I like that idea for verification -- I may just use it! I agree with what you said -- I have to be careful with error handling or the whole protocol will get bloated and waste transfer time. Thank you for the ideas!
On the writing data to chips front, I have good news and bad news.
The good news is that I am successfully writing to all four chips (SST39SF040) in parallel now -- I write the word 0x12345678 over and over again to fill all four chips, and I can read it back in my EEPROM burner to confirm that I wrote it correctly. The other great news is that the two large chips I've been using (SST39SF040 and the Am29F040B) both use the exact same JEDEC write command set, so they both work perfectly with the same code. At this point I'm 100% confident that every function of the SIMM burner is going to work.
The bad news is that the SPI GPIO expander significantly lowers performance. I'm not even testing any communication overhead at all yet, and it's taking about 1 minute and 45 seconds to do the complete write. If I only write to the two chips that aren't connected to the GPIO expander chip, it only takes about 47 seconds. So that extra ~1 minute of overhead is because of my serial communication to the GPIO expander chip, which I can't really do anything about. The problem is that I have to write out several data bytes as an unlock sequence before I write each byte to the chips. Plus, once I have written the byte to program, I have to read back data from the chips to determine when the byte has finished programming. Each read/write requires an SPI transaction of 4 bytes, which takes over 4 microseconds at an SPI clock rate of 8 MHz, whereas a lot of the timings for these read/write cycles of the unlock sequence and verification can be down in the tens of nanoseconds -- I just don't have the speed to talk to the chips that fast when I'm using SPI.
I probably should have used some kind of parallel device (a bidirectional latch or something like that? I don't know what the terminology would be) that connects to two 16-bit buses, and can save input from them and output to them as requested. That way I could have shared my 16 data lines with all 32 of the SIMM data lines, rather than using SPI for the other 16 data lines. I think that would have been faster (although I would have lost some of my electrical testing capabilities, I think, because it wouldn't have pull-ups built-in). Anyway, too late for that at this point.
It's still faster than my EEPROM burner which takes over 2 minutes to write a single chip, but once I factor in communication overhead, I'm fearing that it's going to be worse performance. On the other hand, since all four chips are being flashed simultaneously, it's still going to be way faster than the EEPROM burner when you factor in the total time it takes to burn enough chips for the entire SIMM (not to mention the annoyance of removing/inserting PLCC chips)