ZaneKaminski
Well-known member
In this first post, I will articulate my vision of accelerator cards for the SE and SE/30. What follows is a bit verbose, but hopefully this proposal will lead us to a finished product. I plan to start developing this thing in a few months. Right now, I'm trying just to gather more information.
Anyone with a suggestion for functionality or implementation, please supply it. I want to make the right thing the first time haha.
Function of an Accelerator
An accelerator holds the main CPU in reset (puts it to sleep) and supplies another (faster) CPU, which I'll refer to as the "accelerated CPU." Now, the main hurdle to overcome with these accelerators is that even with a faster CPU, the rest of the board can't handle the increased clock speed, and moreover is impossible for a PDS card to change the system clock or processor bus clock. The way to do so would be to change the onboard crystal oscillator, and except in special cases (e.g. IIsi), that would not work. As well, on the 128k through SE, the same oscillator drives the video and the CPU, so changing it would screw up the display timing.
So we have established that the original processor bus and accelerated CPU must reside in separate "clock domains." Therefore there must be some synchronization logic that allows a faster processor to interface with a slower bus while also providing an increase in performance. This synchronization is to be implemented in some "glue logic" residing in an FPGA or CPLD (basically a programmable piece of hardware) sitting "between" the new CPU and processor bus/PDS.
Now, the simplest way to achieve the requisite synchronization between the accelerated CPU and the stock processor bus would be to, when the accelerated CPU accesses memory (to fetch an instruction or to load/store data), perform that access at the speed of the stock bus, forcing the processor to wait until the slow access has been completed. This is all fine and good, but with a 68000 (which has no onboard instruction or data caches), this strategy wouldn't yield an increase in performance! Indeed, if the accelerated clock is not a whole number multiple of the bus clock, it will actually slow the computer down! Doing this with an '020 or '030 would yield a slight increase in performance, since these chips have (small) onboard caches, but this approach would not nearly be fast enough. So there must be a cache of main memory on the accelerator board, so that most of the time, the accelerated CPU does not need to access system memory, and instead can pull the data at high-speed from its onboard cache.
Now, all vintage accelerator boards have some cache, usually implemented as 64k or so of high-speed SRAM. This is all well and good, but at some point, there will be a "cache miss," i.e. the data requested will not be present in the cache, and so the CPU must wait for the access to be performed at low-speed. And then, if the cache is full, some data must be "evicted," and maybe your program was just about to use that data, and then it would have to wait again... So an effective caching strategy must be implemented. The situation is similar for writing, with one notable exception. Every instruction must be fetched from RAM (ignoring onboard caches of the '020+), so the effect of slowing the writes would not be as bad, since it would not affect every instruction executed. Still, we can do better.
So in order to design an effective caching strategy, we must understand which devices in the Macintosh can act as a "bus master," i.e. access RAM, and we must also note that there are peripheral devices which have control registers mapped into the address space. The values in these registers cannot be cached. Okay, in the 128k through SE, there are basically two bus masters: the CPU, with R/W access, and the "Burrell logic" servicing the sound and video, which only reads from memory, never writes. For the SE/30, the only bus master is the CPU. The video is fetched by an independent circuit that ticks along in phase with the 16 MHz pixel clock and fetches data from the "read side" of some dual-ported VRAM. (Though how is the sound data fetched? Maybe I'm wrong about the SE/30.)
Caching Strategy
Firstly, when 128 Mbytes of DDR2 or DDR3 costs 5 or 6 bucks, there is no reason not to cache the entire main memory of the Mac as well as the ROM. We could spend a long time developing and tuning a cache eviction strategy that works well for Macintosh, or we could just cache the whole memory and be done with it.
So when reading, we can refer to the high-speed cache of main memory/ROM, but not for the peripheral registers. Those cannot be cached, since they change with the state of the peripherals, so the accelerated CPU must wait for them to be accessed at low-speed. No big deal. Only thing is that the at first, reads of the ROM (and maybe RAM too) would be considered "cache misses," in that when the Mac and accelerator card are booted, the card's cache is empty. So the cache would need to be filled first with at least the contents of ROM. No point maintaining a table of which cache entries are valid; just make sure the cache is correct when the accelerated CPU begins operating. Only gotcha is that I think the Macs rearrange the memory map during boot, so the cache logic must be sensitive to that. Or we can maintain page tables and valid bits and that sort of thing to keep track of which cache entries are meaningful, but I prefer the entire cache to be valid at all times.
Writing follows a similar reasoning but is more complicated. The 128k through SE store their framebuffers in main memory, so writes to this area must occur at low-speed. However, for the SE/30's entire main memory, and for all of the rest of the memory of the 128k through SE, writes don't even need to be stored in RAM. As long as the data stored by the accelerated CPU is maintained in the cache, it doesn't matter if that data is duplicated in the main memory, since nothing can access it. This isn't true of the SE/30's VRAM. Obviously if we want the display to ever change, we must actually write to the onboard VRAM. The peripheral registers must be written to as well. Now, we can perform a further optimization when writing to video memory and peripheral registers. We can queue one or more writes to the peripheral registers and VRAM, so that the processor doesn't have to wait for the operation to be completed unless the write queue is full. Therefore several write operations to video or peripherals can be performed serially with no penalty, and then the accelerated CPU can go on to do some arithmetic or logical operations while the values are actually being written. The only catch is that, when reading the peripherals, it must be ensured that all pending write operations have completed. VRAM can be cached and only the CPU can change it, so we can refer to the cache when reading VRAM immediately after a VRAM write operation has been queued for. This functionality about the write queue should be somewhat configurable, in terms of the size of the queue and whether it is enabled at all.
Hardware Implementation
For the SE, the best solution is an FPGA implementing the cache/glue and a 68000 soft core. This will provide the lowest cost, unless you insist upon an authentic 68k series chip. The ao68000 project is apparently is fully tested and runs in an Altera Cyclone II (somewhat of an older model) at up to 82 MHz. It's pretty easy on the FPGA resources, too! The creators note, however, that some instructions take longer to execute in their core than a true 68000. Whatever, with the accelerator card, the CPU doesn't have to contend with the "Burrell logic" making the processor wait (that would only slow down video/peripheral writes), so I'll call it 10x faster. Maybe the newer Cyclone V series can run it at an even higher clock speed, I dunno exactly.
For the SE/30, I think we want a 68030 at 50 MHz, and maybe a 68882 FPU (though it's not necessary; the existing one can be used, albeit at low speed, or we can try and get a soft core for it). An FPGA would still implement the cache and glue logic, but a physical 68030 would be more readily doable than a soft core, which I'm not sure about the existence of. We could do an '040, but I believe the 68040 accelerators for '030 machines have some software compatibility issues. (Maybe relating to the larger cache of the '040? Or something relating to the pipelining? I can't remember.) The other option is to try and utilize this new Apollo "68080" core. I dunno if they'll let us have it, but that would certainly make for the fastest SE/30 ever. And if they let us use their core, that would certainly lower the cost of the card. Here's the link for the Apollo: http://www.apollo-accelerators.com/files/Apollo_datasheet.pdf
Software Compatibility
Plug and play. There's nothing more to say. A well-designed accelerator like I have described is completely transparent to the peripherals and the video logic and all that. It's like running your emulator of choice at 8x speed. The system is completely usable. Animations are (well, should be) rendered during the vertical blanking interval, which still will occur every 1/60 seconds. The difference is that a lot more instructions can be executed between each frame. Go try it in Mini vMac if you have any doubts. The original Macintoshes had deterministic instruction execution timing, but with the caching architecture of the 68020+, routines with exact timing are basically impossible. So we don't need to worry about software compatibility too much.
More Cool Stuff
Well, we oughta choose an FPGA with some resources to spare, and integrate some other useful features onto the board as long as it will not be too costly: USB (mass storage, human interface device), VGA, stuff like that. VGA in particular is trivially easy with an FPGA. We can figure out how to make it work later. Driver software will have to be developed for the Mac OS in order to support these features. Talked about also was the possibility of an open-source reimplementation of the Micron Xceed.
Another benefit of this approach is that we can change out the contents of the ROM whenever we like.
With the caching strategy I have described, we can max out the memory of the machine, even when only the minimum is installed.
Aaaaand we can also switch between the accelerated CPU and the stock one at will, though the stock one would be bound to use the installed ROM. I will leave the algorithm for that as an exercise for a savvy reader; it is possible by hooking into the vertical blanking interrupt (in software) and reset vector of the processors.
Anyone with a suggestion for functionality or implementation, please supply it. I want to make the right thing the first time haha.
Function of an Accelerator
An accelerator holds the main CPU in reset (puts it to sleep) and supplies another (faster) CPU, which I'll refer to as the "accelerated CPU." Now, the main hurdle to overcome with these accelerators is that even with a faster CPU, the rest of the board can't handle the increased clock speed, and moreover is impossible for a PDS card to change the system clock or processor bus clock. The way to do so would be to change the onboard crystal oscillator, and except in special cases (e.g. IIsi), that would not work. As well, on the 128k through SE, the same oscillator drives the video and the CPU, so changing it would screw up the display timing.
So we have established that the original processor bus and accelerated CPU must reside in separate "clock domains." Therefore there must be some synchronization logic that allows a faster processor to interface with a slower bus while also providing an increase in performance. This synchronization is to be implemented in some "glue logic" residing in an FPGA or CPLD (basically a programmable piece of hardware) sitting "between" the new CPU and processor bus/PDS.
Now, the simplest way to achieve the requisite synchronization between the accelerated CPU and the stock processor bus would be to, when the accelerated CPU accesses memory (to fetch an instruction or to load/store data), perform that access at the speed of the stock bus, forcing the processor to wait until the slow access has been completed. This is all fine and good, but with a 68000 (which has no onboard instruction or data caches), this strategy wouldn't yield an increase in performance! Indeed, if the accelerated clock is not a whole number multiple of the bus clock, it will actually slow the computer down! Doing this with an '020 or '030 would yield a slight increase in performance, since these chips have (small) onboard caches, but this approach would not nearly be fast enough. So there must be a cache of main memory on the accelerator board, so that most of the time, the accelerated CPU does not need to access system memory, and instead can pull the data at high-speed from its onboard cache.
Now, all vintage accelerator boards have some cache, usually implemented as 64k or so of high-speed SRAM. This is all well and good, but at some point, there will be a "cache miss," i.e. the data requested will not be present in the cache, and so the CPU must wait for the access to be performed at low-speed. And then, if the cache is full, some data must be "evicted," and maybe your program was just about to use that data, and then it would have to wait again... So an effective caching strategy must be implemented. The situation is similar for writing, with one notable exception. Every instruction must be fetched from RAM (ignoring onboard caches of the '020+), so the effect of slowing the writes would not be as bad, since it would not affect every instruction executed. Still, we can do better.
So in order to design an effective caching strategy, we must understand which devices in the Macintosh can act as a "bus master," i.e. access RAM, and we must also note that there are peripheral devices which have control registers mapped into the address space. The values in these registers cannot be cached. Okay, in the 128k through SE, there are basically two bus masters: the CPU, with R/W access, and the "Burrell logic" servicing the sound and video, which only reads from memory, never writes. For the SE/30, the only bus master is the CPU. The video is fetched by an independent circuit that ticks along in phase with the 16 MHz pixel clock and fetches data from the "read side" of some dual-ported VRAM. (Though how is the sound data fetched? Maybe I'm wrong about the SE/30.)
Caching Strategy
Firstly, when 128 Mbytes of DDR2 or DDR3 costs 5 or 6 bucks, there is no reason not to cache the entire main memory of the Mac as well as the ROM. We could spend a long time developing and tuning a cache eviction strategy that works well for Macintosh, or we could just cache the whole memory and be done with it.
So when reading, we can refer to the high-speed cache of main memory/ROM, but not for the peripheral registers. Those cannot be cached, since they change with the state of the peripherals, so the accelerated CPU must wait for them to be accessed at low-speed. No big deal. Only thing is that the at first, reads of the ROM (and maybe RAM too) would be considered "cache misses," in that when the Mac and accelerator card are booted, the card's cache is empty. So the cache would need to be filled first with at least the contents of ROM. No point maintaining a table of which cache entries are valid; just make sure the cache is correct when the accelerated CPU begins operating. Only gotcha is that I think the Macs rearrange the memory map during boot, so the cache logic must be sensitive to that. Or we can maintain page tables and valid bits and that sort of thing to keep track of which cache entries are meaningful, but I prefer the entire cache to be valid at all times.
Writing follows a similar reasoning but is more complicated. The 128k through SE store their framebuffers in main memory, so writes to this area must occur at low-speed. However, for the SE/30's entire main memory, and for all of the rest of the memory of the 128k through SE, writes don't even need to be stored in RAM. As long as the data stored by the accelerated CPU is maintained in the cache, it doesn't matter if that data is duplicated in the main memory, since nothing can access it. This isn't true of the SE/30's VRAM. Obviously if we want the display to ever change, we must actually write to the onboard VRAM. The peripheral registers must be written to as well. Now, we can perform a further optimization when writing to video memory and peripheral registers. We can queue one or more writes to the peripheral registers and VRAM, so that the processor doesn't have to wait for the operation to be completed unless the write queue is full. Therefore several write operations to video or peripherals can be performed serially with no penalty, and then the accelerated CPU can go on to do some arithmetic or logical operations while the values are actually being written. The only catch is that, when reading the peripherals, it must be ensured that all pending write operations have completed. VRAM can be cached and only the CPU can change it, so we can refer to the cache when reading VRAM immediately after a VRAM write operation has been queued for. This functionality about the write queue should be somewhat configurable, in terms of the size of the queue and whether it is enabled at all.
Hardware Implementation
For the SE, the best solution is an FPGA implementing the cache/glue and a 68000 soft core. This will provide the lowest cost, unless you insist upon an authentic 68k series chip. The ao68000 project is apparently is fully tested and runs in an Altera Cyclone II (somewhat of an older model) at up to 82 MHz. It's pretty easy on the FPGA resources, too! The creators note, however, that some instructions take longer to execute in their core than a true 68000. Whatever, with the accelerator card, the CPU doesn't have to contend with the "Burrell logic" making the processor wait (that would only slow down video/peripheral writes), so I'll call it 10x faster. Maybe the newer Cyclone V series can run it at an even higher clock speed, I dunno exactly.
For the SE/30, I think we want a 68030 at 50 MHz, and maybe a 68882 FPU (though it's not necessary; the existing one can be used, albeit at low speed, or we can try and get a soft core for it). An FPGA would still implement the cache and glue logic, but a physical 68030 would be more readily doable than a soft core, which I'm not sure about the existence of. We could do an '040, but I believe the 68040 accelerators for '030 machines have some software compatibility issues. (Maybe relating to the larger cache of the '040? Or something relating to the pipelining? I can't remember.) The other option is to try and utilize this new Apollo "68080" core. I dunno if they'll let us have it, but that would certainly make for the fastest SE/30 ever. And if they let us use their core, that would certainly lower the cost of the card. Here's the link for the Apollo: http://www.apollo-accelerators.com/files/Apollo_datasheet.pdf
Software Compatibility
Plug and play. There's nothing more to say. A well-designed accelerator like I have described is completely transparent to the peripherals and the video logic and all that. It's like running your emulator of choice at 8x speed. The system is completely usable. Animations are (well, should be) rendered during the vertical blanking interval, which still will occur every 1/60 seconds. The difference is that a lot more instructions can be executed between each frame. Go try it in Mini vMac if you have any doubts. The original Macintoshes had deterministic instruction execution timing, but with the caching architecture of the 68020+, routines with exact timing are basically impossible. So we don't need to worry about software compatibility too much.
More Cool Stuff
Well, we oughta choose an FPGA with some resources to spare, and integrate some other useful features onto the board as long as it will not be too costly: USB (mass storage, human interface device), VGA, stuff like that. VGA in particular is trivially easy with an FPGA. We can figure out how to make it work later. Driver software will have to be developed for the Mac OS in order to support these features. Talked about also was the possibility of an open-source reimplementation of the Micron Xceed.
Another benefit of this approach is that we can change out the contents of the ROM whenever we like.
With the caching strategy I have described, we can max out the memory of the machine, even when only the minimum is installed.
Aaaaand we can also switch between the accelerated CPU and the stock one at will, though the stock one would be bound to use the installed ROM. I will leave the algorithm for that as an exercise for a savvy reader; it is possible by hooking into the vertical blanking interrupt (in software) and reset vector of the processors.
Last edited by a moderator: