• Hello MLAers! We've re-enabled auto-approval for accounts. If you are still waiting on account approval, please check this thread for more information.

Macintosh 68060 Redux

I don't know that attaining 40mhz bus operation (80mhz cpu) would be attainable in Macs without the scarce and expensive late chips.
The LC 475 happily runs at a 40MHz bus speed with almost no changes. You just need to move one resistor to divide the clock to the SCSI chip to make sure you get stable SCSI and change the RAM/VRAM/ROM wait states to keep everything in spec. Which you know about :)

Edit - ah, you mean the CPUs, my bad. I thought you meant the host.
 
OK, I've got an answer to why BlockCopy performance was reduced, along with scrolling. Both of these routines use Move16 in order to get the maximum memory throughput if you've gotta move data through the CPU. This is a new instruction for the 68040 used to move entire lines (16 bytes, 4 longs) and avoid thrashing the caches. It's used in the optimized 68040 routines for blockcopy and a few quickdraw routines (scrolling, notabl!y). And it turns out the timing is different on the 68060, as compared to the 68040.

Per Motorola docs:

060 takes 18 cycles for MOVE16 src/dst register postincrement. 11 cycles if cache hit (unlikely).
040 takes 14 cycles for MOVE16 src/dst register postincrement. Motorola seems to indicate this figure is relative to BCLK, but I don't quite believe that. It is likely a mix of PCLK and BCLK.

In both cases, these execution timings assume ideal 2-1-1-1 memory timing. 33mhz Quadra 650 uses 5-2-2-2 (11 BCLK cycles) on a page burst read and 4-2-2-2 (10 cycles) on a page burst write. So on a memory to memory transfer we're taking (11 [read] + 10 [write] - 2 * 5 [motorola assumed memory cycles]) * 2 [blck is half pclk] = 22 additional pclk cycles over motorola specifications.

(14 [040 timing] + 22 extra)/(18 [060 timing] + 22 extra) = 68060 has 90% the MOVE16 bandwidth comapred to 68040, exactly as I found in the system info benchmarks. Easily verified, also: I wrote a quick MOVE16 test to move around 200MB from RAM to RAM as fast as possible, and it returned 91% on the 060 as compared to the 040. Exactly as predicted, so this is functioning as intended. Presumably some of the special behavior around this instruction has become more expensive internally to sequence.

As that last mystery has been solved, I believe I'm getting the most performance out of the 060 itself so some benchmarks are in order. All tests are done with 68MB of interleaved RAM, 32mhz BCLK (early 060 can't quite manage 33.3), System 7.1, and a ZuluSCSI Slim. I didn't bother with Disk tests as they're essentially unchanging. FPU tests also excluded as I still haven't gotten that working and probably won't bother.

Note that the below numbers are all using the standard Quadra 650 timings for 33mhz: there's definitely more performance on the table by optimizing but this is a CPU comparison rather than trying to milk as much performance out of the Wombat platform :)

Norton System Info 3.2.1. Detailed explanations of each subtest in attached text file.

060 sys info.jpg

Speedometer 4. Disclaimer: I maintain that Speedometer is a crap benchmark for measuring anything faster than a 68030. Almost all of its tests don't test much RAM and fit in CPU caches so it is of very limited use to measure overall system performance. Nevertheless, here's a screenshot of a comparison at 32mhz. Note that KWhet and Math tests hit SANE so these will differ from other 040 benchmarks with a FPU.

1763655742021.jpeg

Finally, the most interesting stuff: a collection of real-world application benchmarks.

1763654497242.png

Some clarification for those unfamiliar: MacDoom is unoptimized and can't be reasonably compared to other platforms as it performs poorly. Take those numbers as a relative comparison only, and one that's highly dependent on memory performance.

MacBench 3 graphs are attached. The CPU performance benchmark is a black box; it doesn't seem to test memory all that heavily. I would take it with a grain of salt for real world performance. The quickdraw benchmarks however are handy. The 6100 benchmark they use as a baseline supposedly doesn't have a L2 cache installed so it would be best compared to the 060 without FPU.

Altogether, it looks to me like the 68060 is delivering on the 1.6-1.7x 68040 performance promised by Motorola, clock for (P)clock. Motorola was very squirrely about which clock on the 040 actually governs internal execution, based on my observations I believe it's the higher clock. Motorola did no favors by trying to mask that the 68040 is essentially a clock-doubled 030 with massively improved caches and bus interface.

Thoughts on these numbers.... the 060 clearly could benefit from additional memory throughput, it seems to have performance to spare but is limited by memory. The L2 cache on the 060 above is running at bus speed; even so it's possible to notice where it picks up the slack. The cache supports 2-1-1-1 (5) read cycles on a hit compared to the 5-2-2-2 (11) of the stock RAM. However, the cache introduces a 1 clock penalty on cache misses that have to go on to RAM: that can be observed in the System Info BlockMove benchmarks.

The 68040 uses a 32 bit bus running at half core speed, in a system designed for 040 memory performance between the 040 and 060 should be similar for the same bus clock if we exclude the MOVE16 special case. However, the 68060 is now able to support a full speed bus unlike the 68040. I don't know what Amiga 68060 accelerators with onboard RAM typically do but I would assume they are running the accelerator onboard RAM at full speed for much better RAM performance.

Most Pentium systems are going to have a 64 bit data bus running at full speed (60mhz+), so there's a clear advantage to the Pentium on memory throughput and latency. A Pentium Overdrive in a 486 system would be an interesting matchup as that's going to reduce both bus speed and width to something comparable to the Wombat. Early PowerPC Macs have a 64 bit RAM/ROM bus running at half speed, this a more fair comparison but favors the PPC when running native code.

As usual, I don't promise this to be in any way usable and don't intend to polish it to the point where I would be comfortable recommending as something to actually use. With that in mind it does seem surprisingly usable and stable in the limited testing I've done. The source is up on Github and should work with LC060 CPUs now. Unrelated: after putting the 060ISP in place of the unused 040FPSP I found the system will boot and work in 24 bit mode. Pointless but interesting to confirm.
 

Attachments

To pre-empt the question of what remains to be done to make this usable?
  • Slot Manager, SCSI Manager, and DerefHandle all need to be patched in ROM to reflect the new bus(access) error stack frames and possibility of branch prediction access errors (BPE). Possibly more routines: any bus error handlers that looks at the content of the stack frame or might run while branch prediction is enabled require an update.
  • Mac OS is overriding our bus error handler, and possibly other vectors too
  • Any use of PTEST by applications or Systems will be immediately fatal and must be patched or emulated
  • FPU / FPSP currently causes a lock at finder load
  • (In)SANE / ΩSANE requires updates to support new FPSP vectors
  • Anything that looks at content of FPU stack frames must be patched (SANE)
  • Newer versions of Mac OS (7.5+) throw a F-line exception with FPU disabled and don't work with it enabled
  • Test networking, nubus cards, appletalk, etc - all of these I've not tested.
  • Lots and lots of application testing
I would posit that a user-space component similar to what was done with SE accelerators 68030s in 68000 systems and 68040s in early 68030 systems is required here. Something that takes an application blacklist and enables/disables CPU features according to known issues and perhaps could patch applications at runtime. Additionally it would need to patch the System itself and mitigate issues like the bus error handler getting messed with. It's not quite as dire a situation as Amiga has to deal with, I think, but still very nontrivial.
 
Would a hardware solution between the 68060 and the rest of the 040/030-based machine be something the Amiga accelerators used. Some sort of compatibility layer.
 
Amiga 060 accelerators typically have a tiny ROM that patches things enough for Kickstart to work. Some of the newer designs skip this and need a patched/post-3.1 version of Kickstart.
 
So I wonder if something similar could be done with Mac ROMs
Not sensibly on a native 040 system. Earlier accelerators for macintosh (Turbo 040, for an example) and Amiga ones can take advantage of the presence of an 020/030 bus which allows a single 8 bit rom to (slowly) supply 32 bits of data across all byte lanes utilizing dynamic bus sizing.

The 040 and 060 lack dynamic bus sizing so any adapter with onboard ROM would require a full 32 bit wide ROM and/or multiple chips in order to directly execute code from ROM. At that point you may as well just install/utilize the ROM socket on the logic board with a modified ROM.
 
Here is what my Amiga 4000 with a BFG9060 at 100 MHz (68060 Rev 6 overclocked) and ZZ9000 HDMI RTG card looks like using ShapeShifter. This uses the built in IDE for disk access- when I use my 4091 SCSI-2 card disk access is faster.

IMG_9893-2.jpg
IMG_9894-2.jpg
 
Here is what my Amiga 4000 with a BFG9060 at 100 MHz (68060 Rev 6 overclocked) and ZZ9000 HDMI RTG card looks like using ShapeShifter. This uses the built in IDE for disk access- when I use my 4091 SCSI-2 card disk access is faster.

Thanks for that! Here's numbers adjusted for clockspeed. As expected, it's essentially the within margin of error since Speedometer tests fit within CPU caches. Small variance that can be explained by Amiga having more going on in the background. FPU tests have been left out (as I didn't enable it) and disk and graphics also since those would be virtualized/emulated.

If you have time, I would be curious to see the detailed System Info numbers as that has several tests that exercise the memory subsystem - I expect the BFG card to run away with those due to the faster more modern memory. The various graphics subtests would also be interesting to compare since that'll be the major weakness of emulating.

1764904081728.png

The followup question becomes, with 1994 memory technology, would it have been possible to do a memory subsystem supporting a full speed 060 bus? The Quadra 800 already expected 60ns DRAM and IIRC that would have been high end at the time. I don't know my memory history all that well but I don't think there were huge improvements to latency back then... It'd be "easy" to just scale all the timings for higher clocks but then memory performance is going to remain more or less the same. Possibly one could look at period Pentium designs' timings since the early ones operated with either 1x or odd fractional multipliers to keep max bus speed around 50-60mhz (AFAIK).

To get some real world numbers I could rig the Q650 for a 20mhz bus (2x multipler) and compare to a 40mhz bus (1x multipler) but I'd have to disassemble and hack on the adapter to make that possible. And of course hack the ROM a little to force particular timings. Maybe a rainy day project... There's not a huge point in digging into the nitty gritty like this, but I enjoy it.
 
The followup question becomes, with 1994 memory technology, would it have been possible to do a memory subsystem supporting a full speed 060 bus? The Quadra 800 already expected 60ns DRAM and IIRC that would have been high end at the time. I don't know my memory history all that well but I don't think there were huge improvements to latency back then... It'd be "easy" to just scale all the timings for higher clocks but then memory performance is going to remain more or less the same. Possibly one could look at period Pentium designs' timings since the early ones operated with either 1x or odd fractional multipliers to keep max bus speed around 50-60mhz (AFAIK).
I think it's hard to do much better, yeah. It's already interleaving and I don't think memory of the era ever really went below 50ns?

I'm out of my depth, but maybe with a custom quad-interleave memory controller it could keep up? Multi-channel wouldn't be period, that's too late. I don't know if anyone used quad-interleave but at least that's an extension of what was already possible. But I'm not sure that fits the spirit of your technical challenge here.
 
Thanks for that! Here's numbers adjusted for clockspeed. As expected, it's essentially the within margin of error since Speedometer tests fit within CPU caches. Small variance that can be explained by Amiga having more going on in the background. FPU tests have been left out (as I didn't enable it) and disk and graphics also since those would be virtualized/emulated.

If you have time, I would be curious to see the detailed System Info numbers as that has several tests that exercise the memory subsystem - I expect the BFG card to run away with those due to the faster more modern memory. The various graphics subtests would also be interesting to compare since that'll be the major weakness of emulating.

View attachment 93038

The followup question becomes, with 1994 memory technology, would it have been possible to do a memory subsystem supporting a full speed 060 bus? The Quadra 800 already expected 60ns DRAM and IIRC that would have been high end at the time. I don't know my memory history all that well but I don't think there were huge improvements to latency back then... It'd be "easy" to just scale all the timings for higher clocks but then memory performance is going to remain more or less the same. Possibly one could look at period Pentium designs' timings since the early ones operated with either 1x or odd fractional multipliers to keep max bus speed around 50-60mhz (AFAIK).

To get some real world numbers I could rig the Q650 for a 20mhz bus (2x multipler) and compare to a 40mhz bus (1x multipler) but I'd have to disassemble and hack on the adapter to make that possible. And of course hack the ROM a little to force particular timings. Maybe a rainy day project... There's not a huge point in digging into the nitty gritty like this, but I enjoy it.

I will do my best to get the Norton numbers- may you please send me the exact version you are using?

Tests for different video modes aren’t really easy because the video mode and color depth is preset before starting shapeshifter.

I expect the performance of your hack to be roughly equivalent to an A3660 card at 66 MHz. The BFG at 100MHz is a serious card and can run Quake at playable framerates. I think it has the best memory speed of all Amiga 4000 accelerator cards, for both fast memory and chip memory access.

Thank you
 
I think it's hard to do much better, yeah. It's already interleaving and I don't think memory of the era ever really went below 50ns?

I'm out of my depth, but maybe with a custom quad-interleave memory controller it could keep up? Multi-channel wouldn't be period, that's too late. I don't know if anyone used quad-interleave but at least that's an extension of what was already possible. But I'm not sure that fits the spirit of your technical challenge here.

Same - DRAM is not my specialty either. I've been procrastinating digging into DRAM for a while now. I can figure timings, though.

Given the current timings assume 60ns DRAM, i'd assume that 50mhz would probably require 6-3-3-3 or possibly 7-3-3-3 to keep the time factors roughly the same. That means performance is going to be more or less identical except where some time is saved due to greater granularity (seen below when comparing 40mhz burst to 50mhz) - so the problem remains...

1764944652744.png 1764944134642.png

I suppose the realistic solution for the period would have been adding L2 cache. As seen above even a 2-1-1-1 cache on half speed bus had notable benefits. With the flexibility of the CLKEN function on the 060, I think it'd be practical to implement L2 cache in such a way that cache accesses could take place in a full speed clock domain (so 50 - 75mhz) and on cache miss use CLKEN and bus buffers until bus cycle concludes. This is similar to what I do on my 040 hyperdrive accelerator but simpler because of the CLKEN function. With this trick you could probably manage 3-1-1-1 (@ 50 - 75mhz) cache accesses.

Of course, you then run into the usual coherency problems: either you update the OS/ROMs to have a function to invalidate L2 cache entries where required (how NeXT handled it) or snoop 24/7 (how DayStar/Diimo handled it) which costs performance. PPC chips had hardware help for L2 caches so the software didn't have to deliberately handle L2 cache maintance. 68K didn't have that, so any time you used cpush/cinv instructions you'd need to address the L2 cache and tell it to invalidate entries too... or you snoop all bus traffic and update cache entries that way.

I will do my best to get the Norton numbers- may you please send me the exact version you are using?

Tests for different video modes aren’t really easy because the video mode and color depth is preset before starting shapeshifter.

I expect the performance of your hack to be roughly equivalent to an A3660 card at 66 MHz. The BFG at 100MHz is a serious card and can run Quake at playable framerates. I think it has the best memory speed of all Amiga 4000 accelerator cards, for both fast memory and chip memory access.

Thank you

I've attached a binhex of System Info 3.2.1. There's not much of a point in testing the other video modes - 8bpp makes the most sense. This is one place where emulation is expected to lose to the original hardware though. I'm assuming the emulator dev had all sorts of fun translating the Mac packed pixel framebuffer to the planar framebuffers of the Amiga and emulating CLUT behaviors.

Real world performance should be notably better than a A3640/60; A3640 is something analogous to a Turbo/Carrera 040 in a Mac IIci with no L2 cache. A pretty dire scenario: no line writes, no interleaving, and 030-040 bus translation penalty. The closest thing you'd find to the Quadra 650-060 would be a period amiga accelerator with an 040/060 and onboard 72 pin SIMMs but I'm not sure if such a beast existed. 040 type bus, half speed, supporting line reads/writes and interleaved asynchronous RAM.

Any of the modern amiga 060 accelerators seem to use SDRAM or DDR and presumably use full speed bus in the local 060 bus, it's going to have much better performance compared to earlier hardware.
 

Attachments

Sure- I had an A3660 card with a 64MB local ram module. So those do exist.

Please see system info below for BFG9060 at 100MHz.

Please note that resolution is 720p 8bit on graphics. Let me try 640x480 and see if graphics scores improve

IMG_1724.jpg
IMG_1725.jpg
 
Sure- I had an A3660 card with a 64MB local ram module. So those do exist.

Please see system info below for BFG9060 at 100MHz.

Please note that resolution is 720p 8bit on graphics. Let me try 640x480 and see if graphics scores improve

View attachment 93068
View attachment 93069

Thanks for those numbers! Below are the CPU numbers with scaling. As expected emulated quickdraw performance is dire, there's not really any way around that. For your A3660, It looks like there's some add-on modules that fit in the CPU socket to add local RAM that is not part of the A3640/A3660 design. Probably that's what you have? "Real Power RAM"?

1764964367719.png

Conclusion.... as expected, the BFG board has a huge advantage in memory throughput of nearly 2x, as would be expected by using SDRAM at full bus speed with better timings. However, in the real world tests (Sort, Tree, Search) the difference isn't devastating, and the "slow" L2 cache makes up almost all of the ground by masking the Q650's higher latency on RAM reads. The instruction test is an outlier as that tests memory latency heavily and a write-through cache won't help at all there, but CPUs generally spend more time reading data and processing it than they do writing or doing large block copies. There's a few whitepapers floating out there which explore this in more detail as it pertains to cache implementations.

This confirms in my opinion that the period appropriate solution would have been to add a L2 cache. I have successfully re-invented the wheel :)
 
Thanks for those numbers! Below are the CPU numbers with scaling. As expected emulated quickdraw performance is dire, there's not really any way around that. For your A3660, It looks like there's some add-on modules that fit in the CPU socket to add local RAM that is not part of the A3640/A3660 design. Probably that's what you have? "Real Power RAM"?

View attachment 93071

Conclusion.... as expected, the BFG board has a huge advantage in memory throughput of nearly 2x, as would be expected by using SDRAM at full bus speed with better timings. However, in the real world tests (Sort, Tree, Search) the difference isn't devastating, and the "slow" L2 cache makes up almost all of the ground by masking the Q650's higher latency on RAM reads. The instruction test is an outlier as that tests memory latency heavily and a write-through cache won't help at all there, but CPUs generally spend more time reading data and processing it than they do writing or doing large block copies. There's a few whitepapers floating out there which explore this in more detail as it pertains to cache implementations.

This confirms in my opinion that the period appropriate solution would have been to add a L2 cache. I have successfully re-invented the wheel :)
There are several weird A3640/3660 add ons to add CPU RAM. Rev 0.1 of the Z3660 also did that. The problem is they disrupt Zorro III DMA. I no longer own any of those cards.

BFG 9060 is a great card that is fully DMA capable.

A period accurate 68060 card with 72 pin SIMMs is the Cyberstorm Mk III.
 
I expect the performance of your hack to be roughly equivalent to an A3660 card at 66 MHz. The BFG at 100MHz is a serious card and can run Quake at playable framerates. I think it has the best memory speed of all Amiga 4000 accelerator cards, for both fast memory and chip memory access.

Thank you
this is correct, the BFG has the fastest ram access of all current and past Amiga accelerators - iro 90MB/s for long word read. Chip mem access speed maxes out at 7Mb/s too.
 
Back
Top