• Hello MLAers! We've re-enabled auto-approval for accounts. If you are still waiting on account approval, please check this thread for more information.

Macintosh 68060 Redux

Even Wolfenstein 3D for Mac doesn't run super great on 68k Macs, and that's a game that runs on 386s on the PC side.
Seems they both had too much focus on PowerPC optimization.
Much akin to Intel Macs, or PowerPC macs during the transition to the next. Why would anyone spend time developing for a dead platform.

I'm impressed with the numbers and then much less so. Clock for clock, I was reading that an 060 should be outperforming an 040 by 2 or even 3 times. There must be something else going on there.

Not being negative, I'm very impressed you got it to a point where it boots Mac OS and runs benchmarks. Neat as heck. Thank you! Interesting news in an otherwise depressing week for me.
 
The '060 does perform much better than a similarly clocked '040:

Code:
BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          6.4586  :       0.17  :       0.05
STRING SORT         :         0.84488  :       0.38  :       0.06
BITFIELD            :      1.8995e+06  :       0.33  :       0.07
FP EMULATION        :          1.0745  :       0.52  :       0.12
FOURIER             :          64.008  :       0.07  :       0.04
ASSIGNMENT          :         0.17618  :       0.67  :       0.17
IDEA                :          24.606  :       0.38  :       0.11
HUFFMAN             :          13.158  :       0.36  :       0.12
NEURAL NET          :        0.080567  :       0.13  :       0.05
LU DECOMPOSITION    :          4.8719  :       0.25  :       0.18
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 0.371
FLOATING-POINT INDEX: 0.133
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 :
L2 Cache            :
OS                  : NetBSD 11.0_BETA
C compiler          : gcc version 12.5.0 (nb1 20250721)
libc                :
MEMORY INDEX        : 0.088
INTEGER INDEX       : 0.096
FLOATING-POINT INDEX: 0.074

Compare that with:

Code:
TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          21.343  :       0.55  :       0.18
STRING SORT         :          1.8755  :       0.84  :       0.13
BITFIELD            :      1.0275e+07  :       1.76  :       0.37
FP EMULATION        :          2.8269  :       1.36  :       0.31
FOURIER             :          182.44  :       0.21  :       0.12
ASSIGNMENT          :         0.43796  :       1.67  :       0.43
IDEA                :          113.77  :       1.74  :       0.52
HUFFMAN             :          41.017  :       1.14  :       0.36
NEURAL NET          :         0.19577  :       0.31  :       0.13
LU DECOMPOSITION    :          11.413  :       0.59  :       0.43
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 1.202
FLOATING-POINT INDEX: 0.338
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : 
L2 Cache            : 
OS                  : NetBSD 10.0_STABLE
C compiler          : gcc version 10.5.0 (nb3 20231008) 
libc                : 
MEMORY INDEX        : 0.274
INTEGER INDEX       : 0.321
FLOATING-POINT INDEX: 0.187

The '060 is 3.1 times faster on the memory index, 3.34 times faster on integer, and 2.5 times faster on floating point.

The '040 is running at 36 MHz, and the '060 at 66 MHz, so taking clock in to account, the '060 is 1.7 times (memory), 1.8 times (integer) and 1.36 times (floating) faster at the same clock, to give a rough idea.
 
I did some quick doom benchmarks either. Drumroll please;

040@25mhz - 10.4 fps
060@50mhz superscalar+no branch prediction - 12.4 fps
060@50mhz superscalar+branch prediction - 14.3 fps
040@40mhz - 16 fps
060@66mhz superscalar+no branch prediction - 16.4 fps
060@66mhz superscalar+branch prediction - ??? fps

... it's just that badly optimized. This is a small window and low detail also. Still, it's clear the branch cache is very beneficial. My best guess is doom is extremely memory bandwidth bound currently and is likely thrashing internal caches.

These 040 numbers are about what I'd expect for a machine of that era and are on par with other contemporary machines. However these 060 numbers are far below what I'd expect out of that CPU, the 060 should be getting into the upper 20s. There's definitely something weird going on here and I don't think it's caused by doom. If it was just a poorly optimized program it should affect both CPU and you'd still see the expected performance gap between them, instead we see an 060 that struggles to outperform a slower CPU.
 
These 040 numbers are about what I'd expect for a machine of that era and are on par with other contemporary machines. However these 060 numbers are far below what I'd expect out of that CPU, the 060 should be getting into the upper 20s. There's definitely something weird going on here and I don't think it's caused by doom. If it was just a poorly optimized program it should affect both CPU and you'd still see the expected performance gap between them, instead we see an 060 that struggles to outperform a slower CPU.
To re-iterate: this is all extremely preliminary, so that wouldn't surprise me at all. I've already noted that memory performance as measured by norton system info is less than expected and it remains to be seen why that is. It seems to carry over into the video benchmarks (Copybits large and scrolling) as those both are simply moving lots of data as fast as possible (using the CPU, not DMA or offload) so seems like something is odd.

Do keep in mind that video and memory performance is nearly linear to bus clock on these machines, unlike Amiga, and for simply blasting bits to/from RAM/VRAM the 060 would be expected to be similar to 040 at the same bus clock. For all those numbers I quoted the bus clock is either = to CPU speed for 040 or half CPU speed on the 060, so that 040 at 40mhz is going to have significantly more bandwidth over the 060 on a 25mhz bus.

I am not personally familiar with ByteMark so I don't know which benchmarks would be more affected by memory bandwidth over performance in CPU caches... maybe there's a mac version I can try.
 
There's definitely something up with the memory access. I've been futzing around with clocking today and inexplicably at 64 mhz (vs 66.6mhz before) the memory access issue clears up and the results of the blockcopy and copybits routines are then similar or better than the 040, as expected.

I need to check the configuration of the djmemc, perhaps it's getting programmed incorrectly somehow. The timings should be a product of gestalt ID only (no dynamic cycle timing ala NeXT) so I don't know why that would be but it's repeatable on my test board. I've not yet moved the 060 over to my second wombat board to verify there. The 040/060 bus wouldn't tolerate a slipped clock cycle so that rules out some oddity with CLKEN, I think.... more digging to be done.

64 mhz is also just slow enough that the branch cache works without crashing on my CPUs, turning in 17.6 FPS in Doom or 19fps with L2 cache added. Not going to grab a full set of benchmarks right this moment but it's noticable in the results.

I've missed getting lost in the weeds like this. While I'm definitely not committed to seeing this through - this is all very rough and ready experimentation - it's certainly looking more and more like there'd be potential to run Mac OS in a 68060 in a practical sense. Still a lot of major issues though, for instance I haven't done anything at all about floating point. However running with a LC CPU (or disable FPU via PCR) would probably be the more reasonable solution given how seldom used it is.
 
Keep going!

Once you get to a certain point where it works relatively well but needs optimization and refinement, hopefully it will encourage others to build on your good work here.

Who knows, maybe you can work on getting an 88k CPU working next!

c
 
Amiga PiStorm 32 doesn't have fully working MMU BTW. Neither does the Vampire. The devs just can't be botha'd. Kinda makes it a non-starter for MacOS. I have some experience with 060 on Atari and that's mostly fine since the Falcon breaks so much ST software anyway. I know of people trying to replicate the X68000 060 accelerators but the hardware for that is far from mature or stable. Beyond that the other 68k-based platforms don't have too many 060 update attempts; nothing's really been tried for NeXT, for example.

The Amiga makes updating to 060 look much easier than it actually is. A LOT of software on Amiga dies on 060 unless patched, and I could see a lot of Mac software sharing the same fate. The Amiga hugely benefits from having had the 060 for almost three decades; the very last A4000Ts in fact shipped from the factory with 060 CPUs offered as an official option. Virtually the entire game library was patched via WHDLoad, and NewTek and friends kept updating the software/hardware that kept the Amiga alive in it's little video/music professional niches well past that point. (Another reason for 060 adoption was that the PowerPC Amiga "transition" was an expensive farce, with numerous companies trying over the last two decades to rebadge overpriced surplus PPC router SoCs as "next-generation amiga")

How bad is the stability currently, and what games besides Doom run? Any of the Marathons/Ambrosia titles?
 
I figured out the tweak necessary for the ROM to correctly operate with FPU-optional mode, allowing for LC cpus or full CPUs with the FPU disabled in software. This allows me to run quite a bit more now as we aren't crashing the second something goes for floating point code. Hard to say anything on stability as this 060 is marginal at 64mhz, but freezes are very infrequent and of unclear cause. I'd sooner attribute it to the CPU.

Fun fact, the 060 would sometimes fail to boot at room temperature at 66.66mhz. Resetting would not change anything, a full power cycle would be required to try again. It gets worse with cooling: a peltier would always prevent it from booting.

I did some quick tracing with the ISP last night and the ISP is hit to a surprising degree - even on mouse movement, which I found interesting. There are several big issues that need to be resolved still; several routines are messing with the bus error handler which would be needed to run with branch prediction on safely. No progress on why memory is somewhat slower, except I found that my 040 numbers were coming from a Quadra 605 rom not a 650 which uses the tighter Quadra 800 memory timings. Still seems to be a bit of a gap compared to the 040 but not as bad.

Marathon works, and gets a nice improvement.

1763175933665.png

Speedometer is a crap benchmark for measuring anything faster than a 68030 as almost all of its tests are tiny and fit in CPU caches. So it is of limited use to measure overall system performance. Nevertheless, here's a screenshot of a comparison at 32mhz. Note that KWhet and Math tests hit SANE so the disabled FPU on the 060 will influence those.

1763176184804.jpeg

I have done some quick software testing with all the software I had on my SE/30 disk image. Again I am in no way suggesting this is usable, but surprisingly I didn't find anything that failed. The below is what I tested and it all seems to launch and work at least in very cursory testing.

Clarisworks 4, Photoshop 3, Stuffit Deluxe 5.5, BBEdit, JPEGView, Symantec C 7, Softwindows, Resedit, MPW
A-Train, Beyond Dark Castle, Dark Castle, Civlization, Cosmic Osmo, Crystal Quest, F/A-18 Hornet, Flight simulator 4
Hellcats, Indiana Jones 3, Lemmings, Maelstrom 1.4, Oregon trail, Pax Imperia 1.5, Prince of Persia, Railroad tycoon
SimAnt, SimCity 1.4, SimCity2000, Test drive 2, Tetris, Abuse, Escape velocity I, Marathon, Out of this world
SimTower, Slick Willie 3, The incredible machine 3, Doom, Warcraft, Wolfenstein
 
No progress on why memory is somewhat slower, except I found that my 040 numbers were coming from a Quadra 605 rom not a 650 which uses the tighter Quadra 800 memory timings. Still seems to be a bit of a gap compared to the 040 but not as bad.
That depends on the machine ID as well. Q800 memory timings are even tighter than Q650 as it's tuned for 70ns memory, using the same ROM. (IIRC, if you load an LC475 ROM you get the Q800 timings for both, oddly enough).
 
The '060 does perform much better than a similarly clocked '040:

[…]
OS : NetBSD 11.0_BETA
C compiler : gcc version 12.5.0 (nb1 20250721)

OS : NetBSD 10.0_STABLE
C compiler : gcc version 10.5.0 (nb3 20231008)
I’m not questioning the rough results in judging the 060 vs. the 040, but above is a bit vague and comparing apples and peaches: different OSses under the benchmark (with potentially different cache or locking behaviour), and very different compilers. This adds quite a range of outcome-variety to the benchmark?!
 
That depends on the machine ID as well. Q800 memory timings are even tighter than Q650 as it's tuned for 70ns memory, using the same ROM. (IIRC, if you load an LC475 ROM you get the Q800 timings for both, oddly enough).
Exactly what happened. So all the 040 memory scores were somewhat higher than they should have been given I'm trying to compare 040 to 060 rather than milk every bit of performance out of the Wombat platform. (Yet.)

Comparing the 68060 rom vs the unmodified ROM, there's still an interesting difference in blockcopy but it's more like 10% rather than 20% as I verified with @cy384 's Wombat tool that the dmemc registers are identical. Might be i need to verify my cache flush patch as a flush is performed during blockmoves.

A challenge I'll put out for some workshopping: Apple loves screwing with bus (access) error handler, both by ROM code as well as System and even certain pieces of application code. Notably Slot manager, SCSI manager 4.3 and derefhandle are changing it, then Mac OS itself later in the boot process. Functionally this doesn’t seem to be causing major issues right this minute, but its a no-no to do when branch prediction is enabled as there is the possibility of receiving spurious bus (access) errors due to branch prediction. The correct behavior in such a scenario is to check a new bit in the stack frame then clear the branch prediction cache if set and return if there's not an actual bus error.

I think that figuring a way to protect/intercept certain handlers from being changed is needed. Ideally one would chain handlers so we first check if it's an actual bus error or a branch prediction error in new ROM code and then and hand off to whatever handler apple last installed if there's a real error. Mostly futzing with the handler is done very temporarily to say probe if a piece of hardware exists, and checking the fields isn't actually required. So it is not a big issue from a 040 to 060 migration standpoint aside from those cases I mentioned which do so.

The slot manager bus error needs updating anyways as it needs to look at the fields of the stack frame, and the 060 differs massively here. Same for the SCSI bus error handler. Some mitigation is also possible by not turning on branch prediction until later in boot, perhaps when finder loads. For any kind of final version of 060 support it's probably necessary to have a user space component that handles processor features according to a blacklist of running applications anyways. So you can disable features when there's a known conflict with particular applications, similar to what was done with 030 accelerators on a SE.

I can't think of a sensible way to detect futzing with VBR though or the actual memory the vector table lives at. Given we run in supervisor mode that limits options further.

A similar approach might be needed for FPU exceptions and illegal instruction if PTEST emulation ends up being required. The FPU handlers already have some provisions in place in SANE, some expansion on what is already being done is probably required but it didn't look like a major issue. But if the System file on 7.5+ uses PTEST at all then something would need to be done about that...
 
I love the fact that there are amazing skills out there and that open source enables crazy ideas like this. Are 68060 processors more plentiful than 68040 @ 40mhz?
 
I love the fact that there are amazing skills out there and that open source enables crazy ideas like this. Are 68060 processors more plentiful than 68040 @ 40mhz?
No, very much the opposite. 68060 are uncommon and already in high demand by Amiga and Atari communities. I don't know that attaining 40mhz bus operation (80mhz cpu) would be attainable in Macs without the scarce and expensive late chips. Even 33mhz bus speed might be problematic.

That said, I think using LC cpus for Mac is a very reasonable solution since so little requires the FPU on Mac OS and especially not games. Late model LC chips that can run at higher frequencies aren't too expensive (yet).

Still all very hypothetical though as in my opinion there is quite a ways to go to make this usable.
 
As far as I know very few machines/cards were sold with 060 chips so few would still be around (and the cut down versions probably outnumber the full 060's).
 
Back
Top