OK, I've got an answer to why BlockCopy performance was reduced, along with scrolling. Both of these routines use Move16 in order to get the maximum memory throughput if you've gotta move data through the CPU. This is a new instruction for the 68040 used to move entire lines (16 bytes, 4 longs) and avoid thrashing the caches. It's used in the optimized 68040 routines for blockcopy and a few quickdraw routines (scrolling, notabl!y).
And it turns out the timing is different on the 68060, as compared to the 68040.
Per Motorola docs:
060 takes 18 cycles for MOVE16 src/dst register postincrement. 11 cycles if cache hit (unlikely).
040 takes 14 cycles for MOVE16 src/dst register postincrement.
Motorola seems to indicate this figure is relative to BCLK, but I don't quite believe that. It is likely a mix of PCLK and BCLK.
In both cases, these execution timings assume ideal 2-1-1-1 memory timing. 33mhz Quadra 650 uses 5-2-2-2 (11 BCLK cycles) on a page burst read and 4-2-2-2 (10 cycles) on a page burst write. So on a memory to memory transfer we're taking (11 [read] + 10 [write] - 2 * 5 [motorola assumed memory cycles]) * 2 [blck is half pclk] = 22 additional pclk cycles over motorola specifications.
(14 [040 timing] + 22 extra)/(18 [060 timing] + 22 extra) =
68060 has 90% the MOVE16 bandwidth comapred to 68040, exactly as I found in the system info benchmarks. Easily verified, also: I wrote a quick MOVE16 test to move around 200MB from RAM to RAM as fast as possible, and it returned 91% on the 060 as compared to the 040. Exactly as predicted, so
this is functioning as intended. Presumably some of the special behavior around this instruction has become more expensive internally to sequence.
As that last mystery has been solved, I believe I'm getting the most performance out of the 060 itself so some benchmarks are in order. All tests are done with 68MB of interleaved RAM, 32mhz BCLK (early 060 can't quite manage 33.3), System 7.1, and a ZuluSCSI Slim. I didn't bother with Disk tests as they're essentially unchanging. FPU tests also excluded as I still haven't gotten that working and probably won't bother.
Note that the below numbers are all using the standard Quadra 650 timings for 33mhz: there's definitely more performance on the table by optimizing but this is a CPU comparison rather than trying to milk as much performance out of the Wombat platform
Norton System Info 3.2.1. Detailed explanations of each subtest in attached text file.
Speedometer 4. Disclaimer: I maintain that Speedometer is a crap benchmark for measuring anything faster than a 68030. Almost all of its tests don't test much RAM and fit in CPU caches so it is of very limited use to measure overall system performance. Nevertheless, here's a screenshot of a comparison at 32mhz. Note that KWhet and Math tests hit SANE so these will differ from other 040 benchmarks with a FPU.
Finally, the most interesting stuff: a collection of real-world application benchmarks.
Some clarification for those unfamiliar: MacDoom is unoptimized and can't be reasonably compared to other platforms as it performs poorly. Take those numbers as a relative comparison only, and one that's highly dependent on memory performance.
MacBench 3 graphs are attached. The CPU performance benchmark is a black box; it doesn't seem to test memory all that heavily. I would take it with a grain of salt for real world performance. The quickdraw benchmarks however are handy. The 6100 benchmark they use as a baseline supposedly doesn't have a L2 cache installed so it would be best compared to the 060 without FPU.
Altogether, it looks to me like the 68060 is delivering on the 1.6-1.7x 68040 performance promised by Motorola, clock for (P)clock. Motorola was very squirrely about which clock on the 040 actually governs internal execution, based on my observations I believe it's the higher clock. Motorola did no favors by trying to mask that the 68040 is essentially a clock-doubled 030 with massively improved caches and bus interface.
Thoughts on these numbers.... the 060 clearly could benefit from additional memory throughput, it seems to have performance to spare but is limited by memory. The L2 cache on the 060 above is running at bus speed; even so it's possible to notice where it picks up the slack. The cache supports 2-1-1-1 (5) read cycles on a hit compared to the 5-2-2-2 (11) of the stock RAM. However, the cache introduces a 1 clock penalty on cache misses that have to go on to RAM: that can be observed in the System Info BlockMove benchmarks.
The
68040 uses a 32 bit bus running at half core speed, in a system designed for 040 memory performance between the 040 and 060 should be similar for the same bus clock if we exclude the MOVE16 special case. However, the 68060 is now able to support a full speed bus unlike the 68040. I don't know what Amiga 68060 accelerators with onboard RAM typically do but I would assume they are running the accelerator onboard RAM at full speed for
much better RAM performance.
Most
Pentium systems are going to have a 64 bit data bus running at full speed (60mhz+), so there's a clear advantage to the Pentium on memory throughput and latency. A Pentium Overdrive in a 486 system would be an interesting matchup as that's going to reduce both bus speed and width to something comparable to the Wombat. Early
PowerPC Macs have a 64 bit RAM/ROM bus running at half speed, this a more fair comparison but favors the PPC when running native code.
As usual, I don't promise this to be in any way usable and don't intend to polish it to the point where I would be comfortable recommending as something to actually use. With that in mind it does seem surprisingly usable and stable in the limited testing I've done. The
source is up on Github and should work with LC060 CPUs now. Unrelated: after putting the 060ISP in place of the unused 040FPSP I found the system will boot and work in 24 bit mode. Pointless but interesting to confirm.