Testing a 6200 and comparison with 6100

zigzagjoe · Feb 22, 2026

You might want to peek at the assembly coming out of your PPC compiler. Perhaps it's unoptimized or unusually bad. Hand-written assembly probably would make more sense to sidestep compiler shenanigans and improve accuracy.

David Cook · Feb 22, 2026

Phipli said:
What speed does the 040 L1 run at? 33 or 66MHz? (I assume 33, but sometimes worth asking the dumb questions).

According to me, 33 MHz. According to Apple and Motorola marketing department 66 MHz.

David Cook · Feb 22, 2026

zigzagjoe said:
Perhaps it's unoptimized or unusually bad.

Agreed. That wouldn't surprise me. That's my big caveat to all of this. I'm using a period-correct compiler (Metrowerks CodeWarrior 11 Gold) with pure C code that is not specifically tailored for a PowerPC processor. I am positive that if I wrote this code differently and chose 603 instruction ordering it could do better.

As you know, the cache tester is really simple. It's purpose is just to detect the existence of a cache at various steps. It doesn't exercise the cache with writes or random accesses. And, it is focused on data, not code.

Phipli · Feb 22, 2026

David Cook said:
According to me, 33 MHz. According to Apple and Motorola marketing department 66 MHz.

Fair, just double checking given the performance difference.

David Cook · Feb 23, 2026

The performance portion of the code is a unrolled loop that copies 32 bytes per loop. So, basically, nothing else is as impactful on the result as this portion of code.

The addition operation is used to verify that memory is valid. The buffer has been preloaded with an incrementing value where the end sum is known. (This is a cache checker program.)

sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;

I've checked the PowerPC disassembly and it looks fine. There are two operations per C line on PPC, as opposed to a single 040 operation, which is to be expected on RISC vs CISC. However, someone with more expertise in PPC assembly might know of an optimization.

The disassembly is interesting in that the PPC code switches between loading one register and then another. I assume using multiple registers allows a performance gain where it can execute two or more operations (a read to one register using the load/store unit and an add to another register using the integer unit) in parallel. Cool.

00000098: 807C0000 lwz r3,0(r28)
0000009C: 841C0004 lwzu r0,4(r28)
000000A0: 7CC61A14 add r6,r6,r3
000000A4: 849C0004 lwzu r4,4(r28)
000000A8: 7CC60214 add r6,r6,r0
000000AC: 847C0004 lwzu r3,4(r28)
000000B0: 7CC62214 add r6,r6,r4
000000B4: 841C0004 lwzu r0,4(r28)
000000B8: 7CC61A14 add r6,r6,r3
000000BC: 849C0004 lwzu r4,4(r28)
000000C0: 7CC60214 add r6,r6,r0
000000C4: 847C0004 lwzu r3,4(r28)
000000C8: 841C0004 lwzu r0,4(r28)
000000CC: 7CC62214 add r6,r6,r4
000000D0: 7CC61A14 add r6,r6,r3
[two operations to prepare to loop and finally]
000000DC: 7CC60214 add r6,r6,r0

Here's the 040:
0000007E: D69A ADD.L (A2)+,D3
00000080: D69A ADD.L (A2)+,D3
00000082: D69A ADD.L (A2)+,D3
00000084: D69A ADD.L (A2)+,D3
00000086: D69A ADD.L (A2)+,D3
00000088: D69A ADD.L (A2)+,D3
0000008A: D69A ADD.L (A2)+,D3
0000008C: D69A ADD.L (A2)+,D3

- David

croissantking · Feb 25, 2026

Phipli said:
What speed does the 040 L1 run at? 33 or 66MHz? (I assume 33, but sometimes worth asking the dumb questions).

Ask @Melkhior

Testing a 6200 and comparison with 6100

zigzagjoe

David Cook

David Cook

Phipli

David Cook

croissantking

Similar threads