Testing a 6200 and comparison with 6100

You might want to peek at the assembly coming out of your PPC compiler. Perhaps it's unoptimized or unusually bad. Hand-written assembly probably would make more sense to sidestep compiler shenanigans and improve accuracy.
 
Perhaps it's unoptimized or unusually bad.

Agreed. That wouldn't surprise me. That's my big caveat to all of this. I'm using a period-correct compiler (Metrowerks CodeWarrior 11 Gold) with pure C code that is not specifically tailored for a PowerPC processor. I am positive that if I wrote this code differently and chose 603 instruction ordering it could do better.

As you know, the cache tester is really simple. It's purpose is just to detect the existence of a cache at various steps. It doesn't exercise the cache with writes or random accesses. And, it is focused on data, not code.
 
The performance portion of the code is a unrolled loop that copies 32 bytes per loop. So, basically, nothing else is as impactful on the result as this portion of code.

The addition operation is used to verify that memory is valid. The buffer has been preloaded with an incrementing value where the end sum is known. (This is a cache checker program.)

sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;

I've checked the PowerPC disassembly and it looks fine. There are two operations per C line on PPC, as opposed to a single 040 operation, which is to be expected on RISC vs CISC. However, someone with more expertise in PPC assembly might know of an optimization.

The disassembly is interesting in that the PPC code switches between loading one register and then another. I assume using multiple registers allows a performance gain where it can execute two or more operations (a read to one register using the load/store unit and an add to another register using the integer unit) in parallel. Cool.

00000098: 807C0000 lwz r3,0(r28)
0000009C: 841C0004 lwzu r0,4(r28)
000000A0: 7CC61A14 add r6,r6,r3
000000A4: 849C0004 lwzu r4,4(r28)
000000A8: 7CC60214 add r6,r6,r0
000000AC: 847C0004 lwzu r3,4(r28)
000000B0: 7CC62214 add r6,r6,r4
000000B4: 841C0004 lwzu r0,4(r28)
000000B8: 7CC61A14 add r6,r6,r3
000000BC: 849C0004 lwzu r4,4(r28)
000000C0: 7CC60214 add r6,r6,r0
000000C4: 847C0004 lwzu r3,4(r28)
000000C8: 841C0004 lwzu r0,4(r28)
000000CC: 7CC62214 add r6,r6,r4
000000D0: 7CC61A14 add r6,r6,r3
[two operations to prepare to loop and finally]
000000DC: 7CC60214 add r6,r6,r0

Here's the 040:
0000007E: D69A ADD.L (A2)+,D3
00000080: D69A ADD.L (A2)+,D3
00000082: D69A ADD.L (A2)+,D3
00000084: D69A ADD.L (A2)+,D3
00000086: D69A ADD.L (A2)+,D3
00000088: D69A ADD.L (A2)+,D3
0000008A: D69A ADD.L (A2)+,D3
0000008C: D69A ADD.L (A2)+,D3

- David
 
Back
Top