Testing a 6200

Phipli

68040
I wrote this mostly over a year ago but haven't clicked post. I wanted to do a bit more, but things have conspired against it so I thought I'd share.

I'm actually nervous posting it because people are so passionate about these machines, one way or the other.

I tried to do testing as fairly as possible to see where the performance differences were, and experimented to see where improvements could be made.

The biggest thing I learnt along the way is that SpeedDoubler is a mandatory install for beige PPC macs if you run any 68k code on them at all.


@cheesestraws @joshc @Snial @Cory5412
 
I wrote this mostly over a year ago but haven't clicked post. I wanted to do a bit more, but things have conspired against it so I thought I'd share.

I'm actually nervous posting it because people are so passionate about these machines, one way or the other.

I tried to do testing as fairly as possible to see where the performance differences were, and experimented to see where improvements could be made.

The biggest thing I learnt along the way is that SpeedDoubler is a mandatory install for beige PPC macs if you run any 68k code on them at all.


@cheesestraws @joshc @Snial @Cory5412
Reading this now :-) ! Since @noglin is a P5200 fan, I thought I'd notify them here! OK, you mention Noglin later.

Based on my analysis of disk partitions on my PB1400c, as standard it will come with the 68K IDE driver, but Mac OS 8.1 and later will install the PPC driver (on a different driver partition).

Suggested edit: "Based on in which benchmarks the Performa 6200/75 performs least we’ll compared to the PM 6100/66", I think "we'll" is a typo and should be ", well".
 
Last edited:
@Snial

The main issue seems to be graphics speed based on testing of this specific machine.

Given the 630 is faster, and tests show that it isn't due to 68k driver code / quickdraw code, the hardware is clearly capable of more.

Possible issues are the number of wait straights they had to add for the increased bus speed, or something nasty to do with the bridge. What did I see that might be a clue - it is good at writing images to the frame buffer compared to the 630, but bad at translating an image in VRAM.

Given they had a few issues early on, I wonder did they detune the video performance heavily to avoid crashes? Bad performance over stability issues?

I'd love to compare a release day ROM to see if graphics performance is better.

Also, I hadn't thought before, but I wonder what it would be like with an LCPDS video card?

Also, how differently does a 6300 (5300) perform? How about when underclocked to 75MHz?

So many questions.

Regardless, the results weren't what I was expecting from what I've previously heard from either advocates or detractors.
 
Suggested edit: "Based on in which benchmarks the Performa 6200/75 performs least we’ll compared to the PM 6100/66", I think "we'll" is a typo and should be ", well".
Yeah, that's a phone autocorrect - it is been a nightmare lately. It keeps editing valid text like changing "of" to "if", but only after I look away and press space. It is also frequently breaking words that start with a more common word by forcing a space mid way through the word. I strongly feel auto correct has got much worse over the last 15 years. It tries to be smart and is worse for it.
 
The main issue seems to be graphics speed based on testing of this specific machine.
So, if @noglin pops up on this thread, then they probably have the best insight.
Given the 630 is faster, and tests show that it isn't due to 68k driver code / quickdraw code, the hardware is clearly capable of more.

<snip> wait states they had to add for the increased bus speed, or something nasty to do with the bridge. What did I see that might be a clue - it is good at writing images to the frame buffer compared to the 630, but bad at translating an image in VRAM.
Sounds plausible. I thought I was going to find a 68040 bus signal diagram here:


And I'm sure I've seen it before, but I couldn't find it. The P5200/6200 need 80ns RAM, but I guess the '040 bus runs at 37.5MHz? There's some extensive discussion I'm sure you're aware of here:


Noglin suggested that the graphics slowdown is due to the write buffers on the PPC/'040 bridge. Graphics writes likely use stfd instructions, which write 64-bit values, but these get broken down into 2x 32-bit writes, so even a single write fills 2 write buffers. Hence it functions as a bottleneck. He did numerous graphics tests which identified huge bus latencies depending on how much data was being shuffled and processed between caches and the '040 bus.

<snip> compare a release day ROM <snip> LCPDS video card? <snip> 6300 (5300) perform [75MHz]? <snip>
oK.
Regardless, the results weren't what I was expecting from what I've previously heard from either advocates or detractors.
I guess these machines are contentious, because they were pioneering consumer PPC Macs, leading to Workstation level over-expectations for some people (not me, my first experience with a P5200 I felt was awesome vs my LC II :-D ).
 
I guess these machines are contentious, because they were pioneering consumer PPC Macs, leading to Workstation level over-expectations for some people (not me, my first experience with a P5200 I felt was awesome vs my LC II :-D ).
They're excellent spreadsheet machines. Especially if you're doing lots of FP and the version of Excel supports the PPC FPU.
 
@Snial - I should probably share this here too for background, although I think I showed you specifically before :

 
@Snial - I should probably share this here too for background, although I think I showed you specifically before :

Thanks, I've read it again. PowerPC had a lot of different microarchitecture choices, which can significantly affect performance in conjunction with a complete system design. The PPC603 is significantly different to the PPC601.

1770376338402.png1770376490920.png

The most basic differences (apart from the support for POWER instructions on the 601) is that the 601 has a 256b path from the cache to the 8-entry Instruction Queue, so it can fill all 8 entries if possible. The 603 only has a 64-bit path to a 6-entry IQ, so at best it can only fill 2 entries. The 601 can dispatch 3 instructions per cycle, but the 603 can only dispatch 2 (though the BPU interfaces directly to the instruction fetch, allowing some branches to be annulled).

I believe completion must always be in-order for both. There are other aspects which limit PPC603 throughput. The PPC603 in fact closely resembles the MC88110 which came out a few years earlier:
1770377456334.png
The responsibilities of the IU and FPU were changed, but the rest is similar. It looks to me like the MC88110 team negotiated compromises in order to get the PPC603 to market on time: drop some functional units, re-partition the ALU / FPU datapath, decode the new instruction set.

And PowerPC designers seemingly failed to consider one aspect of RISC vs CISC. RISC was originally designed for Unix workstations, where the expectation was that executables would normally be recompiled for a given machine. And that's normal there: e.g. when playing with the MAME SparcStation 1 emulator, part of the standard install procedure is to recompile the kernel. But consumer-based computers aren't like that: the users generally don't have compilers, so the software can't be recompiled (e.g. the need for the 68K emulator, which RISC workstations never needed). Therefore microarchitecture choices have a much bigger impact. e.g. Cyrix and AMD Pentium 1 competitors suffered because their choices must have looked good on paper (e.g. SpecInt), but worse in reality; whereas the Pentium was basically 2x 486's strapped in parallel, with an FPU that was more independent.

Re: FPU, did the PPC601 has the same or equivalent fused MAC instruction optimisations in the FPU? That could explain the 603's FPU competitiveness. But also, as an aside where you said the P6200 is good for spreadsheets, it's a bit like how the Sinclair QL, which is normally about 2x slower than a Mac 128K thanks to the QL's 68008 CPU, can nearly match the Mac's math performance, because that's mostly internal cycles, which is the same.
 
But consumer-based computers aren't like that: the users generally don't have compilers, so the software can't be recompiled (e.g. the need for the 68K emulator, which RISC workstations never needed). Therefore microarchitecture choices have a much bigger impact.
Yes, I've wondered about this. I've seen some software note that it contains optimisations for the 604 (I think I've seen Photoshop mention it), but most PPC software will have targeted the 601, or a core of commands. Any extra opcodes or optimised tricks will have been mostly left out.

Do any period Macintosh development environments have switches to target, or even optimise, for specific types of PPC?

In 68k land you seem to mostly get the choice to either target 68000 or 68020 in CodeWarrior, although I haven't really dug into the specifics as my programming is rarely speed critical.
 
Yes, I've wondered about this. I've seen some software note that it contains optimisations for the 604 (I think I've seen Photoshop mention it), but most PPC software will have targeted the 601, or a core of commands. Any extra opcodes or optimised tricks will have been mostly left out.
It's mostly scheduling rules that change. Unified vs Harvard caches make a difference too, because e.g, a Harvard cache can trade increased data cache misses for increased instruction cache hits. That is, if code exhibits more locality, but data doesn't, then a Harvard cache might win, because on a unified cache the extra load/store misses will empty code cache lines. For example, code that has a fairly tight math loop crunching through a whole pile of data will have lots of data cache misses. On a unified cache they'll cause, like 100% load/store cache misses and also instruction misses, but on a Harvard cache it'll just be 100% data cache misses.

Hence, the 6200 can be relatively good for large spreadsheets.

So, in turn, if you schedule that math loop on a 603: unrolling code; exploiting registers or whatever so that it occupies close to 8kB, then you gain better instruction performance and still no misses, but on a 601 you'll end up reloading 8kB of code for every 32kB of data. A tighter 4kB loop would only end up reloading 4kB for every 32kB of data.

Do any period Macintosh development environments have switches to target, or even optimise, for specific types of PPC?
I'll check my version of CodeWarrior, though I guess based on your paragraph below, you already have?
In 68k land you seem to mostly get the choice to either target 68000 or 68020 in CodeWarrior, although I haven't really dug into the specifics as my programming is rarely speed critical.
I usually target just a 68000 on everything, but that can also be more efficient on a 68040; though perhaps not the '040 emulator. 68020 code gains efficiency from the extra addressing modes (which have a constant number of cycles to calculate), but the RISCwards shift of the 68040 makes simple code more efficient again (I think).
 
So, in turn, if you schedule that math loop on a 603: unrolling code; exploiting registers or whatever so that it occupies close to 8kB, then you gain better instruction performance and still no misses, but on a 601 you'll end up reloading 8kB of code for every 32kB of data. A tighter 4kB loop would only end up reloading 4kB for every 32kB of data.
What I'm hearing is that I should write everything in hand crafted assembly? :)
usually target just a 68000 on everything
Same to be honest - in fact, the book I first started learning from recommended doing it unless there was reason not to. I think I remember you get smaller applications even, so it really does make sense unless you need the extra performance.
I'll check my version of CodeWarrior, though I guess based on your paragraph below, you already have?
I'll look right now... as above, I don't tend to look at the PPC settings.
1770397559631.png
Yeah, it lists individual CPU targets. So I guess my question is, if I select 604e, what happens if I run it on a 601?
 
What I'm hearing is that I should write everything in hand crafted assembly? :)
LOL! Well, compilers might do those things too.
Same to be honest - in fact, the book I first started learning from recommended doing it unless there was reason not to. I think I remember you get smaller applications even, so it really does make sense unless you need the extra performance.
It should be possible for a compiler (or human) to code shorter for '020, because it's a strict superset and some of the addressing modes require wasting a register on an '000 for intermediate calculations.
I'll look right now... as above, I don't tend to look at the PPC settings.
Yeah, it lists individual CPU targets. So I guess my question is, if I select 604e, what happens if I run it on a 601?
There we go! The big differences, from memory:
  • Two integer units (so uninterrupted ALU operations benefit).
  • Longer pipeline compared with a 603, 603e, 601, 604 and 750, so more potential latency for jumps.
  • Better branch prediction (dynamic branch prediction, branch hints [don't think they work on 601]).
The upshot, I think is that code with short calculations and lots of sequences of nested if statements would close the gap between a 601 and 604e.
 
Back
Top