• Hello MLAers! We've re-enabled auto-approval for accounts. If you are still waiting on account approval, please check this thread for more information.

How did the PowerPC 603 / 5200 at 75mhz compare to PC:s (486/Pentium)?

I was curious so I looked and the dnet statistics server is still up:

cgi.distributed.net/speed/

You can pull up graphs like this:

View attachment 87987

The thruput is fairly linear vs clock speed, and the slope of the lines shows the efficiency of the chip. For this thread, here are some specific results I looked up for some tuned assembly code:

Code:
rc564              MHZ
PowerPC 603e       100      310,190
         scaled to  75    232,000
Intel Pentium (P5)    75       96,326
Intel 486dx4        75       71,124

ogr
PowerPC 603e       200    1,524,307
         scaled to  75    571,000
Intel Pentium (P5)    75    277,653
Intel 486dx4       100      252,230

rc572
PowerPC 603e       240    793,275
         scaled to  75    248,000
Intel Pentium (P5)    75       76,413
Intel 486dx4       100       65,739

This is an interesting source. Thanks!

The 603e should not be confused with the 603. While they are very similar, they are different enough that they should be treated separately:
- 603e: stores are faster (2x throughput)
- 603e: the SRU can do some integer work, so is "~2 ALU" (like the 68060 and the Pentium, can dual issue and dual complete 2 ints subject to pairing requirements).
- L1 is twice as big on 603e (4-way instead of 2-way), unlikely to matter for this test though.

The 603 has entries for rc564 and ogr (only 1 datapoint, but all others have low stddev so ok)

As for this as a benchmark, this workload measures instead only the ALU (no fpu work at all), and won't exercise the cache and LSU that much since I presume most will be linear accesses, in one tight inner loop that mostly will do bit shifting work (for which the PowerPC rlwmn and rlwimi are a great fit).

I wouldn't expect them to be representative of daily or typical usage. They are representative of "how good can this CPU do on this exact problem" But since the problem is essentially one inner loop, it does not exercise the system very broadly.

PowerPC 603e100rc564310190.1413,672.544.417
PowerPC 603e
** scaled to 75 mhz
75rc564232642.61
Intel Pentium (P5)133rc564181776.3012,626.976.9571
Motorola 6806050rc564140113.7515,970.6911.4012
PowerPC 60375rc564135136.503,250.262.414
PowerPC 60160rc564105,464.201,443.111.375
Intel Pentium (P5)75rc56496,326.736,379.386.6215
Intel 486dx266rc56463,815.654,806.737.5326

I would have gladly swapped my 603 75 MHz for a P133 back in the day :) so I think the Quake 1 is definitely a more representative test.

PowerPC 603e100ogr7386310,000,001
PowerPC 603e
** scaled to 75
75ogr553973
Intel Pentium (P5)133ogr53529324.375,214,5521
PowerPC 60375ogr3411230,000,001
Motorola 6806050ogr31591727.017,248,5512
PowerPC 60166ogr3147840.000.001
PowerPC 601
** scaled to 60
60ogr286167
Intel Pentium (P5)75ogr27765323.432,008,442
Intel 486dx250ogr1422760,000,001

Although I don't feel I was convinced, I appreciate the additional data point we got here. It definitely shows that the PowerPC, and the 603 for some problems definitely can beat intel.

I think @Snial mentioned it earlier in the thread (hope I don'r recall wrong) that he suspected the PowerPC was designed using only a small set of example problems, and I think that is why it does relatively well on specint92, but then in real life it just has a lot of trouble.

I've done some micro benchmarking of the 603 as well and things like the fact that it stalls LSU completely on cache misses which easily leads to ALU also stalling and if you deal with IO or VRAM then those cache misses take a long time to resolve. In contrast both the 68060 and intel (afaik) can actually proceed with loads if they hit cache even if a store did a cache miss. I think things like this (cache misses) that happens in daily usage all the time are completely lost on these type of very isolated "one innerloop problems". I know nothing about chip design, but I would imagine that if they had used larger more complex programs, maybe they could have let the load pass through, and also not let fpu div stall the whole cpu. Those two fixes (I have no idea how many transistors that would have eaten) would have made the 603 a much stronger CPU, and also would have made it possible to do fpu/alu work together while store cache miss is happening.
 
If quake is your measure of daily usage, then yes, the only thing faster than a pentiumTM is another one with more cache. My point is, I think you are looking at a very narrow part of the scene and the comparison is not objective. Even including other games, on nearly the same hardware, shows there is more to the picture. Here is a screen capture from goggle video showing several different socket 7 CPUs (note the baseline) running at nearly the same clock speed. Look at the variability even within this group of very similar x86 CPUs running different commercial software. Now look at the quake vs quake 3 results: the ranking nearly reverses.

x86games.png

True, one specific piece of skillfully hand-tuned software runs particularly well on one specific chip. In my opinion there is more to consider. There are plenty of articles describing the quake "phenomenon":

liam-on-linux.livejournal.com/49259.html

"The one single product that killed the Cyrix chips was id Software's Quake."
 
If quake is your measure of daily usage, then yes, the only thing faster than a pentiumTM is another one with more cache. My point is, I think you are looking at a very narrow part of the scene and the comparison is not objective.

It is not perfect, I give you that. And Doom might be the better benchmark for these machines, or at least to compliment.

I think PowerPC can beat Pentiums on Quake 1. A 604 should be able to win over Pentium (at same MHz) on Quake 1, as its fpu can complete the div in parallel, while it still has 3 ALU's that can do work. Amazing CPU for its time. It really seems to have fixed all the issues with the 603/603e (two major pains being: a) cache miss leads to LSU completely stalling which in turn quickly dries up possible work for the other units, and b) can only retire 2 instructions if the last retires, which means any multi-cycle instruction will completely stall the cpu).

The 603 is both low clocked and bad at being "superscalar" with so many ways it can stall itself. The 603e has same fundamental issues, but was made with larger L1, higher clock and bus which reduces the impact. And the 604 just seems wonderful (the challenge would be to find ways to keep it 100% busy, rather than avoiding it tripping on itself).

Ironically, the 603/603e has a quite capable FPU. It is the completion unit that is the problem. And that shows up also in integer programming. If the 603/603e had just a little bit more flexible completion unit it would have made a big difference I believe.

Even including other games, on nearly the same hardware, shows there is more to the picture. Here is a screen capture from goggle video showing several different socket 7 CPUs (note the baseline) running at nearly the same clock speed. Look at the variability even within this group of very similar x86 CPUs running different commercial software. Now look at the quake vs quake 3 results: the ranking nearly reverses.

GLQuake/QuakeIII (and forsaken as well?) was hardware accelerated (i.e. the triangle drawing was offloaded to a 3D accelerator). I would not expect quake 1 to predict the performance in those scenarios at all. Also, unless they used the exact same 3D accelerator for all systems, then the accelerated games say more about the accelerator than the CPU.

As for Doom that one locked the camera to straight walls/floors so you didn't need perspective division since the perspective was constant if you drew either vertical lines or horizontal lines (which is exactly what Doom did and why it could run at a playable frame rate). I.e. sufficient with integer and loads and stores. I think Doom would be a great complimentary test. Will try get some benchmarks for Doom as well.

I read up a little bit on Cyrix 6x86, so its P150+ model really is 120 MHz, but marketed "as fast as a Pentium 150 MHz". And they had gambled that they could beat Pentium by focusing on the integer unit, and charge a premium. I can't imagine the panic they must have felt as Quake started to influence parent's computer purchases. Comparing the P120+ (100mhz) vs 603e (100 mhz) on Quake 1 would be interesting. But I don't find quake scores for the P120+.

Going back to Quake 1, I should add that for the 133 MHz Pentium measure to compare with the 5200, I took the worst reported Quake 1 FPS (at 320x200) of all the Pentium 133 (that were desktops), but I was not aware of the Cyrix no-pipelining-fpu, and the P150+ performs worse.
 
I read up a little bit on Cyrix 6x86, so its P150+ model really is 120 MHz, but marketed "as fast as a Pentium 150 MHz". And they had gambled that they could beat Pentium by focusing on the integer unit, and charge a premium. I can't imagine the panic they must have felt as Quake started to influence parent's computer purchases. Comparing the P120+ (100mhz) vs 603e (100 mhz) on Quake 1 would be interesting. But I don't find quake scores for the P120+.

The whole "PR" thing was a mess. Sometimes it was easy to work out the actual speed of the silicon (a 'PR150'chip would have a '50MHz x 2.5' BIOS/jumper block setting on the chip, for example), other times it was obscured and you had to go digging (as with AMD's Athlon line marketed as "3500+" when you know there was no way it was clocked anywhere near that).
For all the shady marketing that Apple could be accused of, they never resorted to "PR" advertising on Macs vs. Intel. Sure they'd have comparisons and say things like 'the G3 at 300MHz runs Photoshop 3.2x faster than a 400MHz Pentium II,' but they didn't market the computer as a "P1200+" or whatever.

One thing Intel did well in the early '90s was integer units. Their FPUs were historically weak, especially compared to emerging RISC designs, so Intel put a lot of focus on the Pentium's FPU. Cyrix, aware of the history, put lots of resources into an Intel-killing integer unit and kind of coasted on its historically-decent FPU (they were normally faster than Intel's) in designing the 6x86. They both met their design goals: the Pentium had an excellent FPU (well, except for the whole FDIV bug) and Cyrix had quite a good integer unit. Unfortunately for Cyrix, the market was moving on and FPUs were not only mandatory now, but more things relied heavily on them, specifically games and especially Quake.
 
I found an interesting article for the 603: "PowerPC 603, A microprocessor for Portable Computers" - written by the PowerPC 603 design team.

- "As this article shows, its design focuses on meeting these challenges and achieving an optimum balance between power and performance for the portable system."
..
- "The 603’s caches are blocking type, meaning that a cache may not be accessed if it is processing a miss. The time a cache must wait for data during a miss then is
guaranteed idle time and an opportunity for dynamic power management logic to disable the cache and MMU clocks. For timing reasons, the 603 disables cache and MMU clocks only during misses"
- "most of the benchmarks in this analysis were small enough to fit in the cache, producing a very low miss rate". (two are from SpecInt92)
- "Trace-driven simulation shows the performance of the 603 to be an estimated 75 Specint92 and 85 Specfp92 at 80 MHz with 1-Mbyte external secondary cache". "These figures exceed the performance of many current desktop and portable design alternatives"

It is interesting to note how they achieved their goals but failed the objective. The 603 was ever only used in a desktop plugged to the wall, and where it was desirable to have no L2 at all, and with slow VRAM on a 32 bit bus :). I think this is why I doubt benchmarks of the 5200 that will only let the 603 spin as fast as possible and hitting in L1 or L2, when the real problem is daily usage has a lot of VRAM (and IO).

Let's see how close their estimates were:

66 MHz L2=1MB, estimated specint92: 66 (1specint92/mhz), actual: 63.7. OK, pretty close!
66 MHz L2=256kB, actual: 60.6, oops! 10% less! Even if it is "very low miss rate", each matters a lot when they also miss in L2!


We can probably scale the 256kB to 75 MHz: the 5200 with 256kB L2) would likely get about: 68.9 on SpecInt92
 

Attachments

I found an interesting article for the 603: "PowerPC 603, A microprocessor for Portable Computers" - written by the PowerPC 603 design team.
Good find! I've now read about half of it.
<snip> optimum balance between power and performance for the portable system." <snip> caches are blocking <snip> benchmarks <snip> fit in the cache <snip> (two are from SpecInt92) <snip> estimated 75 Specint92 and 85 Specfp92 at 80 MHz with 1-Mbyte external secondary cache". "These figures exceed the performance of many current desktop and portable design alternatives"

It is interesting to note how they achieved their goals but failed the objective. <snip> the real problem is daily usage has a lot of VRAM (and IO).
MHzL2/kBEstActual
6610246663.7
6625660.6
7525668.9

To my mind, the article shows how innovative the PPC603 was; where they made many design decisions that would have led to a great portable computer, but ultimately failed because the primary application (a Mac) needed a 68K emulator that used a 512kB instruction vector table (64k entries x 8-bytes per vector) + remaining code for execution. I remember reading a bit about the Davidian emulator where he said that the way the vector table worked was that each table entry didn't contain the address of the actual instruction, but the first two instructions for emulating an instruction (the second being a jump to somewhere else in the emulator, perhaps even the Next Instruction Fetch). This works well when you have a large cache and large secondary cache; however it hammers a smaller cache which is why PICO-Mac's Muhashi-based emulator is crippled.

The PDM-based PowerMac design was a late adoption, a back-up plan. So, it's probably fair to say that even if the Motorola 603 designers knew about the Gary Davidian emulator (quite possible), they didn't factor it in to the design. A clue to that is where they say:

"We took measurements for several benchmarks running under Minix, a Unix-like operating system."

So, they'd ported Minix to the 603, which I think was still stuck at version 1.x in the early 1990s, pre-MMU implementations. They must have known Macs were the primary market, but it suggests that they either thought Pink or Copeland would be working by the time 603 laptops appeared (or even native 603-based A/UX) and therefore Mac laptops would be running native PPC code.

Of course, that didn't happen:


"I got invited to a party at Austin, Texas.. to celebrate.. I go «I'd been complaining to Motorola, this is gonna be a disaster.. the cache is too small, I told your guys and they ignored me saying this is what they're doing.» immediately, the next day I got an email from them saying «Yeah, we're gonna make the 603e»."

Your focus isn't on the emulator though, but the impact of the data cache design in the face of high workstation-level video bandwidth. Perhaps again, the designers assumed that either 603-based Macs would be running native or would have graphics chips that precluded the need for the CPU to handle low-level stuff. It doesn't take a lot of logic to perform basic blitting or blending as the Amiga and the later Atari ST chipsets already proved (and certainly PC chipsets by the early 90s as well as obviously SGI stuff). DMA can do 8bpp and aligned 16bpp or 32bpp direct memory to video transfers so it's not an unreasonable assumption. It's just not the combination of factors that inevitably worked their way in to a budget consumer Mac.
 
Good find! I've now read about half of it.

MHzL2/kBEstActual
6610246663.7
6625660.6
7525668.9

To my mind, the article shows how innovative the PPC603 was; where they made many design decisions that would have led to a great portable computer, but ultimately failed because the primary application (a Mac) needed a 68K emulator that used a 512kB instruction vector table (64k entries x 8-bytes per vector) + remaining code for execution. I remember reading a bit about the Davidian emulator where he said that the way the vector table worked was that each table entry didn't contain the address of the actual instruction, but the first two instructions for emulating an instruction (the second being a jump to somewhere else in the emulator, perhaps even the Next Instruction Fetch). This works well when you have a large cache and large secondary cache; however it hammers a smaller cache which is why PICO-Mac's Muhashi-based emulator is crippled.
Their methodology was essentially: measure the real test, take note of ~2.5% cache misses, then trace test without considering the cpu state and assume 2.5% of LSU is cache miss, and that the cost of the cache miss is the cost of looking it up in an L2=1MB.

Had they had as a requirement that the 603 must perform also without an L2, very likely,
they had known early that an L2 cache won't be shipped with the laptop, their methodology would likely have shown that the 603 as designed was not ideal.
it just seem that either the 603 should have had a way to deal with cache misses other than just ending up stalling the CPU

The PDM-based PowerMac design was a late adoption, a back-up plan. So, it's probably fair to say that even if the Motorola 603 designers knew about the Gary Davidian emulator (quite possible), they didn't factor it in to the design. A clue to that is where they say:

"We took measurements for several benchmarks running under Minix, a Unix-like operating system."
So, they'd ported Minix to the 603, which I think was still stuck at version 1.x in the early 1990s, pre-MMU implementations. They must have known Macs were the primary market, but it suggests that they either thought Pink or Copeland would be working by the time 603 laptops appeared (or even native 603-based A/UX) and therefore Mac laptops would be running native PPC code.
I think the operating system used when running the benchmarks does not matter that much.

Of course, that didn't happen:


"I got invited to a party at Austin, Texas.. to celebrate.. I go «I'd been complaining to Motorola, this is gonna be a disaster.. the cache is too small, I told your guys and they ignored me saying this is what they're doing.» immediately, the next day I got an email from them saying «Yeah, we're gonna make the 603e»."

Your focus isn't on the emulator though, but the impact of the data cache design in the face of high workstation-level video bandwidth. Perhaps again, the designers assumed that either 603-based Macs would be running native or would have graphics chips that precluded the need for the CPU to handle low-level stuff. It doesn't take a lot of logic to perform basic blitting or blending as the Amiga and the later Atari ST chipsets already proved (and certainly PC chipsets by the early 90s as well as obviously SGI stuff). DMA can do 8bpp and aligned 16bpp or 32bpp direct memory to video transfers so it's not an unreasonable assumption. It's just not the combination of factors that inevitably worked their way in to a budget consumer Mac.
 
This attached paper is also interesting. It seems they didn't consider *expensive* cache misses at all. Similar to what was speculated earlier in the thread.

Was the 603 badly designed, or was it just not used in the way it was designed for? One could probably argue both ways.
I don't know how feasible it would have been within the constraints, but to me, if they had allowed cache-hit loads while the BIU processed a cache inhibited store, it would have made the CPU being able to do a lot of work during IO and VRAM which would have made a huge difference for every day usage. Clearly it does rather well on benchmarks where IO/VRAM is not involved and almost everything hits cache, but those workloads are probably rare for the average user/software.

As for the 68k emulator on the 5200, which has an L2 of 256kb, I think it is fine. Mac OS 7.5.1 feels faster than both Mac OS 8.5 and definitely faster than Mac OS 9. So the 68k emulator was probably not the main issue. I recall thinking the 5300cs (long time since I used it now) felt a bit laggy compared to the 5200. For the 68k emulator / 5300 laptop, it just sounds like a miss that the 603 team didn't have "must work well without an L2" as a requirement early given that probably Apple was the only one that could move portables in a volume that mattered
 

Attachments

Part of the problem with the 5200/6200 series is that the PPC is strapped to a bus converter that adapts it to the peripheral environment of a quadra; the chip is bottlenecked at the bus.
 
I think the only Pre-G4 machine that has some value to me would be the 8100/80AV, gets some expansion but also the 601, but also can boot 7.1.2. Pretty sure it's the last series that can boot 7.1.
 
Part of the problem with the 5200/6200 series is that the PPC is strapped to a bus converter that adapts it to the peripheral environment of a quadra; the chip is bottlenecked at the bus.
This is true, but exactly what is causing the quite substantial slow down I'm not 100% sure yet.

Here is what I do know: the 603 LSU stalls on cache miss, meaning no more loads or stores can happen until the cache miss has been fully resolved. The 603 can still do FPU and ILU instructions, but will soon need some more data to work on, and thus issue a load which will stall. At this point the CPU is completely stalled and does 0 instructions per cycle. This is the same for the 603e btw, only the 604 and later PowerPC's can continue to do LSU work while a cache miss is in progress. So during a cache miss, the CPU can after just a few cpu cycles be completely stalled and do 0 instructions per cycle. It ends up performing like a scalar cpu, like the 486.

Thus it is critical that the cache misses are resolved fast. That's why it really needs a large L2 to reduce time the CPU has to go to slow RAM. And as we can read, the 603 was really designed with the assumption that cache misses happens rarely and when they do they are not so expensive (large L2). This is the case for typical "benchmark" code, like specint, nbench etc, mostly tight loops hitting cache friendly sequential ram and doing some calculations inbetween. This is hindsight bias, but for the 5200, if the 603 had been built with at least one benchmark that dealt with cache inhibited storage (such as vram or disk), I think the 603 team would have done something to allow cache-hit loads proceed.
As for the 5200/6200 and the "quadra environment". The 5200/6200 has a 64 bit PPC bus that connects to 256kb L2 cache, so that part is fine. For cache efficient code, it is not too bad even if you miss L2, as a full block (32 bytes) is loaded on the cache miss. Based on 5200/6200 developer notes, for RAM/IO there is the "Capella" chip that translates the 603 64 bit bus protocol, to the 68040 bus protocol and connects to F108 (for RAM) and Valkyrie (for VRAM read/write buffers, 4 entries), and PrimeTime II (IO).

So on a cache miss to VRAM, doing a stfd (64 bit write), on the 64 bit bus, Capella will see this and translate that to two 32bit writes on the 68040 bus protocol. The valkyrie chip will process that and use 2 of its 4 entries (stall if none available).
I see 27 cpu cycle LSU stall when writing stfd to VRAM.

Now where exactly is that time spent? Is it the bus overhead? is it Valkyrie not emptying the write buffer in time? think here to really know, I would have to learn the bus protocol of the 603 and of the 68040, and do some more microbenchmarks.

A 486 which is scalar, will behave the same way (afaik, but I don't know the 486 well at all), i.e. it cannot do anything else while a cache miss is in progress. But a Pentium can continue to do work during a cache miss, which is a huge advantage.
 
This attached paper is also interesting. It seems they didn't consider *expensive* cache misses at all. Similar to what was speculated earlier in the thread.
Sonya Gary, Carl Dietz were involved with both papers, interesting. But also, impressive how you keep finding these papers!
<snip> badly designed, or was it just not used in the way it was designed for? <snip> cache-hit loads while the BIU processed a cache inhibited store <snip> huge difference for every day usage.
As I was reading the second paper, it looked to me that the design team were frustrated. There's quite a number of excuses for short design times and constraints and it's likely that's a factor in design compromises.

"Performance modeling was a difiicult and time consuming part of the 603' development effort... There were several hundred parameters in the simulator that enabled observation of various design parameters and design trade-offs."

Translation: "You didn't give us enough time.. look how complex modelling is!" Or "SPEC89 and SPEC92.. There are nearly 80 billion
instructions in these applications.."
Translation: "No-one could have modelled it well enough in the time-frame." Or "The full trace, in this case, took days while the sampled trace took minutes to run with BRAT simulation." Translation: "BRAT made simulation possible so don't diss it!"

They spent time looking at various cache trade-offs, but many of them were already known for most architectures. e.g. the 68030, 68040 and M88K series all had split I/D cache designs, but only the 601 had a unified cache. This suggests that the 603 team had had pressure from higher management and were being defensive about split caches. Or, consider Fig 3:

1751188631815.png
Notice the kink after 4 completion buffers? This suggests there's a bottleneck elsewhere in the architecture (maybe it's a bottleneck they knew about).

Preventing some kind of out-of-order completion for FPU operations and Integer ops would have helped quite a bit r.e. Quake, but that wasn't one of their benchmarks ;-) !

"The configuration of five completion buffers and five GPR renames optimized integer performance, including a load/add loop of concern to a major customer."

This is a reference to Apple, the only major customer! Why would Apple need Load/Add? Critical loop Color QuickDraw? This kind of thing must really have annoyed them, because the M88K already had an execution unit for saturating graphics ops of precisely the right kind for Color QuickDraw; stuff that only appeared in CPUs and PowerPC much later.

This is probably the closest they come, IMHO to talking about the issues you've identified. Since (guess) much of the 603 team was ex-M88K, it must have been very disappointing to have worked on the M88K; given it features that would have worked well with graphics on a Mac; only to find it thrown away and then being patronised by being asked to add compensating features that wouldn't have done what they already had done.

You can see that in other parts of the paper where they discuss execution units. The M88110 had lots of execution units while the PowerPC 601 didn't, so they spend quite a bit of time justifying extra execution units.

I agree that it probably would have been quite cheap to allow cached load/stores and ALU/FPU/Branch instruction execution to continue while non-cached loads and stores took time to resolve. It probably needs 1 reservation station and non-cache write buffer and a little extra score boarding.

those workloads are probably rare for the average user/software.

As for the 68k emulator on the 5200, which has an L2 of 256kb, I think it is fine. Mac OS 7.5.1 feels faster than both Mac OS 8.5 and definitely faster than Mac OS 9. So the 68k emulator was probably not the main issue. I recall thinking the 5300cs (long time since I used it now) felt a bit laggy compared to the 5200. For the 68k emulator / 5300 laptop, it just sounds like a miss that the 603 team didn't have "must work well without an L2" as a requirement early given that probably Apple was the only one that could move portables in a volume that mattered
Replying to this in a future email.
 
Part of the problem with the 5200/6200 series is that the PPC is strapped to a bus converter that adapts it to the peripheral environment of a quadra; the chip is bottlenecked at the bus.
Is this a possible reference to the 5200/6200 being a "Road Apple"? There's a classic article that corrects the myths that have grown up around these Macs. I must say that my original experience of a Performa 5200 (with 12MB of RAM, running System 7.5.2) was wonderful. I was house-sitting for a friend who had one and used it to write my Manchester Uni application (I scanned all the pages and imported them as images into a ClarisWorks 3.0 word processing document; overlayed text boxes where all the responses needed to go; filled it in; then covered the images with borderless, white rectangles before finally printing it all out on the application).

It just seemed amazingly powerful compared with my Performa 400 (LCII)!

Later I bought into the Road Apple description of the 5200/6200, until I saw this article:


<snip> 603 LSU stalls on cache miss <snip> does 0 instructions per cycle. This is the same for the 603e btw, <snip> performing like a scalar cpu, like the 486. <snip>

So on a cache miss to VRAM, doing a stfd (64 bit write), on the 64 bit bus, Capella will see this and translate that to two 32bit writes on the 68040 bus protocol. The valkyrie chip will process that and use 2 of its 4 entries (stall if none available). I see 27 cpu cycle LSU stall when writing stfd to VRAM.
Yes, that's a puzzle if the write buffer is being written at < the rate it would empty. Unless, there's some kind of write-completion acknowledgement that's required for uncached memory?

<snip> bus protocol of the 603 and of the 68040, and do some more microbenchmarks.
Which sounds like more fun! (or at least educational).
A 486 which is scalar, will behave the same way (afaik, but I don't know the 486 well at all), i.e. it cannot do anything else while a cache miss is in progress. But a Pentium can continue to do work during a cache miss, which is a huge advantage.
It's probably more imperative for the Pentium design, because it had so few registers, memory transactions were more important.

<snip> 68k emulator on the 5200 <snip> probably not the main issue. I recall thinking the 5300cs (long time since I used it now) felt a bit laggy <snip> amiss that the 603 team didn't have "must work well without an L2" as a requirement early given that probably Apple was the only one that could move portables in a volume that mattered
Agreed. I suspect though that given the size and number of teams it was more likely a case of everyone passing on responsibility to another team, combined with the inevitable Chinese whispers from communication via layers of management (which I've seen quite often). How far along was the PPC603 development when Tesseract was cancelled in March 1993? We can guess. The PPC603 reached first Silicon by October 1993 and was an 18-month project.


Thus, it began in March 1992. Therefore, it's reasonable they assumed Tesseract was the intended architecture. The PPC 601 took about 15 months to develop and had prototypes by October 1992. So, that meant another 18 months between PPC 601 first silicon and PowerMac 6100 and 20 months between PPC603 first silicon and the Performa 5200 (May 1995). PPC 604 made first Silicon in Jan 1994, and the first Power Macs to use it appeared in August 1995 (as did the PB5300), so that's 20 months again.


My PB1400c/166 (128kB L2) started out as a PB1400cs/117 (0 L2), but even the 117 seemed OK to me under Mac OS 8.1, even if I do stressy things like running SoftWindow 3.1 on it.

It does seem like (with hindsight) they made some real development mistakes. Apple must have known, by March 1993 what the cache size for the 603 was. They could have simulated the 68K emulator running on a PowerPC Mac with only 8kB of cache at the very least. They could have made IBM produce some sample 601s with half the cache and tried them on a 6100 without L2. They could have worked on a dynamic recompilation emulator in the 26 months between knowing about the 603 cache sizes and releasing the P5200. Hmmm.

I keep bringing up the emulator, but you're right that this thread is about graphics/game/demo bottlenecks on those early PPC603(e) Macs. I'll try not to in future posts!
 
Is this a possible reference to the 5200/6200 being a "Road Apple"? There's a classic article that corrects the myths that have grown up around these Macs. I must say that my original experience of a Performa 5200 (with 12MB of RAM, running System 7.5.2) was wonderful. I was house-sitting for a friend who had one and used it to write my Manchester Uni application (I scanned all the pages and imported them as images into a ClarisWorks 3.0 word processing document; overlayed text boxes where all the responses needed to go; filled it in; then covered the images with borderless, white rectangles before finally printing it all out on the application).

It just seemed amazingly powerful compared with my Performa 400 (LCII)!
It should be amazingly powerful compared to a Performa 400. The question is, how was it measuring up against its competition, which by 1995 was Pentium/top 486.

Yes, that's a puzzle if the write buffer is being written at < the rate it would empty. Unless, there's some kind of write-completion acknowledgement that's required for uncached memory?
L2 cache miss is roughly 32 cpu cycles for RAM. And valkyrie write buffer is seemingly available again after 8 cpu cycles.

Which sounds like more fun! (or at least educational).

It's probably more imperative for the Pentium design, because it had so few registers, memory transactions were more important.
It is equally important for the 603, as it completely stalls. It doesn't matter how many registers you have available if you cannot load them :)

My PB1400c/166 (128kB L2) started out as a PB1400cs/117 (0 L2), but even the 117 seemed OK to me under Mac OS 8.1, even if I do stressy things like running SoftWindow 3.1 on it.
Did you ever run a native Pentium 100 or Pentium 166 to compare with? Because of course it will feel snappy compared to an even older Mac :)

It does seem like (with hindsight) they made some real development mistakes. Apple must have known, by March 1993 what the cache size for the 603 was. They could have simulated the 68K emulator running on a PowerPC Mac with only 8kB of cache at the very least. They could have made IBM produce some sample 601s with half the cache and tried them on a 6100 without L2. They could have worked on a dynamic recompilation emulator in the 26 months between knowing about the 603 cache sizes and releasing the P5200. Hmmm.

I keep bringing up the emulator, but you're right that this thread is about graphics/game/demo bottlenecks on those early PPC603(e) Macs. I'll try not to in future posts!
The thread was really about comparing the 5200 with its natural competition, Pentium/late 486, and Amiga as well.

I don't think it should be "graphics only".

I think the spec92 shows one thing (603 is competitive with Pentium on specint92), while qualitatively, as I recall it, the 5200 always felt slow compared to friends PC:s (was that due to the emulator, or due to the 5200 graphics, or the combination of bus overhead on cache misses which the 603 is particularly bad at handling... I'm not sure).

Quake shows the 5200 was quite behind the PC. Doom/Marathon2 which do not require the FPU parallel also feels much slower on Mac versus Doom on Pentiums (I tried Doom on the 5200 but without recompiling it I cannot get a timedemo/fps for it).

I checked out Speedometer 4.02, and used its benchmark db of various macs, and ran it on my machine and also found some numbers in a usenet thread. On 8bit graphics, the 5200 is *really* slow compared to 601 macs, and even to 68040 macs. So either the valkyrie really is very slow, or something actually was problematic with how the 68040 bus interacted with the PPC 603 bus.
 

Attachments

  • 2025-06-29-154426_799x854_scrot.png
    2025-06-29-154426_799x854_scrot.png
    117.6 KB · Views: 5
It should be amazingly powerful compared to a Performa 400. The question is, how was it measuring up against its competition, which by 1995 was Pentium/top 486.
OK, so all of my workplaces were PC-centric apart from a company I wrote iOS apps for in 2010 to 2011 who allowed Macs for those purposes. That's typical in the UK. In the 1995-1996 timeframe I was first given a 99MHz 486 DX4 PC by a cheap PC maker who went bust about a year later (Escom!!). It had 4 MB and was terrible for compiling Turbo C++ 3.5 for Windows apps on it to the point where I persuaded them to swap it for a 66MHz 486 DX2 that at least had 8MB of RAM. That was the fastest PC I used until 1999.

By comparison my P400 compiling THINK C 5.04 apps felt fast - frankly, my PB100 compiling THINK C 5.04 apps felt about as good as the DX2! At the time I was mostly looking at computer performance from a perspective of productivity - Macs of almost any speed felt more productive than even 486-class PCs. I wasn't a games player, so I didn't really experience raw performance in the same way.

In spring 1996 I was house-sitting and using the P5200 and indeed it felt a lot more capable than any of the PCs I'd used. So, I hadn't experienced a Pentium/100 or faster PC.

In late 1996 I was still writing software for the same company, but had moved to Manchester Uni. I persuaded them it was cheaper to buy me Softwindows 3.0 to go with the Turbo C++. On my PM4400/160 Turbo C++ seemed a little slower than the DX2, but much faster than the DX4 so I was fine with it. I have it installed on my PB1400 now so I can directly compare.
L2 cache miss is roughly 32 cpu cycles for RAM. And valkyrie write buffer is seemingly available again after 8 cpu cycles.
Interesting.
It is equally important for the 603, as it completely stalls. It doesn't matter how many registers you have available if you cannot load them :)
Agreed, I just meant that it would be more obvious to the Pentium developers, like you say, the 603 designers assumed latency wouldn't be that bad and if it was, it was worth it to save power.
Did you ever run a native Pentium 100 or Pentium 166 to compare with? Because of course it will feel snappy compared to an even older Mac :)
As you can see from earlier, I didn't until 1999, and I wouldn't have noticed given the way I used and assessed Macs vs PCs.
The thread was really about comparing the 5200 with its natural competition, Pentium/late 486, and Amiga as well.
Good point.
I don't think it should be "graphics only".
I keep thinking your demo stuff implies you're focussed on graphics.
I think the spec92 shows one thing <snip> while qualitatively, as I recall it, the 5200 always felt slow compared <snip: emulator/Valkyrie/bus/cache>... I'm not sure). Quake shows the 5200 was quite behind the PC. Doom/Marathon2 <snip> also feels much slower on Mac<snip>
OK
<snip> Speedometer 4.02 <snip> 8bit graphics, the 5200 is *really* slow compared to 601 macs, and even to 68040 macs. So either the valkyrie really is very slow, or something actually was problematic with how the 68040 bus interacted with the PPC 603 bus.
OK.
 
OK, so all of my workplaces were PC-centric apart from a company I wrote iOS apps for in 2010 to 2011 who allowed Macs for those purposes. That's typical in the UK. In the 1995-1996 timeframe I was first given a 99MHz 486 DX4 PC by a cheap PC maker who went bust about a year later (Escom!!). It had 4 MB and was terrible for compiling Turbo C++ 3.5 for Windows apps on it to the point where I persuaded them to swap it for a 66MHz 486 DX2 that at least had 8MB of RAM. That was the fastest PC I used until 1999.

By comparison my P400 compiling THINK C 5.04 apps felt fast - frankly, my PB100 compiling THINK C 5.04 apps felt about as good as the DX2! At the time I was mostly looking at computer performance from a perspective of productivity - Macs of almost any speed felt more productive than even 486-class PCs. I wasn't a games player, so I didn't really experience raw performance in the same way.

In spring 1996 I was house-sitting and using the P5200 and indeed it felt a lot more capable than any of the PCs I'd used. So, I hadn't experienced a Pentium/100 or faster PC.

In late 1996 I was still writing software for the same company, but had moved to Manchester Uni. I persuaded them it was cheaper to buy me Softwindows 3.0 to go with the Turbo C++. On my PM4400/160 Turbo C++ seemed a little slower than the DX2, but much faster than the DX4 so I was fine with it. I have it installed on my PB1400 now so I can directly compare.
That's interesting! OK I will need to fire up my Pentium 100 and do some comparisons eventually. I have only used it to watch some demos, and my memory of my friends 486/Pentiums might have been too colored by how the games performed, although I recall thinking Windows felt quite snappy.

L2 cache miss is roughly 32 cpu cycles for RAM. And valkyrie write buffer is seemingly available again after 8 cpu cycles.
Interesting.
valkyrie writes: https://68kmla.org/bb/index.php?thr...2xx-63xx-max-vram-bandwidth.49018/post-551354
(I should re-measure it, as now the details elude me, but 4 stw => fast, but with 5stw, then 33% of the stw takes 8 cpu cycles longer, implying that the write buffer makes one available in 8 cpu cycles. But this is not the sustained speed. I.e. another test I made with 320x240 writes, then time between stfd's (== 2 entries in the write buffer), is 27 cpu cycles.

And for the L2 miss, reading from RAM: https://68kmla.org/bb/index.php?thr...cache-work-on-nubus-powerpc.47395/post-532041

I keep thinking your demo stuff implies you're focussed on graphics.
That is *my* primarily interest, but doesn't paint the full picture :) and also, I want to really understand the 603 and the 5200, because it is only that way that I can figure out how to get optimal speed out of it. The insights on where the 603 has trouble has already made my asm routines much faster, and I can now rather easily beat any compiler at max optimizations as they simply would have no idea of what memory is cache inhibited and may cause stalls and how to best handle that.
 
Is this a possible reference to the 5200/6200 being a "Road Apple"? There's a classic article that corrects the myths that have grown up around these Macs.
no, it was us reading off the hardware dev notes and the PowerPC 603 manual. The 603 is being operated as a 64-bit data-bus chip, thus it fills its caches in 64-bit chunks, and so a fetch cycle on the CPU requires 2 data cycles of the 040 bus, and there's no real way around that. The CPU is bottlenecked at the memory bus.
 
OK this was interesting... Quadra 630 beats my 5200 on speedometer 8 bit graphics! I ran the test several times, the 8-bit graphics is actually quite a bit faster in Mac OS 7.5.1. versus Mac OS 8.5.1 (but 16-bit is the same).

Machine8-bit graphics16-bit graphics
Quadra 6301.13-Mac OS 7.5.5
52000.950.71Mac OS 7.5.1
52000.810.71Mac OS 8.5.1
 
no, it was us reading off the hardware dev notes and the PowerPC 603 manual. The 603 is being operated as a 64-bit data-bus chip, thus it fills its caches in 64-bit chunks, and so a fetch cycle on the CPU requires 2 data cycles of the 040 bus, and there's no real way around that. The CPU is bottlenecked at the memory bus.
True, although this is only when there are cache misses both in L1 *and* L2. So it will not be constantly bottlenecked by the memory bus, only occasionally (although when it does happen, it will be expensive!).

The problem comes with cache inhibited data (like storing pixels to vram, or IO although that is generally less time critical), as well as for (the few) situations where it is hard to make it cache friendly (or naively written code).
 
Back
Top