• Hello MLAers! We've re-enabled auto-approval for accounts. If you are still waiting on account approval, please check this thread for more information.

How did the PowerPC 603 / 5200 at 75mhz compare to PC:s (486/Pentium)?

noglin

6502
Are there (reliable?) benchmarks that compare these CPU:s or these machines?

I think most likely the PPC 603 performed quite a bit better than a 486DX2/66 P24D.

Against the Pentium 75 Mhz I'm less sure. In real world use cases the P75 would likely be quite a bit faster due to the bus and due to having integer instructions that can fuse and do more work compared to the 603 (which in contrast to 603e in 5300, does not get any help from SRU for integer instructions). However I think the 603 would likely win on pure float benchmarks.


PPC 603 in Macintosh Performa 5200 (1995, CPU 1994)
- 75 Mhz (and 37.5 mhz bus)
- 1.6M transistors (85mm^2, 0.5 picometer process)
- 2+1 instructions per cycle (FPU & IU, + branch fold)
- L1: 8 KB/8KB
- 603 has a 64 bit memory bus (Macintosh Performa 5200 has L2 cache on the 64 bit bus, and the motherboard's 2 busses are 32 bit and bridged to the 64 bit bus)

486DX2 P24D (CPU: 1992)
- 66 mhz (and 33mhz bus)
- 1.2M transistors
- L1: 8 KB unified
- P24D (L1 "write-back" cache)

Pentium (P54C) (CPU: 1994, socket 5/7)
- 75 Mhz (and 50 Mhz bus)
- 3.2M transistors
- L1: 8KB/8KB (write-back)
 
I like the PowerPC design far above x86 and I think the PPC 603(e) series are under-appreciated.

If we compare SpecInt92 and SpecFp92 ratings for the processors in question:

CpuSpecInt92SpecFp92Link
PowerPC 603 (66MHz)6070Google Groups
PowerPC 603 (80MHz)7585Google Groups
PowerPC 603e (100MHz)120105Google Groups
Pentium 66/256kB L264.556.9TechMonitor
DX2 at 66/256kB L232.216.0TechMonitor

We can see that a Pentium is about 7.5% faster per MHz than a PowerPC 603, but a PowerPC 603e is 23% faster than a Pentium per MHz. I don't know the L2 cache size for the PowerPC tests (it could be 0kB). However, clearly, a DX2 is about half the speed of a PowerPC 603.

In practice it's hard to compare machines, because early Pentium machines were mostly running 16-bit Windows 3.1 and even Windows 95 was somewhat of a hybrid 32-bit / 16-bit OS; whereas PowerPC Macs were running a 32-bit OS, most of which was running under 68K emulation apart from the critical performance components. At a hardware level Windows PCs were transitioning to PCI from late 1994, but kinda in the ISA/ EISA / VesaBus world; whereas PowerPC Macs at the time were using NuBus (32-bit and faster than ISA) + some PDS cards. MacOS had HFS SCSI (a better format and higher performance drives) vs still FAT formatted 8MHz IDE drives for PCs (though obviously consumer Macs were using IDE by then).

In other words, there were marginal differences & the PowerPC was much lower power than Pentium, which meant they were at least competitive.

Arguably though, Apple made a mistake with PowerPC. If they'd continued with the 88K, then they might have been able to get 88K-based Macs out in 1992 since there were already prototypes in operation in 1991 with a suitable emulator and the up-and-coming 88110 with two integer units could have matched comparable Pentium machines. According to this link, a 40MHz 88110 could manage 37.8 SPECint92 and 50.5 SPECfp92 (62.4/83.3 at 66MHz). Given that Motorola won the battle to keep the M88K bus on PowerPC chips, it wouldn't surprise me if the PowerPC 603 is a heavily reworked M88110 (with just 1 integer unit). It would make sense from a development viewpoint.

See also, the thread on fantasy 88K-based Macs.
 
At the same clock I would estimate it at around 5x faster than the 486 and around 50% faster than the pent. This CPU benchmark has been run on a lot of different hardware over the years, and is one way to try and compare different machines using the same software:


The compiler and OS can make a difference, but that would be true when comparing different machines anyway.
 
That's a very nice resource. Thanks! Especially that the source code of the benchmark is there. I will try (eventually) to port it so I can run it on Mac OS on a 5200.

Once I have results will update and add it it. Probably it will be close to how the 601 performs (relative changes): 486DX2@66: +16%, +70%, +230%; P90 scaled to 66: -39%, -31%, +17%.
CPU/MachineL2 (or L3) CacheOSCompilerMEMINTFP
603e 200 MHz200Mhz Freescale MPC8241 linkstation (LS1) (Buffalo NAS) w/ MPC603e Motorola PowerPC core16KB (probably typo, this is likely L1)Linux 2.4.17_mvl21-sandpointgcc-3.3.50.7131.0500.877
603e 180 MHzMotorola PPC 603ev 180 MHz256 KBLinux-pmac-2.1.24egcs-1.00.7211.0161.396
603e 166 MhzAmiga dual CPU: PPC 603e 166MHz and 68040-25MHz0 KBPPC-Linuxgcc 2.7.2.1 ppclinux0.6100.8751.020
P54C 90 MHzIntel Pentium 75 - 200 90MHz?Linux 2.4.28gcc-2.95.4 200110020.2550.2730.479
P54C 90 MHz scaled 66 Mhz0.2000.3510.351
601 66 MHzApple Power MacIntosh 601 66MHz?MkLinuxgcc ?2.7.2.10.1220.2400.410
68060 50 MHzAmiga with Motorola 68060 50 MHz CPU?Linux-2.0.36gcc 2.7.2.30.1750.2150.096
486DX2 66 MHzIntel 486DX2 66 MHz256 KBLinux-2.1.66gcc 2.7.2.30.1050.1420.124
 
That's a very nice resource. Thanks! Especially that the source code of the benchmark is there. I will try (eventually) to port it so I can run it on Mac OS on a 5200.

Once I have results will update and add it it. Probably it will be close to how the 601 performs (relative changes): 486DX2@66: +16%, +70%, +230%; P90 scaled to 66: -39%, -31%, +17%.
CPU/MachineL2 (or L3) CacheOSCompilerMEMINTFP
603e 200 MHz200Mhz Freescale MPC8241 linkstation (LS1) (Buffalo NAS) w/ MPC603e Motorola PowerPC core16KB (probably typo, this is likely L1)Linux 2.4.17_mvl21-sandpointgcc-3.3.50.7131.0500.877
603e 180 MHzMotorola PPC 603ev 180 MHz256 KBLinux-pmac-2.1.24egcs-1.00.7211.0161.396
603e 166 MhzAmiga dual CPU: PPC 603e 166MHz and 68040-25MHz0 KBPPC-Linuxgcc 2.7.2.1 ppclinux0.6100.8751.020
P54C 90 MHzIntel Pentium 75 - 200 90MHz?Linux 2.4.28gcc-2.95.4 200110020.2550.2730.479
P54C 90 MHz scaled 66 Mhz0.2000.3510.351
601 66 MHzApple Power MacIntosh 601 66MHz?MkLinuxgcc ?2.7.2.10.1220.2400.410
68060 50 MHzAmiga with Motorola 68060 50 MHz CPU?Linux-2.0.36gcc 2.7.2.30.1750.2150.096
486DX2 66 MHzIntel 486DX2 66 MHz256 KBLinux-2.1.66gcc 2.7.2.30.1050.1420.124
Returning to this thread after several months as I missed the replies. Thanks for the comments. Good find on the links to MPR articles, as that magazine was very well respected, but had a really expensive subscription (Manchester University, UK subscribed to it when I was there).

It's interesting to note how much better the 603e did over the Pentium on these Byte benchmarks - I guess the SPEC marks operate in quite a different way. I've recently been re-reading the BitSavers architecture links on the 603e, 88110 and Pentium & thinking about why the 603e and Pentium could beat the 88110 even though the 88110 had many more execution engines (all of them are 2-issue, Superscalar CPUs and both the 603, 88110 and Pentium have 8kB Instruction + 8kB Data caches, but the 603e has 16kB caches).

An obvious place to start is to think about the most common instructions in any piece of code. One tends to find that the most basic instructions have a similar distribution:
  1. An ALU operation.
  2. A Load/Store operation.
  3. A comparison.
  4. A branch.
This means that any 2-issue superscalar CPU can achieve similar performance if they contain at least 2 execution engines targeted at these operations. The 603e has a separate load/store execution engine and Integer Unit; The Pentium appears to only have 2 ALU pipelines, but in reality both of those pipelines include the equivalent of a load/store unit, because x86 instructions have reg,mem / men, reg operations. Therefore it could be argued that the Pentium has the equivalent of 4 Execution Units in some circumstances. Finally, the 88110 also has an ALU (in fact 2) units + effectively a load/store unit (though it's integrated into the MMU and cache).

All of these CPUs have their own kinds of limitations though: the 603e only has 1 IU while the Pentium and 88110 are restricted on how they can pair up instructions.

Both 8kB cache: 32 byte line size x 2 way, same as 88110 for both caches.



88110 Features (Issue restrictions, Branch Target Instruction Cache)


As an early superscalar CPU, the 88110 has some significant issue limitations in that issue happens in-order. Both slots get stalled if the 1st slot is stalled; the 2nd slot moves to the first if it can't yet be issued, but the 1st can and only if both slots can be issued they both proceed. This means that no instruction can ever by-pass a stalled issue slot (and stalls will frequently happen due to data dependencies). In essence, it has a 2 entry reservation station shared for all units (though the branch unit also has a reservation station once it's been issued).

The 88110 implements what Motorola calls a Target Instruction Cache, which has 32 entries and each entry contains 2x 32-bit instructions + a logical address tag. Initially, they're all marked as invalid, but whenever a branch is taken, the next two instructions fetched are copied to a random, free TIC entry and marked as valid. Then if a branch at that address is taken in the future, the issue queue immediately replaces whatever was there with the corresponding two instructions from the TIC. This eliminates a branch penalty about 50% of the time, depending on whether the branch instruction itself was in the first or second issue queue slot.

In addition, the 88110 uses a static branch prediction scheme for bcnd (branch-conditional) instructions whereby backward branches are predicted taken. Finally, the 88110 can execute subsequent instructions in the instruction stream while a branch in the branch reservation station is waiting for its condition (stored in a GPR) to be evaluated and can back out of those instructions if the branch wasn't predicted correctly.

So, the 88110 can short-cut a cache fetch when a branch takes place, but it can't eliminate the branch itself.

PATC = 32 entry, BATC=8 entry.

Pentium Issue & Branch Prediction


The Pentium has more severe issue limitations than the 88110. Both pipelines (U and V) proceed in lockstep where any stall in one pipeline until the EX stage also stalls the other pipeline. The EX stage however, can complete a U pipeline instruction while V is stalled in EX. In addition, both pipelines can only be used if instructions are 'simple'; with no dependancies, no displacement + immediate operands and almost always no prefixes. However, quite a lot of instructions are deemed 'simple', including most mov/alu reg,reg/mem/immediate and mem,reg instructions + inc/dec/lea/push/pop/jmp/call/jcc instructions. However, all jumps can only be paired if they're going to V.

The Pentium branch prediction uses a dual 32-byte prefetch buffer; one for sequential execution (including predicted non-taken branches) and the other (speculative) filled according to the branch target buffer once a branch is predicted taken. Both buffers are cleared once a branch has been mis-predicted.

PPC603e Issue & Branch Prediction

Instruction issue is more flexible on the 603(e). The 603e has some rename registers (ctr, lr, 5x GPR, 4x FPR) and it has reservation stations for each Execution Unit (branch conditional, IU, FPU).

The PPC603(e) has static branch instruction prediction and can perform branch folding. The Branch Processing Unit decodes Instruction Queue branches prior to the dispatch stage, allowing following instructions in the 6-entry instruction queue to be issued without a branch cycle or penalty in some cases. It can speculatively execute the predicted branch instruction stream, but won't write-back results until the branch is properly completed. This avoids having to restore CPU state at the expense of periodic stalling.

The PPC603 has a few more limitations over the PPC603e: Stores have a 2 cycle latency & throughput; add & compare instructions aren't performed in the SRU (I guess they're performed in the IU).

Conclusion

It's interesting (IMHO) to compare the design decisions in these three superscalar architectures, because they all have very similar levels of performance and a similar instruction issue rate.

The biggest individual weakness of the 88110 is likely to be lack of reservation stations and rename registers: the 2 entry issue queue essentially is the reservation station. The second biggest weakness is likely to be issue restrictions and the third weakness, the branch unit placed at the Execution stage (though the branch target cache will help).

The biggest individual weakness of the Pentium is likely to be the far more serious issue restrictions than any of the RISC architectures. The second-biggest weakness is the complex pipelining in both execution engines in that register, register instructions in one pipeline will get stalled by reg,mem instructions in the other. The third biggest weakness is the lack of registers. I was thinking that the 32-byte prefetch buffer was a major weakness, but actually it's the same size as the 8-entry Instruction Queue on a PPC 601 and only 25% longer than the one on the PPC 603(e). This implies that instruction throughput is probably fairly similar.

The most critical issue in all three is likely to be the size of the caches. Even though the code density of x86 is about 33% better than PPC (according to this paper), it's not enough to compensate for the 2x (16kB) cache sizes on the PPC603e. The most surprising aspect about the Pentium is that it can actually keep up with the other architectures, even though it needed over 2x the number of transistors as an 88110 and PPC603.

Finally, returning to the 88110. I think it's fair to say that if Apple had stuck with the 88110 and dumped the CPU/Pink OS collaboration with IBM, we could easily have had cutting-edge RISC Macs by 1992, a full year before the Pentium was released and the boost in revenue to Motorola could have kept the RISC Macs competitive until at least the end of the actual PowerPC era (though I am a substantial PPC603e fan). Another upshot could have been the emergence of Radiation-hardened 88K (rather than PPC) CPUs ending up on Mars!
 
Hi folks,
I can see that people are still reading this thread, so I'll add another post. I've been reading through old issues of Microprocessor Report. Feb 12, 1992 is an interesting one, because it's discussing the up-and-coming new generation of CPUs: the 486/50, the P5 (before it was renamed Pentium), the RS/6000 RSC (which was modified to become the PPC601); the R4000; PA-RISC and SuperSparc.

There's an interesting comment about the 88110:

"Despite the appeal of the IBM alliance, it is a shame that Apple won’t be building 88110 systems, since it probably would have given them a lead of at least a year over the first PowerPC chips and would have provided considerably higher performance. The 88110 isn’t yet production- ready, however; sources indicate there are still serious bugs in the MMU."

In Simulation they predicted a really high performance:

1722611776832.png
It concurs with my earlier comment about the advantages the 88110 could have brought.
<snip>Finally, returning to the 88110. I think it's fair to say that if Apple had stuck with the 88110 and dumped the CPU/Pink OS collaboration with IBM, we could easily have had cutting-edge RISC Macs by 1992, a full year before the Pentium was released and the boost in revenue to Motorola could have kept the RISC Macs competitive until at least the end of the actual PowerPC era (though I am a substantial PPC603e fan). Another upshot could have been the emergence of Radiation-hardened 88K (rather than PPC) CPUs ending up on Mars!
Also, possibly, if Motorola hadn't had to work on the PPC, the 88110's MMUs could have been debugged earlier.
 
Last edited:
Once I have results will update and add it it. Probably it will be close to how the 601 performs (relative changes): 486DX2@66: +16%, +70%, +230%; P90 scaled to 66: -39%, -31%, +17%.
The table you generated didn't seem to have the PM5200 entry (no 603@75MHz). Did you manage to get the results? Also, what's your dev environment for PM5200 coding?

-cheers
 
Interesting read! There are some things to be added for the 603/603e:
- the 603e can execute add/cmp in both the SRU and the ILU, while the 603 can only execute these in the ILU
- for the 603/603e, one should also consider the completion queue, with 5 entries (program order), in which only 2 can be retired from, and for store/fpu and sru, they can only retire if they are at the last slot. So e.g. an fpu div at 18 cycles, will sit there and occupy that last entry, during which no store nore any sru instruction can be performed.

Another thing, the x86 has many "complex" instructions, for very tight inner loops, that makes it possible to do some truly ingenious things that run very fast, Michael Abrash's texture mapper code comes to mind (pg 28 listing 2): http://www.chrishecker.com/images/5/5e/Gdmtex5.pdf
 
I just did now. I took the code, used CW7.1 and set the global optimizer to max, targeted specifically 603 and enabled all optimizer features.

TEST : Iterations/sec. : Old Index : New Index
: : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT : 26.039 : 0.67 : 0.22
STRING SORT : 1.3046 : 0.58 : 0.09
BITFIELD : 1.6681e+07 : 2.86 : 0.60
FP EMULATION : 2.9556 : 1.42 : 0.33
FOURIER : 660.26 : 0.75 : 0.42
ASSIGNMENT : 0.28291 : 1.08 : 0.28
IDEA : 96.775 : 1.48 : 0.44
HUFFMAN : 48.546 : 1.35 : 0.43
NEURAL NET : 0.31874 : 0.51 : 0.22
LU DECOMPOSITION : 13.668 : 0.71 : 0.51
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX : 1.190
FLOATING-POINT INDEX: 0.648
Baseline (MSDOS*) : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
C compiler : CodeWarrior Pro 7.2.1 (version 4.2.6 build 832)
libc : MSL_All_PPC_D.Lib
MEMORY INDEX : 0.247
INTEGER INDEX : 0.341
FLOATING-POINT INDEX: 0.359
Baseline (LINUX) : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.


I'm somewhat surprised. I'm not sure this is apples to apples. It is odd it is ahead of 601 by that much, and it is odd that it does better on integer than on FPU compared to the Pentium.

This could be due to compiler advancements as CodeWarrior Pro 7 came out 2001, after Motorola bought metrowerks, and Motorola had a quite an advanced PowerPC compiler. Most results for the Pentium are gcc 2.7.2.3 which came out 1997.

I think I will have to re-try using another compiler that is also from 1997, e.g. CodeWarrior Pro 2.


CPUCPU/MachineL2 (or L3) CacheOSCompilerMEMINTFP
603e 200 MHz200Mhz Freescale MPC8241 linkstation (LS1) (Buffalo NAS) w/ MPC603e Motorola PowerPC core16KB (probably typo, this is likely L1)Linux 2.4.17_mvl21-sandpointgcc-3.3.50.7131.0500.877
603e 180 MHzMotorola PPC 603ev 180 MHz256 KBLinux-pmac-2.1.24egcs-1.00.7211.0161.396
603e 166 MhzAmiga dual CPU: PPC 603e 166MHz and 68040-25MHz0 KBPPC-Linuxgcc 2.7.2.1 ppclinux0.6100.8751.020
P54C 90 MHzIntel Pentium 75 - 200 90MHz?Linux 2.4.28gcc-2.95.4 200110020.2550.2730.479
603 75 MHzMacintosh Performa 5200256 KBMac OS 8.5.1CW 7.10.2470.3410.359
601 66 MHzApple Power MacIntosh 601 66MHz?MkLinuxgcc ?2.7.2.10.1220.2400.410
68060 50 MHzAmiga with Motorola 68060 50 MHz CPU?Linux-2.0.36gcc 2.7.2.30.1750.2150.096
486DX2 66 MHzIntel 486DX2 66 MHz256 KBLinux-2.1.66gcc 2.7.2.30.1050.1420.124
 
Tested with CW Pro 1 which is from 1997, way poorer results than with CW Pro 7 (except for fpu). In both cases I enabled all optimizations, and set to optimize for the 603 specifically.

These results are quite surprising to me. Either CW is a much better compiler than gcc-2.95.4, or the 603 at 75 MHz is comparable to a Pentium at 90 MHz on this benchmark.

The 5200 always felt very slow, even compared to a friend's P75, both in general usage but especially when it came to gaming. One reason is of course that the P75 had a 320x240 mode (which it turns out the 5200 has as well, a true pity this was not widely announced). On top of that the P75 has a 50 MHz system bus so graphics heavy operations probably had an advantage there, and as I mentioned when you really want to optimize some tight innerloop, I think the x86 with its odd instructions would allow some ingenious solutions. Curious to hear your thoughts on this @Snial.

CPUCPU/MachineL2 (or L3) CacheOSCompilerMEMINTFP
603e 200 MHz200Mhz Freescale MPC8241 linkstation (LS1) (Buffalo NAS) w/ MPC603e Motorola PowerPC core16KB (probably typo, this is likely L1)Linux 2.4.17_mvl21-sandpointgcc-3.3.50.7131.0500.877
603e 180 MHzMotorola PPC 603ev 180 MHz256 KBLinux-pmac-2.1.24egcs-1.00.7211.0161.396
603e 166 MhzAmiga dual CPU: PPC 603e 166MHz and 68040-25MHz0 KBPPC-Linuxgcc 2.7.2.1 ppclinux0.6100.8751.020
P54C 90 MHzIntel Pentium 75 - 200 90MHz?Linux 2.4.28gcc-2.95.4 200110020.2550.2730.479
603 75 MHzMacintosh Performa 5200256 KBMac OS 8.5.1CW Pro 7.10.2470.3410.359
603 75 MHzMacintosh Performa 5200256 KBMac OS 8.5.1CW Pro 1.7.40.2040.3100.368
601 66 MHzApple Power MacIntosh 601 66MHz?MkLinuxgcc ?2.7.2.10.1220.2400.410
68060 50 MHzAmiga with Motorola 68060 50 MHz CPU?Linux-2.0.36gcc 2.7.2.30.1750.2150.096
486DX2 66 MHzIntel 486DX2 66 MHz256 KBLinux-2.1.66gcc 2.7.2.30.1050.1420.124
 
Tested with CW Pro 1 which is from 1997, way poorer results than with CW Pro 7 (except for fpu). In both cases I enabled all optimizations, and set to optimize for the 603 specifically.
These results are quite surprising to me. Either CW is a much better compiler than gcc-2.95.4, or the 603 at 75 MHz is comparable to a Pentium at 90 MHz on this benchmark.
I haven't looked at it in while, but I suspect the Byte benchmark is not realistic. Measuring a bunch of small loops isn't very representative or data-dependent branch-heavy code that interactive programs use. The strength of SPEC was it was bunch of real-life codes, so was representative of real-life use. That's why it became important to workstation/server vendors and as a source of trace and studies for academic research in computer architecture. Every new instance of SPEC is still studied and characterized in great details, even if it has lost importance in the eyes of end-users.

I suspect the RISC people of that ERA consistently under-estimated the importance of memory accesses (BW, latency) due to the inability to simulate large code. And consistently over-estimated the benefits for a larger number of registers, as compilers of the era weren't really able to do a good job of optimizing complex code. GCC for instance had been developed to pick and schedule complex instructions for the various CISC architectures of the 80s. Compilers wouldn't become really good at producing RISC code from poor high-level code until the SSA form became common in the 00s (it had been widely studied in the 90s). Meanwhile, Intel had to optimize for memory accesses in their register-starved architecture, forcing them into being less efficient on unrealistic micro-benchmarks but more efficient on real-life code and things like SPEC.

Unfortunately, even the geriatric versions of SPEC are behind a paywall :-( (to SPEC credit, you can apparently still buy them!)
 
And for fun (I was near the system and only needed a reason), NetBSD 9.0 and gcc 7.4 on a SPARCstation 5/110 (110 MHz MicroSPARC II):

Code:
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 1.237
FLOATING-POINT INDEX: 0.659
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 :
L2 Cache            :
OS                  : NetBSD 9.0
C compiler          : gcc version 7.4.0 (nb3 20190319)
libc                :
MEMORY INDEX        : 0.175
INTEGER INDEX       : 0.473
FLOATING-POINT INDEX: 0.366
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

The SS5 was introduced in March '94, so would have been a year older than the 5200, and about contemporary to the P54C (revised Pentium including the 90 MHz one). I'm somewhat surprised by the poor memory performance, but this system has 8x32 MiB which may affect the result (this would have been a very, very, very expensive configuration at the time, between 8 and 32 MiB total would have been much more common).
 
Here are specint92 results that I found:

machinespecint92source
Micronics M4P PCI
486DX4 100MHz / 33 MHz
51.4
PPC 603 @ 66 MHz
L1: 8KB/8KB. L2: 256KB
60.6
PPC 603 @ 66 MHz
L1: 8KB/8KB. L2: 1MB
63.7
Pentium 610/75 - 75 MHz
L1: 8KB/8KB. L2: 512KB
89.1
Pentium 735/90 - 90 MHz
L1: 8KB/8KB. L2: 512KB
104.5
P5 100 MHz
L1: 8KB/8KB. L2: 256KB
138.5

Estimating 603@75 MHz as: 60.6/66*75 = 68.9

Based on this P75 is about 30% faster than a 603@75MHz.
 
I just did now. I took the code, used CW7.1 and set the global optimizer to max, targeted specifically 603 and enabled all optimizer features.
Brilliant! They're really interesting results. Thanks for doing that!
I'm somewhat surprised. I'm not sure this is apples to apples. It is odd it is ahead of 601 by that much, and it is odd that it does better on integer than on FPU compared to the Pentium.
I think that's because the Pentium can do double-precision simple floating-point operations such as FMUL and FADD in 1 clock cycle. Also, the Pentium can combine a FXCH with another floating point operation in one cycle, alleviating one of the main x87 bottlenecks. The PPC603 can only manage a throughput of 1 cycle per single-precision floating point operations; double-precision takes 2 cycles. By default, I think, C uses double-precision floating point operations. If you can somehow force CW Pro 7 to treat doubles as single-precision, then you should see a marked improvement in 603 (and 603e) FPU performance.
This could be due to compiler advancements as CodeWarrior Pro 7 came out 2001, after Motorola bought metrowerks, and Motorola had a quite an advanced PowerPC compiler. Most results for the Pentium are gcc 2.7.2.3 which came out 1997.
I think by the same token it might be worth comparing with MrC (which I think was the more optimised Apple PowerPC compiler) and/or the latest gcc for PowerPC, on both the Pentium and PowerPC 603(e). Is that hard- or is there already a suitable equivalent to Retro68 for PowerPC development?


I think I will have to re-try using another compiler that is also from 1997, e.g. CodeWarrior Pro 2.
This begs the question: "How did PPC603 performance compare with a Pentium?", vs "How does PPC603 performance compare with a Pentium?"

And for fun (I was near the system and only needed a reason), NetBSD 9.0 and gcc 7.4 on a SPARCstation 5/110 (110 MHz MicroSPARC II):

Code:
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 1.237
FLOATING-POINT INDEX: 0.659
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 :
L2 Cache            :
OS                  : NetBSD 9.0
C compiler          : gcc version 7.4.0 (nb3 20190319)
libc                :
MEMORY INDEX        : 0.175
INTEGER INDEX       : 0.473
FLOATING-POINT INDEX: 0.366
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

The SS5 was introduced in March '94, so would have been a year older than the 5200, and about contemporary to the P54C (revised Pentium including the 90 MHz one). I'm somewhat surprised by the poor memory performance, but this system has 8x32 MiB which may affect the result (this would have been a very, very, very expensive configuration at the time, between 8 and 32 MiB total would have been much more common).
Also very interesting! Thanks for doing that. I had the sole use of a Sun Sparcstation IPX in the late 90s at Manchester Uni. I felt privileged to be able to have a Sun workstation even though I knew that its 40MHz MicroSparc(?) was about the same as the 66MHz 486DX2 I used in my previous workplace (the IPX RAM was much bigger though, 64MB instead of 8MB). Weird also how the FPU still seems slower than the Pentium!

Going back to:
I haven't looked at it in while, but I suspect the Byte benchmark is not realistic. Measuring a bunch of small loops isn't very representative or data-dependent branch-heavy code that interactive programs use. The strength of SPEC was it was bunch of real-life codes, so was representative of real-life use.
Mostly agree, tiny loops will just fit in even early 90s caches. but I think that it's representative of real-life Unix workstation use. In my thinking, a lot more Unix (now Linux) applications are built upon a mix of application code, tools and libraries; whereas I suspect that personal computer applications in the 90s revolved around monolithic applications mostly sitting on top of monolithic OS's (though Windows 3.1 and above made extensive use of DLLs).

That's why it became important to workstation/server vendors and as a source of trace and studies for academic research in computer architecture. Every new instance of SPEC is still studied and characterized in great details, even if it has lost importance in the eyes of end-users.
I'd guess that SPEC has mostly mattered to academics even though it is more representative for general use than the 90s Byte benchmarks.
I suspect the RISC people of that ERA consistently under-estimated the importance of memory accesses (BW, latency) due to the inability to simulate large code.
This is a curious question as I'm not sure of the answer. I know for sure that Steve Furber and Sophie Wilson, who designed the ARM processor were very aware of memory accesses and bandwidth. At one point they approached Intel to see if they'd licence the 80286 core (around 1983) but with a different bus interface, because they saw how poor it was. The ARM CPU put memory accesses and bandwidth front and centre. I've read the early RISC papers (RISC-1 and MIPS) and I don't recall them saying much about memory bandwidth, but the numerous editions of Computer Architecture, certainly in the second edition I own (1996) and probably in the earlier edition has a major chapter on it. Since the earliest RISC processors had no cache and they wanted to achieve 1 instruction per cycle, I would have thought that their understanding was at the forefront of Computer Science. But I don't have proof (yet, as I haven't re-read the papers).

It's certainly true that the early RISC developers understood the importance of memory hierarchies + bandwidth and that's why they advocated 3 operand, register operations on a large (≥32) register set. Caches were already part of many computer systems (a number of pdp-11 computers had caches, not sure about the VAX-11/780). Memory hierarchies and the principle of locality were a core part of the University of East Anglia Computer Science course in the mid/late 1980s (I was there for that).

And consistently over-estimated the benefits for a larger number of registers, as compilers of the era weren't really able to do a good job of optimizing complex code. GCC for instance had been developed to pick and schedule complex instructions for the various CISC architectures of the 80s.
I think this conflates a couple of aspects of early and mid-80s compilers. Firstly, I think that in order to understand the motivations of early RISC CPU design we can't consider GCC, because it was a late 80s compiler. Early 80s C compilers were things like PCC / Lattice etc and all of these were very stack oriented. They were terrible, shoving everything on a memory stack and having to perform rubbish LOAD, LOAD, operate, STORE type sequences for everything, unless a variable was designated as a register variable.

So, by the standards of compilers, RISC designers were competing against, I think they were able to make good use of registers. The early RISC papers gave examples of code sequences which used multiple registers and, I believe, covered concepts like common-sub-expression elimination and loop unrolling which traded multiple registers for higher performance. Also, pipeline latency forced multiple register usage to avoid stalls. Also, even for these compilers, register usage wasn't as luxurious as we might think, because e.g. RISC-I, RISC-II and SPARC allocated 24 registers per stack frame: In x 8, Local/Temp x 8, Out x 8 + System x8. They only had to really think about the register usage for 8 regs.

Compilers wouldn't become really good at producing RISC code from poor high-level code until the SSA form became common in the 00s (it had been widely studied in the 90s).
SSA helped a lot, I agree. For both RISC and x86 (despite its limitations).
Meanwhile, Intel had to optimize for memory accesses in their register-starved architecture, forcing them into being less efficient on unrealistic micro-benchmarks but more efficient on real-life code and things like SPEC.
This is probably true for the 486 onwards, precisely because they had to compete with RISC, but I'm not sure it's true for the 8086 up to the 80386. In my mind, x86 would be better at micro-benchmarks because they don't need so many register resources and the heyday of RISC in the late 1980s to the late 1990s is a good argument for saying RISC was better at real-life code and SPEC benchmarks. I think the MPR reports from 1992 show how even then, the 486 wasn't a serious workstation CPU due to its poor integer and FPU performance, but that Intel were at least trying. This changed with the 486DX2 (which was about as good as the IPX I had at least for Integer performance) and then the Pentium. By the time the P6 and Pentium II arrived, it was easily killing off low to mid-end RISC workstations.

1723152533896.png
(from: https://websrv.cecs.uci.edu/~papers/mpr/MPR/ARTICLES/060202.PDF . You can scan through lots of articles to see how the dynamic changes in favour of Intel. Here a top-spec 50MHz 486 + 256K of cache SPECint89 beats a 33MHz Sparc with 64kB of cache; a 33MHz RS/6000 with no cache and a 33MHz PA-RISC with 32K+64K of cache. It beats nothing on SPECfp89. The site doesn't provide earlier editions, so it's hard to see what things were like, say in 1990 when the 25MHz 486 appeared or how they benchmarked systems prior to SPEC'89 for the earliest RISC CPUs).
Unfortunately, even the geriatric versions of SPEC are behind a paywall :-( (to SPEC credit, you can apparently still buy them!)
Agreed.

Feel free to debunk this, I've had a couple of glasses of wine and could be talking tosh :-D !
 
I just did now. I took the code, used CW7.1

Thanks for sharing your results. Would you consider uploading your build? There are not many entries with results from these computers for comparison. Does CW7.1 also build for 68k? This thread might be a good place for it:


It's possible to basically turn the number sorting test into a memory test by increasing the size of the array. I did this to test differences in cache configuration:


Small arrays fit in the L1, medium in L2, and large in main memory. But to test vs other hardware would be more difficult because results are not available for comparison.
 
I suspect the RISC people of that ERA consistently under-estimated the importance of memory accesses (BW, latency) due to the inability to simulate large code. And consistently over-estimated the benefits for a larger number of registers, as compilers of the era weren't really able to do a good job of optimizing complex code.
It's possible to download the original RISC research papers for free, so I had a look at the 1981 RISC-I paper, by D.A Patterson.


It's a good read. In summary:
  1. Even in this paper [page 448], they consider the memory hierarchy. "In most computers, the interface to memory is a main performance bottleneck".. "we assumed that we can access main memory in a single RISC CPU cycle".. "also under the assumption that two CPU cycles are required.. Performance degraded only 10%".
  2. They considered caches even for early RISC: "An on-chip cache would be beneficial for RISC".. "An instruction cache is a desirable commodity" (that's because of their two CPU cycle instruction fetch change).
  3. They missed some obvious memory access tricks such as interleaved memory; or running the memory interface at twice the CPU clock speed; or the DRAM trick used by the early ARM CPUs to prioritise sequential access in column address mode.
  4. They based their RISC compiler on PCC as discussed in the "Overall Performance" section on page 447. These compilers, as you surmise were bad at allocating variables to registers on CISC architectures, but at least the compilers could be compared.
  5. RISC-I uses overlapping register windows, which also were allocated to part of the address space (even though they really would have been on-chip registers). The PCC implementation, I think, simply treated registers as a memory stack. The paper notes that most CISC HLL function calls just used scalars as parameters, so changes were fairly trivial. Also, note that if the 'C' compiler did: LOAD a->tmp, LOAD b->tmp2, ALU tmp1,tmp2, STORE tmp1->c; then RISC-I could just do ALU c,a,b. Rubbish register allocation algorithms didn't matter so much thanks to the 3-op design and 32 available regs.
  6. RISC-I doesn't seem to have much pipelining apart from overlapped instruction fetching and execution. You can see this from their description of the main instruction cycle on page 448. "..we estimated.. RISC cycle is 400ns: 100ns to read.. registers; 200ns to perform a 32-bit addition; and 100ns to store.. register.." A 400ns memory cycle was about the same as the VAX-11/780 memory cycle and within the capabilities of early 80s DRAM.
  7. You were mostly correct that RISC-I used a lot of trivial programs for their detailed analysis, 7/11 were ≤101 instructions on a VAX and the largest 3/11 were 774 to 1578 instructions. However, they did some analysis on some more substantial programs: PCC itself, CIFPLOT (VLSI mask layout plotter); NROFF; A Pascal P-code compiler; the Macro expansion phase of the SCALD I design system (whatever that is).
  8. They didn't actually develop any real hardware. They looked at potential VLSI implementations for part of the CPU and the rest was simulated (but the IBM 801 was mostly simulated and that was highly influential).
I hope this sheds a bit of light on actual early RISC analysis and design decisions.

-cheers from Julz
 
On 603 (on 5200), L1 is 2 cycles, L2 is 8 cycles, and system ram is 32 cycles *if* an entry is already in TLB, if not it is 168 cycles(!). Sure, that is also partially because Apple used a 32 bit bus to the system ram, maybe it could have been 16 cycles, but that is still 8 times slower than what they designed for.

Another thing that hurts on the 603 is that the completion queue is shared among all units and only two slots can be used to retire, and for store / FPU and SRU, it must be on the last slot. Which means that if you do a fdiv (18 cycles), then that will cause a significant stall. As far as I know, on the Pentium instead, an fdiv (33 cycles) can execute without blocking the 2 integer units.
 
It seems I should clarify my "era" comments :-) I didn't mean the whole 80s/90s, but specifically the era for which @noglin put a result table - when Apple moved from the '040 to the PPC601, Intel from the 80486 to the Pentium, and the RISC vendors were moving from their early implementations to newer designs. SuperSPARC was '92, Pentium '93, R4000 '91 (and first to 64bits!), PA-7100 '92, 21064 "92 as well. We all know Apple started using PPC in '94.

Of course the RISC designers were very much aware of the memory issue - "underestimated", not "ignored" :-) The trick was that, like their 68k-based predecessors, early RISC workstations (not personal computer!) had large caches - most VME-based Sun 3 ('020-based) had 64 KiB of external caches, and so did the original 4/260 (the first SPARC system, also VME). The CPUs had been designed with that feature in mind, and they worked very well. When the MC68040 was introduced, a lot of the performance was due to the integrated caches, but those were much smaller than the big external chips of the RISC vendor so the performance still lagged. Motorola was following the RISC trend with the 88100/88200 combo (the 88110 had some onboard cache).

By the time the "new" CPU arrived, *everybody* knew how important memory subsystems were - there's no doubt about that. But in my opinion, the RISC vendor underestimated how important that was in real-life situation, and to an extant thought their performance lead was due to the "better" design of RISC vs. CISC, a view that could be supported by some micro-benchmarks. And it turned out that wasn't necessarily the case, IMHO. The new batch of RISC CPUs I listed above were great (with the SuperSPARC being the best of course, as I'm a SPARC man), but when the P5 (a.k.a. the original Pentium) and more so the P54C (the 3.3V, 75-100 MHz one), the lead was much smaller than it used to be. And at a much lower cost due to economy of scale. Intel had put enough internal cache (faster than external) to support their memory-focused ISA, and had a fast 64-bits bus to feed the CPU, same as most others, and that was enough to close most of the gap - in a similar way to the '040 being much faster than the '030, provided you didn't disable the internal caches. The error was not in the designs - it was in the belief that Intel wouldn't be able to catch-up because it was CISC, when it was able because it could also use fast internal caches, same as everyone else, and managed to do some superscalar, close enough to everyone else.

Mostly agree, tiny loops will just fit in even early 90s caches. but I think that it's representative of real-life Unix workstation use. In my thinking, a lot more Unix (now Linux) applications are built upon a mix of application code, tools and libraries; whereas I suspect that personal computer applications in the 90s revolved around monolithic applications mostly sitting on top of monolithic OS's (though Windows 3.1 and above made extensive use of DLLs).
How the code is built is to a large extent irrelevant, unless the ISA+ABI has really, really poor support for calls to dynamically loaded functions. The issue was the type of performance-gobling algorithms was moving fast from a bunch of loops that had made the bread-and-butter of CDC & Cray in simulation codes, to much more complex access patterns due to interactive use with graphics (the 3M of early workstations : 1 MIPS, 1 MebiByte, 1 MegaPixel), and the spreading habits of recompiling a bunch of codes from than new-fangled "free software" idea.

The whole "but it's so much easier to pipeline in RISC!" works if you have enough ILP to extract, and ILP from indepedent iterations of loop is almost unbounded - add more registers, unroll more, and voilà, more ILP! RISC wins! But then you recompile gcc itself/then with gcc, and it's basically pointer chasing and latency-bound. And then what you want is the fewer possible number of dependent instructions between you loads. Indirect accesses don't play very nice with memory hierarchy, either...

I'd guess that SPEC has mostly mattered to academics even though it is more representative for general use than the 90s Byte benchmarks.
It was also important in the workstation/server markets in the 90s. And it still extimated/measured/quoted for server-oriented CPU today (with CPU2017 being the current one), though people only really care about "rate" and not "speed", as single-process benchmarks on 64+ cores CPU don't make much sense.

So, by the standards of compilers, RISC designers were competing against, I think they were able to make good use of registers. The early RISC papers gave examples of code sequences which used multiple registers and, I believe, covered concepts like common-sub-expression elimination and loop unrolling which traded multiple registers for higher performance.
I'll be the devil's advocate here - and say what a relative of mine working in marketing would say, "turns the flaw into a virtue". Does having more registers give you performance... or do you *need* more registers because otherwise you don't get any performance? The P5 and P6 (vs. all the RISC of the era) eventually proved IMHO that there's no absolute answer, and that various trade-offs are viable. In ASIC design, logic is cheap, memory is very expensive, and registers are memory.

The original ARM1 has a *lot* of area dedicated to the register file. It's a choice, and clearly they were doing something right as Arm is still around.

[that being said, I was and remain a RISC fan, I just think it's more guidelines than actual rules, and they should be ignored if it helps with performance and/or versatility of the software... so three-inputs instructions are fine, carry shouldn't be thrown away once computed, and double-output instructions shouldn't be ruled out definitely...]

I think the MPR reports from 1992 show how even then, the 486 wasn't a serious workstation CPU due to its poor integer and FPU performance, but that Intel were at least trying.
I'm not sure the data supports that conclusion for integer. in SPECint89, the 486 is better than the low-end SPARC, 33 MHz PA-RISC, and cacheless RS/6000, almost match the cached RS/6000, and is beaten by the fast PA-RISC which would have been an order of magniture more expensive in a system. It's also beaten by the non-existent CPUs, which anyone can do (I'm of the "if I can't buy it and have it delivered this month, it doesn't exist and is irrelevant" school of thoughts, saved me a ton of grief waiting on empty promises).

As for FP, yes the 486 is bad compared to the competition. Biut while a necessary feature for everyone today, back then if you didn't do lots of numerical simulation it didn't matter. Apple choose to ship a lof of systems with no FPU, and that was perfectly fine at the time. The vast majority of the desktop market didn't care, and I think it costs dearly to the RISC vendor. Why pay through the nose for a feature you don't need?

Ultimately, people buy hardware to run software. So the criteria really are (a) can it run my software ? (b) can I afford it ? And the ultimate choice is usually the cheapest system that is "good enough" for the software... UNIX workstations were expensive by a lot larger margin than they were faster in the 90s, while the 486, P5 and cheap PPC systems were "good enough" for a larger and larger fraction of the existing software. (I still love my Sun Ultra 1 Creator 200MHz which I bought brand new :-) ).
 
Hi @noglin !
On 603 (on 5200), L1 is 2 cycles
Really? From my reading of the manual:
1723205513219.png
The Instruction cache bus is 64-bits wide, so it has to read one 64-bit word per cycle in order to feed two instructions to the IQ per cycle. Or do you mean the data cache?
L2 is 8 cycles
Credible. Is this for the critical word or for the entire 8 word burst?
and system ram is 32 cycles *if* an entry is already in TLB, if not it is 168 cycles(!).
TLB misses have to be handled in software. I'm sure you have a good source for this, but such a cost for a TLB miss would be basically insufferable. From this MPR article, the 603 has a 64-entry x 4kB TLBs giving 256kB of TLB hits, which.. Going by this recent post of mine:


The miss rate is 0.9% for a 256kB cache, which might be similar for 256kB of TLBs. So, every 0.9% of instructions (and/or data), we'd have a 168 cycle hit. That's an average of a 1.5 cycle cost per instruction, making the 5200 terribly slow. However, the MPR report:

1723206352936.png
claims the TLB miss-handler can fit in two cache lines (16 instructions). The question then is whether the miss-handler itself is likely to be in L1, L2 or main memory. A miss-rate of 0.9% in any of the caches, will mean that most of the time, the TLB miss-handler will be in L1, bringing the handler code itself down to about 8 to 16 cycles; or an additional 4 cycles (12-20). Reloading a TLB will require, probably loading 64-bits from outside of the TLB, but some of these accesses themselves will be in the cache for the same reasons (TLB misses will also exhibit locality and a 0.9% miss rate means that most of the L1 cache and virtually all the L2 cache is likely to contain TLB data).

In the worst case though we'd need to load 16-words from main RAM for the handler = 32*16=512c, then probably 4 to 8 words from main RAM for the TLB data at 32 cycles per word = 128c to 256c => 640c to 768c. I think this is as bad as it'll get on System 7, because AFAIK, page tables can never be paged out to disk.

In either case, because the Pentium has hardware TLB miss table search it'll be much faster anyway.

Sure, that is also partially because Apple used a 32 bit bus to the system ram, maybe it could have been 16 cycles, but that is still 8 times slower than what they designed for.
Indeed, and is it the case that the 5200 can't use FPM mode? Is it really that bad? A 75MHz 603 needing 32 cycles for a main RAM read implies 32/75=426ns RAM access times from, I believe 70ns RAM?
Another thing that hurts on the 603 is that the completion queue is shared among all units and only two slots can be used to retire, and for store / FPU and SRU, it must be on the last slot. Which means that if you do a fdiv (18 cycles), then that will cause a significant stall. As far as I know, on the Pentium instead, an fdiv (33 cycles) can execute without blocking the 2 integer units.
Doesn't this part of the Pentium manual contradict what you're saying?
1723208246237.png
The Pentium's FPU is integrated with the Integer Unit on the U pipeline:
1723208830562.png
1723208893145.png
Because the FPU is integrated into the U pipeline (apart from FXCH); no instruction can enter either pipeline until both pipelines complete their WB stages; and the FPU bus is shared with the U pipeline for most operations, I think this means that the FDIV will stall integer operations in both pipelines.

Oh, there is one thing I keep forgetting to add: I do agree that Pentium instructions can do more in a single instruction than the 603 can. In effect (and I think I kinda mentioned this earlier); the address decode stages add the equivalent of what would be two more load/store units in a RISC processor, even though they're part of the U and V pipelines: a P5 can therefore issue and execute the equivalent of up to 4 RISC instructions. OTOH: people tend to neglect the fact that RISC CPUs have 3 operand ALU instructions which may often require two x 2 operand CISC instructions.

Now I'm off to dig a trench so I can duck for when I get shot down in the next reply ;) !

-cheers from Julz
 

Attachments

  • 1723208509684.png
    1723208509684.png
    67.5 KB · Views: 2
Back
Top