• Hello MLAers! We've re-enabled auto-approval for accounts. If you are still waiting on account approval, please check this thread for more information.

How did the PowerPC 603 / 5200 at 75mhz compare to PC:s (486/Pentium)?

Hi @noglin ,

I'll try and reply to the rest later, but just skipping ahead:
I'm not sure I understand the "tag RAM", would you help and explain that a bit and why that would indicate L2 being 2-way? (To me it seemed it was N-way based on the empirical data where there was no clear pattern showing e.g. only every other block being cached).
I suspect it's not N-way, because that would need a lot of tag RAM.

It's really very similar to what the L1 cache has: 8kB of content + 128x2 tags. The L2 cache stores 256kB of recently used DRAM (or VRAM) accesses. So, the L2 cache, like the L1 cache is divided into a number of 32-byte 'lines', i.e. blocks. Tag RAM determines which address range of DRAM is being tagged by a given line of cache, essentially it's a cache of the upper part of the address for that line. A direct mapped cache would simply need to check the top 32-5 = 27 address bits. But the index can just be checked directly, so the tag memory doesn't include that. So, the remainder is stored in RAM per tag entry and 256kB cache would need 256kB/32B = 8192 entries x (27-13 [because there's 13 bits that don't need to be checked as they're the tag entry]=14 bits), which itself is a fair amount, about 14kB in itself.

Every time Capella sees a bus request address. It indexes the tag ram using bits 5 to 21 (13 bits) of the address and if the tag bits match the upper 14 bits of the address, then the L2 cache line is correct and the 603 can read from the cache directly.

That's how it'd work for a direct-mapped 1x way cache. For a 2 way L2 cache we need an extra address bit because each L2 cache entry could come from 2 places (8192 entries x 15 bits). A 2-way cache would have better performance than the same sized 1x way cache. You'd need 1 more bit for the LRU, bumping it up to 31 bits for both entries. Capella would therefore have a 31-bit 'data' interface to the tag ram as it'd have to read both tags at the same time; then invert the LRU bit.

Your algorithm that goes sequentially through RAM would still see the same 32 cycle spike every 256kB. That's also the case for an LRU, N-way L2 cache, because the least recently used entry will always be 256kB ago.
As for the DRAM read frame buffer, yes, this is on my "TODO" to investigate. It is quite a bit of trade off, because not writing to VRAM directly means you have to do it later, so you add overhead work, but you *might* gain it back. I've some ideas on how I might be able to make that work in a fast way, but would require several things to be put in place first.
Cool!

I had an idea about how to avoid FDIV in an implementation of Quake. Quake has a limited horizon. Therefore, Z probably has a range between 1mm and 100m. So, storing a large 100,000 element array of 1/Z would provide up to 100m of range at a 1mm resolution, this reduces the problem to a memory fetch at the cost of 400kB (32-bit floats) or 800kB (64-bit floats). Anything further away, is going to be (a) very small and (b) easy to interpolate without needing an FDIV.

The second idea was to compute two 16-bit pixels and combine them in a single 32-bit word before storing the results. That way the two cycle store x 32-bits is equivalent to 2x 1 cycle stores x 16-bits each (though maybe packing the pixels would cost another cycle). Maybe, also, because we have 32 GPRs we can unroll loops a bit more than a Pentium could with its crummy 8 registers. 32 GPRs can store 64 x 16-bit pixels worth of data; 16 GPRs can store 32 pixels.
 
Last edited:
Hi @noglin ,

I'll try and reply to the rest later, but just skipping ahead:

I suspect it's not N-way, because that would need a lot of tag RAM.

It's really very similar to what the L1 cache has: 8kB of content + 128x2 tags. The L2 cache stores 256kB of recently used DRAM (or VRAM) accesses. So, the L2 cache, like the L1 cache is divided into a number of 32-byte 'lines', i.e. blocks. Tag RAM determines which address range of DRAM is being tagged by a given line of cache, essentially it's a cache of the upper part of the address for that line. A direct mapped cache would simply need to check the top 32-5 = 27 address bits. But the index can just be checked directly, so the tag memory doesn't include that. So, the remainder is stored in RAM per tag entry and 256kB cache would need 256kB/32B = 8192 entries x (27-13 [because there's 13 bits that don't need to be checked as they're the tag entry]=14 bits), which itself is a fair amount, about 14kB in itself.

Every time Capella sees a bus request address. It indexes the tag ram using bits 5 to 21 (13 bits) of the address and if the tag bits match the upper 14 bits of the address, then the L2 cache line is correct and the 603 can read from the cache directly.

That's how it'd work for a direct-mapped 1x way cache. For a 2 way L2 cache we need an extra address bit because each L2 cache entry could come from 2 places (8192 entries x 15 bits). A 2-way cache would have better performance than the same sized 1x way cache. You'd need 1 more bit for the LRU, bumping it up to 31 bits for both entries. Capella would therefore have a 31-bit 'data' interface to the tag ram as it'd have to read both tags at the same time; then invert the LRU bit.
Nice. I had a pretty good idea of L1 but I had no understanding of the L2. OK so any address with overlap of the low 17 bits will map to same index in L2, and Capella will kick out the LRU and store the upper other 15 bits in the tag ram (and update the LRU). This is very useful information!

This makes it even more mysterious how the interaction actually happens between Capella, L2 and the CPU. Again, for the case where the 603 has a L1 miss, it puts out the burst read request by putting the requested word address (critical word). Capella will see the request for a burst read request that the CPU put on the bus. L2 cannot just send out the block - it does not even know if what it has actually is for the block with the word. Capella must check its tag ram entries, and if it sees L2 has it, then it must make L2 put out that block (with the critical word first, possibly told via the P_D4-0).

However, there is no other connection between Capella and L2 other than the bus of the 603 (the 64-bit bus and 32-bit address bus), so somehow this happened using the bus protocol of the 603?

I'm starting to think that the P_D4-0 is somehow used to facilitate this out of band of the 603 busses. It would need 3 bits to tell it which word is the critical one (which to send first), and one bit to tell L2 that it should respond, that is 4 bits.

Your algorithm that goes sequentially through RAM would still see the same 32 cycle spike every 256kB. That's also the case for an LRU, N-way L2 cache, because the least recently used entry will always be 256kB ago.
Uff. Indeed, and I had even commented that in my L2 post.


Cool!

I had an idea about how to avoid FDIV in an implementation of Quake. Quake has a limited horizon. Therefore, Z probably has a range between 1mm and 100m. So, storing a large 100,000 element array of 1/Z would provide up to 100m of range at a 1mm resolution, this reduces the problem to a memory fetch at the cost of 400kB (32-bit floats) or 800kB (64-bit floats). Anything further away, is going to be (a) very small and (b) easy to interpolate without needing an FDIV.
Not a bad start ;), I can tell you that I do use a LUT indeed, but it is *much* smaller, and uses some mathematical tricks to make it possible. One day I will document it, but I'll have to hold it a secret until I've made an even better demo for the 5200 as I actually have a competitor in this space :D (which I am very happy that I do, way more fun when more people are trying to push the same hardware, actually, if you make a demo for the 5200 I will tell you the trick :D).

The second idea was to compute two 16-bit pixels and combine them in a single 32-bit word before storing the results. That way the two cycle store x 32-bits is equivalent to 2x 1 cycle stores x 16-bits each (though maybe packing the pixels would cost another cycle). Maybe, also, because we have 32 GPRs we can unroll loops a bit more than a Pentium could with its crummy 8 registers. 32 GPRs can store 64 x 16-bit pixels worth of data; 16 GPRs can store 32 pixels.
That's right! I actually already pack 2 pixels into a word whenever I know there are more than 2 pixels left in the span :)
 
From the developer notes for 5200/6200:
The memory subsystem of the Power Macintosh 5200 and 6200 computers consists of
a 4 MB ROM and a 256 KB second-level (L2) cache, in addition to the internal cache
memory of the PowerPC 603 microprocessor. The ROM and the cache are contained on
the 160-pin ROM/Cache DIMM card that plugs into the main logic board. The Capella
custom IC provides burst mode control to the cache and ROM
If the P_D4-0 is the out of band way of it handling this, a fifth bit would be needed to differentiate whether it is the cache or the ROM that should answer, so that might explain the 5 bits in P_D4-0? (if it is active, which critical word, if it is L2 or ROM that should respond)
 
Nice. I had a pretty good idea of L1 but I had no understanding of the L2. OK so any address with overlap of the low 17 bits will map to same index in L2, and Capella will kick out the LRU and store the upper other 15 bits in the tag ram (and update the LRU). This is very useful information!
Yep! I think most of the cache stuff I learned from "Computer Architecture, a Quantitive Approach, 2nd edition"
This makes it even more mysterious how the interaction actually happens between Capella, L2 and the CPU. Again, for the case where the 603 has a L1 miss, it puts out the burst read request by putting the requested word address (critical word). Capella will see the request for a burst read request that the CPU put on the bus. L2 cannot just send out the block - it does not even know if what it has actually is for the block with the word. Capella must check its tag ram entries, and if it sees L2 has it, then it must make L2 put out that block (with the critical word first, possibly told via the P_D4-0).
However, there is no other connection between Capella and L2 other than the bus of the 603 (the 64-bit bus and 32-bit address bus), so somehow this happened using the bus protocol of the 603?
Yes, but I think there's something missing from the diagram, or some kind of error. When Capella gets an address for ROM or one that matches an L2 cache entry, then it must select the right DIMM or Cache bank. And because there's 32 bytes per cache line, but only 8 bytes for the CPU data bus, then it needs to select one of 4 banks, indexed by P_A4-3. The 4 banks are shown, but not the means by which they're selected, unless that's the P_D4-0 and they're already de-multiplexed into 4 individual bank selects. And of course, there's a bit missing which enables the critical word to be selected.
Not a bad start ;), I can tell you that I do use a LUT indeed, but it is *much* smaller, and uses some mathematical tricks to make it possible. One day I will document it, but I'll have to hold it a secret until I've made an even better demo for the 5200 as I actually have a competitor in this space :D (which I am very happy that I do, way more fun when more people are trying to push the same hardware, actually, if you make a demo for the 5200 I will tell you the trick :D).
I can't even imagine I'd ever be able to do a PM5200 demo! I'm just an embedded developer with an MPhil in computer architecture. I don't have a PM5200 or 6200 and I don't think I have the space for yet another Mac. But if you DM me, I seriously promise I won't blab to your competitor (whom I didn't know existed). I'm just quite pleased I'm coming up with similar optimisations to the ones you have! Oh, do you have an assembly listing of that inner Pentium loop with the FDIV?
That's right! I actually already pack 2 pixels into a word whenever I know there are more than 2 pixels left in the span :)
Yay! That 603 is good for something other than pretty die shots ;-) !
 
From the developer notes for 5200/6200:
I saw that too, a bit earlier when searching through all the cache references (previously I'd just stopped once I'd seen what I thought explained what I wanted to know. So, we know Capella handles the burst mode.

If the P_D4-0 is the out of band way of it handling this, a fifth bit would be needed to differentiate whether it is the cache or the ROM that should answer, so that might explain the 5 bits in P_D4-0? (if it is active, which critical word, if it is L2 or ROM that should respond)
I'm the puzzled one now, why is anything going out from Capella on the cpu data bus?
 
Hi @noglin,

I found the FDIV code from Quake (I think). There's at least 2 versions: d_draw.asm and d_draw16.asm as well as .s versions of those files (presumably to compile the Linux version on a Pentium). Not sure about the difference unless one is for 16-bit graphics and the other for an 8-bit palette.


Interestingly, from my viewpoint, the use of FDIV doesn't completely overlap with the integer unit. It saves some time, but a lot of it is FPU instructions which would be very similar on a 603 (except the PM5200 version would be a table look-up). There's also frequent reloading of FPU and integer constants which the 603 could put in FP regs and GPRs. Also, it occurred to me that memory reads on a Pentium will actually slow down dual issue integer operation to effectively single-issue.

This is why: the U and V pipelines can both compute an effect memory address simultaneously, for 'simple' instructions. However, the data cache is only single-ported. Therefore if both pipelines need to read a memory address, then one will get stalled by at least 1 cycle. The code performs quite a lot of memory accesses.

Code:
LSpanLoop:
 fild ds:dword ptr[4+ebx]
 fild ds:dword ptr[0+ebx]
 fld st(1)
 fmul ds:dword ptr[_d_sdivzstepv]
 fld st(1)
 fmul ds:dword ptr[_d_sdivzstepu]
 fld st(2)
 fmul ds:dword ptr[_d_tdivzstepu]
 fxch st(1)
 faddp st(2),st(0)
 fxch st(1)
 fld st(3)
 fmul ds:dword ptr[_d_tdivzstepv]
 fxch st(1)
 fadd ds:dword ptr[_d_sdivzorigin]
 fxch st(4)
 fmul ds:dword ptr[_d_zistepv]
 fxch st(1)
 faddp st(2),st(0)
 fxch st(2)
 fmul ds:dword ptr[_d_zistepu]
 fxch st(1)
 fadd ds:dword ptr[_d_tdivzorigin]
 fxch st(2)
 faddp st(1),st(0)
 fld ds:dword ptr[fp_64k]
 fxch st(1)
 fadd ds:dword ptr[_d_ziorigin]
 fdiv st(1),st(0)
 mov ecx,ds:dword ptr[_d_viewbuffer]
 mov eax,ds:dword ptr[4+ebx]
 mov ds:dword ptr[pspantemp],ebx
 mov edx,ds:dword ptr[_tadjust]
 mov esi,ds:dword ptr[_sadjust]
 mov edi,ds:dword ptr[_d_scantable+eax*4]
 add edi,ecx
 mov ecx,ds:dword ptr[0+ebx]
 add edi,ecx
 mov ecx,ds:dword ptr[8+ebx]
 cmp ecx,8
 ja LSetupNotLast1
 dec ecx
 jz LCleanup1
 mov ds:dword ptr[spancountminus1],ecx
 fxch st(1)
 fld st(0)
 fmul st(0),st(4)
 fxch st(1)
 fmul st(0),st(3)
 fxch st(1)
 fistp ds:dword ptr[s]
 fistp ds:dword ptr[t]
 fild ds:dword ptr[spancountminus1]
 fld ds:dword ptr[_d_tdivzstepu]
 fld ds:dword ptr[_d_zistepu]
 fmul st(0),st(2)
 fxch st(1)
 fmul st(0),st(2)
 fxch st(2)
 fmul ds:dword ptr[_d_sdivzstepu]
 fxch st(1)
 faddp st(3),st(0)
 fxch st(1)
 faddp st(3),st(0)
 faddp st(3),st(0)
 fld ds:dword ptr[fp_64k]
 fdiv st(0),st(1)
 jmp LFDIVInFlight1
LCleanup1:
 fxch st(1)
 fld st(0)
 fmul st(0),st(4)
 fxch st(1)
 fmul st(0),st(3)
 fxch st(1)
 fistp ds:dword ptr[s]
 fistp ds:dword ptr[t]
 jmp LFDIVInFlight1
 align 4
LSetupNotLast1:
 fxch st(1)
 fld st(0)
 fmul st(0),st(4)
 fxch st(1)
 fmul st(0),st(3)
 fxch st(1)
 fistp ds:dword ptr[s]
 fistp ds:dword ptr[t]
 fadd ds:dword ptr[zi8stepu]
 fxch st(2)
 fadd ds:dword ptr[sdivz8stepu]
 fxch st(2)
 fld ds:dword ptr[tdivz8stepu]
 faddp st(2),st(0)
 fld ds:dword ptr[fp_64k]
 fdiv st(0),st(1)
LFDIVInFlight1:
 add esi,ds:dword ptr[s]
 add edx,ds:dword ptr[t]
 mov ebx,ds:dword ptr[_bbextents]
 mov ebp,ds:dword ptr[_bbextentt]
 cmp esi,ebx
 ja LClampHighOrLow0
LClampReentry0:
 mov ds:dword ptr[s],esi
 mov ebx,ds:dword ptr[pbase]
 shl esi,16
 cmp edx,ebp
 mov ds:dword ptr[sfracf],esi
 ja LClampHighOrLow1
LClampReentry1:
 mov ds:dword ptr[t],edx
 mov esi,ds:dword ptr[s]
 shl edx,16
 mov eax,ds:dword ptr[t]
 sar esi,16
 mov ds:dword ptr[tfracf],edx
 sar eax,16
 mov edx,ds:dword ptr[_cachewidth]
 imul eax,edx
 add esi,ebx
 add esi,eax
 cmp ecx,8
 jna LLastSegment

This suggests there are ways for tuned 603 assembler to compete.
 
Hi @noglin,

I found the FDIV code from Quake (I think). There's at least 2 versions: d_draw.asm and d_draw16.asm as well as .s versions of those files (presumably to compile the Linux version on a Pentium). Not sure about the difference unless one is for 16-bit graphics and the other for an 8-bit palette.
Hey Snial!

Perspective texturing: screen x and y is proportional to 1/view_z. The texture is on the 3d triangle, and has texture coordinate [u,v] at the vertices of the triangle. One interpolates u/view_z, v/view_z and 1/view_z (as these are linear in screen space), then per pixel one need to recover u and v to lookup the texture: view_z = 1/(1/view_z), u = (u/view_z) * view_z, v = (v/view_z)*view_z.

But that is too expensive on the older hardware, so you typically only calculate u and v every 8 or 16 pixels, and calculate u_dx = (u2-u1) >> 3, and same for v_dx and then step u and v as you plot each pixel on the span.

The d_draw16 calculates u and v at start of the span, and u and v at the end of the span (16 pixels further), this is initial setup code, *and* sets up an fdiv so that while drawing this span, the next span's view_z is already available, and repeats.

with up to 16 of these (and a jump table to jump in according to how many remaining pixels there are in the span)
sbb ecx,ecx
mov ds:byte ptr[9+edi],al
add ebx,ebp
mov al,ds:byte ptr[esi]
adc esi,ds:dword ptr[advancetable+4+ecx*4]
add edx,ds:dword ptr[tstep]

The other one, d_draw.asm recalculates every 8 pixels, so with an fdiv in flight, it does at most 8 of the above.

Note: u and v and u_dx, v_dx are fixed point values, and whenever u fractional parts accumulate to an integer we just move by 1 in the texture, but when v fractional part accumulates to 1 we must move by the texture width.

Chris Heckers "perspective texture mapping" explains what exactly this is doing, and it is quite beautiful:


I think you are right that no one really pushed the PowerPC to its max for triangle rasterizers. I know for sure I have unexplored paths and I think I can make it even faster. Perhaps this week-end I will make some attempts.
 
@noglin,

Thanks for the reply. I had read a bit on the "u/view_z, v/view_z and 1/view_z" type maths from the developer's blog series, but kinda skimmed over it, so your succinct post helps me understand a bit more, along with the "u_dx = (u2-u1) >> 3" approximative interpolation and the meaning of d_draw16 vs d_draw.

slight update on my side.
This is why: the U and V pipelines can both compute an effect memory address simultaneously, for 'simple' instructions. However, the data cache is only single-ported. Therefore if both pipelines need to read a memory address, then one will get stalled by at least 1 cycle. The code performs quite a lot of memory accesses.
It turns out this is not quite correct. From the Pentium manual (Page 3-14): "The storage array in the data cache is single ported but interleaved on 4 byte boundaries to be able to provide data for two simultaneous accesses to the same cache line."

So, it turns out if two accesses are made on the same 8 byte alignment within a cache line, then it will take two cycles (Tags are triple-ported for one snoop and 2 simultaneous accesses). It's not clear if two fetches from different cache lines can be processes concurrently, but it seems likely given that tags can be simultaneously accessed (why access both lines if only one line can be read, since if only one can be read, but a cache line is interleaved, then all you need to do is check that both refs are the same).

What is true though is: "SImple instructions are entirely hardwired; they do not require any microcode control and, in general, execute in one clock. The exceptions are the ALU mem,reg and ALU reg,mem instructions which are two and three clock operations respectively." and "Although in general two paired instructions may proceed in parallel independently, there is an exception for paired "read-modify-write" instructions. Read-modify-write instructions are ALU operations with an operand in memory. When two of these instructions are paired there is a sequencing delay of two clocks in addition to the three clocks required to execute the individual instructions."

Oh, right, so basically RMW instructions have to be indivisible: they take 3 cycles, but they'll block some aspect of the data cache?
 
I've just been having a little look at the Quake minimal Intel spec.


Pentium 75MHz, 8MB RAM (16MB RAM under Win 95), VGA graphics (doesn't say PCI, but probably all P75s were Vesa bus or PCI by then), MSDOS 5.0/Win 95 or Win 98; 80MB free disk space; 2x CD-ROM.

What it doesn't say is whether the P75 needs an L2 cache, but maybe that was standard then too. This thread:

No L2 cache costs about 10-15% performance on a Pentium system. To be honest, I don't notice much of a difference.

Says that there's a 10% to 15% loss of performance. Equivalent to 68MHz down to 65MHz with L2 cache. There's also an FPS quote for a Pentium MMX 233:

L2 on = 53.3fps, L2 off=48.9fps.

So, 75MHz => 15.7fps without L2. There's a frame-rate list here:


P5-75, Generic FX m/b w/256K PB cache & 32meg 60ns EDO DRAM = 18.8

-cheers from Julz
 
So, 75MHz => 15.7fps without L2. There's a frame-rate list here:

P5-75, Generic FX m/b w/256K PB cache & 32meg 60ns EDO DRAM = 18.8
That's a good source. I do think both entries are with L2 ("w/256K"). The fps diff is likely due to only the vram bandwidth (S3 Trio64+, 64bit and EDO dram) vs the likely 32 bit generic Chips & Tech dram.

5-75, Generic FX m/b w/256K PB cache & 32meg 60ns EDO DRAMExpertColor DSV3365E (S3 Trio64V+ chipset) w/2meg EDO DRAM & SDD 5.3Dos 7.0
18.8​
8.3​
6.0​
P-75, NEC notebook w/256K cache & 40meg 70ns DRAMChips & Tech w/1meg DRAM
15.9​

I stumbled upon this today, a Linpack benchmark study from 2014 that had entries for PowerPC and Pentium. This is for double precision floating point, but with the larger calculations it is at least having data dependencies on cache and not "fake loops". Still, the results are all over the place showing how critical the compiler is.

In total though, I think it is fair to say that the 603 75MHZ is slower than a P75 (when only looking at the CPU). Still impressive for a CPU with half the transistor count to be so close.

ComputerOS/compilern=100Mflop
Apple Power Macintosh 6100/60Absoft v4.0 F77 -O9.6
Intel Pentium 75 MHzg77 -march=pentium -O38.92
Apple Power Macintosh 7100/66Absoft v4.0 F77 -O8.6
Apple Performa 6230CD/603/75Absoft f77 Power PC v4.15.9
Gateway P5-60 (60 MHz Pentium)77/32/mf/d1/warn/5/fp5/ot5.3
Apple Power Macintosh 6100/60Absoft F77 SDK3.4


I had not realized before this discussion that the 601 really is way more powerful than the 603. Kind of makes me tempted... :D
 

Attachments

That's a good source. <snip> vram bandwidth
Yes, graphics-heavy applications will be limited by VRAM bandwidth and cache can be less effective if there's less locality.
<snip> Linpack <snip> 2014 <snip> PowerPC and Pentium. This is for double precision <snip> all over the place <snip> 603 75MHZ is slower than a P75
Interesting table. Surprised the PPC 603 can even get close with its single-precision optimised FPU.
I had not realized before this discussion that the 601 really is way more powerful than the 603. Kind of makes me tempted... :D
Similar transistor count as a P60 and can dispatch more instructions per clock even though it has fewer functional units. To me, though the PPC603 (functionally) looks very much like an M88110 with a different decoder (and some other mods). My guess is that's why Motorola insisted on having the M88K bus on PowerPC, because their quickest route was to hack an M88K until it became a PPC.

AD_4nXcjLKtGiuP-gdFWj7gvVQ_iXRhbtI6e2HWwz0foLjDpP3Sv91bJUkVif9cFFStTArsLIrdG3pJ9R6DnEYwdxSoasvbVUHvyiU9XpXjr2C6eKwiHdBLa5DBQNQBXjVqiraS4QR0k6mAdMVd_OQxS9KAD0sLs
AD_4nXctUom1w3dvRMrcsNU8t0vnPDQ1FY2_MXicyiB9Olh2MoA1NY8zHv4bQgVEVrevhDXNSXvp7VcfAVk9zg_yPPyeaTm-c8U7M_ughTK3HMQOwqrZIcaIjesVKw5-niPab3FB9lLWfLCzn-8M2IAoth6P8DK2

They don't literally look the same though. They'd look closer if the M88110 was turned +90º; then the MMUs were placed below the caches. Oddly enough the PPC603 die shot has two sections called "Branch Processing Unit" - maybe my hacked M88110K => PPC 603 theory is tosh :-) !
 
Interesting table. Surprised the PPC 603 can even get close with its single-precision optimised FPU.
I thought the 603 is mostly double-precision? (all registers are double precision, instruction timings are same in double precision, the only advantage with single-precision is that you can use the "multiply & add" and some "fast math" instructions (the compiler is unlikely to be given flags to use those for scientific benchmark).

Similar transistor count as a P60 and can dispatch more instructions per clock even though it has fewer functional units. To me, though the PPC603 (functionally) looks very much like an M88110 with a different decoder (and some other mods). My guess is that's why Motorola insisted on having the M88K bus on PowerPC, because their quickest route was to hack an M88K until it became a PPC.

AD_4nXcjLKtGiuP-gdFWj7gvVQ_iXRhbtI6e2HWwz0foLjDpP3Sv91bJUkVif9cFFStTArsLIrdG3pJ9R6DnEYwdxSoasvbVUHvyiU9XpXjr2C6eKwiHdBLa5DBQNQBXjVqiraS4QR0k6mAdMVd_OQxS9KAD0sLs
AD_4nXctUom1w3dvRMrcsNU8t0vnPDQ1FY2_MXicyiB9Olh2MoA1NY8zHv4bQgVEVrevhDXNSXvp7VcfAVk9zg_yPPyeaTm-c8U7M_ughTK3HMQOwqrZIcaIjesVKw5-niPab3FB9lLWfLCzn-8M2IAoth6P8DK2

They don't literally look the same though. They'd look closer if the M88110 was turned +90º; then the MMUs were placed below the caches. Oddly enough the PPC603 die shot has two sections called "Branch Processing Unit" - maybe my hacked M88110K => PPC 603 theory is tosh :) !
Oh nice die shot! I had wished there was a "move float register into gp register". Instead you must do it in 3 steps (convert float reg into integer, then store that float register, then load that addr into integer register). If there had been a 2 cycle "round float to int and place it in this integer reg", many loops could have been pipelined very nicely using both.

Regarding our previous discussion on "P_D4-0" and if/how the L2 can load/send the critical byte first. I think the D4-0 are these D signals on the bus interface, could that make sense?
 

Attachments

  • 2025-05-24-082705_944x661_scrot.png
    2025-05-24-082705_944x661_scrot.png
    160.4 KB · Views: 4
I thought the 603 is mostly double-precision? (all registers are double precision, instruction timings are same in double precision, the only advantage with single-precision is that you can use the "multiply & add" and some "fast math" instructions (the compiler is unlikely to be given flags to use those for scientific benchmark).
In this post:

I remarked:

The PPC603 can only manage a throughput of 1 cycle per single-precision floating point operations; double-precision takes 2 cycles.

I'm sure I got that from reading the 603 manual on bitsavers. I'll check again, I could easily have misread it.

Oh nice die shot! I had wished there was a "move float register into gp register". Instead you must do it in 3 steps (convert float reg into integer, then store that float register, then load that addr into integer register). If there had been a 2 cycle "round float to int and place it in this integer reg", many loops could have been pipelined very nicely using both.
I have a colour die shot of the PPC603e which I've used as the back cover for my PowerBook 1400c - it looks amazing! GPR=Int(FPR) taking 3 instructions is pretty shocking!

Regarding our previous discussion on "P_D4-0" and if/how the L2 can load/send the critical byte first. I think the D4-0 are these D signals on the bus interface, could that make sense?
My interpretation was that an L1 cache miss would generate an L2 cache fetch that accessed the critical word first followed by the rest of the cache line, but if the L2 cache also missed, then the Performa 5200's L2 cache controller would potentially have to store the previous, dirty L2 cache line; then fetch an entire L2 cache line from RAM in order, and during that time the PPC603 wouldn't see or make use of any of those transactions (CPU bus access would be suspended). Then the CPU would fetch the critical word first from the L2, followed by the rest of the cache line as before.

I tend to think that, despite my lack of proof, because I find it hard to imagine Apple's L2 cache controller would replicate the behaviour of the 603 itself; (b) an L2 fetch could easily turn into an L2 cache line store, then an L2 cache line fetch [unless it's write-through] and (c) L2 cache timings aren't affected by the fetch order, but DRAM does.

But, again, I could be very wrong.
 
In this post:

I'm sure I got that from reading the 603 manual on bitsavers. I'll check again, I could easily have misread it.
Just checked the manual, and uh hehe.. I recalled incorrectly... only instructions that can essentially skip the first stage in the FPU are equally fast in single precision, otherwise they do take one extra cycle. (And way worse for already expensive instructions like fdiv vs fdivs).

Pg36:
2-26 PowerPC 603e RISC Microprocessor User's Manual
Implementation Note—Single-precision multiply-type instructions operate faster than
their double-precision equivalents. See Chapter 6, “Instruction Timing,” for more
information.

Pg41:
The FPU contains a single-precision multiply-add array and the floating-point status and
control register (FPSCR). The multiply-add array allows the 603e to efficiently implement
multiply and multiply-add operations. The FPU is pipelined so that single-precision
instructions and double-precision instructions can be issued back-to-back.
Pg96:
2-26 PowerPC 603e RISC Microprocessor User's Manual
Implementation Note—Single-precision multiply-type instructions operate faster than
their double-precision equivalents. See Chapter 6, “Instruction Timing,” for more
information.

And then the actual timings show: yes for all but add/sub and similar basic instructions, the double precision do take one extra cycle in the first stage in the FPU.

I have a colour die shot of the PPC603e which I've used as the back cover for my PowerBook 1400c - it looks amazing! GPR=Int(FPR) taking 3 instructions is pretty shocking!
Oh that is cool, is it in digital form as well? OK found a nice one! Is it this one?

View attachment 87073

My interpretation was that an L1 cache miss would generate an L2 cache fetch that accessed the critical word first followed by the rest of the cache line, but if the L2 cache also missed, then the Performa 5200's L2 cache controller would potentially have to store the previous, dirty L2 cache line; then fetch an entire L2 cache line from RAM in order, and during that time the PPC603 wouldn't see or make use of any of those transactions (CPU bus access would be suspended). Then the CPU would fetch the critical word first from the L2, followed by the rest of the cache line as before.

I tend to think that, despite my lack of proof, because I find it hard to imagine Apple's L2 cache controller would replicate the behaviour of the 603 itself; (b) an L2 fetch could easily turn into an L2 cache line store, then an L2 cache line fetch [unless it's write-through] and (c) L2 cache timings aren't affected by the fetch order, but DRAM does.

But, again, I could be very wrong.
It would be interesting to really read and understand the exact bus protocol. I think that would be the only way to understand what how the L2 and the mysterious P-D4-0. As I recall, when I measured the L2, the first word that was in L2 but not in L1 took 8 cycles, which means 2 bus cycles. So I think that means that the critical word does come first. But how - I do not know.
 

Attachments

  • 1748099506059.png
    1748099506059.png
    22.3 MB · Views: 3
  • 2025-05-24-172333_1026x496_scrot.png
    2025-05-24-172333_1026x496_scrot.png
    83.7 KB · Views: 4
I now believe Quake FPS is the best way to answer this question. Benchmarks - we saw it, they are not reliable (not representative work, depends as much on the compiler as the cpu...).
I've run Quake now on my 5200 (note: one must first start a game, then open console and write timedemo demo2, if one just write timedemo demo2, the fps won't be reported at the end as it just continus to play next demo).

I was initially surprised, but it makes sense, the Amiga with a 68060 at 50mhz performs better than the 5200 system.

YearCPUMHZBUSL1 (KB)ALU pairingTransistorsQuake FPS
1992486 DX266338 unifiedscalar1.2M4.97
1994PPC 6037537.58/8dual (1 ALU)1.6M7.9
19946806050508/8dual (~2 ALU)2.5M9.6
1994Pentium (P54C)75508/8dual (~2 ALU)3.2M18.8
1995Pentium
(P54CS)
133668/8dual (~2 ALU)3.3M25.2

ALU: the pairing are under certain conditions, e.g. 68060/Pentium has constraints on the paired integer instruction, and the 603 has various situations that stalls the other units - it is complicated.

Quake FPS for PC (320x200, no sound, timedemo demo2):
486DX266Mhz Asus, 256K cache, 32MB 70ns Cirrus 5334, 2MB DRAM
P5-75, Generic FX m/b w/256K PB cache & 32meg 60ns EDO DRAM = 18.8
P5-133, NEC PowerPlayer LE FX m/b w/256K PB cache & 16meg EDO DRAM Matrox Mystique w/2meg SGRA

Amiga 68060/50 (likely 320x200, timedemo demo2, as it is mentioned in other notes)

Performa 5200 / 603 75 MHz, v1.09 (set it 320x200 not scaled, no sound)
(my machine)

If anyone sees any mistake on the data let me know :)
 
Last edited:
I now believe Quake FPS is the best way to answer this question. <snip>
ALU: the pairing are under certain conditions, e.g. 68060/Pentium has constraints on the paired integer instruction, and the 603 has various situations that stalls the other units - it is complicated.

If anyone sees any mistake on the data let me know :)
But to summarise, a dual pipeline ALU makes up most of the difference and I suspect FPU parallelism makes up the rest. Motorola also had the handicap of having to throw engineers at multiple general purpose CPU projects: initially the 88K (the 88110 had a dual pipeline 2x ALU and lots of handy MMX type instructions on a separate pipeline); the 68060 and then PowerPC. I'm now starting to think that the PPC603 was an intentionally crippled M88110 derived CPU. I still like it though.

Actually it'd be good to see someone run it on a 60MHz PPC 601.
 
I now believe Quake FPS is the best way to answer this question.

Respectfully, I disagree. I think that would be one of the worst ways. Even among contemporary x86 chips, quake is notorious for skewing results due to the hand-tuned assembly made specifically for that particular FPU. There are detailed explanations of this available. If quake is the only software you want to run, then sure, the results tell you what you want to know. But for a general case, I think compiled code is a much more neutral way to compare different CPUs running the same software. How well does quake run with compiled code on each different CPU?

I have some software hand coded in assembly for the G4 vector unit, and it shows that chip is massively ahead of anything else you could compare it to at the time. But does that single result mean anything in general?

I think quake 2 and quake 3 were made with diverse hardware in mind, and these show more legitimate results for comparing different systems.

If you want to compare hand-tuned code in narrow use cases, some software like photoshop was made for different hardware, and presumably at least some effort was spent tuning it for each platform. You may be able to run something like a test suite of filters on each type of CPU and compare. Perhaps this has already been done. Also, the dnetc clients were tuned for each particular task/CPU combination. Their statistics server used to show data for each type of client vs clock speed, but I have not checked recently.

 
I was curious so I looked and the dnet statistics server is still up:

cgi.distributed.net/speed/

You can pull up graphs like this:

dnetCPUs.png

The thruput is fairly linear vs clock speed, and the slope of the lines shows the efficiency of the chip. For this thread, here are some specific results I looked up for some tuned assembly code:

Code:
rc564              MHZ
PowerPC 603e       100	  310,190
         scaled to  75    232,000
Intel Pentium (P5)	75	   96,326
Intel 486dx4    	75	   71,124

ogr
PowerPC 603e	   200	1,524,307
         scaled to  75    571,000
Intel Pentium (P5)	75    277,653
Intel 486dx4	   100	  252,230

rc572
PowerPC 603e       240    793,275
         scaled to  75    248,000
Intel Pentium (P5)	75	   76,413
Intel 486dx4	   100	   65,739
 
But to summarise, a dual pipeline ALU makes up most of the difference and I suspect FPU parallelism makes up the rest. Motorola also had the handicap of having to throw engineers at multiple general purpose CPU projects: initially the 88K (the 88110 had a dual pipeline 2x ALU and lots of handy MMX type instructions on a separate pipeline); the 68060 and then PowerPC. I'm now starting to think that the PPC603 was an intentionally crippled M88110 derived CPU. I still like it though.

Actually it'd be good to see someone run it on a 60MHz PPC 601.
Hey @Snial! I think it is also the bus speed that impacts quite a bit here given the many load and stores, the 601 that is handicapped by not having dual issue gets 7.6 fps (although this number is not directly comparable as the timedemo was on a different map but it is in same ballpark): https://www.vogons.org/viewtopic.php?t=46851

The 603e would be interesting to get a number for, when I get a chance I will try it with my 5300 board, as the 603e has dual ALU issue.
 
Respectfully, I disagree. I think that would be one of the worst ways.
Alright let's talk! :D

Even among contemporary x86 chips, quake is notorious for skewing results due to the hand-tuned assembly made specifically for that particular FPU. There are detailed explanations of this available. If quake is the only software you want to run, then sure, the results tell you what you want to know. But for a general case, I think compiled code is a much more neutral way to compare different CPUs running the same software. How well does quake run with compiled code on each different CPU?

I have some software hand coded in assembly for the G4 vector unit, and it shows that chip is massively ahead of anything else you could compare it to at the time. But does that single result mean anything in general?

I think quake 2 and quake 3 were made with diverse hardware in mind, and these show more legitimate results for comparing different systems.

If you want to compare hand-tuned code in narrow use cases, some software like photoshop was made for different hardware, and presumably at least some effort was spent tuning it for each platform. You may be able to run something like a test suite of filters on each type of CPU and compare. Perhaps this has already been done. Also, the dnetc clients were tuned for each particular task/CPU combination. Their statistics server used to show data for each type of client vs clock speed, but I have not checked recently.


Quake 2 and esp 3 are irrelevant for the 603. They were not made for that type of CPU, came much later and were meant for more powerful CPU:s. I am not suggesting to use Quake 1 to measure "any CPU", I'm thinking specifically for the 603 and its "competitors".

You are proposing photoshop, I think Quake 1 can be argued for with the exact same argument: at least some care has been put to make it run well for the platform it is ported to. Each frame has to be delivered as fast as possible or the product is worse. Every frame includes a bunch of different type of work (integer work, fpu work, cache, ram and also vram). It exercises it all. Furthermore standard software rarely has any way to run a "timedemo demo2" that runs it in the same way for a minute and then report a figure that can be directly compared.

I didn't have the pleasure of growing up with a 68060, but I had the 5200 (with a 603) and my friends had 486 and Pentiums, and this ranking matches how the machines felt in daily usage.

In contrast, all the benchmarks that have been looked at in this thread, they are either theoretical estimates which is kind of useless (Motorola quoted "estimated specint92" numbers for the 603 (which unfortunately I've seen then reported as actual measured values elsewhere..). The benchmark that had source code, well turns out the compiler impacts the result more than the cpu. And most of it is "toy problems". So I think all of them are worse than Quake 1.

So I personally think Quake 1 is about as good as it gets, for these machines / that time period. The part I have doubts about is whether the mac port left a lot on the table for the 603. Perhaps the amiga 68060 port has exercised every trick in the book, while the mac port has an easy 20% win left for the 603. I can say that at least Duke Nukem, Marathon 2 and Wolfenstein on the 5200 does not say that the mac port of quake 1 is particularly bad (if anything I was positively surprised it got 7.9 fps actually).


All this said, I'm curious to see what alternatives you would propose, I saw you shared one so will look into that one next.
 
Back
Top