Testing a 6200 and comparison with 6100

( @Phipli ) it adds up, but there are a few reasons why I don't think that's the cause.

Firstly, there are no Read-Modify-Write memory instructions on a PowerPC, just load and store. So, this sequence of bus cycles in non-cached DRAM would have to be implemented in software, not the Capella bridge chip (load a 32-bit word to a register; insert the byte into the right position; write the long word back). But that doesn't make sense either, it would mean that the ROM had to consider the underlying PPC/'040 bridge.

Secondly, I don't think it makes sense to be implemented in Capella either. For cachable addresses, whole words get loaded into L2, then L1 cache - a block of 32 bytes (8 words) are loaded at a time and if a byte or half-word is modified there (which is now on the PPC side of course), the whole word (i.e. 8 words) would be written when the cache line needs to be flushed. So cacheable memory never needs an RMW sequence, nor a byte modification sequence.

Thirdly, from the developer note, it seems fairly clear that the interface between the 68040 bus to the PPC603 is a set of 543 latches:


So, I don't think it makes sense for Capella to emulate a byte store to uncached RAM to load an entire word into the D-latches from DRAM, then modify the intended byte, then write the 32-bit word back out. For byte loads it wouldn't matter, a PPC603 would still just load a 64-bit double-word and select the right byte.

It also makes little sense, given that the only way Capella can know if there's a non-cacheable byte write is if the PPC603 tells it. So, there are signals from the CPU for that purpose. It's far easier for DRAM to include byte line selects and simply pass those signals on. Even though the CPU will probably send out 64-bits + the byte select signals; Capella must know how to select the high or low 32-bit word, because DRAM is 32-bits wide. It must do that for normal writes anyway.


From page 8-15 it works like this:

A29-A31 (which are the low-order bits in the insane IBM convention), select the byte lane while the 3 signals TSIZ0:2 ==001, meaning 1 byte.

It's very easy for Capella to use this information to select the right byte lane in DRAM. A similar approach works for 16-bit and 32-bit read/writes:

Poor Video RAM performance is, I would deduce, for other reasons.
Sorry, I think you are confused or missing my point, perhaps I was unclear.

I don't mean a read-modify-write sequence as in an instruction that causes a read-modify-write bus cycle. I don't think the 68K ROM used these as only specific instructions use them. I'm referring to the QuickDraw use case of reading a particular address (framebuffer for a particular pixel), modifying the data, then writing it back. The point of mentioning byte width (as an example for 8bpp) is that an entire bus transfer cycle is required regardless of width if the address is cache inhibited. If for each pixel changed you're doing two full bus cycles, as I said, that gets expensive quickly because you accumulate additional wasted cycles from bus translation overhead.

Edit: To further clarify, most QuickDraw accesses act like that - basic drawing instructions and the like. In some cases it's possible to do larger transfers (blits, moves, and the like) but that depends a lot on the specific request, organization of framebuffer memory, bit depth, etc. A line from A to B is going in most cases to cause a series of single-pixel changes requiring that exact read-modify-write trace (but not a RMW cycle) I mentioned above as the framebuffer is typically cache inhibited.

Cache inhibition can be a handled a couple of two ways. Most commonly it's handled by MMU flags on particular pages/portions of address space that indicate if the caches should be inhibited or not. This is how apple did it on all models of 68K mac and I would assume on PPC as well. It's possible for an external device (at least on 040) to indicate that a particular transfer shouldn't be cached, but this should only be used in very specific cases. Assuming the page is marked as cache inhibited via the MMU, on any access, the caches will not be checked nor will the result of the bus cycle be stored in cache. On an 040 this is the primary way a bus cycle of less than a line width would happen.

You want to inhibit the caches for IO devices, for example, and other circumstances where the contents of the addressed memory may change unexpectedly (DMA). In most 68K macs only the ROM and (most) main memory DRAM is cachable; nubus space and IO space is not and framebuffer lives somewhere in there.
 
Last edited:
<snip> perhaps I was unclear. <snip> I'm referring to the QuickDraw use case of reading a particular address (framebuffer for a particular pixel), modifying the data, then writing it back. <snip> bus translation overhead <snip> most QuickDraw accesses act like that - basic drawing instructions and the like. In some cases it's possible to do larger transfers (blits, moves, and the like) but that depends a lot on the specific request, organization of framebuffer memory, bit depth, etc. A line from A to B is going in most cases to cause a series of single-pixel changes requiring that exact read-modify-write trace (but not a RMW cycle) I mentioned above as the framebuffer is typically cache inhibited.


Cache inhibition <snip> MMU flags on particular pages/portions of address space <snip> main memory DRAM is cachable; nubus space and IO space is not and framebuffer lives somewhere in there.
Indeed. So, on a 6100, video is in DRAM. So, that means that area of DRAM gets marked as non-cacheable. On a 6200, the frame buffer is 1MB of DRAM behind Valkyrie, which can store up to 4x transactions.

So, to properly resolve the question I think someone needs to write a couple of tests to work out what the cost of the bus translation overhead is. Reading other similar threads, to which we've contributed:



Then people like @Melkhior seem to know how the 603 to '040 bus interface worked and somewhere in my hazy memory I recall seeing a diagram, but I've no idea where. Finally, one of the comments on these threads claimed the '040 bus worked at 37.5MHz, not 33MHz. Surely the mismatch would cause a pretty bad overhead in most circumstances & not just graphics, not least because a P5200's 32-bit '040 data bus is half the width of the 6100's: 68K emulation would be badly affected too. Finally, @noglin estimated that although writes to Valkyrie were fast, reads could take ≈50 x PPC 603 cycles.

And one other thing, the Dev note says Valkyrie's clock is synced to the CPU clock for performance reasons during data transfers. So, given that it's on the '040 side; it seems to me that the '040 bus also runs at 37.5MHz, because otherwise you'd get extra cycles inserted for the bus interface and on top of that Valkyrie would have to sync to the PPC603, which surely would slow things down even further?

Perhaps there's another possibility, not related to the '040 bus itself, nor QuickDraw, but maybe the TLB? The 603 has an inverted page table, but TLB misses are handled in software via an interrupt. I understand that Inverted page tables were so awful, Apple didn't use the standard 8-cycle TLB miss interrupt, but their solution, though faster (or better) than Inverted Page Tables was still much slower than 8-cycles.

If so, we may have a rationale for why QuickDraw is slow on a P6200/P5200 vs a Q630 vs a PM6100:

  1. Both the Q630 and PM6100 have hardware TLB walks, so these will be fast.
  2. Some QuickDraw operations could quite easily hammer the TLB in a way that general 68K emulation doesn't, because they're far more non-local. e.g. when you draw a vertical line you're skipping 640b at a time at 640x480x8bpp; that's 1x 4K page fault every 6.4 pixels.
  3. OTOH 68K code has a high degree of locality.
  4. Therefore, the P5200/6200 graphics are a pathological case.
I freely admit I might be talking tosh here, but at least I have a claim to some imaginative tosh 😁 Now it's midnight and my brain is fried. Serves me right ;) !
 
Anecdotally, I used a Performa 6200 as my main machine for about 3 years or so, and I have owned several since. I never really noticed a speed problem, because my machine before the 6200 was an SE. I also had use of a 4400 around that time, but the 6200 always performed fine at what I used it for: ClarisWorks, Warcraft II, Photoshop, REALBasic. I even used to render images with Bryce 2 on it, it took an age but it was worth the wait, I wish I still had those Bryce files from back then. People have to remember what the machines were intended for and what else was around at the time. Macs were hellishly expensive, a lot of people would've still been using Pluses/SEs around the time the 6200 was released. It was a huge jump from those machines.
 
Anecdotally, I used a Performa 6200 as my main machine for about 3 years or so, and I have owned several since. I never really noticed a speed problem, because my machine before the 6200 was an SE. I also had use of a 4400 around that time, but the 6200 always performed fine at what I used it for: ClarisWorks, Warcraft II, Photoshop, REALBasic. I even used to render images with Bryce 2 on it, it took an age but it was worth the wait, I wish I still had those Bryce files from back then. People have to remember what the machines were intended for and what else was around at the time. Macs were hellishly expensive, a lot of people would've still been using Pluses/SEs around the time the 6200 was released. It was a huge jump from those machines.
Yeah, there is nothing wrong with it.

The discussion around the video performance is driven by wondering can we improve it given it is lower than expected. Otherwise, I don't feel that previous claims about 68k emulation performance and L1 cache limitations actually bear fruit.

I suspect in part people just didn't understand that 75MHz doesn't always mean faster than 66MHz.
 
Yeah, there is nothing wrong with it.
Yes, and case in point: my dad ran his graphic design business on a 6200 for ~4 years. It certainly wasn't the ideal machine for that job, but it got him by until he upgraded to a G4. Power Macs before the G3s came out were silly price, so for a one man band operation the Performas were much more acheiveable.
 
Yes, and case in point: my dad ran his graphic design business on a 6200 for ~4 years. It certainly wasn't the ideal machine for that job, but it got him by until he upgraded to a G4. Power Macs before the G3s came out were silly price, so for a one man band operation the Performas were much more acheiveable.
Exactly this. I couldn't afford a IIci when I was in college and starting my own studio, all I could afford was a IIsi. I did what I could with what I had — it wasn't my dream setup but it was still good enough.
 
Nice article!

I'd love to compare a release day ROM to see if graphics performance is better.

I have a 6200 that was recently “repatriated” last year that I’d given to my sister ~25 years ago for web design. As a primarily Windows-based user, she needed to be able to see how her pages looked on a Mac. It certainly performed sufficiently for that task, at the time.

I do not remember what ROM it has, but I will check.
 
@Phipli : I checked the compiler options for my Code Warrior Gold 11 (Academic) to see how they differ from yours. And indeed there are fewer options:

CWGold11PpcOpts.png
As you can see, only the PPC601, 603 and 604 were supported at the time (I picked it up in late 1996), not the 603e, 604e and 750.

Yeah, there is nothing wrong with it.
This is one of the reasons why I keep hoping there's going to be an InfiniteMac.org P5200 emulator at some point. DingusPPC can emulate a PPC603, but Valkyrie isn't emulated yet. Still, I think enough is understood about the chip to implement pretty much every use of it. Also it'd mean the Q630 could be added.
The discussion around the video performance is driven by wondering can we improve it given it is lower than expected. Otherwise, I don't feel that previous claims about 68k emulation performance and L1 cache limitations actually bear fruit.
Which is why I think the first step is to figure out what causes the slowdown in current tests.
I suspect in part people just didn't understand that 75MHz doesn't always mean faster than 66MHz.
Possibly. I decided to see if I could find a review of the P5200 online. MacWorld has one (on Archive.org), and there's a section on performance which is instructive:

1770722422305.png
If this was true, it means the '040 bus runs at 3 PPC cycles per '040 bus cycle and this would explain quite a bit of the performance hit. How could we prove this? I'm not sure how to force regions of RAM to be uncacheable, but I don't think we need to as we can make use of what we understand from the L1 (writeback, 2-way) and L2 (writethrough) caches with 32-byte cache lines each. Consider these PPC tests:

  • Take a single L1 data cache line, which can be generated by allocating 64 bytes of RAM and constructing a pointer aligned to 32-bytes. Write 32-bytes of data repeatedly to the same cache line (millions of times), incrementing 32-bit values will do, we need to write 8 of them per pass. This measures the L1 bandwidth for one way, it's a write-back cache so it doesn't go out to L2. The test loop needs to otherwise use only registers (we have 32 so that should be OK).
  • A similar test follows: allocate a 4kB+64b block of RAM; generate 2 pointers at the base address+whatever to align to 32 bytes; then that base address + 4kB. This accesses both sets. Write the same sets of data to both lines alternately. This measures the L1 bandwidth for both sets. The result should be the same bandwidth.
  • The third test extends this a little further, we generate a 12kB+64b block of RAM and generate 4 pointers: int32_t *p0=((baseAddr+31)&~32), *p1=&p0[1024], *p2=&p1[1024], *p3=&p2[1024]; This time when we write to p2 and p3 it will force an L1 cache line flush into L2, and then the same thing will happen when we write back to p0 and p1. But because L2 is write-through, it will force an L2 cache line flush too, writing to the '040 bus. Since the CPU <--> L2 bus is 64-bits (8B), there are 8 writes at a time. The CPU will write the first long word, I think, in 2 x 37.5MHz cycles, and would like to write the rest in 1 cycle each. However, because the L2 cache is write-through and this behaviour is governed by Capella, all those writes will go immediately to the '040 bus as well; which means the first write will take 2x37.5MHz cycles (because it'll go into the latches), but the rest will take much longer: approximately 4x37.5MHz cycles per word if the '040 bus runs at 33MHz or 6x37.5MHz if it runs at 25MHz (3 cycles per 32-bit word x 2 words per 64-bit word). But ultimately, the speed might be limited by the DRAM, which I think is 80ns (12.5MHz for a /RAS + /CAS or 25MHz for other words in the page).
  • The fourth test is like the third, but the pointer is set to somewhere in the VRAM address space (which is listed in the developer note). This tests whether VRAM performs higher than DRAM even though it's uncacheable. My understanding is that it is, because it's 60ns DRAM.
We can then repeat the set of scenarios with a slightly different test. In this case instead of writing an updated counter from a register, we read from the pointer; add a value and write it back. L1 performance should be very high. L1+L2 tests should now be faster than the one where we hit video RAM, because although L1 gets flushed and L2 is write-through, reads will read from the L2 cache, because that's 256kB. The final test essentially tests @zigzagjoe 's hypothesis, I think. My guess is that the final test will be dramatically slower, because Valkyrie prioritises writes over reads!

If so it'd tell us how to improve performance a little. If DRAM reads are fairly fast, then we could in theory, cache video in DRAM, essentially as an off-screen pixmap. RMW graphics go to that fake buffer, but later are pushed to Valkyrie. We can also use that mechanism to independently test the actual performance of QuickDraw by writing our test routines to go to only off-screen pixmap instead of on-screen ones.
 
If this was true, it means the '040 bus runs at 3 PPC cycles per '040 bus cycle and this would explain quite a bit of the performance hit.
That is absolutely garbage. We have the developer notes and also, it makes no sense. The bus runs at 75/2.
I'm not sure how to force regions of RAM to be uncacheable
We can probably disable the whole cache... But we're lacking in hardware documentation for this platform. Remind me and I'll do some research.

I'll have to reread the rest of your post when I'm settled on the sofa.
 
If this was true, it means the '040 bus runs at 3 PPC cycles per '040 bus cycle and this would explain quite a bit of the performance hit
You know something - I can't prove this either way. I can't find a speed for the 040 bus quoted anywhere at all by a primary source (I'm ignoring all other sources because there is a lot of contradictory statements out there).

The only thing is that this would obviously impact RAM too, which feels like it would be a bit foolish. 25MHz 32bit RAM on a 75MHz 64bit processor is... Well. Not amazing.
 
Last edited:
This may have been brought up in all of this but could these tests differ in MacOS 8.0-8.6?

I had a 6200 board that died in a tragic heatsink accident, and I found it to be absolutely charming. It was slow, it was not great, and it was the second beige retro machine I picked up. I miss it.
 
That is absolutely garbage. We have the developer notes and also, it makes no sense. The bus runs at 75/2.
Dev notes say the CPU bus runs at a 37.5MHz, but it doesn't say how fast the '040 bus runs. Various contributors here have said maybe there's an extra wait state to take it down to 33MHz. Yet I'm sure I've seen a document on how '040 bus transactions work. Maybe it was on the '040 document itself. I guess it's likely that if it's an '040 bus it's actually following that protocol.

We can probably disable the whole cache... But we're lacking in hardware documentation for this platform. Remind me and I'll do some research.

I'll have to reread the rest of your post when I'm settled on the sofa.

You know something - I can't prove this either way. I can't find a speed for the 040 bus quoted anywhere at all by a primary source (I'm ignoring all other sources because there is a lot of contradictory statements out there).
Exactly, but I think it could be tested.
The only thing is that this would obviously impact RAM too, which feels like it would be a bit foolish. 25MHz 32bit RAM on a 75MHz 64bit processor is... Well. Not amazing.
True. But most accesses won't hammer the '040 bus; it's just not amazing for code that performs a lot of memory modifications, like graphics.j

I looked up the MC68040 bus protocol, it's normally 2 cycles per transaction or 5 cycles for a 4 32-bit word cache line fill. I think the F108 should be able to process page-mode, but I'll ignore that, because graphics operations will generally operate on 1 byte at a time. Let's assume both the P630 and P5200 have 80ns RAM.

So, on the P630 we have a 30ns Address cycle, then we need to wait for data; F108 generates /RAS (30ns), /CAS (30ns) and waits 80ns (rounded to 90ns) for the first access, the only access in the worst case, so that's a total of 150ns or 5 transactions.

On the P5200 we have 40ns access. So that's /RAS (40ns), /CAS (40ns) + 80ns (probably needs to be rounded to 120ns) = 200ns.

This accounts for a 33% slow-down when accessing uncached DRAM.

But for Valkyrie it'd be a bit faster. A 2 cycle word access would take 60ns; whereas the same access would take 50ns on a P5200. Thus the P5200 will be about 20% slower. In fact, 56.6/43.8 is 29% slower, which is pretty close.
 
Dev notes say the CPU bus runs at a 37.5MHz, but it doesn't say how fast the '040 bus runs. Various contributors here have said maybe there's an extra wait state to take it down to 33MHz. Yet I'm sure I've seen a document on how '040 bus transactions work. Maybe it was on the '040 document itself. I guess it's likely that if it's an '040 bus it's actually following that protocol.




Exactly, but I think it could be tested.

True. But most accesses won't hammer the '040 bus; it's just not amazing for code that performs a lot of memory modifications, like graphics.j

I looked up the MC68040 bus protocol, it's normally 2 cycles per transaction or 5 cycles for a 4 32-bit word cache line fill. I think the F108 should be able to process page-mode, but I'll ignore that, because graphics operations will generally operate on 1 byte at a time. Let's assume both the P630 and P5200 have 80ns RAM.

So, on the P630 we have a 30ns Address cycle, then we need to wait for data; F108 generates /RAS (30ns), /CAS (30ns) and waits 80ns (rounded to 90ns) for the first access, the only access in the worst case, so that's a total of 150ns or 5 transactions.

On the P5200 we have 40ns access. So that's /RAS (40ns), /CAS (40ns) + 80ns (probably needs to be rounded to 120ns) = 200ns.

This accounts for a 33% slow-down when accessing uncached DRAM.

But for Valkyrie it'd be a bit faster. A 2 cycle word access would take 60ns; whereas the same access would take 50ns on a P5200. Thus the P5200 will be about 20% slower. In fact, 56.6/43.8 is 29% slower, which is pretty close.
You're making a lot of faulty assumptions here.... I would strongly recommend reviewing the MEMC or MEMCjr documentation floating around to get an idea of what 040 bus transactions look like in application and how those correlate to memory timings. Wait states aren't changing the bus clock frequency and more often than not you'll be using at least one wait state in order to meet timing needs. Outside of cases specifically designed for speed anyways, such as a cache or possibky the valkyrie write buffers.

Ultimately some work with a LA and specific test cases would be best, as you suggested. As i recall there are a few ROM traps regarding cachability but you probably could just directly manipulate the MMU.
 
You're making a lot of faulty assumptions here.<snip> to get an idea of what 040 bus transactions look like
I'm certainly making some assumptions, because I too, don't have a P5200/6200 and I'm only an embedded firmware engineer. But I think maybe I wasn't clear enough in my earlier comment too. Other people have suggested there were extra wait states because they figured the '040 bus ran at 33MHz, but the CPU bus ran at 37.5MHz. I didn't have an opinion on that. For all I know it could be right.

For my comment about 3 hours ago, to get an idea of what '040 bus transactions look like I went to:


Chapter 8, page 15:
1770758040210.png
This is where I get my idea that a basic '040 bus cycle is 2 cycles.

I'm not sure that the P5200 '040 bus is 25MHz, I'm just doing a rough estimate of what things might look like if it did. It's unlikely the F108 memory controller can generate more than address phase per clock, so 3 cycles would be the minimum, and RAM would make it longer. From my reading of some Motorola 256kBit x 4-bit 80ns DRAM...

A random read would be 160ns minimum and fast page mode reads would be 55ns.

Wait states aren't changing the bus clock frequency
I didn't think they were, I had just read that the '040 bus was at 25MHz.
and more often than not you'll be using at least one wait state in order to meet timing needs.
Yes, that's how I understand it. On a 33MHz bus that would mean maybe 180ns in total (6x 30ns) for a random read, but maybe 360ns for a burst of 4 reads. On a 25MHz bus, perhaps you'd need 4 cycles (160ns), but that sounds tight, so perhaps 5 are needed (1 wait state?) so that's 200ns for a random read. Then 2 cycles would be needed per FPM read, giving 200+240ns=440ns. So, then the P5200 is 22% slower.

Again, though, a new estimate using a DRAM datasheet leads to a similar conclusion, 20% to 25% slower on the P5200.
<snip> possibky the valkyrie write buffers.
And there may be some interaction between the write and read buffers, given that Valkyrie read and write buffers go through the same bus interface and that of course lots of VRAM bandwidth is taken up just generating video. Maybe it's something really dumb like the P5200 bus runs at 25MHz and so does Valkyrie, but that means more time is spent reading VRAM to generate a display and this compounds the reduction in available CPU bandwidth for accessing Valkyrie.

Ultimately some work with a LA [logic Analyser] and specific test cases would be best, as you suggested. As i recall there are a few ROM traps regarding cachability but you probably could just directly manipulate the MMU.
Cool. I guess it's possible to read the bus speed directly using an LA! It's a good point about cacheability traps.
 
Taking a step back and using logic, I don't see why the 040 bus wouldn't be 37.5MHz - there is no motivation that I can see for it to be otherwise. The hardware was spec'd for up to 40MHz at least. Preceding versions happily run at 40 or even 50MHz. Why add the complexity of misalignment and further reduce performance both by clock, and by misalignment. I suspect the reason Apple never mentions it's speed is because there is no reason to - it's the same 37.5 as the PPC part of the bus...

(Keeping in mind that the Dev note tells us that the bus is 37.5, the 030 part is 16MHz, the SCSI clock is 18.75MHz, and the SCC is clocked at 8MHz... I doubt they forgot the 040 bus)

But the only way I'd find out is by sticking a probe on a pin I know sees a clock for the 040 bus somewhere and I don't know where I'd find that, and I don't have a working 6200 anyway.
 
Last edited:
Read the whole thread, interesting.

About Valkyrie performance, if you look at the Valkyrie DeclROM on a 5200/6200, there is 68k in there. So part of it is emulated. It may be just for basic stuff like resolutions listing or initial bringup, or it could be some other advanced routines as well in there.
 
Back
Top