Testing a 6200

( @Phipli ) it adds up, but there are a few reasons why I don't think that's the cause.

Firstly, there are no Read-Modify-Write memory instructions on a PowerPC, just load and store. So, this sequence of bus cycles in non-cached DRAM would have to be implemented in software, not the Capella bridge chip (load a 32-bit word to a register; insert the byte into the right position; write the long word back). But that doesn't make sense either, it would mean that the ROM had to consider the underlying PPC/'040 bridge.

Secondly, I don't think it makes sense to be implemented in Capella either. For cachable addresses, whole words get loaded into L2, then L1 cache - a block of 32 bytes (8 words) are loaded at a time and if a byte or half-word is modified there (which is now on the PPC side of course), the whole word (i.e. 8 words) would be written when the cache line needs to be flushed. So cacheable memory never needs an RMW sequence, nor a byte modification sequence.

Thirdly, from the developer note, it seems fairly clear that the interface between the 68040 bus to the PPC603 is a set of 543 latches:


So, I don't think it makes sense for Capella to emulate a byte store to uncached RAM to load an entire word into the D-latches from DRAM, then modify the intended byte, then write the 32-bit word back out. For byte loads it wouldn't matter, a PPC603 would still just load a 64-bit double-word and select the right byte.

It also makes little sense, given that the only way Capella can know if there's a non-cacheable byte write is if the PPC603 tells it. So, there are signals from the CPU for that purpose. It's far easier for DRAM to include byte line selects and simply pass those signals on. Even though the CPU will probably send out 64-bits + the byte select signals; Capella must know how to select the high or low 32-bit word, because DRAM is 32-bits wide. It must do that for normal writes anyway.


From page 8-15 it works like this:

A29-A31 (which are the low-order bits in the insane IBM convention), select the byte lane while the 3 signals TSIZ0:2 ==001, meaning 1 byte.

It's very easy for Capella to use this information to select the right byte lane in DRAM. A similar approach works for 16-bit and 32-bit read/writes:

Poor Video RAM performance is, I would deduce, for other reasons.
Sorry, I think you are confused or missing my point, perhaps I was unclear.

I don't mean a read-modify-write sequence as in an instruction that causes a read-modify-write bus cycle. I don't think the 68K ROM used these as only specific instructions use them. I'm referring to the QuickDraw use case of reading a particular address (framebuffer for a particular pixel), modifying the data, then writing it back. The point of mentioning byte width (as an example for 8bpp) is that an entire bus transfer cycle is required regardless of width if the address is cache inhibited. If for each pixel changed you're doing two full bus cycles, as I said, that gets expensive quickly because you accumulate additional wasted cycles from bus translation overhead.

Edit: To further clarify, most QuickDraw accesses act like that - basic drawing instructions and the like. In some cases it's possible to do larger transfers (blits, moves, and the like) but that depends a lot on the specific request, organization of framebuffer memory, bit depth, etc. A line from A to B is going in most cases to cause a series of single-pixel changes requiring that exact read-modify-write trace (but not a RMW cycle) I mentioned above as the framebuffer is typically cache inhibited.

Cache inhibition can be a handled a couple of two ways. Most commonly it's handled by MMU flags on particular pages/portions of address space that indicate if the caches should be inhibited or not. This is how apple did it on all models of 68K mac and I would assume on PPC as well. It's possible for an external device (at least on 040) to indicate that a particular transfer shouldn't be cached, but this should only be used in very specific cases. Assuming the page is marked as cache inhibited via the MMU, on any access, the caches will not be checked nor will the result of the bus cycle be stored in cache. On an 040 this is the primary way a bus cycle of less than a line width would happen.

You want to inhibit the caches for IO devices, for example, and other circumstances where the contents of the addressed memory may change unexpectedly (DMA). In most 68K macs only the ROM and (most) main memory DRAM is cachable; nubus space and IO space is not and framebuffer lives somewhere in there.
 
Last edited:
<snip> perhaps I was unclear. <snip> I'm referring to the QuickDraw use case of reading a particular address (framebuffer for a particular pixel), modifying the data, then writing it back. <snip> bus translation overhead <snip> most QuickDraw accesses act like that - basic drawing instructions and the like. In some cases it's possible to do larger transfers (blits, moves, and the like) but that depends a lot on the specific request, organization of framebuffer memory, bit depth, etc. A line from A to B is going in most cases to cause a series of single-pixel changes requiring that exact read-modify-write trace (but not a RMW cycle) I mentioned above as the framebuffer is typically cache inhibited.


Cache inhibition <snip> MMU flags on particular pages/portions of address space <snip> main memory DRAM is cachable; nubus space and IO space is not and framebuffer lives somewhere in there.
Indeed. So, on a 6100, video is in DRAM. So, that means that area of DRAM gets marked as non-cacheable. On a 6200, the frame buffer is 1MB of DRAM behind Valkyrie, which can store up to 4x transactions.

So, to properly resolve the question I think someone needs to write a couple of tests to work out what the cost of the bus translation overhead is. Reading other similar threads, to which we've contributed:



Then people like @Melkhior seem to know how the 603 to '040 bus interface worked and somewhere in my hazy memory I recall seeing a diagram, but I've no idea where. Finally, one of the comments on these threads claimed the '040 bus worked at 37.5MHz, not 33MHz. Surely the mismatch would cause a pretty bad overhead in most circumstances & not just graphics, not least because a P5200's 32-bit '040 data bus is half the width of the 6100's: 68K emulation would be badly affected too. Finally, @noglin estimated that although writes to Valkyrie were fast, reads could take ≈50 x PPC 603 cycles.

And one other thing, the Dev note says Valkyrie's clock is synced to the CPU clock for performance reasons during data transfers. So, given that it's on the '040 side; it seems to me that the '040 bus also runs at 37.5MHz, because otherwise you'd get extra cycles inserted for the bus interface and on top of that Valkyrie would have to sync to the PPC603, which surely would slow things down even further?

Perhaps there's another possibility, not related to the '040 bus itself, nor QuickDraw, but maybe the TLB? The 603 has an inverted page table, but TLB misses are handled in software via an interrupt. I understand that Inverted page tables were so awful, Apple didn't use the standard 8-cycle TLB miss interrupt, but their solution, though faster (or better) than Inverted Page Tables was still much slower than 8-cycles.

If so, we may have a rationale for why QuickDraw is slow on a P6200/P5200 vs a Q630 vs a PM6100:

  1. Both the Q630 and PM6100 have hardware TLB walks, so these will be fast.
  2. Some QuickDraw operations could quite easily hammer the TLB in a way that general 68K emulation doesn't, because they're far more non-local. e.g. when you draw a vertical line you're skipping 640b at a time at 640x480x8bpp; that's 1x 4K page fault every 6.4 pixels.
  3. OTOH 68K code has a high degree of locality.
  4. Therefore, the P5200/6200 graphics are a pathological case.
I freely admit I might be talking tosh here, but at least I have a claim to some imaginative tosh 😁 Now it's midnight and my brain is fried. Serves me right ;) !
 
Back
Top