• Updated 2023-07-12: Hello, Guest! Welcome back, and be sure to check out this follow-up post about our outage a week or so ago.

How does the L2 cache work on NuBus (PowerPC)

noglin

Active member
There is detailed information about L1 (as it is part of the CPU), i.e. N-Way (on 603 it is 2-way), cache policies, that it is LRU, what bits from physical address are the tag, and what bits in the EA/PA are the index, etc.

As L2 is not part of the CPU, information about it is unlikely to be readily available. But perhaps someone here knows?

I'm particularly interested in the L2 on the Macintosh Performa 5200, but I suppose they are likely very similar for NuBus.
 

noglin

Active member
Some insights:
- the L2 cache is (as expected) not split. Data can use all of the 256kB.
- I have not checked N-way yet (I think allocating 2x128kB, write both, and then read interleaved from both would tell if it is N-way)

How I approached the micro benchmark:
- this is on a 5200 with 8kB data L1 cache and 256kB L2 cache
- I use "tb" (ticks every 8 cycles), I wait for tb to tick and then I measure a load and immediately after measure tb again.
- I allocate memory (aligned to page 4kb), I write 512kb of data (in a reversed way, from last position to offset 0)
- I use a load to measure cache access, I use a stride multiple (8) of block size (32) = 32*8 (so no load will benefit from a previous load)
- I read the last memory written (most of this will thus be in L1, but e.g. stack will pollute a bit), in plots, this is 0kb to 8kb
- I then read the memory from 8kb to 256kb, this has a high chance of being in L2, but L2 is shared by instructions as well, and any interrupt can severely modify the data in L2.
- finally I read 256kB to 512kB, this will be in ram (not L1/L2)

plots is all from same data but showing different areas (byte positions that was loaded), tables.csv (first column is the byte position, and second column is the # of cycles).

Details:
- as expected, it is not a split-cache, i.e. data can be used in all of the 256kb
- L1: I only have 8 cycle resolution, specs says 2 cycle throughput and 2 cycle latency => 0 is really ~2 cycles (tb did not have time to tick), 8 is when it must go fetch in L2
- L2: takes 8 cycles when is in L2, we can see these spread out all over L2, indicating that the data can be placed anywhere (it is not a split cache).
- RAM: we can see when not in L2 best case is 32 cycles, this is both due RAM being slower but also because it is on a 32 bit bus (L1/L2 is 64 bit). It also takes ~168 cycles worst case, and we can see how this happens on 4kB (page) boundaries. So on a miss, something happens to make accesses within that page faster (it is not in L2, but it might be the MMU page entries that helps or maybe the memory controller has a buffer?).
 

Attachments

  • L1.png
    L1.png
    77 KB · Views: 5
  • L2.png
    L2.png
    147.8 KB · Views: 2
  • RAM.png
    RAM.png
    128.7 KB · Views: 4
  • all.png
    all.png
    119.2 KB · Views: 6
  • table.csv.zip
    6.6 KB · Views: 1

noglin

Active member
Actually, I think I had this part the wrong way about the RAM accesses:
- I think the typical time it should take is 32 cycles, and then when a page misses, that's when it gets very expensive, because it will miss in TLB and must search in PTE, but then after those 168 cycles it won't have to do that again (until next page). (I wonder how many pages the on-chip TLB entries can store?)
 

Attachments

  • 2024-05-10-012210_867x1006_scrot.png
    2024-05-10-012210_867x1006_scrot.png
    94.9 KB · Views: 4
Top