So this week I had a serious breakthru on why the JavaScript accelerator in TenFourFox chugs so badly on the G5. Apple, true to form, only documents some of this. I wrote a whole big blurb for the TenFourFox blog, but here are the highlights:
- The D-cache is very different. Not a problem for me right now, but it might be if I start manipulating the data cache to get more performance. AltiVec D-cache instructions like dst actually can cause pipeline bubbles, worsening performance.
- Loads and stores to nearby addresses should have nops between them to force them into separate dispatch groups, or you risk pipeline stalls when the G5 discovers the aliasing fault. Tweaking this bought me some extra points in V8.
- Watch out for cracked and microcoded instructions.
- mtctr should be first in a dispatch group if possible. I bet there are others.
Here's the one that really frosted my cake, and Apple doesn't mention it anywhere:
- mcrxr is emulated in software on the G5. Because the nanojit uses lots of code to check for overflow state, we heavily use XER, and I used mcrxr to get the XER overflow bit into a CR for branching (for a variety of reasons, summary overflow is unsuitable). This works great on the G3 and G4, which have it in hardware. On the G5, it causes an illegal instruction, a pipeline spill and emulation in software by the OS. Yikes! This is by far the biggest reason why the G4 ran rings around the G5. Now, the G5 is back on top.
With these fixes, the raw nanojit (without the fixes I originally used) drops from a dismal 5200ms in SunSpider to 1760ms on my quad 2.5GHz. I bet the people with dual 2.7s do even better. This will be in TenFourFox 4.0.1.
I can see why people got frustrated with optimizing for the G5; Apple never documented this stuff very well.
- The D-cache is very different. Not a problem for me right now, but it might be if I start manipulating the data cache to get more performance. AltiVec D-cache instructions like dst actually can cause pipeline bubbles, worsening performance.
- Loads and stores to nearby addresses should have nops between them to force them into separate dispatch groups, or you risk pipeline stalls when the G5 discovers the aliasing fault. Tweaking this bought me some extra points in V8.
- Watch out for cracked and microcoded instructions.
- mtctr should be first in a dispatch group if possible. I bet there are others.
Here's the one that really frosted my cake, and Apple doesn't mention it anywhere:
- mcrxr is emulated in software on the G5. Because the nanojit uses lots of code to check for overflow state, we heavily use XER, and I used mcrxr to get the XER overflow bit into a CR for branching (for a variety of reasons, summary overflow is unsuitable). This works great on the G3 and G4, which have it in hardware. On the G5, it causes an illegal instruction, a pipeline spill and emulation in software by the OS. Yikes! This is by far the biggest reason why the G4 ran rings around the G5. Now, the G5 is back on top.
With these fixes, the raw nanojit (without the fixes I originally used) drops from a dismal 5200ms in SunSpider to 1760ms on my quad 2.5GHz. I bet the people with dual 2.7s do even better. This will be in TenFourFox 4.0.1.
I can see why people got frustrated with optimizing for the G5; Apple never documented this stuff very well.

