Really interesting how the cost function changes when your host clock is 20x faster than the emulated cpu.
Yes and that's today's topic (I think, assuming I know what you mean by cost function)!
Today I did something ILLEGAL!
Yes, I implemented the ILLEGAL instruction!!! In many ways, ILLEGAL is pretty simple, because it doesn't need to do much computation, it just causes an exception. But that meant I needed to implement the beginning of exception handling and in so doing I learned something about the Davidian PowerMac 68K emulator. I know that Exception emulation is going to be much faster than the way a real 68K would generate an exception stack frame, because the Emulator can store a word on a stack in the equivalent of 0.1 68K cycles.
So, that means I can use a single routine to generate both kinds of exception stack frames:
Format A
Offset from SP | Size | Value |
---|
+0 | Word | Flags:
Bit | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Def | | 0:Write 1:Read | 0:Instruction | Function code | | | | | | | | | | | | |
|
+2 | Longword | Access address |
+6 | Word | IR |
+8 | Word | SR |
+10 | Longword | PC |
Format B
Offset from SP | Size | Value |
---|
+0 | Word | SR |
+2 | Longword | PC |
An ILLEGAL instruction is Format B. On a 68000 it takes 34 cycles, equivalent to 654 cycles on an RP2040 at 125MHz. So, my Exception routine currently takes 100 cycles for ILLEGAL, which is 6.54x faster than a Real Mac.
What does this mean? It means that whereas there's quite a significant overhead for taking exceptions on a real Mac (added to the cost of the trap dispatch), on the emulator there's a relatively low cost, which also applies to 'A'line and 'F' line traps. And this means that ToolBox calls, particularly fairly trivial ones like Rect calculations, are relatively much faster.
So, this also provides an insight into the Davidian PowerMac 68K emulator, because Gary Davidian's emulator would have found exactly the same thing: generating an exception frame in PowerPC code would have been relatively fast compared with normal instruction emulation. However, on top of that, the Davidian emulator emulates the Mac ROM's 'A' line trap dispatch entirely in PowerPC code too. So, instead of perhaps a dozen or more (other people here know) 68K instructions dispatching to the correct ToolBox function which would individually get emulated; an 'A' line instruction performs both the hardware operation of the trap
and the dispatch entirely in native PPC.
So, it'll be relatively faster.
Also when the number of regs differs. You mentioned earlier that you use global registers, but I’m not sure I understood - does that mean that A0-7 and D0-7 each get stored in a (cortex) memory location? Or do you map certain 68k regs to cortex regs?
Partly. The first issue, which is also true for any 32-bit ARM emulation of a 68000, is that ARM doesn't have enough built-in registers to emulate D0-7 and A0-7. So the emulator defines a Context data structure containing all the regs and when the emulator needs to access one of them it must perform a LDR rd,rContext,#RegOffset to load the reg; then some ALU stuff; then a STR rd,rContext,#RegOffset to store it at the end.
And this is what the Cyclone 68K emulator does explicitly, because Cyclone is written in 32-bit ARM code. However, a number of critical 68K registers can be globally allocated to ARM registers. For example, MØBius allocates the Instruction Register; the CCR (not SR); the PC; the pointer to the CpuContext to the 68K regs all in Cortex M0 registers. These are the ones most frequently used, i.e. on pretty much every instruction. PC isn't even the real 68K PC - it's pre-mapped to Cortex memory, either somewhere in Flash when executing ROM code or Cortex RAM when executing RAM code. That way I can pick up the next instruction just with an LDRH rather than having to translate the logical value of PC to a physical Cortex location.
But a 'C' emulator has additional problems, because the developer might not be able to get the compiler to allocate global registers and even if it did, it'll be more restricted, because compilers will need more temporary host registers to compile efficiently. So, probably on Muhashi(?), there are no Cortex registers globally allocated to 68K registers and again, that'll slow it down with lots of LDR, STR instructions from the Cpu Context.
Just going back to the instruction backlog issue with MIDI: this was a problem with real hardware as well, which is why MIDI includes spec for external clock and clock sync. Sometimes of course, those aren't an option, which is where cycle exact emulation shines.
Sure, but unless Cycle-exact emulation is different to Cycle-accurate emulation, then it still won't fix that issue. Cycle-accurate emulation counts the right number of emulated CPU cycles so that the emulated CPU executes the same number of cycles as the real CPU would per time slice.
But it still executes the corresponding instructions as a block before performing its other emulation tasks (i.e. the IO stuff and video etc). Then at the beginning of the next time slice it synchronises real-time with the number of cycles emulated in the previous time slice.
So, if the emulated CPU picks up MIDI clock events, then it partly solves it. The CPU knows when to send the MIDI events on playback, but its sense of time is still compressed, so either the events still get sent in a fraction of the time (even though the CPU thinks it's sending them at the right time), or they're sent to a buffer which then handles the I/O after the emulated CPU time slice and then the events still come out as a block. This is what I saw with miniVMac and PCE-Mac with cycle-accurate emulation.
Conversely, MØBius isn't cycle-accurate at all, but it will be able to correctly run MIDI software in accurate real-time, because it has a core dedicated to running the CPU emulation.
But maybe cycle-exact as you say is also a core dedicated to CPU emulation and does:
C:
void CpuTask(void)
{
ns=CpuExe(1); // Execute one instruction and return cycle time in ns.
while(nanos()<ns)
; // Stall until real-time reached.
// Don't perform any IO or video - that's on a different Core.
}
Yes, then it would work.
I really like the idea of an emulator where one thread is used to manage each hardware implementation, and another thread is used as the IO monitor, such that synchronization is managed by a central loop that doesn't reflect a part of the original hardware. This also makes it much easier to toss in another piece of hardware or tweak a component after the fact, without having to deal with timing issues. But for this to work, you need to ensure that your loop is cycling faster than the fastest expected combination of IO signals, so you're always assured you're back at the next cycle for any piece of hardware before you get a collision.
Indeed. That sounds very clever - and challenging unless you have a really fast, modern CPU. It might sound like I'm trivialising it by only remarking on it briefly, but in reality I'm too mind-blown by the concept to comment on it until I learn more about it. It sound incredible though and thanks for introducing me to it.
MØBius is going to be fairly fortunate, because original Mac hardware is so crude, the second core won't need to do much synchronisation ... he says naively! For example, I think I can emulate the VIA using the RP2040's own timers. It should be possible to map serial functions directly. Video will be a background task using DMA and the second core for some of the critical stuff to restart video frames (this is pretty standard). Audio. A really accurate emulation would need 15kHz interrupts on the second core so that Mac software that races the beam to fill in audio samples would still work, but I think I'll just copy the buffer on every frame and then play that out using DMA. Mouse and Keyboard is USB to original Mac protocol, but it doesn't need to be µs accurate timing (and it can't be, because you won't get a USB mouse XY change faster than every ms). I can probably use pico-Mac's code. Disk emulation won't emulate IWM, but use the fake Sony driver concept to access a Flash file system. However, I should be able to support about 1.2MB to 1.6MB even on a basic PICO's internal Flash (one 800kB disk and one 400kB disk). SD support will be an upgrade, but perhaps I can port that code from the existing pico-mac.