More pico-mac stuff

Really interesting how the cost function changes when your host clock is 20x faster than the emulated cpu.
Also when the number of regs differs. You mentioned earlier that you use global registers, but I’m not sure I understood - does that mean that A0-7 and D0-7 each get stored in a (cortex) memory location? Or do you map certain 68k regs to cortex regs?
 

adespoton

Well-known member
Just going back to the instruction backlog issue with MIDI: this was a problem with real hardware as well, which is why MIDI includes spec for external clock and clock sync. Sometimes of course, those aren't an option, which is where cycle exact emulation shines.

I really like the idea of an emulator where one thread is used to manage each hardware implementation, and another thread is used as the IO monitor, such that synchronization is managed by a central loop that doesn't reflect a part of the original hardware. This also makes it much easier to toss in another piece of hardware or tweak a component after the fact, without having to deal with timing issues. But for this to work, you need to ensure that your loop is cycling faster than the fastest expected combination of IO signals, so you're always assured you're back at the next cycle for any piece of hardware before you get a collision.
 

Snial

Well-known member
Really interesting how the cost function changes when your host clock is 20x faster than the emulated cpu.
Yes and that's today's topic (I think, assuming I know what you mean by cost function)!

Today I did something ILLEGAL!

Yes, I implemented the ILLEGAL instruction!!! In many ways, ILLEGAL is pretty simple, because it doesn't need to do much computation, it just causes an exception. But that meant I needed to implement the beginning of exception handling and in so doing I learned something about the Davidian PowerMac 68K emulator. I know that Exception emulation is going to be much faster than the way a real 68K would generate an exception stack frame, because the Emulator can store a word on a stack in the equivalent of 0.1 68K cycles.

So, that means I can use a single routine to generate both kinds of exception stack frames:

Format A

Offset from SPSizeValue
+0WordFlags:
Bit1514131211109876543210
Def0:Write 1:Read0:InstructionFunction code
+2LongwordAccess address
+6WordIR
+8WordSR
+10LongwordPC

Format B

Offset from SPSizeValue
+0WordSR
+2LongwordPC

An ILLEGAL instruction is Format B. On a 68000 it takes 34 cycles, equivalent to 654 cycles on an RP2040 at 125MHz. So, my Exception routine currently takes 100 cycles for ILLEGAL, which is 6.54x faster than a Real Mac.

What does this mean? It means that whereas there's quite a significant overhead for taking exceptions on a real Mac (added to the cost of the trap dispatch), on the emulator there's a relatively low cost, which also applies to 'A'line and 'F' line traps. And this means that ToolBox calls, particularly fairly trivial ones like Rect calculations, are relatively much faster.

So, this also provides an insight into the Davidian PowerMac 68K emulator, because Gary Davidian's emulator would have found exactly the same thing: generating an exception frame in PowerPC code would have been relatively fast compared with normal instruction emulation. However, on top of that, the Davidian emulator emulates the Mac ROM's 'A' line trap dispatch entirely in PowerPC code too. So, instead of perhaps a dozen or more (other people here know) 68K instructions dispatching to the correct ToolBox function which would individually get emulated; an 'A' line instruction performs both the hardware operation of the trap and the dispatch entirely in native PPC.

So, it'll be relatively faster.

Also when the number of regs differs. You mentioned earlier that you use global registers, but I’m not sure I understood - does that mean that A0-7 and D0-7 each get stored in a (cortex) memory location? Or do you map certain 68k regs to cortex regs?
Partly. The first issue, which is also true for any 32-bit ARM emulation of a 68000, is that ARM doesn't have enough built-in registers to emulate D0-7 and A0-7. So the emulator defines a Context data structure containing all the regs and when the emulator needs to access one of them it must perform a LDR rd,rContext,#RegOffset to load the reg; then some ALU stuff; then a STR rd,rContext,#RegOffset to store it at the end.

And this is what the Cyclone 68K emulator does explicitly, because Cyclone is written in 32-bit ARM code. However, a number of critical 68K registers can be globally allocated to ARM registers. For example, MØBius allocates the Instruction Register; the CCR (not SR); the PC; the pointer to the CpuContext to the 68K regs all in Cortex M0 registers. These are the ones most frequently used, i.e. on pretty much every instruction. PC isn't even the real 68K PC - it's pre-mapped to Cortex memory, either somewhere in Flash when executing ROM code or Cortex RAM when executing RAM code. That way I can pick up the next instruction just with an LDRH rather than having to translate the logical value of PC to a physical Cortex location.

But a 'C' emulator has additional problems, because the developer might not be able to get the compiler to allocate global registers and even if it did, it'll be more restricted, because compilers will need more temporary host registers to compile efficiently. So, probably on Muhashi(?), there are no Cortex registers globally allocated to 68K registers and again, that'll slow it down with lots of LDR, STR instructions from the Cpu Context.
Just going back to the instruction backlog issue with MIDI: this was a problem with real hardware as well, which is why MIDI includes spec for external clock and clock sync. Sometimes of course, those aren't an option, which is where cycle exact emulation shines.
Sure, but unless Cycle-exact emulation is different to Cycle-accurate emulation, then it still won't fix that issue. Cycle-accurate emulation counts the right number of emulated CPU cycles so that the emulated CPU executes the same number of cycles as the real CPU would per time slice.

But it still executes the corresponding instructions as a block before performing its other emulation tasks (i.e. the IO stuff and video etc). Then at the beginning of the next time slice it synchronises real-time with the number of cycles emulated in the previous time slice.

So, if the emulated CPU picks up MIDI clock events, then it partly solves it. The CPU knows when to send the MIDI events on playback, but its sense of time is still compressed, so either the events still get sent in a fraction of the time (even though the CPU thinks it's sending them at the right time), or they're sent to a buffer which then handles the I/O after the emulated CPU time slice and then the events still come out as a block. This is what I saw with miniVMac and PCE-Mac with cycle-accurate emulation.

Conversely, MØBius isn't cycle-accurate at all, but it will be able to correctly run MIDI software in accurate real-time, because it has a core dedicated to running the CPU emulation.

But maybe cycle-exact as you say is also a core dedicated to CPU emulation and does:

C:
void CpuTask(void)
{
    ns=CpuExe(1); // Execute one instruction and return cycle time in ns.
    while(nanos()<ns)
        ; // Stall until real-time reached.
    // Don't perform any IO or video - that's on a different Core.
}

Yes, then it would work.

I really like the idea of an emulator where one thread is used to manage each hardware implementation, and another thread is used as the IO monitor, such that synchronization is managed by a central loop that doesn't reflect a part of the original hardware. This also makes it much easier to toss in another piece of hardware or tweak a component after the fact, without having to deal with timing issues. But for this to work, you need to ensure that your loop is cycling faster than the fastest expected combination of IO signals, so you're always assured you're back at the next cycle for any piece of hardware before you get a collision.
Indeed. That sounds very clever - and challenging unless you have a really fast, modern CPU. It might sound like I'm trivialising it by only remarking on it briefly, but in reality I'm too mind-blown by the concept to comment on it until I learn more about it. It sound incredible though and thanks for introducing me to it.

MØBius is going to be fairly fortunate, because original Mac hardware is so crude, the second core won't need to do much synchronisation ... he says naively! For example, I think I can emulate the VIA using the RP2040's own timers. It should be possible to map serial functions directly. Video will be a background task using DMA and the second core for some of the critical stuff to restart video frames (this is pretty standard). Audio. A really accurate emulation would need 15kHz interrupts on the second core so that Mac software that races the beam to fill in audio samples would still work, but I think I'll just copy the buffer on every frame and then play that out using DMA. Mouse and Keyboard is USB to original Mac protocol, but it doesn't need to be µs accurate timing (and it can't be, because you won't get a USB mouse XY change faster than every ms). I can probably use pico-Mac's code. Disk emulation won't emulate IWM, but use the fake Sony driver concept to access a Flash file system. However, I should be able to support about 1.2MB to 1.6MB even on a basic PICO's internal Flash (one 800kB disk and one 400kB disk). SD support will be an upgrade, but perhaps I can port that code from the existing pico-mac.
 
Last edited:

adespoton

Well-known member
A really accurate emulation would need 15kHz interrupts on the second core so that Mac software that races the beam to fill in audio samples would still work, but I think I'll just copy the buffer on every frame and then play that out using DMA.
IIRC (fuzzy memory, it was a long time ago), I think this is what Paul eventually ended up doing for Mac II audio in Mini vMac. It works well enough for everything I've thrown at it! He spent years trying to figure out how to get the Mac II audio to sync properly, and eventually just settled on frame-by-frame, ignoring the random stutter this can theoretically produce. I do also remember him having a mild issue with the buffer he was filling, but I don't think that was an implementation issue, just a "oops, forgot about that blocking event!" issue.
 

Snial

Well-known member
IIRC (fuzzy memory, it was a long time ago), I think this is what Paul eventually ended up doing for Mac II audio in Mini vMac. It works well enough for everything I've thrown at it! He spent years trying to figure out how to get the Mac II audio to sync properly, and eventually just settled on frame-by-frame, ignoring the random stutter this can theoretically produce. I do also remember him having a mild issue with the buffer he was filling, but I don't think that was an implementation issue, just a "oops, forgot about that blocking event!" issue.
It's tricky. Theoretically, one of the nice things about having direct access to hardware in a bare-metal implementation is that accurate audio could be done without tying up a core at all. Here, you get a 15kHz timer to trigger a DMA channel which copies the next pair of bytes to the audio. Then a PIO plays every other byte (because one of the bytes is used as a PWM controlling the IWM's disk rotation speed). But for most purposes, just copying the entire buffer in one go and playing that at the right rate will be fine for most purposes and simpler.

Today my big achievement was *NOTHING*! Yes, I finally implemented NOP! In the meantime I've implemented TAS, TST and TRAP.

MØBius has a 256 entry vector table for the upper byte of an opcode, but many 68000 instructions have multiple instructions encoded within the low byte of a single vector. I could create more dispatch tables, but each 256-entry table would require another 1kB for the emulator, but such tables would be very sparse.

So, in these cases, I perform a binary subdivision to decode the rest of the instructions. In the worse case, that would also seem to require at least 128 bit tests and 128 branches but that's half the space and as I just wrote in fact the low byte decodes don't contain many instructions, in practice it wouldn't take many tests and branches.

An ARM Cortex M0+ doesn't have bit-immediate test instructions (unlike a 68000's BTST instruction), so the easiest way to perform a bit test is to shift it, e.g. to the most significant bit and then do a branch on whether the result is -ve:

C:
    lsl rd,rOpcode,#(31-9); puts bit 9 into bit 31.
    bmi OpBit9Set
OpBit9Reset:
    ;An instruction to emulate or more decoding.
    b rNext
OpBit9Set:
    ;An instruction to emulate or more decoding.
    b rNext

In fact though we can do better than a simplistic binary search, by making use of both the carry and negative flags to decode 2 bits with one shift:
Code:
    lsl rd,rOpcode,#(31-8); Bits<9:8> to go Carry:N.
    bcs Op1xPattern        ;1x pattern
    bmi Op01Pattern ;01 pattern
Op00Pattern: ;3 cycle decode.
    ;An instruction to emulate
    b rNext
Op01Pattern: ;4 cycle decode.
    ;An instruction to emulate
    b rNext
Op1xPattern:
    bmi Op11Pattern
Op10Pattern: ;4 cycle decode.
    ;An instruction to emulate
    b rNext
Op11Pattern: ;5 cycle decode.
    ;An instruction to emulate
    b rNext

Thus we can decode 2 bits at a time, taking from 3p to 5p cycles (where p is the number of bit pairs). This will be quicker than vectoring for up to 4 or 5 bits and hardly slower for up to 8 bits.

For the $4Exx block that includes NOP: {TRAP, LINK, UNLK, MOVE USP, RESET, NOP, STOP, RTE, RTS, TRAPV, RTR, JSR, JMP} there are a lot instructions to decode that way, but it's still going to be easily efficient enough to this kind of decoding.

On a full, original, 32-bit ARM instruction set you can use the same technique to perform, e.g. bit counts faster than trying to generate AND masks and then branching based on them.

Again, this kind of testing would be less efficient in 'C', because you can't do a 4 way branch (though you could do a simple binary subdivision).
 

Snial

Well-known member
I hope people aren't finding it irritating that I'm providing a near daily update for MØBius.

Today was a bit of a show-STOPper!! Yes, I implemented STOP along with {LINK, UNLK, MOVE USP, RESET, STOP, RTE, RTS, TRAPV, RTR}.

That's a lot of ops in one day, but most of them are closely related, because they either involve pushing or popping from SP; accessing SP or USP; trapping on various conditions (e.g. privilege mode) or switching from Supervisor to User mode. This means that for this block of opcodes, only JMP and JSR remain (and they're basically PEA/LEA type instructions).

STOP correctly implements the trace flag, but I think I need to update RTE to support Trace. MØBius doesn't and won't need to check for interrupts, exceptions or trace on every instruction fetch; which helps to speed things up. This is because:
  • Real Cortex M0+ interrupts will generate interrupts on MØBius - by changing the Next Reg's value (MØBius instructions execute the next instruction by executing b regNext . So, by changing the value of the register we can make the CPU take an exception or interrupt at the right point - i.e. after the end of the currently emulated instruction). This is similar to how a number of Forths implement execution tracing without any performance overhead.
  • In other cases, exceptions, trace and switching between Supervisor and User mode only matters when there's a state change to SR.
Here's an interesting question: we know the early Systems and 68K Mac ROMs use the upper 8-bits of 24-bit addresses for various memory management flags. Do they also use CCR<7:5>, which as I understand have no official use on an M68000? I would have hoped not, because it's so easy to trash those bits using a MOVE <ea>,CCR instruction.. and on early 68K ROMs, the Mac runs in Supervisor mode anyway <facepalm> so even MOVE <ea>,SR is allowed!

It would still be OK if 68K Macs did.
 

adespoton

Well-known member
Personally, I'm enjoying the daily updates :) Keep them coming!

I don't know if CCR<7:5> is used at all, but I don't see any reason why they would do so? As you say, early Macs were running in Supervisor mode, and the OS and Frankensteinian ROM then had to account for that in later revisions. And by the time they would have been thinking to refactor or rewrite that logic, PPC and Copland were already on the horizon.
 
Here, you get a 15kHz timer to trigger a DMA channel which copies the next pair of bytes to the audio. Then a PIO plays every other byte (because one of the bytes is used as a PWM controlling the IWM's disk rotation speed). But for most purposes, just copying the entire buffer in one go and playing that at the right rate will be fine for most purposes and simpler.

I think you can set up a DMA to fill the PIO fifo at full speed, and then set the PIO to run at 15kHz. The DMA can be paced by whether the fifo has space, so you could initiate it and it would keep feeding the PIO as needed. Which is pretty much exactly what you said, except that “copying the entire buffer” can be a DMA.
How large is the audio buffer? Do programs fill that buffer and then enable audio, or do they trickle the data in? I think you would need to know the number of words to transfer via DMA, but I guess you could periodically check if there was more data in the buffer and restart the DMA
 

Snial

Well-known member
I think you can set up a DMA to fill the PIO fifo at full speed, and then set the PIO to run at 15kHz.
I think that's what I intend to do initially.
The DMA can be paced by whether the fifo has space, so you could initiate it and it would keep feeding the PIO as needed.
The PIO buffer FIFO would essentially achieve the same thing as a proper 15kHz trigger for the DMA, so that's a good point.
<snip> How large is the audio buffer? Do programs fill that buffer and then enable audio
On an early Mac, there's a (I think) 370 x 16-bit word buffer and the individual bytes from each new word is output to a pair of PWMs, one for the IWM drive speed and the other for audio (the PWM + filter is a cheap DAC). So, audio is synchronised to scans (as it would be for UHF video for a TV) and the audio is basically enabled all the time (filling it with 0x80 = silence).


(search for: "THE SOUND GENERATOR".. "The sound circuitry scans the sound buffer at a fixed rate of 370 words per
video frame, repeating the full cycle 60.15 times per second")
I think you would need to know the number of words to transfer via DMA, but I guess you could periodically check if there was more data in the buffer and restart the DMA
I was / am working on a Mac Plus-era application I call Page-Aaah (a pun on the Fairlight CMI's Page-R) which would generate 4-channel sampled audio on the fly by writing to the buffer directly, so I'd certainly hit that problem. Racing-the-beam with the audio gives a bit extra time to generate audio, so it can make sense.

Let's consider what happens if Audio generation takes 75% of CPU. If you start generating samples at the beginning of a video frame (kicked off by the VBL interrupt) then you'll gradually over-take the scan line generation. That's good, because it's a disaster if you fall behind. By the end of the buffer, the scans will only be 75% of the way through. This gives you 25% of a frame left for other tasks (equiv 1.65MHz 68000, not much).

But that's quite tight, and if your VBL is delayed a bit you'll get glitches at the beginning. So, instead it's probably better to start audio generation half-way through the buffer at the beginning of a VBL. Then, the hardware could delay you by up to half a frame; it wouldn't glitch and you'd still be ahead of the beam. By the time you wrap round to the beginning (for the 'next' frame of audio), the scan is 37.5% through the buffer; by the time it's 50% through the buffer you're only 25% through the new buffer. Finally, by the time you've finished the next half of the buffer, the scan is 75% through buffer.

So, then it still takes the same amount of time, but you've allowed for more latency (e.g. mouse interrupts or other timer/housekeeping stuff in a VBL).

Oh, I should add another note to that: in the Linux document (and in the early Mac hardware documents), it explains that there are two sound buffers just like there's two video frame buffers and you can use VIA port outputs to switch between them. Later Macs added much better audio buffering that wasn't synchronised to what would be the Mac's 512x342 screen (which doesn't exist once Macs had higher resolution/colour video) and I think at least one Compact Mac design didn't have the second video buffer either. So, in my description above, I'm just assuming using the single, main audio buffer. There's a discussion about this on the 68KMLA, I think perhaps w.r.t the KABOOM game - it's all about avoiding audio pops on an early Mac.


(Did @jkheiser finish KABOOM?).
 
Last edited:

Snial

Well-known member
We need a countdown! What is it, about 25 instructions left?
Good point too! There's 36 opcodes left and between 25 and 32 instructions left after today's update! So, very good guess!
Personally, I'm enjoying the daily updates :) Keep them coming!
Thanks, that's encouraging!
I don't know if CCR<7:5> is used at all, but I don't see any reason why they would do so? As you say, early Macs were running in Supervisor mode, and the OS and Frankensteinian ROM then had to account for that in later revisions. And by the time they would have been thinking to refactor or rewrite that logic, PPC and Copland were already on the horizon.
That's what I think. On further reflection, I don't think it'll cost anything (or much) in the CPU to maintain those bits :) !

Today's update includes JSR, JMP (very similar), which fully completes the $4exx instruction block. It's amazing how short an emulation for some instructions can be: both of them put together (they share code) take just 7 x 16-bit words, i.e. 14 bytes. This includes 3 subroutine calls, but because even a JMP takes 8 Cycles there's plenty of time.

I've also completed MOVEM.W/L <ea>,regList It's much shorter than MOVEM.W/L regList,<ea> (36 bytes). I can see I made a mistake in that my current code only handles Words and gets the source/destination wrong - other than that - it's great :ROFLMAO: ! So, I'll need to fix it.

Finally I've implemented LEA, which is very much like JMP and PEA except the <ea> value goes to an address reg. Again this is short, just 18 bytes long. If the other instructions were this compact (they won't be), the rest of the emulator would only add 450 bytes!

This brings me to my next topic:

XIP Cache Performance

Originally I aimed to get MØBius into 16kB of Flash so that it would fit in the RP2040's XIP cache. Increasingly it looks like the emulator will fit in 8kB, just half the RP2040's cache! To me, that's fairly mind-blowing, given that we're used to a 68K emulator being fairly hefty with the vector table itself taking up 256kB!

You might think there's nothing to be gained once the emulator is <16kB, but that's not true, because the XIP Cache will have to share both emulator code, Mac ROM code and disk blocks (unless a Disk block is DMA'd directly from flash to RAM using the replacement Sony driver, but that sounds complex to me). How well will it handle sharing both?

The RP2040, 16K XIP Cache is two way set-associative with 8 bytes per cache line. So, every cache access is checked rather like a page table: the address bits for the cache line pass straight through; some or all of the remaining address bits up to the size of the cache index a line in the cache; a number of address bits are then used as a tag to check whether an access is a hit or miss and some of the remaining bits are a hard-coded tag (i.e. because only part of an address range is cached, the rest is non-cacheable).

If the RP2040 XIP cache was one-way set associative, the layout would be:


RangeTagSetIndexLine
0x10000000 .. 0x10FFFFFF A23:A22:A21:A20:A19:A18:A17:A16:A15:A14A13:A12:A11:A10:A9:A8:A7:A6:A5:A4:A3A2:A1:A0

The 11 address bits A<13:3> would index a cache line and if that cache line's tag matched the address bits A<23:14>, there would be a hit, otherwise there's a miss and the cache controller would load 8 bytes from flash and update that cache line's tag.

But the RP2040 XIP cache is two-way set associative, which means that there's only 10 cache line index bits, but the tag for each is 11 bits.

RangeTagSetIndexLine
0x10000000 .. 0x10FFFFFFA23:A22:A21:A20:A19:A18:A17:A16:A15:A14:A130A12:A11:A10:A9:A8:A7:A6:A5:A4:A3A2:A1:A0
0x10000000 .. 0x10FFFFFF A23:A22:A21:A20:A19:A18:A17:A16:A15:A14:A131A12:A11:A10:A9:A8:A7:A6:A5:A4:A3A2:A1:A0

I am going somewhere with this! For every cache access there's now two options for the address being cached and if there's a cache miss, then (probably) the cache line for the least-recently-used set gets replaced with the new data.

Now let's think about what happens when the emulator is basically 8kB. The emulator code, relative to whatever address, but 0x10000000 will do, will occupy 0x10000000 to 0x10001FFF of the cache ie. a range where A<12:0> goes from 0000000000000 to 1111111111111, which means it'll occupy the whole of one set. And, unlike an emulator with a vector for all 64k opcodes; the 256 vectors in MØBius will ensure that that set has a high utilisation: a much higher utilisation than any Mac ROM or Flash disk fetches (because those addresses will only be accessed every 10% or less of emulator address access).

So, in turn it means that the Mac ROM and/ or Flash disk will be (for the most part), cached in set 1 of the XIP 16kB cache and will function like an 8kB direct-mapped cache. And this means I can calculate the likely hit/miss rate for the Mac ROM!

To do that I'll reference the Bible for Computer Archictecture: "Computer Architecture, A Quantitive Approach" by John L Hennessy and David A Patterson. Chapter 5 is on memory hierarchy where we find, in section 5.2, Figure 5.7:

SizeInstruction CacheData CacheUnified Cache
8KB1.10%10.19%4.57%

Because the XIP cache isn't a harvard cache, it's intuitive to think it'll function as either a data cache (because the Mac ROM is accessed using Cortex M0+ ldr instructions) or a unified cache with a miss rate of 4.57%. But it won't, because the Mac ROM is a ROM of instructions: it'll function as an 8kB, direct-mapped Instruction cache, just like a cached ROM would if the emulator was in fact a microcoded CPU, like a real 68000.

Now we can calculate access times for the ROM, because if it's terrible, it'll make a significant impact on the emulator's performance. On the XIP cache, hits take 1 cycle and misses take an awful 40 cycles. Yike! But the miss rate is only 1.1%. This gives a combined cycle time of 1c x 98.9/100 + 40c x 1.1/100 = 1.429 Cortex M0+ cycles on average.

An encouraging result, Mac ROM access can easily be tolerated. If 10% of emulator accesses are to the ROM (worst-case, i.e. when running a Toolbox trap), then It'll cost another cycle 42.9% of the time, an overall performance hit of 4.29%, leading to 95.9% of the ideal performance. :cool: !
 
Last edited:

Snial

Well-known member
Today's update and a curious CCR inconsistency!

A few new instructions: CHK, ADDQ, SUBQ, Scc, DBcc!

Most interpretive emulators have extensive flag-handling code that handles individual flags across a number of statements. For example PCE-mac's M68K emulator does:

C:
void e68_cc_set_add (e68000_t *c, unsigned d, unsigned s1, unsigned s2)
{
    uint16_t set = 0;
    d &= 1;
    s1 &= 1;
    s2 &= 1;
    if (d) {
        set |= E68_SR_N;
        if (s1 && s2)
            set |= E68_SR_C | E68_SR_X;
        if (!(s1 || s2))
            set |= E68_SR_V;
    }
    else {
        if (s1 || s2)
            set |= E68_SR_C | E68_SR_X;
        if (s1 && s2)
            set |= E68_SR_V;
    }
    c->sr &= ~(E68_SR_X | E68_SR_N | E68_SR_V | E68_SR_C);
    c->sr |= set;
}

Code like this makes sense for a portable implementation, because you can't rely on the underlying architectural features. It's not fast though. I want to avoid this, by using the Cortex M0+'s flag handling as much as possible. This also makes sense in MØBius, because the Cortex M0+ has a very similar set of flags: NZCV to the M68000's XNZVC. But note: C and V are the opposite way around.

Most of the time this doesn't matter, because most real code just performs an arithmetic, bitwise or compare operation and then takes a jump based on a meaningful conditions such as EQ, NE, GT, LT, MI, PL, CS, CC etc where it wouldn't make any difference what order they're in. As long as the flags have the same meaning, then simply recording the Cortex M0's flag results after the operation itself is good enough.

And most of the flag results on a Cortex M0+ match an M68000. This means we can also optimise conditional evaluation for instructions such as Scc, DBcc, Bcc. Instead of analysing the individual bits for the relevant condition code, we can just vector on the M68000's condition field to 16 conditions that all look like:

Code:
;... earlier conditions 0000 to 1001
bpl CpuCcrTestTrue ; 1010=Pl
b CpuCcrTestFalse

bmi CpuCcrTestTrue ; 1011=Mi
b CpuCcrTestFalse

bge CpuCcrTestTrue ; 1100=Ge
;...later conditions 1101 to 1111

And that's very fast. But I found out it's not quite right! On an ARM CPU, the carry output is the same in many cases, except for subtracts and compares, where's it's inverted. This has two consequences:
  • Firstly, BHI and BLO are evaluated differently. On the 68K it's ~C & ~Z, but on ARM it's C & ~Z. Likewise, LO is C|Z on the 68K, but ~C|Z on an ARM. This is a consequence of the 6502 8-bit CPU performing a subtract a,b as cy:a+~b+1 rather than a-b.
  • A subtract instruction must invert the carry. However, this won't fix the problem above, because BHI and BLO will no longer perform the correct Hi and Lo test (they would if we didn't invert carry for subtracts, but then all the other cases where someone might use HI or LO conditions on a 68K would be wrong).
I fixed the former by forcing a short-circuit evaluation. for Hi, seeing Carry=1, means it's false, else it's an NE test. For Lo, seeing Carry=1 means it's true, else it's an EQ test. So, this costs a couple of cycles for those tests.

I fixed the latter by inverting bit 1 of the emulator's CCR reg (which is carry, because it's in ARM format). Interestingly, most of the other flags work the same way (e.g. N and V) thus signed compares still work as normal even for subtracts.

31 opcodes, 20 to 26 instructions to go.

Conclusion: copying existing functionality from another CPU really can help with emulation performance, but subtle differences can be tricky!
 
Top