Fantasy M88100 Macs

Snial · Mar 3, 2024

Let's imagine, for a moment, Apple had decided to go with the M88x00 RISC successors to 680x0 Macs instead of PowerPC.

Gary Davidian had already created an emulator which worked on an LC and could boot Mac OS 7.x by 1991. So, in theory this could have meant an architectural shift, a full 2 years ahead of PowerPC (1992 instead of 1994), probably with the M88200 at 25MHz to 50MHz instead of PPC at 60..80MHz.

The M88x00 was quite a different CPU to PowerPC, mostly because it spliced off the Data and Instruction, 16kB L1 caches to external chips, which resulted in greater (3c) latency. The instruction set was simpler and easier to learn, but it lacked some of the bit field masking instructions (which are really useful for emulation). In addition it had four execution units: Data load/store, Instruction Load, IU, FPU. So, in many ways it looked like a primitive PowerPC 603e.

What would a M88 Mac have looked like? First, I want to give model names. The 680x0 Macs ended up with 3 digit models and PowerPC Macs, 4 digits: 1x00, 3x00, 4x00, 5x00, 6x00, 7x00, 8x00). I'm going to switch the other way and give the M88K Macs a letter+2 digit model numbers, released in this order: R41 (consumer, like an LC Mac), R61 (1 slot NuBus big Pizza style), R71 (3 slot NuBus). There's no top-end M88K Mac to displace Quadra 950 Macs as they sit alongside M88K Macs until they reach maturity.

The inner emulation core for the M88x00 Macs will look a lot like this:

Code:

Macro Next
ld.h r19,rIp,0 ;rIp^instruction, r19=opcode.
ld r20,rVecBase,r19<<2;dependecy on r19.
jmp.n r20 ;jump with delayed branch again, dep on r19.
add rIp,2 ;6c.
ENDM

;So, NOPs at 8c each? So, at 25MHz, that's about 3MIPs for NOP.
;Simple Ins:
OpCodeMoveLd0d0:
ld r5,rDRegs,kD0
ld r6,rDRegs,kD1
add.u r5,r5,r6 ;update flags too.
lcr r10;save flags (flags need translating for a number of scenarios).
st.l r5,rDRegs,n ;store result.
Next ;So, 7cycles to 10 cycles, so 18c, so 1.39MIPs.

So, one of the upshots is that an R41 at 25MHz would emulate slower than a 68000 and therefore the performance (like PowerPC Macs) would come from converted ROM routines and also Native code segments in the loaded OS and applications. In The M88K model, QuickDraw, Segment Manager and the Modern Memory manager are native, as is a nano kernel (like PPC). But there's no PPC Segmentation (which is different to Mac OS 68K code segments) and no fragments. Instead, we use RCDE resources which mirror CODE resources. When an application is loaded on an M88K Mac, it looks for RCDE 0, which contains the jump table for the M88K: normal execution addresses for 68K CODE resources, but execution addresses | (1<<0) for M88K RCDE resources, i.e. an odd address. Thus if 68K code calls an RCDE segment routine the emulator traps into M88K execution. Then Mixed-mode management duplicates RCDE jump tables so that M88K to M88K calls execute directly from the shadow jump tables, but M88K to M68K calls re-invoke the 68K emulator.

Application ABIs also change, to be more register oriented on the 68K side as well as M88K side. D0..D2, A0..A1 are callee-saved; D3..D5, A2..A3 are caller-saved; D6, D7, A4 are temps. On a mixed-mode call, they are immediately loaded into their equivalent M88K registers: R2..R4, R5..R6. Other regs are used as per the Motorola standard ABI. This also means mixed 68K / M88K apps need modified compilers that use the correct ABI.

M88K environments have access to globals (like PPC). The emulator emulates a 68000 (not 68LC040) for simplicity and returns gestalt68000 when you pass it the selector gestaltProcessorType. But it still supports VM traps and 32-bit addressing. A-line and F-line traps are mapped to M88K addresses (with bit 0 set) for supported M88K routines, which also applies to M88K natively patched System code.

Function pointers have the same issues as per the PowerPC Mixed-Mode manager, except that as described above, an address needs to have bit 0 set to indicate an M88K routine, via the Empp(address) Toolbox API which is implemented as nothing for 68K apps, but checks the Jump-Table for an equivalent M88K routine in the PCDE resource. So, this is a simplified routine descriptor concept (though we could support Routine descriptors section 1-15) with kM88kISA=2).

M88K doesn't support shared import libraries (1-20). The M88K environment shadows A5 in a native 'linker' register, and as standard supports up to 64kB of globals (vs 32kB for the 68K environment), but the application's resource info can specify any additional amount of globals, which are added after the A5 world (because address offsets are unsigned and +ve in the M88K).

The Memory Manager on the M88K however, does and must support Macs with just 4MB of RAM, with VM turned on to at least 6MB (historical note: the 6100.. series was supposed to originally ship with 4MB) and because PCDE resources must be contiguous, File Mapping has the same semantics as for the PowerPC environment: application code doesn't occupy heap space (1-54). In addition, System, Finder and other Mac OS 68K code segments are re-organised to be contiguous in order to support application code file mapping (but not Read-only data file mapping).

The 25MHz R41/4/40, 40MHz R61/6/60 and 50MHz R71/8/100 ship in 1992 with System 7.1 so no 7.1.2 version is needed, running with VM in 2MB of RAM leaving 4MB of VM spare with 6MB of VM, or 4MB of RAM with VM turned off. It means fewer 68K Macs are released as the cross-over happens earlier. A cut-down Disk Tools runs in 3MB of RAM with VM turned off to allow 4MB RAM systems to be set up. Obviously far less native software would be available by then: System OS would be almost entirely 68K, ClarisWorks 1.5 ships with key M88K acceleration as does Photoshop and an early version of RenderMan. The R41 is targeted at Math, Physics and Biology students, offering slightly slower than LC speeds (12MHz equivalent) in normal GUI operation, but workstation speeds for targeted Math, Physics and Organic chemistry modelling thanks to NIH sponsored application frameworks. Dual Ensoniq DOC and Audio AV/ MIDI PDS (and NuBus) later in 1992 opens the R41 to R71 to early DAW applications.

Well, this is fantasy ;-)

http://www.bitsavers.org/components/motorola/88000/MC88100_RISC_Microprocessor_Users_Manual_2ed_1990.pdf

Transplanting the Mac’s Central Processor: Gary Davidian and His 68000 Emulator

Last week Apple announced its switch from Intel to its own ARM-based processors for all future #Macs. Apple has done this multiple times in its history, starting with the transition from Motorola 68000 chips to PowerPC in the early 1990s.

computerhistory.org

https://developer.apple.com/library/archive/documentation/mac/pdf/PPC_System_Software/Intro_to_PowerPC.pdf

Mac Release Dates 1992 - Macs By Year: EveryMac.com

Technical specifications for all Apple Mac models released in 1992. Dates sold, processor type, memory info, hard drive details and more.

everymac.com

MAME 0.255 Released!

MAME 0.255 As you may have expected, it’s MAME 0.255 release day! Following on from April’s breakthroughs, Namco System 10 MP3 audio is now supported, making Golgo 13: Juusei no Requiem, Seishun Quiz Colorful High School and Nice Tsukkomi fully playable. On top of that, Point Blank 3 and Gunbalin...

forums.atariage.com

Phipli · Mar 3, 2024

Would it support multiple processors?

Could you bump the 68k performance by retaining a 68000 or 68020 for unported code?

Would traps cause any new or different challenges vs. PPC?

Snial said:
So, in many ways it looked like a primitive PowerPC 603e.

Would it ultimately benefit from fast cache the same way as the 603 architecture did in the G3?

Melkhior · Mar 3, 2024

Phipli said:
Would it support multiple processors?

Multi-processor (either true SMP or some other schemes like the stuff implement in MP PPC machines under OS9) is mostly a software issue - you need the software to support it. And that means adapting the software to the quirks of the underlying hardware, and back in the days there was a lot of quirks. SMP was costly, and desktop/laptop didn't move to it unless either forced to by lack of single-CPU performance (read: Apple during the PPC era) or when the SMP became available in a single chip (e.g. the Pentium D).

During the early days:

The MC68020 was 'easy' as it has no data cache, so you only need to worry about self-modifying code, as for UP. MP performance would be quickly limited by bus utilization - probably no more than 2 would be useful.
The MC68030 had no specific MP support, so any possibly shared pages of data must be cache-inhibited, killing performance to MC68020 level (unless the software has tight control about what is and isn't possibly shared).
The MC68040 and MC68060 have broken MP support [1], and any possibly shared pages of data must be marked as write-through instead of write-back, severely crippling performance (unless the software has tight control about what is and isn't possibly shared).
The MC88100 basically required a pair of MC88200 CMMUs (one for the instruction bus, one for the data bus), and had some MP support that may nor may not be good (I did not investigate, too many impossible to find chips).
The MC88110 has a proper external bus with good MP support. It actually support a decent level of hardware coherency, so SMP on it would be fine - and probaby was, as at least Motorola shipped dual-CPU systems.
The PPC601 (and all those with a similar bus or compatible to it) has an external bus that's very similar to the MC88110 (with extra stuff to make it more complex) and also has proper MP support, and SMP was fine (IBM at least did ship a lot of SMP RS/6000 based on the 601, at least duals and quads).

In the above, "the software has tight control about what is and isn't possibly shared" is very difficult to achieve; for instance in a modern Unix-like system (Linux, NetBSD), any page in a process virtual space is accessible by any threads. Unless the OS can guarantee the process is single-thread, can flush the cache when migrating the process, and can identify pages that will be shared with the OS (I/O, ...), and maybe more, then all pages will be 'possibly shared'. Which makes SMP '030, '040 and '060 rather pointless. Some were built back in the day, but then process threads might not be a feature ot the OS, making things a lot easier, and there were a lot of proprietary OS with different assumpions.

[1] At the time some people were unhappy about it

Snial · Mar 3, 2024

Phipli said:
Would it support multiple processors?

The M88K data sheet talks about multiple processor support - which is their rationale for offloading both the MMUs and caches. So, I'd say that alternate history would be the same as for real PPC Macintosh history. Also, it's likely they'd be introduced in the equivalent time-frame as for PPC604 Macs, so 1.25 years later, in summer 1993 along with M88K A/ROSE.

Phipli said:
Could you bump the 68k performance by retaining a 68000 or 68020 for unported code?

I thought about this. I imagine there would be some issues: the 680x0 has a different bus architecture, so you'd have to translate between the two and this would raise costs, beyond supporting the second CPU. Also, I'm not sure how the M88K emulator ROM would work. I would imagine the M88K being in charge and then the mixed-mode manager being adapted so that if there was a hardware 680x0 it would use that and use it for the whole of a legacy application. But the ROM and a bit of the System would still contain M88K code (e.g. QuickDraw, Memory Manager etc), so I guess the A-line and F-line traps would still have to mode switch, and do it differently to the 68K emulator, because the 68K registers wouldn't be directly accessible from the M88K (as they're on the 680x0 chip itself) and this would add a different kind of overhead.

It's the mixed-mode emulation / co-processor execution that makes the trade-offs particularly uncertain. A low-end R41 would benefit the most from co-processor CPU support, because its 68K emulation could be slower than a real 68000, but a 68000 co-processor has its own overhead, which implies that either it should use a 16MHz 68000 (plausible, cf Mac Portable or PowerBook 100) or skip straight to using a 68020 or 68030. But if it did that, then it'd start competing with more expensive 680x0 Macs (and would be more expensive anyway).

There are other implications of this approach too - though in some ways it's not so different to the Quadra AVs with the 66.7 MHz AT&T DSP 3210 co-processors. There's a practical incentive to including a 680x0 due to slow emulation in the 1992-era, but also it might slow down the transition, because there's less incentive to switch.

I'm imagining that this transition is different anyway, with the early Rx1 machines being relatively low-volume and targeted at specific users along with an emphasis on developers. Here I'm also assuming that 90s-era Apple could be competent at such a strategy. But Apple had been rife with competing teams since the 80s (cf Apple ///, Lisa vs Mac, Apple ||gs), so ironically maybe it'd provide an outlet for that, with perhaps M88K OS getting accelerated, so we'd see Copeland being subsumed into Mac OS development earlier and therefore Mac OS 8.5/8.6 features (Oct '98) appearing in Mac OS 7.5 / 7.6 with its look & feel (1994, 1996). By 1994/1996, given there's no PPC transition this time, the M88K would easily be fast enough to emulate the fastest 68K Macs as Dynamic translators would have arrived and I'd be offering ROM upgrades too. Then the early R41 users who suffered slow emulation would have nearly all native apps (CW4, Nisus 6, Metrowerks 10 Gold) and would have upgraded to 16MB of RAM and 2GB of HD and even a 25MHz R41 would then be as fast as a 33MHz or 40MHz '040, so their patience would have been rewarded.

Phipli said:
Would traps cause any new or different challenges vs. PPC?

AFAIK PPC Toolbox calls aren't done via traps, they're just there for 68K emulation.

Phipli said:
Would it ultimately benefit from fast cache the same way as the 603 architecture did in the G3?

If I understand it correctly, the M88200 had two integrated 16kB caches (rather than separate caches), so they were already much more like the 603e. Also, I imagine the emulator would have been written with M88K caches in mind (as I imagine the original M88K emulator would have been), so they wouldn't have suffered the 8kB L1 cache issue found on the 603 since the PPC emulator targeted a 32kB cache.

Fantasy is lots of fun - liberate those M88K Macs ;-) !

Melkhior · Mar 3, 2024

Snial said:
If I understand it correctly, the M88200 had two integrated 16kB caches (rather than separate caches), so they were already much more like the 603e.

The MC88200 aren't CPUs, they are Cache/Memory Management Units for the MC88100 (the first implementation of the MC88000 architecture). The MC88100 doesn't have internal caches or MMU, aid it has two external 32-bits memory bus ("P-Bus"): one for instruction, one for data. Unifying that into a single physical memory for a 'normal' OS required external support, which Motorola supplied with the MC88200. Those connected to the MC88100 "P-Bus" on one side, connected a MMU and internal 16 KiB cache to it, and connected to an external "M-Bus" (shareable between MC88200) where the memory controller and peripherals would live.

That was an expensive solution, though implementing a CPU in multiple chips was common for early RISC designs. The choice of integrating the FPU yet externalizing the MMU/cache combo was less common, but might have made sense for some embedded design - for sinatce, NCR X-terminal based on the MC88100 ignored the MC88200 and implemented two physical memories, one for instructions (the xserver would be loaded to it by the ethernet interface using TFTP) and one for data. Can't do that for a normal OS... CPU with such separated buses never caught for workstation/desktop/laptop designs, but did have some success in emebedded markets, such as the "three-bus" members of the AM29000 family.

The only other implementation of the MC88000 was the MC88110, which was a one-chip design like the MC68040 or MC68060. FPU, MMU, caches are all integrated, with just an external 64-bits bus common to instruction and data. After that, Motorola went PowerPC.

The MC88100 would have been too expensive to use with its companion chips for a Mac (and turns out, for almost everybody else as well!). The MC88110 would have been usable instead of the PPC601, but Motorola alone would have had a difficult time keeping up with everybody else...

Phipli · Mar 3, 2024

Snial said:
the 680x0 has a different bus architecture, so you'd have to translate between the two and this would raise costs, beyond supporting the second CPU

You say this, but Apple frequently did this with the LC line of PowerPC macs, which retained the LC-PDS.

Snial said:
they're just there for 68K emulation.

Hum, I thought they were fundamental to how Mac OS works? Hence needing to emulate 68k traps on PPC.

Snial · Mar 3, 2024

Melkhior said:
The MC88200 aren't CPUs, they are Cache/Memory Management Units for the MC88100 (the first implementation of the MC88000 architecture).... <snip>

Thanks, my goof, confusing the M88110 second-generation with the M88200 MMU/Cache units.

Melkhior said:
That was an expensive solution...

Yep.

Melkhior said:
The only other implementation of the MC88000 was the MC88110, which was a one-chip design like the MC68040 or MC68060. FPU, MMU, caches are all integrated, with just an external 64-bits bus common to instruction and data. After that, Motorola went PowerPC.

The MC88100 would have been too expensive to use with its companion chips for a Mac (and turns out, for almost everybody else as well!). The MC88110 would have been usable instead of the PPC601, but

The MC88110 is a pretty amazing chip though, with 10 execution units and 2x Integer Units. Unfortunately, it has 8kB caches like the 603, so what I said to @Phipli would have turned out to have been wrong: the emulator would have had to be modified again or you'd have to have an L2 cache.

Motorola alone would have had a difficult time keeping up with everybody else...

But this is a fantasy topic - one can imagine that some aspects could turn out differently even if it's a bit implausible. Basically I'm trying to imagine what such a Mac would feel like, with a much slower emulated environment, but some nippy (for the day) natively accelerated components.

Snial · Mar 3, 2024

Phipli said:
You say this, but Apple frequently did this with the LC line of PowerPC macs, which retained the LC-PDS.

Hum, I thought they were fundamental to how Mac OS works? Hence needing to emulate 68k traps on PPC.

I thought the linkage was different if it's PPC native application code to PPC native Toolbox calls, but it's likely I'm not checking properly. Emulated 68K apps to 68K traps, obviously go through the trap table; Emulated 68K apps to PPC native go through the trap table, and (I guess) a UPP then forces a switch to PPC mode; Native PPC to Emulated 68K traps would go through a UPP that does a mixed mode switch to a 68K trap; Native PPC to a native PPC Toolbox call that's then patched to 68K code would also have to revector via the Mixed-mode manager.

Surely it's like that, because the PPC ABI uses registers, and Toolbox 68K traps use the stack, so if a native PPC application called a native PPC Toolbox function and it had to go through the same 68K trap address just like the emulated 68K would, wouldn't there be an additional, unnecessary overhead?

Melkhior · Mar 3, 2024

Snial said:
The MC88110 is a pretty amazing chip though, with 10 execution units and 2x Integer Units. Unfortunately, it has 8kB caches like the 603

It is a nice CPU, and yes it's likely somewhat crippled by small caches - as most CPU would be, the L1 cache size being limited by area (almost any CPU without built-in L2) or by meeting L1 access latency requirements (almost everything since the introduction of the built-in L2).

The 64-bits memory bus does help performance, though it is annoying for "modern use" at that requires a LOT of pins if you want to connect a FPGA to it (and a lot of traces to route). The higher bus frequency is also a challenge for homebrews (MC680x0, x<=4 have slower buses, 68060 can run at the bus at 1:1, 1:2 or 1:4 the CPU frequency).

Snial said:
But this is a fantasy topic - one can imagine that some aspects could turn out differently even if it's a bit implausible. Basically I'm trying to imagine what such a Mac would feel like, with a much slower emulated environment, but some nippy (for the day) natively accelerated components.

if Apple had started the transition from m68k to something else a little earlier than they did, then the state of the software would have been different. From my point-of-view, rather than the PPC solution of back-and-forth between 68k and PPC code, they could have updated A/UX to run natively on the new platform, and used that to run "classic" 68k app in emulated sandboxes (similar to v68k). More intrusive than what we got UI-wise, but would have been more future-proof - it eventually happened in MacOS X but that was already mostly there in A/UX and then more or less abandoned during the MacOS 8/9 era.

cheesestraws · Mar 3, 2024

Melkhior said:
... could have updated A/UX to run natively on the new platform, and used that to run "classic" 68k app in emulated sandboxes (similar to v68k). More intrusive than what we got UI-wise, but would have been more future-proof

Not to derail the thread but I think you (and other people) seriously overestimate how complete A/UX's "sandboxing" is. A/UX was not future proof at all and even aside from the licensing problems wouldn't have made a sane basis for a next generation platform.

Snial · Mar 3, 2024

Melkhior said:
It is a nice CPU, and yes it's likely somewhat crippled by small caches - as most CPU would be, the L1 cache size being limited by area (almost any CPU without built-in L2) or by meeting L1 access latency requirements (almost everything since the introduction of the built-in L2).

Yes.

Melkhior said:
The 64-bits memory bus does help performance, though it is annoying for "modern use" at that requires a LOT of pins if you want to connect a FPGA to it (and a lot of traces to route). The higher bus frequency is also a challenge for homebrews (MC680x0, x<=4 have slower buses, 68060 can run at the bus at 1:1, 1:2 or 1:4 the CPU frequency).

The M88110 is ambitious for the period.

Melkhior said:
if Apple had started the transition from m68k to something else a little earlier than they did, then the state of the software would have been different. From my point-of-view, rather than the PPC solution of back-and-forth between 68k and PPC code, they could have updated A/UX to run natively on the new platform, and used that to run "classic" 68k app in emulated sandboxes (similar to v68k).

Good to see AMS (aka v68k) still being worked on.

Melkhior said:
More intrusive than what we got UI-wise, but would have been more future-proof - it eventually happened in MacOS X but that was already mostly there in A/UX and then more or less abandoned during the MacOS 8/9 era.

I guess there were multiple avenues Apple explored including MAE. All of them require 68K emulation at some point; whether it's a 68K app in an emulated Sandbox or in the case of the PowerMacs, using the new processor to accelerate a 68K system until it becomes the dominant environment. Real Unix systems avoided that when they went RISC, because it was pretty much all 'C' source in the first place. This I guess is part of how foresighted Unix was, being ported from a pdp-7 (in a manner of speaking) to pdp-11, then a few related LSI-11s before Interdata 8/32 (an IBM/360 style architecture) and then VAX-11/780.

Apple has gotten away with emulation at each stage, except for 6502 to 68K emulation (which was just skipped apart from the Apple ][ on an LC card), due to the need to support legacy code. I guess this is because most of the user base won't be technically aware enough to recompile code.

cheesestraws said:
Not to derail the thread but I think you (and other people) seriously overestimate how complete A/UX's "sandboxing" is. A/UX was not future proof at all and even aside from the licensing problems

Interesting.

cheesestraws said:
wouldn't have made a sane basis for a next generation platform.

Except if SANE was supported, that would have made a SANE basis for all future generation platforms ;-) .

Snial · Mar 3, 2024

Phipli said:
Hum, I thought they were fundamental to how Mac OS works? Hence needing to emulate 68k traps on PPC.

I've been looking into this a little bit more. So, to recap, the context is:
“

Phipli said:
Would traps cause any new or different challenges vs. PPC?

AFAIK PPC Toolbox calls aren't done via traps, they're just there for 68K emulation.
”
I originally bought the Inside Macintosh volume on the PowerPC System Software way before I had a PowerPC Mac, because I was very fascinated by how Apple did it. However, looking through the book they seem to spend quite a bit of time covering the basic 4 cases of which architecture calls functions in another (or same) architecture and when the Mixed-Mode manager is invoked to do this. It's only when we get to Page 2-8:

it's finally talking about how System (e.g. ROM) code is called for those combinations. The upshot is that if the emulator is running 68K code, then it uses the Trap despatch table as normal (as it has to), but PowerPC code always uses the System Software Import Library, and there's one of those in the ROM too. So, e.g. a PPC app calling a QuickDraw PPC routine would be called directly as a Shared Library routine (as we'd call it now), and it wouldn't go through a trap despatch table, but a 68K app calling the same routine would invoke the trap as normal (because it doesn't know it's running in an emulator) and the trap would point to a routine descriptor in ROM which would start with a Mixed-mode entry point instruction in 68K code and that instruction would switch to PPC mode; convert the 68K stack to work for the PPC convention (registers) and then call the PPC routine directly, before switching back on exit.

I suspect that either Gary Davidian's M88K, 68K emulator didn't even support Mixed-mode execution, since he was only interested in proving the RISC CPU could run 68K Mac OS 7. From the oral transcript and video this seems to be the case. This means that M88K based R41.. R61 Macs would either have to do what the PPC Macs did, or would possibly choose a simpler Mixed-Mode mechanism, because it's earlier in history.

An obvious mechanism, without Software Import Libraries is to:

Build Mixed mode execution directly into the A-line (and F-line) trap handler; so that the emulator looks at the lowest bit of the trap despatch table address and if it's a 1, it's in M88K code. I would probably do this for Cortex M0 traps with M0bius (note to the reader, it's an idea for a complete Mac 256K on a standard Raspberry PI PICO).
The M88K doesn't directly have an A-line and F-line equivalent, but there are multiple mechanisms for OS calls:
- opcodes in the jmp instruction range: 0xfyxxc020 to 0xfyxxc7ff and jsr in the range 0xfyxxc820 to 0xfyxxe7ff (where y=4..7 and x=0..f) are undefined and should trap, giving well over 16-bits of possible trap vectors.
- we could use 2 instruction jsr's: add rd,OsVectorBase,#Immediate then jsr rd (wasteful).
- We could use the MMU to ensure that all applications see the set of OS routines in low memory vectors and they could be 2 instruction jumps as in the previous case or the MMU just maps the OS despatch table to RAM containing the correct addresses, then it's just a jsr followed by a jmp. I'd probably favour this mechanism.
M88K routine calls to 68K emulated routines would then just be a jsr to a jmp that switches to the emulator .. but actually you'd have to save the routine offset, so it'd still need at least 2 words.

All in all, mixed mode execution is messy, because we have to support binary execution (which Unix and A/UX didn't have to, or couldn't easily be adapted too) for a previous instruction set. And that's also similar to the problems faced by the switch from 16-bit 8086 code to 16-bit protected '286 code (where segments have a different meaning) to 32-bit x86 code (instruction set has different semantics, effective address modes are different, operand length is byte/long instead of byte/word, regs are 32-bits instead of 16-bits): lots of thunking up or down, but at least the emulation for the older CPU is built into the hardware.

Snial · Mar 5, 2024

A little addendum on the M88K inner interpreter loop. I noticed that the PowerPC System Software guide explained that each opcode vectored to a two instruction table, so that some instructions could be emulated by a single PPC instruction (and then the second instruction jumped back to the Fetch loop). I think the same could be done here:

R2 to R25 are general purpose. Maybe we need about 8 for temporaries (rBaseVec, rIp, rSr + others, I use 3 more in Next), We can probably alias d0 to d3 and A0..A2 in M88K registers (perhaps even close to d0..d7 and A0..A7) and this means several types of instructions can be done in just one instruction. Because Next doesn't update condition codes, we can use the real condition codes to maintain them most of the time. This gives us code such as:

Code:

Next:
ld.h r25,rIp,0 ;rIp^instruction.
mak r24,r25,0,3 ;lsl 3
add r23,rBaseVec,r24
jmp.n r23 ;jump with delayed branch again, dep on r19.
add rIp,2 ;6c.

;So, NOPs at 8c each? So, at 25MHz, that's about 3MIPs for NOP.
;Simple Ins (move.l D0,D1).
MoveLD0D1:
jmp.n Next
add.u rD1,rD0,r0 ;update flags too.

The important thing is that most instructions on a 68K too are the simple instructions: MOVE, ADD, CMP, BRA.s. We can support a good proportion of these directly. MOVE.L, AND.L, OR.L, EOR.L for 49 to 256 opcodes each; probably most BRA.s (256 opcodes), all BRF.s and all BRF; a number of immediate shifts can be done. All of MOVEQ (4 x 256 opcodes). This takes us up to about 3 MIPS for a decent proportion of instructions. On the downside, the jump table then becomes 64k x 8b = 512kB, of which only 16kB will be resident in the L1 data cache and on the M88110, only 8kB leading to the same problems faced by the PPC603.

Phipli · Mar 5, 2024

Snial said:
This takes us up to about 3 MIPS for a decent proportion of instructions.

That's about a 16MHz 68000 or a 12MHz 68020/030 - useful numbers for the timeframe, especially with bursts of native code from a statistical analysis of the most critical parts of the toolbox/OS identifying where to put effor first in porting.

Snial · Mar 5, 2024

Phipli said:
That's about a 16MHz 68000 or a 12MHz 68020/030 - useful numbers for the timeframe, especially with bursts of native code from a statistical analysis of the most critical parts of the toolbox/OS identifying where to put effor first in porting.

Surely, 3MIPs tops = 12MHz 68000 (since at best it takes 4 cycles per instruction). Unfortunately, I've found out it won't be that good, because .u doesn't mean update flags, instead the 88100 is very much like a classical RISC which lacks shared state and instead tests the result of a register directly for conditions which means that the emulator must copy the result to a register that can be used for eq/ne, le/lt, ge/gt style branches; possibly clear or set the carry flag (in another register) and the same will apply to overflow. This leads to:

Code:

NextAluFlagsD1: ;We need one of these for each destination register.
add rFlags,rD1,r0 ;
add rVFlag,r0,0 ;clear VFlag.
Next:
ld.h r25,rIp,0 ;rIp^instruction.
mak r24,r25,0,3 ;lsl 3
add r23,rBaseVec,r24
jmp.n r23 ;jump with delayed branch again, dep on r19.
add rIp,2 ;6c.

;So, NOPs at 8c each? So, at 25MHz, that's about 3MIPs for NOP.
;Simple Ins (move.l D0,D1).
MoveLD0D1:
jmp.n NextAluFlagsD1
add rD1,rD0,r0 ;update flags too.

10c minimum now means 2.5MIPs, 10MHz 68000, 7.5MHz 68020. Emulators are curious things, for example, load/store tends to be relatively fast, but decoding tends to be relatively slow, so some instructions would likely exceed the 10MHz performance while others (e.g. with disp(An, Xn.w) are going to be slow to decode).

Phipli · Mar 5, 2024

Snial said:
Surely, 3MIPs tops = 12MHz 68000

I was working off the table here which gives 0.175MIPS per MHz for a 68000, so 16 = 2.8MIPS.

Snial · Mar 5, 2024

Snial said:
Surely, 3MIPs tops = 12MHz 68000 (since at best it takes 4 cycles per instruction). Unfortunately, I've found out it won't be that good, because .u doesn't mean update flags, instead the 88100 is very much like a classical RISC which lacks shared state and instead tests the result of a register directly for conditions which means that the emulator must copy the result to a register that can be used for eq/ne, le/lt, ge/gt style branches; possibly clear or set the carry flag (in another register) and the same will apply to overflow. This leads to:

Code:

NextAluFlagsD1: ;We need one of these for each destination register. add rFlags,rD1,r0 ; add rVFlag,r0,0 ;clear VFlag. Next: ld.h r25,rIp,0 ;rIp^instruction. mak r24,r25,0,3 ;lsl 3 add r23,rBaseVec,r24 jmp.n r23 ;jump with delayed branch again, dep on r19. add rIp,2 ;6c. ;So, NOPs at 8c each? So, at 25MHz, that's about 3MIPs for NOP. ;Simple Ins (move.l D0,D1). MoveLD0D1: jmp.n NextAluFlagsD1 add rD1,rD0,r0 ;update flags too.

10c minimum now means 2.5MIPs, 10MHz 68000, 7.5MHz 68020. Emulators are curious things, for example, load/store tends to be relatively fast, but decoding tends to be relatively slow, so some instructions would likely exceed the 10MHz performance while others (e.g. with disp(An, Xn.w) are going to be slow to decode).

I wrote a little Python program to perform a static analysis on the Mac Plus ROM. Each word in the ROM is read and a count of how many times it occurs is stored in an array 0..65535. For example, if the value 0 occurs 1000 times, then the top, left corner would have whatever colour is the value 1000.

In the Mac Plus ROM, no value is used more than 1574 and you can see only a relatively small set of values are actually used. The y axis is the upper byte and the x axis the lower byte. They're not all opcodes, as I haven't tried to distinguish, so there's immediates, addresses and other data mingled in there, but most of it will be representative.

The top 25% is the MOVE instruction. That's used a lot as you can see. The scattered band 6/16ths of the way down is branch instructions. They're used a lot too. There's a tendency for the top-left of each group to be brighter than the bottom right - this shows that lower register numbers are in fact used more often. I think it's quite fascinating.

Code:

import tkinter
import array

# init tk
root = tkinter.Tk()

# create canvas
# each instruction word is 1 byte.
myCanvas = tkinter.Canvas(root, bg="white", height=768, width=768)

f = open("vMac.ROM", "r")
memCount=array.array('i', (0 for i in range(0, 65536)))
maxRange=0

for theIns in range(0, 65536):
    ins=ord(f.read(1))
    ins=ins*256+ord(f.read(1))
    memCount[ins]+=1
    if memCount[ins]>maxRange:
        maxRange=memCount[ins]

print("MaxRange="+str(maxRange)+"\n")

for y in range(0,256):
    for x in range(0,256):
        coord = x*3, y*3, x*3+3, y*3+3
        val=memCount[y*256+x]*16777215/maxRange
        r=int(val/65536)
        g=int(val/256)%256
        b=val%256
        dummy = myCanvas.create_rectangle(coord, fill='#%02x%02x%02x' % (r,g,b))

# add to window and show
myCanvas.pack()
root.mainloop()

Snial · Mar 5, 2024

Phipli said:
I was working off the table here which gives 0.175MIPS per MHz for a 68000, so 16 = 2.8MIPS.

Oh, I see what you mean then since the average number of cycles is >4 per instruction (it's 5.7), then 2.8MIPS=16MHz. I was only comparing the fastest emulated instruction with the fastest 68K instruction. I don't know what the rest would be, but if it could be equivalent to 16MHz, then that's great!

CC_333 · Mar 6, 2024

Snial said:
Except if SANE was supported, that would have made a SANE basis for all future generation platforms ;-) .

If we lived in a SANE world, everything would be as it was in the mid 90s, but faster (none of this bloated nonsense we've been getting over the past ~20 years).

c

Phipli · Mar 6, 2024

Snial said:
Oh, I see what you mean then since the average number of cycles is >4 per instruction (it's 5.7), then 2.8MIPS=16MHz. I was only comparing the fastest emulated instruction with the fastest 68K instruction. I don't know what the rest would be, but if it could be equivalent to 16MHz, then that's great!

Sorry, I forgot to paste the link :

Instructions per second - Wikipedia

en.wikipedia.org

I'm not advocating it's accuracy particularly - it's just useful / interesting.

Fantasy M88100 Macs

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Similar threads