Fantasy M88100 Macs

Snial

Well-known member
Oh my bad, I assumed reading multiple values was more efficient like on some other CPUs. I'm not familiar enough with the opcodes / timings.
Oh, like on the early ARM where MEMC can read multiple consecutive words without having to issue row strobes (hence all the references to 1N+1S clock cycles type thing).

Yeah, Freeport was the SE's codename.
OK! A revelation!
 

Snial

Well-known member
Continuing the analysis of NuMac graphics performance, I found the proper article I linked to via MacTech. In fact it's a Develop article from June 1994 and is still hosted on VintageApple.org, but as a PDF so the images are retained:


PpcQd1_3_5.jpg
What this means is that QuickDraw 1.3.5 on a PowerMacintosh is about twice as fast as on a Quadra 700, but there are some exceptions. CopyBits 1-bit (large image) on PPC is 4.8x faster. DrawPicture can also be about 4x faster. As they explain in the article, NuBus video cards are NuBus bandwidth limited (10MHz, or 20MHz at the time); the internal video frame buffer is 64-bits wide which helps with the video performance.

So, the upshot is this: 8-bit video performance is mostly bus-speed limited (which is why the PPC version is usually 2x faster, because video is on a 64-bit bus). A NuMac 41 would have the same bus speed & width as an LC III (25MHz) which is the same as a Q700. So, basically the graphics performance would be the same as a Q700 for 8-bit video, except for CopyBits (1-bit) and DrawPicture (1-bit) where it'd be about 25% to 50% faster (i.e. reflecting the performance difference between a 68040 and a M88K at 25MHz). And that would be great, because for those applications, you're getting a Q700 for the price of an LCIII.

A NuMac 61 runs at 33MHz on a 32-bit, 33MHz bus, so the general graphics performance would be about 32% faster for 8-bit operations and 65% to 98% faster for CopyBits 1-bit and DrawPicture.

Gary Davidian's RLC's Graphics Performance

One other thing is worth noting, what was the graphics performance of Gary Davidian's M88K, RLC emulator? This is something that's not discussed from my memory of the CHM interviews, but from what we know of the project we can guess. And it's fairly simple: RLC's purpose was to prove that an M88K could emulate a Mac (an LC) with the original LC's ROM (+the emulator code). So from that it's fair to deduce that he didn't recode QuickDraw for the RLC, instead QuickDraw ran 100% emulated, about 10x slower than native M88K code, about 1x to 1.3x the speed of an SE; whereas an LC ran at 1.8x the speed of an SE. So, RLC felt like a Macintosh LC at 72% of the speed - good enough for a proof-of-concept. I don't yet know if RLC's main motherboard had a 16-bit data bus (it too was a custom board, not an actual LC board). If it was 16-bits, then it would probably run even slower (the RLC CPU card was obviously a full 32-bit bus design).
 

Arbee

Well-known member
Assuming it's using the V8's integrated audio and video, the RAM has to be 16 bits wide as far as I'm aware.
 

Snial

Well-known member
Assuming it's using the V8's integrated audio and video, the RAM has to be 16 bits wide as far as I'm aware.
It might have been. From the Gary Davidian's oral transcript:

".. in parallel, the development of the-- what we call the RLC was going on. It was a RISC version of the Macintosh LC and it had an 88100 CPU and two 88200 cache MMUs. And I was able to develop the emulator for that, that ran the Macintosh LC ROM unmodified. So, it could run any version of the OS that the Macintosh LC itself could run. So...
"this is a Macintosh RLC. It looks exactly like a Macintosh LC. In fact, it probably says “Macintosh LC” on it. It’s the real Macintosh LC plastics. So, it had to fit into the same form factor. I think that’s the 88100 and the two 88200s or maybe it’s-- there’s three of them.
".. .. And I think it was probably mid-’91 when RLC was actually up and running. And then I think it was-- well, one thing I did, one I had RLC up and running, I actually did some experimentation to look at actually using-- running RISC code from Macintosh application. So, you know, I put in some hooks to actually run-- let you actually run 88000 code and I had some support for actually making toolbox calls from that 88000 code. So, I took one of those DTS applications-- I think it was called “Traffic Light” at the time and implemented enough of this interfacing back to the Toolbox so that that application could run and it was just a way to show that you could-- you know, compiling for 88000 and running it with an emulated operating system. So, that was sort of proof of concept, but some of the stuff that I did for that did influence some of the Mixed Mode later on, but we’re jumping ahead a bit. So, but that was sort of the first native applications."

This is the RLC motherboard:

1710282668378.png
From https://archive.computerhistory.org/resources/access/text/2019/09/102781078-05-01-acc.pdf

And this is the LC motherboard (from LowEndMac image):

1710282898917.png
I think we can see that they're not the same, just based on the position of the PDS slot and the arrangement of the SIMM sockets. What we can tell though is that it has 2x 30-pin SIMMs, which meant that it had a 16-bit RAM motherboard. So, @Arbee must be right here. The 88200 cache chips were 32-bits though (and Harvard, so the total bandwidth there is 64-bits).

What we can also tell is that GD developed the beginnings of the Mixed-mode manager here so that he could run 88K applications on the existing emulated toolbox. He doesn't say anything here about modifying the Toolbox, so I'd conclude he didn't - and it's important that he didn't, because they wanted to know the emulator really worked as a 68020 emulator and you didn't have to substitute 88K code, just to get it to work. In my NuMac descriptions the Mixed-Mode Manager is more sophisticated, but not as advanced as that for PowerPC: QuickDraw, Memory Manager would definitely have been converted, but there's no FAT code or UPPs, instead applications use CdeR ("coder") resources for accelerated functionality.

Note also, GD's term: "proof-of-concept", which I happened to use in my previous post :) .
 

Snial

Well-known member
I hope they removed that Maxell battery.......
I think the image is from the CHM itself and perhaps can be seen in Video 2, so, er, maybe not!

Meanwhile, I quite like this as the NuMac design:

1712154363047.png
Four-slice unit. Ethernet, AV, 40MHz Second Processor, Power PSU, Ram Bank and Cache upgrades (for both CPU slices) also available.
 

Snial

Well-known member
I'm really really liking that black Apple IIgs keyboard......
Nifty isn't it?

Oh, but this is a R41/25 NuMac with an MSP430 RISC-based keyboard controller. Totally, totally different ;) ! You can use a ||gs keyboard too though!

In this iteration, the minimal StudentStation™ configuration consists of just the R41/25 slice, with no hard disk, nor FDD, but a 10Mbps LocalTalk interface the computer can netboot from. The unit only has an ADB, LocalTalk and video interface, with SCSI being delegated to an optional HDD slice (the PDS-based docking connector is present though). It has 4MB of battery-backed soldered RAM and a single 72-pin SIMM which can take 4MB or 8MB modules (and later up to 32MB modules). Sold with a greyscale 12" monitor, a student could buy one for the same price as an LCIII in mid-1992. Innovative commercials show a student working at one in their room; printing out an assignment on a StyleWriter in the Common area, before popping the unit into their backpack to then plug it into a multi-processor NuMac in a Computer Science lab, and finally taking it home on vacation between Semesters.
 

Phipli

Well-known member
Nifty isn't it?

Oh, but this is a R41/25 NuMac with an MSP430 RISC-based keyboard controller. Totally, totally different ;) ! You can use a ||gs keyboard too though!

In this iteration, the minimal StudentStation™ configuration consists of just the R41/25 slice, with no hard disk, nor FDD, but a 10Mbps LocalTalk interface the computer can netboot from. The unit only has an ADB, LocalTalk and video interface, with SCSI being delegated to an optional HDD slice (the PDS-based docking connector is present though). It has 4MB of battery-backed soldered RAM and a single 72-pin SIMM which can take 4MB or 8MB modules (and later up to 32MB modules). Sold with a greyscale 12" monitor, a student could buy one for the same price as an LCIII in mid-1992. Innovative commercials show a student working at one in their room; printing out an assignment on a StyleWriter in the Common area, before popping the unit into their backpack to then plug it into a multi-processor NuMac in a Computer Science lab, and finally taking it home on vacation between Semesters.
Is it bundled with ClarisWorks and HyperCard?
 

Snial

Well-known member
Is it bundled with ClarisWorks and HyperCard?
It's not a poxy-Performa[*], mate ;-) ! Besides, there's no HDD, CD nor FDD. Students would boot their minimalist R41s from a University AppleTalk server and use whatever tools it provides. So, it's fairly likely that it'd have HyperCard, but I've no idea what the utilisation rate for ClarisWorks was in Universities. The battery backup for at least 4MB of RAM would probably use 2MB for System 7.1 (not 7.1.2 as that arrived in 1994) leaving 2MB for apps. From the Uni server, VM (over a 10Mbps connection) might be set to 6MB for a basic system, which I think would often be sluggish. I imagine a compression tool when moving (purge CODE segments and compress data if the same app is on the target location, or compress both). At the target location you'd either connect to a bigger system or interface to an AppleTalk server (e.g. Mum or Dad's IIsi/5/40 at home).

Purging all the CODE segments is quite a good concept really. If the Jump table and stack is retained, then it should be possible to restart the app by reloading all the CODE segments based on the Jump table contents and then just returning.

[*] For all the Performa fans in the 68KMLA, my first Mac was a Performa 400; it's still going strong and I love it to bits. What a trouper!
 

MOS8_030

Well-known member
I thought I had one of these around somewhere. I remember running (processing) these back-in-the-day at Motorola.
 

Attachments

  • XC88200.jpg
    XC88200.jpg
    169.9 KB · Views: 17

Snial

Well-known member
NuMacSE Prototype Emulator
It'd be quite interesting to create a quick-and dirty NuMacSE based emulator as a first step. Gary Davidian's first M88K emulator ran a Mac SE ROM, because obviously it's easier to write a 68000 emulator than a 68020 emulator. Only when its feasibility was proved was the RLC developed.

We can use existing emulators to hack a NuMacSE emulator together. I'd start with the PCE emulator, because it's architecture is well defined and the code base is cleaner than miniVMac. Let's look at its basic structure (open arrow means 'Uses', long dashed line = Level 1 interpretation):

1712518752241.png
Source code is at the top which gets compiled into object code which is linked into an executable. An Arch(itecture) wraps a Chipset which Executes an MC68000 core CPU. Then we pass the MacSE ROM to the executable to interpret the ROM. So, note: when the the MacSE ROM makes a memory read/write request for address x, the e68000 emulator's memory read/write routines are handled by the SE Chipset code for the corresponding addresses, which result in the ROM, RAM or I/O being accessed.

Therefore the simplest way to implement a NuMac SE is as follows (short dashed = Level 2 interpretation).
1712519110377.png
First we replace the PCE's e68000 emulator with (an adapted) MAME MC88100 emulator, which gives us a Mac 'Plus' architecture which can't execute the MacSE ROM. In addition, we take the PCE's e68000 CPU code (which is pretty standalone) and then compile it into M88K code using m88k-gcc (which exists and has done since about 1990 or 1991). This is then bolted on the end of the MacSE ROM. Now all we need to do to execute the MacSE ROM is to patch the first instruction (or whatever address the MC88100 boots from) with a MC88100 jmp instruction to the beginning of the M88K compiled e68000 emulator. Gary Davidian says that only 6 bytes were modified in the LC ROM. Now, PCE's MacPlus Arch's MC88000 emulator will be running the e68000 emulator which can execute the MacSE ROM, with the exception of the checksum, which is patched to ignore the error (perhaps that involves substituting a 68000 NOP (0x4E71) for a branch to Checksum error).

So, note, again, when the MacSE ROM needs to make a memory read/write request for address x, the e68000 emulator's memory read/write routines are simply a reference to the same address, which is interpreted by the MC88100 emulator whose routines are handled by the SE Chipset code for the corresponding addresses, which result in the ROM, RAM or I/O being accessed just as before.

It looks like I'm labouring this point, but it is useful. In a normal emulator, the host binary needs to emulate the chipset as well as the CPU and so if a host is emulating another system which is emulating a second system, then both levels end up with their own chipset emulation. E.g. if you run miniVMac compiled for PPC 604e BeOS running under SheepShaver on your Intel PC under Linux, then miniVMac emulates Mac Plus video addresses which get interpreted into BeOS API calls which are then interpreted into Linux API calls. But when Apple jumps CPUs, the chipset isn't emulated, so the CPU is a pure CPU emulator which is why it's much faster. It means that Apple mostly has to solve one problem at a time.

The same process applies to a NuMac M88K emulator (and would e.g. apply to a version of the Dingus PPC emulator if the Davidian 68K emulation was replaced by a third party 68K emulator): the inner CPU emulator is a pure emulator and this saves on a lot of work emulating the chipset (as that's already done[*]).

Conclusion
It should be possible do develop a rough M88K MacSE emulator largely using existing code. It's very basic in that there's no support for a Mixed-Mode manager; no colour QuickDraw; no accelerated QuickDraw (it's still all 68K code); no 32-bit clean support in ROM, (the e68000 emulator must do that). It's purpose would be to prove M88K MacSE emulation, just like Gary Davidian did.

The obvious next steps would be:

  1. Executing M88K native code by modifying the e68000 emulator (which is the M88K binary) to interpret odd JMP instruction addresses as a jump to native code (test code could then be patched in to an application).
  2. Changing the e68000 emulator to perhaps the Cyclone 68000 emulator. This is an emulator that's much like the PPC emulator, but in ARM-32 assembler. Here, the C (C++?) source is compiled into an executable, which when run generates a massive 65536 entry table somewhat like the Davidian PPC emulator and the entries point to native ARM assembler instructions, which can then be compiled by arm-gcc-as. By replacing the arm assembler fragments it's possible to build a closer M88K emulator without doing all the work from scratch. Just doing this would speed up the NuMac emulator significantly.
  3. There exists a few implementations of QuickDraw in 'C' (Executor / MACE / Advanced Mac Substitute), they could be a starting point for recompiling a native M88K QuickDraw.

At this point, we'd have something much closer to a recreation of an RLC.

[*] At this point you might wonder why we would start with a Mac SE ROM instead of taking the Dingus Emulator; replacing its PPC emulator with MAME's M88K emulator and then replacing a PowerMac's Davidian 68K emulator with e.g. the PCE e68000 emulator. And the reason is simple: the PowerMac ROM contains a significant amount of PPC code (e.g. Colour QuickDraw and the memory manager plus the mixed-mode manager a few other things), so we'd have to do a lot more to the ROM to run it as an RLC ROM even if a real NuMac's ROM would be much closer in functionality to a PowerMac ROM.
 
Last edited:

Snial

Well-known member
NuMac Emulation Progress & The MiniMixedModeManager (MinxedMan)
So, I've made a bit of progress on the NuMac Emulation. I've built a new copy of the PCE Mac Plus emulator, but more importantly, I now have a working M88K gcc compiler (3.3.6) via the GXemul emulator, which emulates the Luna88K, one of the few Motorola 88K computers. I can also transfer data in and out of it via an MSDOS FAT disk image.

It's a bit crude, obviously. I had tried some other approaches and I'll take you through a bit of that pain first and why I ended up doing it this way.

GCC supported the m88k fairly early on and at the time was one of the important, new architectures that made gcc more generically portable and therefore more usable, primarily because it was the only alternative to the official (non-free) Motorola compiler. I thought gcc still supported the m88k, but in reality they dropped support after version 3.3.6, which was sometime in the Noughties. I only know all this because someone is building a M88K LLVM back-end.

I grabbed V3.3 from GitHub (I'm not sure if it was 3.3.6) and binutils 2.16. I struggled to get it built on my Mac mini under macOS Catalina (Gcc 4.7 or so?): the source kept failing to #include <stdlib.h> because that was dependent on some defines that weren't defined under my macOS environment after ./configure some_parameters and even after I sorted that it then complained at the link stage because it didn't seem to be able to link in .dylibs. So, then I thought, OK, I'll install Linux as a VirtualBox.org VM on my Mac. I need to learn quite a bit more about virtual machine environments, so it's a good education.

I got a bit further with that - I could compile binutils 2.16 for m88k-coff (but not m88k-elf), but I couldn't compile gcc 3.3 for either (m88k-none-coff nor m88k-none-elf). So, if anyone knows how to do that from source, or even how to do it under macOS Catalina, then that'd be welcome. I used: https://www.niksula.hut.fi/~buenos/cross-compiler.html general process for compiling, but I also used a page specifically about compiling gcc for macOS.

Because I wasn't quite sure whether I needed a yet earlier version of gcc and if so what binutils was compatible and what the object format was, I figured that it was better to go the emulation route, because then I'd be able to find out directly. And that turned out to be fairly easy. The GXemul emulator is small (just 6MB of compressed source); simple & simple to configure and make; it was pretty simple to run it with a pre-built openBSD image and once I created a FAT16 disk image without any partitions using DiskUtil, I could get GXemul to mount it and so now I can transfer data in and out.

So, now I can start work on compiling PCE's 68K emulator (or another standalone, more efficient 'C' emulator if I find it). I'll need to add some basic code for booting (but the M88K really does boot from 0), but the emulation environment itself is relatively simple if I ignore M88K exceptions and run it with the MMUs off. I still have to map M88K interrupts to 68K interrupts, but I don't think that's too hard. It means I could in theory support System 6 or System 7, since I'll be emulating 24-bit mode code.

Now onto the next bit:

The MiniMixedModeManager (MinxedMan)

The PPC Mixed Mode Manager is actually fairly complex, far more complex than I would be able to recreate for the M88K demo. For 100% 68K emulation (the initial case) it's not too bad, because we won't be able to execute native code at all! But as soon as we can execute native code, even in terms of M88K pre-built libraries as proposed earlier, some kind of mini-mixed-mode manager is needed.

To make this feasible, at a hobby development level, MinxedMan needs to grossly simplify the model. That's because the PPC mixed mode manager probably did take a number of years to be fully comprehensive (i.e. recreating all the new Toolbox headers) and I can't spare that kind of time to reimplement the same system (even if I could use the same headers, though that's a thought). Also, the NuMacs are intended to be earlier systems, so they would be less complex anyway. Also, there are always multiple engineering solutions to any one problem with different trade-offs and although the only thing I initially want to demonstrate, is being able to execute pure native functions (leaf-functions that don't switch back to the emulator), I need to consider what the model would be if it was comprehensive.

A while back I figured that an easy way to implement 68K to M88K mode switches would be to place odd jump addresses in the jump table. the jump table is set up by the segment loader, so for M88K native code support the M88K segment loader is changed to look for corresponding CdeR code resources, and substitute them, filling in odd jump addresses as appropriate.

The 68K emulator would generate an address exception; and the exception would generate the mode switch. That would work, because the M88K can generate alignment errors on 16-bit and 32-bit data loads from odd addresses and of course, all 68K code is just data to the M88K. So, this kind of Mixed mode manager can use the odd address technique to switch from 68K to M88K, but (from what I've read), calls or jumps to odd address just zero the bottom 2 bits, so it doesn't matter (there's no unaligned address exception for code on the M88K).

However, it turns out there's a better way by using the M88K's MMU (in both VM and physical addressing mode). Let's consider how it works with VM on. The technique is simply that an M88K routine placed in the jump table (or as a ToolBox routine) would cause a page fault if the emulator tried to interpret it, because it would have to be execute-only on the M88K. And similarly, a 68K routine placed in the jump table (or ToolBox) would cause M88K execution to page fault for the opposite reason: it can't jump to 'code' in a data page. So, we diagnose the page fault to determine if we need a mode switch.

So, you don't need unaligned function pointers - bit 0 is marked safe!! You just need to assign M88K CdeR segments in application code space.

This technique would be very versatile: you wouldn't necessarily need UPP routine descriptors and it could work for all ISA switches for callbacks, VBLs, Toolbox patches etc. There's still the issue of mapping the ABI (which routine descriptors handle too) for the different ISAs, but I'll come to that shortly.

There's two significant downsides to the technique:
  1. The application heap per application would necessarily be potentially quite a bit bigger for Mixed-mode applications, because all CdeR resources have to be on distinct physical (and virtual) pages and especially because in this system there's no file mapping. This, in fact can be largely solved fairly easily using what I'm calling Logical Memory.
  2. In the PPC mixed-mode manager, there's a lot of SDK (and OS) support to make it easier to write PPC applications and convert them (or partially convert) to PPC or write applications that can compile transparently for both targets (hence PPC comments in the Would like to development a new THINK Pascal App for 68K topic). In this system, because of the use-cases and practicality, relatively little M88K code would be written. So, instead a higher burden is put on the application developer. I don't intend to fix this problem.
Logical Memory

So, the way to overcome the extra physical memory resource usage is to use the MMU regardless of whether VM is turned on or off. The MMU is on in either case, which may be the case for 68030 and 68040 Macs even when VM is turned off, because it's very handy for remapping memory arbitrarily (e.g. covering up gaps in the memory map), though it doesn't really matter if that's not what the ≥68030 ROM and System 7 behaviour does, the M88K can do it. When VM is off, there's a logical to physical translation where the logical address space = the physical address space (+ extra entries to handle I/O and ROM space).

In 680x0 System 7 VM, as far as I understand, the logical address space is organised the same way as physical memory is when VM is turned on, except the logical address range is the size of your VM swap and physical pages are a cache of the current virtual pages (except for the system heap which must be in physical memory).

The trick here is that even if all these rules apply, your logical address space can be larger than your virtual address space (i.e. swap space). For example, your VM manager could allocate applications on 1MB boundaries in the logical address space leading to an average of 0.5MB gaps between applications (or more if the average application size was <0.5MB). And this is OK as long as the total number of logical pages actually used < the VM memory size. To take a very trivial example, let's say the VM memory size is 8kB (2 pages) and we have two tiny 4kB apps: Hello and Bye. Hello is loaded first and is assigned the logical address space 0x300000 to 0x3FFFFF, but its SIZE resource says just 4kB so it's mapped to the first VM page. Bye is loaded second and is assigned the logical address space 0x400000 to 0x4FFFFF, with SIZE=4kB and therefore mapped to the second VM page. There are big logical address gaps that just can't be used by the applications. The page directories and tables would reflect this: entries that are invalid.

So, logical Address space ≥ Virtual address space ≥ Physical address space.

So, the technique for mixed M88K / 68K apps is to play around with the logical application heap size so that M88K CdeR segments can be kept all together (and thus save space by not being aligned on 4kB pages); and also kept separate from 68K CODE segments.

A mixed 68K / M88K application could do this as follows. It defines a minimal and preferred SIZE resource as per normal, but it also internally defines minimal and preferred SiZEs for 68K and M88K code requirements for the M88K execution environment. For example, a pure 68K application might need at least 256kB for code in the application heap, though there's 384kB of code, while the mixed environment might need 192kB for 68K code and from 64kB to 128kB for M88K code (giving a total of 512kB). So, if you're running it on an M88K then the logical space assigned for M88K code is 128kB (which could be assigned after the Master pointers or perhaps even before), and the 68K application heap space follows.

Because the M88K code is in its own logical space, its code segments can be purged (down to 64kB) to make room for 68K code (e.g. UI code) and vice-versa (e.g. for some high-speed rendering) without the total exceeding the application heap size at any one point and without M88K code segments colliding with 68K code segments or without M88K code segments wasting parts of 4kB pages.

Consider a case where M88K needs to load a code segment, but the application heap is 'full', and contains M88K or 68K code and data resources that can be purged. The memory manager performs basically the same actions as normal: it would first purge resources for its own ISA (in this case purging code segments not in the current call stack) and if there's still not enough space for the code segment, reallocating any possible fragmented logical pages, then if there's not enough space applying the same algorithm to 68K code segments (or other resources), until space is found for the needed M88K code segment. Thus the memory manager and segment loader has the same kind of behaviour, but is slightly more flexible in that it can reduce some wasted memory due to fragmentation.

It's the programmers' job to make sure the application still functions given these, more fluid constraints and there's a bit of extra VM complexity, possibly an extra layer of page mapping, but that's small (I think) compared with a more comprehensive PPC-stye mixed mode manager. It would handle VBL and Time Manager tasks and callbacks the same way (you lock code segments at the top or bottom of the application heap(s) as needed).

There's one more thing to take into account: ABI differences. This technique doesn't have the same kinds of UPPs or Routine Descriptors to handle conversions, but the same issue still exists. The solution here is that for routines that can be called from a different ISA, the programmer provides the conversion by writing a function that handles it:

C:
typedef void (*MixProc)(void);

tMyReturnType MyFuncM88K(char *aSrc, char *aDst, uint32_t aLen);
typedef tMyReturnType (*tMyFunc)(char *aSrc, char *aDst, uint32_t aLen);

void MyFuncM88KAbi(t68KRegs *aRegs, MixProc aProc);

// A manually defined routine descriptor.
const MixProc gMyFuncMix[2]={
    (MixProc)__asm("bsr %0",&MyFuncM88K),
    (MixProc)&MyFuncM88KAbi
};

// Now the function looks like a proper function, but it's not.
#define MyFunc(char *aSrc, char *aDst, uint32_t aLen) (*(tMyFunc)(gMyFuncMix))

/**
 * M88K to M88K call, indirectly via MyFuncMix
 * or just directly.
 */
tMyReturnType MyFuncM88K(char *aSrc, char *aDst, uint32_t aLen)
{
    // action blah blah blah.
    return myReturnValue;
}

/**
 * Programmer has to manually translate the ABI.
 * Mostly easily by debugging the 68K version to see
 * how all the parameters are allocated.
 */
void MyFuncM88KAbi(t68KRegs *aRegs, MixProc aProc)
{
    tMyReturnType ret;
    ret=(*aProc)(aRegs->a[0], ((*long)aRegs->a[7])[1], aRegs->d[0]);
    aRegs->d[0]=ret;    // return ret in d0.
    aRegs->a[7]+=sizeof(long); // deallocate stack.
}

Obviously you'd factor common ABI translations. The way it works is that when the MinxedMode manager is invoked, it calls the second entry instead of the first (which is always 4 bytes later), passing the other ISA's context and the target routine address as the parameters. The second entry packages up the other ISA's parameters, then calls the target routine. Note, because the target routine is passed in the second parameter, the same Abi converter can be used for different routines where the same conversion is needed.

I think that covers this concept. I'll have goofed up some details about the SegmentLoader, Memory manager and Mixed Mode Manager, but I think the concepts are probably sound and I'll correct errors in due course (likely, assuming people read this and see obvious errors).

Thanks for trawling through this ongoing fiction!
 
Last edited:

Snial

Well-known member
Here's a minor update. I've been reading some of the BitSavers BlueBook documentation, which covers, amongst many interesting things, documentation on 32-bit QuickDraw; the Universal ROM and MacVM.


It also covers a few interesting snippets like the occasional use of 'Nu' as a prefix for a hardware or software product and a minimal NetBooting Mac, especially with regards to educational environments. It's interesting to see some of my assumptions partially validated by Apple design documentation in the very late 80s.

Gary Davidian crops up a number of times, particularly in the section on the Universal ROM (Vol 2). He also improved the trap dispatch mechanism.

Finally, I finally saw some proper confirmation of the System 7.0 (and 7.1) VM algorithm as implementing the CLOCK algorithm. This is basically the simplest VM algorithm, and is basically a ring buffer. Physical memory pages are just used in sequence whenever a new VM page needs swapping in. The oldest physical page is first paged out to make room (and written back to VM if it's changed). When the head pointer gets to the last physical page used by VM, it then wraps to the first physical page used by VM.
 

Snial

Well-known member
A while back I was trying to estimate the SpecInt92 and SpecFp92 performance for a NuMac R41/25 running native code, which would differ from a normal MC88110 as it only has 4kB Instruction + 4kB Data cache instead of 8kB each. Here I try to use my copy of Computer Architecture: A Quantitive Approach, 2nd edition to help. In the chapter on memory hierarchy, it provides miss rates for caches of various sizes on the Spec92 benchmarks. Here's the miss rate for the relevant part of the table:

SizeInstruction CacheData CacheUnified Cache
4kB1.78%15.94%7.24%
8kB1.10%10.19%4.57%

Oddly enough, even though the miss rate of an 8kB unified cache is lower than the miss rate of the 4kB Data Cache, data accesses are far lower than instruction access, so 4kB + 4kB is usually better than an 8kB unified cache (though in my calculations below, it's actually slightly worse).

I estimate that a 25MHz MC88110 (4kB+4kB caches) is going to be lower than scaling down a 40MHz 88110, which could manage 37.8 SPECint92 and 50.5 SPECfp92 (256kB L2 cache, I think). So, it's lower than 23.6 SPECint92 + 31.6 SPECfp92. The missing L2 cache will hurt performance. From "Computer Architecture", the miss rate of a 256kB level-2 cache is about 0.9%, which needs to be compared against unified cache sizes. 1992 DRAM had a cycle time of 120ns, with RAS=80ns, a CAS of 15ns and therefore the latency is 120-80-15=25ns

The cache line size on the MC88110 is 8 words = 32 bytes, so a 4kB cache has 128 lines. It's reasonable to assume an L2 cache is designed to keep up with the bus interface on an MC88110 and that a cache refill uses a burst transaction:
1722637700111.png
I'll assume that the L2 cache can deliver 8 words in 10 cycles at 40MHz (25ns), thus the average access time is 25ns*10/8=31.25ns; and that the DRAM access to refill 4 words is 120ns + 3*(15+25)= 240ns, giving an average access time of 60ns per word. If the miss rate is 0.9% for a 256kB cache, then the average access time for a 40MHz MC88110 is 0.009*60ns+(1-0.009)*31.25 = 31.5ns, almost identical to the ideal of 31.25ns (a slow-down to 79.4%).

For a 25MHz CPU, the ideal average access time would be 8 words in 10 cycles at 25MHz (40ns) => 50ns. In reality, for the ratio of cache misses from the 4kB caches, it'll be 60ns. Since about 25% of instructions access memory, that means the average miss rate from the L1 cache will be 1.78%+15.94%/5=4.968%. It's "/5", because in total there are 4 instructions + 1 data access, hence 1/5 accesses. This compares with 1.1%+10.19%/5=3.14% for the 8kB+8kB caches on a real MC88110. For a real MC88110, 96.86% of the time the speed is 100%, for 3.14% of the time it's 79.4% => 99.4% of the ideal performance. This is why caches are good!

For the the 4kB + 4kB cache version, 95% of the time the speed is 100%, for 5% of the time it's 40ns/60ns = 67%, so the overall speed is 98.4% of the ideal. This makes the R41/25 98.4/99.4 about 98.9% of the ideal performance, which again is very good. The performance would be about: 23.6 * 0.989 SPECint92 + 31.6 * 0.989 SPECfp92 = 23.3 SPECint92 and 31.3 SPECfp92. Please bear in mind I may have goofed on the maths, but it seems fairly reasonable to me - and note, this is for native, not emulated code.

The main reason why it's not so bad as it might be is because there's less of a discrepancy between a 25MHz CPU and 120ns DRAM than between a 40MHz CPU and 120ns DRAM. I assume the R41/25 uses fast-page mode.

Going back to comparing the R41/25 with Intel computers of the time:

CPUI/D+L2SpecInt92RatioSpecFp92RatioDate
i486DX/338+12818.278.1%8.326.5%Sep92
i486DX2/508+12825.7110%12.239.0%Mar93
R41/254/4+023.3100%31.3100%'Apr92'

In other words, the R41/25 base model would have a competitive performance, 28% faster than a 486DX/33 despite the lack of L2 cache and nearly 4x faster for floating point arithmetic; making it a great student computer for science and maths.

In a future post I might revisit QuickDraw performance for the NuMac R41/25. In my earlier post I just assumed I could just scale PowerPC graphics performance down to MC88110/25MHz speeds, but I didn't know at the time that the MC88110 actually has a dedicated Graphics Execution Unit; which might well speed up QuickDraw. I'm not sure how to estimate this though - I guess comparing PowerPC snippets for core routines with the MC88110 + Graphics Unit equivalent.
 

Snial

Well-known member
Periodically I still think about M881x0-based Macs, the NuMacs I've been alluding to. And of course, this thread is essentially me using the 68KMLA forum topic as a blog (sorry). In the fantasy history they get launched in summer 1992, and there's a few questions I keep coming back to, namely:
  1. Can the NuMacs compete on relative performance and price against PC compatibles.
  2. Can the NuMacs sit alongside the actual historical Mac lineup.
  3. Is the NuMac 41/25 practical?
I want to make a few more notes on that here.

Competitiveness

It's worth noting that in mid-1992, PCs were still pretty expensive. I took a look through Byte Magazine archives from around that time, when Gateway were the new, cheap-end kid on the block and found an advert containing this desktop lineup:

AD_4nXcMx6Csaw5QOPk3OLadfY9v6oaTk6TUI0IxQiFS920pRV8-15RDO1jy8KLf631C8gxo__2eHKeE0MBGHnbmR9iLE2gTFoYTfl6iUR3rxdY5QveGgmG0au3tnjMNrxaLdW7ulYtO


The LC II which had had been launched around that time was comparable to the 25MHz 386SX (without the monitor and maybe the bundled software). The NuMac 41/25 is intended to be that kind of price point, but obviously it'd deliver a SpecMark89 of about 23 vs a SpecMark89 of 3.4 for a 25MHz 386DX (and maybe 3.0 for a 25MHz 386SX). So, that model has a performance comparable with a DX2/50 (and much higher FPU performance) for a lower price.

Similarly, the NuMac 61/40 would provide a performance comparable with a DX4/75 and significantly better than a DX2/50 and those machines were in the future. At the same time, a NuMac 61/40 would be expensive, upward of $6000 and significantly slower than a Pentium 60 which was going to be launched a year later. But the NuMac 61/40 also had an MP version with a second CPU and glue ASIC (fitted to the two empty sockets).

Market Positioning

Both the NuMacs achieve competitiveness and fit within the market by being less general-purpose than conventional Macs or PCs, namely, by having a slice-oriented physical design. In both cases by providing a base unit with only a Duo-type docking connector (with no motorised retractor); Video out; Basic Audio In/Out, [Ethernet on the NuMac 61/40,] Serial x 1 and ADB x 1. Bringing them up to Quadra-style facilities increases the price massively. In addition, for normal operation they're slower than comparable Macs, due to emulation performance hits. These machines are designed for scientists and university students who need access to local computing power and non-local resources.

Practicalities

Firstly, how does a NuMac Dock work if it's not motorised? The simplest answer IMHO is that the docking connector includes the power input, so you can't power the NuMac base unit and then connect it to the Dock (because the NuMac's power cable would get in the way). But also the Power switch on the Dock causes the connector cover to be locked in place when in the on position, so you can't power the dock and then try to plug in the NuMac. Switching it off, releases the lock, allowing the cover to fold back; then you connect the NuMac to the dock and then you can switch it on.

I thought a bit about the NuMac 41/25 workflow in previous posts and then started to question the practicality. The basic idea is that a student would carry their NuMac 41/25 slice from their college to home for vacations or reading weeks (as they're called in the UK, does that term exist in the US?). Because it has no built-in storage, and the internet wasn't generally available in the early 90s, my initial idea was to maintain a battery-backed RAM image of the Mac, but purge code segments in order to store entire documents (or projects) on the computer and compress them too. However, I figured that might be somewhat restrictive, because purging code segments might only free up a few hundred KiB per app when VM might be set to 8MB.

But I then realised that users could store data on floppy disks, because there's likely to be floppy drives attached to the servers the R41/25 would be attached to in a lab, or the student's accommodation common room. Here, they just bring up a backup disk image from the server; copy all the data or projects they need to it; physically walk to the shared Floppy drive and insert as many floppy disks as needed to copy the data. Then back at their parents' they do the reverse operation, inserting the floppy disks in the home Mac, whose AppleTalk server then pops up a disk image for the R41/25 to use containing the data.

So, this gives a general-purpose solution and because disk images and AppleTalk servers were easy under System 7.x, entirely feasible for the day.

However, I then realised that if a user was to try and just store all their data in battery-backed RAM, they can do much better than just purging code segments. They could do the same operation where the remote server mounts the disk image and the user copies all they want to it: then when they've done, the entire RAM on the R41/25 is purged, leaving just a RAM disk image and RAM disk over AppleTalk driver, which the remote server fills with the compressed disk image first, before then offloading any remaining data to real floppy disks. Because a 2:1 compression ratio can often be achieved, this gives the user around 6MB to 8MB of usable space.

Back at the parents, again the reverse operation happens: the R41's local RAM disk image is offloaded back to the home Mac, then the R41 reboots from AppleTalk straight into the home Mac's 68K-based System 7.

In other news, I did manage to contact Gary Davidian and he replied, once! I might have another try at some point in the future if I manage to get my M88K emulated SE going!
 

olePigeon

Well-known member
I think the image is from the CHM itself and perhaps can be seen in Video 2, so, er, maybe not!

Meanwhile, I quite like this as the NuMac design:

View attachment 71944
Four-slice unit. Ethernet, AV, 40MHz Second Processor, Power PSU, Ram Bank and Cache upgrades (for both CPU slices) also available.
I feel like with 3D printing and homebrew hardware being so affordable these days, that we could actually make functional versions of these design studies. Also ... that would have to be an 8 cm Mini-CD version of the CD 300, wouldn't it?
 
Top