• Updated 2023-07-12: Hello, Guest! Welcome back, and be sure to check out this follow-up post about our outage a week or so ago.

IIsiFPGA: HDMI for the 68030 PDS slot

Melkhior

Well-known member
For the current 'publicness', the hardware & gateware should be up-to-date on GitHub; the declaration rom isn't but is near identical to the one from the NuBusFPGA (needs to do some clean-up work so they can share sources to avoid code duplication); the patch for the Rom to support 'memory expansion' isn't public yet but will be when I find the time.
 

Bunsen

Admin-Witchfinder-General
Also it could be worth investigating solutions to use SDRAM as expansion memory for '030 systems, with the SDRAM chips directly connected to the '030 bus

This would certainly be a boon for the IIfx, given the rarity of the proprietary 68 pin RAM SIMMs for that model. Especially if the SDRAM could run at the CPU clock speed.
 

Phipli

Well-known member
Also it could be worth investigating solutions to use SDRAM as expansion memory for '030 systems, with the SDRAM chips directly connected to the '030 bus

This would certainly be a boon for the IIfx, given the rarity of the proprietary 68 pin RAM SIMMs for that model. Especially if the SDRAM could run at the CPU clock speed.
Don't forget the IIfx PDS has a different pinout :)

Just a warning to prevent accidents
 

Melkhior

Well-known member
This would certainly be a boon for the IIfx, given the rarity of the proprietary 68 pin RAM SIMMs for that model. Especially if the SDRAM could run at the CPU clock speed.
Commonly available SDRAM at up to 133 MHz, but the question is the initial delay - if burst read can be serviced as 2-1-1-1 and write in 3 cycles, it's basically a cache. Even 3-1-1-1R/3W would be excellent. More initial wait cycle might be needed in practice, but X-1-1-1 won't be an issue with SDRAM for some X, probably reasonably low.

The IIfx would be a likely candidate, as are some of the LC-class machine that have limited memory expansion capability.

But the IIfx is a weird beast, as the PDS slot isn't actually 'direct' to the processor, some signals goes through intermediate chips, thus changing the electrical properties of the signals. And there's some timings trickery going on, as the speed of the PDS changes depending on the address - and of course, no word as what happens when you use addresses in the memory range... and while you do have the half-speed 20 MHz clock, you don't get the full speed 40 MHz one!

Don't forget the IIfx PDS has a different pinout :)
The pinout in terms of which signals goes to which pins is very similar (clock is a different pins, only a handful signals are changed), it's really all the other properties, much harder to deal with, that are different :-( (I did try to make the IIsiFPGA 'IIfx-compatible', but I don't have a lof of confidence).
 

Phipli

Well-known member
The pinout in terms of which signals goes to which pins is very similar (clock is a different pins, only a handful signals are changed), it's really all the other properties, much harder to deal with, that are different :-( (I did try to make the IIsiFPGA 'IIfx-compatible', but I don't have a lof of confidence).
Sounds like you're already on top of it :)

I don't have an fx, so haven't really paid attention, do I remember hearing that some of the power pins are different?

I'm not very familiar with the fx at all.
 

Melkhior

Well-known member
I don't have an fx, so haven't really paid attention, do I remember hearing that some of the power pins are different?
I don't have one either (that's why it's untested), but I looked in details in the pinouts of the IIsi/SE/03/IIfx to try and make the IIsiFPGA compatible - though with the current going prices of IIfx, odds of me ever getting one are pretty much zero.

I don't think any of the power (as in VCC of various voltages) pins are different, though IIRC some ground pins are - but alternate use is signals for which being grounded is a reasonable default.

I would definitely double-check the pinouts before turning on a IIfx with a IIsiFPGA anyway, shorting a IIfx would be an expensive and almost sacrilegious mistake :)

Sounds like you need an MC88920 :)
That would be the old-school way, yes :) But FPGAs have built-in PLL, so regenerating a 40 MHz signal from the 20 MHz clock should be easy when using a FPGA. I suspect to support memory, the real issue is what happens with the PDS signals when you address the unused memory range: full-speed synchronous access (good!), slow-speed synchronous access (can be lived with), or something else like the request doesn't go through at all? Every other machine (including LCs) is easy because there's no-one in the way, but the IIfx is being a IIfx and has to be 'the special one' in every possible way...
 

Melkhior

Well-known member
Wait, you mean you're it exclusively using 74 series logic!? :ROFLMAO:
Hehe, the sad part is I've spent enough times with '030 design documents in general and caches in particular I could probably make a serious argument about why this would be unlikely to work timing-wise...
 

Bolle

Well-known member
some signals goes through intermediate chips
It's more like some signals don't go through intermediate chips... the majority of signals are buffered and disconnected from the actual CPU bus.
The whole address and data busses and most of the control signals as well. This makes it problematic (aka impossible) to access Nubus from a PDS bus-master card because the arbitration logic that's built around a handful of PALs on the logicboard doesn't know how to handle that case.
 

demik

Well-known member
Well, very very little testing so far, but at last System 7 is now booting with a reasonable amount of memory in my IIsi:

View attachment 59716
Almost 133 MiB of memory : 1 MiB from soldered bank A (minus the internal video), 4x1 MiB from SIMMs, and 128 MiB mapped in the DDR3 of the IIsiFPGA :) Boot is extremely long due to memory testing, but then the machine doesn't feel any slower than before. I did implement burst mode using a dedicated port to the DDR3, so read shouldn't be too terrible, but the latency is still likely higher than the 'real' memory. I haven't had time to do any benchmarking.

Just in time too before I have to leave for some vacations ;-) I will have to see how far I can push the ROM, there's still another 120 MiB I could use (256 MiB total DDR3 minus the 8 MiB for the framebuffer), and then share some code on the GitHub for those interested (but this will have to wait for my return of vacations).

I suspect any '030 machine compatible with a IIsi ROM & with a '030 PDS or cache slot _should_ be amenable to a similar patch of the ROM to enable some additional bank(s?) of memory.

Impressive work ! LC IIIs do have a socketed ROM, so maybe something is possible there as well.
 

Melkhior

Well-known member
Impressive work ! LC IIIs do have a socketed ROM, so maybe something is possible there as well.
Yes, I discovered later that there's another SIMM socket on my board, empty, which is quite likely to be a ROM socket for a 1 MiB ROM as a substitute for the soldered one... No idea why I thought it didnt. So it should be feasible for the LC3 as well, though it would require reverse-engineering another ROM to patch the relevant part (the IIsi ROM was already somewhat documented, so was a better starting point anyway). The hardware side is just a different form factor and pinouts, a fairly easy redesign - schematics should be identical otherwise. However, the LCIII is much slimmer than the IIsi, my FPGA daughterboard won't fit in the closed case :-(
 

Trash80toHP_Mini

NIGHT STALKER
It's more like some signals don't go through intermediate chips... the majority of signals are buffered and disconnected from the actual CPU bus.
The whole address and data busses and most of the control signals as well. This makes it problematic (aka impossible) to access Nubus from a PDS bus-master card because the arbitration logic that's built around a handful of PALs on the logicboard doesn't know how to handle that case.
What's the functional status of the NuBus Adapter Card for the IIsi in that case? Slave only card limited support?
Confused there, I got my Rocket up and running under RocketShare in the IIsi and it's a bus mastering card, no?

I'm probably way out in the weeds kinda confused again here? :oops:
 

Melkhior

Well-known member
During the vacations, as a new 'crazy project', I decided to have a shot a designing a co-processor in my FPGA for the 68030. The '020 and '030 have this nifty (if slow) co-processor interface. Not a traditional memory-mapped device, a real co-processor defining new instructions that can be added to your code to do extra stuff.

So far it only does AES encryption in a not-very-efficient way, but to the best of my knowledge it's the first non-Motorola co-processor for the '030 (I currently use synchronous cycle, should be easily downgraded to asynchronous for the '020). Motorola did the 68851 MMU for the '020 (a simpler MMU is builtin the '030), and the 68881/68882 FPU for the '020 and '030. I don't know of any other, anyone knows better? Sun did their own MMU, but as far as I can make out from the Sun 3 schematics and the NetBSD kernel source, that was memory-mapped. Weitek did FPUs that were occasionnaly used with 68k (Sun's FPA comes to mind), but as far as I can tell they were also memory-mapped. No-one back then seems to have bothered doing an '020/'030-only device using the co-processor interface. Memory-mapped for 32-bits addressing is a lot more generic and made a lot more sense, to be fair. The interface was dropped in the '040. Shame.

The nice thing about the co-processor interface is that you can use e.g. the AES instruction inline as if they were part of the CPU core (same as the FPU instructions). So you can define a set of macros such as:
Code:
#define KAES32E0(x, y) asm("lea %1, %%a0\n"                \
               ".word 0xFC10\n.word 0x0000 + _num_%0\n" : "+d" (x) :  "m" (y) : "a0")
Where the F-line instruction (0xFC10) indicates a co-processor instruction and the next word is sent to the co-processor. The co-processor can then request the CPU to send the memory content pointed to by the Effective Address (EA) in the co-processor instruction (here it's hardwired to %a0 as I wasn't sure how to produce the right opcodes for arbitrary effective addresses, the LEA takes care of that for me). It can also request the register (the number is passed as part of the co-processor instruction word), and then send back the result to the same register.

Then you can use it to implement an AES round in C easily:
Code:
#define AES_ROUND1T(TAB,I,X0,X1,X2,X3,Y)        \
  {                            \
    X0 = TAB[I++];                    \
    KAES32E0(X0, Y);                    \
    X1 = TAB[I++];                    \
    KAES32E1(X1, Y);                    \
    X2 = TAB[I++];                    \
    KAES32E2(X2, Y);                    \
    X3 = TAB[I++];                    \
    KAES32E3(X3, Y);                    \
  }

Each instruction handle the full updating of the relevant word from the four words in the array Y, they are basically merged version of the RV32K AES instructions. For instance the first instruction KAES32E0 is equivalent to the R5K sequence:

Code:
 X0 = aes32esmi0(TAB[I++],Y0);                 \
    X0 = aes32esmi1(X0,Y1);                     \
    X0 = aes32esmi2(X0,Y2);                     \
    X0 = aes32esmi3(X0,Y3);                     \

(the numbering means different thing in the two ISA, in R5K it's the byte offset in the word, in my code it's which word in the array to start with, original R5K code here).

You can do a lot of powerful stuff with that interface - though the software side might quickly become a problem. For instance in the example code above, the array Y is sent four times (once per instruction) despite the fact it doesn't change. It could be sent just once, but then it creates an extra state in the "CPU"... and that extra state needs to be saved/restore when context switching, which MacOS obviously won't do.

Kinda useless (I guess ssheven could theoretically benefits, but I don't have a NIC in my IIsi to test that theory), but I'm happy it seems to work :)
 

Phipli

Well-known member
During the vacations, as a new 'crazy project', I decided to have a shot a designing a co-processor in my FPGA for the 68030. The '020 and '030 have this nifty (if slow) co-processor interface. Not a traditional memory-mapped device, a real co-processor defining new instructions that can be added to your code to do extra stuff.

So far it only does AES encryption in a not-very-efficient way, but to the best of my knowledge it's the first non-Motorola co-processor for the '030 (I currently use synchronous cycle, should be easily downgraded to asynchronous for the '020). Motorola did the 68851 MMU for the '020 (a simpler MMU is builtin the '030), and the 68881/68882 FPU for the '020 and '030. I don't know of any other, anyone knows better? Sun did their own MMU, but as far as I can make out from the Sun 3 schematics and the NetBSD kernel source, that was memory-mapped. Weitek did FPUs that were occasionnaly used with 68k (Sun's FPA comes to mind), but as far as I can tell they were also memory-mapped. No-one back then seems to have bothered doing an '020/'030-only device using the co-processor interface. Memory-mapped for 32-bits addressing is a lot more generic and made a lot more sense, to be fair. The interface was dropped in the '040. Shame.

The nice thing about the co-processor interface is that you can use e.g. the AES instruction inline as if they were part of the CPU core (same as the FPU instructions). So you can define a set of macros such as:
Code:
#define KAES32E0(x, y) asm("lea %1, %%a0\n"                \
               ".word 0xFC10\n.word 0x0000 + _num_%0\n" : "+d" (x) :  "m" (y) : "a0")
Where the F-line instruction (0xFC10) indicates a co-processor instruction and the next word is sent to the co-processor. The co-processor can then request the CPU to send the memory content pointed to by the Effective Address (EA) in the co-processor instruction (here it's hardwired to %a0 as I wasn't sure how to produce the right opcodes for arbitrary effective addresses, the LEA takes care of that for me). It can also request the register (the number is passed as part of the co-processor instruction word), and then send back the result to the same register.

Then you can use it to implement an AES round in C easily:
Code:
#define AES_ROUND1T(TAB,I,X0,X1,X2,X3,Y)        \
  {                            \
    X0 = TAB[I++];                    \
    KAES32E0(X0, Y);                    \
    X1 = TAB[I++];                    \
    KAES32E1(X1, Y);                    \
    X2 = TAB[I++];                    \
    KAES32E2(X2, Y);                    \
    X3 = TAB[I++];                    \
    KAES32E3(X3, Y);                    \
  }

Each instruction handle the full updating of the relevant word from the four words in the array Y, they are basically merged version of the RV32K AES instructions. For instance the first instruction KAES32E0 is equivalent to the R5K sequence:

Code:
 X0 = aes32esmi0(TAB[I++],Y0);                 \
    X0 = aes32esmi1(X0,Y1);                     \
    X0 = aes32esmi2(X0,Y2);                     \
    X0 = aes32esmi3(X0,Y3);                     \

(the numbering means different thing in the two ISA, in R5K it's the byte offset in the word, in my code it's which word in the array to start with, original R5K code here).

You can do a lot of powerful stuff with that interface - though the software side might quickly become a problem. For instance in the example code above, the array Y is sent four times (once per instruction) despite the fact it doesn't change. It could be sent just once, but then it creates an extra state in the "CPU"... and that extra state needs to be saved/restore when context switching, which MacOS obviously won't do.

Kinda useless (I guess ssheven could theoretically benefits, but I don't have a NIC in my IIsi to test that theory), but I'm happy it seems to work :)
Amazing work as ever!

The most amazing thing to implement would be hardware accelerated decompression. Using retro hardware I swear I spend 90% of my time sat waiting for stuffit to decompress files.

There is a public implementation of the .sit algorithms in a Linux compression utility that supports .sit...

For me, that would be the single most useful coprocessor.

There were hardware nubus compression cards back in the day. Fairly rare though.
 

Melkhior

Well-known member
The most amazing thing to implement would be hardware accelerated decompression. Using retro hardware I swear I spend 90% of my time sat waiting for stuffit to decompress files.
There is a public implementation of the .sit algorithms in a Linux compression utility that supports .sit...
For me, that would be the single most useful coprocessor.
There were hardware nubus compression cards back in the day. Fairly rare though.
Co-processors are nice, but they might not be the best option for that use case. They shine when the compute-to-data ratio is high. That's the case for floating-point (vs. SW emulation), that's the case for crypto. But I'm not so sure for compression, where usually it's about data manipulation and data-based behavior (test and branches). With my current interface, there's a about a dozen bus cycles for every instruction, plus some more I don't see when the CPU gets data from memory (unless in cache). It might be worth it for AES but I haven't yet checked if it was actually faster than my table-based software implementation! You need to do a significant amount of work in each instruction to justify that kind of overhead - and you probably need to have some state on the co-processor side to minimize said overhead.

My suspicion is that the NuBus accelerators uploaded large amount of input data to the on-board memory, ran the algorithm internally, then copied the result back to the main memory. Provided you implement double-buffering to hide some of the latency (so transfers and compute overlap, same principle as any GPGPU or similar accelerators), the overhead is manageable. Edit: there's no theoretical reason why something like that couldn't be done in the NuBusFPGA and/or IIsiFPGA, BTW.

I guess it depends what decompressing StuffIt really requires. Unfortunately I don't think hey ever published the source code for the expander, let alone a vintage 68k version. One would first need to reimplement a full SW expander, and then figure out what needs to be accelerated. It would be a lot of software work :-(
 

Phipli

Well-known member
They shine when the compute-to-data ratio is high. That's the case for floating-point (vs. SW emulation), that's the case for crypto.
Oh, ok, I have to admit I thought decompression was one of the high compute to data things. An excessively compressed 20MB file can take most of an hour to decompress on an old Mac.
Unfortunately I don't think hey ever published the source code for the expander, let alone a vintage 68k version. One would first need to reimplement a full SW expander, and then figure out what needs to be accelerated.
The coprocessor doesn't need to be in 68k though does it? That's why I mentioned there is an open source implementation, I assume it is in CPP, or similar.

I really thought that would be a suitable activity - turning something that is complex (judging by the time it takes) into something fast.

Completely understand that it is extremely complex software wise.
 

zigzagjoe

Well-known member
Oh, ok, I have to admit I thought decompression was one of the high compute to data things. An excessively compressed 20MB file can take most of an hour to decompress on an old Mac.

The coprocessor doesn't need to be in 68k though does it? That's why I mentioned there is an open source implementation, I assume it is in CPP, or similar.

I really thought that would be a suitable activity - turning something that is complex (judging by the time it takes) into something fast.

Completely understand that it is extremely complex software wise.
To expand a little... with compression, at a high level, the entire operation is not just turning a block of data into more data via a straightforward transform or other set of mathematical operations. There's a variety of cases involved from literal data encodings to backrefs, dictionaries and more that all requires quite a bit of working RAM and reference data to perform the decompression. The 68020/30 coprocessor type accesses are best suited to computationally intensive but relatively straightforward transformations (little if-then logic, just math) that do not require a lot of data. Encryption involves very little logic, not a lot of data, but a *lot* of math making it generally well suited for acceleration. Or using floating point algos as an example, computing a square root takes a while but only requires a single floating point number in and out.

To make (de)compression suitable for a co-processor use-case, you'd need to drill into the algorithms and find out what operations are the most expensive, and if those are well suited to being implemented in logic with the prior constraints in place. As Melkhior said, the case where you hand off a large chunk of data to card's buffers (private RAM) and tell the card to "please do the thing" instead is easier but still a lot of work.

Definitely possible but the LOE to implement is high since you'd have to basically build an independent expander application as a pre-requisite for any of this. The mac environment is sufficiently old and esoteric that it's not exactly trivial to take a current piece of code and backport it, especially if you were to use a native toolchain (read: old c compiler).

If you were asking me to solve the decompression use case, practically, I'd probably aim for some automation involving a second machine and dropbox folder(s). Drop an archive into an appleshare folder and have some applescript (or that native linux utility you mentioned) that decompresses that automatically, you can then pull from that folder. Not elegant, but more straightforward.
 
Top