IIsiFPGA: HDMI for the 68030 PDS slot

Melkhior · Aug 16, 2023

Phipli said:
Oh, ok, I have to admit I thought decompression was one of the high compute to data things. An excessively compressed 20MB file can take most of an hour to decompress on an old Mac.

It depends on the algorithm. It might be sufficiently compute-intensive for a co-processor to be useful. We've gone such a long way, it's hard to tell where the real costs are on a vintage system without a detailed analysis. Perhaps I have too much bias from modern CPU :-/

Phipli said:
The coprocessor doesn't need to be in 68k though does it?

The idea of a co-processor in the '020/'030 is that it is driven by co-processor F-line instructions in the 68k code-flow, and then it can implement 'anything' in terms of functionality. So it is quite tightly coupled to the 68k core, but isn't a 68k itself (andd it doesn't have to be a processor at all). For instance, it can request read or write directly to the A and D registers of the 68k, Other types of devices cannot do that - they only communicate through memory. It's rather unique, not many other CPUs did similar things (perhaps the NS32K? it also has a co-processor interface but I don't know the specifics). Also, there's specific support for branches based on condition inside the co-processor. It's also rather neat

The granularity, in theory, can be anything - the limit is how long an instruction will run, because at some point the 68k will need to service interrupts and so on. The interface is also built in a way that 'instructions' (as in, stuff running inside the co-processor, not 68k instructions...) can overlap (the '882 takes advantages of that for instance). Each co-processor F-line instruction is executed as part of the 68k control-flow, and the CPU will only move to the next 68k instruction once the co-processor says it doesn't need the 68k help anymore. So in the '882, if you do a FMUL from memory to a FP register (which is internal to the FPU), then the '882 will request the memory data from the CPU (which will do the relevant load and then send the data to the FPU), and then check for possible early exceptions and tell the CPU it's good to go. Then the CPU can move on to the next 68k instructions while the FPU is computing the FMUL (all 71 cycles of it...).

If you want something that doesn't need that level of synchronization / granularity / access, then a 'regular' device is the best option. Which is probably why nobody other than Motorola ever made co-processors for the '020/'030 that I know of... Except briefly for MMU/FPU, it's a solution in search of a problem... and that's probably why it got dropped in the '040!

Phipli said:
That's why I mentioned there is an open source implementation, I assume it is in CPP, or similar.
I really thought that would be a suitable activity - turning something that is complex (judging by the time it takes) into something fast.
Completely understand that it is extremely complex software wise.

The software is the hardest part... From an existing high-level source code implementation, the easiest way would be to do a regular memory-mapped device with a 'normal' processor and some micro-code. The NuBusFPGA/IIsiFPGA already have a VexRiscv core for QuickDraw acceleration because it was way easier than creating some dedicated hardware.

So, given:
(a) a C/CPP stand-alone pure function taking a input buffer and an output buffer and doing the decompression from input to output (of a significant amount of data)
(b) an application calling said function (repeatedly) to decompress a SIT file
It would not be very hard to accelerate said application by
(1) uploading the input data to some memory on the *FPGA
(2) execute a micro-code version of the function in a embedded core (big enough to offer some performance benefits)
(3) downloading the decompressed data back into the original output buffer
Then speeding things up would be basically, is it possible to overlap things by double-buffering? Plenty of RAM available in the *FPGA. Alternatively, (1) and (3) could be replaced by DMA bus master accesses from/to memory to avoid the copies. Surely more complex, maybe faster - NuBus block transfers on Quadras would be faster than CPU-initiated copies, and perhaps so would be burst transfers from/to the MDU in a IIsi (wouldn't work on a SE/30, no support for burst there).

IIsiFPGA: HDMI for the 68030 PDS slot

Melkhior

Well-known member

Similar threads