• Updated 2023-07-12: Hello, Guest! Welcome back, and be sure to check out this follow-up post about our outage a week or so ago.

NuBusFPGA: HDMI on NuBus Macs

demik

Well-known member
The last line is a recent modification of the gateware, where the write are put in a FIFO and immediately acknowledged to the NuBus, thus saving on write latency - and boosting write bandwidth significantly. It doesn't change read behavior. That single modification push the Speed 4.02 number for unaccelerated 8-bits by 30%, to 0.486. It also helps the accelerated mode, which (with basic rectangular bitblit & solid fill) moves from 0.8 to 0.9 ; it also helps all lower-depth numbers but less significantly.
The 0.486 is close to what you get with Toby in an identical machine; from DCDMF3's description of Toby, I suspect the loss comes from the higher read latency (Toby has really fast VRAM!)- I have to cross clock domains from NuBus to the system clock then wait for the DDR3 then cross back to NuBus, all of that takes time.

You are welcome.

Pretty significant improvement indeed. 0.9 is kinda impressive to be fair, good job !

Toby VRAM is also dual ported, which means reads from the DAC and writes from NuBus are not interfering with each other. Read latency is 40-60ns IIRC. All chips are also sort of in parallel, so that's a 32 bit interface as well. Can BRAM be used in a similar way ? (Sorry my knowledge of FPGA is limited)
 

Melkhior

Well-known member
Pretty significant improvement indeed. 0.9 is kinda impressive to be fair, good job !
Thanks :) At this stage it would be mostly software work to improve performance, to support more primitives. In particular, putting offscreen GWorld in the 248 MiB of unused memory would help, as then you can do HW bitblit from offscreen to onscreen at full speed... but not idea how to implement that on the QuickDraw side.

Toby VRAM is also dual ported, which means reads from the DAC and writes from NuBus are not interfering with each other. Read latency is 40-60ns IIRC. All chips are also sort of in parallel, so that's a 32 bit interface as well. Can BRAM be used in a similar way ? (Sorry my knowledge of FPGA is limited)
Yes, BRAM can be dual-ported and can do 2R/1W ; they probably could be used for a Toby-like framebuffer. However the amount of BRAM available is too limited in 'reasonable' FPGAs ; mine is limited to 50 blocks of 36 KiBit, or 225 KiBytes (assuming there is no rounding waste). A 1920x1080@8-bits framebuffer would require over 500 blocks when using 1024x32 bits per block - that's basically a Kintex 355T or larger, already a ~$2.5k FPGA.

External DDR3 is a lot cheaper, has enough bandwidth, and the extra read latency doesn't slow the framebuffer that much - most accesses are write at 8+ bits depth (NuBus has write granularity, I assume MacOS takes advantage of that and doesn't do a read-modify-write cycle for byte accesses), reads are mostly for bitblits which you really want accelerated anyway.

What could be possible (and more cost-effective) for reduced latency would be a NuBus-side BRAM-based cache, but it would be complex to design as you need single-byte write buffering to avoid re-introducing extra write latency... And I'm not even sure that would be worth the effort (and resource usage in the FPGA).
 

Jamieson

Well-known member
Does the system read from the card's frame buffer very often? Seems like it would be almost always writing to it. Optimizing the logic to have the FPGA acknowledge a Nubus write cycle as fast as possible like you're doing seems like a straigtforward way to speed things up. Going any faster would require writing more firmware to do the blitter and acceleration routines.

I was looking at the "Village Tronic" cards. Looks like they took a conventional 2d accelerated chip like the Cirrus Logic Alpine and then used a small FPGA to translate between the NuBus interface and whatever parallel interface that graphics chip used. That way they leveraged the blitter and other acceleration stuff that had already been done by the ASIC designers. Not clear however how closely the graphics chip routines line up with QuickDraw accelerated routines.
 

Melkhior

Well-known member
@Jamieson My guess is read are a lot less common than write ("most accesses are write at 8+ bits"), but I haven't actually checked/measured. I could count transaction of both types in the hardware and expose the counters to SW to confirm the hunch, which might actually be quite interesting.

Going faster indeed requires extra software; to be clear on the naming in the NuBusFPGA it's shared between a dedicated INIT (to send commands to the device) and the special 'firmware' for the VexRiscV core (which is just some basic drawing functions and a dispatcher). It's not in the Declaration ROM though, which most people would think as the card's 'firmware', where there's just the basic stuff to handle the 'dumb' framebuffer (it does include e.g. changing depth).

Using a dedicated chip to implement acceleration made sense; what you _really_ need is just some blitting and rectangular solid fills. All accelerating graphics chips of the era can do that. Anything else QD can do but not the chip you can fallback on software. And even if the chips could do stuff QuickDraw can't use, it was probably cheaper to re-use a high-volume off-the-shelf ASIC than to design a brand new one.
 

Melkhior

Well-known member
Week-end update...

Acceleration was fixed to work in 16 & 32 bits in addition to 8 bits (it still doesn't do anything in 1/2/4).

Managed to get some level of word-sized DMA working using the original implementation I was using. It's a faithful replication of Apple's PAL code from the early day, and even in Verilog it's a mess in my opinion. No way I could add burst mode in that code.

So I just wrote my own NuBus interface in Migen, with a FSM I actually understand (and just enough Verilog to handle the negative-edge sampling). As a side-effect, i was able to ACK write in the cycle following the request instead of needing a 1-cycle delay. My guess is the 68040 waits for two NuBus cycle between request with my test code, as the write bandwidth jumped from 8 MB/s (4 bytes every 5 cycles at 10 MHz) to 10 MB/s (4 bytes every 4 cycles) from removing this 1-cycle delay. The other BW numbers also improved, but less so as I didn't improve the read latency (or not by a lot). The Speedo 4.02 numbers for 8 bits jumped to 0.52 (unaccelerated) and marginally above 1 (accelerated).

I've been annoyed by the internal BW inefficiency in my design - when I did it for the SBusFPGA's cg6, VexRiscv could only do 32-bits Wishbone, and it's using the system bus to access memory (32-bis @ 100 Mhz). But now VexRiscv has support for wider Wishbone, so I moved to an updated core with a 128-bits Wishbone for data, and added a bypass to access a dedicated port to the SDRAM directly with a 128-bits bus @ 100 MHz. Theoretically it should improve bandwidth significantly, and so the accelerated numbers as well (unaccellerated won't notice). Practically, the Speedo 4.02 numbers for 8-bits jumped to around 1.15 and the 16-bits number (which I can now test) is just above 1. Not quite as good as the built-in video of the Q650, but apparently better than the Q605 Speedometer uses as a reference :)

@Jamieson Test done, it seems there's still quite a lot of read with a read:write ratio (for transaction, not bytes) at around 1:2 to 1:2.5 depending on whether acceleration is disabled or enabled.
 

Jamieson

Well-known member
When I have to interface a modern FPGA to something much slower like a NuBus interface, typically I will use an internal fast clock (100MHz to 200MHz) to oversample the NuBus signals, including the sampling the bus clock and treating that like any other bus signal, rather than use it to clock logic directly. That way I can very quickly detect the rising/falling edge of that bus clock, and now since I am already in a fast clock domain, I have lots of clock cycles to do stuff before the next rising/falling edge. If is wasn't for the DRAM refresh cycles messing up the latency, reads and writes to the frame buffer could go pretty fast, with the ACK asserted on the very next cycle. I usually do everything in VHDL, haven't looked into the Migen stuff -- is that the stuff based on python?
 

Melkhior

Well-known member
@Jamieson I hadn't though of that. SBus (the first vintage bus I played with)) is clocked at up to 25 MHz so it makes sense to use it to clock some logic directly, but at 10 MHz the CDC from Nubus is _slow_ (and the Wishbone component I use has issue with such different clocks). For reads, the CDC probably account for 30% to 50% of the latency, if not more. The system clock is 100 MHz, so even with the 75/25 duty cycle it should reliably detect the negative edge. Food for thought, thanks; removing the CDC would speed up things siginificantly (and save quite a bit of resources).

Yes, Migen is python-based. My NuBus implementation is here for instance, with Verilog only used to capture signals on the negative edge of the NuBus clock (Migen assumes everything is always done on the positive edge).
 

Melkhior

Well-known member
@Jamieson Tried a quick implementation of your idea - and it does improve read latency ; rough measurement says it's now in the 920-950 ns range instead of 1300-1400 ns. That helps unaccelerated performance quite a bit for scrolling, with the Speed 4.02 for 8 bits jumping to 0.62. It also helps accelerated numbers but much less significantly, in line with the fact they're less reliant on reads, it's now at about 1.2. However, I'm not super confident in my implementation; adding some phase drift between clocks in simulation shows my signals are probably not always properly timed on the NuBus side... but it's definitely a good idea performance-wise, thanks!

I've also added acceleration for 1/2/4 bits for only the special case were everything is byte-aligned (so left side of source and destination and width all multiple of 8/4/2), which Speedometer 4.02 happens to be doing at least to some extent, thus pushing all 1/2/4 bits numbers just over 1. Feels a bit like "benchmark optimization" :)
 

Jamieson

Well-known member
Registering the NuBus signals and oversampling with a fast local clock is a good technique for eliminating one clock domain crossing, which is always nasty business. It's cool to see the benchmark numbers increasing!

Do you think if some of the acceleration routines like rectfill and bitblit were written in directly in verilog that there would be further performance gains?
 

demik

Well-known member
Woah, always interesting to read you guys posts.

Registering the NuBus signals and oversampling with a fast local clock is a good technique for eliminating one clock domain crossing, which is always nasty business. It's cool to see the benchmark numbers increasing!

Is this the same thing as using a PLL to get the 10 MHz NuBus to 100 MHz ? Would that eliminate phase drift ?
 

Jamieson

Well-known member
Using a PLL to step the NuBus clock might work OK as long as the NuBus clock is constant 10MHz. If for some reason it disappears or changes frequency the FPGA PLL could get into a bad state. In this example it's OK to run the local fast clock asynchronous to the NuBus clock. Say if the local clock is 200MHz and you're using that to sample the NuBus signals, you can detect the falling/rising edge within 5ns and that's probably good enough. As I understand it, on the rising edge of NuBus clock that's when bus signals start to change. On the falling edge of the NuBus clock that's when the bus signals are supposed to be stable and valid. It's not a bad scheme, it's just that the 10MHz clock seems a bit slow even by late 80's standards.
 

Melkhior

Well-known member
Do you think if some of the acceleration routines like rectfill and bitblit were written in directly in verilog that there would be further performance gains?
Probably. Currently they are C code on a customized VexRiscv (e.g. it has things like 'fsr' from Zbt to better handle misaligned stuff) with a 128-bits interface to the memory. But ultimately, it's just an in-order scalar processor, so it's going to be somewhat limited in raw performance - those 1.6 GB/s are nowhere near saturated (80-100 MB/s is likelier in ideal condition, if that, or about a 32-bits word every 4 or 5 cycles on average).
Dedicated bitblt/rectfill could be faster, but also less versatile and much more complex to implement (I tried...). I've also tried my own little vector engine (based on an existing crypto engine...), but it didn't seem that much faster and I had to hand-write the assembly. Vex is easier to deal with :)
NaxRiscv would be the 'easy' upgrade, as it's superscalar and out-of-order, and could probably saturate the memory interface much more efficiently. But for now it's a bit too high-end, I haven't been able to generate the 'basic' core I need (it can run Linux just fine, but I don't need to run Linux ;-) ).
Is this the same thing as using a PLL to get the 10 MHz NuBus to 100 MHz ? Would that eliminate phase drift ?
Theoretically yes, a PLL could lock the relationship between the two clocks. But I'm not sure the 7-series MMCM/PLL will accept such a low frequency as input, or the 75/25 duty cycle. For now I haven't tried that, the sysclk is generated from the 48 MHz crystal on the FPGA board (and so is the video clock). I could perhaps use the NuBus90 clock, which is 20 MHz 50/50... but that's only available on Quadras.

Anyway to stress-test the interface, I've gone back to the idea of writing a small driver to use the other 248 MiB of memory as a RAM disk. Given that I can have full access to the 256 MiB in the superslot, it shouldn't be too hard (easier than on SBusFPGA where I have a dedicated DMA, but also slower are simple memory load/store won't use block accesses... I need to fix busmaster access and implement them at some point).
 

Melkhior

Well-known member
New update...
There's now a simple RAM disk using the leftover SDRAM. It's pre-initialized by the firmware, but for some reason doesn't auto-mount - I have to probe it with 'disk aid' for it to mount and be usable, no idea why. Currently using basic slave-mode accesses, no DMA.
On the framebuffer size, I've pushed the Vex as far as it would go and then a bit more, adding support for custom 64-bits load/store (using pair of 32-bits registers), which boost screen-to-screen copy performance somewhat. I've also added more support for pattern-to-screen, which Speedometer doesn't seem to worry about (it's quite a basic benchmark, not sure it really reflects reality in a nay way... who spend time drawing ellipses?), although it might improve 16-bits a bit.
Current results with Speedo 4.02:

speedo402.jpg
So at this stage, it's in the ballpark of the Q650 internal video for 8 and 16 bits - although in a quite biased way; the final upward scrolling is faster on the NuBusFPGA than on the internal video, while previous "'040 pushes pixels to screen" steps are still slower. That's with 1920x1080 resolution on the NuBusFPGA, did no yet try to get some better numbers by running it at a lower resolution ;-) (... which would defeat the point of having made the board in the first place).
 

demik

Well-known member
New update...
There's now a simple RAM disk using the leftover SDRAM. It's pre-initialized by the firmware, but for some reason doesn't auto-mount - I have to probe it with 'disk aid' for it to mount and be usable, no idea why. Currently using basic slave-mode accesses, no DMA.
On the framebuffer size, I've pushed the Vex as far as it would go and then a bit more, adding support for custom 64-bits load/store (using pair of 32-bits registers), which boost screen-to-screen copy performance somewhat. I've also added more support for pattern-to-screen, which Speedometer doesn't seem to worry about (it's quite a basic benchmark, not sure it really reflects reality in a nay way... who spend time drawing ellipses?), although it might improve 16-bits a bit.
Current results with Speedo 4.02:

View attachment 43142
So at this stage, it's in the ballpark of the Q650 internal video for 8 and 16 bits - although in a quite biased way; the final upward scrolling is faster on the NuBusFPGA than on the internal video, while previous "'040 pushes pixels to screen" steps are still slower. That's with 1920x1080 resolution on the NuBusFPGA, did no yet try to get some better numbers by running it at a lower resolution ;-) (... which would defeat the point of having made the board in the first place).

That's some nice scores ! Yeah sure the score is an average, but it's kinda neat anyway.

Does the RAMdisk need some sort of Declaration Trickery ? Or is it only software ?
 

Melkhior

Well-known member
Does the RAMdisk need some sort of Declaration Trickery ? Or is it only software ?
It's only software ... in the declaration ROM :) The resources in the DeclRom declare the device and the driver, and the driver is built in the DeclRom along with the (non-accelerated) Framebuffer driver. There's nothing additional needed on the MacOS side - except probing with 'disk aid' (I've also embedded a custom RLE32-compressed image in the unused ROM space to initialize the drive). Also, it's more 'hackish' than the FB driver: for some reason, the driver doesn't seem to get the slot base address (works fine for the FB), so I have currently hardwired it to $C. Either I messed up something, or the Slot Manager is not really expecting multiple devices sharing the same slot...
 

MrFahrenheit

Well-known member
This is a really cool project.

Are you at a stage where people can buy these from you? Is the hardware pretty much done and all improvements are being done in software? How would someone update the firmware in the future?
 

cheesestraws

Well-known member
I have to probe it with 'disk aid' for it to mount and be usable, no idea w

The driver is responsible for any auto-mounting of volumes, I believe. You can manually mount SCSI drives without a driver partition if the Apple driver is already resident in RAM, for example, and the Apple driver only supports auto-mounting one partition whereas third-party SCSI drivers support mounting more. I don't know where you need to tell it to do the mount—I've never written a block storage driver—but I am pretty sure it is the driver's responsibility not the OS's.

(See also how SCSI removable storage works: attempting to automount there is undesirable, which is quite possibly why the decision was made to leave it up to the driver when to do it? That's speculation, though.)
 

Melkhior

Well-known member
This is a really cool project.
Thanks :)
Are you at a stage where people can buy these from you? Is the hardware pretty much done and all improvements are being done in software? How would someone update the firmware in the future?
The entire project is available on GitHub (minus stuff I could have forgotten to commit/push...), but there is currently no plan to manufacture/sell any. The design is somewhat cumbersome and quite expensive with the two connected boards. The hardware has no known major issue (the VGA might be a bit dodgy but I didn't really looked into it as the HDMI works great), pretty much everything is done in the gateware (FPGA/CPLD configuration) and the software. A 'production' version would probably need a re-spin to get rid of the VGA and reclaim some FPGA pins, and then to rework the interface with NuBus to get rid of the CPLD and some of the extraneous hardware like all the extra LEDs... with the big issue being whether to try to integrate the FPGA directly instead of using a daughterboard. More complex to design but would fit in the form factor and might be more cost-optimized.

The Declaration Rom is stored inside the gateware (along with the accelerator firmware), and it's reasonably easy to upgrade as it only requires an external power supply for the board and a USB cable; upgrade is done via the FPGA board manufacturer's software over USB (but can also be done over the FPGA JTAG with Xilinx' tools). Replacing the gateware is needed for e.g. changing the resolution (a single resolution and its clock are baked into it, MacOS can only change the depth). The CPLD must be configured/upgraded over its own JTAG connection, which is a pain as you need the proper software and hardware - I have a Windows 7 virtual machine just to run the proper version of Xilinx ISE for the CPLD... Fortunately, I've yet to need a change in the CPLD configuration as it's very simple - and on one of my two carrier boards it seems the JTAG connection to the CPLD is no longer working :-(

This is basically just a proof-of-concept to show that is is doable for a hobbyist to create a NuBus device (and SBus before that), and to learn some stuff along the way...
 

Melkhior

Well-known member
but I am pretty sure it is the driver's responsibility not the OS's.
Make sense, thanks. I need to figure out what I need to do to ask MacOS to mount my volume then. Not that the RAM disk is super useful, but it helps testing the NuBus interface.
 
Top