• Updated 2023-07-12: Hello, Guest! Welcome back, and be sure to check out this follow-up post about our outage a week or so ago.

NuBusFPGA: HDMI on NuBus Macs

Trash80toHP_Mini

NIGHT STALKER
IIRC, I/O line availability limits of FPGA boards for PDS implementation was a major factor covered in the previous thread? Does your setup have enough lines available to support a PDS implementation?

< unimplemented tangent mode >
If so, once NuBus up and running you might find one particular PDS foray for your card as fascinating as NuBus has been for you.
PM forthcoming ;)
< /unimplemented tangent mode >

@Jamieson and everyone else, please don't take my post as anything but an attempt to insert a generally addressed primer on the definitions applicable to this and other discussions of Video Card performance. Much of the confusion about these things can be attributed to misguided comparisons made when LEM was fairly new to the game that persist in "reviews" there to this day.

As that's where most, if not all folks learn about this aspect of the RetroMac journey, some clarification is needed. I no expert, so I wish someone technically qualified (and a much better writer) would come up with a detailed, bulleted explanation as the first post in a new topic on this for a dedicated discussion.
 

Mu0n

Well-known member
Really cool project. I love using a RGB2HDMI on my monochrome macs and I'd be all over this if I even owned a nubus equipped mac, but I don't.

I might land my hands on a working color classic soon, so I'll have to research my options to bring its video to the modern world.
 

Melkhior

Well-known member
@demik I'm not even sure how to call it... There's a section 'NuBus bit and byte structure' in DCDMF3 to deal with the numbering, and it seems like NuBus stays little-endian:
The NuBus bit structure is not the same as the bit structure of the Mc68020 bus, the
MC68030 bus, or the Mc68040 bus. To achieve byte-addressing consistency, the
Macintosh computers perform byte swapping of data between the microprocessor and
the NuBus. This section explains the rationale and details of this implementation.
It's already something my brain doesn't like, and then you get the width changes: Wishbone is 32 bits LE like NuBus but the DDR3 interface is 16 bits interface with a 128 bits controller. And those 128 bits accesses (required to get enough bandwidth) have to be cut into 1/2/4/8/16/32 bits in the right order - and for 16 and 32 you then need to extract the three color components... No need to say, I saw a lot of weird-looking columns and colors on the way to success. That's the beauty of FPGAs: you don't really need to understand the mess, you can go the software route of trial-and-errors-until-it-works ;-)

Truth be told, once the interface to NuBus worked, fixing the framebuffer was the easy part, as it had already been debugged on the SBus for switchable 8/32 bits support.
 

Melkhior

Well-known member
@Trash80toHP_Mini Number of pins could be an issue for PDS, but the particular board I use was suitable for SBus which has a lot more signals than NuBus - SBus has separate address (28) and data (32) signals. In fact SBus has a 64-bits extended transfer mode that borrows the address lines and a few more as data lines during such extended transfers (none of which I currently implement as the SPARCstation 20 host controller can't do extended transfers, and I've yet to migrate the project to my Ultra 1).

And there's boards with a lot more I/O even in higher-density form factor, such as the Trenz TE0712 family (which I've been thinking about for a SBusFPGA V2.0 re-using the HDMI design from NuBusFPGA V1.0, as I don't understand the word 'unreasonable' ;-) ). And FPGA can have even more I/O than those BGA-484 Artix-7... the real issue is cost rather than feasibility. Those Trenz boards are very nice but quite expensive (the TE0712-02-72C36-A is 264€ excl VAT!).

edit: typo
 

Trash80toHP_Mini

NIGHT STALKER
I'm hoping one of out resident boffins might start a new topic, taking up the quest to graft your board onto an intersticial 040 PGA carrier card. That'll be a real doozy.😜
 

Melkhior

Well-known member
Thanks to @cheesestraws help, I was able to find the missing bit (literally - the 'sysheap' flag was missing...) and figure out how to patch traps in a way that works in 8.1 (untested in older systems yet).

Still super-preliminary alpha-quality (if that) and very ugly (e.g. I hardwire the slot, nothing but 8-bits depth work, ...), but I have some basic BitBlt [1] acceleration on the hardware, for screen-to-screen rectangular copies (e.g. scrolling). Probably still buggy, but it's a start and the machine didn't crash :)

As for speed, well, modern hardware is fast compared to whatever could be built back in the day; having just this one preliminary function running on a small VexRiscv core @ 100MHz, the speedometer 4.02 score for 8-bits jumps from 0.365 to 0.663. Still a lot slower than the internal video, but I haven't seen a higher number for a NuBus device, even accelerated - however the lowendmac results are usually on slower host like a IIcx, which is likely unfair to those vintage devices.

[1] BitBlt is an internal trap used by e.g. StdBits with a really crappy 'let's poke in the undocumented stack' interface, but the graphic semantic is easier than StdBits as all the cropping, scaling, ... has been done already.

Edit: for the curious, the INIT code is here, the code running on the Vex is here (a lot of it unused, a lot of reuse from the SBusFPGA's cg6), and the HW glue between the two is here.
 
Last edited:

Trash80toHP_Mini

NIGHT STALKER
Still a lot slower than the internal video, but I haven't seen a higher number for a NuBus device, even accelerated - however the lowendmac results are usually on slower host like a IIcx, which is likely unfair to those vintage devices.
And you never will, NuBus itself bottlenecks performance. Machine doesn't matter and the comparisons on LEM were made almost two decades ago by folks who didn't understand bottlenecks vis a vis NuBus, processor bus nor QuickDraw Acceleration for that matter. ;)
 

uyjulian

Well-known member
Probably in order to get "faster" graphics, need to do some pre-processing on the graphics data like queueing, culling, avoiding commands for non-visible stuff, compression, etc. before transferring that data over to the graphics card.

If the benchmark is doing some equivalent of glReadPixels, that will kill performance also.

You probably won't gain that much "raw" transfer speed anyways, just like connecting a GPU to a 1x PCIe Thunderbolt link, or combining X11 forwarding with OpenGL server-side rendering. So the CPU <-> Graphics card communication needs to be optimized.
 

demik

Well-known member
Thanks to @cheesestraws help, I was able to find the missing bit (literally - the 'sysheap' flag was missing...) and figure out how to patch traps in a way that works in 8.1 (untested in older systems yet).

Still super-preliminary alpha-quality (if that) and very ugly (e.g. I hardwire the slot, nothing but 8-bits depth work, ...), but I have some basic BitBlt [1] acceleration on the hardware, for screen-to-screen rectangular copies (e.g. scrolling). Probably still buggy, but it's a start and the machine didn't crash :)

As for speed, well, modern hardware is fast compared to whatever could be built back in the day; having just this one preliminary function running on a small VexRiscv core @ 100MHz, the speedometer 4.02 score for 8-bits jumps from 0.365 to 0.663. Still a lot slower than the internal video, but I haven't seen a higher number for a NuBus device, even accelerated - however the lowendmac results are usually on slower host like a IIcx, which is likely unfair to those vintage devices.

That's really impressive.

Does accelerating quickdraw require a different implementation for each color depth or can this be factorised somewhat ?
 

Melkhior

Well-known member
Does accelerating quickdraw require a different implementation for each color depth or can this be factorised somewhat ?
It can probably be factorized to some extent ; 8/16/32-bits can probably be unified - once 8 is working (byte aligned), then 16/32 are just more data with more favorable alignment. 1/2/4 is probably a bit harder as you have sub-byte alignment - and using less memory they tend to be faster to less incentive to accelerate them... I only tried 8 so far because that was the assumption in the code I carried from my Sun cg6 emulation code (the cg6 hardware is 8-bits only).

Also I only looked at BitBlt yet, and only for screen-to-screen ; pattern-to-screen is something I should look at as I think that's where some of the Rect stuff is going eventually. Other primitives are more complex, though theoretically just need the appropriate software; I'm basically reinventing the acceleration model of the 8*24 GC, with a QuickDraw implementation running on a RISC core inside the NuBus device. Although I'm not targeting a full implementation, way too much work...

There's some speedo 4.02 numbers for the 8*24 GC from lowendmac; the unaccelerated numbers are significantly better in the IIfx than in slower systems and ultimately pretty good. That 0.8 unaccelerated in the Q650 is impressive, I wonder why I'm so much slower. Latency to access the memory? Or can some of those systems use burst mode to the device?

Anyway, the world has changed a lot since the day of QuickDraw and stuff custom-written in 68k assembly... fun to give it a go, but i'm not sure how far this will go.
 

volvo242gt

Well-known member
A IIfx will be much quicker than other Mac II machines, Nubus video card-wise. I remember having a SuperMac Spectrum/8*24PDQ in my old IIci. Was very slow. The built-in VampireVideo was much quicker. Installed the same card into my old IIfx (not the same model, but the same card). Video was as quick as the built-in video on the IIci. Now, if Apple had implemented VRAM and built-in video on the IIfx, the SuperMac card would have been slower than the theoretical built-in video.
 

demik

Well-known member
There's some speedo 4.02 numbers for the 8*24 GC from lowendmac; the unaccelerated numbers are significantly better in the IIfx than in slower systems and ultimately pretty good. That 0.8 unaccelerated in the Q650 is impressive, I wonder why I'm so much slower. Latency to access the memory? Or can some of those systems use burst mode to the device?

I have a few Q650s, IIcx and Tobys+DPD cards in another location. Will run a set of speedo benchmarks when I get there :)
 

CC_333

Well-known member
I have a question:

How would a theoretical NuBus90 (the max. throughput of which is basically 1.75x faster than it is on original NuBus) video card perform relative to the onboard video in, say, a Quadra 840av?

I would expect it to at least hold its own, but the onboard video would still probably be faster.

c
 

Jamieson

Well-known member
@Jamieson As mentioned by @cheesestraws it seems some drivers are using lower level interfaces that have 'easier' semantic (see e.g. my comments here). But so far I've yet to be able to figure out how to patch anything (not even SysBeep) without crashing the system :-( so no acceleration yet. And I've yet to get busmastering to work (not needed for a framebuffer, but might be useful for other uses).

On the plus side: now Thousands of Colors (15/16 bits) is working as well, seems as expected a bit faster than Millions (24/32). @demik : that was a bit harder than expected, but just because I had to figure out which 3 sets of 5 bits to use for each color in the 16 bits (figuring out endianess at bit and byte level is a mess when going from 68k -> NuBus -> Wishbone -> DDR3 and then DDR3 -> Wide FIFO -> Down-Converter -> Output and no-one agrees)
Hrm, that link points to a Swedish talk/variety show. After an hour or so they still have not gotten around to talking about hardware video acceleration....? Was there another file you meant to point to? Takk!
 

Melkhior

Well-known member
@demik: it would be quite interesting to compare the same board in different host. I also should get around to write a stream-like benchmark to measure bandwidth to RAM or VRAM to see how different systems behave (and another for latency similar to lmbench's lat_mem_rd).

@CC_333: NuBus90 speeds up burst transfers only, which I don't think the host will initiate to the device - though maybe for move16 or other 'bulk' instructions? Early Quadra don't support it at all on the host controller, though later ones probably can handle NuBus90 bursts from master devices. All Quadra support it for device-to-device, but that's not super useful except in some very special use cases.

@Jamieson: Weird! I meant to post a link to a post in another thread, this one I think.
 

Jamieson

Well-known member
Reading up on early video card designs it seems like a lot of the complexity was managing access to the frame buffer memory. The host wants to write to it without waiting, the video DAC wants to read from it continously, and on top of that there are DRAM refresh cycles going on. In a world of 7400 series logic chips and a few PALs, that was quite a feat.

Seems that with modern hardware there is "lots" of time available to manage access to the frame buffer memory block. Assuming an FPGA with 100 or 200MHz clock, on the write side, the NuBus interface can write 32 bits every 100ns (best case); an FPGA has 10 or 20 clocks available to do stuff in that same period. On the read side, the H and V blanking intervals provide a good opportunity to quickly read a few lines worth of pixels and push them into a FIFO to keep feeding the video DAC. Assuming some typical DDR2 or DDR3 hanging off the FPGA, is that interface fairly low latency?
 

Melkhior

Well-known member
@Jamieson Memory is still something you need to be careful about - vintage 640x480/8@75Hz only require about 23 MB/s of bandwidth from the 300 KiB framebuffer, but 1920x1080/32@60Hz (Full HD) requires nearly 500 MB/s from the nearly 8 MiB framebuffer!

100 MHz is what I get (reasonably easily) from my FPGA, and the 16-bits DDR3 interface ends up running at 400 MT/s (so 1.6 GB/s of theoretical bandwidth). That's already a very large fraction of the achievable bandwidth consumed by the DAC. You can get more bandwidth by using a faster (read: much more expensive) FPGA and memory chip, or with an extra channel and DDR3 chip and a FPGA with more pins (also more expensive).

The issue with DDR3 is not so much the 'average' latency; I have no idea about my design, but CPU with DDR3 see latencies in the 80ns-120ns with lmbench3/lat_mem_rd. That's huge compared to the 0.3333ns cycle time of a 3 GHz CPU, but not so much compared to the 100ns cycle time of NuBus. But this 'average' can be vastly exceeded during refresh cycles - the access has to wait for the SDRAM to become available...
So you end up with a deep FIFO (64 KiB!) to cover 'worst case' latencies in front of the DAC. And if someone else hogs the memory (e.g. a really fast acceleration engine...), then you can still have FIFO underrun and a shifted display.
 

demik

Well-known member
Had a few hours today to play with my IIcx and Quadra lab boards. So as as @Melkhior suspected, running a Toby in a Mac II is a little unfair to what it can do. I used it a few years back in the day on a Q650 and it didn't feel that much slower than the OnBoard video (about 50% slower)

I read somewhere it's about as good as an unaccelerated board could do. Couldn't test the Dual Page card as said earlier because I have not found a way to display something on a VGA monitor.

Benchmarking configurations
IIcx

- System 7.5.5, Apparence Manager
- 8 MB RAM
- A/ROSE Ethernet and A/ROSE serial on other slots (Shouldn't matter)

Lab board Wombat (Q650/Q800):
- System 7.5.5
- 136 MB RAM.
- onboard VRAM maxed out

Lab board Cyclone (Q840av):
- System 7.5.5
- 128 MB RAM.
- onboard VRAM maxed out
- possibly defective NuBus (Toby was crashing in the NuBus slot with the DAV connector)

Benchmark Method:
Ran speedometer 4.02 with 5 iterations. On the Q650, tested a few Tobys, no difference found.

Results


CombinaisonB&W4 Colors16 Colors256 ColorsThousand Colors
IIcx + Toby0.2610.2470.2390.224
Wombat + Onboard1.2551.2401.2551.2801.347
Wombat + Toby0.8790.7530.6170.507
Cyclone + Onboard1.6271.5541.4741.4641.395
Cyclone + Toby1.1120.9360.7260.590

Looking at the IIcx running the benchmark, I feel like Toby is transaction bound in that machine. On Quadras, it's bandwidth bound (Combinaison of processor copy performance and NuBus speed. Not sure how DMA efficient NuBus is).

I sort of feel like you are latency bound. Would be interesting to see how your benchmark turns out. Tracing the data in a Toby board, it goes to the 74LS245 buffers (±25-40ns) and to the Memory chips (120ns)
 

Melkhior

Well-known member
@demik Thanks for the numbers ! Seems logical from what I observe.

A super-crude, basic implementation of a stream-like and a lat_mem_rd-like benchmarks gives me, with all numbers rounded to a couple digits for readability:
HW \ BenchWrite-only BWTriad-like (2:1 R:W) BWLoad-to-use latency
Q650 RAM18 MB/s22 MB/s400 ns
Q650 NubusFPGA (synchronous)4.4 MB/s3.1 MB/s1300-1400 ns
Q650 NubusFPGA (buf. write)8 MB/s4 MB/s1300-1400 ns
The last line is a recent modification of the gateware, where the write are put in a FIFO and immediately acknowledged to the NuBus, thus saving on write latency - and boosting write bandwidth significantly. It doesn't change read behavior. That single modification push the Speed 4.02 number for unaccelerated 8-bits by 30%, to 0.486. It also helps the accelerated mode, which (with basic rectangular bitblit & solid fill) moves from 0.8 to 0.9 ; it also helps all lower-depth numbers but less significantly.
The 0.486 is close to what you get with Toby in an identical machine; from DCDMF3's description of Toby, I suspect the loss comes from the higher read latency (Toby has really fast VRAM!)- I have to cross clock domains from NuBus to the system clock then wait for the DDR3 then cross back to NuBus, all of that takes time.

Edit: For the sake of completeness, I should say that the write buffer is effectively a non-coherent cache. Theoretically after a write, an immediate read to the same address could bypass it and get a stale value from memory if the write has not yet reached the memory... however in practice it doesn't seem to cause issue (I suspect the write FIFO is at least as fast as the read CDC, so the write will always get to the Wishbone bus before the read thue avoiding the issue)
 
Last edited:
Top