• Updated 2023-07-12: Hello, Guest! Welcome back, and be sure to check out this follow-up post about our outage a week or so ago.

NuBusFPGA: HDMI on NuBus Macs

demik

Well-known member
During use? I was thinking more of "halt the computer, take the board out, load the new bitstream, replug the board and the new monitor"...

Yes but on that board the FPGA is only used to read from the dual port video memory and generate the video signals. It's way easier than yours which has to handle the NuBus part and memory controller as well.

Turns out, 32 bits is slow, but not as slow as I feared; X11 on NetBSD/sparc is probably slowed down by a lot of CPU-based XRender. MacOS 8.1 is probably a lot more frugal in its use of CPU and read-back bandwidth.

For kicks, the monitor control panel on the NuBusFPGA "goblin" framebuffer running in 1920x1080 and set to 'millions of colors', captured by "apple-caps-4" and then converted to JPG by GraphicConverter 3.9.1:
View attachment 40701
I'm quite happy with that :)

Ca déchire !

That being said, wouldn't 16 bit video a good middle ground ? If I read your Litex link properly, the DDR3 backend is using a 16 bit bus as well. Should ease the FIFO pressure as well.
 

Melkhior

Well-known member
That being said, wouldn't 16 bit video a good middle ground ? If I read your Litex link properly, the DDR3 backend is using a 16 bit bus as well. Should ease the FIFO pressure as well.
The external interface to DDR3 is 16-bits, and is running DDR at 4x the system clock. The controller exposes ports of 128 bits running at sysclk to match the bandwidth. The FIFO buffers the 128 bits in sysclk, then the data crosses in the video clock domain, and finally it's down-converted in the appropriate bitwidth (1/2/4/8/32 so far).
Adding 16-bits would be fairly easy and would indeed offers a middle ground between 8-bits indexed and 32-bits direct -at the expense of yet another down-converter + extra muxing in the video clock domain.
 

cheesestraws

Well-known member
I have nothing very much to add to this at this stage other than to say that this project has cheered me up during a very tiring and melancholy few weeks.

And yes, to agree with what other people have said: Quadra onboard video is fast. No unaccelerated NuBus card can reasonably keep up.
 

Melkhior

Well-known member
PDS would be faster, but is a lot more sensitive as an interface (you're basically interfacing with the CPU directly...) and less versatile machine-wise - though I only tried using the NuBusFPGA in my Q650 yet so that's very theoretical.

Also : "No unaccelerated NuBus card can reasonably keep up" (emphasis mine) - the Sun CG6 framebuffer is accelerated already, and in SBusFPGA I experimented with a Jareth 'vector' accelerator for the Goblin framebuffer used in NuBusFPGA :)

(and yes, all the names are just because the Apple card is called Toby - if you don't get the reference, you're likely younger than I am...).
 

cheesestraws

Well-known member
Oh, yup, I wasn't saying you couldn't do acceleration :) I'm just saying that comparing your current card with Quadra onboard video as I think you did upthread is unfair to your card.
 

Melkhior

Well-known member
@cheesestraws Not necessarily unfair - unaccelerated video is heavily reliant on the bus performance between the CPU and the VRAM, and NuBus was already obsolete by the Quadra 650 era; the results show that NuBus isn't a great choice for unaccelerated video - its strengths lie elsewhere: widely available connector, compatible with older systems, easier to interface with, ... Some might have expected such a recent design to be much faster than other solutions out-of-the-box, and I think it's important to show the limits (performance) as well as the strengths (you do get a 1920x1080 display in 1/2/4/8/32 bits depth on a classic Mac!).

For acceleration the primary issue on Macs is the software side, so it's likely to stay unaccelerated for a while. Whereas the NetBSD console (mostly for scrolling) and X11 using EXA have clear interfaces for acceleration, System 7/MacOS 8 don't really :-(
 

uyjulian

Well-known member
Is OpenGL(ES) a suitable vector for acceleration?

I know that older accelerated cards required an extension to enable acceleration, and broke between different System versions due to the fact that it hooked/patched QuickDraw calls.
Once such example is the Macintosh Display Card 8•24 GC

For newer video cards that provide OpenGL acceleration, I'm not too sure how they work.
If it worked like QuickDraw->OpenGL->HAL, then maybe it would be easy to switch out the OpenGL or HAL layer.
However if it worked like:
QuickDraw->HAL
OpenGL->HAL
I'm not too sure if it would be OK to use that method.

I know Apple provided OpenGL drivers for MacOS 8 and later.

Personally, I would think of implementing OpenGLES on the card side and providing an RPC interface for it.
And then use gl4es on the Mac side to translate OpenGL to OpenGLES.

So the Mac-side chain would look something like this:
QuickDraw->(QuickDraw to OpenGL translation layer)->OpenGL->(gl4es translation layer to translate OpenGL to OpenGLES)->OpenGLES->(RPC interface to communicate to the video card)

And then on the video card side, you could use that OpenGLES data as-is, or could be converted to something different e.g. Vulkan with ANGLE.
 
Last edited:

Melkhior

Well-known member
@uyjulian Short version, no. Long version, h*ll no :)

More seriously - on the one hand, OpenGL (even ES) is an atrociously complicated thing to implement in hardware. Doing an OGL-capable FOSH device has been an ambition/hope for many people for a long time, and it's still not there. Even if a design existed, it wouldn't fit in a small (of even not-so-small but not yet stupid-expensive) FPGA anyway. That's a non-starter hardware-side, just way too complex.

On the other hand, the wealth of features of OpenGL is not helpful to implement QuickDraw, which is only designed to move around integers in a 2D plane, not to do sophisticated transform & shading of 3D surfaces. It would only make sense for e.g. QuickDrawGX & QuickDraw3D, if that.

In your chain, the first step is already a complicated one - where do you change QuickDraw to call some acceleration? Once you have figured that out, you can fit any 'simple' hardware design that can do the job behind it; no need to go for something overly complex. To emulate the CG6 framebuffer from Sun, I have a simple RISC-V core and some C code on top of it, thus easily emulating the acceleration features of the CG6 that the software actually use (which is probably less than 10% of what the 'real' hardware can do!). That's probably a much easier approach, and can be very 'incremental'; you can start by patching some very specific sub-sub-sub-routine and then work your way up to more generalized versions.

Edit: +link
 
Last edited:

Jamieson

Well-known member
Interesting thread. So it sounds like without acceleration in the card itself, nubus is the performance bottleneck. It also sounds like accelerated graphics cards can achiveve good performance despite the nubus interface. So I'm curious what "accelerated" means in this context. Does that mean the CPU tells the card "draw a box from X1Y1 to X2Y2 and make it color XXX" or "draw a line from x1y1 to x2y2"?
 

cheesestraws

Well-known member
Does that mean the CPU tells the card "draw a box from X1Y1 to X2Y2 and make it color XXX" or "draw a line from x1y1 to x2y2"?

Depends on the card and the kind of acceleration. Quickdraw is kind of designed to allow some of the work to be done on the card through "bottleneck routines" which different cards can override. So, for example, a card can accelerate "draw line" but not "draw box" should it desire to.

Apparently in practice (according to @Melkhior) this is a bit impractical and it's easier to accelerate at a slightly lower level. I've never written the actual firmware for one of these, so I bow to others' expertise. But either way, different cards can accelerate different operations.
 

Melkhior

Well-known member
@Jamieson As mentioned by @cheesestraws it seems some drivers are using lower level interfaces that have 'easier' semantic (see e.g. my comments here). But so far I've yet to be able to figure out how to patch anything (not even SysBeep) without crashing the system :-( so no acceleration yet. And I've yet to get busmastering to work (not needed for a framebuffer, but might be useful for other uses).

On the plus side: now Thousands of Colors (15/16 bits) is working as well, seems as expected a bit faster than Millions (24/32). @demik : that was a bit harder than expected, but just because I had to figure out which 3 sets of 5 bits to use for each color in the 16 bits (figuring out endianess at bit and byte level is a mess when going from 68k -> NuBus -> Wishbone -> DDR3 and then DDR3 -> Wide FIFO -> Down-Converter -> Output and no-one agrees)
 

cheesestraws

Well-known member
But so far I've yet to be able to figure out how to patch anything (not even SysBeep) without crashing the system :-(

From the declrom or an extension? I'd suggest doing it as a loadable extension at least to start with—I suspect there are good reasons why cards tended to implement their acceleration as INITs or CDEVs. I can send you an example of doing it in an INIT that's known working if that'd be helpful.
 

Melkhior

Well-known member
@cheesestraws From either ; I've tried to do an INIT in CodeWarrior (to remove any potential issue with Retro68) but with no more luck. The INIT will load just fine at bootime, and as long as I put back in SetTrapAddress the address i got from GetTrapAddress it works OK. But anything else will crash the system, no matter the content of the function. I _think_ I get the address of the function OK, as the technique works for the VBL interrupt callback. A working example of a trap patch would go a long way, so if you can share one I'd be very grateful!

Edit: as soon as I posted that, I realized the interrupt callback is in the DeclROM using Retro68, so that might not be a good reference point for an INIT in CW...
 

cheesestraws

Well-known member
A working example of a trap patch would go a long way, so if you can share one I'd be very grateful!

Here you go, have a Very Silly Example.


This is an INIT which patches DrawMenuBar to give those old PowerBooks the M1 Pro/Max notch in the CodeWarrio Pro C dialect. :D I seem to to have omitted to commit the project file, but can send you that too, if you would like.

If it's unclear why bits of that are happening, let me know and I'll explain it
 

Trash80toHP_Mini

NIGHT STALKER
Dunno if this came up in the general discussion, if so I apologize as I missed it in skimming?

Accelerated or not, no NuBus Card has ever nor will ever match the performance of onboard Video within its limitations. A PDS VidCard could, but NuBus Bandwidth is lacking for what passes for performance numbers when talking about onboard video.

NuBus is for nice big fat color depth pixels by the barrel. 68K onboard video is in a different class. Lesser if you're a graphics type and greater if a gamer or running general purpose apps at lower bit depths than needed for graphics on a high end NuBus VidCard.
 

Jamieson

Well-known member
Dunno if this came up in the general discussion, if so I apologize as I missed it in skimming?

Accelerated or not, no NuBus Card has ever nor will ever match the performance of onboard Video within its limitations. A PDS VidCard could, but NuBus Bandwidth is lacking for what passes for performance numbers when talking about onboard video.

NuBus is for nice big fat color depth pixels by the barrel. 68K onboard video is in a different class. Lesser if you're a graphics type and greater if a gamer or running general purpose apps at lower bit depths than needed for graphics on a high end NuBus VidCard.

It's my understanding that NuBus generally runs slower that the processor bus and there is more of a protocol or handshaking involved, and that's the reason for the reduced bandwidth. Whereas if the FPGA is sitting on a PDS card, it's directly connected to the ADDR/DATA/CTRL lines of the CPU itself. So then by decoding the address lines, one could get the absolute maximum write bandwidth into the FPGA.

Are the SE/30 and IIci PDS slots similar enough to consider developing some sort of video card that could work in both machines? Was there ever a commerical product like that developed?
 

Trash80toHP_Mini

NIGHT STALKER
Yep, NuBus runs at 10MHz, slower than even the Mac II system bus. It was plenty fast enough with the simple Toby Frame Buffer Card when the II was new, but quickly fell behind. Early onboard Mac Video can be considered a forerunner of the ISA world's VLB in the 486 era.

Unless @Melkhior has changed his spec part way thru this thread, his FPGA is still running on Nubus with its 10MHz bottleneck.

QuickDraw Acceleration has a rep far beyond what it actually does, in many cases confusing folks into thinking they're getting a NuBus card that will be "faster" than onboard video. In many ways users threw away the benefits of QuickDraw Accelerations by bad, but convenient choices of user interface options, like using the "hand tool" for unaccelerated scrolling rather than making use of the scroll bars or page up/page down which are.

In no case can "Acceleration" overcome limits of bandwidth.

Quite a few 030 PDS VidCards were developed, some with and some without acceleration.

However the IIci Cache Slot is, by Apple's definition, NOT a PDS slot. It's documented as, and the first implementation of an Application Specific Slot, which has no way to connect to the world outside a Mac's case, whereas by definition, a true PDS Slot must.
 

Melkhior

Well-known member
@cheesestraws First, thanks for the example! I could use the project as well, as it seems I didn't get it right on my own. Trying your code in my project, the main() run fine (like with my own similar attempt), but 8.1 freezes after loading the extensions, presumably at the point where it would draw the menu bar. main() is called and finishes fine, but it seems the callback is never called (after trying the 'vanilla' code I added some debug write to my virtual framebuffer that Qemu forward to the console, so I have low-overhead tracing). I _guess_ it crashes when calling the address put by SetTrapAddress.

I tried 7.1 as well, which crashes 'nicer' by telling me 'bad f-line':

crash71.jpeg

Currently I have CW11 installed, but I can add a newer one, just need to know which version the project is using.

@Jamieson Yes as mentioned NuBus is not as fast as direct access through PDS, mostly because of the slower frequency as mentioned by @Trash80toHP_Mini. The request also need to go through the NuBus controller and follow the protocol, which add to the overhead as you said. Those are the downsides - the upside is that a NuBus device is independent from the CPU bus, something novel at the time; the same board can work in either Macintosh II ('030) or Quadra ('040). Other buses of the era were commonly tied to a CPU bus, and using them with a different CPU involved a lot of logic on the host side to remain compatible with the old devices (e.g. VMEbus was basically the 68k bus, but was/is used with other CPU such as SPARC in the Sun 4 family).

@Trash80toHP_Mini Yes the faster bus of the PDS makes a lot of difference, in particular for code not going through regular QuickDraw calls or through unaccelerated one. The NuBusFPGA is (meant to be) NuBus-compliant so suffer from the bandwidth limitation. Should be possible to do a version with a '040 PDS, but the connector is virtually unobtainium it seems :-( (the NuBus connector is a still-in-use DIN standard so widely available).
 

demik

Well-known member
On the plus side: now Thousands of Colors (15/16 bits) is working as well, seems as expected a bit faster than Millions (24/32). @demik : that was a bit harder than expected, but just because I had to figure out which 3 sets of 5 bits to use for each color in the 16 bits (figuring out endianess at bit and byte level is a mess when going from 68k -> NuBus -> Wishbone -> DDR3 and then DDR3 -> Wide FIFO -> Down-Converter -> Output and no-one agrees)

Thats really nice, good job. Isn't NuBus Big Endian in 68k Macs ? IIRC it was supposed to follow the host CPU endianess
 
Top