• Updated 2023-07-12: Hello, Guest! Welcome back, and be sure to check out this follow-up post about our outage a week or so ago.

QuadraFPGA: HDMI for the 68040 PDS slot

Melkhior

Well-known member
Hello all,

Of course it was coming :) Following the NuBusFPGA and IIsiFPGA, here comes the QuadraFPGA, adding the Highly Desirable Macintosh Interface to Macintosh Quadras.

It's very similar to the IIsiFPGA, but is designed to connect to the Quadra PDS slot and talk to the 68040 there, including burst support on reads and writes. So far I've only tested it in my Quadra 650. It requires a KEL 8807-140-170LH connector, which are nearly impossible to find: thanks to member @Jockelill for sending me a couple for my prototypes :) (there's a thread on TD for possible alternate, untested so far).

Currently the full Framebuffer available in the NuBusFPGA/IIsIFPGA is supported, including audio & acceleration. It's much faster than either the NuBusFPGA or the IIsiFPGA, thanks to the 68040 performance and the direct bus. Here's some Speedometer 4.0.2 results when set to 640x480 as most other Macs are tested:
QuadraFPGA-640x480.jpg
Quite happy with that :)

To limit cost, the board is much smaller than a full PDS card (100mm*100mm, cheap at JLCPCB!), so the HDMI connector is way inside the case; it's not very pretty. A 3D-printed backplate and an extension cable (@Jockelill found that one) would improve things.

Currently it doesn't do memory expansion, as the Q650 I have has no ROM slot, and it seems it will be more complicated than on the IIsi due to the more complex memory controller & associated initialization.

.It's all on GitHub as the others.
 

Phipli

Well-known member
Hello all,

Of course it was coming :) Following the NuBusFPGA and IIsiFPGA, here comes the QuadraFPGA, adding the Highly Desirable Macintosh Interface to Macintosh Quadras.

It's very similar to the IIsiFPGA, but is designed to connect to the Quadra PDS slot and talk to the 68040 there, including burst support on reads and writes. So far I've only tested it in my Quadra 650. It requires a KEL 8807-140-170LH connector, which are nearly impossible to find: thanks to member @Jockelill for sending me a couple for my prototypes :) (there's a thread on TD for possible alternate, untested so far).

Currently the full Framebuffer available in the NuBusFPGA/IIsIFPGA is supported, including audio & acceleration. It's much faster than either the NuBusFPGA or the IIsiFPGA, thanks to the 68040 performance and the direct bus. Here's some Speedometer 4.0.2 results when set to 640x480 as most other Macs are tested:
View attachment 65006
Quite happy with that :)

To limit cost, the board is much smaller than a full PDS card (100mm*100mm, cheap at JLCPCB!), so the HDMI connector is way inside the case; it's not very pretty. A 3D-printed backplate and an extension cable (@Jockelill found that one) would improve things.

Currently it doesn't do memory expansion, as the Q650 I have has no ROM slot, and it seems it will be more complicated than on the IIsi due to the more complex memory controller & associated initialization.

.It's all on GitHub as the others.
Absolutely awesome - accelerated PDS cards are everything short of non existent.

How have you arranged the VRAM? Built in 24bit Quadra video (Q700/900/950) uses 32bits per pixel (wastefully) to improve performance, is that an option for you given your abundance of VRAM?
 

Melkhior

Well-known member
Absolutely awesome - accelerated PDS cards are everything short of non existent.
Probably did make sense back then; in my case, while acceleration does help, the benefits are much less in the QuadraFPGA than in the others.
QD on '040 is quite fast, so you need quite a bit of acceleration to improve on the raw framebuffer... and it does extensively uses 'move16' so burst support helps.

How have you arranged the VRAM? Built in 24bit Quadra video (Q700/900/950) uses 32bits per pixel (wastefully) to improve performance, is that an option for you given your abundance of VRAM?
In all variants, all pixels are aligned on power-of-two bits, so thousands (3x5) is on 16-bits and millions (3x8) is on 32-bits. Packing more efficiently for space would be much, much less efficient in terms of accesses. And as you said, plenty of space available anyway.
 

Phipli

Well-known member
In all variants, all pixels are aligned on power-of-two bits, so thousands (3x5) is on 16-bits and millions (3x8) is on 32-bits. Packing more efficiently for space would be much, much less efficient in terms of accesses. And as you said, plenty of space available anyway.
Excellent :)

I'd be curious to see a benchmark comparison between a Q950 @24bit and your card. See how much you have gained.

Have you speed checked each element of the acceleration (Norton System Info gives a breakdown by some specific QuickDraw routines if needed) in case any of them result in a slowdown versus raw framebuffer?

Image translation is likely to be the big gain I guess.
 

Melkhior

Well-known member
I'd be curious to see a benchmark comparison between a Q950 @24bit and your card. See how much you have gained.
I don't have a Q950, and I think my Q650 is limited to 15/16 and can't do 24/32 on the internal video :-(

Have you speed checked each element of the acceleration (Norton System Info gives a breakdown by some specific QuickDraw routines if needed) in case any of them result in a slowdown versus raw framebuffer?
Image translation is likely to be the big gain I guess.
No, for now I haven't done any performance testing beyond Speedometer, whose only virtues are to be small, fast, and trivial to use. Though you get what you pay for, which isn't much in terms of actual relevance I'd say... Visually, the difference between NuBusFPGA and QuadraFPGA is more significant than the numbers suggest - I would expect unaccelerated QuadraFPGA to be more comfortable at high depth than accelerated NuBusFPGA on the Q650. The acceleration only really help scrolling, while moving from NuBus to PDS helps *everything*. Adding acceleration on the QuadraFPGA does help scrolling some more, which is good, but it doesn't feel 'mandatory' the way it does on the NuBusFPGA.

Of course the acceleration is currently limited in scope sue to software, more could be added. But I suspect it would help the NuBusFPGA and IIsiFPGA a lot more than the QuadraFPGA.
 

Phipli

Well-known member
I don't have a Q950, and I think my Q650 is limited to 15/16 and can't do 24/32 on the internal video :-(
Yeah, I'm in the same boat. Only 24bit Quadra I have is an 840 and the video is different and possibly slower on that.

No, for now I haven't done any performance testing beyond Speedometer, whose only virtues are to be small, fast, and trivial to use.
I suggest grabbing Norton's System Info, it's only a few hundred k in size and gives a breakdown when you ask for more detail. I'll see if I can find a copy. It tends to be bundled which is less convenient.
 

Melkhior

Well-known member
(Norton System Info gives a breakdown by some specific QuickDraw routines if needed)
Turns out, I had 3.5.3 on a drive and mounting the Toast image is fairly quick.

Norton-Q650-640x480.jpg
Sorry for the cursor and pardon the French :) Using the default Q950 setup as reference, hopefully it was tested at 8 bits. "Système testé" is the Q650 internal video set to 640x480, 8 bits. Q650-QuadraFPGA-640x480-8bits should be self-explanatory :) No optimization done; this is 8.1, no disk cache, AppleTalk enabled, extensions loaded.

QuadraFPGA is overall the fastest of the 68k here, but helped a lot by the "Défilement" (scrolling) result, along with two of the CopyBits (G/A and G/N); unfortunately I'm not sure what those means, but I guess that it means they are screen-to-screen somehow as that's why is currently accelerated (and they end up faster than the 6100/60 too!). I'm not sure why the others are slower than the internal video and the Q950 - if they do some readback, the higher read latency could be a factor. Or maybe the write latency is higher when the FIFOs are full and not using burst write. Maybe I should try larger FIFOs :)
 

Phipli

Well-known member
Ah, you got there first :)

Well, for future searches, here is Norton System Info for anyone looking for it.
 

Attachments

  • System Info ƒ.sit.hqx
    662.9 KB · Views: 3

Phipli

Well-known member
hopefully it was tested at 8 bits
It will have been, in the summary results window there is an info button that gives details of how the tested machine was set up, but they always default to 256 colours.
CopyBits (G/A and G/N); unfortunately I'm not sure what those means,
Blitting basically - aligned and not aligned.

I'm not sure why the others are slower than the internal video and the Q950 - if they do some readback, the higher read latency could be a factor. Or maybe the write latency is higher when the FIFOs are full and not using burst write. Maybe I should try larger FIFOs
It looks like the 950 drawing primitives is faster then, but you're winning on anything that is memory intensive - moving images, copying images and blitting various sized blocks of VRAM.
 

Melkhior

Well-known member
This is very funny.
View attachment 65011
I guess Defilement means something quite different in French!
Hehe, yes, in French it's just "scrolling" basically in that context. The "accent aigu" on that first "é" is quite important :)

It looks like the 950 drawing primitives is faster then, but you're winning on anything that is memory intensive - moving images, copying images and blitting various sized blocks of VRAM.
Any data movement internal that can be accelerated will be much faster on the *FPGA, as in the ideal case it's a bunch of cached 64-bits read/write over a dedicated 128-bits memory channel in the 100 MHz VexRiscv.

But the drawing primitives should be similar - it's the same code on the same CPU, and the QuadraFPGA will accept write instantly if there's room in the FIFOs. Larger FIFO are inconclusive for now, but then it's no surprise: unless they are big enough to not get full, given the kind of tests done, the steady state will be 'full FIFO' even if it takes a bit longer to get there. For instance in the "lines" test, the access pattern will look a lot like random writes to the memory subsystem (diagonals!), so latency will not be great, and the FIFO will empty slowly, get full, and the back-pressure will slow the '040 also - at least that's my guess.

"real" VRAM would have write latency much more homogeneous vs. the access patterns: worse best-case scenario, but better worst-case scenario.
 

Melkhior

Well-known member
and the back-pressure will slow the '040 also - at least that's my guess.
Well, an hypothesis has to be disprovable to be scientific, and unfortunately this one seems disproved. The FIFO don't seem to be full a sufficient fraction of the time to meaningfully affect performance through back-pressure. A much larger FIFO (1024 instead of 32) pretty much remove all occurences of FIFO full, yet doesn't help performance.

So I'm not sure why QD would be faster on the Q650 and Q950 native video than on the QuadraFPGA...

... but it might be a "dirty" trick from Apple :)
Unrelated investigation into supporting memory expansion revealed this in the Q650 ROM:
Code:
           4088010e 00 00 c0 00     ulong     C000h                   ;newtc
           40880112 f9 00 c0 60     ulong     F900C060h               ;newt0
           40880116 80 7f c0 40     ulong     807FC040h               ;newt1
           4088011a 03 f6           ushort    3F6h                    ;templOff      0x40880504
           4088011c 06 5e           ushort    65Eh                    ;specialOff    0x4088076C
           4088011e 0f 6e           ushort    F6Eh                    ;physicalOff   0x4088107C
This is the default setup for the '040 MMU in 32-bits mode. 'TC' is the Translation Control Register, and 0xC000 just says 'enable with 8 KiBytes pages'. The next two are values used to initialize the Transparent Translation Registers. On the 68040 there are two pairs: DTT0/ITT0 for data/instruction, and DTT1/ITT1. Apple only has two values has they re-use the structure from the '030 MMU, where data and instruction are not distinguished in the TT - Apple doesn't use the TT on the '030 anyway. On the '040, they stuff the same value in D and I, so they make do with two values for the four registers.

And decoding the init values reveal an interesting tidbits:
  • Top byte (bits 31-24) are the top byte of address to match, so here we have $F9 and $80 - $F9 is where the internal video lives
  • The next byte (bits 23-16) is a mask for the previous value. Any bit set to one is ignored in the address. So the $F9, masked by $00, matches all address from $F900_0000 to $F9ff_ffff. But the $80 with $7F matches all addresses with bit 31 set, from $8000_0000 to $ffff_ffff - which overlap with the other one. However on the 68040, if both *TT0/*TT1 matches, other bits are taken from *TT0.
  • The third byte from the beginning (bits 15-8) are just enable & user bits, and are identical for both.
  • The fourth byte is the meaningful one - because bits 5 and 6 are the cache mode. And here the internal video is 'b11, while everything else in I/O space is 'b10
Quoting from the '040 user manual:
CM—Cache Mode
This field selects the cache mode and access serialization as follows:
00 = Cachable, Write-through
01 = Cachable, Copyback
10 = Noncachable, Serialized
11 = Noncachable
Which means that both are non-cachable, but in addition everything not the internal video is "serialized". From a different part of the user manual (4.3.2
Cache-Inhibited Accesses):
If the CM field indicates serialized, then the sequence of read and write accesses to the
page is guaranteed to match the sequence of the instruction order. Without serialization,
the IU pipeline allows read accesses to occur before completion of a write-back for a
previous instruction.
Not sure how much this helps on the 68040 (not being sequentially consistent is pretty much a requirement for all modern systems, that's why there's entire chapters dedicated to memory consistency in ISA documentations, and they usually the hardest parts to understand), but it Apple decided to use one of the TT solely to disable serialization on the internal video, it must have some sort of performance-increasing effect!

If I had a Q650 with a ROM slot, I'd try adding serialization to the internal video to see if/by how much performance is impacted, to eliminate that as an option. Alternatively (or in addition), TT0 could be set to match $FE (the PDS slot) instead of $F9 (internal video) to see what happens then - though it migh have some side-effect if something wants serialization in the QuadraFPGA...

Still some stuff to test and understand I guess :)
 

bigmessowires

Well-known member
@Melkhior: casually drops amazing hardware creation

I admire your skills. Hope you can find a good alternative for that hard-to-find connector.

I guess Defilement means something quite different in French!

I'm a word nerd. I looked it up, both words appear to have a shared Middle English and Anglo-French ancestry related to the idea of figurative and literal trampling under foot, or literally “to full" as in the process of making cloth from wool. In English, this eventually developed connotations of "to make dirty". In French, défile means to march past or parade - a different foot-related concept, but you can see how it's related. I would then guess "défilement" literally means to parade the window content through the visible region.
 

Phipli

Well-known member
The acceleration only really help scrolling, while moving from NuBus to PDS helps *everything*
One thing we're not really seeing is a hidden benefit - even in a situation where the speed is the same on the card or built in video, the accelerated card is possibly reducing the CPU load and enabling the CPU to undertake other work... assuming graphics can be asynchronous. This would give you greater performance improvements in things like games where there is a mix of game engine code and graphics code running alongside each other.

Not things like Doom which are just a huge amount of pixels being thrown around and probably doesn't benefit at all from quickdraw acceleration, but hugely benefits from a fast connection to the CPU.
 

Melkhior

Well-known member
Hope you can find a good alternative for that hard-to-find connector.
The 3M one is highly possible from the datasheets (I added on TD), but unfortunately has a MOQ of 200 at places like Mouser or Digikey :-(
That's a bit expensive to check if they would really work, in particular as they are 13,47 € each right now...

One thing we're not really seeing is a hidden benefit - even in a situation where the speed is the same on the card or built in video, the accelerated card is possibly reducing the CPU load and enabling the CPU to undertake other work... assuming graphics can be asynchronous.
Unfortunately, I'm not too confident about that assumption. My current acceleration code basically works synchronously - it doesn't return to caller until the job is done. I'm not sure this can be avoided: the application has basically full access to the framebuffer in System 7/MacOS 8, and so can access any area at any time. If the acceleration engine is running asynchronously, then the application could read/write from/to stale data... The only way to guarantee this doesn't happen is to do what the 8*24 GC was doing: offload *all* QuickDraw operations so they are properly serialized device-side, and ask people nicely not to write to the framebuffer directly... (see Drawing directly to the screen in the develop article).

This would give you greater performance improvements in things like games where there is a mix of game engine code and graphics code running alongside each other.
I suspect most games rendered directly in the FB for performance reasons. That includes ...

Not things like Doom which are just a huge amount of pixels being thrown around and probably doesn't benefit at all from quickdraw acceleration, but hugely benefits from a fast connection to the CPU.
... things like Doom. And probably other. IIRC, I didn't see much activity on the accelerator running SimCity 2000, for instance.

PDS (and built-in videos) are better on 68k Macs for fast graphics than NuBus in practice, even if I can fool benchmarks by accelerating some primitives a lot :)

For now, the one application I've seen that really benefits from the fast blitter is CodeWarrior: scrolling big text windows line-by-line is much, much better on NuBusFPGA with the acceleration enabled.
 

Arbee

Well-known member
I don’t know if it’s documented anywhere, but I’d bet that forcing serialization temporarily disables some of the out of order execution and that’s the performance drain. PPC has the famous EIEIO instruction to manage that in a finer-grained way, but on 040 they pretty much had to do it this way so existing NuBus cards and drivers could work.
 

Jockelill

Well-known member
This!! Just amazing!

I got the connectors from Quest Comp. Derf had an MOQ of 10pcs and asked for 25$/piece. The quest comp one was 9$ instead (when I bought 6).
 

Melkhior

Well-known member
Blitting basically - aligned and not aligned.
Forgot to say: you are absolutely correct; turns out there's a log on the second monitor with more explicit names. P and G are also in French turns out (they translated everything!), and are for 'Petit' (small) and Grand' (large). So the QuadraFPGA wins on Large, but loose on Small (perhaps I should add an heuristic to not accelerate below a certain size?);
 

Phipli

Well-known member
Forgot to say: you are absolutely correct; turns out there's a log on the second monitor with more explicit names. P and G are also in French turns out (they translated everything!), and are for 'Petit' (small) and Grand' (large). So the QuadraFPGA wins on Large, but loose on Small (perhaps I should add an heuristic to not accelerate below a certain size?);
Andy Hertzfeld did say something about the trick to fast graphics code being not to be shy to make different branches for different scenarios and not to try to make one global case 😆
 

zigzagjoe

Well-known member
Fantastic work, as always.

Macbench has a good set of tests for the various copybits routines. Takes a while to run though and make sure you use the same color depth if comparing.
 
Top