If DMA is not being used, I wonder if implementing some sort of compression e.g. lzo or lz4 for transferring sectors would make transfers fasters and allow squeezing more bandwidth out of the NuBus interface.
Nah, there's not nearly enough CPU time to make something like that work out.
Realistically the limited max throughout of nubus isn't a major limiter, and I am also testing the worst case (mac II class nubus). The main benefit of NuCF from what I've determined is that small accesses are much "cheaper" to execute. Mac OS loves to make single sector sequential accesses, especially during early boot; these are cheap on CF even without any kind of prefetch. However, on SCSI with the limited CPU performance of the 030 machines those hundreds of small accesses add up to waste a lot of CPU time in the driver, scsi manager, scsi controller, and the drive itself. Even with the limitations of the early NuBus implementation (all later machines should be faster!) the benefit of CF over SCSI remains for these small accesses as they are not limited by bus speed.
That performance gain is mostly due to fewer software and hardware layers to go through. However, as CPUs got faster and the SCSI implementations better, the CPU limitation on small transfers and SCSI hardware limitation on large transfers becomes less of an issue. So while the Mac II machines see 3x or greater performance improvement on CF over SCSI disks, in Quadras the PIO mode CF merely becomes somewhat faster rather than several times faster under all conditions. Less compelling.
DMA: There isn't a huge point to DMA for this type of activity on a 68030 based machine, as due to how the 030 bus arbitration works a CF transfer will block the bus halting the cpu for extended periods of time. The situation is a little different on the 040 machines which can have nubus cards execute bus master transfers while the CPU can run within its caches, so it can do some productive work while locked out from main memory.
So you can kind of see how the design "divide" comes down.
Design A) PIO based NuBus implementation for 68030s
Similar to my current design, expanded to 32 bit width in hardware. I expect i would see most if not max performance practical on the 030 machines, and probably also the Quadra 700/900. Conceptually still fast on later machines, just not a quantum improvement over SCSI. Relatively simple hardware by comparison, and would use existing driver.
Design B) Bus mastering UDMA NuBus: Bus mastering, block transfers, and ATA UDMA on 040 machines
After Q700/900, NuBus block transfers to RAM are were implemented, so bus mastering block transfers would be required for the "next" level of performance gain on an 040-class system. That next level of performance would come with markedly increased hardware complexity to support the new NuBus features, and the ATA/CF interface would probably need to support UDMA-class transfer modes in order to get data fast enough. Also a huge increase in complexity!
I'll probably experiment with design A, possibly with the option of bus mastering for my edification. I don't expect to get around to design B. Given that prototype cards can get away without being spec-compliant, I may just put a pair of CPLDs acting as the inverting bus drivers/registers/etc needed for NuBus on a prototype board to play with. Maybe a final product could result, maybe not... it's an experiment. But this is all
well down the road
---------------------------
Current updates: I've finished hardware testing in all of my SE/30 machines with various combinations of IO boards and accelerators. No concerns. I need to double check IIsi but don't expect issues. I'm also working on final software testing. I still need to knock out a quick hardware tester/initial firmware tool so I can actually test cards in a streamlined fashion, but most of that's already written as that code was required to bring up the design in the first place.
One of the tested scenarios

one of my booster 2.0 accelerators, a NuCF PDS, and a 30Video HC (GS) on top. As always the chassis card supports need to be bent out of the way and insulated but that's all. I can't guarantee 3 card stacks will always work, but I expect this particular one to be kosher more often than not.
It is more difficult to remove the CF card than I'd prefer as it can only be gripped by the sides, but I think this layout remains the most sensible for the moment (on a standalone PDS card, anyways).
I also tried to see if I could salvage my socketed NuCF board so I stacked it in my socketed carrera design. This did actually work, but it wouldn't tolerate any additional IO cards with the Carrera running. The socket NuCF forgoes the buffers for the CF (as it was intended for IIcx...) and the anemic drive of the Carrera's FPGAs aren't enough to hit the CMOS levels (required by CF) on a heavily loaded bus.