The diminishing returns thing isn't really an issue. The amount of time spent waiting for the delivery of data to an FPU is much, much smaller than the amount of time it takes for the FPU to run an instruction.
For instance, let's use a Classic II as an example . It has a 16 bit, 16 MHz bus which can transfer 16 MB/sec (two clocks per transfer at 16 MHz is 8 million transfers per second, an at 16 bits means 16 million bytes per second). Although oversimplifying, let's say that an FPU operation loads two 64 bit floating point numbers, an instruction and transfers a 64 bit number when done. That amounts to 13 transfers of 16 bits at a time. If an FPU were infinitely fast, then our Classic II would be able to do 8 million divided by 13, or about 600,000 floating point operations a second when moving this much data. More typically there'd be less transfers, but we're just guesstimating.
An m68882 takes anywhere from 130 clocks to 400 clocks for common operations with extremes at nearly 1000 clocks. But, again, for the sake of argument, let's say that the operations do require that much transfer (when they usually require less) and that the FPU takes 130 clocks (when it usually requires more). That means that at 16 MHz, you could do 16 million / (130+26) operations, or about 102,000 floating point operations.
Let's compare this with a 50 MHz m68882 on a 16 bit, 16 MHz bus: we get 16 million / (130/3.125+26), or 236,000 floating point operations per second or so.
Finally, let's compare this with a 50 MHz m68882 on a 32 bit, 50 MHz bus: we get 50 million / (130+14), or 347,000 floating point operations.
So does the bus make the FPU's speed increase irrelevant? No, even though this is a worst-case scenario.
Let's look at the numbers if we use more realistic figures - nine 16 bit transfers and 400 clocks per FPU operation:
16 bit, 16 MHz everything: 16 million / (400+18), or 38,000 FLOPS
16 bit, 16 MHz bus, 50 MHz m68882: 16 million / (400/3.125+18), or 109,000 FLOPS
32 bit, 50 MHz everything: 50 million / (400+18), or 119,000 FLOPS
To summarize, worst case with a 50 MHz m68882 is 2.3 times faster on a Classic II's bus instead of 3.4 times faster with a 50 MHz, 32 bit bus.
Typical case is 2.8 times faster with a 50 MHz m68882 instead of 3.1 times faster.