Anyway, it's good news about the DDR2 memory. Two chips for 256MB would be all that a 5x0-series machine could need, so I'd be of a mind to not include a RAM expansion slot; just solder the two chips down, wire them to the onboard controller (which apparently would provide a 32-bit bus, yeah?),
Not exactly. If you use the dedicated DDR2 controller cells built into the FPGAs, you'd have two separate 16 bit buses. You could stripe your 32 bit operations across the two controllers, using the FPGA logic, but you'd have to include that in your development. You wouldn't just be sending 32 bit transactions to a single controller. Of course, you'd probably develop that by creating a logical 32 bit controller that stripes 32 bit transactions across the two physical 16 bit controllers. That might add a little latency, but at DDR2 speeds, it probably doesn't matter.
If you use the development tool that creates a custom DDR2 controller (and ignore the built-in controllers) then you could have a true 32 (or 64) bit wide controller, which has the advantage of saving some IO pins (no duplication of address/control pins) but the disadvantage of using more of the FPGAs general logic. There's so much logic available these days, that may not matter.
DDR2 memory chips are available in 512Mb, 1Gb, and 2Gb sizes, or were. It's been a few years since I was a DDR2 driver monkey. Every capacity chip is available in 4, 8 and 16 bit widths.
So, a 32 bit wide DDR2 controller could be populated with eight X4 chips, four X8 chips or two X16 chips.
In practice, the only reason to use anything other than the X16 chips is if you are trying to increase your total capacity.
So you'd use two X16 chips for a 32 bit wide bus, or four X16 chips for a 64 bit wide bus.
and call it good. If a 750fx or gx doesn't operate properly with a 32-bit memory width, either the FPGA can buffer for the extra bits or the design can be extended to four 1Gb chips to create the necessary 64-bit bus with 512MB of total RAM. Could you imagine? Over 12x the original maximum RAM. No need for VM or RamDoubler here.
Xilinx typically takes a given FPGA chip/die and makes it available in several different packages. The different packages have different numbers of pins/balls and so affect how many I/O pins/balls are available.
So, for example, a given series of Xilinx FPGA might come in versions with 25K, 50K, 100K and 200K configurable logic blocks (CLBs), perhaps with some number of DSP slices and DDR2 controllers on the chip. Then every one of those versions might be available in 144, 208, 256, 484 and 586 pin packages. The above numbers are made up, but in the neighborhood. The larger pin numbers yield more available IO pins to be used. So the ratio between on-chip logic and IO pins can vary greatly.
The Xilinx chips I looked at recently contain four DDR2 controllers, each 16 bits wide. Which, on the face of it, would suit the above needs well. However, two of the DDR2 controllers are not pinned out except in the largest (most expensive) packages (packages with the most pins/balls). You might need those largest packages any way, in order to get enough I/O pins, but it's just something to note. In effect, these chips only have two available DDR2 controllers, except in the largest two packages.
But regardless of all that, you could satisfy any of these needs with a single X16 DDR2 controller, by, as you noted, buffering the transactions up to the required width. Additionally, DDR2 transactions are performed in a Burst, typically of four reads or writes. The Burst is going to happen (take up time) no matter what. If some portion of the data is not needed those portions of the burst are signalled to be ignored, but they still happen and still take up time.
So, if your controller is doing every 16 bit transaction in groups of 4 anyway, it's always going to be taking time to read or write 64 bits.
All that said, I think soldering the DDR2 memory chips down is definitely the way to go. They're reliable and tiny, like 12.5mm X 10mm X less than 1mm high.