• Updated 2023-07-12: Hello, Guest! Welcome back, and be sure to check out this follow-up post about our outage a week or so ago.

Homebrew microcomputers & video generation

Bunsen

Admin-Witchfinder-General
Hey Grackle! Good to hear of your progress :)

The "caching" idea is that you can take a more relaxed approach: the Prop needs to grab one or two K inside of every 1/60th of a second window, but it could do it by, say, halting the CPU *once* and doing the transfer in one big chunk at the end of every vertical refresh cycle.
Ah ok, I'm with you now. That does sound elegant.

The cool thing about the Propeller, of course, is it's a fully general purpose CPU so in theory you could order it to do things like line/circle drawing, vectors, etc. (Also, you could wire a mouse to the Propeller and let the Prop completely handle the cursor generation, tracking, and even some of the high-level aspects of a GUI like window drawing and text handling. Think of it as a sort of Fisher-Price My First NeXT.)
Sound more like a My First XTerm to me - one device to handle the whole UI.

Or you could try driving an ISA video card.
This is ... interesting.

maybe make a single board computer with just a serial port for a console to start with. If you expose the CPU bus on an expansion connector, you can interface anything you like to it later on.
This too.

 

Grackle

Member
Couldn't leave well enough alone... I had to go back and add some counters and offsets to make it scroll.


I know everybody in the history of everything has done a XOR effect, but it's still pretty cool to do it yourself for the first time.

Next step is to add a framebuffer. I'm not sure how involved that'll be; it means talking to memory but I think this device has a controller built in.

 

CC_333

Well-known member
This is absolutely fascinating!

How expensive are these FPGA things? I think I might get myself one, mostly because I'm curious.

I don't really have any productive purpose for one (yet!), but I probably never will unless I find out exactly what it is and learn how to use it.

OK, I don't want to hijack the thread, I just wanted to interject some of my thoughts on the matter.

Carry on...

c

 

Grackle

Member
Hey no problem, we've touched on a bit of everything here!

The board I have is the Digilent Nexys2, which I got a few years ago for $100 with an academic discount. Apparently they have a Nexys3 now, with a Xilinx Spartan 6 (vs the Spartan 3 on my board). If you get that one you can use the new Xilinx "Vivado" development environment, which is supposed to have some improvements. There's also the Basys2 which is super cheap if you're a student.

The other direction to go is Altera, which I've heard has a better development environment. However, their boards are a little more expensive.

Both Xilinx and Altera have free (limited) licenses for their software. One of the biggest differences is that Altera's free license gives you access to a logic analyzer tool that lets you see signal timings on your device, which can be very handy.

 

CC_333

Well-known member
Hi,

I looked up your suggestions, and the Basys2 looks good. It's cheap, which is nice ($50 for a US Student isn't bad at all. One question though-- do I have to prove anything special (like what I'm majoring in, what school I'm in, etc.), or can I just say that I'm a regular college student?)

Quite frankly, the more expensive ones would probably be a waste for me, since I know almost nothing about them.

I could upgrade to one in the future, though.

Thanks!

c

 

Bunsen

Admin-Witchfinder-General
You might also check out fpga4fun, which has some reasonably cheap entry-level devices and a few cool projects and tutorials.

 

Bunsen

Admin-Witchfinder-General
PropBerry – Propeller & Raspberry Pi combo

For PropBerry, I was thinking using a Parallax Propeller* (or just Prop) as a super i/o co-processor for the RPi where the Prop would be used to offload the real-time I/O and let the RPi handle the higher program features. After talking about this combo on the Parallax forums, the Props VGA video capabilities were mentioned which got me thinking about using the PropBerry as a VGA serial terminal console and shelve the i/o co-processor idea for now.
RBox: A diy 32 bit game console for the price of a latte

Uses the smallest and cheapest 32 bit CPU to generate 3D graphics and sound.
The RBox is a game console that is simple enough to build on the prototype area of an NXP LPC111X dev kit; no pcb required just a crystal, a few capacitors and resistors.

Features:

320x240 composite or s-video output generated entirely in software

256 colors with standard palette, up to 8k colors

8 bit 15khz stereo audio

~$1 Analog joystick

~$1 CPU
(from http://zuzebox.wordpress.com/2011/01/31/an-update-to-list-of-homebrew-video-games-consoles/ which has some dead links)

And this is probably about as minimalistic as you can get:

Bit banger

Bit banger is my most constrained and minimalistic microcontroller-based demo yet. It won the Oldschool 4k compo at Revision 2011.
Bit banger is built around an ATtiny15 microcontroller, which runs at 1.6 MHz and has 1 kB of flash ROM and a claustrophobic 32 bytes of RAM. / the entire demo is cycle counted.

At a clock rate of 1.6 MHz, the visible part of each line of the VGA signal swooshes by in exactly 36 clock cycles. The entire line, including horizontal blanking, is 51 clock cycles wide. During this time, both graphics and sound must be generated.
His Craft demoboard, based on an ATmega88, is a bit more featureful.

 

Grackle

Member
Oh, Linus Åkesson's stuff is awesome.

Recently I read about the Macintosh Display Card 8/24 GC, which is a nubus display card that uses an AM29000 RISC CPU to do quickdraw operations on its framebuffer. One of the particularly neat things about it is that it can operate as a NuBus master, which allows it to accelerate other dumb framebuffer cards you might have installed. Very cool! There is a MacTech article about it here.

Ideally I would like to do something a little like that, but I can't seem to get around needing a dual port RAM, and I don't have enough control over the external bus to do a funky workaround. Argh.

 

Bunsen

Admin-Witchfinder-General
Gameduino: a game adapter for microcontrollers

(spoiler: it's an FPGA. But it's also cheap-ish, powerful, programmed and ready to go)

Gameduino is a game adapter for Arduino - or anything else with an SPI interface - that has plugs for a VGA monitor and stereo speakers.

  • video output is 400x300 pixels in 512 colors
     
    all color processed internally at 15-bit precision
    compatible with any standard VGA monitor (800x600 @ 72Hz)
     
    background graphics
    512x512 pixel character background
    256 characters, each with independent 4 color palette
    pixel-smooth X-Y wraparound scroll

foreground graphics

  • each sprite is 16x16 pixels with per-pixel transparency
    each sprite can use 256, 16 or 4 colors
    four-way rotate and flip
    96 sprites per scan-line, 1536 texels per line
    pixel-perfect sprite collision detection



audio output is a stereo 12-bit frequency synthesizer

  • 64 independent voices 10-8000 Hz
    per-voice sine wave or white noise
    sample playback channel



The adapter is controlled via SPI read/write operations, and looks to the CPU like a 32Kbyte RAM. There is a handy reference poster showing how the whole system works, and a set of sample programs and library.
 

Gorgonops

Moderator
Staff member
Ironically the Gameduino seems to pretty much emulate roughly the capabilities of a Propeller. (Even down to having 32k of RAM. Granted there's probably somewhat more actually available on this for tiles and sprites, since the Prop will be using at least some HUB RAM for the video and sound generation software.)

Still, I can see the attraction. One feature the Propeller could really benefit from is a hardware "SPI slave mode". It's possible to make it be one in software but the Prop is better at generating clocks than following them. (For that matter, it'd be nice if it were somewhat easier to make it follow an external clock for parallel transfers, so it'd be easier to use in place of things like 6522s.) That'd make it easier to use as a coprocessor in more "conventional" computer designs where the Prop doesn't run the whole show.

 

Bunsen

Admin-Witchfinder-General
So, I've been wracking my brains trying to come up with a fast way of clocking video data out of a low-end micro without tying up the CPU overmuch.

I was reading up about SPI, as I remembered that was the port Sprite was using on the SE/ARM to direct-drive the Mac's 1-bit video. And about DMA, as that is pretty much all about getting bits in and out of the micro with minimal CPU intervention.

Every micro under the sun seems to have SPI, and all but the very cheapest seem to have at least one or two DMA channels: the Cypress PSoCs for example allow DMA between any port and any internal logic block and in between and vice versa and so forth.

I can't find the relevant piece to quote here, but I recall Gorgonops mentioning that Sprite's SE/ARM (LCP ARM running a b&w compact Mac CRT, acting as an external GPU for another ARM via USB) only achieved four frames per second. Having read this though:

{SPI clock} frequencies are commonly in the range of 1–100 MHz. {*}
I thought to myself, huh? That should be plenty fast enough to drive a compact Mac CRT, so what's the glitch here? 

* Yes, I know "clock" =/= bps

I went over to have a closer read of that part of his project, this being the most relevant section (my emphasis added):

I also had some speed issues. The LPC has no problem pushing the pixels to the display quickly enough thanks to the SPI controller I used, and even with the tedious task of fetching the data from the external RAM first, it ran just fine. Problems started appearing when I wanted to implement the USB-interface to actually make the Dockstar write to display RAM: The ARM still had enough power to actually perform the tasks, but I ran into timing issues. Basically, I couldn't handle the USB-transfers quickly enough to be done before I had to write another line to the CRT, thereby throwing off the timings and introducing many ugly glitches in the image. I solved that my creating a routine estimating the time I had left before I had to write another line, and only processing as many bytes as I could do in that time. The disadvantage was that a routine like that introduces a lot of overhead in switching over the DRAM; in the end I could only upload about 4 full frames per second to the GPU. Luckily, implementing RLE acceleration was already planned from the start, making me only hit the 4FPS worst-case-scenario when the complete screen had to be redrawn.
So, thoughts:

It seems a bit like by introducing a second device to act as a GPU, he's actually made this more complex than it needed to be. Now, I get that his desire was to offload video wrangling from the Dockstar, so it could just get on with the task of running smoothly as a server.

But, if that's not your desire, hijacking an SPI output pin from the main device to drive 1-bit video directly seems like a much less clunky approach that should in theory be fast enough to smoothly update a screen - especially if you can drive the SPI from DMA. Ditching the external SRAM (or not using it for video) and using a micro with enough on-die RAM that you can reserve 171 kbits/21kB for video (512*342) should speed things up too, neh?

Going for a smaller screen rez is also an option: QVGA (320x240) LCDs are common, and require only 9.6kB of RAM @1bpp. Alternatively, I *think* you should be able to make the bit width of the Mac screen any arbitrary size without any hardware mods, as it's just pulses to drive the level of the electron gun up and down. 480x342 for example, or 480x320 (half-VGA) with a few blank lines. Even if that's not possible, a smaller screen could be displayed within the 512x342 by just padding it out with zeros at either end.

NB: altering the vertical linecount of the Mac screen is basically not doable.

Alternatively, I wonder why he didn't link the two devices directly via SPI rather than via USB, with all its issues.

One tidbit in particular from the SPI article on wikipedia caught my eye:

Arbitrary choice of message size, content, and purpose
So (and other text on the SE/ARM sort of implies this might have been Sprite's approach), it seems like it should be possible for the micro to do the following:

  • Set SPI clock to Mac CRT scan rate *
    Initiate SPI transfer at the start of each scanline
    Order DMA controller to begin transfer of 512 bits from location Y



And you're done.

* independently of CPU master clock, unless I'm reading everything wrong

 

Bunsen

Admin-Witchfinder-General
Incidentally, the paragraph following Sprite's quote above:

Later on, I'd changed my mind: it would probably be more challenging but in the end more satisfying and universal to make a kernel frame buffer driver. This way, when I would plug in the GPU, the kernel would recognize it and create a framebuffer device. A framebuffer device is an abstract representation of a graphic card plus display, and there are a lot of programs which can talk to that: MPlayer, image viewing tools and even X.org. Getting X.org running on the device made things much simpler: any program I would want to run, including webbrowsers and Macintosh emulators, could run on top of that without any modifications to the programs themselves. So I took to work, and the result was nice to look at: Firefox, running on my workstation, could render itself to a second X session and into the classic CRT of the Mac.
So there you have something fairly cool - with a $12 LPC dev board and a little soldering, your boxmac is now a plug & play external USB monitor (for *nix/X11 systems only), even without the main CPU in Sprite's SE/ARM. Albeit a somewhat slow one, if you follow his design to the letter. A faster micro - one with USB2.0 device, in particular - should get you something more performant.

 

Gorgonops

Moderator
Staff member
So (and other text on the SE/ARM sort of implies this might have been Sprite's approach), it seems like it should be possible for the micro to do the following:

  • Set SPI clock to Mac CRT scan rate *
    Initiate SPI transfer at the start of each scanline
    Order DMA controller to begin transfer of 512 bits from location Y



And you're done.

* independently of CPU master clock, unless I'm reading everything wrong
So, it's not a bad idea at all. There are a few... semi-gotchyas:

1: Obviously it assumes that you're able to set the SPI speed to exactly the desired pixel clock. How trivially you can do that with a given MCU undoubtedly varies. (It may depend on you using an external clock crystal of an odd frequency so the dividers the hardware offers are appropriate.

2: You'll need a couple GPIO pins and some tightly-coded timing loops to generate the horizontal and vertical sync pulses. Not a huge problem certainly, but the video generation still won't be as "set and forget" as you might like.

3: The last bit. I hesitate to speak "authoritatively" on this point, but... the SPI protocol has "built in" to it the insertion of additional "0" bits in between data bytes. Which means that you can't just get a clean pixel stream out of it; you'll have a bit stuck on every 9th pixel. Here's someone complaining about that.

Note that the thread calls out the solution: at least some microcontrollers use hardware called a USART to accelerate serial bitstreams (including SPI), and if your chosen hardware supports using the USART in "Raw" mode without the SPI overhead you're still okay.

(And in fact, using the USART for this purpose is apparently fairly common for "toy" MCU video setups. Honestly I'd be sort of surprised if Sprite wasn't using a USART for the pixel clocking on his hardware setup... in fact, he says "The LPC1343 had almost everything I needed: enough flash to store a large program in, USB for the connection to the Dockstar, hardware timers to make the timing to the CRT easier and a SPI-port I could abuse to output pixels to the display without too much CPU overhead."... I'm sure by "SPI" he's actually using it in RAW USART mode.)

I suspect the real problem with his design achieving better than the "4FPS" raw framerate is the software overhead from driving the external RAM chip. An MCU that had enough onboard RAM to hold the framebuffer might well do much to solve that problem.

 

Bunsen

Admin-Witchfinder-General
Well, he seemed pretty clear that the "GPU" (which is the micro with the external RAM) could handle the video just fine by itself, but that the problems began when linking the two micros up over USB. Are you thinking that it's a combination of the two, and that dropping either of them would have helped?

I've left him a question in the comments over there; I'll pop in again in a bit and see if he's replied.

the SPI protocol has "built in" to it the insertion of additional "0" bits in between data bytes
Ah. Well, that's annoying. Thanks for the link; again, something I will follow up on.

Incidentally, I've gotten interested in building something myself, probably based around the eZ80 AcclaimPlus! micro. (ugh, awful name)

 

Gorgonops

Moderator
Staff member
Well, he seemed pretty clear that the "GPU" (which is the micro with the external RAM) could handle the video just fine by itself, but that the problems began when linking the two micros up over USB. Are you thinking that it's a combination of the two, and that dropping either of them would have helped?
It's not my design, but from looking from the outside I'd almost definitely "the combination". I don't think the villain here is USB per se (although being limited to USB 1.0 speeds undoubtedly doesn't help), but the fact that he's asking a single-core MCU to multitask pretty hard and it simply doesn't have the resources for it. Here's what he said on the blog:

"Basically, I couldn't handle the USB-transfers quickly enough to be done before I had to write another line to the CRT, thereby throwing off the timings and introducing many ugly glitches in the image. I solved that my creating a routine estimating the time I had left before I had to write another line, and only processing as many bytes as I could do in that time. The disadvantage was that a routine like that introduces a lot of overhead in switching over the DRAM; in the end I could only upload about 4 full frames per second to the GPU.
Note the line about "switching over the DRAM". One thing that may not be particularly obvious about his design until you think about it a bit is that the DRAM isn't really "RAM" so far as the LPC1343 is concerned. It's a random 4-bit wide DRAM device hanging off GPIO pins and thus requires a software loop to "bit-bang" values in and out of it. Without digging into his code I'd be hard-pressed to lay out in detail exactly where the worst problems are, but here's some things to ponder.

1: With the DRAM device being software driven and inadequate framebuffer memory available internally that means that he has to find time multiple times per frame to suck lines of pixels from DRAM into main memory so the USART hardware can clock them out using DMA. (He can't use DMA straight from DRAM.) This takes time, obviously, and it'd be interesting to know if it's something that can be done while the DMA loop is executing. (IE, does the DMA controller cause RAM contention that would prevent the main CPU from filling the buffer that it's reading from while he's reading?)

2: USB has limited granularity with it comes to initiating transactions: You can only start one once per millisecond. (You can transfer a semi- arbitrary amount of data once you've started the transaction, but that's your granularity for starting and stopping one.) A single frame of video at 60hz is only 16.6ms. Remember that even if we're using DMA to push pixels during the 342 active lines of the display you're still going to have be spending some time sending the hsync pulses and setting up the DMA transfers for each line. (Hypothetically maybe the hardware offers a hardware timer able to handle the hsync? Again, I'm too lazy to look at the code.) So somehow you have to find time in all that shuffling data to and from RAM and twiddling hsync pulses to be able to take a transfer block "when you can" and gulp data from USB without breaking the refresh loop. Pure speculation, but I'm guessing what he ended up resorting to is only taking data during the front/back porches. (And being able to do *that* would still basically require being able to either set up timers for vsync and hsync and have them happen automagically, or being able to respond to interrupts to do the needful *during* a USB transfer. It does look like DMA for USB transfers is supported, but perhaps you can't do that at the same time you're using the USART.)

3: If you google for "LPC1343 usb transfer speed" the first few links will point to developers hashing around the various limitations that USB has on that device, including one thread where someone makes a pretty solid case that the best that device can do is around 4mb/s. That's 500Kb/s, which is only about 2/5th of what you'd need for 60 FPS. So even best case that class of device would only give you 24 FPS, USB working non-stop. Divide 24 by 4 and you come up with 6. It's an interesting coincidence that he's getting about 1/6th the theoretical bandwidth the chip is capable of while roughly 1/6th of your average video frame is vsync front/back porch/blanking area. That sort of supports the speculation above that he can't receive data from USB and refresh the screen at the same time, be that because of DMA memory contention, issues with driving his chosen RAM chip, whatever... But, again, it's pure speculation.

I'm sure he's doing about as well as his chosen hardware can handle, I'm totally not criticizing the genius of his design, but... it is an interesting illustration of how a few TTL gates, a hardware shift register, and a few other little bits can make an 8mhz 68000 (Or, heck, a 2Mhz 6502) outperform a 72mhz ARM when it comes to walking and chewing gum at the same time. This is why I'm just generically not that fond of software-generated video solutions. (With the partial exception of the Propeller because it's basically built specifically for the job.)

Incidentally, I've gotten interested in building something myself, probably based around the eZ80 AcclaimPlus! micro. (ugh, awful name)
The eZ80 is cool. The only thing really not to like about it is it only comes in surface mount. (Yeah, yeah, I know, not a problem for these kids today.) I've been thinking about ordering a couple DIP-package Z180s just for kicks; it should be fairly straightforward to substitute for a plain Z-80 in some of those single-board wire-wrap projects I've been looking at and it includes an onboard MMU, DMA controller, and some UARTS that might come in handy and save a chip or two. It's "purely 8 bit" instead of "24 bit" like the ez80, though.

 
Top