Jump to content
ZaneKaminski

Serious proposal: accelerator and peripheral expansion system

Recommended Posts

I have designed the basic architecture for an accelerator “system” which can be adapted to any 68k Macintosh with a PDS slot.

The goals of the design were as follows:

  • moderate cost
  • at least 50x faster than Mac SE
  • basic architecture can be used for any 68000, 68020, 68030 Mac with a PDS. Maybe 68040. (Especially SE, SE/30, IIci. IIfx difficult because of its unique I/O architecture.)
  • maxes out the RAM of your system (probably can’t do this for IIfx and other DMA machines)
  • enough bandwidth to connect peripherals: WiFi, SD card, display (at least 1024x768x4-bit)

tl;dr: scroll down and look at the pictures

 

I’ve designed the basic architecture implementing these features and have chosen all of the main (expensive) parts. What follows is my reasoning about the system and how I came to the conclusions which informed the design. Following that is the design I have thus far arrived at.

 

 

Originally, I looked for a 68k-compatible core which I could synthesize in an FPGA. Some cores exist which would give a nice boost to an SE, but were incompatible with ‘030 instructions, or I didn’t feel they would be fast enough. I had the feeling that a high-performance 68030-compatible soft core did exist, but could not realistically be licensed for this application. This approach would also be the cheapest way to implement the system, but I don’t know if it could be 50x faster than a Mac SE.

 

I am no CPU designer, nor do I intend to be; designing the core myself is not my interest. After a lot of thinking about it, I also concluded that the best 68k-compatible core built in an FPGA would be slower than the best 68k emulator running on a modern, cheap-ish microprocessor.

 

So I decided to pursue a solution where the 68k instructions are executed by emulator software running on an ARM microprocessor. Initially, I thought that the 68k instructions could be converted to ARM instructions in hardware, but this approach is, long story short, impossible given the details of the hardware available today and some details of the MC68k ISA, particularly the variable instruction size.

 

Instead, the 68k instructions would be translated to ARM in software. “Just-in-time compilation” is the term for this. I will explain more later. I already have a fairly detailed design for this part. Actually, the task of designing and implementing the emulator/translator is easier than the task of making the accelerator halfway affordable.

 

No existing emulator software is quite right for this purpose, and an OS would get in the way of achieving the highest performance. The goal is more of an emulator-kernel that runs without supervision.

 

Along with the ARM processor, an FPGA would implement the bus interface (along with some level-shifters). The emulator would issue commands to the FPGA (e.g. read a longword from address 0x12345678), which would execute the operation on the PDS bus and then return the result to the emulator.

 

 

 

At this point, I was thinking that the accelerator would be too expensive to make. In particular, the requirements are a fast processor, high-bandwidth + low-latency CPU-FPGA interconnect, a cheap circuit board, and easy DIY assembly. These goals are all in competition!

 

Faster processor generally means more pins, so the PCB has to be more expensive. Same with having a higher-bandwidth interconnect. Both will require narrower traces and more layers. I accepted that the processor and RAM would come in a “BGA” package, which would be pretty hard to solder. So my aim was to keep the FPGA in a package that I can readily solder. I also wanted to avoid having a 6- or 8-layer board, which some would say is the minimum for a design of this complexity. That would be really expensive to make in small quantities.

 

The best option in terms of the FPGA-CPU interconnect would be to choose a CPU with a generic “external bus interface” to talk to the FPGA, along with a separate DDR2/3 controller. This would give the best throughput from the accelerator CPU to the Macintosh. (particularly the I/O devices, VRAM, ROM. We don’t need to touch main memory.) However, this would require more pins on the CPU (since it would have to have dual memory interfaces) and the FPGA. FPGAs with enough user I/O to do two wide bus interfaces are mostly BGA parts, which are hard to solder. A two-in-one FPGA+ARM SoC sounds good, since they have fast interconnects, but they are quite expensive and have a lot of pins, so that’s no good. Some high-speed serial interfaces (e.g. USB 2.0 high-speed) provide the requisite bandwidth, but their latency is too poor to do a good job with random-access. USB has as much throughput as an SE/30’s processor bus, but requires that data be transferred 512 bits at a time, when we really want to just transfer 32 bits of address, 32 bits of data, along with a 4- or 8-bit command. So in terms of latency, USB would be basically 8x slower than the PDS.

 

Complicating things further was a nagging detail about the implementation of the emulator. M68k has more registers than 32-bit ARM, so the entire state of the M68k (D0-7, A0-7, PC, etc.) cannot be entirely stored in the general-purpose registers of such an ARM processor. I would have to find some extra space in the ARM’s NEON registers or something. Maybe I could only store D0-D7 and the PC in the ARM general-purpose registers, and then I would have to stick A0-A7 in the NEON registers, moving address register values into the GP registers only when required. That sucks! Luckily, ARMv8-A “64-bit ARM” has plenty of registers, 31 x 64-bit. So an ARMv8-A chip would be nice. Just they are a bit rare outside of smartphones and usually have a lot of pins.

 

Additionally frustrating was the problem of adding video output. Ideally, the CPU would provide an RGB-type “LCD interface” that could be easily converted to VGA. But again, that requires more pins. The LCD interface could be part of the FPGA, but then the FPGA would have to have its own SDRAM or something, and the video performance would be constrained by the speed of the FPGA-CPU bus. Also then the FPGA would have to come in a BGA package because of pin count, which is undesirable.

 

So the aim was to pick a powerful processor with few pins, an FPGA with exposed pins (not BGA), and then get the cost for the board down somehow… All while ensuring that the FPGA-CPU interface is fast enough and that there’s enough bandwidth out of the CPU to an external peripheral board.

 

 

My solution was to divide the project into three separate boards:

  • Main accelerator board, which will plug into the Mac. This will be different for each machine supported.
  • Processor system-on-a-module card, shaped like a DDR2 SODIMM, but containing just the processor and its RAM. This will plug into the accelerator board.
  • I/O board, optional. Connects to the processor card via USB through the main accelerator board. This would go in the hole on the back of the Mac SE or would be a dummy NuBus card. (NuBus is slow, no need to actually use it.)

post-6543-0-73575900-1477511594_thumb.png

 

 

Note that I have indicated the specific interfaces between the processor board and the accelerator and I/O board. These interfaces will be discussed in greater detail later.

 

Three boards sounds expensive and difficult (three different products to make!), but the benefits are pretty clear:

  • The same processor card can fit into accelerators for different machines.
  • In the future, someone else can make a faster processor card that plugs into the existing main board. I have defined a minimum set of I/O interfaces that the processor card must support. Almost all modern MPUs have these interfaces, so the design of the processor card is quite orthogonal to the design of the main board.
  • Vice-versa, too. Same processor card can plug into an improved main board.
  • Encourages future development of peripheral I/O hardware, since the I/O board is separate and will be much easier to design than the other two pieces. Someone can add video output that way.
  • The processor card has to be very dense, and a dense PCB costs more. If the entire board had to have the density of the processor card, it would be prohibitively expensive.
  • Since the processor card is relatively simple, hopefully I will get it right the first time. The other parts will be cheaper to redo if I mess up on them.
  • I can definitely assemble the main board and I/O boards myself. The processor board has these BGA parts that most consider to be impossible to DIY solder. It certainly is possible, just difficult. I prefer using a hot plate to solder SMD components, but I’ve never done BGA. Professional assembly and testing is expensive in small quantities, so we want to minimize the size and amount of parts on the processor board in case we have to go that route.

 

 

Okay, now let’s discuss my design for the main accelerator board. Here is one for a Macintosh SE.

The coolest and most expensive components (e.g. FPGA, processor) have already been chosen and their part numbers are marked in the image.

Other parts have not been explicitly chosen but I have provided a cost estimate for these parts.

“Dumb” parts like headers don’t have a cost estimate yet.

These diagrams only show the “main” component costs associated with the design. There will be lots of other little fiddly components that will cost money, but they are generally cheap.

post-6543-0-56280300-1477513362_thumb.png

 

 

Here’s a similar diagram for the SE/30.

Since the SE/30 has wider address and data buses, the address and data lines coming from the FPGA are multiplexed to save pins on the FPGA. FPGAs with more pins than this one are either older and more expensive, or newer and in BGA package and still more expensive. Some will criticize the choice of an FPGA over a cheaper CPLD-type device, but $12 is cheap for an FPGA, and I am considering offloading part of the 68030 MMU emulation to the FPGA, justifying a more complex device. Actually, fully emulating the memory management unit is the bottleneck of 68030 emulation. We will have to see what the best way to do that is.

(Somewhat relevant are NeXT’s machines, which use a similar multiplexed address and data bus to communicate with their peripherals. Most of the Cube’s peripherals are accessed through special DMA-capable I/O ASICs, and so it’s basically free to build in to these custom chips a latch or two to demultiplex the bus. It saves room on the board.)

post-6543-0-27253000-1477513517_thumb.png

 

 

Here’s the processor board, showing my choice of processor, the NXP QorIQ LS1012A.

This processor is cheap for its speed ($20 in quantities of 10 or so), has the new “64-bit” ARMv8-A ISA, and has only 211 pins.

NXP says it’s designed for low-cost, 4-layer boards. That’s what we want. Actually, this processor is supposed to be for routers and network-attached storage, but it’s perfect for our application.

Also important to note is the minimum feature size of the processor board. 0.07mm is the minimum width for traces, less than half of the 0.15mm specified for the other boards. It would be too expensive to make the main board with such fine features.

post-6543-0-76557700-1477511659_thumb.png

 

 

Here’s something pretty basic in the way of an I/O board. The idea is that the top edge of the graphic represents the ports facing out of the computer.

The microcontroller on here is from the Atmel SAM D21 family. It’s one of my favorites. It’s got a 48 MHz ARM Cortex-M0+ core. This is a really flexible chip that’s cheap. Especially cool is its clock system and generic clock generator peripheral. I used it for the system controller on the main board as well.

The features are WiFi, SD card slot, and a UART (serial port). Obviously these features will have to be implemented in the emulator software running on the processor.

By the way, XOSC is short for “crystal oscillator” and SWD is short for “single-wire debug,” used to program and debug ARM microcontrollers.

This board would have a USB connection through a header on the main board to the processor board. This is USB 2.0 “full speed,” meaning 12 Mbit/s. Too slow for video, but adequate for this SD card and serial stuff. 12 Mbit/s is slower than the SCSI on an SE/30, so that’s why I call this board “basic.”

post-6543-0-58823700-1477513400_thumb.png

 

 

Here’s a faster version of the same thing. Upgrading the processor to the Atmel SAM S70 series gives us some new interfaces and possibilities:

  • USB 2.0 “high-speed” satisfies fully any desire for SD card and WiFi performance. 480 Mbit/sec blows away the I/O on even the original iMac. (It used only “full speed”)
  • This microcontroller supports JTAG, which allows us to use a single port to program and debug the I/O controller, bus interface FPGA, and the main processor. Convenient.
  • Also the processor is like 8x faster than the I/O controller on the slower board. Sounds like overkill, but this is the cheapest microcontroller with high-speed USB 2.0.

post-6543-0-65214200-1477511600_thumb.png

 

 

 

Here’s a quick sketch of a display board. I haven’t thought as much about this one.

USB 2.0 high-speed is just barely enough to do 8-bit 1024x768 x 60fps. 4-bit color would be more comfortable.

The basic idea is that the main processor can, during the vertical blanking interval, queue the entire frame to be sent over USB to the display controller. It can then figure out how to display it.

I don’t know how much a VGA DAC (digital-to-analog converter) costs.

post-6543-0-36934900-1477511606_thumb.png

Higher resolutions and color depths are possible if a display controller with USB 3.0 is used. Right now, that's too expensive.

 

 

One of the problems with the entire design it is that the qSPI interface between the main processor and the bus interface FPGA puts a limit on the speed at which the accelerator can talk to the Mac’s RAM, VRAM (if any), ROM, and I/O devices. A 68000 bus access takes 4 cycles, and an ‘020+ access takes 3. It may take longer than that access for the processor to give an access command to the FPGA. The qSPI interface used for this is a 4-bit interface that can run at 62.5 MHz (limitation of the QorIQ LS1012A main processor).

 

Consider the case of writing to memory on a Macintosh SE, such as when changing the display. In order to write to memory, the main processor must transfer to the FPGA: a 4-bit command, 24-bit address, and 16-bit data. This takes 11 qSPI cycles at the 62.5 MHz, or 0.176 us. Then FPGA performs the bus access, taking 4 cycles (forget about waiting for the video circuitry access for a sec) at 7.83 MHz. So that’s 0.51 us. Then the processor must ask the FPGA what the result of the access was. That’s another 4 bit command and 4 bit data, at least. So 0.032 us. Therefore, the total time spent for the access is 0.72 microseconds, of which 0.21 us were wasted transferring data from the processor to the FPGA and vice-versa. That’s not that bad, representing a maximum PDS bus utilization of 71%.

 

But now consider the case of writing to VRAM on an SE/30. In order to write to memory, the main processor must transfer to the FPGA: a 4-bit command, 32-bit address, and 32-bit data. This takes 17 qSPI cycles at the 62.5 MHz, or 0.272 us. The bus access takes 3 cycles at 16.67 MHz: 0.18 us. Finally, 0.032 us is spend reading the result. Now, the total is 0.484 microseconds, of which only 0.18 us was spent using the PDS bus, for a utilization of 37%.

 

The case for a IIci is even worse, only 28% utilization. Also, the cycles may not line up perfectly, so I’ve given the best-case utilization here. As well, 8- and 16-bit reads and writes will give slightly better utilization percentages. For accesses that occur linearly in memory, “next address” and “previous address” commands can be implemented, saving the transfer of many address bits. That will improve it a little more. It may be possible for the emulator software to “pipeline" these memory operations, enqueuing another operation before the first is complete. Or maybe we can use two qSPI buses in parallel to realize a moderate increase in performance. I dunno.

 

This really only presents a problem for I/O. ROM is not a problem since it never changes and we just need to read it into the emulator’s memory once. RAM is not a problem since the emulator has its own RAM and we don’t need to mess with the Mac’s RAM except for the video portions. Video performance will be great, since instructions will never need to be fetched over the PDS bus and the video memory is usually written in linearly, so the accesses can be accelerated by the “next” and “previous” commands. The problem is I/O performance. I/O registers are not read or written linearly, and the I/O registers for the NCR 7530 SCSI chip, for example, are only 8 bits wide, which really bogs down the throughput. Indeed, users of IIci and faster Macintosh models may see slightly decreased SCSI and NuBus performance with this design. IWM and the other peripherals are not fast enough be negatively impacted. Of course, the rest of the system will be much faster, so it should be okay. You can turn on a huge disk cache and that will accelerate read operations.

 

So this problem with the qSPI kinda sucks but it’s not a big deal since every other aspect of the system is pretty fast.

 

 

Now, the big question: Does anyone actually want this? Each of the boards will probably cost $100 if they're produced in quantities of 25 or so each. The processor board might be a little more, the I/O boards, less. There are also a few hundred dollars of tooling fees that would have to be paid just once for the processor board in particular.

 

I can design the board for the SE, SE/30, and the processor board, and I can design the software for the main processor, system controller MCU, and also the FPGA bus interface. I will license my work under some kind of "open-source" license. I haven't decided which one I think is best yet. My hope is that, in releasing the design files for the accelerator board in particular, that I can inspire others in the community to adapt the board to other models of Macintosh beyond the SE and SE/30. I was gonna use KiCAD because it's free. Eagle's free versions are slightly not good enough for our purposes.

 

I am hoping someone else will design the I/O board(s) and their software. There are some mechanical problems to be solved there about how to mount the board in the hole of the SE and SE/30, and how to mount the SD card slot perpendicular to the board.

 

My next steps are:

  • Obtain full manual (probably 3000 pages) for the NXP QorIQ LS1012A processor (this information is "embargoed" and so I must sign an NDA to receive it)
  • Create schematic symbols and footprints for all components
  • Create the basic board shapes for the SE and SE/30 main boards, and the processor board (to get a better idea of the required PCB size and cost)
  • Start designing schematics for one of the accelerator boards (SE or SE/30) and the processor board (according to the block diagrams given in this post)
Edited by ZaneKaminski

Share this post


Link to post
Share on other sites

So. I really didnt read the whole thing, but this is... a wifi, ram, etc etc board all in one card for an SE?

 

If so, I dont even have one, but I know if it does the above Ill buy one.

 

Im going to read the whole thing now...

Share this post


Link to post
Share on other sites

This is awesome.  [:D]]'>  I would buy one.

 

Does the QorIQ have a faster I/O interface but we're limited by the FPGA interface? If the QorIQ has a faster interface, is there an FPGA in the same price ballpark that can make use of it (let's assume we're ok with BGA for a moment)? Can we put the FPGA on the processor board (since BGA will probably require 4 layers)? I know the processor board is supposed to be interchangeable, but the FPGA could be reprogrammed for different hosts.

 

I question the value of designing for home-assembly. Especially when it hinders the design. Automated assembly should probably be built-in to the cost, including the PDS board. It's a lot of components. Just soldering 96 or 120 pins for the PDS slot is going to be a huge pain. Your time is more valuable doing things besides hand-assembling PCBs. Consider a group buy or kickstarter funding method.

Share this post


Link to post
Share on other sites

Here's the block diagram of the QorIQ LS1012A:

post-6543-0-67106800-1477522427_thumb.jpg

 

Kinda odd to use such a network-y processor, but it's cheap and it only has 211 pins, so it'll be easy to "fan out" all of the signals when it's on a board.

 

All of the boards have to have at least 4 layers, by the way. Two-layer boards would not hold up to an 800 MHz processor and DDR memory. You need a reference plane adjacent to each signal layer, so the 4-layer board typically has internal power and ground planes, and then two signal layers on the outside.

 

The SERDES (short for serializer-deserializer) peripheral is intended for this type low-latency and high-bandwidth application. The problem is that an FPGA that supports this type of I/O would be $45 or so instead of $12.

 

8-bit SDIO was also a possibility but I rejected it because there are these really long commands that have to go over a 1-bit bus, so the command latency is higher than qSPI.

 

Another option for the FPGA interface would be to bit-bang signals out on the GPIO of the main processor. I will see about that once I get the manual.

 

One of the other problems is that it may take a long time to generate the ARMv8-A code corresponding to a given MC68k routine. The ROM and OS routines will be fully translated quickly after boot, and we can provide a set of "hints" that can allow this process to go more efficiently. The hints would basically list the points in the ROM at which to begin code generation. In particular, we want to make sure the disk routines in the ROM are fully translated before the system boots, otherwise there will be a big delay and then the disk timing will be wrong and it will fail to read the disk. We can translate anything in the ROM ahead of time, but when loading a new application from disk, it may run slowly at first as the translation occurs during execution.

 

A dual-core processor would be helpful in solving this problem, since one core could be always speculatively translating the code in memory.

 

Another interesting detail with the translation is that the 68000 has variably sized instructions. One instruction word may be followed by many extension words of data. So not only do you not know whether some memory is code or data, but even if you know it's code, you don't know how it lines up and which pieces are instruction words and which are extension words of data.

 

NXP has a new series of processors, i.MX8, which will come out soon. They're ARMv8-A and have like 4+ cores each. I'll check them out as well, once more information is available.

Edited by ZaneKaminski

Share this post


Link to post
Share on other sites

There is a pds slot on the portable, so I think someone is working on that already so I would say there are people who would like to have these depending on cost of course.

Share this post


Link to post
Share on other sites

The latest in a flurry of acquisitions. Hopefully it won't affect distributors' willingness to stock NXP's products. Microchip-Atmel is probably gonna suck in the long-term as well.

 

Maybe I can eliminate the FPGA entirely and get a more expensive processor chip that has enough GPIO (and CPU power to waste) to bit-bang the 68k bus. I think the practical limit of that type of most GPIO ports is 50 MHz or so. Conveniently, that's the speed of the fastest 68k Mac.

 

The lowest-end NXP i.MX8 has four Cortex-A53 cores and two (much slower) Cortex-M4 cores. That would be great, but who knows how pricey the part is. One of the M4s could bit-bang the bus, and then the four larger cores could do the emulation, translation, and MMU work.

 

One slow operation is particularly parallelizable. I'll explain. The translator software is supposed to translate M68k code to ARMv8-A code. This translation is cached, which is the key to achieving high performance; the code only needs to be translated once. The problem is that, when the memory which contains the code originally translated is overwritten, the cached translation of that code must be "invalidated." This is a really time-consuming operation closely related to the MMU functionality, which is also complicated and time-consuming. The good part is that multiple cores can speed this process up substantially. Sometime I'll write a spec about the entire thing including the tree which the emulator must traverse to perform a memory access. But first, some more hardware work must be done.

 

Also, about home assembly, I can purchase 4-layer boards from OSH park in multiples of 3 and assemble them myself if they're "easy" (surface-mount whatever is fine, as long as it's not BGA). This is much faster and cheaper than sending three boards off for assembly. So I wanna keep the main board free of BGA parts.

I don't think this product will sell in 100's quantities, which is when professional assembly becomes halfway cheap. I imagined personally making 10-20 prototypes and then selling them to some sponsors of the project. So if I have to pay for the assembly of 25 processor boards (will probably be convenient for me to have some for my future developments) to do that, fine. Just any more professional assembly than required will make the prototype development too costly, I think.

Edited by ZaneKaminski

Share this post


Link to post
Share on other sites

 

 

One of the other problems is that it may take a long time to generate the ARMv8-A code corresponding to a given MC68k routine. The ROM and OS routines will be fully translated quickly after boot, and we can provide a set of "hints" that can allow this process to go more efficiently. The hints would basically list the points in the ROM at which to begin code generation. In particular, we want to make sure the disk routines in the ROM are fully translated before the system boots, otherwise there will be a big delay and then the disk timing will be wrong and it will fail to read the disk. We can translate anything in the ROM ahead of time, but when loading a new application from disk, it may run slowly at first as the translation occurs during execution.

 

 

 

This is why you need a JIT runtime engine/code. Again, as I suggested before, if your using a different CPU to execute instructions on a 68K bus, I would again look at some of the open-source 68K emulator code out there, like minivMac and see if there is something you can adapt? 

 

You cant limit your system to launch specific areas of code into translations in a hierarchical manner because it will make the card very machine proprietary/specific because the addresses vary from machine to machine. Not a good idea IMHO. 

 

my goal is once/if this gets complete, is to adapt it to the macintosh portable. Which is a machine that sits between an SE and an SE/30. it has alot of the logic to adapt an 030 natively, but not all of it. and it has a 16Mhz 68000 sitting in there instead. 

 

Also if your adapt to an SE, or Portable, or other base 68000 machines you have to generate the Synchronous bus and its timings. E/VPA/VMA because its used for the VIAs, SCC, etc.. That has to be emulated as well as the Asynchronous bus.

Edited by techknight

Share this post


Link to post
Share on other sites

Well I think what I described is technically a JIT, but the problem is that I haven't thought about adding the ability to interpret portions of code that have not yet been translated when the emulator encounters an untranslated jump destination or the end of a translated block. In that case, it oughta switch to interpreting the instructions (and also translating them too) so there isn't a huge latency in the result. That's what'll ruin the disk accesses.

Share this post


Link to post
Share on other sites

The tree structure is supposed to generalize it to multiple models and memory maps. It can also be constructed in a way that it can double as a structure used for the 68030 MMU emulation. There has to be at least some MMU emulation. (forget 68020 + "Apple HMMU." We can build that on top of the '030 MMU stuff) A full implementation of the MMU functionality may be quite slow or memory-consuming.

 

The system controller is a pretty robust chip and it can generate the E clock. I thought that was generated externally to the 68000 (so I marked it as an input in my diagram), but it would make sense that it's not. No problem generating it.

 

About the synchronous peripherals generally, my design solves that problem entirely. Right, a faster 68000 won't want to talk to the synchronous peripherals, but when the emulator is of my design, it has no problem waiting 10 Macintosh clocks.

 

Thanks for the recommendation about assembly. I'll look into it.

Edited by ZaneKaminski

Share this post


Link to post
Share on other sites

Just any more professional assembly than required will make the prototype development too costly, I think.

 

Very true, I overlooked the prototyping part. Assembly for prototypes would be very expensive.

Share this post


Link to post
Share on other sites

The tree structure is supposed to generalize it to multiple models and memory maps. It can also be constructed in a way that it can double as a structure used for the 68030 MMU emulation. There has to be at least some MMU emulation. (forget 68020 + "Apple HMMU." We can build that on top of the '030 MMU stuff) A full implementation of the MMU functionality may be quite slow or memory-consuming.

 

The system controller is a pretty robust chip and it can generate the E clock. I thought that was generated externally to the 68000 (so I marked it as an input in my diagram), but it would make sense that it's not. No problem generating it.

 

About the synchronous peripherals generally, my design solves that problem entirely. Right, a faster 68000 won't want to talk to the synchronous peripherals, but when the emulator is of my design, it has no problem waiting 10 Macintosh clocks.

 

Thanks for the recommendation about assembly. I'll look into it.

 

I would read the 68000 manual if I were you. I studied it awhile back. E clock is generated by the CPU, and the VPA/VMA is also used with the CPU in conjunction wiht the E clock and a few other signals to communicate synchronously to peripherals on the bus. 

 

Here is the thing. Synchronous communications on the bus has to come from your board, because you will take over the bus as the bus master. Returning control back to the onboard CPU for Synchronous communications is not an efficient way of doing it. But in "theory" you could. 

Share this post


Link to post
Share on other sites

Sure Im interested, hell I'd easily pay $300. Could always meddle around with programming the fpga myself.

 

I like the idea of using cheap off the shelf stuff like the ESP, becuase even a accelerator board like this, pretty much anything modern is faster than a 68k processor :p .

Tbh if you can get over the hassle of emuating a '030 core, having a macintosh as a server would be a lot more usefull.

Looking forward to seeing where this one goes, as a EE this makes me want to do more FPGA stuff.

Share this post


Link to post
Share on other sites

I've decided that the FPGA isn't really necessary if we bit-bang the signals out via a wide GPIO port. That should be faster since the data doesn't need to first be transferred over a slow connection from the processor to the FPGA. Since these processors in question run at basically GHz speeds, they are certainly fast enough to bit-bang the 68000 and '020/'030 buses.

 

The only problem is that the QorIQ LS1012A doesn't have enough pins to talk directly to the PDS, even if we muliplex address and data. There are some faster members of the QorIQ family, but they're basically non-options because of their steep price. The only reason I'm looking at the LS1012A is how cheap it is to integrate into a system.

 

Qualcomm has started selling their Snapdragon 410E through resellers, which is a very capable chip (4 x 1.2 GHz Cortex-A53). The chip itself is only $18 or so, less than the QorIQ, but it will be difficult and expensive to integrate into a board. I think much of the hardware information about this chip is "secret" as well, so that may make it difficult to develop "bare-on-the-metal" software for it.

 

There are some Snapdragon 410E "system-on-a-module" units for sale, but they don't break out enough consecutively-numbered GPIO pins to make it worth using without the FPGA. We want something that outputs, for example, GPIO 0-31, so we can hook those up to the adddress or data bus or whatever of the Mac. If, for example GPIO16 were to be missing, then we will have to spend some CPU cycles rearranging the data before we put it on the bus. The Snapdragon 410E pre-built modules available were not necessarily designed with this requirement in mind.

 

On the upside, the Snapdragon 410E is way fast, and multiple cores can allow us to parallelize translation of the Mac opcodes and some of the MMU housekeeping so as to make the system even faster.

 

Also, the DragonBoard 410c, a $75 development board integrating the Snapdragon 410E, is available, and it's got USB and an HDMI port. Combined with a little Mac peripheral emulation software (for the VIA, IWM, display, etc.), that takes us very close to a super-Macintosh on a board for $75. Just add monitor, keyboard, and mouse. The primary difficulty in adapting an emulator to standalone operation on the DragonBoard is the lack of documentation on the USB and HDMI stuff of the 410E.

 

Also the 410E can run Android hahahah.

 

Again, the major problem with all this is that the processor board with the 410E will cost a lot more than the one for the QorIQ. Even if we eliminate the FPGA and shrink the main board as small as we can, the Snapdragon will still mean greater cost and difficulty. If I get far with the Snapdragon, I'll launch a kickstarter for the processor board. It should be useful for many other applications. If a lot of people want the processor board (for use in the "Maccelerator" or otherwise), then the unit cost will be lower.

 

I think I will try both the QorIQ and the Snapdragon. Still haven't received the QorIQ datasheet from the sales rep. Hopefully it'll arrive soon.

Share this post


Link to post
Share on other sites

I wonder though if you could use the AM335X chips? Because at least you have a couple PRUs available for play with the bus while you still have the main ARM core available for whatever usage. 

Share this post


Link to post
Share on other sites

Well, the problem with those chips is that they're ARMv7-A. We could use them, but I have a preference for ARMv8-A since it has 31 general-purpose registers to ARMv7-A's 15. That's convenient because M68k has 16 registers, so on ARMv8-A, we can fit D0-D7, A0-A7, PC, status register stuff (X bit will be tricky), and still have room for scratch registers. Without as many registers, an ARMv7-A implementation will have to swap the stored M68k registers out to memory or something. That will slow things down.

 

However, I acknowledge that my preference for ARMv8-A could be driving up the system cost. I will look into the possibility of storing the 16 M68k general-purpose registers in ARMv7-A's NEON registers.

 

Those AM335x chips have around the pin count I'm looking for though. The Snapdragon 410E has 760 balls at 0.4" pitch, compares to 211 for the QorIQ. Qualcomm says you need at least two, but preferably four types of microvias to route all of the signals. Ugh. That's expensive. I'm trying to get the full pinout and footprint for the Snapdragon, and then I'll see how hard it will be to do the board.

 

Something around 300-400 BGA balls sounds like it'll have the right amount of I/O but still be easy to break out the signals.

Share this post


Link to post
Share on other sites

The PRU-ICSS system in those TI chips is really cool though. I just figured that I could do the bus communication on the main processor, but the timing predictability of the PRU system is attractive.

 

I figured the strategy of performing a bus access on the main ARM chip would be to poll the clock, dtack, etc., and once we observe the right edge on the right signal, drive the data we want on the bus, take it off, whatever. If the goal is to support a IIci, however, I've never bit-banged at 25 MHz. So we've gotta make sure that'll actually work. I'm concerned that, at 25 MHz, there may be a big delay between when an edge occurs and when the GPIO system of the processor recognizes that edge.

 

It would be much nicer to clock the chip at a multiple of the Mac's clock and in phase with it, and write code to with exact timing to access the bus. Then we would never have to poll the clock. That's not gonna happen on one of these big fancy ARM microprocessors. Some ARM microcontrollers have a mode that makes instruction timing completely predictable, but setting that mode negates all of the benefits of the cache and and branch prediction and all that.

 

I've gotta examine the timing margins for the M68k family buses (especially the synchronous accesses, since they are gonna be tighter and obviously will have a defined relationship with the clock), and then I'll look at how long input and output signals take to propagate in these GPIO systems.

 

In the AM335x, the fact that the PRUs are separate from the ARM CPU seems good, but I can't think of a purpose of parallelizing the execution of translated instructions with accessing the bus. There's no point unless we decide to "pipeline" the operations in the software, speculatively execuing future code while a bus access is pending. Pipelining works great in hardware, but in software, it would probably be a mess and slow things down more than it would help. Maybe while the PRU is busy accessing the bus, the main processor can do housekeeping functions or speculatively translate more code.

Edited by ZaneKaminski

Share this post


Link to post
Share on other sites

TI AM437x may also be a good choice. Only $20 for a 1 GHz one with a Cortex-A9 and four PRU cores. I understand the Cortex-A9 in these chips to be a good bit faster than the Cortex-A8 in the AM335x series.

 

There's also NXP's i.MX6 series, which goes down to around $30 for 2x ARM Cortex-A9 at 1GHz or so. No fancy PRU though.

 

i.MX7 has more of a low-power focus... no reason to spend money on that when we could be spending money on making the system faster. The older i.MX chips are basically irrelevant. i.MX8 is coming as well, but I think those chips will only support DDR4 memory, which is sort of a bummer. DDR4 is so distant from the "makerspace" that having it in the design would certainly make it impossible to produce for a reasonable price.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×