Serious proposal: accelerator and peripheral expansion system

ZaneKaminski · Nov 22, 2016

The other part of the bad news is that there are currently 120 components which must be placed on the board...

Most of them cost between one and ten cents, so the problem is strictly assembly cost, not parts cost.

joethezombie · Nov 22, 2016

Well my opinion is the more options the better. It would be a dream to have a super Mac se/300, and if the cost is a couple hundred dollars and some change, so be it. The current offering is pretty much the cost of a TwinSpark, PowerCache, and micron XCEED which unless you are very lucky is unobtanium, well into a couple thousand dollars and nets you 50MHz.

I project like yours only comes around only so often, and probably never as ambitious as the goals you have set. I say let the feature creep creep on!

K55 · Nov 22, 2016

Tbh I think that if you're using a classic mac you're not really looking for vga video. I mean the display is there. I'd make a difference if it was the only video out but ehhh (bit banging vga isnt that hard though

)

I'd focus on getting the entire thing working as an accelerator then worry about video.

Others are free to chime in of course.

Scott Squires · Nov 22, 2016

I'm not sure how popular this opinion is. But personally, I don't see the attraction of adding external displays to a compact Mac. At that point, one might as well just use an LC or Quadra.

Maybe replacing the internal CRT with a color LCD. Maaaybe. Probably not. At least not until all the flyback transformers are burnt out. But I doubt there are panels available with the appropriate specs anyway.

Scott Squires · Nov 22, 2016

joethezombie said:
I project like yours only comes around only so often, and probably never as ambitious as the goals you have set. I say let the feature creep creep on!

Feature creep often results in vaporware. I suggest reining in the features.

ZaneKaminski · Nov 22, 2016

Yeah, the ideas for features are out of control at the moment. I don't plan to implement all of the features I've been planning for. I just think it would be a shame to sell this $150 thing and for it to be impossible to upgrade it to add some feature like video output. So I'm trying to strike a balance where I have a minimum set of features that I will implement while reserving hardware interfaces for others to be creative with and add something.

Here are the pieces I plan to make:

Schematics for SE and SE/30 (almost done)
Board deign for SE and SE/30 (will begin soon)
FPGA programming for 68000 and 68030 buses
MC68000 emulation engine for ARMv8-A
GUI front-end for emulator (for use on desktop ARMv8-A PCs, e.g. Raspberry Pi 3. This will help with testing independently from the Maccelerator hardware.)
At least a little Mac Plus/SE peripheral emulation when running the emulator on a PC (again, makes the emulator easy to test without the Maccelerator hardware. Maybe this portion can come from Mini vMac)
Integration of Linux system for the Maccelerator

I do agree, though, that VGA output is not the most important thing since the Mac already has a screen. Hmm. I think this image of the Radius Full-Page Display was what got me onto the idea of an external display:

tumblr_o8276uLMd21u2fteyo1_1280.png.jpeg

Maybe I'll place a footprint and route the traces for the display sub-board but not actually solder on the connector. Adventurous owners can reflow the connector on with hot air.

Before I finish the schematic and begin the board layout, however, I need to talk to Intrinsyc and buy their Open-Q 410 module. They won't give me any technical information (even the pinout... they will answer my questions about capabilities but not give me the actual info) unless I own the thing already.

EvilCapitalist · Nov 22, 2016

anthon said:
Feature creep often results in vaporware. I suggest reining in the features.

Seconding this right here. At what point is the line drawn between "accelerator" and "essentially a new computer shoehorned into a classic Mac"? Right now the only thing this doesn't have (potentially) that a new computer would is a sound card. Don't get me wrong, I like the idea of an affordable accelerator for a system where current options are near unobtanium (either by scarcity, cost, or some combination of the two) but I think the feature list should be pared down somewhat. An upgraded processor, RAM, and upgrading the internal video from black and white to grayscale is a good stopping point IMHO. Replacement storage and the addition of external video, WiFi, USB seem a bit excessive, unless they are left as potential future features to be implemented using an expansion card.

K55 · Nov 22, 2016

I agree with evil. Just to show it can be done doing a "basic" version with the things EC said would be more than enough for me to pay $200+ for. It would make the design easier also.

Mabye make a poll?

ZaneKaminski · Nov 22, 2016

I wanna stress that I don't plan to implement grayscale video or video output myself, or any of the peripheral boards I've given rough designs for. (I may become frustrated and design the USB/JTAG header board while debugging the Maccelerator though. It's pretty easy.) I really am just trying to do the things in the list I posted above, but I'm trying to make sure that with my hardware, there exist plausible implementation strategies for certain desirable features.

In my mind, video is one of the features that should be possible and about which I should have a clear implementation strategy; I just won't actually implement it. Someone else can do it if they want, and hopefully the plan (which I have/am detailing now) will be helpful or serve as a starting point.

Since I now plan to run Linux on the Snapdragon 410, features relating to storage, internet connectivity, and USB come free. (Well, no, you've gotta pay $79 to get the Snapdragon 410 module lol, but at least Linux is free in some sense.) Let me explain.

In my mind, the emulator software has to be prototyped on a Raspberry Pi 3, DragonBoard 410c, something like that. That's the easiest way to get it running quickly and get feedback when debugging. Obviously then I'll have to develop a desktop GUI front-end for the emulator. So it will be natural for the emulator to use USB keyboards, mice, get files and disk images from the filesystem, etc. So in that sense, those features are free because they come naturally as a consequence of a seemingly unrelated detail of the implementation. Indeed, I don't want to debug the emulator software while connected to the Macintosh, so this is all the easiest way.

All of the Snapdragon 410 SoMs I've seen also have WiFi, Bluetooth, and GPS built in. You just need to connect an antenna. Now, I can't sell this thing with antennas, since it would have to be certified and I don't know how to achieve compliance with that kind of thing, but I'm individuals can to hook up their own (well, subject to FCC regulations). If a Raspberry Pi can make a PPP or SLIP connection with a Mac and get it online, I'm sure we can do the same in software. So that's not that hard.

joethezombie · Nov 22, 2016

ZaneKaminski said:
I wanna stress that I don't plan to implement grayscale video or video output myself, or any of the peripheral boards I've given rough designs for. (I may become frustrated and design the USB/JTAG header board while debugging the Maccelerator though. It's pretty easy.) I really am just trying to do the things in the list I posted above, but I'm trying to make sure that with my hardware, there exist plausible implementation strategies for certain desirable features.

In my mind, video is one of the features that should be possible and about which I should have a clear implementation strategy; I just won't actually implement it. Someone else can do it if they want, and hopefully the plan (which I have/am detailing now) will be helpful or serve as a starting point.

That's what I had gathered when reading the thread. The question seemed more like "should I put a $1 connector on the board or not? Well, my opinion is, for $1 why not? Then someone like a future more experienced me could interface with that and maybe get something working. Also, as for the usefulness of an external display with a compact Mac, it is just awesome having a portrait display next to my SE/30!

ZaneKaminski · Nov 22, 2016

Hmm but maybe some of the peripheral stuff isn't as free as I may have thought. For example, emulating the VIA and other members of the chipset is different from adding more I/O devices to a system which already has a physically working chipset.

However, this work shouldn't be too terrible. The chipset emulation, used when the Maccelerator software is running on a standalone ARMv8-A chip, can be taken from Mini vMac. (Hopefully their license permits it.) Some drivers for the Mac will need to be written, but they can be just brain-dead simple. I mean, a virtual disk driver would be so easy. Just moving memory around, no actual control of the drive or anything like is done with the Sony IWM driver.

Knez · Nov 23, 2016

Why not make something like the Vampire II for Amiga work with the 68k Macs? Instead of reinventing the wheel with some kind of emulation running on ARM.

It's 68000, 68020, 68030, 68040, 68060 and ColdFire compatible and is allmost twice as fast as the fastest real 68060 out there.

I know that they are making a stand alone version and are thinking about adding the interrupts to make it work in Atari's. Why not add some 68k Mac support as well? Im not that into 68k stuff, so I don't really know what's needed.

About the Vampire FPGA core for those of you not familliar with it: http://wiki.apollo-accelerators.com/doku.php?id=apollo_core

techknight · Nov 23, 2016

What I would like is not only having the 68K acceleration, but also having a webkit module in the CPU that is addressable in slot-space. So a modern browser could be written to take advantage of the hardware-accelerated webkit engine

ZaneKaminski · Nov 23, 2016

Knez said:
Why not make something like the Vampire II for Amiga work with the 68k Macs? Instead of reinventing the wheel with some kind of emulation running on ARM.

It's 68000, 68020, 68030, 68040, 68060 and ColdFire compatible and is allmost twice as fast as the fastest real 68060 out there.

I know that they are making a stand alone version and are thinking about adding the interrupts to make it work in Atari's. Why not add some 68k Mac support as well? Im not that into 68k stuff, so I don't really know what's needed.

About the Vampire FPGA core for those of you not familliar with it: http://wiki.apollo-accelerators.com/doku.php?id=apollo_core

I tried to talk to the developer of the Apollo core. Originally, I wanted to license his core and synthesize it in a Cyclone III or IV FPGA. He never answered. I thought could maybe develop a 68k-compatible core and synthesize it in an FPGA, but I'm not really interested in that.

So the disadvantage of doing the emulation in software is that the cost of the accelerator is higher (maybe even $50 higher), but the advantage is that running Linux on a modern processor allows a lot of flexibility. Moreover, the emulator, when finished, will be way faster than the Apollo core, especially for operations that don't touch main memory. Register-register operations, once translated, will happen almost as fast as native ARM code doing the same. Way faster than a 68060.

techknight said:
What I would like is not only having the 68K acceleration, but also having a webkit module in the CPU that is addressable in slot-space. So a modern browser could be written to take advantage of the hardware-accelerated webkit engine

Don't worry lol, I've been considering that. It would be great with the 8-bit grayscale adapter on the SE and SE/30. 512x342 is plenty of pixels to display Android, and it's very supported on the Snapdragon 410. Now, I don't plan to get the thing running Android myself (barring some kind of a change in plans), but it is very possible to get on the web like that.

Gorgonops · Nov 23, 2016

I've been vaguely following this for a while, and I have to admit I'm still sort of stuck questioning the viability of doing the emulation in software on top of Linux, at least with any level of performance. Essentially what you're building here is an in-circuit emulator for a Motorola 68000-family chip, and while I guess I see you have abandoned the idea of literally bit-banging the individual lines with a Linux process I do wonder exactly how much "administrivia" you intend to have to have the main CPU to do behind the FPGAs you're adding on. General-purpose Linux is sort of terrible at real-time work and according to the couple of white papers I've googled up even RT Linux typically has thread-switching latency in the "tens of (edit) microseconds" ballpark. (By comparison it looks like a memory read cycle on an 8mhz 68000 takes about... 290 nanoseconds?) Seems to me those FPGAs are going to have to be doing a *lot* of work to abstract the bus so the CPU won't be spending massive amounts of time thrashing about watching for handshaking signals, etc.

Also, practically speaking, once you have this CPU bus glue how do you intend to structure the communication not only between the motherboard and the emulator, but *generally*? I mean, when a user is running a MacOS program on this setup will the code still reside in the memory on the motherboard and for every word executed the emulator will be hitting this bus bridge, pulling the instructions (presumably into a cache), hopefully executing it faster than the original CPU would, and writing any changed results back to the board (will the cache always be write-through to ensure you don't get tripped by self-modifying code, or are you going to try to be smarter about it?)? Or is your intention to use the Mac's motherboard *just* for I/O and the MacOS container is going to reside entirely on the onboard RAM on your little Linux core, with *Linux itself* handling all the I/O through the bus bridge? It's an important distinction, because I seriously doubt you're going to see much in the way of a performance boost unless you do the latter, and doing the latter is going to mean you're going to have to, in essence, create Linux drivers for every hunk of hardware in the "Zombiefied" Mac (a Linux computer ate its brain), which is going to be really entertaining for picky timing-critical pieces like the IWM.

Also have to admit I'm sort of skeptical about your performance estimates for how fast of a Motorola 68000 emulator you can do on your ARM SoC, but if you have some code you're just waiting to demo I'd be happy to be proved wrong. So far as I'm aware there's no JIT for BasiliskII on ARM yet but apparently there's some JIT support in UAE. Don't see anyone saying it's way faster than a 68060, though.

To be clear, if you're sure that you've got all these issues worked out then by all means don't let me stop you. I guess I'm just vaguely wondering if there's some precedent to a project like this? I mean, I've seen the Vampire accelerator, and there's also a project out there that adds features (and I *think* some level of acceleration but I could be wrong about that, I'm too lazy to find the links right now) to 6502-based computers via an FPGA packaged in board that fits a 40 pin DIP socket, but I don't believe I've seen anything that tries to use a general purpose CPU to try to replace(*), let alone accelerate, a system.

(* Some In-Circuit emulator systems might do this, I guess, but they generally target microcontrollers.)

On the topic of using this to add all sorts of enhanced features, including web access via the Linux you've put on board, well, I'm going to toss this out there: Have you by any chance heard of the Apple II Pi? It's a devious bundle of software that lets you take a Raspberry Pi, cable it up to an Apple II (or in fact plug it into a slot using a little interposer card that's just a really brain-dead serial port), and boot up a software disk which redirects all keyboard/joystick/mouse I/O to the Pi and lets them act as those peripherals for the Linux machine. It even allows access to the floppy drives attached to the host so disks can be read and used for the Apple IIgs emulator that runs on the Pi side, so overall it makes the setup look like a massively expanded Apple IIgs (IE, 15MB of RAM, hard disks, network card, the works). The illusion is so convincing that many of the people who've bought the serial card are convinced that it must be doing a lot more with their Apple II than it really does. So in that vein: every Macintosh has two high-speed serial ports built into it, have you considered the idea of just interfacing your parasite Linux computer to that and creating a little extension that turns the Mac into an I/O slave? For every Mac that uses an external monitor the job is pretty much done at that point; just pair up with an HDMI to VGA adapter if you want to drive the original monitor and you're good to go.

For the toaster macs there are various ways to drive the original CRT from a modern ARM SoC; the link shows a sort of software-intensive way of doing it but I'm relatively sure I've seen other methods that can leverage LCD interfaces or even HDMI to do it with less overhead. How about for those you just put an interposer board between the analog board plug and the motherboard that can switch on a software command between the motherboard output and the video generated by your "accelerator"? Connectivity to the host Mac would consist of said connector and a little plug that snaked out of the case and plugged into one of the serial ports. On bootup the extension to slave-ify the Mac would load, it'd establish handshaking with the Linux computer (which itself might need some time to boot), and then the screen goes *blink* briefly as control is handed off. For all practical purposes you've accomplished the same thing but you don't have to target the CPU socket.

ZaneKaminski · Nov 23, 2016

Gorgonops said:
I've been vaguely following this for a while, and I have to admit I'm still sort of stuck questioning the viability of doing the emulation in software on top of Linux, at least with any level of performance. Essentially what you're building here is an in-circuit emulator for a Motorola 68000-family chip, and while I guess I see you have abandoned the idea of literally bit-banging the individual lines with a Linux process I do wonder exactly how much "administrivia" you intend to have to have the main CPU to do behind the FPGAs you're adding on. General-purpose Linux is sort of terrible at real-time work and according to the couple of white papers I've googled up even RT Linux typically has thread-switching latency in the "tens of (edit) microseconds" ballpark. (By comparison it looks like a memory read cycle on an 8mhz 68000 takes about... 290 nanoseconds?) Seems to me those FPGAs are going to have to be doing a *lot* of work to abstract the bus so the CPU won't be spending massive amounts of time thrashing about watching for handshaking signals, etc.

Also, practically speaking, once you have this CPU bus glue how do you intend to structure the communication not only between the motherboard and the emulator, but *generally*? I mean, when a user is running a MacOS program on this setup will the code still reside in the memory on the motherboard and for every word executed the emulator will be hitting this bus bridge, pulling the instructions (presumably into a cache), hopefully executing it faster than the original CPU would, and writing any changed results back to the board (will the cache always be write-through to ensure you don't get tripped by self-modifying code, or are you going to try to be smarter about it?)? Or is your intention to use the Mac's motherboard *just* for I/O and the MacOS container is going to reside entirely on the onboard RAM on your little Linux core, with *Linux itself* handling all the I/O through the bus bridge? It's an important distinction, because I seriously doubt you're going to see much in the way of a performance boost unless you do the latter, and doing the latter is going to mean you're going to have to, in essence, create Linux drivers for every hunk of hardware in the "Zombiefied" Mac (a Linux computer ate its brain), which is going to be really entertaining for picky timing-critical pieces like the IWM.

Also have to admit I'm sort of skeptical about your performance estimates for how fast of a Motorola 68000 emulator you can do on your ARM SoC, but if you have some code you're just waiting to demo I'd be happy to be proved wrong. So far as I'm aware there's no JIT for BasiliskII on ARM yet but apparently there's some JIT support in UAE. Don't see anyone saying it's way faster than a 68060, though.

To be clear, if you're sure that you've got all these issues worked out then by all means don't let me stop you. I guess I'm just vaguely wondering if there's some precedent to a project like this? I mean, I've seen the Vampire accelerator, and there's also a project out there that adds features (and I *think* some level of acceleration but I could be wrong about that, I'm too lazy to find the links right now) to 6502-based computers via an FPGA packaged in board that fits a 40 pin DIP socket, but I don't believe I've seen anything that tries to use a general purpose CPU to try to replace(*), let alone accelerate, a system.

Let me explain how I will structure the software. This should address the concerns you have raised, Gorgonops.

The emulator engine itself should be a chunk of C code with few, if any, external dependencies, in terms of libraries and the OS it must run on.
Of course, the entire RAM and ROM of the Macintosh should be cached by the emulator. RAM writes don’t even need to be written back to the Mac except in the case of VRAM/video region writes. (Therefore other DMA devices on the PDS bus are not supported.)

The emulator can keep a big table, the “address space table,” of 4kbyte pages of the entire 4Gbyte memory space of the Macintosh (can maintain different tables per function code as well). These tables will be a bit big, 1M x (size of entry), but that’s okay when you have 1G of RAM. Each entry in the address space table will tell how to access the given page of memory. So in changing the address space table, you can change whether an access by the emulator to a certain region of memory results in an access of the RAM cache, ROM cache, or an actual bus access (such as for VRAM or an I/O device). (The translator software itself can also use this table to store flags about the translation status of a page, but I have determined it would be better to use a different type of structure.)

The emulator engine must be statically linked to a certain front-end in order to actually do anything. (I specify static linking because some external segment loader or whatever must do the dynamic linking, and I want to avoid any external dependencies. No need for dynamic linking.) So I plan for two front-ends.
A desktop front-end would support debug and test of the emulator engine in one of the typical Linux desktop environments.
An in-circuit emulator (ICE) front-end would support use of the emulator as an accelerator or coprocessor in an existing M68k system like the Macintosh.

The ICE front-end would be constructed as a kernel module, in order to eliminate the context-switch overhead. Now, I have totally eliminated any operation of the M68k bus by the Snapdragon itself. The Snapdragon talks to 3 FPGAs over the multiplexed, 32-bit-wide “FPGA bus,” and then they coordinate among themselves the M68k bus access. (Enabling this was my discovery of the very affordable MachXO FPGA family.) Therefore, I have made the FPGA-Snapdragon bus design fully asynchronous. So you can, from userspace, operate the FPGA bus and command an M68k bus operation, but it would be slow, and you probably can't operate the IWM, as you say Gorgonops. So for practical purposes, it has to be done from a kernel module.

Now, there still is a bit of a bandwidth problem, but maybe not the one you would expect. The Snapdragon 410 has 122 GPIO pins and 122 GPIO pin registers. What I mean is that you can’t do a 32-bit-wide write to a register and set 32 GPIOs at once. No, you have to do 32 separate accesses. To command a bus access for a 68020+ system, some 32 pins need to be set 3 times each, plus a few control signals. So that’s 100 memory operations, plus some waiting because of rise/fall times. I dunno how fast the Snapdragon’s GPIO I/O interconnect is, but these 100 writes to I/O memory couldn’t take more than a half of a microsecond.

By the way, it’s about 1 microsecond per word access on Mac Plus, average of 3/4 us (I think) on SE, and 1/4 us or 3/16 us on SE/30 (don’t know if the glue logic imposes a wait-state).

So in conclusion, the main hit will be in I/O performance to the peripherals and video memory, but I don’t think it’ll be too bad, since everything else will be faster. Write operations can also be sort of pipelined to improve performance, which will benefit video writes. Video reads can come from a cache in the Snapdragon's memory. So the worst case is that the I/O performance of an SE/30 or IIci will be as good as a Mac SE. But again, I think I can make it faster than that by pipelining the write operations.

Gorgonops said:
On the topic of using this to add all sorts of enhanced features, including web access via the Linux you've put on board, well, I'm going to toss this out there: Have you by any chance heard of the Apple II Pi? It's a devious bundle of software that lets you take a Raspberry Pi, cable it up to an Apple II (or in fact plug it into a slot using a little interposer card that's just a really brain-dead serial port), and boot up a software disk which redirects all keyboard/joystick/mouse I/O to the Pi and lets them act as those peripherals for the Linux machine. It even allows access to the floppy drives attached to the host so disks can be read and used for the Apple IIgs emulator that runs on the Pi side, so overall it makes the setup look like a massively expanded Apple IIgs (IE, 15MB of RAM, hard disks, network card, the works). The illusion is so convincing that many of the people who've bought the serial card are convinced that it must be doing a lot more with their Apple II than it really does. So in that vein: every Macintosh has two high-speed serial ports built into it, have you considered the idea of just interfacing your parasite Linux computer to that and creating a little extension that turns the Mac into an I/O slave? For every Mac that uses an external monitor the job is pretty much done at that point; just pair up with an HDMI to VGA adapter if you want to drive the original monitor and you're good to go.

For the toaster macs there are various ways to drive the original CRT from a modern ARM SoC; the link shows a sort of software-intensive way of doing it but I'm relatively sure I've seen other methods that can leverage LCD interfaces or even HDMI to do it with less overhead. How about for those you just put an interposer board between the analog board plug and the motherboard that can switch on a software command between the motherboard output and the video generated by your "accelerator"? Connectivity to the host Mac would consist of said connector and a little plug that snaked out of the case and plugged into one of the serial ports. On bootup the extension to slave-ify the Mac would load, it'd establish handshaking with the Linux computer (which itself might need some time to boot), and then the screen goes *blink* briefly as control is handed off. For all practical purposes you've accomplished the same thing but you don't have to target the CPU socket.

That stuff is cool but it's not really the direction I want to go in. My aim is to make the Maccelerator a goal so I can personally learn something. The added benefit is that there is something tangible to show off, so maybe you can kinda make the same effect by taking over the screen and I/O, but that's not my aim.

Let me say a little more about the performance.

I think 50x faster than SE for register-register operations is easy. Do I need to say any more? You just write the shortest ARMv8-A algorithm in assembly to implement the given 68000 instruction. Even if it takes 10 instructions, the Snapdragon's Cortex-A53 runs at 1.2 GHz, with a 533 MHz memory bus, and can dispatch multiple instructions in one clock. 68030 takes at least two clocks to execute an instruction, and it has an unsophisticated instruction cache compared to the Cortex-A53's cache system.

Instructions which access memory will be harder to get to be 50x faster, but I'm sure its doable when the emulator runs as a kernel module, basically hogging a whole core.

So I am ignoring the speed at which the chipset can be accessed in this speed analysis, but I don't think it will be a huge deal, especially if write operations can be pipelined to accelerate video writes.

Gorgonops · Nov 23, 2016

ZaneKaminski said:
Now, there still is a bit of a bandwidth problem, but maybe not the one you would expect. The Snapdragon 410 has 122 GPIO pins and 122 GPIO pin registers. What I mean is that you can’t do a 32-bit-wide write to a register and set 32 GPIOs at once. No, you have to do 32 separate accesses. To command a bus access for a 68020+ system, some 32 pins need to be set 3 times each, plus a few control signals. So that’s 100 memory operations, plus some waiting because of rise/fall times. I dunno how fast the Snapdragon’s GPIO I/O interconnect is, but these 100 writes to I/O memory couldn’t take more than a half of a microsecond.

To be frank this sounds like a deal-breaker to me if it's true that you can't group the GPIO pins together into broader busses that would let you send a word at a time. (Some microcontrollers have the ability, for their GPIO pins or some subset of them to alternately act like a general purpose or memory bus.) The Raspberry Pi benchmarks at just under 42mhz as the absolute maximum speed it can toggle a single GPIO pin into a square wave and I have no reason to believe the Snapdragon is going to be much different than that. If you have to do 100 operations for every bus cycle that's going to put your effective bus speed at, what, around 400khz? That sounds like a pretty serious performance hit to me.

For larfs I downloaded the Qualcomm Dragonboard 410c "Peripherals Programming Guide" and while I won't pretend to understand much of it it seems like the fastest I/O method it documents is SPI, which the "BLSP" that controls the GPIO pins and whatnot can handle at up to 50mhz. (That is of course 50mhz *serial*.) Kind of seems to me that to really make this work you want an ARM chip that exposes the AHB and you interface to that.(*)

(* EDIT: Reading a little more this sounds like nonsense to me, it sounds like AHB is a "bus" that's used for interconnecting *on die* peripherals? Never mind, I have no idea what I'm talking about. That said, if that *is* true then perhaps SPI is in fact the best, most performant bus protocol that SoCs like this offer. And SPI doesn't seem to quite cut the mustard, performancewise.)

By the way, it’s about 1 microsecond per word access on Mac Plus, average of 3/4 us (I think) on SE, and 1/4 us or 3/16 us on SE/30 (don’t know if the glue logic imposes a wait-state).

That 290ns figure I found was in an extremely low-level discussion of the bus timing requirements of the 68000, IE, the speed at which you'd have to handle signal transitions in software if you were actually trying to "bit-bang" it with a raw GPIO port. In real terms, yes, the access speed of a Plus is far slower because, among other reasons, the hardware dedicates a non-negotiable half of the bus' available bandwidth to video refresh and of course the 68000 itself can't do a memory operation on every clock cycle. If your FPGA handles all that handshaking then, yes, the *effective* speed is what you need to target to be as good as a Plus. But it still sounds a bit ambitious to me to do in userland, at least.

I think 50x faster than SE for register-register operations is easy. Do I need to say any more? You just write the shortest ARMv8-A algorithm in assembly to implement the given 68000 instruction. Even if it takes 10 instructions, the Snapdragon's Cortex-A53 runs at 1.2 GHz, with a 533 MHz memory bus, and can dispatch multiple instructions in one clock. 68030 takes at least two clocks to execute an instruction, and it has an unsophisticated instruction cache compared to the Cortex-A53's cache system.

I'm not going to disagree that, sure, it's undoubtedly *possible* for a hand-tuned assembly language emulator core running on a 1.2ghz CPU to manage "50 times better" than the 8 mhz original on a totally synthetic test. I'm just feeling like there is a *lot* of devil in the details here in terms of translating this into real performance, particularly with the overhead of this bus bridge.

ZaneKaminski · Nov 24, 2016

Well, the bus speed doesn't work like you're saying.

There is the bit transition time, which can be specified in terms of the time it takes to rise from 10% to 90% or fall from 90% to 10% (of the logic level in question). That's what limits the GPIO pin performance to 40 MHz or so for single-ended, non-terminated I/O like the GPIO pins. It takes maybe 12 nanoseconds for the signal to rise and fall to the appropriate logic levels, and that limits the speed of the bus or whatever.

The limitation I was referring to earlier was basically the speed of the AHB bus (yes, it's an on-die interconnect) and anything else in between the processor core and the GPIO system... how long does it take to set a bit in a GPIO register, not how long it takes for the signal to transition from high to low or low to high after setting the bit.

Yes, we will have to wait 25 ns or so (1/40 MHz) for the bits to transition, but we will only have to do that a few times, not serially for each bit in the 32-bit quantities transmitted. The amount we have to wait when we set each bit is the delay imposed by the on-die bus between the GPIO and ARM core. At best, only one clock will be required to write to a GPIO register, but it will, in practice and on average, be more. I don't know what, but I don't think it'll be more than 6 clocks.

So the total delay for a write operation, for example, is how long it takes to perform these operations in series:

Let t_GPIO be the time it takes to write to a GPIO register, let t_bit be the bit transition time, and let t_latch be the time required for the glue FPGAs to latch the value on the FPGA bus.

set 32 address bits (= 32 * t_GPIO)

set R/W signal (not M68k R/W, FPGA bus R/W) (= t_GPIO)

wait for bits to transition (= t_bit)

set A signal (= t_GPIO)

wait for A signal to rise (= t_bit)

wait for data bits latched (= t_latch)

unset A signal (tells address glue to latch) (= t_GPIO)

wait for A signal to fall (= t_bit)

set 32 data bits (= 32 * t_GPIO)

set R/W signal (= t_GPIO)

wait for bits to transition (= t_bit)

set D signal (= t_GPIO)

wait for D signal to rise (= t_bit)

wait for data bits latched (= t_latch)

unset D signal (tells data glue to latch) (= t_GPIO)

wait for D signal to fall (=t_bit)

set 32 control bits (to command control glue to perform the bus access) (= 32 * t_GPIO)

set R/W signal (= t_GPIO)

wait for bits to transition (= t_bit)

set C signal (= t_GPIO)

wait for C signal to rise (= t_bit)

wait for control bits latched (= t_latch)

unset C signal (this executes the command latched in the control glue) (= t_GPIO)

wait for C signal to fall (=t_bit)

Bus access is then performed by the control glue, in conjunction with the address glue and data glue.

So the total delay before that can happen is 105 * t_GPIO + 9*t_bit + 3*t_latch.

For t_GPIO = 6 cycles @ 1.2 GHz = 5 ns, t_bit = 12 ns, t_latch = 5 ns, that's 648 nanoseconds required to command a write access. Not great but not that terrible. It may be faster than I estimate. I think 6 cycles might be an overestimate for how long the GPIO accesses take.

Earlier in this thread, I considered quad-SPI to connect the processor and the FPGA, but I didn't like the solution for various reasons. This way is better.

Anyway, I will release my schematic later tonight and you can see how I do it. None of the 68000 bus control signals are manipulated directly by the Snapdragon anymore. It will all be handled by the control glue FPGA.

Combined with running the whole thing in kernel mode, I think the GPIO and bus access speed will be adequate. Not great, but adequate.

We will see about the performance in a few months when I implement the emulator. I've got a fair amount of experience with ARMv6-M assembly, so ARMv8-A shouldn't be hard.

Gorgonops · Nov 24, 2016

Just out of curiosity, do you have a link to the particular documents which you used to come up with your "I don't think it will be worse than six (1.2ghz) clock ticks" estimate for what sort of signalling speed you're going to be able to get out through the GPIO pins? Frankly this plan seems to rely heavily on an assumption that the source of delay is the signal transition time for the glue between the register and the pins rather than between the register and the CPU, and thus it follows that you can pound on the register at 1.2ghz 100 times at light speed and it'll all of come out in one sort of parallel 32 bit blob when it hits the FPGA... or maybe that's not what you're saying at all? (IE, you seem to be arguing that you can make your ad-hoc parallel bus perform much better than the transition speed of a single pin because setting a pin isn't an atomic operation, IE, you can do it "instantly" at the register and move onto the next while the slow-as-molasses downstream hardware propagates the state change, instead of the limiting factor being the speed of the GPIO controller itself.) Does the hardware manual actually *say* anywhere how fast the connection is between the GPIO glue and the core? I would bet a shiny nickel it's not at 1.2ghz.

ZaneKaminski · Nov 24, 2016

Well, the bits of the address-data-control bus need not be valid all at the same time. Here's a timing diagram for an address loading operation, where the 32 address bits are latched into the address glue FPGA:

(this diagram strays from established conventions in some ways but whatever)

The data on the ADC bus need only be valid when A has a rising edge or maybe when it's high. Haven't decided yet. There will be a latch or flip-flop array in each FPGA that will store the 32 bits on the bus at the moment of ADC[31:0] capture.

(Edit: actually the input should be captured during the rising edge strictly, not just latched while A is high. That would make things faster, long story short.)

So yeah, what you're saying, we are not limited by n * (bit transition time) for an n-bit parallel bus. The limitation is n * (GPIO access time) + (bit transition time), and for almost any system, (GPIO access time) < (bit transition time)

Now, this operation has to be done 3 times in order to command a write operation on the PDS bus. Once to set the address, once to set the data, and then one final access to give the control glue the command. Reading is similar but then the data needs to be gotten out of the data glue FPGA after the bus operation is complete, instead of being set before the operation.

The overall speed of the interconnect between the GPIO and the core is probably not 1.2 GHz, but I would be very surprised if it weren't in the hundreds of millions of transfers per second. The information to figure it out exists in some document, but it's likely privileged information that I can't access. Next week, my DragonBoard 410c will arrive and I will start screwing around and testing.

Serious proposal: accelerator and peripheral expansion system

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Moderator

Well-known member

Moderator

Well-known member

Moderator

Well-known member

Similar threads