Jump to content

Serious proposal: accelerator and peripheral expansion system


Recommended Posts

Why not make something like the Vampire II for Amiga work with the 68k Macs? Instead of reinventing the wheel with some kind of emulation running on ARM.

It's 68000, 68020, 68030, 68040, 68060 and ColdFire compatible and is allmost twice as fast as the fastest real 68060 out there.

 

I know that they are making a stand alone version and are thinking about adding the interrupts to make it work in Atari's. Why not add some 68k Mac support as well? Im not that into 68k stuff, so I don't really know what's needed.

 

About the Vampire FPGA core for those of you not familliar with it: http://wiki.apollo-accelerators.com/doku.php?id=apollo_core

Link to post
Share on other sites
  • Replies 203
  • Created
  • Last Reply

Top Posters In This Topic

Why not make something like the Vampire II for Amiga work with the 68k Macs? Instead of reinventing the wheel with some kind of emulation running on ARM.

It's 68000, 68020, 68030, 68040, 68060 and ColdFire compatible and is allmost twice as fast as the fastest real 68060 out there.

 

I know that they are making a stand alone version and are thinking about adding the interrupts to make it work in Atari's. Why not add some 68k Mac support as well? Im not that into 68k stuff, so I don't really know what's needed.

 

About the Vampire FPGA core for those of you not familliar with it: http://wiki.apollo-accelerators.com/doku.php?id=apollo_core

 

I tried to talk to the developer of the Apollo core. Originally, I wanted to license his core and synthesize it in a Cyclone III or IV FPGA. He never answered. I thought could maybe develop a 68k-compatible core and synthesize it in an FPGA, but I'm not really interested in that.

 

So the disadvantage of doing the emulation in software is that the cost of the accelerator is higher (maybe even $50 higher), but the advantage is that running Linux on a modern processor allows a lot of flexibility. Moreover, the emulator, when finished, will be way faster than the Apollo core, especially for operations that don't touch main memory. Register-register operations, once translated, will happen almost as fast as native ARM code doing the same. Way faster than a 68060.

 

What I would like is not only having the 68K acceleration, but also having a webkit module in the CPU that is addressable in slot-space. So a modern browser could be written to take advantage of the hardware-accelerated webkit engine ;)

 

Don't worry lol, I've been considering that. It would be great with the 8-bit grayscale adapter on the SE and SE/30. 512x342 is plenty of pixels to display Android, and it's very supported on the Snapdragon 410. Now, I don't plan to get the thing running Android myself (barring some kind of a change in plans), but it is very possible to get on the web like that.

Edited by ZaneKaminski
Link to post
Share on other sites

I've been vaguely following this for a while, and I have to admit I'm still sort of stuck questioning the viability of doing the emulation in software on top of Linux, at least with any level of performance. Essentially what you're building here is an in-circuit emulator for a Motorola 68000-family chip, and while I guess I see you have abandoned the idea of literally bit-banging the individual lines with a Linux process I do wonder exactly how much "administrivia" you intend to have to have the main CPU to do behind the FPGAs you're adding on. General-purpose Linux is sort of terrible at real-time work and according to the couple of white papers I've googled up even RT Linux typically has thread-switching latency in the "tens of (edit) microseconds" ballpark. (By comparison it looks like a memory read cycle on an 8mhz 68000 takes about... 290 nanoseconds?) Seems to me those FPGAs are going to have to be doing a *lot* of work to abstract the bus so the CPU won't be spending massive amounts of time thrashing about watching for handshaking signals, etc.

Also, practically speaking, once you have this CPU bus glue how do you intend to structure the communication not only between the motherboard and the emulator, but *generally*? I mean, when a user is running a MacOS program on this setup will the code still reside in the memory on the motherboard and for every word executed the emulator will be hitting this bus bridge, pulling the instructions (presumably into a cache), hopefully executing it faster than the original CPU would, and writing any changed results back to the board (will the cache always be write-through to ensure you don't get tripped by self-modifying code, or are you going to try to be smarter about it?)? Or is your intention to use the Mac's motherboard *just* for I/O and the MacOS container is going to reside entirely on the onboard RAM on your little Linux core, with *Linux itself* handling all the I/O through the bus bridge? It's an important distinction, because I seriously doubt you're going to see much in the way of a performance boost unless you do the latter, and doing the latter is going to mean you're going to have to, in essence, create Linux drivers for every hunk of hardware in the "Zombiefied" Mac (a Linux computer ate its brain), which is going to be really entertaining for picky timing-critical pieces like the IWM.

Also have to admit I'm sort of skeptical about your performance estimates for how fast of a Motorola 68000 emulator you can do on your ARM SoC, but if you have some code you're just waiting to demo I'd be happy to be proved wrong. So far as I'm aware there's no JIT for BasiliskII on ARM yet but apparently there's some JIT support in UAE. Don't see anyone saying it's way faster than a 68060, though.

To be clear, if you're sure that you've got all these issues worked out then by all means don't let me stop you. I guess I'm just vaguely wondering if there's some precedent to a project like this? I mean, I've seen the Vampire accelerator, and there's also a project out there that adds features (and I *think* some level of acceleration but I could be wrong about that, I'm too lazy to find the links right now) to 6502-based computers via an FPGA packaged in board that fits a 40 pin DIP socket, but I don't believe I've seen anything that tries to use a general purpose CPU to try to replace(*), let alone accelerate, a system.

 

(* Some In-Circuit emulator systems might do this, I guess, but they generally target microcontrollers.)

On the topic of using this to add all sorts of enhanced features, including web access via the Linux you've put on board, well, I'm going to toss this out there: Have you by any chance heard of the Apple II Pi? It's a devious bundle of software that lets you take a Raspberry Pi, cable it up to an Apple II (or in fact plug it into a slot using a little interposer card that's just a really brain-dead serial port), and boot up a software disk which redirects all keyboard/joystick/mouse I/O to the Pi and lets them act as those peripherals for the Linux machine. It even allows access to the floppy drives attached to the host so disks can be read and used for the Apple IIgs emulator that runs on the Pi side, so overall it makes the setup look like a massively expanded Apple IIgs (IE, 15MB of RAM, hard disks, network card, the works). The illusion is so convincing that many of the people who've bought the serial card are convinced that it must be doing a lot more with their Apple II than it really does. So in that vein: every Macintosh has two high-speed serial ports built into it, have you considered the idea of just interfacing your parasite Linux computer to that and creating a little extension that turns the Mac into an I/O slave? For every Mac that uses an external monitor the job is pretty much done at that point; just pair up with an HDMI to VGA adapter if you want to drive the original monitor and you're good to go.

For the toaster macs there are various ways to drive the original CRT from a modern ARM SoC; the link shows a sort of software-intensive way of doing it but I'm relatively sure I've seen other methods that can leverage LCD interfaces or even HDMI to do it with less overhead. How about for those you just put an interposer board between the analog board plug and the motherboard that can switch on a software command between the motherboard output and the video generated by your "accelerator"? Connectivity to the host Mac would consist of said connector and a little plug that snaked out of the case and plugged into one of the serial ports. On bootup the extension to slave-ify the Mac would load, it'd establish handshaking with the Linux computer (which itself might need some time to boot), and then the screen goes *blink* briefly as control is handed off. For all practical purposes you've accomplished the same thing but you don't have to target the CPU socket.

Edited by Gorgonops
Link to post
Share on other sites

I've been vaguely following this for a while, and I have to admit I'm still sort of stuck questioning the viability of doing the emulation in software on top of Linux, at least with any level of performance. Essentially what you're building here is an in-circuit emulator for a Motorola 68000-family chip, and while I guess I see you have abandoned the idea of literally bit-banging the individual lines with a Linux process I do wonder exactly how much "administrivia" you intend to have to have the main CPU to do behind the FPGAs you're adding on. General-purpose Linux is sort of terrible at real-time work and according to the couple of white papers I've googled up even RT Linux typically has thread-switching latency in the "tens of (edit) microseconds" ballpark. (By comparison it looks like a memory read cycle on an 8mhz 68000 takes about... 290 nanoseconds?) Seems to me those FPGAs are going to have to be doing a *lot* of work to abstract the bus so the CPU won't be spending massive amounts of time thrashing about watching for handshaking signals, etc.

Also, practically speaking, once you have this CPU bus glue how do you intend to structure the communication not only between the motherboard and the emulator, but *generally*? I mean, when a user is running a MacOS program on this setup will the code still reside in the memory on the motherboard and for every word executed the emulator will be hitting this bus bridge, pulling the instructions (presumably into a cache), hopefully executing it faster than the original CPU would, and writing any changed results back to the board (will the cache always be write-through to ensure you don't get tripped by self-modifying code, or are you going to try to be smarter about it?)? Or is your intention to use the Mac's motherboard *just* for I/O and the MacOS container is going to reside entirely on the onboard RAM on your little Linux core, with *Linux itself* handling all the I/O through the bus bridge? It's an important distinction, because I seriously doubt you're going to see much in the way of a performance boost unless you do the latter, and doing the latter is going to mean you're going to have to, in essence, create Linux drivers for every hunk of hardware in the "Zombiefied" Mac (a Linux computer ate its brain), which is going to be really entertaining for picky timing-critical pieces like the IWM.

Also have to admit I'm sort of skeptical about your performance estimates for how fast of a Motorola 68000 emulator you can do on your ARM SoC, but if you have some code you're just waiting to demo I'd be happy to be proved wrong. So far as I'm aware there's no JIT for BasiliskII on ARM yet but apparently there's some JIT support in UAE. Don't see anyone saying it's way faster than a 68060, though.

To be clear, if you're sure that you've got all these issues worked out then by all means don't let me stop you. I guess I'm just vaguely wondering if there's some precedent to a project like this? I mean, I've seen the Vampire accelerator, and there's also a project out there that adds features (and I *think* some level of acceleration but I could be wrong about that, I'm too lazy to find the links right now) to 6502-based computers via an FPGA packaged in board that fits a 40 pin DIP socket, but I don't believe I've seen anything that tries to use a general purpose CPU to try to replace(*), let alone accelerate, a system.

 

Let me explain how I will structure the software. This should address the concerns you have raised, Gorgonops.
 
The emulator engine itself should be a chunk of C code with few, if any, external dependencies, in terms of libraries and the OS it must run on.
Of course, the entire RAM and ROM of the Macintosh should be cached by the emulator. RAM writes don’t even need to be written back to the Mac except in the case of VRAM/video region writes. (Therefore other DMA devices on the PDS bus are not supported.)
 
The emulator can keep a big table, the “address space table,” of 4kbyte pages of the entire 4Gbyte memory space of the Macintosh (can maintain different tables per function code as well). These tables will be a bit big, 1M x (size of entry), but that’s okay when you have 1G of RAM. Each entry in the address space table will tell how to access the given page of memory. So in changing the address space table, you can change whether an access by the emulator to a certain region of memory results in an access of the RAM cache, ROM cache, or an actual bus access (such as for VRAM or an I/O device). (The translator software itself can also use this table to store flags about the translation status of a page, but I have determined it would be better to use a different type of structure.)
 
The emulator engine must be statically linked to a certain front-end in order to actually do anything. (I specify static linking because some external segment loader or whatever must do the dynamic linking, and I want to avoid any external dependencies. No need for dynamic linking.) So I plan for two front-ends.
A desktop front-end would support debug and test of the emulator engine in one of the typical Linux desktop environments.
An in-circuit emulator (ICE) front-end would support use of the emulator as an accelerator or coprocessor in an existing M68k system like the Macintosh.
 
The ICE front-end would be constructed as a kernel module, in order to eliminate the context-switch overhead. Now, I have totally eliminated any operation of the M68k bus by the Snapdragon itself. The Snapdragon talks to 3 FPGAs over the multiplexed, 32-bit-wide “FPGA bus,” and then they coordinate among themselves the M68k bus access. (Enabling this was my discovery of the very affordable MachXO FPGA family.) Therefore, I have made the FPGA-Snapdragon bus design fully asynchronous. So you can, from userspace, operate the FPGA bus and command an M68k bus operation, but it would be slow, and you probably can't operate the IWM, as you say Gorgonops. So for practical purposes, it has to be done from a kernel module.
 
Now, there still is a bit of a bandwidth problem, but maybe not the one you would expect. The Snapdragon 410 has 122 GPIO pins and 122 GPIO pin registers. What I mean is that you can’t do a 32-bit-wide write to a register and set 32 GPIOs at once. No, you have to do 32 separate accesses. To command a bus access for a 68020+ system, some 32 pins need to be set 3 times each, plus a few control signals. So that’s 100 memory operations, plus some waiting because of rise/fall times. I dunno how fast the Snapdragon’s GPIO I/O interconnect is, but these 100 writes to I/O memory couldn’t take more than a half of a microsecond.
 
By the way, it’s about 1 microsecond per word access on Mac Plus, average of 3/4 us (I think) on SE, and 1/4 us or 3/16 us on SE/30 (don’t know if the glue logic imposes a wait-state).
 

 

So in conclusion, the main hit will be in I/O performance to the peripherals and video memory, but I don’t think it’ll be too bad, since everything else will be faster. Write operations can also be sort of pipelined to improve performance, which will benefit video writes. Video reads can come from a cache in the Snapdragon's memory. So the worst case is that the I/O performance of an SE/30 or IIci will be as good as a Mac SE. But again, I think I can make it faster than that by pipelining the write operations.

 


On the topic of using this to add all sorts of enhanced features, including web access via the Linux you've put on board, well, I'm going to toss this out there: Have you by any chance heard of the Apple II Pi? It's a devious bundle of software that lets you take a Raspberry Pi, cable it up to an Apple II (or in fact plug it into a slot using a little interposer card that's just a really brain-dead serial port), and boot up a software disk which redirects all keyboard/joystick/mouse I/O to the Pi and lets them act as those peripherals for the Linux machine. It even allows access to the floppy drives attached to the host so disks can be read and used for the Apple IIgs emulator that runs on the Pi side, so overall it makes the setup look like a massively expanded Apple IIgs (IE, 15MB of RAM, hard disks, network card, the works). The illusion is so convincing that many of the people who've bought the serial card are convinced that it must be doing a lot more with their Apple II than it really does. So in that vein: every Macintosh has two high-speed serial ports built into it, have you considered the idea of just interfacing your parasite Linux computer to that and creating a little extension that turns the Mac into an I/O slave? For every Mac that uses an external monitor the job is pretty much done at that point; just pair up with an HDMI to VGA adapter if you want to drive the original monitor and you're good to go.

For the toaster macs there are various ways to drive the original CRT from a modern ARM SoC; the link shows a sort of software-intensive way of doing it but I'm relatively sure I've seen other methods that can leverage LCD interfaces or even HDMI to do it with less overhead. How about for those you just put an interposer board between the analog board plug and the motherboard that can switch on a software command between the motherboard output and the video generated by your "accelerator"? Connectivity to the host Mac would consist of said connector and a little plug that snaked out of the case and plugged into one of the serial ports. On bootup the extension to slave-ify the Mac would load, it'd establish handshaking with the Linux computer (which itself might need some time to boot), and then the screen goes *blink* briefly as control is handed off. For all practical purposes you've accomplished the same thing but you don't have to target the CPU socket.

 

 

That stuff is cool but it's not really the direction I want to go in. My aim is to make the Maccelerator a goal so I can personally learn something. The added benefit is that there is something tangible to show off, so maybe you can kinda make the same effect by taking over the screen and I/O, but that's not my aim.

 

 

Let me say a little more about the performance.

 

I think 50x faster than SE for register-register operations is easy. Do I need to say any more? You just write the shortest ARMv8-A algorithm in assembly to implement the given 68000 instruction. Even if it takes 10 instructions, the Snapdragon's Cortex-A53 runs at 1.2 GHz, with a 533 MHz memory bus, and can dispatch multiple instructions in one clock. 68030 takes at least two clocks to execute an instruction, and it has an unsophisticated instruction cache compared to the Cortex-A53's cache system.

 

Instructions which access memory will be harder to get to be 50x faster, but I'm sure its doable when the emulator runs as a kernel module, basically hogging a whole core.

 

So I am ignoring the speed at which the chipset can be accessed in this speed analysis, but I don't think it will be a huge deal, especially if write operations can be pipelined to accelerate video writes.

Edited by ZaneKaminski
Link to post
Share on other sites
Now, there still is a bit of a bandwidth problem, but maybe not the one you would expect. The Snapdragon 410 has 122 GPIO pins and 122 GPIO pin registers. What I mean is that you can’t do a 32-bit-wide write to a register and set 32 GPIOs at once. No, you have to do 32 separate accesses. To command a bus access for a 68020+ system, some 32 pins need to be set 3 times each, plus a few control signals. So that’s 100 memory operations, plus some waiting because of rise/fall times. I dunno how fast the Snapdragon’s GPIO I/O interconnect is, but these 100 writes to I/O memory couldn’t take more than a half of a microsecond.

 

To be frank this sounds like a deal-breaker to me if it's true that you can't group the GPIO pins together into broader busses that would let you send a word at a time. (Some microcontrollers have the ability, for their GPIO pins or some subset of them to alternately act like a general purpose or memory bus.) The Raspberry Pi benchmarks at just under 42mhz as the absolute maximum speed it can toggle a single GPIO pin into a square wave and I have no reason to believe the Snapdragon is going to be much different than that. If you have to do 100 operations for every bus cycle that's going to put your effective bus speed at, what, around 400khz? That sounds like a pretty serious performance hit to me.

 

For larfs I downloaded the Qualcomm Dragonboard 410c "Peripherals Programming Guide" and while I won't pretend to understand much of it it seems like the fastest I/O method it documents is SPI, which the "BLSP" that controls the GPIO pins and whatnot can handle at up to 50mhz. (That is of course 50mhz *serial*.) Kind of seems to me that to really make this work you want an ARM chip that exposes the AHB and you interface to that.(*)

 

(* EDIT: Reading a little more this sounds like nonsense to me, it sounds like AHB is a "bus" that's used for interconnecting *on die* peripherals? Never mind, I have no idea what I'm talking about. That said, if that *is* true then perhaps SPI is in fact the best, most performant bus protocol that SoCs like this offer. And SPI doesn't seem to quite cut the mustard, performancewise.)

 

 

 By the way, it’s about 1 microsecond per word access on Mac Plus, average of 3/4 us (I think) on SE, and 1/4 us or 3/16 us on SE/30 (don’t know if the glue logic imposes a wait-state).

 

That 290ns figure I found was in an extremely low-level discussion of the bus timing requirements of the 68000, IE, the speed at which you'd have to handle signal transitions in software if you were actually trying to "bit-bang" it with a raw GPIO port. In real terms, yes, the access speed of a Plus is far slower because, among other reasons, the hardware dedicates a non-negotiable half of the bus' available bandwidth to video refresh and of course the 68000 itself can't do a memory operation on every clock cycle. If your FPGA handles all that handshaking then, yes, the *effective* speed is what you need to target to be as good as a Plus. But it still sounds a bit ambitious to me to do in userland, at least.

 

 I think 50x faster than SE for register-register operations is easy. Do I need to say any more? You just write the shortest ARMv8-A algorithm in assembly to implement the given 68000 instruction. Even if it takes 10 instructions, the Snapdragon's Cortex-A53 runs at 1.2 GHz, with a 533 MHz memory bus, and can dispatch multiple instructions in one clock. 68030 takes at least two clocks to execute an instruction, and it has an unsophisticated instruction cache compared to the Cortex-A53's cache system.

I'm not going to disagree that, sure, it's undoubtedly *possible* for a hand-tuned assembly language emulator core running on a 1.2ghz CPU to manage "50 times better" than the 8 mhz original on a totally synthetic test. I'm just feeling like there is a *lot* of devil in the details here in terms of translating this into real performance, particularly with the overhead of this bus bridge.

Edited by Gorgonops
Link to post
Share on other sites

Well, the bus speed doesn't work like you're saying.

 

There is the bit transition time, which can be specified in terms of the time it takes to rise from 10% to 90% or fall from 90% to 10% (of the logic level in question). That's what limits the GPIO pin performance to 40 MHz or so for single-ended, non-terminated I/O like the GPIO pins. It takes maybe 12 nanoseconds for the signal to rise and fall to the appropriate logic levels, and that limits the speed of the bus or whatever.

 

The limitation I was referring to earlier was basically the speed of the AHB bus (yes, it's an on-die interconnect) and anything else in between the processor core and the GPIO system... how long does it take to set a bit in a GPIO register, not how long it takes for the signal to transition from high to low or low to high after setting the bit.

 

Yes, we will have to wait 25 ns or so (1/40 MHz) for the bits to transition, but we will only have to do that a few times, not serially for each bit in the 32-bit quantities transmitted. The amount we have to wait when we set each bit is the delay imposed by the on-die bus between the GPIO and ARM core. At best, only one clock will be required to write to a GPIO register, but it will, in practice and on average, be more. I don't know what, but I don't think it'll be more than 6 clocks.

 

So the total delay for a write operation, for example, is how long it takes to perform these operations in series:

 

Let t_GPIO be the time it takes to write to a GPIO register, let t_bit be the bit transition time, and let t_latch be the time required for the glue FPGAs to latch the value on the FPGA bus.

set 32 address bits (= 32 * t_GPIO)

set R/W signal (not M68k R/W, FPGA bus R/W) (= t_GPIO)

wait for bits to transition (= t_bit)

set A signal (= t_GPIO)

wait for A signal to rise (= t_bit)

wait for data bits latched (= t_latch)

unset A signal (tells address glue to latch) (= t_GPIO)

wait for A signal to fall (= t_bit)

 

set 32 data bits (= 32 * t_GPIO)

set R/W signal (= t_GPIO)

wait for bits to transition (= t_bit)

set D signal (= t_GPIO)

wait for D signal to rise (= t_bit)

wait for data bits latched (= t_latch)

unset D signal (tells data glue to latch) (= t_GPIO)

wait for D signal to fall (=t_bit)

 

set 32 control bits (to command control glue to perform the bus access) (= 32 * t_GPIO)

set R/W signal (= t_GPIO)

wait for bits to transition (= t_bit)

set C signal (= t_GPIO)

wait for C signal to rise (= t_bit)

wait for control bits latched (= t_latch)

unset C signal (this executes the command latched in the control glue) (= t_GPIO)

wait for C signal to fall (=t_bit)

 

Bus access is then performed by the control glue, in conjunction with the address glue and data glue.

 

So the total delay before that can happen is 105 * t_GPIO + 9*t_bit + 3*t_latch.

 

For t_GPIO = 6 cycles @ 1.2 GHz = 5 ns, t_bit = 12 ns, t_latch = 5 ns, that's 648 nanoseconds required to command a write access. Not great but not that terrible. It may be faster than I estimate. I think 6 cycles might be an overestimate for how long the GPIO accesses take.

 

Earlier in this thread, I considered quad-SPI to connect the processor and the FPGA, but I didn't like the solution for various reasons. This way is better.

 

Anyway, I will release my schematic later tonight and you can see how I do it. None of the 68000 bus control signals are manipulated directly by the Snapdragon anymore. It will all be handled by the control glue FPGA.

 

Combined with running the whole thing in kernel mode, I think the GPIO and bus access speed will be adequate. Not great, but adequate.

 

We will see about the performance in a few months when I implement the emulator. I've got a fair amount of experience with ARMv6-M assembly, so ARMv8-A shouldn't be hard.

Edited by ZaneKaminski
Link to post
Share on other sites

Just out of curiosity, do you have a link to the particular documents which you used to come up with your "I don't think it will be worse than six (1.2ghz) clock ticks" estimate for what sort of signalling speed you're going to be able to get out through the GPIO pins? Frankly this plan seems to rely heavily on an assumption that the source of delay is the signal transition time for the glue between the register and the pins rather than between the register and the CPU, and thus it follows that you can pound on the register at 1.2ghz 100 times at light speed and it'll all of come out in one sort of parallel 32 bit blob when it hits the FPGA... or maybe that's not what you're saying at all? (IE, you seem to be arguing that you can make your ad-hoc parallel bus perform much better than the transition speed of a single pin because setting a pin isn't an atomic operation, IE, you can do it "instantly" at the register and move onto the next while the slow-as-molasses downstream hardware propagates the state change, instead of the limiting factor being the speed of the GPIO controller itself.) Does the hardware manual actually *say* anywhere how fast the connection is between the GPIO glue and the core? I would bet a shiny nickel it's not at 1.2ghz.

Edited by Gorgonops
Link to post
Share on other sites

Well, the bits of the address-data-control bus need not be valid all at the same time. Here's a timing diagram for an address loading operation, where the 32 address bits are latched into the address glue FPGA:

post-6543-0-72556400-1479956070_thumb.jpeg

(this diagram strays from established conventions in some ways but whatever)

 

The data on the ADC bus need only be valid when A has a rising edge or maybe when it's high. Haven't decided yet. There will be a latch or flip-flop array in each FPGA that will store the 32 bits on the bus at the moment of ADC[31:0] capture.

(Edit: actually the input should be captured during the rising edge strictly, not just latched while A is high. That would make things faster, long story short.)

 

So yeah, what you're saying, we are not limited by n * (bit transition time) for an n-bit parallel bus. The limitation is n * (GPIO access time) + (bit transition time), and for almost any system, (GPIO access time) < (bit transition time)
 

Now, this operation has to be done 3 times in order to command a write operation on the PDS bus. Once to set the address, once to set the data, and then one final access to give the control glue the command. Reading is similar but then the data needs to be gotten out of the data glue FPGA after the bus operation is complete, instead of being set before the operation.

 

The overall speed of the interconnect between the GPIO and the core is probably not 1.2 GHz, but I would be very surprised if it weren't in the hundreds of millions of transfers per second. The information to figure it out exists in some document, but it's likely privileged information that I can't access. Next week, my DragonBoard 410c will arrive and I will start screwing around and testing.

Edited by Bunsen
https://wiki.68kmla.org/index.php/68kMLA:Forum_Rules#No_excessive_quoting
Link to post
Share on other sites

Version 2 of the Mac SE accelerator schematic is done. I've attached it to this post. SE/30 will be done tomorrow or Friday.

 

Maccelerator-SE.pdf

 

This version is a bit transitional, in that it uses the Variscite Dart 410 module, when I actually plan to use the Intrinsyc Open-Q 410. I just can't see the pinout of the Open-Q 410 until I receive mine. Only customers with a serial number can see that info. Ugh.

Edited by ZaneKaminski
Link to post
Share on other sites

Thank you for the encouraging words, sstaylor.

 

The coolest feature I've added in this version is certainly the display sub-board connector. I've added a low-profile, shielded, 30-pin board-to-board connector that breaks out the MIPI-DSI display interface. Looks like this:

post-6543-0-93911400-1479832475.jpg

 

Very nice and I'd like to ditto sstaylor''s encouragement.

 

Not to derail things, but linkage to the source of those cables and the matching board connectors would be greatly appreciated. They'd make a couple or three of my own development projects a lot easier than originally planned! :approve:

Edited by Trash80toHP_Mini
Link to post
Share on other sites

The picture I posted wasn't of the exact series I'm using. What I plan to use is from the Hirose DF40 series. They're also used in the Snapdragon modules I'm looking at.

post-6543-0-03367500-1480012110_thumb.jpg

 

Here's a link to the series on Digi-Key:

http://www.digikey.com/product-search/en/connectors-interconnects/rectangular-board-to-board-connectors-arrays-edge-type-mezzanine/1442154?FV=ffec4097%2Cfff40016%2Cfff8016a&mnonly=0&newproducts=0&ColumnSort=0&page=1&stock=0&pbfree=0&rohs=0&quantity=0&ptm=0&fid=0&pageSize=500

 

There are shielded and unshielded models available. The shielded ones are better for high-speed signal applications, but they are available in fewer sizes. If your signals are in the tens of MHz or less it should be fine to use an unshielded one as long as you sprinkle around enough ground reference pins.

Edited by Bunsen
https://wiki.68kmla.org/index.php/68kMLA:Forum_Rules#No_excessive_quoting
Link to post
Share on other sites

Thanks much, Zane, very helpful. Not to be too lazy, but if you have a chance, a direct link to the appropriate cables page would be greatly appreciated. [:)]]'>

 

LOVE your project, way over my head, but fascinating all the same. One suggestion: you might look into building your card for the IIci Cache Slot interface. There are plenty of adapters available for reverse engineering for the PowerCache accelerators in the 68k Mac lineup, including a fairly simple one for the SE/30 and IIsi.

 

Cloning the Daystar PDS Adapter for the SE/30 - Take 2

 

Targeting the IIci cache slot interface would open up your user base to all who already have PowerCache accelerators on hand for all their machines, including PDS slotless Macs like the Mac II, IIx and IIcx.

 

If I can find a link to the PowerCache Adapter tree graphic or find it on disk, I'll post that here if you'd like.

 

Link to post
Share on other sites

I think it should be very straightforward to port the schematic for SE/30 to the IIci's cache slot. It has basically the same signals as the SE/30 PDS. Of course, someone else has to design the board... I just don't have time to do it, and nor do I have a IIci.

 

For the rest of the Mac II series and any other NuBus-equipped Macs, the design of the bus interface FPGAs has to be redone to talk to the Mac's hardware over NuBus, but everything above the bus interface (Snapdragon, emulator, Linux system, etc.) should work without modification, as long as the FPGA-Snapdragon interface remains the same.

Link to post
Share on other sites

Oh, cables. You mean the flat-flex kind? I dunno much about those.

post-6543-0-98971700-1480016316.jpg

 

I was just going to mount the connector onto the accelerator board, and then the display sub-board would mate to it and be secured with a few screws. No cables, just the connector sockets mating the boards directly.

 

My understanding with the flat-flex cables is that you get the cables and then solder on the connector. I think the cables can be custom-made, similar to how a PCB is designed and produced.

 

There are also rigid-flex PCBs, where the internal layers are basically a flat-flex cable, and you can decide what portions are rigid and which are the cable.

post-6543-0-94199800-1480016546_thumb.jpg

These are really expensive though, totally out of range for hobby stuff.

Link to post
Share on other sites

Thanks for the info, I'll have to look into flex cable possibilities.

 

I think it should be very straightforward to port the schematic for SE/30 to the IIci's cache slot. It has basically the same signals as the SE/30 PDS. Of course, someone else has to design the board... I just don't have time to do it, and nor do I have a IIci..

 

The PowerCache adapter for the LCIII is completely passive, just resistors and caps on the board. If you've got an LCIII, I've got a couple of those boards on hand, so if you're in the US and you'd like one, it's yours for the asking. That combo adds up to a IIci.

Link to post
Share on other sites

Thanks, but no thanks. Like I said, I don't have time to do a board for IIci, but anyone is welcome to try to adapt what I have once I've made more progress. It's also a little early in the project for me to be accepting hardware donations lol, maybe once I have some kind of proof-of-concept working. That's months away though.

 

First, I will do the Mac SE board, and then move on to the SE/30. SE is easier since the 68000, compared to the 68030, has fewer instructions, a smaller address space, and no MMU. The MMU emulation especially will be complicated.

 

The 68000 has 24 address bits plus 3 function code bits, so that's a 27-bit address space at maximum, or 128 Mbytes. Dividing that into regions as fine as 256 bytes, that's only 512k entries in the address space table. So the overall address space table size will be small, and the immediacy of checking a region in the address space table lends to good performance for memory accesses.

 

A 68030-compatible emulator for SE/30 has to implement the 68030 MMU, which has this tree structure that it traverses to find out how to translate a given address. Not that hard to implement or anything, but it may impose an interesting performance penalty... cached translations of the M68k code may have to be purged from memory when the translation tree is changed. Not sure about this part yet. Let me tackle Mac SE and MC68000 first.

Edited by Bunsen
https://wiki.68kmla.org/index.php/68kMLA:Forum_Rules#No_excessive_quoting
Link to post
Share on other sites

Gotcha! One other thing though, was wondering if you knew that when you're designing for the SE, you're also designing for the Plus and the Classic. If you have room on your board to parallel your SE EuroDin PDS connector with thruholes for a 68000 socket as shown, you're also tripling your user base. Now as then, in terms of development it's all about economies of scale.

 

micromac030160502.jpg

 

Since your board's profile will undoubtedly be far lower than that of this MicroMac accelerator, the CPU/PDS interface adaptation could probably be done with a directly soldered header interconnected daughtercard underneath your PDS card.. It'll be interesting to hear from techknight if something similar might be possible for the Portable's CPU interface.

 

One wonders if the marvels of rapid prototyping might pave the way for a reproduction of the Killy Klip interconnect hardware for this and many other 68000 board installations?

Edited by Trash80toHP_Mini
Link to post
Share on other sites

No idea. I have decided to put SE/30 on hold for now as well. The SE design is too young and unstable to turn it into a "master schematic" and fork that for each supported machine. That's what I was trying to do in the past few days, but when the schematic is changing so fast, that's a pain.
 
 
 
 
The only commitment I can make relating to the schedule is that I'll spend a lot of my spare time on the project haha. This is, for me, a learning experience, so I may make a ton of mistakes and it may take a while. I think that the only way these types of projects are completed are by someone who is in it more for the learning than to have the final product, y'know the super Mac or whatever.
 
About the boards, I'll definitely release all of the source files eventually, so you could go yourself to OSHPark or some other board fab and purchase the board yourself, solder the components on, program the FPGAs over JTAG (current design has a system controller that can bit-bang JTAG to update the FPGA software, but I don't plan on implementing that, so you will need an external programmer), and put the Snapdragon in the socket, and use a microSD card with the software I'll provide to boot the Snapdragon up.
 
Now, for at least SE, maybe more models, I will probably sell fully assembled units myself. I have explicitly avoided BGA parts, so I will be able to reflow the boards myself or hire a friend skilled in that stuff to do it. In order to do this, however, I will have to have a Kickstarter or something to raise the funds required to build the boards. If more than 25 or 40 boards are in demand, I will outsource production to China, since that's around when it becomes affordable.
 

 

Gotcha! One other thing though, was wondering if you knew that when you're designing for the SE, you're also designing for the Plus and the Classic. If you have room on your board to parallel your SE EuroDin PDS connector with thruholes for a 68000 socket as shown, you're also tripling your user base. Now as then, in terms of development it's all about economies of scale.
 
Since your board's profile will undoubtedly be far lower than that of this MicroMac accelerator, the CPU/PDS interface adaptation could probably be done with a directly soldered header interconnected daughtercard underneath your PDS card.. It'll be interesting to hear from techknight if something similar might be possible for the Portable's CPU interface.
 
One wonders if the marvels of rapid prototyping might pave the way for a reproduction of the Killy Klip interconnect hardware for this and many other 68000 board installations?

It's tempting, but there are a few notable electrical differences.
 
Firstly, the SE has this scheme where the ~8 MHz clock is like 90 degrees out of phase of the ~16 MHz clock from which it is generated. I don't think the 128k - Plus do that.
 
If we replaced the 68000, we would also have to generate ECLK. Clock routing is a pain, though the E clock is only 783 kHz, so that's in no way fast, but in general, clock routing is a pain and doing it improperly is a common source of problems. In the last schematic I posted, I had the clocks connected to the system controller, thinking that it might be useful to measure their frequency or run synchronous to them or something. I have since removed that feature. Maybe in the SE/30 design, the system controller will have to take in the 16 MHz clock and make 32 MHz from it, from which the FPGA control logic will run, but no need to complicate things for the SE.
 
It is tempting to try and copy the shape of that MicroMac card, though... hmm. The mechanical engineering part of it, making sure it fits properly, is really not one of my strong areas. Also mine may be thicker than the traditional accelerators. The snapdragon module is a few millimeters tall, just like the 68030 on that board you posted.
 
Maybe once I have it all together for SE, I can try adding those extra pins, but I'm already wrestling with a lot of complexity as it is. The cost of the boards is also not going to be that much if I can manage to buy 25 or so boards for each supported machine. See my reply to asaggynoodle above about how I plan to build them.

Link to post
Share on other sites

Since I'm halting work on the SE/30 schematic for now. I'm moving on to the board for the SE. Maybe it will pick up 128k-Plus and/or Classic compatibility eventually, but I am counting more on just doing another board for those. More boards means more work, but also routing a board with a bunch of differently shaped footprints (DIP 68000, PGA 68000, and SE PDS) is harder too.

Link to post
Share on other sites

Actually, I change my mind. Supporting the 128k - Plus is more important than supporting the SE/30, right? So maybe I should drop SE/30 entirely and support all the 68000 compacts. That's easier than supporting 68000 and 68030.

 

Probably not on your comment re: importance (more on that later) of support. The SE/30 gang is probably the most invested segment of the 68k community in terms of interest and CERTAINLY in terms of spending in order to amp the little beast up.

 

Dunno if you can use the same SE/Plus/Classic setup for the 128k/512k/512ke. I had a board that worked on 128k through 512ke back in the day and now I've got this one for SE/Plus/Classic, but I don't know if the two fairly distinct generations are compatible or not. IMO, anything useful for the earlier models would need to have the added complication of RAM expansion on board to be of much use for anything practical. Such was the case with the NewLife(?) board for my 512k. Playing with those early models seems to me to be a joyous celebration of the impractical. So the generational compatibility issue would be moot from my point of view.

 

The beauty of work based on the MicroMac model and board layout yields three machine compatibility (at least) without doing different boards. Work based upon the IIci Cache Slot model yields compatibility with dozens of 68030 based Macintosh models using original DayStar units or redesigned versions of their PowerCache adapters with a single board design.

 

You'd be doing one board with either/or connector installation for a minimum of three 68000 based Macs and a single board for nearly the entire 68030 based family of the Macintosh.

 

Keeping those two board designs within unofficial (68000) design restraints and Apple's Spec for the IIci Cache Slot seems he obvious way to go to me. The notion of doing different boards for specific models as opposed to just two boards for two entire families of models gives me the heebie-jeebies.

 

But that's just me, so I'll be quiet now. :I

Link to post
Share on other sites

I know I said I'd be quiet, but a couple of things jumped out at me.

 

.  .  .  but there are a few notable electrical differences.

 

Firstly, the SE has this scheme where the ~8 MHz clock is like 90 degrees out of phase of the ~16 MHz clock from which it is generated. I don't think the 128k - Plus do that.

 

If we replaced the 68000  .  .  .

 

On the former, the MicroMac board would seem to point to the Plus handling things like the later models, that could be why I've seen the division of generations as I stated above.

 

In no case are we replacing the 68000, the SE has its EuroDin PDS connector and the other clips to all of the legs of the existing 68000  .  .  .  how's that for a direct slot? So much for the Steve's slotless design edict! HEH!  [:D]]'> 

 

OK, now I will be quiet  .  .  . ::)

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

×
×
  • Create New...