Jump to content

Serious proposal: accelerator and peripheral expansion system


Recommended Posts

The issue is that on the Plus, there isn't that much room between the 68000 and the power/video connector.

 

But also, the problem is that SE users will want to mount theirs with the PDS, not on the 68000, even with the Killy Klip. And it's not like you can mount a horizontal SE-style card in the PDS with the accelerator in there.

 

Yep, clearance conflict between Apple's horizontal PDS card spec. and a Killy Klip card was exactly what I tried to explain. Not so good with words, I figured the pic would let others visualize the problem.

 

As for a NIC solution, you're already putting USB on the card. IIRC, I've got an inexpensive USB WiFi dongle the size of a wireless mouse nubbin knocking around the joint somewhere. Breaking USB out to a pair of connectors (one for the nubbin) on the expansion card coverplate might be an elegant solution for users to add networking to your accelerator.

Link to post
Share on other sites
  • Replies 203
  • Created
  • Last Reply

Top Posters In This Topic

The little WiFi dongle approach seems easy, but it's only easy when we run Linux, since the dongle probably has a Linux driver. But for my latest design, which is not supposed to run Linux, it's going to be too hard to develop a driver for the dongle. So we need something easier to work with, like the ESP8266 module.

Edited by ZaneKaminski
Link to post
Share on other sites

Ah! Like I said, this stuff is way over my pay grade to really be of help. Notions pop up in the noggin' and I just bounce 'em off you as I think of them.

 

Another silly question: would it be any easier if you could hand off the "slow" I/O bus to a co-processor in the manner Apple used a pair of 6502s in the IIfx for that task?

 

I'm thinking along the lines of getting the rPi crowd interested in taking on that end of the project to make entry level pricing for your basic Accelerator as inexpensive as possible? As I understand it, that gang has TONs of hardware and software hackage available for uses of all sorts.

 

 

 

edit: the thought here would be to allow you to grip the KISS principal tenaciously while opening the project up to all manner of feature creepage.

Edited by Trash80toHP_Mini
Link to post
Share on other sites

if your FPGAs require external configuration, You can probably load those up using the external ARM chip. 

 

Just hold the RESET line low, to keep the machine stuck in reset while this task completes. Once the ARM loads the FPGAs, and the ARM is "booted" and ready to go, release the RESET line and let everything take off as it should. 

 

I would still love to see some sort of webkit integration/acceleration to make web browsing actually "usable" on the SE, and SE/30, that would really make my day. otherwise just having "faster" speed for existing software to run on the speed it comes stock with isn't useful to me, UNLESS its emulating a 68020+ and can handle CFM68K. Then it becomes useful for app development for some of the things that I really want to do. 

Edited by techknight
Link to post
Share on other sites

if your FPGAs require external configuration, You can probably load those up using the external ARM chip. 

 

Just hold the RESET line low, to keep the machine stuck in reset while this task completes. Once the ARM loads the FPGAs, and the ARM is "booted" and ready to go, release the RESET line and let everything take off as it should. 

Yeah, that's easy with /RESET. In the old design, I had the System Controller in charge of pulling reset low, and it would time the boot sequence between the FPGAs and Snapdragon. Actually, it turns out that the Lattice iCE40 series of FPGAs does have internal configuration flash, but is more accommodating to external programming than the MachXO.

 

Only CPLDs are truly instant-on. FPGAs that don't require external configuration memory just have some internal flash and they usually take a few hundred microseconds to a few milliseconds to load that into their SRAM.

I would still love to see some sort of webkit integration/acceleration to make web browsing actually "usable" on the SE, and SE/30, that would really make my day. otherwise just having "faster" speed for existing software to run on the speed it comes stock with isn't useful to me, UNLESS its emulating a 68020+ and can handle CFM68K. Then it becomes useful for app development for some of the things that I really want to do.

I really want WebKit too, but I'm afraid that'll have to be on a higher-end product, not a 68000 accelerator.

 

This setup isn't really capable of accelerating 68020+ systems. They have a larger address space, 68030 has internal MMU, more bus throughput, and users of these systems expect other features like greater color depth, etc. Trying to do all of that on this microcontroller would be too hard. As it is, the performance of a Macintosh SE with the accelerator will be faster than an SE/30, hopefully faster than a IIci, but probably not as fast as a system with a 40 MHz 68040.

 

An accelerator for a 68020+ system would have to have a microprocessor, DDR2/3 memory, etc. A system of that class has enough horsepower to run WebKit. This little STM32H7, without a real OS? I don't know, maybe there's some browser that will run on such a system, but I don't think it's WebKit.

 

I was trying to do this unified, overdesigned system giving certain (fancy) capabilities to any 680x0 Macintosh, but that just drives the price up for 68000 systems.

 

So in this latest iteration I'm working on, the 68000 systems are intentionally slower than any 68020+ design that might follow, which should use the Snapdragon 410 module, etc. 

 

However, you can figure out how to send over USB entire 512x342x8bit frames, maybe some keyboard and mouse data too, and then stick some other system running WebKit onto the USB bus to get online. It just can't be integrated for 68000 systems. Too expensive.

Edited by ZaneKaminski
Link to post
Share on other sites

Well what I meant was on a base 68000 system, the accelerator emulating a 68020+ on a 68000.

Nooo, unfortunately I never planned for that. Emulating a different CPU complicates things. I planned to use the ROM from the machine as-is, and the software in the ROM only works under the assumption of the presence of some particular chipset.

 

68030 systems all have, for example, the Apple Sound Chip, which is absent from 68000 systems. So that would need to be emulated, and accurate peripheral emulation is a project in and of itself. My aim is to emulate the exact same processor as comes in the machine, but seemingly executing many more instructions per clock.

 

What machine in particular did you want these more advanced features (web browser and 68020+) for?

 

For Plus and SE, if you want 68030, the solution is to upgrade to SE/30 and get the accelerator for that. For your Portable... there is no solution in the exact same form-factor. PB140/170 is the closest.

Edited by ZaneKaminski
Link to post
Share on other sites

I planned to use the ROM from the machine as-is, and the software in the ROM only works under the assumption of the presence of some particular chipset.

 

That just reminded me, dunno about other accelerators, but the Radius had a license for the Rocket to copy the crown jewels into its onboard memory to avoid the bottleneck of ROM access over NuBus. You'd mentioned maybe doing so in your project, dunno if you've made a decision yet, but I'd say it would speed things up considerably to have the Toolbox on board.

Link to post
Share on other sites

As it is, the performance of a Macintosh SE with the accelerator will be faster than an SE/30, hopefully faster than a IIci, but probably not as fast as a system with a 40 MHz 68040.

Just tossing this out there since I vaguely recall a target price for this thing in the one hundred dollar ballpark: the 68000 softcore in the MIST reconfigurable computer, which sells for about $200, is capable of speeds up to about 48mhz. (And seems to also have at least partial 68020 compatibility.) I have no idea how much just the FPGA in the MIST costs or if there might be a smaller/cheaper one that has sufficient capacity to run the softcore and some bus glue while dispensing with the capacity to emulate the rest of computer, but... can't help but make me wonder if the performance bar is this low and you're using FPGAs for bus glue anyway if an all-FPGA approach might be simpler and end up costing around the same.

Link to post
Share on other sites

have the Toolbox on board

Haha I'm way ahead of you on caching the ROM.

 

The STM32H7 has a "scattered memory architecture," with a bit over 1 Mbyte of SRAM but split into different banks and sizes to optimize throughout. And then there's the 32 Mbyte external SDRAM.

 

Firstly, the emulator software has to be stored in the 64 kbytes of "instruction-tightly-coupled-memory" (ITCM).

 

Any data structures, like trees, linked-lists, tables of pointers, etc. need to be stored in the first 64 kbyte bank of "data-tightly-coupled-memory" (DTCM). There is a second 64 kbyte bank of DTCM too.

 

The -H7 has a main SRAM of 512 kbytes, in which I planned to store the 171 kbyte (512x342x8bit) screen buffer for 8-bit grayscale on the compacts.

 

Other memories on the -H7 include two 128 kbyte banks and another 32 kbyte bank of SRAM in the "connectivity domain," another 64 kbytes in the "batch acquisition domain," and 4K of battery backup SRAM, like the Mac's clock chip has.

 

Then in the external memory will be stored the up to 9 Mbytes of RAM, up to 256 kbyte ROM, and then the full decoded cache for each, so 300% of 9.25 Mbytes, which is 27.75 Mbytes. The ROM will be loaded from the machine and fully decoded before boot. RAM is never written back to the Macintosh except VRAM, the cache of which would be in a write-through configuration.

 

There is a lot of internal memory left, some of which will have to be devoted to a USB buffer. What's left, however, can be used to move select areas of the RAM, ROM, and their instruction word decodings into on-chip SRAM. This is much faster than external SDRAM.

 

So for a given ROM, some "hints" should be supplied that tell what to store in internal SRAM and what to store in external SDRAM.

 

Memory accesses are going to go through a doubly indirect tree-type structure. What I mean is that there are 256 64kbyte chunks in the 24-bit address space of the 68000, and within each 64 kbyte chunk, another 256 chunks of 256 bytes. So you can sorta make a tree (actually it's not technically a tree) out of this, where first you look at A[23:16] and take that offset into a table of pointers, and then use A[15:8] to do it again from there. That can lead to a routine to access memory at that location.

 

If I didn't explain it well enough, the gist is that each 256-byte region of memory can be put in a different place or accessed with a different method. So that's how addresses accessed by the virtual 68000 will be resolved into an actual 68000 bus access, an SRAM access, SDRAM access, etc.

 

The hints supplied to the emulator would basically break the ROM down into 256-byte chunks and tell which should be in SRAM and which should be in SDRAM.

 

Tuning the emulator will be a problem in and of itself though lol, lots of tweaking can be done to improve performance, especially for a specific application.

Link to post
Share on other sites

all-FPGA approach

I'm just not a CPU designer is the thing. I designed a toy CPU this semester for my computer architecture class. Not something I really wanna do again. My strength is mainly as a programmer.

 

The key problem in making this kind of low-cost, high-performance, eccentric thing is choosing which parts to be hardened and which parts to be softer, i.e programmed, microcoded, indirect, etc. The STM32H7 has a hardened cache, for example. So even though memory accesses are gonna be be doubly-indirect, the fact that it's done on a hard processor is advantageous.

 

The easy case is that processors are always hard and glue logic is implemented in CPLDs or something. The hard case is when you're trying to process this instruction stream for an outdated processor faster than it ever ran. How to balance the hard and soft aspects is a very complicated problem, dependent not just on the MC68000 ISA, but also the very specific details of the products available.

 

Just one tiny difference, for example, whether the qSPI interface supports transmission of the command code over all four data lines or just one, makes it or breaks a certain design.

 

So this device you mention, if it supports, in theory, any CPU architecture (that can fit), then the trade-offs are different and so the fully-programmable FPGA approach is better for them. But again, they don't get hardened execution units and caches and all that.

 

So I think that this architecture is now adequately cheap ($100) and performs faster than a similarly priced accelerator with a better FPGA.

Edited by ZaneKaminski
Link to post
Share on other sites

Just for larfs I did a little digging regarding that "Vampire" accelerator for the Amiga 600, and for whatever it's worth the schematics/BOM/whatnot of the "prototype" version of it are on the web. (The website is pretty horribly organized and all the latest content is about the "V2" version of the accelerator, but the "About/Schematics/etc" entries on the sidebar are about said prototype.) Even if you stick with this using-a-microcontroller-for-the-actual-CPU idea I wonder if there might be something in there that could be useful to you. This early version didn't use the proprietary "Apollo" core (IE, the one that claims 3x the performance of the best 68060, which I honestly suspect is the sort of performance it would be difficult to match with that Snapdragon module you were looking at), it used the TG68 core from OpenCores running on a Cyclone II FPGA that seems to sell for about $25 on Mouser.

The best performance he claimed from this combination was only about two and a half times the native speed of the 68000 (verses the, what, 150+ times faster they're claiming from the Apollo core version), so I'm willing to grant that if you can work out the bus interface issues your HW+emulation hybrid might be capable of beating that. Nonetheless, well, I do wonder if there might be something there you can reuse; after all, he *did* successfully work out the bus logic to integrate an "alien" core and a 64MB RAM expansion into an accelerator in a 68000 socket using a relatively small and cheap FPGA. He has his home-grown work open-sourced here, perhaps there's something in there that might save you some work designing your own bus logic.

 

Of course, maybe you noticed that yourself already and are already leveraging it and I failed to see while scanning the thread, in which case... never mind. :)

 

It is a shame that the Apollo Core people seem to be so difficult to work with. I do gather from some of the chatter on the forums for their product it's in part because they have "investors" they're working with that might be trying to turn their work into a commercial ASIC, which means all sorts of disclosure/need-to-know issues crop up.

Edited by Gorgonops
Link to post
Share on other sites

Nooo, unfortunately I never planned for that. Emulating a different CPU complicates things. I planned to use the ROM from the machine as-is, and the software in the ROM only works under the assumption of the presence of some particular chipset.

 

68030 systems all have, for example, the Apple Sound Chip, which is absent from 68000 systems. So that would need to be emulated, and accurate peripheral emulation is a project in and of itself. My aim is to emulate the exact same processor as comes in the machine, but seemingly executing many more instructions per clock.

 

What machine in particular did you want these more advanced features (web browser and 68020+) for?

 

For Plus and SE, if you want 68030, the solution is to upgrade to SE/30 and get the accelerator for that. For your Portable... there is no solution in the exact same form-factor. PB140/170 is the closest.

 

Darnit.... Well.. that blows my plans to bits. 

 

I need a 68020+ accelerator/CPU for the Portable which is a 16Mhz base 68000. You need a 68020 or higher to support the CFM68K runtime. 

 

Reason I need a CFM68K runtime is so I can run RealBASIC applications which compile using that runtime. No, I cannot program in C or any other language very well except BASIC. I was going to write an MP3 player software that runs on my portable which controls a decoder card that I designed. 

 

I built a decoder card with an MP3 DSP and an atmel MCU that I was going to interface with the system using the PDS or even serial port. The one I have right now is serial based, and the storage is actually an SD card on the decoder itself. 

 

Problem is, I cant write the software for the mac to control the card because well, it wont run CFM68K apps. 

 

So I need 68020 emulation on the 68000 hardware. a Real CPU, you need to do some SIZ0, SIZ1, A0, A1 decoding to LDS/UDS and of course the E clock and VPA/VMA logic that has to be created so a 68020+ CPU can run on 68000 hardware. Thats all it takes to make a 68020 run in a 68000 board is those 2 things. 

Edited by techknight
Link to post
Share on other sites

68020+ accelerator/CPU for the Portable

Well maybe it wouldn't be too hard. There are still unsolved problems though. Let me explain.

 

The differences between the 68000 and 68020 in terms of the programmer's model are an extra 8 address bits, longword ALU operations, and the cache control registers. So all that needs done to add 68020 support is:

  • Either do another tree lookup or ignore top 8 address bits (as the Mac II's "Apple HMMU chip" does)
  • Longword ALU operations should be trivial once word-wide operations are implemented
  • Cache is irrelevant, just have to implement dummy cache control registers.

So it's not too hard to add 68020 support. 68030 is harder since it always comes with the MMU.

Now, what 68020 system to emulate? Either Mac II or LC. Mac II sounds better since it's older.

Anyway, the relevant differences between an SE and Mac II, in terms of chipset, are:

  • Different memory map, even in 24-bit mode
  • NuBus
  • No onboard video
  • Has 68881
  • Has Apple Sound Chip

These need to be translated into stuff on the Mac SE in order to emulate the Mac II with its hardware peripherals. Memory map, NuBus, and no onboard video can be resolved just by setting up the address space tree (in terms of what to cache and what to go over the bus to get). 68881 emulation shouldn't be too hard, since most operations can just be translated into equivalent ARMv7-M FPU operations. Apple Sound Chip emulation may be difficult, since the emulated output of the ASC has to be mapped to sounds that the 68000 Macs can produce through their more limited setup. Maybe it’s impossible and the accelerator should have a DAC and a little amplifier.

Sound aside, the conclusion is that it shouldn't be too hard to emulate a Mac II on top of SE hardware.

 

But for Portable? You need to cobble together a ROM that runs on a Macintosh II with a 68020, but supports the Power Manager bus communication through the VIA. Then it can work.

The other option is, if the PB140+ 68030 models use the same Power Manager chip as the Portable, to use a PB140/170 ROM and do a dummy 68030 MMU implementation.

I'll check it out as I progress further. The TG68 core's bus interface implementation may be particularly helpful to me, since I have to implement basically the same thing. But I'm curious to see how they've implemented the Amiga peripherals. The Amiga has such a great design, with its blitter and the bitplanes and the two banks of RAM.

 

I've still gotta finish the schematic for this latest revision (with the microcontroller), and then I want to write the framework for the emulator and implement an instruction or two to see how good the performance can be. I'm also behind on my "real" product, this thing that adds Bluetooth A2DP audio to various Honda and Acura models with non-standard sized stereo head units. My finals are this week, otherwise there would be more progress.

Edited by ZaneKaminski
Link to post
Share on other sites

Umm why do we need to emulate anything? Only thing that needs done IMHO is emulate the CPU. Maybe I missed something, but the machine is already there. The bus and motherboard does everything that needs to be done. the CPU just executes code the ROM feeds it. 

 

Oh, and the Portable uses an ASC on-board with a pair of Sony chips. Just like the Mac II and the SE/30. 

Edited by techknight
Link to post
Share on other sites

Umm why do we need to emulate anything?

Well the problem I'm seeing is with the ROM. If you emulate the same CPU as in the machine, just faster, then you can use its ROM and everything works since all of the chipset hardware expected to exist by the software in ROM is present in the machine.

 

You could run the Portable from its stock ROM using an accelerated 68020+ CPU (real 68020 or emulator), but then it's unclear if 68020 software can be used, since the ROM and OS are unaware that the CPU has 68020 capability. So then you'd have to probably do some ROM patching, and that might be a neverending battle, trudging through tens of kbytes of disassembly to fix a neverending stream of error messages or something.

 

So the other solution is to run a Mac II ROM and re-map the address space in the emulator (or in the case of a physical processor, in the glue logic) to line up with the Mac II's address space. If the Portable has the ASC (I didn't realize it did), then that aspect should be easy, but then the Mac II's ROM is missing the software for the Power Manager chip, and the VIA GPIO bits will certainly be different, too, so those will need remapped or else the software in ROM, when using the VIA, will be manipulating the wrong signals.

Edited by ZaneKaminski
Link to post
Share on other sites

The qSPI interface of the STM32F7 supports SDR operation up to 100 MHz and DDR operation up to 80 MHz. Specs for the -H7 are not available yet but they'll surely be the same or better. The -H7 is built on a new 40nm process. I think the -F7 is built at 65nm or something.

 

Now the qSPI clock should be synchronous to the Mac's clock, just several times faster. For all but the Portable and PB100, these machines run at 7.8336 MHz. So SDR operation maxes out at 94.0032 MHz (x12) and DDR operation at 78.336 MHz (x10).

 

Now, obviously 78.336 MHz DDR is the fastest option, but that would possibly require a 156.672 MHz internal clock in the iCE40 FPGA. Here's what the iCE40's datasheet says about its performance and latency:

post-6543-0-29666300-1481749535_thumb.png

 

So hopefully it's possible, since the qSPI interface shouldn't be very "deep" in terms of the longest path through the piece of work that has to be done on every clock edge. Also the iCE40 series has DDR registers in all of its I/O pins, so it can sort of take double the data and clock out in an alternating fashion. Hopefully that will halve the clock speed requirement.

 

The iCE40HX4K, which I plan to use, has two internal phase-locked loops, which can be used to multiply the input clock from the Mac. Hopefully I'll only use one, and so we can downgrade to the iCE40HX1K, which has only one PLL and a logic capacity of about 3x less.

 

The other benefit in multiplying the Macintosh's clock is that we can run the two completely in-phase. The BBU in the Macintosh SE, for example, is purposefully run out of phase with the 68000. The 68000's clock is generated from the BBU's clock, but with a 30ns delay. Therefore if the BBU places data on the bus according to a certain clock edge, the 68000 will always be delayed enough for the signals to stabilize before it samples them. With the fast qSPI clock synchronous to the Mac's clock, we don't need to do any of this delay business and instead can time an event to occur at one of 12 (at the most) points in the 68000's cycle.

 

So now that I'm doing it this way, I've gotta go back to designing the qSPI command set, which is what I started doing when I began this thread hahah.

Edited by ZaneKaminski
Link to post
Share on other sites

I'd rather do it with DDR since that's less demanding in terms of clock signal bandwidth. Y'know, in SDR, the clock oscillates at double the frequency of alternating (1010...) data bits. Whereas in DDR, the frequency of alternating data bits is the same as the clock frequency. But then again, I've never designed a state machine so complex, let alone with DDR.

 

About qSPI, it's mainly for interfacing with flash memory, but the qSPI command sequence is fully programmable on most chips supporting it, so it can be used as a simple, pretty fast FPGA interface.

 

The STM32F7 supports dual qSPI, where you hook both qSPI interfaces up to separate flash memory chips, and then they're accessed in parallel to double the flash memory throughout. I've determined it's not necessary for 68000 systems at 7.8336 MHz. Maybe Portable, at twice that speed, may need both interfaces. Surely the -H7 supports something similar.

 

Now, the Snapdragon doesn't have qSPI, so if that's ever used as a processor for 68020+ systems, none of this applies. But the maximum qSPI bandwidth (in dual configuration) on a IIfx is actually equal to the maximum throughput of its bus (80MHz DDR x 8 qSPI bits in dual configuration = 1280 Mbit/sec), but the qSPI interface has more overhead and has to send the command bits twice, once to each chip. So it's not quite up to full-speed operation on a IIfx. But on a IIci, I think it can manage 80% bus utilization or so. So that's adequate. But again, I dunno what processor I would use for those systems anymore.

Edited by ZaneKaminski
Link to post
Share on other sites

I think that we can avoid needing an INIT or special boot floppy or anything else like that by constantly hogging the bus, and never letting the 68000 get a chance to fetch an instruction. The Bus Glue FPGA, if it has nothing to do, will have to constantly issue meaningless reads to addresses known to be in the RAM or ROM.

 

Older hardware must have used latches and control signal translation logic (as thetechknight has described), so it wasn't really capable of performing such dummy bus operations. Maybe that's why early accelerators required the ROM board or a special floppy or something.

 

In addition to the bus operations, the FPGA could also be programmed to decode instructions (in the class-and-parameters scheme I described a few days ago), delivering the result in as little as 0.1 microseconds. I don't think it will be useful though. 0.1 microseconds is 40 clocks of the STM32H7, and I'm hoping to perform the instruction decoding faster than that. (It will be cached though, so it only has to be decoded once. This is the key enabler of high performance.)

 

Edit: also ST is apparently announcing higher-end members of the STM32H7 family, in Q1 of 2017. So I'll stay tuned to see if there's anything that can offer more performance. More internal memory and more L1 cache would be particularly helpful, but another 100MHz would be great too.

 

Edit again: The Bus Controller should have a queue of up to, say, 7 bus operations which have not yet completed. For 68000 systems, it should take less than one 7.8336 MHz cycle (1/4 of a bus cycle) to transmit the write byte/word command (including address, data) to the FPGA.

 

The way it would work is you would enqueue a read or write command and then check later (with a status command) to see if the operation was successful or not (and also read the data read, for read operations).

 

Each command can be assigned a sequence number from 0 to 7. The first command enqueued is numbered 0 and it goes from there, wrapping around to 0 after 7.

 

An asynchronous output of the FPGA should give the sequence number (as a gray code) of the command currently in progress (i.e. one after the number of the most recently finished.) That way, the STM32 can just check that output to know how many queued operations have been completed.

 

The idea is kind of like sliding-window flow control, if anyone is familiar.

Edited by ZaneKaminski
Link to post
Share on other sites

most accelerators needed a ROM as a DeclROM to the OS, to tell the machine that it has X-Y-Z features, including the processor type. 

 

Also, I have an accelerator board without a ROM for the plus, and the Plus sees it as a base 68020 without RAM expansion, or FPU. But it has an onboard FPU, and it is actually a 68030. 

 

So the OS has its own way to tell whether the CPU is a 68020 or not. and its up to the DeclROM to tell the OS that its "really" an 030, and it has x-y-z features, and contains drivers for those x-y-z features, IF needed. 

Edited by techknight
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

×
×
  • Create New...