• Hello MLAers! We've re-enabled auto-approval for accounts. If you are still waiting on account approval, please check this thread for more information.

SE/30 DiiMO accelerator cloning

The logic is based on the Diimo GALs, adapted to fit on a single PLD, right? So I take it you had to understand enough of it to be able to combine it, but other parts of the code you haven’t needed to analyse?

And so in adapting it to a single logic device new issues have arisen, perhaps because it was running marginally to begin with?
Truthfully, at the coarsest level there isn't much understanding required. The fundamental equations from eqn2jed are already normalized in a way that agrees with CUPL, order of operations, etc. If you want to stay sane, the equations from eqn2jed should have the variables renamed to match net names gleaned from what they connect to (ex connects to the CPU AS pin? then it's AS.LOCAL).

Next, registered and OE logic needs to be distributed to each equation in a GAL that uses it, any inverted signals (usually just /OE if used as an input) need to be made consistent, syntax translated, and various optimization steps like eliminating feedback declarations. These are largely mechanical steps and don't require particular understanding of the equations being worked on as it's "just" translation and clean-up.

By the end of it you've got a pile of equations in a single file and you need to either assign pins or let the fitter do it for you; hypothetically it can optimize placements. Not sure of the truth of that, though, and I find manual placement according to what makes my life easier to route is usually fine. Conceptually at this point you can make a board and it might just work. I just recently did this with the Formac design Bolle uses for his Quadra 700 clock doublers, and after correcting a single typo in an equation the resulting board worked perfectly. I even snuck in a version that swapped unidirectional latches for bidirectional registered transcievers.

1745531257948.jpeg

So you could stop here, but that's no fun, and potentially risky if you don't test every single possible scenario. So let's continue.

Alright, you've labeled the outputs and inputs in accordance to what they connect to - but what is being done in the middle, in these mystery equations? You've still got a lot of intermediate nodes left that aren't clear what they do, and this means potential edge cases. For example:
Code:
gal3_nc14.ck = BCLK_LB;
!gal3_nc14.d = !RSTI
    # !gal3_nc16 & !gal3_nc15 & !gal3_nc14
    # !BB_LB & gal3_nc16 & gal3_nc15 & !gal3_nc14 & !gal3_nc13
    # gal3_nc16 & gal3_nc15 & !gal3_nc14 & !gal3_nc13
    # gal3_nc16 & gal3_nc15 & !gal3_nc14 & BG_LB_REG
    # BG_LB & gal3_nc16 & gal3_nc15 & !gal3_nc14
    # gal3_nc16 & !gal3_nc15 & !gal3_nc14
    # gal3_nc19 & !gal3_nc16 & !gal3_nc15 & gal3_nc14;
Due to experience now I can tell you this looks like a counter, and I'm thinking it's actually 3 counters sharing this single node, all having something to do with arbiting bus ownership. But actually disentangling that, where you have intermediates depending on intermediates... good luck!

Working from the edges in you can start to assign probable functions based on a best guess of the referenced variables. For example, It's looking for SIZ0 & SIZ1? OK, it cares about line-size accesses, and this equation identifies that state; therefore we have a new name for this variable instead of the useless temporary name it had before. From there it's possible to slowly gain further insight as now we know anything referencing that cares about line accesses.... and so it goes. Sometimes probing with a LA can also help.

If you're lucky eventually you'll meet in the middle and have a "full" understanding, and perhaps even be able to optimize. I was able to significantly redesign the Formac, increasing cpu performance by 10%, memory bandwidth by 20%, and video performance by over 30%! This was a mixture of resolving some bugs/special cases with the original code, and since I know I've got a fixed PLL-locked clock phase relationship I could optimize around that. I even manged to spot a minor tracing error in the schematic Bolle had been so kind as to send me as the equation didn't make any sense as-written & wired. Eventually 4 GAL16V8 turned into a single ATF1502 CPLD that is only about half full.

The thing is, the Formac (and the Booster before it) have relatively simplistic logic governed almost entirely by the CPU's timing requirements. So it makes pretty clear what your intermediate states needs to be, and from that you can figure out what the equations must be doing. Plus, there isn't any state carried over between bus cycles at all.

The diimo however, well, that's another beast entirely; it originally had 12 GAL 16V8 that were heavily utilized. Up to 96 Macrocells; those were replaced by an ATF1508 with 128. From a "subsystem" perspective, it has both internal and external bus strobes, data muxing/register control, cache control (and a cache snoop case for bus masters!), and internal bus cycle management. Around 36 of the macrocells in use were internal feedback only or inter-PLD feedback! All of these work together to make the thing work.

Have a gander at that visualization from earlier... look for the big light-green AS node, click on it, then check the nodes it depends on...look how deep that rabbit hole goes Eeeek!😨

You're also right that there's room for timing issues. At a high level, the ATF150x CPLD has a similar propogation delay to a GAL. 10ns on both is the guarenteed combinatorial (non-clocked) input to output time. In my experience the CPLDs are very stable and will hit their timings without much variance. GALs have a little more wiggle room. For example, a global clocked output on one of these CPLDs will have an output in 5ns max (and that's basically a constant), and as long as you meet 3ns of setup time you're good. However, 10ns GAL is rated anything from 2-7ns, and it wants a 10ns setup time before the clock!

In general the ATF15xx CPLD of the same speed grade as the GAL should be the same or faster than the GAL, but under the hood the timing models are very different and with the more intermediate nodes between input and output, the more those timing differences can add up. The architecture of the CPLD is much more flexible, but with that comes much more intricate timing. Things can be tight enough that even the act of observing an intermediate (by either tying it to an ouput pin, or using the feedback to an output pin) can cause behavior to change by the internal layout of the CPLD shifting. Since I don't have a way to simulate the thing as a whole and analyze timings, you can see how it turns into a real mess!

Currently, for example, I've got a case where the local bus acknowledge code will deadlock under certain circumstances, and I only have 4 test points with which to examine internal state. Tweaking the clock used by one of the intermediates seemed to do the trick with the first issue ... but it broke synchronous termination when the cache is off. And the data buffer clock on reads doesn't make sense at all; it doesn't seem to be clocking (latching) data when an 030 would normally do so at all. I have no idea how it's getting valid data!

And this is hypothetically one of the easier cached accelerators; I've already switched GALs all around on a finished board and only a couple of them are really sensitive to propogation delay. On cards like the Turbo 040 GAL version which is known to be fickle at the best of times and has a specific mix of GAL speed grades for certain speeds - godspeed.

Hope this sheds a little light :)
 
The clock doublers look like they’re based on your QFP2PGA design. With 25MHz rated Freescale chips under there?

The GAL code is a real jigsaw puzzle. What does all this say about the engineers who originally created the code?

I’m using AI (ChatGPT) more and more to help me advance my projects, it’s amazing how far it helps to understand issues and solve problems. I wonder if it could help with that code at all.
 
Last edited:
I’m using AI (ChatGPT) more and more to help me advance my projects, it’s amazing how far it helps to understand issues and solve problems. I wonder if it could help with that code at all.
That would depend on the volume and quality of GAL code that's been added to the GPT training set. Something I've found is that all the current models from different vendors have a really deep understanding of Python as long as you have a large enough context window. So I've recently written a bunch of transformers to represent different flows as Python code, and get the models to explain THAT to me. It's been a lot more dependable not only in doing logic drill-downs and feature hunts, but also in finding gaps and suggesting approaches to resolve them. Then I can use that information to manually implement the solutions (because that's a step the models still tend to flub too regularly).

The big bit though is the context window (which recent models greatly expand, if you've got access).
 
The clock doublers look like they’re based on your QFP2PGA design. With 25MHz rated Freescale chips under there?

The GAL code is a real jigsaw puzzle. What does all this say about the engineers who originally created the code?

I’m using AI (ChatGPT) more and more to help me advance my projects, it’s amazing how far it helps to understand issues and solve problems. I wonder if it could help with that code at all.

Yep, there's a QFP full 040 under the heatsink :) Similar appearance but electrically it's descended from one of these. Bolle reverse engineered this design and used it to produce his doublers for Q700 and I think a few other things. Originally, it had an asynchronous 45mhz clock but I've put a PLL on there for 50mhz operation, similar to what Bolle did, but since I can guarantee clock alignment that means the fast /TA signal can have a fixed relationship to the slow logicboard signal and enable less time wasted. Altogether, there's now an average of 20-40ns "penalty" on bus accesses down from 100-140ns - this does good things for memory throughput. I'm using this design as a basis since it uses the same concepts as the boosters, massage strobes and acknowledge signals between fast and slow clock domains. Unlike the janky clock manipulation done by the quaddoublers there's a possibility this could work in a NeXTstation; I intend to find out once the last order I have in at JLC comes in.

What it says about the engineers that created the code is that, I think, they likely used a higher level tool to generate that counter. CUPL will happily turn a sequence/state machine into registered logic for you rather than you needing to come up with all of the equations to make the counter count. Not impossible to do by hand but annoying.

I would be surprised if AI could do anything productive with PLD logic.... I think the interactions would be too nuanced. It did surprisingly adequately when I prompted it with accessing a CF card in 68k assembly though, as a point to evaluate by. Not functional, but sort of in the right ballpark anyways.
 
@David Cook was so kind as to share his cache testing utility and with it I was able to confirm the design change I made to enable the 64K of cache RAM that was unused in the original design. I'd found it a bit hard to pin down the effect in benchmarks, but Doom for example liked the additional cache quite a bit.

As expected there's a sharp fall-off after 128K. I do find it a little odd that the noncache performance is as poor as it is at size=256, since at 50mhz this should still be in the internal CPU caches. The booster should post the same score, just a smidge lower due to 3mhz slower core clock. So we can posit that something is causing external bus cycles, probably the test routines are larger than the internal cpu caches. Presumably that is why the cache-on case posts faster speeds after exceeding the cache size, as the bus cycle timing remains the same cache on vs cache off. Interesting data!

The booster's significantly faster bus access can be seen here also vs the nocache case.

1745691680008.png
 
Back
Top