• Updated 2023-07-12: Hello, Guest! Welcome back, and be sure to check out this follow-up post about our outage a week or so ago.

Floppy / Floppy Controller SWIM Issues

pgreenland

Well-known member
Hi All,

I've been working on resurrecting an SE/30 with some minor battery leakage damage.

Re-capped and fixed a few address lines near the RTC. It now passes all the tests in snooper, lots of RAM cycles passed etc. Boots fine from a ZuluSCSI. I've been using it for several hours and it appears stable.

Having a problem with the floppy drive though.

Rebuilt, cleaned, greased etc the floppy drive previously and tested with another SE/30 - all was looking good, could read and write files to it, with matching checksums when reading back.

The recapped SE/30 is having lots of problems with the drive. It will boot from a floppy, getting to the desktop but loading an app typically results in a bus fault or address error message.

Running from SCSI, it mounts disks ok. Reads seem ok, writes sometimes (to a DOS formatted disk at least, although the checksums failed during read back).

A disk erase operation however locks the machine up every time. With either a bus or address error on system 6 and no message, but a lock up on system 7.

Has anyone seen this sort of issue before?

Where would be good to start looking?

Thanks,

Phil
 

pgreenland

Well-known member
Progress report for anyone experiencing similar issues.

Attempted to use DiskDup to image a disk which it claims is a sector level copy. Taking the same image on a PC using DD. Compared the MD5s of the images, which match. So reading doesn't seem to be an issue.

Attempted to have DiskDup write the image to another disk, which it completed and verified without issue. The PC didn't agree on the MD5 though.

Connected the drive to my working SE/30 and repeated the same test, with the same hard disk image.

Read and write again worked fine. However this time the MD5s matched both when reading a disk on the mac that was written on the PC and reading a disk on the PC that was written on the mac.

I was additionally able to use the erase disk functionality in system 6 to format the disk ready for use which I'm not able to do without a bus or address error on the other machine.

The mystery continues....
 

pgreenland

Well-known member
More progress....more weirdness.

Looking a bit deeper at the original vs the copy generated by DiskDup.

It seems that a good chunk of the data made it from disk to disk.

However there are byte errors throughout. Although these errors appear in blocks of 4 bytes. In each instance value that's been written appears to be 51c8 fff2.

Screenshot 2022-10-12 at 21.50.21.png

Repeated the test again having formatted the disk, checking the value before and after writing with DiskDup. It seems the disk is being written to in the incorrect areas (the incorrect bytes differ from the post-formatted value).
 

pgreenland

Well-known member
It seems the SWIM may not be to blame after all.

I did what I was initially avoiding and connected a logic analyser to address, data and control lines used by the SWIM and captured a floppy being written by DiskDup....250MB of capture file for a 1.5MB floppy :p

After producing a floppy, I imaged it on the PC and found the location of the first erroneous word of 51c8 fff2.

I had the logic analyser decode the address and data busses, and python extract all instances of SWIM FIFO writes to a file.

To my surprise I see the start of the disk, and each sector after it. Each sector appears to be pre-pended with some sort of header, seemingly 17 bytes long 4E4E4E4E 00000000 00000000 00000000 FB which vaguely resembles the sector header data.

1665699649471.png

The smoking gun so to speak is here:

1665699824986.png

The mystery rouge value appears not only on the floppy disk but on the data bus. The SWIM appears to have faithfully written what it was asked to by the host.

The question now then....did the processor manage to write the wrong data onto the bus, or could another device be getting addressed at the same time and driving the bus? Why is it the same value every time? It's only using 8 bits of a 32 bit bus, so if it was another device being addressed how is the value changing?
 

Corgi

Well-known member
Just a silly theory, but `51c8` decodes to `andw %a1@,%d4` as an 68K opcode… `fff2` is nowhere near valid, though.
 

pgreenland

Well-known member
@Corgi Haha I was wondering if it might be.

I had a tinker with Macsbugs last night, stopping DiskDup after it loaded the image, and had a search for it in RAM. Before writing it looked ok, but after it had the garbage in. Was hoping I might be able to set a watchpoint but alas them be modern features....unless I'm mistaken? Best I could do was the single step checksum command which seemed to catch the modification. It pointed at _vSyncWait being the culprit although the instruction it was pointing at had nothing to do with the memory address concerned as far as I could see.

The machine had battery acid damage around the RTC and some of the vias leading to the F258s of which there were a few broken address lines. Having fixed them it passes Snooper's RAM tests with the walking ones option selected. I'm wondering if that test isn't quite good enough to catch whats happening here.
 

Corgi

Well-known member
I had a tinker with Macsbugs last night, stopping DiskDup after it loaded the image, and had a search for it in RAM. Before writing it looked ok, but after it had the garbage in. Was hoping I might be able to set a watchpoint but alas them be modern features....unless I'm mistaken? Best I could do was the single step checksum command which seemed to catch the modification. It pointed at _vSyncWait being the culprit although the instruction it was pointing at had nothing to do with the memory address concerned as far as I could see.
MacsBug SS (Step Spy) is mostly equivalent to gdb's watch, though SS takes a range while watch only takes a single address. I'd try a few different times and see if the address changes each time. If it does, we know it's some sort of noise on the address line. If it doesn't, perhaps it's graphics related?
 

pgreenland

Well-known member
@Corgi I figured as much, the docs hinted at an automated single step with checksumming to detect changes. Pretty neat for the classic rouge pointer blatting over things but maybe not enough to catch the mess I've got going on :p

The corruption on the floppies changes each time, in that it's always the same value but different and multiple locations each time. I've only checked a few locations with MacsBugs so far, but the corruption on the disk matched corruption in the RAM on those occasions.

I've got 8MB of RAM in the machine atm....as 8 x 1MB modules. I was considering removing the 4MB in bank B, see if anything changes.

I was going to try to take a look at the memory addresses of the corruption, see if there's a correlation. Then comes the memory map of the SE/30 which I'm not familiar with and the address decoding for RAM which from a high level I can roughly see whats doing on in the schematic.....but I couldn't explain it.....so I probably don't quite get it yet :ROFLMAO: .

Have you got any tips for tracking down the rouge address line?
 

Crutch

Well-known member
VSyncWait is just a tight loop that polls the ioResult of a device’s IOParamBlkRec waiting for a synchronous operation (in your case a disk write) to be completed by the driver which is running at interrupt time.

So unfortunately MacsBug isn’t really telling you anything here. It’s dutifully checking your location in RAM after each instruction reachable from the PC as of whenever you enable the Step-Spy, but it doesn’t step-spy through the interrupting DRVR (since the 68000 just jumps to the DRVR code when it gets an interrupt without telling Macsbug), so suddenly after the next iteration thru the vSyncWait loop it just notices that the DRVR changed your RAM location. VSyncWait itself isn’t doing anything.
 

pgreenland

Well-known member
@Crutch Thanks for the explanation! - I wondered how it would behave if the target area was changed by an interrupt.

Bit more tinkering with MacBugs. I wondered at what point the RAM gets messed up.

I started DiskDup and loaded the original image to be written to disk.

Then jumped into macbugs and searched for the value that always appears:

1665769033460.png
It seems that the image is located around 0x47103 based on an ascii search for a known string.

There are 6 occurrences of the value that managed to wangle its way into RAM.

Returning to DiskDup and inserting a floppy, without pressing anything else another search result arrives.

1665769268143.png
The additional result being at 0x264C8.

Returning to DiskDup, allowing it to write the floppy and checking again. Now there are lots of instances within the disk image area that have incorrect values in:

1665769626140.png

I had a little go writing an app that allocated a big chunk (6MB) of memory and cycled different values though it but didn't see any issues while running for a hundred or so cycles. It looks as if whatever goes wrong is related somehow to disk access.
 

pgreenland

Well-known member
Added a disk access step to my noddy little test program, requesting the Floppy Disk Driver via the Device Manager write the first sector of a floppy disk with the first 512 bytes of my test buffer.

As soon as that write completes, the disk and buffer both have the identical corruption in them.

Interestingly the corruption in RAM is limited to the first 512 bytes of the buffer, aka what was written. Again it seems different locations within the buffer are hit each time.

Test program for reference (excuse the magic numbers....I got excited that I was writing a program for the SE/30):

C:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <Disks.h>
#include <Devices.h>
#include <Files.h>

main ()
{
    unsigned long *buf;
    unsigned long i;
    unsigned long j;
    DrvSts status;
    OSErr errStatus;
    OSErr errWrite;
    ParamBlockRec iorec;

    printf("Please insert disk\n");
    fflush(stdout);
    do
    {
        errStatus = DriveStatus(1, &status);
    } while (0 == status.diskInPlace);
    printf("Disk inserted!\n");
    fflush(stdout);

    printf("Starting...\n");
    fflush(stdout);
    buf = malloc(2*1024*1024);
    printf("Addr: %u\n", buf);
    fflush(stdout);

    for (i = 0; i < 1; i++)
    {
        printf("Iter: %u\n", i);
        fflush(stdout);

        for (j = 0; j < 524288; j++)
        {
            buf[j] = j;
        }

        memset(&iorec, 0x00, sizeof(iorec));
        iorec.ioParam.ioVRefNum = 1;
        iorec.ioParam.ioRefNum = -5;
        iorec.ioParam.ioBuffer = (void*)buf;
        iorec.ioParam.ioReqCount = 512;
        iorec.ioParam.ioPosMode = 1; // from start
        iorec.ioParam.ioPosOffset = 0;
        errWrite = PBWrite(&iorec, 0);
        printf("errWrite: %i\n", errWrite);

        for (j = 0; j < 524288; j++)
        {
            if (buf[j] != j) printf("Bad val at %u\n", j);
        }
    }

    for (j = 0; j < 524288; j++)
    {
        buf[j] = 0xdeadbeef;
    }

    free(buf);
    buf = NULL;

    printf("Done!\n");
};

Any pointers on how to track down what's causing this would be amazing....atm I'm thinking of taping off the floppy drive and marking it as a crime scene :p
 

pgreenland

Well-known member
Tried my little test program with System 7.5.5 (I was using System 6.0.8 until now for its faster boot time following all my crashes :-( )

Exactly the same problem...almost.

With the default 24 bit addressing, as soon as I kick off a disk write the system locks up. Mouse and all, doesn't respond to programmer button. Only way out I could find was a reset.

With MODE32 installed, and 32-bit addressing enabled the program runs successfully and writes a disk, but the same memory corruption is detected following the write.

Tried removing the 4MB of RAM from Bank B - same behaviour with System 6 and a crash due to out of RAM this time in System 7.

Dumped the ROM and compared both its checksum and MD5 to a second ROM and ROM off the internet....just in case I've got some funky copy of the Floppy Driver in this machine....everything checks out there.
 

techknight

Well-known member
Read through this thread, definitely taking an insane approach to try and locate an issue related to buffer corruption.

I have a feeling its RAM related, and when I mean that, those F258s do go bad. Funnily enough its getting past the RAM tests.
 

pgreenland

Well-known member
Read through this thread, definitely taking an insane approach to try and locate an issue related to buffer corruption.

I have a feeling its RAM related, and when I mean that, those F258s do go bad. Funnily enough its getting past the RAM tests.
The bit thats been getting me is that the buffer corruption only appears following a disk write. It's possible to read and write a pattern into ram and have no issues - but writing the same pattern followed by a disk write leads to corruption.

I haven't removed them yet as I don't have any replacements kicking around, aside from some on an even more battery damaged board :-(.

I attempted to check each of them with the logic analyser having spotting your suggestion in another post (that it might be F258 related). They were a prime suspect being near to the original battery damage I was fixing.

They all seemed to be behaving as expected. Capturing their inputs and outputs for several minutes and checking with a dodgy python script.

Discovered TechTool Pro over the weekend, which looked to have more comprehensive RAM tests, in terms of the amount of RAM tested. Again they all passed just fine. The floppy disk write test however failed with it suggesting I clean the heads. It didn't give me any more info and I haven't checked the disk....but I'd bet its that corruption sneaking in again.
 

techknight

Well-known member
The bit thats been getting me is that the buffer corruption only appears following a disk write. It's possible to read and write a pattern into ram and have no issues - but writing the same pattern followed by a disk write leads to corruption.

I haven't removed them yet as I don't have any replacements kicking around, aside from some on an even more battery damaged board :-(.

I attempted to check each of them with the logic analyser having spotting your suggestion in another post (that it might be F258 related). They were a prime suspect being near to the original battery damage I was fixing.

They all seemed to be behaving as expected. Capturing their inputs and outputs for several minutes and checking with a dodgy python script.

Discovered TechTool Pro over the weekend, which looked to have more comprehensive RAM tests, in terms of the amount of RAM tested. Again they all passed just fine. The floppy disk write test however failed with it suggesting I clean the heads. It didn't give me any more info and I haven't checked the disk....but I'd bet its that corruption sneaking in again.

Dont disk writes require interrupts to be completely disabled during the write due to timing reasons? Wonder if somethings going wrong there
 

pgreenland

Well-known member
Dont disk writes require interrupts to be completely disabled during the write due to timing reasons? Wonder if somethings going wrong there
Possibly?....I feel like I may have fallen down the rabbit hole on this one....upside down....and possibly backwards.

Would love to get it fixed not that I plan to use the floppy much, although having a fully functional machine would be good.

At the same time I've been staring at it so long I've passed the point of wanting the floppy working and moved onto just wanting to know what the *beep* is going on.

Any suggestions of things to check or try, how I might check whether interrupts etc were upsetting it I'd follow. Pretty much out of ideas myself.

Aside from coming up with some sort of potentially useful expansion slot adapter to help monitor the data and address busses somehow without requiring a 64 channel + control signals logic analyser :-(

I was hoping to just set a memory watch point like I'd do with a modern micro....but I don't think the 68k has anything fancy like that?....unless, maybe I could mark the block of memory containing my test data as readonly with the MMU as readonly then catch the write fault.....but thats another rabbit hole.

It's gotta be a hardware issue like you suggested, but trying to find it without throwing more than its value of parts at it is starting to feel unachievable :-(
 

pgreenland

Well-known member
Still working on this.

Found that if I connected the logic board to the analog board of my working machine, with the help of an ATX PSU extension lead, thereby using its PSU and display all is well.

Three floppies written without any errors.

I recapped both the analog board and the PSU of the broken machine, which were next on the agenda anyway.

Now I've got a perfect solid image (which used to have the occasional wobble) and a solid 5v adjusted supply.

Still facing the same floppy problem.

Though it might be the PSU so swapped that out for another and....yep.....same problem.

Is there anything on the analog board that could be interfering with the logic board?

Aside from power distribution its just handing the video signal isn't it? Which appears to go straight into a load of and gates being used as buffers.
 
Top