Any news? Can I remove the code from the NanoMac repository?
I guess that's a question for
@Mu0n . I'm using their repo.
In the meantime (to go off-topic) I was musing about how well a PC at the time could handle MOD playback, assuming an equivalent sample playback to a Mac Plus (which is reasonable, since it's pretty simple hardware). My inner loop code would look like this:
Code:
;8086 version.
;ES:BX^volume table, BH=volume table page.
;DS:SI^waveform.
;ES:DI^samples buffer (byte)
;CX=fractional frequency
;DX=whole frequency
;AX=Fractional position
;SI=whole position.
;BP^ Stack frame?
.rept SAMPLES
add ax,cx ;FracPos+=FracFreq 2c:2b, so 3c
adc si,dx ;WholePos+=WholeFreq 2c:2b, so 3c
mov bl,[si] ;Waveform sample 5c:2b, so 6c (2x memory fetches)
mov bl,es:[bx] ;volume adjusted 5c (0c for seg override) 7.5c (2.5 mem fetch)
add es:[di],bl ;add to sample buffer 7c (3.5 mem fetches) 10.5c
inc di ;6 instructions 2c:1b =2c.
.endr
;2*3+10+7=23c.
;However, because it's an unrolled loop the BIU will empty and so real cycle counts should be
;based on bus cycles.
;So, real calculation is 32c per loop per track.
;So, an 8MHz 80286 with a 22050Hz hardware sample buffer would use
;32x22050=705,600 cycles per track, 2.8MCycles for 4 tracks or
;35% of CPU at 8MHz.
Initial
cycle calculations are for an 8MHz PC/AT, the fastest PC at the time of the Mac Plus. Although the 8086 architecture has a lot less register state than a 68000 and address registers can only handle 64kB at a time in Real mode; a PC/AT would easily outperform a Mac for the same task, using only 35% of CPU at 8MHz. There are a few reasons for this:
- The 80286 on a PC/AT is a much bigger CPU than the 68000, with 134K transistors, over twice as many.
- The 80286 only requires 3 cycles for a bus transfer. in an unrolled loop this effectively determines the speed of the algorithm. The BIU's buffer gets filled up when instructions involve internal clock cycles. However, most of the instructions in this loop don't. This means that the BIU is usually empty, waiting for a new instruction to be loaded.
- The 80286 has a dedicated Effective Address ALU which can calculate any effective address combination in a fixed time (I assume 1 cycle based on the Intel document I used). However, even if EA calculations were somewhat slower it wouldn't affect the timing much, because the EA calculation cycles could be filled with BIU fetches.
- The algorithm requires no better than 16-bit calculations, so the 32-bit performance improvement offered by the 68000 doesn't offer an advantage.
- Clever uses of segment registers gives us enough register to play with (and in fact BP is still free for use as the frame pointer).
It's worth thinking about point 5 a bit more. In this code, both the amp adjustment tables and sample buffer use the same segment (ES). How does this work? There are only 64 volume levels and 256 source sample values, so the volume tables need 16kB. By setting the sample buffer to an allocated segment: P, using 370b rounded to 512b and a given volume table to offset (512/16), then ES=P and BH=512+VolumeLevel*256 on entry to the routine.
Similarly, If a given waveform starts at 32-bit address W and we are M samples into the waveform, then at the start DS=(W+M)>>4 and SI=(W+M)&15. This ensures that 0<SI<16 on entry and can generate a full-buffer's worth of samples by the end of the tick even though a full waveform can be well over 64kB (and over 128kB including the repeat sections).
We can compare this with an
8086 or 8088 timings based PC of a similar era. Here, EA calculations will take 5 cycles for [SI] and [BX]. Reg,Reg and INC reg operations take 3 cycles; ADD [DI],BL is 24; MOV BL,[SI] is 12+EA and ES is 2 cycles. So, then we have: 3+3+(12+5)+(12+5+2)+(24+5+2)+3= 76 cycles per loop. At 22050kHz that's 1.6MHz per track, about 2x slower than a 80286. An 8MHz 8086 PC (such as the Olivetti M24) could perhaps just manage 4 tracks at 84% of CPU, assuming that the rest of the playback engine could run in 15% of CPU.
Conclusion
Motorola 68000 fans (like me) deride (and derided) the 16-bit PC and PC/AT architectures of the day, but Intel were able to make up for its deficiencies for many practical applications. In this case, because it's surprisingly easy to work around segment limitations.
Anyway, back to getting the MOD Tracker to not crash loading arbitrary .MOD files!