MODTracker audio replay on early 68k macs

Snial

Well-known member
Yup, was thinking that one could have a VBL interrupt routine solely used for counting tick samples, ticking the replayer after N samples (for supporting MOD BPM), mixing audio, and transferring the block to the correct location in the hardware audio buffer.
As @MIST says, it'd be really helpful to look at the published code and read the previous postings. Firstly, I think the 'tick' or 'ticking' terminology is a bit unhelpful, because it appears to refer to an interrupt period, but in fact there's only ever one interrupt and it's the VBL one, which is at 60.15Hz.

The VBL interrupt always needs to copy the whole of a pre-prepared 370 byte buffer containing the mix-down of the 4 channels; to the Hardware audio buffer, because Hardware audio buffer samples are synchronous with the video scan line (1 sample per scan line, i.e. per 45µs). So, unlike an Amiga, it isn't possible to copy into the "right place" in the Hardware buffer, because you'd have to copy into the right place within the right time too. It's much like 'racing the beam' on an Atari 2600, because you are racing the beam, except to generate audio samples and not video as on the Atari 2600. Instead, the VBL copies the next 370 bytes of Generated samples (the pre-prepared buffer) to the hardware audio buffer and then calculates the mix-down of the next 370 samples and it needs to do that faster than the beam, because otherwise there's no CPU left. The way it does it is to calculate 370 samples for the first two channels, then mix-in 370 samples for the last two channels as that's the optimum given the number of registers on the 68000.

What is possible is to have an algorithm that schedules the Replayer which looks like this:

C:
uint16_t GenFrameSamples(uint16_t aStepSamples)
{
    uint16_t tickSamples=370;
    while(tickSamples>=0) {
        uint16_t clipSamples;
        if(aStepSamples>tickSamples) {
            clipSamples=tickSamples;
        } else {
            clipSamples=aStepSamples;
        }
        GenSamples(clipSamples); // actually vectors into an unrolled loop to gen clipSamples samples.
        tickSamples-=clipSamples; // play up to 370 samples this tick.
        aStepSamples-=clipSamples; // and recalc after aStepSamples
        if(aStepSamples==0) { // no more Step Samples, so
            aStepSamples=Replayer(); // calc next pattern step.
        }
    }
    return aStepSamples;
}

So, Replayer() is scheduled within a tick for the right number of samples in a given step. Effectively, this is what @MIST 's code does now, except aStepSamples is always 444.
 
Last edited:

MIST

Well-known member
The way it does it is to calculate 370 samples for the first two channels, then mix-in 370 samples for the last two channels as that's the optimum given the number of registers on the 68000.
This routine originates from the Atari STE which has stereo capabilities. That may also be the reason why the four channels are being computed in pairs. I am now scaling and summing the result into one channel. I might save a few microseconds by scaling the samples beforehand and not having to scale while summing them. Also the Mac has two hardware audio buffers and they could be used to directly render the audio into them saving the entire Initial buffer-to-hw-copying. But I don't know how to properly allocate that second buffer. Its memory seems to be in use by macos as regular ram and using it messes with it being used otherwise.

But it works as it is and there's currently no need for such optimizations as there's nothing else to be computed while audio is playing. That may change if this is being used in a game or with some eye candy for audio visualization or the like.
 

Snial

Well-known member
This routine originates from the Atari STE which has stereo capabilities.
I thought it was from the Amiga which had 4 audio channels: "The Amiga has four audio channels. Channels 0 and 3 are connected to the left-side stereo output jack. Channels 1 and 2 are connected to the right-side output jack. Select a channel on the side from which the output is to appear."


I guess the STE code is trying to emulate that.
That may also be the reason why the four channels are being computed in pairs.
It's also the maximum number of channels that can be computed given the 8 available data registers and 5 or so address registers. Well, that's not quite true. I think I can work out a routine which can manage 3 voices in one pass, but it's slower than the Wiz code.

I am now scaling and summing the result into one channel.
OK
I might save a few microseconds by scaling the samples beforehand and not having to scale while summing them.
Volume levels can be adjusted on the fly using the Effects bits in each channel data per pattern step, so I think it does need some scaling. What is the case is that samples at full volume must be in the range -64 to 63 (a 7-bit range) rather than -128 to 127 (an 8-bit range), because when the code adds both channels you'll get a 9-bit result, which would cause glitches whenever they had an 8-bit overflow.

Starting with 7-bit sample values then adding and shifting means you had an effective range of -32 to 31. However, it's still not quite the same as pre-shifting the volume-adjusted values into a -32 to 31 range, because the sum of two 6-bit values will have more noise. e.g. imagine the two initial 7-bit samples were 1 and 1. If you add, then >>1, you get 1. But the 6-bit versions would be 0 and 0 so if you add, you get 0.

Also the Mac has two hardware audio buffers and they could be used to directly render the audio into them saving the entire Initial buffer-to-hw-copying. But I don't know how to properly allocate that second buffer. Its memory seems to be in use by macos as regular ram and using it messes with it being used otherwise.
I suspect it would make things more awkward and hardly faster (maybe 0.5%). To save time you'd have to compute mixed values into the alternative buffer directly. I don't think you'd be able to use movep, you'd have to use a move.b samp,(a6)+ or add.b samp,(a6)+ followed by an addq.w #1,a6 to skip the disk PWM byte. This would add: 8*2*370=5,920 cycles per frame. You'd save 4618 cycles from not having to do the movep code, so you'd lose more than you'd gain.
But it works as it is and there's currently no need for such optimizations as there's nothing else to be computed while audio is playing. That may change if this is being used in a game or with some eye candy for audio visualization or the like.
I can't yet improve on the performance of the Wiz code.
 

MIST

Well-known member
Here's a version that behaves by default like the previous one and plays axel-f. But if you put a "song.mod" into the image then it will play that instead.
 

Attachments

  • NanoMacTracker.zip
    68.7 KB · Views: 4

Snial

Well-known member
Onto the data structures.

Part info starts with:

Code:
count:    DC.W PARTS
    
wiz1lc:    DC.L sample1 #I think "lc" means "Location".
wiz1len:DC.L 0 #Actually end address for wiz1.
wiz1rpt:DC.W 0
wiz1pos:DC.W 0
wiz1frc:DC.W 0

#The same structure repeated for wiz2.. wiz4.

aud1lc:    DC.L dummy #I think this means Audio 1 Location.
aud1len:DC.W 0
aud1per:DC.W 0 #Current period modified by portamento.
aud1vol:DC.W 0 #Fine tune at [8].b and Current Volume at [9].b modified by volSlide & volDown.
    DS.W 3 #6 bytes of space

#The same structure repeated for aud2..aud4.
/* one 8 bit mono buffer for Mac */
samp1:    DC.L sample1
sample1:DS.B LEN

#Sometime later in the code:
voice1:
    # Pattern Step at [0].L; 20B Effect at [2]<11:0>, SampleStart at [4], TotalLength(words) at [8],  Repeat Addr at [10].L
    #Repeat Len at [14].w, CurrPitch at [16].w,
    # FineTune at [18].b, CurrVol at [18].w[.b?]. Yet also, repeat end at [8].w
    DS.W 10
    DC.W 0x01
    DS.W 3 #6b Vibrato amp at [4].b<3:0>; phase at [5].b.
#Repeated for voice2..voice4.

Pattern data is at offset 1084 from the beginning of a MOD song in the Wiz player, which means it expects a 31 instrument format (rather than the older 15 instrument format).

On each frame, music increments the counter and if it's less than the 'speed', it chooses the same audio data pointers as before (nonew); else it clears the counter and chooses new audio data (getnew).

nonew sets a4=voice1..4(pc); and a3=aud1..4lc(pc), and calls checkcom for each voice. Checkcom tests the bottom 12-bits of the word at 02(a4), that is, it processes the effects: arpeggio; portup, portdown & myport (portamento), vib (vibrato); port_tones slide, vib_tones slide and volslide. I'm not very interested in the details of these, except to note the memory locations they effect (which I'm noting in the code section above).

getnew also sets a4=voice1..4(pc); and a3=aud1..4lc(pc), but calls playvoice for each voice. The current pattern step is copied into 0(a4).L and then the other fields in voice1 are calculated.

One of the things that keeps puzzling me about the code is that sample lengths are represented in words and should be multiplied by 2 to give the number of bytes, but in the sample-generation code, the position is represented by a 16-bit byte offset, which limits sample lengths to 64kB regardless of whether the code multiplies lengths by 2 or not.

Yet the code seems to confuse word lengths for repeats with byte lengths. Thus, for example:

Code:
playvoice:
    #(loop handling code, d3=repeat offset in words.)
    move.l    0x04(%a4),%d2 #Sample start.
    add.w    %d3,%d3    #Repeat offset *2 (words => bytes)
    add.l    %d3,%d2    #+Sample start.
    move.l    %d2,0x0A(%a4)    #Repeat address.
    move.w    0x04(%a2,%d4.l),%d0 #Repeat offset (words).
    add.w    0x06(%a2,%d4.l),%d0 #+Length of sample repeat.
    move.w    %d0,8(%a4)            #End of repeat.
    move.w    0x06(%a2,%d4.l),0x0E(%a4) #Length of repeat.

The byte offset for the repeat offset in words is doubled, but it's still 16-bits. The end of the sample repeat is calculated in words 4(a2,d4.l)+6(a2,d4.l) and stored in the total length offset in voice. It would be possible to keep the main playback code if the wizxlc addresses were updated to represent the current absolute address of the sample on each sample generation, but that doesn't happen. As far as I can tell, the wizxlc addresses are only ever copied from audxlc.

Some of these word/byte confusions probably explain why not all supposedly suitable MOD files play correctly on this tracker.

In many respects, it would be better to keep the central Sample generation code as it is, but rewrite the rest in 'C' (though still within the VBL). It won't matter that 'C' is slower than assembler, because the sequencer code is at least an order of magnitude less work than sample generation. It would also make it far easier to integrate with a classic Macintosh application.
 

MIST

Well-known member
If you start commenting the code, then I'd ask you to do this in the repository, so your comments don't get lost.

I've just tried to make this into a more "macos" like application. It actually works even will the interrupts still enabled. And I've added the musoff function, so the player can now be exited and returns to MacOS.
 

Snial

Well-known member
<snip> sample generation for 74 samples and six times per VBL <snip>
OK, I've finally had a look at the newer code and I can see how you're splitting it up [I've responded to your latest comment a bit below]. One of the upshots of splitting the "Sample" routine into 74 samples at a time (for 4 channels) is that looping is more finely grained. On the old code it would only ever loop waveforms every 16.7ms (or 33ms 1/6 of the time), but on the new code it'll loop samples within 3.3ms of the correct period, which is defo a good thing.

Nevertheless, you could generate the same effect without repeating the calls for each pass:

Code:
 /* from line 349 in your current revision */
    /* count repeatedly runs from 5 down to 0 */
    move.w count,%d0
    subq.w #5,%d0 #count-5
    add.w %d0,%d0 #count*2-10
    jsr CalcdSamps-.+4(%pc,%d0.w)
    subq.w #1,count #dec count and test.
    bmi.s skpMus #count was 0, so no music & reset.
    bsr music
    moveq #-1,%d0
    sub.w count,%d0 #-1-count
    add.w %d0,%d0
    jsr CalcdSamps-.+4(%pc,%d0.w)
    bra.s    nomus: #same as s_done
skpMus:
    move.w #PARTS,count ;back to 5.
 
nomus:
    movem.l    (%sp)+,%d0-%a6
    rts

    bsr.s sample    #1st pass count=0, 2nd pass count=5
    bsr.s sample    #1st pass count=1, 2nd pass count=4
    bsr.s sample    #1st pass count=2, 2nd pass count=3
    bsr.s sample    #1st pass count=3, 2nd pass count=2
    bra.s sample    #1st pass count=4, 2nd pass count=1
CalcdSamps:         #Label at end of subroutine.
    rts    #1st pass count=5, 2nd pass count=0

sample:
    #as before.

This is an assembly version of the kind of thing I was talking about earlier. It's still position-independent despite the JSR , because it's a PC-relative JSR. In every sample generation phase we need a set of bsr.s samples followed by an optional bsr music then the remaining set of bsr.s samples . We can still do this using an unrolled loop by generating the full unrolled loop and calculating backwards from CalcdSamps where we need to start the loop. By placing an RTS at the end, we can deal with the cases where there's no calls to sample needed (which is the first case when count=5 on entry and the second case when count was 0 on entry).

Note, also, the code doesn't pre-decrement count. This is because we need a code path for all 6 states, which is what your code does too, but you don't label the first path (the one before nr0:).

This code might be slightly slower, because of the calculations, but it's more deterministic, because it doesn't have to perform multiple compares and branches to test for each pass. For a more finely-grained set of calls to sample, this approach would be faster.

I still haven't got Retro68 working on my MacBook M2 (https://68kmla.org/bb/index.php?threads/retro68-build-issues.50537) @bribri has kindly provided some help, but I haven't found time to follow up on it yet.

If you start commenting the code, then I'd ask you to do this in the repository, so your comments don't get lost.
OK. I don't think I have access to your repo, but I'm: https://github.com/Snial
I've just tried to make this into a more "macos" like application. It actually works even will the interrupts still enabled. And I've added the musoff function, so the player can now be exited and returns to MacOS.
OK, that's certainly a big improvement - no need to reset the Mac on each playback!
 
Last edited:

MIST

Well-known member
This is an assembly version of the kind of thing I was talking about earlier.
I just don't see the point of doing this. I've never felt the urge to be able to adjust playback speed at all. And it sounds pretty good the way it is.
OK. I don't think I have access to your repo, but I'm: https://github.com/Snial
You can submit a PR from a fork of yours.
OK, that's certainly a big improvement - no need to reset the Mac on each playback!
It's not perfect. The mouse is sluggish at best while the player runs, and I was unable to set the StaticText's title properly, which is why the song title does not display properly. But it's a start.
 

Snial

Well-known member
I just don't see the point of doing this. I've never felt the urge to be able to adjust playback speed at all. And it sounds pretty good the way it is.
It wouldn't affect the playback speed, it just means code doesn't get duplicated.

One thing I was concerned about, and I think I still am, is the fact that the position: wizxpos is a 16-bit byte offset, so it looks like the sample code can't properly handle waveforms larger than 64kB (128kB is the defined maximum). Instead they're wrapped back to 0. In addition, the sample routine can process up to 73 samples after the end of a waveform. This means that in theory, it can play into the next waveform and I believe that would cause glitches.

The way I'd partially tackle this is to move wizxlc on by wizxpos bytes after every call and wizxpos is cleared to sample so that wizxpos never exceeds 64kB. It slightly complicates the loop code, but not by much. It means this kind of code:

Code:
    cmp.l    wiz2len(%pc),%d0
    blt.s    ok2
    sub.w    wiz2rpt(%pc),%d0

ok2:
    move.w    %d0,wiz2pos
    move.w    %d1,wiz2frc

Becomes:

Code:
    move.w    %d1,wiz2frc            #Always update the fraction.
    add.l %d0,%a0                #Move waveform pointer
    move.l %a0,%a2                #(don't need a2 now)
    suba.l aud2lc,%a2            #actual length so far..
  
    cmpa.l    wiz2len(%pc),%a2    #has it passed the loop length?
    blt.s    ok2
    suba.w    wiz2rpt(%pc),%a0    #adjust waveform pointer by -rpt len.
                                #Note: repLen still limited to 64kB.

ok2:
    move.l %a0,wiz2lc        #save waveform pointer (wiz2pos always 0)

Even though waveforms can be longer than 64kB now and the loop start + loop length can be >64kB, loop lengths are still limited to 64kB. Fixing that requires calculation changes elsewhere and wizxrpt to become a long. OTOH, wizxpos is no longer needed, so the wiz data structures are still the same size.

You can submit a PR from a fork of yours.
OK. That means I need to go back to working on getting Retro68 to compile on my MacBook M2.
It's not perfect. The mouse is sluggish at best while the player runs, and I was unable to set the StaticText's title properly, which is why the song title does not display properly. But it's a start.
Indeed, I'd expect that. The Mac is down to 20% of performance, maybe less. It'd be useful to add some profiling code to the main application's loop to figure out how much.
 

MIST

Well-known member
Imho it makes sense to optimize the loops/unrolled loops. I just improved the hardware copy loop somewhat and having a look at the unrolled generator loops also looks like it could be optimized as there are data registers updated which are not used inside the loop and could likely be processed once outside the loop. I'll have a look at that tomorrow.

And yes. You should get retro68 running. It's just too odd that it's easier to do development for classic Macs on a Linux PC than it is on a recent Mac.
 

MIST

Well-known member
Indeed, I'd expect that. The Mac is down to 20% of performance, maybe less. It'd be useful to add some profiling code to the main application's loop to figure out how much.
I can do basic profiling inside the hardware simulation, and the MOD replay actually uses over 90% of the CPU. But hey, that means that this is actually something that really shows what the old machines can do and where their limits are.
 

8bitbubsy

Well-known member
So I checked the code, and as expected I couldn't find any BPM handling! I'll try one last time to explain what is missing in this player, which is really important for a huge amount of MODs to play at the correct speed:

Most MODs (which means ProTracker MODs) have two values that control how fast the song is played. First is called "speed". This controls the amount of replayer ticks per song row, i.e. how many replayer ticks before you advance to the next song row. This is handled fine in the player already (range F00..F1F for the Fxx command).

The second value is tempo (BPM). This is adjusted with command Fxx in the range F20..FFF. F20 = BPM 32, FFF = BPM 255. This controls how often the replayer ticks, i.e. the duration of a single replayer tick. BPM should be initialized to 125 on MOD load, which is the default value. It translates to a tick rate of 50Hz.

BPM formulas:
tickRateInHz = (BPM * 50) / 125
tickDurationInMillisecs = 2500 / BPM

This is what I meant by setting a samplesPerTick variable and counting it down to zero before ticking the player. Sorry for all the confusion, I'm not trying to be arrogant, but not supporting MOD BPM is a huge issue, and it should be simple to implement (if not, then sorry for all the nagging!). It's used in thousands of MODs, even on the Amiga where it all came from. :)
 
Last edited:

Snial

Well-known member
<snip> "speed". This controls the amount of replayer ticks per song row
You mean how many replayer ticks per pattern row? (where a song row is 64 pattern rows)?
The second value is tempo (BPM).<snip> how often the replayer ticks, i.e. the duration of a single replayer tick: tickDurationInMillisecs = 2500 / BPM

<snip> I'm not trying to be arrogant <snip>
I didn't think you were trying to be arrogant, I was just trying to reconcile the fixed VBL tick rate with your description of the replayer tick rate.

From my current reading then.. I've just had an a-ha moment! So, if I understand you correctly, the duration of a pattern row is tickDurationInMillisecs x ticksPerPatternRow?

But also, because tickDurationInMillisecs can be completely variable from 32 to to 255, with a default of 125 (2500/125=20ms), then the technique of replicating code to perform 5 calls to Sample (74 samples) per call to replayer can't work if BPM is supported. Instead, the computed jump technique has to be used, because it could be a wide range of samples per replayer and consequently, the need to fit that in with 370 samples per VBL tick.

Is that correct then?
 

8bitbubsy

Well-known member
You mean how many replayer ticks per pattern row? (where a song row is 64 pattern rows)?

I didn't think you were trying to be arrogant, I was just trying to reconcile the fixed VBL tick rate with your description of the replayer tick rate.

From my current reading then.. I've just had an a-ha moment! So, if I understand you correctly, the duration of a pattern row is tickDurationInMillisecs x ticksPerPatternRow?

But also, because tickDurationInMillisecs can be completely variable from 32 to to 255, with a default of 125 (2500/125=20ms), then the technique of replicating code to perform 5 calls to Sample (74 samples) per call to replayer can't work if BPM is supported. Instead, the computed jump technique has to be used, because it could be a wide range of samples per replayer and consequently, the need to fit that in with 370 samples per VBL tick.

Is that correct then?
Yup and yes. So this poses a challenge then, since the number of output samples per tick is variable when supporting BPM. I still think it should be possible to do it, but it may require more logic in the VBL interrupt. BPM is really important though, so if some performance can be sacrificed to support it, then it should be well worth it (in my opinion).

I actually don't understand why you are explaining this in such detail instead of just implementing it.
Because I lack the needed knowledge to implement it for a player like this.
 

MIST

Well-known member
Because I lack the needed knowledge to implement it for a player like this.
That's actually rather simple. There are plenty of books online available and software like retro68 comes free of charge with everything including several examples. It's a matter of minutes for a complete apple noob like me to get a demo compiled and running.
 

Snial

Well-known member
Yup and yes. So this poses a challenge then, since the number of output samples per tick is variable when supporting BPM. I still think it should be possible to do it, but it may require more logic in the VBL interrupt. BPM is really important though, so if some performance can be sacrificed to support it, then it should be well worth it (in my opinion).
We shouldn't lose any performance!
Because I lack the needed knowledge to implement it for a player like this.
It's OK, the code snippets I've written earlier can implement it at the output stage. It's a version of this algorithm, from this comment.

C:
uint16_t GenFrameSamples(uint16_t aStepSamples)
{
    uint16_t tickSamples=370;
    while(tickSamples>=0) {
        uint16_t clipSamples;
        if(aStepSamples>tickSamples) {
            clipSamples=tickSamples;
        } else {
            clipSamples=aStepSamples;
        }
        GenSamples(clipSamples); // actually vectors into an unrolled loop to gen clipSamples samples.
        tickSamples-=clipSamples; // play up to 370 samples this tick.
        aStepSamples-=clipSamples; // and recalc after aStepSamples
        if(aStepSamples==0) { // no more Step Samples, so
            aStepSamples=Replayer(); // calc next pattern step.
        }
    }
    return aStepSamples;
}

aStepSamples is the number of samples per row (tickDurationInMillisecs x ticksPerPatternRow), but at the compact Mac sample rate, which is 370*60/1000=22.25 samples per millisecond.

So, to implement it, we need to:
  1. Substitute that algorithm for the code in stereo and sample, but converted into assembly. The core logic in sample would remain, especially the bit @MIST has improved, which eliminated the LSR.B and EOR.B , because the volume tables incorporate both the shifting and the sign fix now.
  2. Add the handler for BPM. BPM and ticks per row needs to generate aStepSamples, with fractional calculations too (probably). Only the whole number part is passed to the function. AStepSamples could be a global, since stereo has to pick it up from some state that's held across VBL frames (i.e. VBL interrupts).
  3. We need to correct the waveform length, Wizxlc waveform start and corresponding loop start and length as per my earlier comments. There doesn't seem to be consistency when calculating the lengths and starts in the replayer code (sometimes 32-bit calculations are used, sometimes 16-bit calculations).
We know that we're currently using 90% of CPU on a Mac Plus (it'd be slightly less on a Mac SE and Classic). The main .rept LEN is now 88c*2=176 cycles, which amounts to 3,907,200 or 60% of CPU by itself, so we might be able to improve things a bit.
 

Snial

Well-known member
@MIST , @8bitbubsy . In case you're not convinced the algorithm above works, we can write a simple 'C' program to test it:

C:
#include <stdio.h>
#include <stdint.h>

// Or whatever TicksPerMs*TicksPerRow*SamplesPerMsIs
#define kSamplesPerVbl 444

void GenSamples(uint16_t aClipSamples)
{
    printf("\SamplesChans23: # %d samples\n", aClipSamples);
    printf("\SamplesChans01: # %d samples\n", aClipSamples);
}

uint16_t Replayer(void)
{
    printf("\tMusic:\n");
    return kSamplesPerVbl; // In reality could be modifed by player.
}

uint16_t GenFrameSamples(uint16_t aStepSamples) { /* as above */ }

int main(void)
{
    int tick, clipSamples=kSamplesPerVbl;
    for(tick=0; tick<30; tick++) {
        printf("VBL: #%d\n", tick);
        clipSamples=GenFrameSamples(clipSamples);
        printf("\n");
    }
}

It might be a bit buggy, but the principle is sound. Converting GenFrameSamples to assembly should be fairly simple, it's just comparisons copies and subtractions. GenSamples could do the actual computed jump, but the Samples code needs to be split in 2 because a computed jump is needed for each pair of channels (hence my edit in the past few minutes to denote that).
 
Last edited:

Snial

Well-known member
@MIST , @8bitbubsy. This is GenSamples in assembly code:

Code:
GenFrameSamples: 
    move.w gStepSamples(%pc),%d0     #d0.w=aStepSamples
    move.w #LEN,%d1                    #d1.w=tickSamples=370;
GenFrameSamples10:                    #can be do loop while(tickSamples>=0) {
    move.w %d1,%d2                #clipSamples=tickSamples;
    cmp.w %d0,%d2
    bge.s GenFrameSamples20        #if aStepSamples<clipSamples
    move.w %d0,%d2                #clipSamples=aStepSamples;
GenFrameSamples20:
    move.w %d2,%d3
    lsl.w #3,%d3                #*8.
    sub.w %d2,%d3                #*7
    sub.w %d2,%d3                #*6
    add.w %d3,%d3                #*12
    add.w %d2,%d3                #*13
    add.w %d3,%d3                #*26
    neg.w %d3                    #-26
    add.w #Samples23End-GenFrameSamples22,%d3
    movem.w d0-d3,-(%sp)        #save regs.
    jsr 0(%pc,%d3.w)            #GenSamples23
GenFrameSamples22:
    move.w (%sp),d3                #load and adjust offset.
    add.w #Samples01End-GenFrameSamples24-(Samples23End-GenFrameSamples22),%d3                    
    jsr 0(%pc,%d3.w)            #GenSamples01
GenFrameSamples24:
    movem.w (%sp)+,d0-d3        #restore regs.
    sub.w %d2,%d1                #tickSamples-=clipSamples; // play up to 370 samples this tick.
    sub.w %d2,%d0                #aStepSamples-=clipSamples; // and recalc after aStepSamples
    bgt.s GenFrameSamples30        #if aStepSamples<=0.
    bsr RePlayer                #Returns aStepSamples in d0.w
GenFrameSamples30:
    tst.w %d0
    bgt.s GenFrameSamples10        #if more tickSamples, jump back.
    move.w %d0,gStepSamples(%pc)

Again, it's probably a bit buggy, but it illustrates that it's not too complex and it's also fast. The most complex bit is multiplying by -26, which is 8*4+6=38 cycles, about 2x faster than a MULU; though a table-driven multiply (370 word table) would be a bit faster.
 

Snial

Well-known member
Note: When it comes to optimisation, every 1083 cycles we save on the VBL (including the Hardware buffer copying, the sample generation or replayer), leaves 1% more CPU for the application.

OK, I found two bugs in my code. Firstly, tst.w %d0 near the end should be tst.w %d1, because we need to keep looping if there's more tickSamples, not aStepSamples. Secondly, the computed jsrs fail to set up all the variables for the unrolled loops so this should be added. Also, the JSRs can be JMPs, because the exits always return to a specific destination for each one. This will save a few cycles.

At the standard BPM, we have 445 samples to process per tick, which means that 5/6 of the time we go round the loop twice and 1/6 of the time we go round once. So, in this code we have 12+8+12 = 32 cycles as a fixed overhead + 13*4+8+10+3*8+42+22*2+10*2=200 cycles per loop. So, it's 232 cycles 1/6 of the time and 432+18 (bsr)=450c 5/6 of the time. 413.7 cycles average or 0.38% CPU.

By comparison, the current cost is (18+16)*6 for the BSRs alone = 204 cycles, but each pass through sample has an overhead of about 760.8 cycles. There's always 5 of them, so that's 4008 cycles or 3.7% of CPU.

However, with GenFrameSamples we'd go through the sample overhead 2x 5/6 of the time and 1x 1/6 of the time, which adds 1394.8 cycles, or 1.29%. Thus GenFrameSamples saves 3.7-(1.29+0.38)=2.03 % of CPU.

I thought we might be able to save cycles on the Sample23 and Samples01 overheads, by saving the context on exiting and simply reloading the context when we need to call it again. But this is not true, because in the majority of cases we call Replayer and this can change all the parameters. Hence they need to be recalculated.

Another approach that might save a few cycles is to vector to each effect rather than testing for each effect in turn. This would involve taking the top effect nybble; << 1 loading the vector offset and then performing a jmp x(pc,dn.w). Some effects use the top two nybbles, so maybe a second vector table could be used, or a wasteful 8-bit table could be used.
 
Top