• Updated 2023-07-12: Hello, Guest! Welcome back, and be sure to check out this follow-up post about our outage a week or so ago.

Faster MacOS by patching NewPtrClear?

David Cook

Well-known member
Read this story to discover lies, bugs, and 68K variations.

I ran across an interesting note in Inside Macintosh Memory about the performance of NewPtrClear. It is unusual that documentation would use the word "Currently" instead of "As of System 7.1". Additionally, it is unusual that is says "large block" instead of "blocks larger than 64 bytes" or something specific.

Inside-Macintosh-Memory-NewPtrClear.PNG

This is a great opportunity to patch the OS to boost the speed for all programs! Or is it?

First, let's see what the ROM source code looks like. What a minute. It is clearing two bytes (Clr.W) at a time. Inside Macintosh claims "one byte" (Clr.B) at a time.
NewPtrClear.png
Note: Apple's code relies on the fact that handles and pointers are always allocated by the OS on even address boundaries and with an even number of bytes. The OS always returns a pointer on an even address, and if you ask for 17 bytes, you'll a receive a 17 byte pointer but the 18th byte will remain unused. This is because the 68000 CPU will crash if you attempt to write a word to an odd address. And, in Apple's code, writing only words (not bytes) means an even number of bytes will always be cleared. So, Apple's code is not safe to use for clearing arbitrary addresses and sizes -- it is only safe for this specific purpose.

Let's allocate a large pointer on a Macintosh SE to give us time to manually interrupt into MacsBug to check if this is the actual shipping implementation. Yup. As of System 7.0.1 and the SE SuperDrive ROM, the code definitely clears a word at a time and the ROM code is being used (not an improved System routine).

NewPtrClear-SE-ROM.jpg

Let's check that on a IIsi with System 7.6.1. Yup. Same code. So, as early as 1989 (SE FDHD ROM) and as late as 1997 (7.6.1) Apple was clearing a word (16 bits) at a time on 68K Macs.

NewPtrClear-IIsi-ROM.png

I searched online for better clearing routines in C. They follow this pattern:
1. Align the pointer to an even 4-byte address, clearing initial bytes as necessary. Aligning the address prevents crashes on some CPUs, but more importantly also writes across the entire 32-bit bus at once to all parallel memory chips. Addresses not on 4 byte boundaries split 32-bit writes into 24/8, 16/16, or 8/24. We want 32/0.
2. As much as possible, write a bunch of longs in a row (like 8) before looping to decrease the loop cost. For example, 8x4bytes=32 bytes per loop. This is called "loop unrolling".
3. As much as possible, write a single long (4 bytes) per loop with whatever remains.
4. Write a single byte per loop for the remainder.
Metrowerks code will be presented later to demonstrate this algorithm.

FYI. Most clearing code is actually based on the memset() standard C library code, where we set the value to zero.

For fun, let's compare the performance of writing a single byte at a time (decrementing byte count), a single byte at a time (looking for the final pointer address), Apple's code, Metrowerks standard library code (decrementing byte count), and my own code (looking for the final pointer address).

For code speed sake, we are going to assume no nil pointers or numberOfBytes<=0. Normally we would check more carefully at the start of the routine. The major difference between the two methods below is that one decrements the number of bytes remaining and the other compares two addresses. I was raised on address comparisons.

Clear-by-bytes-algorithms.PNG
Note: Yes, the second routine could be slightly faster with a do-while. But I just saw that and I'm not going to rerun all my tests.

Graphs
For all graphs presented, smaller values are better as the Y axis is time taken.

Here are the results for a 68000 SE SuperDrive with System 7.0.1. A single buffer is allocated, and then each routine is called 1000 times for a given length, the time is recorded, and the length is increased. The upper far right of this graph says that the MemorySetByteCounter algorithm ("Byte") took almost 4.5 seconds (4.5 million microseconds) to clear a 99 byte long buffer 1000 times in a row. Or, about 4.5 milliseconds per call.

NewPtrClear-68000.png
A) For very small pointer lengths, the simple byte and word routines are much faster than the fancy Metrowerks and David routines. Less setup overhead per call.
B) David's routine switches to long writes at 19 bytes. Metrowerks waits until 32 bytes.
C) Writing a byte and decrementing a counter ("Byte") is slightly slower than writing a byte and doing an address comparison ("ByteEndPtr"). My teachers would be proud.
D) Apple's word clearing algorithm is definitely an improvement over a single byte. It is faster than the fancy algorithms until about 32 bytes.

Let's try longer pointer lengths.
NewPtrClear-68000-at-Scale.png
A) The distance between the "Byte" and "ByteEndPtr" algorithms is widening. That means the we are definitely saving cycles in the loop, rather than in setup/overhead portions of the call that would stay static as pointer size changed.
B) The fancy algorithms are almost three times faster than Apple's NewPtrClear ("AppleWord") and just as much again faster than byte based algorithms.

I cannot test misaligned memory clearing performance on the 68000, because Apple's algorithm crashes. Spoiler alert: so does Metrowerks!

But, we can test aligned and misaligned memory on a 68030. In this case, a 68030 IIsi with System 7.6.1.
NewPtrClear-68030.png
A) What??? My end-pointer check ("ByteEndPtr") is significantly slower than a bytes-remaining check ("Byte") on a 68030. This is the opposite of the 68000. My teachers would be confused.
B) Aligned and misaligned word writes ("AppleWord", "AppleWord+1") are basically identical on a 68030. No performance hit. That's unexpected.
C) Word writing algorithms ("AppleWord", "AppleWord+1") are relatively faster on a 68030. That is, the algorithm is now faster than the fancy version up to around 64 bytes, as opposed to 32 bytes on the 68000.

So, if you performance tuned assembly language code for a 68000, you were frustrated when the 68030 came out.

Let's try longer pointers.
NewPtrClear-68030-at-Scale.png
A) Hmmmm. Metrowerks' memset performance on misaligned memory ("Metrowerks+") is slower than aligned ("Metrowerks"). But, their code aligns memory at the start. How could it not be the same performance?

CodeWarrior-FillMem-Bug.PNG
Look at my comments ("DAC") above. Although the Metrowerks code aligns memory at the start, it is missing a line for 68K processor to copy that pointer over. As such, the Metrowerks memset() routine (and other code based on it) when working with odd memory addresses on a 68K CPU does the following:
1. Crashes on a 68000
2. Doesn't clear the end byte(s), which can result in a crash or bad behavior on all 68K CPUs due to lack of initialized structures.
3. Is up to 50% slower on a 68020/68030.

This bug exists from at least Gold 11 up to Pro 5. I did not check later versions.

What about earlier versions of CodeWarrior? Source code does not appear to be included in Gold 7, but the disassembly is significantly different. I didn't step through the code, because the first thing I noticed is it is slowed due to poor compiler optimization. The left side (Gold 7) shows two instructions (MOVE.L D2,(A0), ADDQ.L #$4,AO) for every single instruction (MOVE.L D3,(A0)+) on the right side (Gold 11). The ADDQ.L consumes 8 extra cycles for each clear. Blech.
CW7-Suboptimal-compile-or-code.PNG

It seems 68K performance and bugs was not a focus of Metrowerks as early as 1995. And, a nasty 68K bug was introduced as early as 1996 and never(?) fixed.

Enough of that. Let's move on to clearing performance on a 68040. Specifically, a Quadra 650 with System 7.6.1.

NewPtrClear-68040.png
A) The two methods of checking for the end of a byte loop are now the same performance (less variable initialization overhead) on a 68040.
B) Apple's word clearing routine is now somewhere in the middle of relative performance, breaking even with the fancy algorithms around 40 bytes, rather than 32 or 64.

For bigger pointers?
NewPtrClear-68040-at-Scale.png
A) Address memory alignment no longer matters. Something about the 68040 allows 32-bit writes on uneven addresses to not slow down the processor. Cool.
B) The fancy algorithm has improved to 4.4x speed over the Apple word algorithm at lengths of 1000 bytes.
C) The fancy algorithm is 10x faster than writing a byte at a time at lengths of 1000 bytes.

Conclusions
1. Contrary to Inside Macintosh, the Apple NewPtrClear algorithm writes two-bytes at a time. It is the most efficient or competitive algorithm on all 68K CPU variants until around 64 bytes. It turns out that most NewPtrClear calls are erasing small structs (instead of initializing inner variables one at a time), so a patch would not likely improve many programs.

2. If your code needs to routinely clear blocks larger than 100 bytes, then writing 32-bits at a time is definitely worthwhile.

3. The 68K CPUs went through substantial performance changes in instructions and memory access, such that code optimizations for one could hurt others. The 68040 seems to have made everything performant. Therefore, except under very specialized circumstances, you should write stable maintainable C code rather than risking defects with overly optimized algorithms.

4. Except for the earliest releases, do not trust Metrowerks support for 68K. Double check their code. Other than a handful of LC/Performa models, Apple had discontinued all 68K computers by 1996 (CodeWarrior 8) and PowerPC models were in their second generation. Metrowerks was looking toward the future rather than testing for backwards compatibility.

- David
 

demik

Well-known member
Now that's a very interesting deep dive. Good job. I wonder if there is similar trick with calloc() inside the libc
 

David Cook

Well-known member
Thank you @cheesestraws @cy384 @akator70 @ymk and @demik for the kind words.

if there is similar trick with calloc()
At least for Gold 11, they implement their own memory management. That is, calloc does not return an official OS "Ptr".
* Although a host operating system will usually provide memory management
* primitives, they are often not very efficient at the management of many
* small to medium sized blocks of nonrelocatable storage. These routines
* provide for an efficient system of memory pool management on top of such
* host primitives.
*
* Implementation
* --------------
*
* The dynamic allocation scheme used here is based in large part on
* techniques presented in Section 2.5 of Knuth's "The Art of Computer
* Programming, Vol. 1".
They always allocate in 4 byte increments on 4 byte boundaries for 68K, and 8 byte for PowerPC. That makes their clearing code simple:
for (size = (size / sizeof(long)) + 1, p = ptr; --size; )
*p++ = 0;

They don't unroll the loop. So, for larger allocations (>128 bytes?), performance will be slower than the fancy routines presented earlier in this thread. But, it should be faster than Apple's routine nearly all the time.
 

8bitbubsy

Well-known member
Nice investigation! :)

One could have a clear length threshold, f.ex. 128 bytes, where anything higher than that would enter a highly optimized 32-bit aligner+clearer, while anything shorter or equal would use the original low-overhead version.

EDIT: movem.l d0-d7,-(An) comes to mind for longer lengths where the setup overhead doesn't matter. It's the fastest way to do 32-bit clearing on a 68000.

As far as I know, unaligned memory access even on a 68040 should have a speed penalty as it has to use micro-code. Maybe the speed difference would be quantifiable if you tested much longer clearing lengths?
 
Last edited:

Crutch

Well-known member
Interesting write up, thanks for sharing this.

My assumption is that nobody particularly thought NewPtrClear needed to be a super optimized routine, because continually allocating cleared memory in a tight loop on the heap is probably not something one does very much?

Compare with _BlockMove, which is clearly a probably tight-loop speed bottleneck routine, and for which maximalist loop unrolling and CPU special-casing was used to the utmost:


1690046309855.png
 

8bitbubsy

Well-known member
Interesting write up, thanks for sharing this.

My assumption is that nobody particularly thought NewPtrClear needed to be a super optimized routine, because continually allocating cleared memory in a tight loop on the heap is probably not something one does very much?
Well, if you allocate a massive chunk (a few megs), then it could take a noticable amount of waiting time. I assume NewPtrClear is the same as calloc, right?
 

Crutch

Well-known member
I don’t think so … if I’m reading the chart right, allocating 1K a thousand times (so, call it allocating 1MB once) takes about 2 seconds using Apple’s implementation on an 030? (Nobody was allocating megabytes on a 68000.) Again unless you’re doing this in a loop (which presumably nobody normally would be doing?), I’m not sure that’s enough time to matter.
 

ajp9

New member
Overhead is definitely something to consider doing this in Asm. Clearing memory is something that doesn't need any additional registers. The routine wasn't even really 68000-optimized if fewer code bytes made DBRA faster than SUBQ.L/BGT... it was just small overall to serve a purpose and not take up more ROM. Preserving A0 could be done faster by putting it in D1 or D2 instead of the stack—assuming no program expects them preserved, although I do recall the trap manager saves registers in certain cases.

Doing performance tests years ago, I know unaligned writes on 68040 with move are still some 20% slower in RAM because both 4-byte lines need to be in cache for no penalty; the routine you used probably doesn't exhaust the cache. RAM wasn't cheap back then, but alignment was kind of critical in some of the hardware because unaligned writes to video memory was SLOW.
 

bigmessowires

Well-known member
Interesting stuff. Regarding whether to decrement a byte counter or compare a pointer, I would always write code to decrement a byte counter on 68000 because it should result in one fewer instruction in assembly code. 68000 has a DBRA (decrement and branch if result >= 0) instruction that can be used to decrement the counter, test the new counter value, and branch, all in a single instruction. If comparing a current pointer to an end pointer, that optimization's not possible. I'm surprised that your version with pointer comparisons was actually faster in some cases. Did you look at the generated assembly code for both versions?

I used godbolt to compile the following code with gcc 13.2.0 for M68K:

Code:
void MemorySetByteCounter(char* dest, char val, char count)
{
    do {
        *dest++ = val;
    } while (--count);
}

I specified -O2 and also had to add -fno-builtin, or else the compiler would detect what I was trying to do and simply call memset(). Here's the generated assembly, with apologies for the x86-style syntax:

Code:
MemorySetByteCounter:
        move.l 4(%sp),%a0
        move.b 11(%sp),%d0
        move.b 15(%sp),%d1
        subq.b #1,%d1
        and.l #255,%d1
        lea 1(%a0,%d1.l),%a1
.L2:
        move.b %d0,(%a0)+
        cmp.l %a0,%a1
        jne .L2
        rts

The first section of code computes endPtr = dest + count - 1. The second part beginning at L2 is basically MemorySetByteEndPtr, so the compiler is generating a preamble that turns the byte counter method into the pointer comparison method.

Yet I don't understand why the compiler did that. Isn't code like this faster for the main loop?

Code:
loop:
    move.b D0, (A0)+
    dbra D1, loop
    rts

I guess we'd have to look up the instruction timings in the 68000 data book, but I would expect the single instruction DBRA to be faster than the pair of instructions CMP.L, JNE. The only drawback is that the byte counter in D1 must be a word (2 byte) size, so you'll need an outer loop if you want to clear more than 65535 bytes.
 

Snial

Well-known member
Code:
loop:
    move.b D0, (A0)+
    dbra D1, loop
    rts

I guess we'd have to look up the instruction timings in the 68000 data book, but I would expect the single instruction DBRA to be faster than the pair of instructions CMP.L, JNE. The only drawback is that the byte counter in D1 must be a word (2 byte) size, so you'll need an outer loop if you want to clear more than 65535 bytes.
One of the clever things about the 68K is that DBRA counts to -1 instead of counting to 0. This means it's possible to swap the words and just do another outer dbra. [Note: I've just seen I've replied to BMOW, who must know this, but maybe not everyone else does.]

Code:
    bra.s loop0Chk ;
loopOuter:
    swap d1
loopInner:
    move.b D0, (A0)+
loop0Chk:
    dbra D1, loopInner
    swap d1
    dbra d1,loopOuter ;d1=-1 at end.
    rts

This means that a 32-bit dbra on a 68000 is only about 0.002% slower than a 16-bit dbra.

Why does it work? Let's consider the case where d1.l=0x10000. We jump to the 0Chk, which makes d1.l=0x1ffff so the loop falls through; swaps d1 => 0xffff0001; performs dbra d1 => 0xffff0000, so it jumps back around; swaps d1 => 0x0000ffff and will then do 65536 inner loops, again leaving d1.l == 0x0000ffff when the inner dbra d1 fails. It then swaps d1 => 0xffff0000; decrements d1.w => 0xffffffff and because that word is 0xffff, exits the loop. We don't need to swap round this time, because the low word = the high word = 0xffff.

The same rationale works for d1.l = 0x10001. Jump to 0Chk => d1.l = 0x10000, so we do one jump back and on the next dbra it's the same case as for the above. We know it works for 0x00000000 as we've covered that above and now we know it will work for 0x10002 etc, and all cases.

Now let's compare it with what happens with a z80's djnz instruction, which decrements the 8-bit b register and jumps if the result isn't 0. The equivalent technique for the z80 would be to use it to do a 16-bit loop using bc. We'd swap b and c (djnz should have worked for c instead) and then have 2 loops as in the above:

Code:
    ;Assuming not 0 this time.
loopOuter:
loopInner:
    ld (hl),a
    inc hl
loop0Chk:
    djnz loopInner
    dec c
    jr nz,loopOuter ;d1=-1 at end.
    rts

We couldn't use dec bc, because it doesn't affect the flags. The problem with the loop is that it doesn't work for corner cases. If the swapped bc=0x100, which should loop once, then after the first djnz b=0, so it falls through to the dec c, which makes c=0xff, so it jumps back and then it would do another 256 inner loops and repeats that for the next 255 decrements of c, which is wrong. Similarly, if swapped bc=0x101, instead of doing 257 loops, it does 1 loop. Oddly enough though if bc had been 0x100, swapped bc=0x0001, which indeed does 256 loops!

The issue is that on a Z80 (and on the 8086) decrementing the lower byte until it's 0 then decrementing the upper byte doesn't work the way subtracting 1 borrows from higher bits. That is, we only borrow from higher bits when all the lower bits were 0 before the decrement: 0101000 -1 => 0100111. On a 68K, dbra does work the same way subtracting 1 works. We could in fact extend dbra to 64-bits using two data registers and it would also work.

As a consequence Z80 programmers usually do this:
Code:
    ;Assuming not 0 this time.
loop:
    ld (hl),a
    inc hl
    dec bc
    ld a,b
    or c
    jr nz,loop ;d1=-1 at end.
    rts

Which is short, but slow. You can compensate for the problem on a Z80, you need to pre-increment c if b isn't 0.

Code:
    ;Assuming not 0 this time.
    ld a,b
    ld b,c
    or a
    jr z,Skip
    inc a
Skip:
    ld c,a
loopOuter:
loopInner:
    ld (hl),a
    inc hl
loop0Chk:
    djnz loopInner
    dec c
    jr nz,loopOuter ;d1=-1 at end.
    rts

Clumsy, but for large loops this makes 16-bit djnz's about 1% slower than 8-bit djnz (instead of nearly 2x slower).
 

bigmessowires

Well-known member
It sounds like you agree that for 68000 MemClr, it ought to be faster to decrement the counter and branch with DBRA than to compare the pointer to an end pointer? This is what I expected, but is the opposite of what @David Cook found in his results. Here are assembly versions of both approaches:

Code:
MemClr_DecrementCounter:
    ; on entry A0 is the destination pointer and D0 is the 32-bit byte count
    bra.s L3
L1:
    swap d0
L2:
    moveq.l #0, (a0)+
L3:
    dbra d0,L2
    swap d0
    dbra d0,L1
    rts
    
MemClr_ComparePtr:
    ; on entry A0 is the destination pointer and D0 is the 32-bit byte count
    ; A1 used as temporary for end pointer
    lea 0(a0,d0.l),a1
L1:
    moveq.l #0,(a0)+
    cmp.l a0,a1
    jne L1
    rts

The second version with pointer compare is more compact, but its inner loop is three instructions versus only two instructions for decrementing a counter. I'm not sure whether Easy68K is cycle accurate for code timing, but I'll try these two on some real hardware when I get a chance.
 

bigmessowires

Well-known member
I read elsewhere that 68010 and later CPUs have a hardware optimization to make DBRA even faster. And I think the absolute fastest method (at least for larger blocks) would use MOVEM rather than MOVE.L, but you're not likely to get that from any compiler and would have to hand-write the assembly code.
 

Snial

Well-known member
I read elsewhere that 68010 and later CPUs have a hardware optimization to make DBRA even faster. And I think the absolute fastest method (at least for larger blocks) would use MOVEM rather than MOVE.L, but you're not likely to get that from any compiler and would have to hand-write the assembly code.
Yes, MOVEM can be much faster - for aligned data:
movem.l (a0)+,{d0-d6,a2-a4}
movem.l {d0-d6,a2-a4},(a1)+
;(2+20)*2*4cycles + 2*2 cycles for the (an)+ addressing mode = 180 cycles to move 40 bytes of RAM, 4.5 cycles per byte.
 

eharmon

Well-known member
I happened to read these old BMUG notes today which had a tangential reference: https://32by32.com/bmug-meeting-minutes-1988-01-21/
  • Radius product (?): by Andy Hertzfeld: QuickerDraw
  • rewrote tight loops in the ROMs. Makes color QuickDraw(tm) almost as fast as B&W. Will Apple steal it after it's released?
I guess that became QuickerGraf in System 6 (per Wikipedia). So it was licensed, if you want to call that stealing ;)

The connection being patching tight loops in the ROMs to optimize for newer, faster machines. I wonder what other hot loops are lurking.

Has anyone ever built a profiler for any emulators? It might be interesting to see what areas are extra hot.
 

olePigeon

Well-known member
I feel like with a lot of these posts, you could reformat them just a tiny bit and publish a quarterly Vintage Macazine. :)
 
Top