Nice investigation! :)
One could have a clear length threshold, f.ex. 128 bytes, where anything higher than that would enter a highly optimized 32-bit aligner+clearer, while anything shorter or equal would use the original low-overhead version.
EDIT: movem.l d0-d7,-(An) comes to mind for...