The old tonclib's memset16() was a Thumb/ROM function,
but called memset32() if it was desirable. Below you can
find an ARM-only version. This time round, I chose to do all the real work
inside the 16-bit version, and let memset32() jump into the
middle of memset16(). Combined, these are less than 32
instructions, which should make cache happy as well. The main difference
in performance is in the lower counts: the function overhead is about 100
cycles lower than before.
@ CODE_IN_IWRAM void memset32(void *dst, u32 fill, size_t wcount)
@ ---------------------------------------------------------------------
.section .iwram, "ax",%progbits
.arm
.align
.global memset32
memset32:
mov r2, r2, lsl #1
cmp r2, #16
bhs .Lms16_entry32
b .Lms16_word_loop
@ ---------------------------------------------------------------------
@ CODE_IN_IWRAM void memset16(void *dst, u16 fill, size_t hwcount)
@ ---------------------------------------------------------------------
.section .iwram, "ax", %progbits
.arm
.align
.global memset16
memset16:
cmp r2, #0 @ if(count != 0)
movnes r3, r0, ror #2 @ if(dst && (dst&1))
strmih r1, [r0], #2 @ { *dst++= fill; count--; }
submis r2, r2, #1 @ if(count == 0 || dst == NULL)
bxeq lr @ return;
orr r1, r1, lsl #16 @ Prep for word fills.
cmp r2, #16
blo .Lms16_word_loop
.Lms16_entry32:
@ --- Block run ---
stmfd sp!, {r4-r8}
mov r3, r1
mov r4, r1
mov r5, r1
mov r6, r1
mov r7, r1
mov r8, r1
mov r12, r1
.Lms16_block_loop:
subs r2, r2, #16
stmhsia r0!, {r1, r3-r8, r12}
bhi .Lms16_block_loop
ldmfd sp!, {r4-r8}
bxeq lr
addne r2, r2, #16 @ Correct for overstep in loop
@ --- Word run (+ trailing halfword) ---
.Lms16_word_loop:
subs r2, r2, #2
strhs r1, [r0], #4
bhi .Lms16_word_loop
strneh r1, [r0], #2 @ r2 != 0 means spare hword left
bx lr
@ EOF
As usual, I'm being somewhat dirty with how the assembly works. In the
first 5 instructions of memset16(), I'm doing several things
in one go: testing the destination (and count) for 0, doing a single halfword
write for non-word aligned destinations, and returning if afterwards the
count is 0, return from the routine. I can do all this in five instructions
through clever manipulation of conditionals.
The instructions that make up the main loop are a little non-standard as well. Here's how it works and why:
- This is how it works and why: Reduce fill-count, C, by N hwords. Note that C need not be a multiple of N. This is important.
- Fill N halfwords if C>=N. It's `>=', not `>', because C==N indicates the last stretch. It's also not `!=', because C need not be a multiple of N.
- Loop as long as C>N. In this case it is `>', because C==N indicates the last full stretch.
- Now it's time for the residuals. If C%N==0, then we're finished, so it's time to return.
- However, if there were residuals, then C>0, thanks to the last subtraction inside the loop. So we have to correct for it by adding N again.
The standard method is splitting possible residuals first, but this version
is shorter and allows for earlier escaping. A second benefit is that you can
use non-power of two values for N as well. It is possible, for
example, to use a 12-fold stmia here with only a few
changes. The lower number of loops means that this would be ~10%
faster … eventually. It really depends on things like memory
waitstates whether the 12-fold version is worthwhile.
Oh, the highlighting was done by geshi as well. Making that arm-asm highlighter turned out very easy indeed.
EDIT, 2007-12-07
There was a small bug in the version above. r12 was used but never initialized. I know I had it in there when I tested it, but somehow it got lost.