The old tonclib's memset16()
was a Thumb/ROM function,
but called memset32()
if it was desirable. Below you can
find an ARM-only version. This time round, I chose to do all the real work
inside the 16-bit version, and let memset32()
jump into the
middle of memset16()
. Combined, these are less than 32
instructions, which should make cache happy as well. The main difference
in performance is in the lower counts: the function overhead is about 100
cycles lower than before.
@ CODE_IN_IWRAM void memset32(void *dst, u32 fill, size_t wcount)
@ ---------------------------------------------------------------------
.section .iwram, "ax",%progbits
.arm
.align
.global memset32
memset32:
mov r2, r2, lsl #1
cmp r2, #16
bhs .Lms16_entry32
b .Lms16_word_loop
@ ---------------------------------------------------------------------
@ CODE_IN_IWRAM void memset16(void *dst, u16 fill, size_t hwcount)
@ ---------------------------------------------------------------------
.section .iwram, "ax", %progbits
.arm
.align
.global memset16
memset16:
cmp r2, #0 @ if(count != 0)
movnes r3, r0, ror #2 @ if(dst && (dst&1))
strmih r1, [r0], #2 @ { *dst++= fill; count--; }
submis r2, r2, #1 @ if(count == 0 || dst == NULL)
bxeq lr @ return;
orr r1, r1, lsl #16 @ Prep for word fills.
cmp r2, #16
blo .Lms16_word_loop
.Lms16_entry32:
@ --- Block run ---
stmfd sp!, {r4-r8}
mov r3, r1
mov r4, r1
mov r5, r1
mov r6, r1
mov r7, r1
mov r8, r1
mov r12, r1
.Lms16_block_loop:
subs r2, r2, #16
stmhsia r0!, {r1, r3-r8, r12}
bhi .Lms16_block_loop
ldmfd sp!, {r4-r8}
bxeq lr
addne r2, r2, #16 @ Correct for overstep in loop
@ --- Word run (+ trailing halfword) ---
.Lms16_word_loop:
subs r2, r2, #2
strhs r1, [r0], #4
bhi .Lms16_word_loop
strneh r1, [r0], #2 @ r2 != 0 means spare hword left
bx lr
@ EOF
As usual, I'm being somewhat dirty with how the assembly works. In the
first 5 instructions of memset16()
, I'm doing several things
in one go: testing the destination (and count) for 0, doing a single halfword
write for non-word aligned destinations, and returning if afterwards the
count is 0, return from the routine. I can do all this in five instructions
through clever manipulation of conditionals.
The instructions that make up the main loop are a little non-standard as well. Here's how it works and why:
- This is how it works and why: Reduce fill-count, C, by N hwords. Note that C need not be a multiple of N. This is important.
- Fill N halfwords if C>=N. It's `>=', not `>', because C==N indicates the last stretch. It's also not `!=', because C need not be a multiple of N.
- Loop as long as C>N. In this case it is `>', because C==N indicates the last full stretch.
- Now it's time for the residuals. If C%N==0, then we're finished, so it's time to return.
- However, if there were residuals, then C>0, thanks to the last subtraction inside the loop. So we have to correct for it by adding N again.
The standard method is splitting possible residuals first, but this version
is shorter and allows for earlier escaping. A second benefit is that you can
use non-power of two values for N as well. It is possible, for
example, to use a 12-fold stmia
here with only a few
changes. The lower number of loops means that this would be ~10%
faster … eventually. It really depends on things like memory
waitstates whether the 12-fold version is worthwhile.
Oh, the highlighting was done by geshi as well. Making that arm-asm highlighter turned out very easy indeed.
EDIT, 2007-12-07
There was a small bug in the version above. r12 was used but never initialized. I know I had it in there when I tested it, but somehow it got lost.
Heh, I know this is old but...
When I wrote my 32-bit memcpy functions, I happened to use an identical method xP
I copied in 32-byte chunks and ldmfd,bxeq'd immediately after, and added the remainder.
I just didn't handle 16-bit things cos I keep everything aligned anyway.
Great minds think alike, huh ;D