New memset16 routine

The old tonclib's memset16() was a Thumb/ROM function, but called memset32() if it was desirable. Below you can find an ARM-only version. This time round, I chose to do all the real work inside the 16-bit version, and let memset32() jump into the middle of memset16(). Combined, these are less than 32 instructions, which should make cache happy as well. The main difference in performance is in the lower counts: the function overhead is about 100 cycles lower than before.

@ ---------------------------------------------------------------------
@ CODE_IN_IWRAM void memset32(void *dst, u32 fill, size_t wcount)
@ ---------------------------------------------------------------------
    .section .iwram, "ax",%progbits
    .arm
    .align
    .global memset32
memset32:
    mov     r2, r2, lsl #1
    cmp     r2, #16
    bhs     .Lms16_entry32
    b       .Lms16_word_loop

@ ---------------------------------------------------------------------
@ CODE_IN_IWRAM void memset16(void *dst, u16 fill, size_t hwcount)
@ ---------------------------------------------------------------------
    .section .iwram, "ax", %progbits
    .arm
    .align
    .global memset16
memset16:
    cmp     r2, #0              @ if(count != 0)
    movnes  r3, r0, ror #2      @   if(dst && (dst&1))
    strmih  r1, [r0], #2        @   {   *dst++= fill;   count--;    }
    submis  r2, r2, #1          @ if(count == 0 || dst == NULL)
    bxeq    lr                  @   return;

    orr     r1, r1, lsl #16     @ Prep for word fills.
    cmp     r2, #16
    blo     .Lms16_word_loop
.Lms16_entry32:

    @ --- Block run ---
    stmfd   sp!, {r4-r8}    
    mov     r3, r1
    mov     r4, r1
    mov     r5, r1
    mov     r6, r1
    mov     r7, r1
    mov     r8, r1
    mov     r12, r1
.Lms16_block_loop:
        subs    r2, r2, #16
        stmhsia r0!, {r1, r3-r8, r12}
        bhi     .Lms16_block_loop
    ldmfd   sp!, {r4-r8}
    bxeq    lr
    addne   r2, r2, #16         @ Correct for overstep in loop

    @ --- Word run (+ trailing halfword) ---
.Lms16_word_loop:
        subs    r2, r2, #2
        strhs   r1, [r0], #4
        bhi     .Lms16_word_loop
    strneh  r1, [r0], #2        @ r2 != 0 means spare hword left
    bx  lr

@ EOF

As usual, I'm being somewhat dirty with how the assembly works. In the first 5 instructions of memset16(), I'm doing several things in one go: testing the destination (and count) for 0, doing a single halfword write for non-word aligned destinations, and returning if afterwards the count is 0, return from the routine. I can do all this in five instructions through clever manipulation of conditionals.

The instructions that make up the main loop are a little non-standard as well. Here's how it works and why:

  1. This is how it works and why: Reduce fill-count, C, by N hwords. Note that C need not be a multiple of N. This is important.
  2. Fill N halfwords if C>=N. It's `>=', not `>', because C==N indicates the last stretch. It's also not `!=', because C need not be a multiple of N.
  3. Loop as long as C>N. In this case it is `>', because C==N indicates the last full stretch.
  4. Now it's time for the residuals. If C%N==0, then we're finished, so it's time to return.
  5. However, if there were residuals, then C>0, thanks to the last subtraction inside the loop. So we have to correct for it by adding N again.

The standard method is splitting possible residuals first, but this version is shorter and allows for earlier escaping. A second benefit is that you can use non-power of two values for N as well. It is possible, for example, to use a 12-fold stmia here with only a few changes. The lower number of loops means that this would be ~10% faster … eventually. It really depends on things like memory waitstates whether the 12-fold version is worthwhile.


Oh, the highlighting was done by geshi as well. Making that arm-asm highlighter turned out very easy indeed.

EDIT, 2007-12-07

There was a small bug in the version above. r12 was used but never initialized. I know I had it in there when I tested it, but somehow it got lost.

One thought on “New memset16 routine

  1. Heh, I know this is old but...
    When I wrote my 32-bit memcpy functions, I happened to use an identical method xP
    I copied in 32-byte chunks and ldmfd,bxeq'd immediately after, and added the remainder.
    I just didn't handle 16-bit things cos I keep everything aligned anyway.

    Great minds think alike, huh ;D

Leave a Reply

Your email address will not be published. Required fields are marked *