Define: overthinking

<ramble>

I think too much. Or so people keep telling me. And they may have a point.

Anyway, so I'm working on this assembly version of toncset for, well, just because I guess. A fill routine has 3 parts: a head, a main run and a tail. The main part fills 32-bit chunks (words) when there's more than 4 bytes left and when the destination is word-aligned. This part is easy, because you can just dump words with stmia or str. For an example of this, see my older memset16 post.

The tail is for the remaining bytes after the main part. Under normal circumstances you could do this byte for byte, but some sections of GBA/NDS memory do not take kindly to byte-writes, so you'd have to read the word, mask the appropriate bytes out/in and then write it back. This is mostly just annoying, but still very doable.

The head part, however, is both annoying and tricky. It consist of filling the unaligned bytes in the first word of a run so that the main part can do its thing. This is similar to the tail part, in that it requires bitmasks. However, it's also possible that both the beginning and and of a run occur in the same word, effectively making the head the tail as well, so you'd have to apply a double mask … somehow. In C, it looks something like this:

/* Input:
  void *dstv;   // potentially non-aligned
  u32 fill;     // Already extended to full 32-bit if appropriate.
  uint size;    // > 0
*/


u32 addr= (u32)dstv;
int left= addr&3;
if(left != 0)
{
    u32 *dst= (u32*)(addr&~3);
    int right= left+size;
    u32 mask= 0xFFFFFFFF;

    if(right < 4)   // Everything in a single word: head and tail.
    {
        mask= ~(mask<<8*size);          // Create right-mask 000F
        mask <<= 8*left;                // Create mid mask 00F0
        *dst = (*dst &~ mask) | (fill & mask);
        return;
    }
    else
    {
        mask <<= 8*left;                // Create left mask FFF0
        *dst = (*dst &~ mask) | (fill & mask);
        size= right-4;
        *dst++;
    }
}

This bit of C translates roughly into the following ARM code:

    @ Reglist
    @ r0 : dst
    @ r1 : src
    @ r2 : size
    @ r3 : left (dst&3) / right (left+size) / data
    @ r4 : lshift (left*8)
    @ r5 : rshift (right*8) / mask
    @ ip : maskBase (0xFFFFFFFF)
.Lfgset_head:
    bic     r0, r0,#3               @ Align dst
    mvn     ip, #0                  @ Set-up mask (0xFFFFFFFF)
    mov     r4, r3, lsl #3          @ (left*=8) != 0 : right-side only
    add     r3, r3, r2              @ right= left+size
    cmp     r3, #4

    @ <= 4 : single-word. Shrink mask, do usual and quit early
    movlo   r5, r2, lsl #3          @ \.
    mvnlo   ip, ip, lsl r5          @ - r5= ((1<<8*size)-1)<<(8*left);
    mov     r5, ip, lsl r4          @ /

    subhi   r2, r3, #4              @ Adjust size for follow-ups

    @ Mask in r1 and write back to dst
    ldr     r3, [r0]
    bic     r3, r3, r5
    and     r5, r1, r5
    orr     r3, r5
    str     r3, [r0], #4

    bhi     .Lfgset_main            @ Longer stretch : go back.
    bx      lr                      @ Single-word fill : finished.

The main thing I thought when I'd written this down was “meh”. You will note that registers r4 and r5 are used here, which means stack-work (omitted here for brevity). The positioning of pushing and popping makes everything a little awkward, so I was off looking for something else.

The essence of the problem here is that you can only use five registers without touching the stack: r0-r3 and ip (r12). Now, r0-r2 are taken by the destination, fill and size, so I can't do anything with those. I also need one to store the left-edge (r3 in this case), leaving us with one for the right edge, the left and right shift intermediaries, and the left and right masks. Right! Crap -_-.


So; strategies. Well, one of the reasons I need to use so many registers is because the lifetimes overlap. For example, I still need left for a while because shifting the mask up comes last here. I can't use r2 for multiple purposes either because I'll need it for the size. Now, I could free up r3 by making the left-mask first, but then I might get in trouble when creating the right-mask. Also, right-4 is actually what size wants to be when it grows up in the long-run case, so I can use that there as well. I'd just have to undo that for the short case, or perhaps even create the right-mask from negative numbers.

At this point I figured it would be helpful to take a look at the various ways of creating the masks I needed. The standard form for a 0x000000FF-style bitmask is (1<<x)-1, but there are others as well. The following list holds a few examples.

// Bitmask examples. Assume x=8 (in r1)

 ( 1<<x)-1;     // 000000FF.    mov r0, #1; rsb r0, r0, r0, lsl r1;
~(-1<<x);       // 000000FF.    mvn r0, #0; mvn r0, r0, lsl r1;
  -1<<x;        // FFFFFF00.    mvn r0, #0; mov r0, r0, lsl r1;
-( 1<<x);       // FFFFFF00.    mov r0, #1; sub r0, r0, r0 lsl r1; sub r0, #1
  -1>>x;        // 00FFFFFF.    mvn r0, #0; mov r0, r0, lsr r1;
~(-1>>x);       // FF000000.    mvn r0, #0; mvn r0, r0, lsl r1;

All of these use +1 or −1 as their base, and all but one is a two-instruction affair. The left-mask looks like 0xFFFFFF00, so the most obvious one to pick here is -1<<x. Technically, the right-mask is 0x000000FF, using x = 8*right = 8*(left+size). However, you can also see it as a 0x00FFFFFF-style mask if you use 4-right. This solves two problems at once. First, it is the negative of the new size, so the value is readily available. Second, this mask is a right-shifted 0xFFFFFFFF, but as the lower bits are shifted out anyway, it doesn't actually have to be a proper 0xFFFFFFFF; it can be a 0xFFFFFF00 as well, which we have in the form of the left-mask. In other words, we don't require as many registers for temps because we already have everything we need. The resulting code is this:

    add     r2, r3, r2              @ (1a) right := left+size.
    movs    r3, r3, lsl #3
    mvn     ip, #0                  @ ip= -1
    mov     r3, ip, lsl r3          @ lmask := (-1)<<(8*left)
    subs    r2, r2, #4              @ (1b) aligned size= right-4

    @ new size < 0 : single-word fill
        rsblo   r2, r2, #0          @ - (2) r3= -8*size
        movlo   r2, r2, lsl #3      @ /
        andlo   r3, r3, r3, lsr r2  @ (3) mask = lmask & rmask
   
    @# inserts and jumps

See? No r4 and r5 anywhere. The key is toying around with r2 and r3. While r2 is reserved for the size, it needs to modified anyway to account for the work done here. In the end, size should be right−4, which is what points (1a) and (1b) do. Since right-4 is a right<4 as well, we can use its result as the condition for the special case; the result being the negative distance from the word-edge. As explained above, the right-mask can be constructed from the left-mask by lmask>>(-8*size), which is done at points (2) and (3).

It's a little hairy, but it works. And yet, it still evoked a feeling of “meh” like before. It's the two instructions at point (2) that annoyed me. The reason it's two instructions and not one is because you can't multiply by −8 in one go. By +8, yes: that's a shifted-mov; −1, yes: that's rsb, r2, #0; but the combination is difficult because the shift only applies to the second operand. A sub r2, #0, r2, lsl #3 would do it, but the first operand needs to be a register and I don't have a spare one with zero in it. I could make one, that just means I have an extra instruction somewhere else. I do, however, have either a +1 or −1 in ip, maybe I can use that somehow. And then it hits me: the carry flag!

There are adc, sbc and rsb instructions that add C, C−1 and C−1 to the result, respectively. Setting or clearing the carry flag is easy, so that's not a problem. All I need now is to start the flag using +1 instead of −1 to cancel out the −1 in sbc or rsc. As it turns out, I can do that I use this format for the left-mask: -(1<<x). In the mask overview above I listed thi as a 3-instruction gig, but as it turns out I can use the carry-trick here to for one instruction less. The final version (for just the head part) looks like this:

    ands    r3, r0, #3
    beq     .Lfgset_main            @ Jump to main stint is aligned

.Lfgset_head:
    bic     r0, r0, #3
    add     r2, r3, r2
    movs    r3, r3, lsl #3          @ left*8 ; clear carry
    mov     ip, #1
    sbc     r3, ip, ip, lsl r3      @ -(1<<8*left) +1-1
    subs    r2, r2, #4              @ size= right-4

    @ If negative (==carry clear), this is a single-word fill
    @ This requires a truncated mask (like 0x0000FF00)
        sbclo   r2, ip, r2, lsl #3      @ x= -8*size +1-1
        andlo   r3, r3, r3, lsr r2      @ mask= mask & (mask>>-8*size);

    @ Insert and jump to main stint if available.
.Lfgset_insert:
    ldr     ip, [r0]
    bic     ip, ip, r3
    and     r3, r1, r3
    orr     ip, ip, r3
    str     ip, [r0], #4

    bhi     .Lfgset_main        @ Longer stretch
    bx      lr                  @ Single-word fill : finished.

Sweeeet :). I was happy with this, until I realized what I'd been working on: an exception of an exception. This would definitely not be part of the 20% of the code that uses 80% of the runtime, so it's really not something one should worry about. Interesting, yes, and I learned a few new tricks, but perhaps time would have been better spent on getting 5% extra out of the main loop. The only problem there is that that is just boring old unrolling a bit, whereas the head presented a more ‘interesting’ problem so I went for that instead.

So yeah; I think too much >_>

</ramble>

5 thoughts on “Define: overthinking

  1. Ahh, _that_ habit. Very insidious and captivating. And the death of many perfectly good evenings. Did I mention this post is giving me brainfaults? @_@

    In the spirit of Not Helping(tm), I'll throw in my favorite ARM perversions:

    mov timers, timer_1
    orr timers, timer_2, lsl #16
    mov r0, shiftCount
    orr r0, someSpareFlags, lsl #8
    orr r0, counter, lsl #12
    ...
    loop:
    rsbs timers, dt, timers, ror #16 ; decrement and test high half < 0
    blt infrequent_complex_event_1
    rsbs timers, dt, timers, ror #16 ; decrement and test low half < 0
    blt pesky_distraction_2
    ...
    mov foo, bar, lsl r0 ; uses low 8 bits of r0
    ...
    orr/bic../tsts r0, someBit ; toy with spare flags
    ...
    subs r0, #(1<<12) ; decrement and test counter
    blt loop

    This came from some Adlib/OPL2 emulation. Desparate enough? Well, that wasn't the end of it, and there are _still_ LDM's in there for lack of registers. Though not all of this was used in the end. I ditched sample-accurate envelopes, and I still can't break ~80 ARM9 cycles/sample. :|

  2. mov timers, timer_1
    orr timers, timer_2, lsl #16
    mov r0, shiftCount
    orr r0, someSpareFlags, lsl #8
    orr r0, counter, lsl #12
    ...
    loop:
    rsbs timers, dt, timers, ror #16 ; decrement and test high half < 0
    blt infrequent_complex_event_1
    rsbs timers, dt, timers, ror #16 ; decrement and test low half < 0
    blt pesky_distraction_2
    ...
    mov foo, bar, lsl r0 ; uses low 8 bits of r0
    ...
    orr/bic../tsts r0, someBit ; toy with spare flags
    ...
    subs r0, #(1<<12) ; decrement and test counter
    blt loop

    Niiiice. I never had to cram several variables in a single word like you're doing for r0 here, but I've always been fond of abusing ROR. For example, on some occasions you can split a coordinate into the divident and the modulo by using ROR 3 instead of LSR 3. In later instructions you can use either part with further shifts, usually without getting into trouble.

    I also came across a nice bit for the tail-end of a large fill routine. When you're down to less than 32 bytes, you can do a word-loop, unroll the word-loop, or go for a nifty stmvariant like this:

        @ Assume r0:dst ; r3-r7:fill ; r2:remaining bytes
        tst     r2, #16
        stmneia r0!, {r4-r7}
        tst     r2, #8
        stmneia r0!, {r4-r5}
        tst     r2, #4
        ... etc

    But you can actually test two bits at once if you shift them into the sign and carry bit:

        @ Assume r0:dst ; r3-r7:fill ; r2:remaining bytes
        mov     ip, r2, lsl #(32-4)     @ test for bits 4 and 3
        stmcsia r0!, {r4-r7}
        stmmiia r0!, {r4-r5}
        mov     ip, r2, lsl #(32-2)     @ test for bits 2 and 1
        stmcsia r0!, {r4}
        strmih  r4, [r0], #2

    Don't you just love that barrel shifter? ^_^.

    This came from some Adlib/OPL2 emulation. Desparate enough? Well, that wasn't the end of it, and there are _still_ LDM's in there for lack of registers. Though not all of this was used in the end. I ditched sample-accurate envelopes, and I still can't break ~80 ARM9 cycles/sample. :|

    And now I'm curious about that whole thing too, but I'm fighting the urge to ask for details.

  3. Don't you just love that barrel shifter? ^_^.

    I feel so weird working with x86 anymore! I guess it's partly a matter of perspective. (e.g.: Memory operands and register renaming as a form of code compression..)

    6502 is my second favorite to ARM. Didn't try much, but I always found the simplicity refreshing. Though a funny thing about it is that if you set the low two opcode bits, supposedly you get an "ALU + shifter" op, like ORA + ASL ("ASO"). Sound familiar? These are the "illegal" ops the chip is so infamous for. I couldn't say whether that's where ARM got the idea, but it's very nifty indeed.

    And now I'm curious about that whole thing too, but I'm fighting the urge to ask for details.

    Er, I'll try to be brief then. (It's not my strong point.)

    Some background: Yamaha's OPL2/3 series cropped up in tons of 1990s-era sound boards. So many, that I wonder if lame OPL instruments are THE reason some folks associate MIDI with "crap music." But used on children of the 80's, good "FM" tunes--like the NES and C64--should induce spine-shivering waves of nostalgia. ;-)

    Other excuses include a burning curiosity regarding MAME's spaghettiesque OPL driver (from which I got the algorithms), and a nice, complete set of instrument patches (ah, wasted youth!).

     while(numSamples--) {
         // Envelopes (among other settings) modulate the wave amplitude.    
         if((envTimer0 -= dt) < 0) updateEnv0();  // Branchy state machine
         if((envTimer1 -= dt) >N] wraps properly.	
         feedb   = out0 + out0_pr;
         out0_pr = out0;             // wave0 can modulate its own phase.
         
         // wave0 modulates wave1 phase (or can just be mixed--not shown)	
         // note: amp and wave tables use log values (hence the adds).
         out0    = wave0[ (phase0 + (feedb <> F_BITS ];
         out0    = expTable[ amp0 + out0 ];
         out1    = wave1[ (phase1 + (out0  <> F_BITS ];
         out1    = expTable[ amp1 + out1 ];
         
         *pMix++ += out1;  // 32 bits (clamping and dither done later)	
     }

    This is pseudocode for the meat of it. As you can see, it's not truly FM, but it's close. Topping out (at 18 voices, 32Khz stereo) uses ~75% of the ARM9.

    Right now it's playing MIDIs, but I'd love to see a Mario Paint sorta thing...

  4. Erk! Maimed by the sanitizer. It actually even ate a few lines. Sorry about that. :(

    You've got those beautiful code blocks up there--now I have to go reading... Is there a good reference on what WordPress will tolerate? Or at least a well-hidden preview? ^_^;

  5. Yeah, the sanitizer's a bit of a bitch. It's a battle between the standard WP styles, my own styles and several plugins. I've been toying with some of the settings to make sure the comments behave as sensible as possible, but as you can see it can still be a little messy. I can fix the code for you, but I'd need the stuff that was deleted.

    The code blocks are done with code snippet, which uses geshi and [code] tags. For pretty colors, you can use lang="..."; cpp for C/C++ and gccarm for ARM assembly are the most common for me.

    I'd love to have a preview, but my webscript-fu is still somewhat weak, so I'm not sure how to implement it ... yet. Of course, if one should suddenly find itself in my mailbox, I would consider it (hint, hint).

Leave a Reply

Your email address will not be published. Required fields are marked *