Define: overthinking

<ramble>

I think too much. Or so people keep telling me. And they may have a point.

Anyway, so I'm working on this assembly version of toncset for, well, just because I guess. A fill routine has 3 parts: a head, a main run and a tail. The main part fills 32-bit chunks (words) when there's more than 4 bytes left and when the destination is word-aligned. This part is easy, because you can just dump words with stmia or str. For an example of this, see my older memset16 post.

The tail is for the remaining bytes after the main part. Under normal circumstances you could do this byte for byte, but some sections of GBA/NDS memory do not take kindly to byte-writes, so you'd have to read the word, mask the appropriate bytes out/in and then write it back. This is mostly just annoying, but still very doable.

The head part, however, is both annoying and tricky. It consist of filling the unaligned bytes in the first word of a run so that the main part can do its thing. This is similar to the tail part, in that it requires bitmasks. However, it's also possible that both the beginning and and of a run occur in the same word, effectively making the head the tail as well, so you'd have to apply a double mask … somehow. In C, it looks something like this:

/* Input:
  void *dstv;   // potentially non-aligned
  u32 fill;     // Already extended to full 32-bit if appropriate.
  uint size;    // > 0
*/


u32 addr= (u32)dstv;
int left= addr&3;
if(left != 0)
{
    u32 *dst= (u32*)(addr&~3);
    int right= left+size;
    u32 mask= 0xFFFFFFFF;

    if(right < 4)   // Everything in a single word: head and tail.
    {
        mask= ~(mask<<8*size);          // Create right-mask 000F
        mask <<= 8*left;                // Create mid mask 00F0
        *dst = (*dst &~ mask) | (fill & mask);
        return;
    }
    else
    {
        mask <<= 8*left;                // Create left mask FFF0
        *dst = (*dst &~ mask) | (fill & mask);
        size= right-4;
        *dst++;
    }
}

This bit of C translates roughly into the following ARM code:

    @ Reglist
    @ r0 : dst
    @ r1 : src
    @ r2 : size
    @ r3 : left (dst&3) / right (left+size) / data
    @ r4 : lshift (left*8)
    @ r5 : rshift (right*8) / mask
    @ ip : maskBase (0xFFFFFFFF)
.Lfgset_head:
    bic     r0, r0,#3               @ Align dst
    mvn     ip, #0                  @ Set-up mask (0xFFFFFFFF)
    mov     r4, r3, lsl #3          @ (left*=8) != 0 : right-side only
    add     r3, r3, r2              @ right= left+size
    cmp     r3, #4

    @ <= 4 : single-word. Shrink mask, do usual and quit early
    movlo   r5, r2, lsl #3          @ \.
    mvnlo   ip, ip, lsl r5          @ - r5= ((1<<8*size)-1)<<(8*left);
    mov     r5, ip, lsl r4          @ /

    subhi   r2, r3, #4              @ Adjust size for follow-ups

    @ Mask in r1 and write back to dst
    ldr     r3, [r0]
    bic     r3, r3, r5
    and     r5, r1, r5
    orr     r3, r5
    str     r3, [r0], #4

    bhi     .Lfgset_main            @ Longer stretch : go back.
    bx      lr                      @ Single-word fill : finished.

The main thing I thought when I'd written this down was “meh”. You will note that registers r4 and r5 are used here, which means stack-work (omitted here for brevity). The positioning of pushing and popping makes everything a little awkward, so I was off looking for something else.

The essence of the problem here is that you can only use five registers without touching the stack: r0-r3 and ip (r12). Now, r0-r2 are taken by the destination, fill and size, so I can't do anything with those. I also need one to store the left-edge (r3 in this case), leaving us with one for the right edge, the left and right shift intermediaries, and the left and right masks. Right! Crap -_-.


So; strategies. Well, one of the reasons I need to use so many registers is because the lifetimes overlap. For example, I still need left for a while because shifting the mask up comes last here. I can't use r2 for multiple purposes either because I'll need it for the size. Now, I could free up r3 by making the left-mask first, but then I might get in trouble when creating the right-mask. Also, right-4 is actually what size wants to be when it grows up in the long-run case, so I can use that there as well. I'd just have to undo that for the short case, or perhaps even create the right-mask from negative numbers.

At this point I figured it would be helpful to take a look at the various ways of creating the masks I needed. The standard form for a 0x000000FF-style bitmask is (1<<x)-1, but there are others as well. The following list holds a few examples.

// Bitmask examples. Assume x=8 (in r1)

 ( 1<<x)-1;     // 000000FF.    mov r0, #1; rsb r0, r0, r0, lsl r1;
~(-1<<x);       // 000000FF.    mvn r0, #0; mvn r0, r0, lsl r1;
  -1<<x;        // FFFFFF00.    mvn r0, #0; mov r0, r0, lsl r1;
-( 1<<x);       // FFFFFF00.    mov r0, #1; sub r0, r0, r0 lsl r1; sub r0, #1
  -1>>x;        // 00FFFFFF.    mvn r0, #0; mov r0, r0, lsr r1;
~(-1>>x);       // FF000000.    mvn r0, #0; mvn r0, r0, lsl r1;

All of these use +1 or −1 as their base, and all but one is a two-instruction affair. The left-mask looks like 0xFFFFFF00, so the most obvious one to pick here is -1<<x. Technically, the right-mask is 0x000000FF, using x = 8*right = 8*(left+size). However, you can also see it as a 0x00FFFFFF-style mask if you use 4-right. This solves two problems at once. First, it is the negative of the new size, so the value is readily available. Second, this mask is a right-shifted 0xFFFFFFFF, but as the lower bits are shifted out anyway, it doesn't actually have to be a proper 0xFFFFFFFF; it can be a 0xFFFFFF00 as well, which we have in the form of the left-mask. In other words, we don't require as many registers for temps because we already have everything we need. The resulting code is this:

    add     r2, r3, r2              @ (1a) right := left+size.
    movs    r3, r3, lsl #3
    mvn     ip, #0                  @ ip= -1
    mov     r3, ip, lsl r3          @ lmask := (-1)<<(8*left)
    subs    r2, r2, #4              @ (1b) aligned size= right-4

    @ new size < 0 : single-word fill
        rsblo   r2, r2, #0          @ - (2) r3= -8*size
        movlo   r2, r2, lsl #3      @ /
        andlo   r3, r3, r3, lsr r2  @ (3) mask = lmask & rmask
   
    @# inserts and jumps

See? No r4 and r5 anywhere. The key is toying around with r2 and r3. While r2 is reserved for the size, it needs to modified anyway to account for the work done here. In the end, size should be right−4, which is what points (1a) and (1b) do. Since right-4 is a right<4 as well, we can use its result as the condition for the special case; the result being the negative distance from the word-edge. As explained above, the right-mask can be constructed from the left-mask by lmask>>(-8*size), which is done at points (2) and (3).

It's a little hairy, but it works. And yet, it still evoked a feeling of “meh” like before. It's the two instructions at point (2) that annoyed me. The reason it's two instructions and not one is because you can't multiply by −8 in one go. By +8, yes: that's a shifted-mov; −1, yes: that's rsb, r2, #0; but the combination is difficult because the shift only applies to the second operand. A sub r2, #0, r2, lsl #3 would do it, but the first operand needs to be a register and I don't have a spare one with zero in it. I could make one, that just means I have an extra instruction somewhere else. I do, however, have either a +1 or −1 in ip, maybe I can use that somehow. And then it hits me: the carry flag!

There are adc, sbc and rsb instructions that add C, C−1 and C−1 to the result, respectively. Setting or clearing the carry flag is easy, so that's not a problem. All I need now is to start the flag using +1 instead of −1 to cancel out the −1 in sbc or rsc. As it turns out, I can do that I use this format for the left-mask: -(1<<x). In the mask overview above I listed thi as a 3-instruction gig, but as it turns out I can use the carry-trick here to for one instruction less. The final version (for just the head part) looks like this:

    ands    r3, r0, #3
    beq     .Lfgset_main            @ Jump to main stint is aligned

.Lfgset_head:
    bic     r0, r0, #3
    add     r2, r3, r2
    movs    r3, r3, lsl #3          @ left*8 ; clear carry
    mov     ip, #1
    sbc     r3, ip, ip, lsl r3      @ -(1<<8*left) +1-1
    subs    r2, r2, #4              @ size= right-4

    @ If negative (==carry clear), this is a single-word fill
    @ This requires a truncated mask (like 0x0000FF00)
        sbclo   r2, ip, r2, lsl #3      @ x= -8*size +1-1
        andlo   r3, r3, r3, lsr r2      @ mask= mask & (mask>>-8*size);

    @ Insert and jump to main stint if available.
.Lfgset_insert:
    ldr     ip, [r0]
    bic     ip, ip, r3
    and     r3, r1, r3
    orr     ip, ip, r3
    str     ip, [r0], #4

    bhi     .Lfgset_main        @ Longer stretch
    bx      lr                  @ Single-word fill : finished.

Sweeeet :). I was happy with this, until I realized what I'd been working on: an exception of an exception. This would definitely not be part of the 20% of the code that uses 80% of the runtime, so it's really not something one should worry about. Interesting, yes, and I learned a few new tricks, but perhaps time would have been better spent on getting 5% extra out of the main loop. The only problem there is that that is just boring old unrolling a bit, whereas the head presented a more ‘interesting’ problem so I went for that instead.

So yeah; I think too much >_>

</ramble>

tonc 1.4 official release

The files have need downloadable for a while now as a preview, but I finally put the text up on the main site as well so I guess that makes it official. Tonc is now at version 1.4. As mentioned before, the main new thing is TTE, a system for text for all occasions. I've also used grit in some of the advanced demos, so if you want to see how you can do advanced work with it, check out the mode 7 demos and the tte demo.

This will be the last version of Tonc. It's really gone on long enough now.


Files and linkies :


Right! Now what …