Bit-tricks and other nifty little snippets

1 Introduction

Here are a few nifty little items involving bit operations for various purposes like branchless signs and bit conversions. These things should be faster than more standard algorithms for the same thing. Because they came from GBA-experience, they're particularly apt for ARM cores, though they should work for other platforms as well. Routines are given in C or ARM or Thumb assembly. I'm assuming bits per word (BPW) here (BPW = 32), but code can be extended

Just a bit of warning, though: I am not saying these tricks should be used; I'm saying these can be used. They're mostly just for fun and to show alternative solutions. Whether they're faster than the regular implementations depends on many things, including processor and compiler. I would recommend some of these if you're already working in assembly though – once you're doing that, you probably have found that you need more speed than the compiler can give you (if not, you're doing something wrong), at which time these entries can actually help you.

2 Math items

A few simply math functions like abs() and sign() use branched code that could just as well be done with bit operations, which are often faster than branching. Most of the items here (alright, pretty much all of them) come from S.E. Anderson's bithacks; a collection that I discovered just after I figured out a few of them myself.

2.1 Branchless 2-way signs

There are three ways of indicating a 2-way sign: negative-based (−1,0), positive-based (0,+1) and sign-based (−1,+1). The basic form is negative-based, which in 2s complement is a one-instruction algorithm: simply sign-extend into all bits by right-shifting by BPW−1. For a positive-based sign, add one. The last one is a little trickier, but can be calculated by multiplying the negative-sign by 2 and adding one, or shift-right by BPW−1 and setting bit 0 (Note using BPW−2 will work as well).

Here are all three:

//# in C
// Negative sign    (x>=0 ?  0 : -1)
sn = x>>31;

// Positive sign    (x>=0 ? +1 :  0)
sp = (x>>31)+1;

// Sign-sign        (x>=0 ? +1 : -1)
ss = (x>>31) | 1;
@# in Thumb asm

@ Negative sign (x>=0 ?  0 : -1)
    asr     r0, r0, #31

@ Positive sign (x>=0 ? +1 :  0)
    asr     r0, r0, #31
    add     r0, #1

@ Sign-sign     (x>=0 ? +1 : -1)
    asr     r0, r0, #31
    lsl     r0, r0, #1
    add     r0, #1
@ or
    asr     r0, r0, #31
    mov     r1, #1
    orr     r0, r1

Most of the routines in this section use the `asr 31' trick in some way or another. It's amazing how much you can do with it.

2.2 Branchless 3-way signs

This can also be used for a 3-way sign: ±1 for positive/negative and 0 for zero. The negative and zero parts can be foind from the 2-way negative sign. for the positive part, shift −x, rather than x. This will give −1 for positive numbers, yet (−0)>>31 will still be 0. Subtracting that from the earlier sign gives the whole thing.

// Alternative for sign= x>0 ? +1 : (x<0 ? -1 : 0)

//# C version
sign= (x>>31) - (-x>>31);

//# ARM
    rsb     r1, r0, #0
    mov     r0, r0, asr #31
    sub     r0, r0, r1, asr #31

//# Thumb
    neg     r1, r0
    asr     r1, r1, #31
    asr     r0, r0, #31
    sub     r0, r1

2.3 Branchless abs(x)

The standard is `abs(x) = x>=0 ? x : -x'. Since −x = ~x+1 = x XOR (−1) − (−1), you can do this without branches too.

// Alternative for abs= x>=0 ? x : -x

//# C version
sign= (x>>31);
abs = (x^sign) - sign;

//# ARM
    eor     r1, r0, r0, asr #31
    sub     r0, r1, r0, asr #31

//# Thumb
    asr     r1, r0, #31
    eor     r0, r1
    sub     r0, r1

Of course you don't need a branchless abs() in ARM because you can do the same thing with a rsblt, but it's nice that it exists.

Reference: S.E. Anderson : abs. He also points out that apparently this approach is patented >_>.

2.4 Branchless min/max

For min/maxing, you have to compare two values, say a and b. This generally takes the form of a subtraction: ab. The point is that whether a>b or vice versa depends on the sign of the difference. If you then mask this difference with the −1-sign, you have either 0 (if ab) or the signed difference (if a<b, so this will always be negative), depending on which variable was bigger, and this value can be used to ‘correct’ the compared variables.

//# C versions
min = b + ((a-b) & (a-b)>>31);
max = a - ((a-b) & (a-b)>>31);
@# Thumb versions. r0:a; r1:b

@ min
    sub     r0, r1
    asr     r2, r0, #31
    and     r0, r2
    add     r0, r1

@ max
    sub     r1, r0, r1
    asr     r2, r1, #31
    and     r1, r2
    sub     r0, r1

Whether these are really much use is debatable, though: they use 3 registers and are one instruction longer than the standard min/max routines, which use a compare, a branch and a move. That said, the branch is quite costly. EDIT, 20071110: but not costly enough. the standard routine costs, on average, 3S+½N cycles; the branchless version costs 4S. Thanks Dennis.

There is, however, one special case where this version is definitely better: max(a,0), (and presumably min(a,0) too, but that should be rare). In this case, the thing basically reduces to `a-(a&a>>31)'. This can be simplified to `a&~(a>>31)' which is two instructions in Thumb and only one in ARM code.

2.5 A very sneaky bit-field clamp

It is sometimes necessary to keep values within a certain range, like [0, A⟩. The standard approach would be to check for min/max and act accordingly: `y = min(max(x,A−1),0)' Normally, this would cost 4 ARM instructions, or about 6 Thumb (with branches, ick); however, if A is a power of two, you can use this rather shifty technique to do it faster.

First, define A = 2n. In other words, we want the result to be confined to n bits. If x is outside the desired range (either too large or too small), then some of the bits outside the range will be non-zero. This gives us our test condition. Even better, the specific bits set indicate on which side we are and how to correct for it. If x<0, then x is negative (well, duh), meaning that the sign-bit is set. If xA, then the sign-bit is clear. Now, the values to clamp to are (all) 0 and A−1 (all ones for n bits); which are essentially the complements of the extended signbit. In other words, once you know you're outside the valid range, sign-extend (x>>31), flip all bits and mask by A−1 (or shift-right by 32−n).

Interestingly, you can combine a few of these operations. The range-test can be performed by a right-shift, which also acts as a sign-extension. The flip and shift as usual. Using a shift-test, rather than a mask-test also clears up a potential problem, namely that the out-of-range values could have extended into the very high bitrange (right underneath the sign-bit), and so complicating the final shift.

// Bit-field clamp;
// Alternative for (x<0 ? 0 : (x>=(1<<n) ? (1<<n)-1 : x)
// NOTE: n is considered constant.

//# C version:
int x;                      // NOTE: signed!
u32 y;                      // NOTE: unsigned!
if( y=x>>n )
    x= ~y >> (32-n);

@# ARM. r0:x
    movs    r1, r0, asr #n
    mvnne   r1, r1
    movne   r0, r1, lsr #(32-n)
@# Thumb. r0:x
    asr     r1, r0, #n
    beq     .Linrange
    mvn     r1, r1
    lsr     r0, r1, #(32-n)

If the clamp is in a loop, you can actually cut the ARM version down to two instructions. Instead of `~y>>(32−n)', pre-load B = A−1 and use `B^(y>>(32−n))'. One extra note on the C version: currently, gcc still uses an extra (superfluous) cmp, so compiled code would not be optimal, but it'd still be faster than the standard version.

    ldr     ip,=(1<<n)-1   @ prep mask before loop

    @ in the loop:
    movs    r1, r0, asr #n
    eorne   r0, ip, r1, lsr #(32-n)

2.6 Cheap power-of-two math

You must know these already, but I'm going to point them out again anyway. For powers-of-two (PoT), you can replace multiplies, divisions and modulos by shifts.

Base form Bitop version
y= x*2n y= x<<n
y= x/2n y= x >>n
y= x%2n y= x & (2n−1)

I should point out, though that the division and modulo aren't quite correct. The problem is that a division rounds towards 0 and a right-shift rounds towards −infinity. Sometimes this is what you want anyway, but for an exact signed-division replacement you'd need to do a correction.

3 Bit unpacking

Fair enough, bit-unpacking doesn't happen too often anymore, but it may be useful for creating masks for bitmaps or collision detection, or for font rendering, if the font is initially stored as a 1bpp bitmap – and even when it isn't you can still do ... interesting things with bit patterns.

3.1 8-fold 1→4 bitunpack via look-up

There are two ways of doing this: putting the masks in a table and look them up, or use straight bit-manipulation. Version A is the pussy version, but in Thumb seems to work best. What it does is read a source byte, use the nybbles of that byte to look-up the two halves of the word from a 16-entry LUT, then string them together to make the final mask.

// Required LUT, with masks for values 0-15.
const u16 pxlut[16] =
    0x0000, 0x0001, 0x0010, 0x0011,  0x0100, 0x0101, 0x0110, 0x0111,
    0x1000, 0x1001, 0x1010, 0x1011,  0x1100, 0x1101, 0x1110, 0x1111
//# C version
u8 *srcL;


u32 raw= *srcL;
    raw= pxlut[raw&15] | pxlut[raw>>4];
@# Thumb. Time without/with branch taken: 2Nc+1Nd+3S+I / 3Nc+3Nd+6S+3I
@# r1:srcL;  r3/r4:raw; r6:pxlut; r7:tmpmask
    @ Before loop
    ldr     r6,=pxlut
    mov     r7, #0x1E       @ =15*2
    ldrb    r3, [r1]
    lsl     r3, r3, #1      @ *2 because pxlut = 16-bit
    beq     .Lnopx          @ Zero byte == Zero mask : skip work.
    lsr     r4, r3, #4
    ldrh    r4, [r6, r4]    @ Load HI mask
    and     r3, r7
    ldrh    r3, [r6, r3]    @ Load LO mask
    lsl     r4, r4, #16     @ - Combine
    orr     r3, r4          @ /

Note that there is a check to see whether raw is zero. If it is, then you know the line is empty and you don't need to render anything. If your font has lower-case letters, they will often have non- zero lines (‘e’, for example). Skipping the render can make a big difference. If not, the one extra S-cycle isn't too much compared to the rest.

3.2 8-fold 1→4 bitunpack via bitops

You can also AND/OR/SHIFT your way to the solution, which is actually faster in ARM code because shifts are free. With shifted ORRs, you can double the number of bits processed per iteration, giving something like log2(M) speed. The only problem is that at some point you'll start overwriting previous results, so you have to clear bits from time to time as well.

The basic algorithm is as follows. The changes are given in red, and the ‘correct’ bits are bold.

0000 0000  0000 0000  0000 0000  hgfe dcba    p (start)
0000 0000  0000 hgfe  dcba 0000  hgfe dcba    p |= p<<12
0000 00hg  fedc xxfe  dcxx fedc  xxfe dcba    p |= p<< 6
0000 00h0  0000 00f0  0000 00d0  0000 00b0    q  = p&0x02020202
0000 000g  0000 000e  0000 000c  0000 000a    p &=   0x01010101 
000h 000g  000f 000e  000d 000c  000b 000a    p |= q<<3

Each of these corresponds to one ARM instruction. Aside from using a 256-entry LUT, this should be the fastest way to expand 8 bits into 8 nybbles.

@# ARM code. r1:src; r3:p; r4:q; r5:mask

    ldr     r5,=0x01010101          @ Pre-loop mask load
    ldrb    r3, [r1], #1
    orr     r3, r3, r3, lsl #12     @ - -1 Staggered shift into higher nybbles.
    orr     r3, r3, r3, lsl #6      @ /
    and     r4, r3, r5, lsl #1      @ Clean for even nybbles.
    and     r3, r3, r5              @ Clean for odd nybbles.
    orr     r3, r3, r4, lsl #3      @ Combine odd and even nybbles.

3.3 8-fold 1→4 bitunpack with reverse

The previous subsection was about 1 to 4 bitunpacking where the bits were were indexed little-endian inside the bytes: the left-most bit at bit 0 and so on. This is nice, because both bits and bytes are then arranged in a left-to-right reading order. Many graphics formats, however, have the left-most pixel in the high-bit of a byte, which to some is also nice, because it corresponds to a more conventional number-reading order.

# 1 to 4 bitunpack with and without reverse.
without: hgfe dcba -> 000h 000g  000f 000e  000d 000c  000b 000a
with   : hgfe dcba -> 000a 000b  000c 000d  000e 000f  000g 000h

The case with reversing the bits/nybbles is not that different from the one without. In fact, it's only one step longer than without the reverse. Here's the procedure for it.

0000 0000  0000 0000  0000 0000  hgfe dcba      p
0000 hgfe  dcba 0000  0000 0000  hgfe dcba      p |= p<<20;
ba00 hgfe  dcba 00hg  fedc ba00  hgfe dcba      p |= p<<10;
b000 000e  d000 000g  f000 0a00  h000 0c00      p &= 0x81818484
b0c0 0b0e  d000 ed0g  f000 gx00  h0a0 0x00      p |= p ROR 5;
b0x0 cb0x  ded0 edex  fgf0 gxgx  h0x0 ax0x      p |= p>>2;
000a 000b  000c 000d  000e 000f  000g 000h      p  = 0x11111111 & (p ROR 7);

As before, each of these steps corresponds to one ARM instruction. It does require two registers to store the masks in, though.

@# ARM code. r1:src; r3:p; r4:maskA; r5:maskB
    ldr     r4,=0x81818484
    ldr     r5,=0x11111111

    ldrb    r3, [r1], #1
    orr     r3, r3, r3, lsl #20     @ - +1 Staggered shift to higher nybbles.
    orr     r3, r3, r3, lsl #10     @ /
    and     r3, r3, r4              @ Clean some parts.
    orr     r3, r3, ror #5          @ - More shifting and positioning.
    orr     r3, r3, lsr #2          @ /
    and     r3, r5, r3, ror #7      @ Final cleanup + rotating into place.

Yes, those are RORs. They're needed because some bits have to move up, while others move down. I think some other variations are possible as well, but finding those is left as an exercise for the reader.

4 Misc

4.1 Writing bytes into an 16-bit bus

This is intended for GBA-VRAM-like memory, which doesn't allow byte writes. Well, it does, but it'll fill both bytes of that particular halfword with the byte you write, which is generally undesirable.

The standard solution is to see is to read the halfword, see which byte you need, then bic/orr the appropriate bits and write the halfword back. However, there is a slightly faster way. It starts by not reading the whole halfword, but by reading only the byte that should not be written to. The rest is fairly simple.

//# C version
u8 *ptr, value;
u32 addr= (u32)ptr;
    *(u16*)(addr-1) = *(u8*)(addr-1)     | value<<8;
    *(u16*)(addr  ) = *(u8*)(addr+1)<<8  | value;
@# ARM version
r0:ptr; r1:byte-value
    eor     r2, r0, #1
    ldrb    r3, [r2]            @ Read the other byte
    ands    r2, r0, #1          @ Test and prep for alignment
    orreq   r3, r1, r3, lsl #8  @ Even: src-byte needs shift
    orrne   r3, r3, r1, lsl #8  @ Odd : value needs shift
    strh    r3, [r0,-r2]        @ Align and write

Yes, I know how awful the C version looks, but that cannot be helped – you're not supposed to do stuff like that to pointers. That's what assembly is for :P. It really is a little faster than the standard version though. Of course, this is only beneficial if you require random-access. For sequential accesses, there are better ways.

4.2 Bit reversing

There is a really nifty method of reversing all bits (or power-of-two clumps of bits) of a byte/word. The standard method will have you loop over all pixels for O(N) time, but it's also possible to do it in O(logN). The basic trick is to first flip all single bits, then groups of 2, then 4 and so on. Here's an example of a byte-reverse, but it works for longer pieces as well.

# Bit-reverse procedure for a byte.
abcd efgh   p
badc fehg   p= ((p&0xAA)>>1) | ((p<<1) & 0xAA)
dcba hgfe   p= ((p&0x33)>>2) | ((p<<2) & 0x33)
hgfe dcba   p= ((p&0x0F)>>4) | ((p<<4) & 0xF0)  # ANDs may be omitted

But wait there's more! The example above is bit reversing all bits. But if you look carefully, you'll see that reversing groups of bits can be done by skipping the earlier reverses. This works out to a switch-block with fallthroughs.

The following is a C routine to reverse 1, 2, 4, 8 and 16 bits within a 32-bit word. It probably won't be compiled quite as it should, but it's a good start. The cases are correctly compiled to a jump-table, the ROR macro works. If you're working on ARMv5 or higher, you might be able to use a CLZ here as well somehow.

For those wondering about the const volatile: no I have not gone insane; it's just that the GCC version tested with cannot AND literals properly. It'll use 4 byte-sized ANDs, rather than a single word-sized AND, which completely kills the routine. The awkward formulation here is the only way I've found that ‘fixes’ this. It probably (hopefully) only applies to GCC for ARM, and only current versions (at time of writing, GCC 4.3.3). For reversing inside bytes, this is not an issue.

#define ROR(data, ror) ( ((data)<<(32-(ror))) | ((data)>>(ror) )

//# Reverse bits within a word. C version.
void data_bit_rev(u32 *data, u32 size, u32 bpp)
    u32 i, tmp, mask;

    const volatile u32 masks[4]=
    {   0xAAAAAAAA, 0x33333333, 0x0F0F0F0F, 0x00FF00FF  };

    for(i=0; i<size; i++)
        tmp= data[i];

        // All fall throughs
        case 1:
            mask= masks[0];
            tmp= (mask & (tmp>>1)) | ((tmp&mask)<<1);
        case 2:
            mask= masks[1];
            tmp= (mask & (tmp>>2)) | ((tmp&mask)<<2);
        case 4:
            mask= masks[2];
            tmp= (mask & (tmp>>4)) | ((tmp&mask)<<4);
        case 8:
            mask= masks[3];
            tmp= (mask & (tmp>>8)) | ((tmp&mask)<<8);
        case 16:
            tmp= ROR(tmp, 16);         
        data[i]= tmp;

Modified Sep, 2009, J Vijn.

15 thoughts on “Bit-tricks and other nifty little snippets

  1. Great tips!
    Also, I really like your tutorial of the Mode7. =)

    Keep doing great work like this. =)


  2. The C comment for the very sneaky bitfield clamp is wrong. ( ... ? ... : 0) should be (...? ... : x)

  3. Just thought I'd add this:

    In "2.5 A very sneaky bit-field clamp", you may be able to remove the temp. value if the mask fits in an immediate and save a register. ie.

    tst req, #IMM
    eorne reg, msk, reg, lsr #32 - bit_limit

    It's what I've done with my individual RGB brightness adjuster =3

  4. there is probably mistake in "Bit-reverse procedure for a byte.", where the second >> whould be actually <>1) | ((p<>2) | ((p<>4) | ((p<<4) & 0xF0) # ANDs may be omitted

  5. For sign-sign the shift count should be 31 instead of 30.

  6. As it happens, both will work. with >>30, you'll get 00 or 01 for positive numbers, and FF or FE for negatives. ORRing the 1 gives +1 or −1, as requested.

    But yeah, for consistency's sake, I'll change it to >>31 like the rest.

  7. I was playing with implementing the CLAMP routine under android today (ndk r8b). The code is compiled as Thumb-2, so it is not as restrictive as Thumb. I found that the "asr" instruction misbehaves when Rd != Rs. The workaround was to replace the "asr" with a "movs" like this:

    @# Thumb. r0:x
    //asr r1, r0, #n
    movs r1, r0, asr #n
    beq .Linrange
    mvn r1, r1
    lsr r0, r1, #(32-n)

  8. Should it be "(-x>>31) - (x>>31)" instead ?

    (x>>31) - (-x>>31); will produce -1 for 1, 1 for -1.

  9. Nope. Just tested it again, works as it should.

    // x= -1
    (-1>>31) - (1>>31) = (-1) - (0) = -1

    // x= 1
    (1>>31) - (-1>>31) = (0) - (-1) = 1

  10. Maybe I'm mistaken, but it seems that 3-way sign doesn't work for INT32_MIN.

  11. In 2.5, on the line that reassigns
    x = ~y >> (32 - n)
    I am running into a theoretical misunderstanding when x >= 32. For simplification, if n = 5 and integers are 8-bit, suppose
    int x = 44 // 000101100
    int y = x >> n // 00000001
    //now we convert to ~y = 11111110 and shift to the right (8-3) = 5 bits for 11111111 = -1

    Interestingly, the code implementation and proper substitutions yield this exact result on console (regardless of bit-length since they're all going to be filled in this case.)

    Now if I was not sign-extending, the answer would be 00011111 or 31, which is exactly 2^n - 1. Maybe I'm missing something? Are there instances where shifting bits implements sign-extension and... no sign extension? Or is there something missing in the code?

    SOLVED AS I TYPED: As I wrote this out, I suddenly understood it. But I figured it might be important for people who read this and fall into the trap I did. y is unsigned. Meaning logical shifts only (i.e. no sign-extending since sign is irrelevant, thus my theoretical solution). x is signed and in the context of gcc implements arithmetic shifts. Yikes, me. The two shifts here are of two different hierarchies. Okay, great. Solved. Hope that helped you, future me.

Leave a Reply

Your email address will not be published. Required fields are marked *