## 1 Introduction

Here are a few nifty little items involving bit operations for various
purposes like branchless signs and bit conversions. These things should
be faster than more standard algorithms for the same thing. Because
they came from GBA-experience, they're particularly apt for ARM cores,
though they should work for other platforms as well. Routines are
given in C or ARM or Thumb assembly. I'm assuming bits per word (BPW) here
(*BPW* = 32), but code can be extended

Just a bit of warning, though: I am not saying these tricks *should* be
used; I'm saying these *can* be used. They're mostly just for fun and
to show alternative solutions. Whether they're faster than the regular
implementations depends on many things, including processor and compiler.
I would recommend some of these if you're already working in assembly
though – once you're doing that, you probably have found that you
need more speed than the compiler can give you (if not, you're doing
something wrong), at which time these entries can actually help you.

## 2 Math items

A few simply math functions like `abs()`

and
`sign()`

use branched code that could just as well be done
with bit operations, which are often faster than branching. Most of the
items here (alright, pretty much all of them) come from
S.E.
Anderson's bithacks; a collection that I discovered just
*after* I figured out a few of them myself.

### 2.1 Branchless 2-way signs

There are three ways of indicating a 2-way sign:
negative-based (−1,0), positive-based (0,+1) and sign-based
(−1,+1). The basic form is negative-based, which in 2s
complement is a one-instruction algorithm: simply sign-extend
into all bits by right-shifting by `BPW`

−1. For a
positive-based sign, add one. The last one is a little trickier, but
can be calculated by multiplying the negative-sign by 2 and adding
one, or shift-right by `BPW`

−1 and setting bit 0 (Note using `BPW`

−2 will work as well).

Here are all three:

// Negative sign (x>=0 ? 0 : -1)

sn = x>>31;

// Positive sign (x>=0 ? +1 : 0)

sp = (x>>31)+1;

// Sign-sign (x>=0 ? +1 : -1)

ss = (x>>31) | 1;

@ Negative sign (x>=0 ? 0 : -1)

asr r0, r0, #31

@ Positive sign (x>=0 ? +1 : 0)

asr r0, r0, #31

add r0, #1

@ Sign-sign (x>=0 ? +1 : -1)

asr r0, r0, #31

lsl r0, r0, #1

add r0, #1

@ or

asr r0, r0, #31

mov r1, #1

orr r0, r1

Most of the routines in this section use the ``asr 31`

' trick in
some way or another. It's amazing how much you can do with it.

### 2.2 Branchless 3-way signs

This can also be used for a 3-way sign: ±1 for positive/negative
and 0 for zero. The negative and zero parts can be foind from the
2-way negative sign. for the positive part, shift −*x*,
rather than *x*. This will give −1 for positive numbers, yet
(−0)>>31 will still be 0. Subtracting that from the
earlier sign gives the whole thing.

//# C version

sign= (x>>31) - (-x>>31);

//# ARM

rsb r1, r0, #0

mov r0, r0, asr #31

sub r0, r0, r1, asr #31

//# Thumb

neg r1, r0

asr r1, r1, #31

asr r0, r0, #31

sub r0, r1

### 2.3 Branchless abs(x)

The standard is ``abs(x) = x>=0 ? x : -x`

'. Since
−*x* = ~*x*+1 = *x* XOR (−1) − (−1),
you can do this without branches too.

//# C version

sign= (x>>31);

abs = (x^sign) - sign;

//# ARM

eor r1, r0, r0, asr #31

sub r0, r1, r0, asr #31

//# Thumb

asr r1, r0, #31

eor r0, r1

sub r0, r1

Of course you don't need a branchless `abs()`

in ARM because
you can do the same thing with a `rsblt`

, but it's nice that
it exists.

**Reference**:
S.E. Anderson : abs.
He also points out that apparently this approach is patented `>_>`.

### 2.4 Branchless min/max

For min/maxing, you have to compare two values, say *a* and
*b*. This generally takes the form of a subtraction:
*a*−*b*. The point is that whether
*a*>*b* or vice versa depends on the sign of the
difference. If you then mask this difference with the −1-sign,
you have either 0 (if *a*≥*b*) or the signed
difference (if *a*<*b*, so this will always be
negative), depending on which variable was bigger, and this
value can be used to ‘correct’ the compared variables.

min = b + ((a-b) & (a-b)>>31);

max = a - ((a-b) & (a-b)>>31);

@ min

sub r0, r1

asr r2, r0, #31

and r0, r2

add r0, r1

@ max

sub r1, r0, r1

asr r2, r1, #31

and r1, r2

sub r0, r1

Whether these are really much use is debatable, though: they use
3 registers and are one instruction longer than the standard min/max
routines, which use a compare, a branch and a move. That said, the
branch is quite costly. **EDIT, 20071110**: but not costly
enough. the standard routine costs, on average, 3S+½N
cycles; the branchless version costs 4S. Thanks Dennis.

There is, however, one special case where this version is definitely
better: `max(a,0)`

, (and presumably
`min(a,0)`

too, but that should be rare). In this case, the
thing basically reduces to ``a-(a&a>>31)`

'.
This can be simplified to ``a&~(a>>31)`

'
which is two instructions in Thumb and only one in ARM code.

### 2.5 A very sneaky bit-field clamp

It is sometimes necessary to keep values within a certain range, like
[0, *A*⟩. The standard approach would be to check for
min/max and act accordingly: ``y = min(max(x,A−1),0)`

'
Normally, this would cost 4 ARM instructions, or about 6 Thumb (with branches, ick); however, if *A* is a power of two, you can use this rather shifty
technique to do it faster.

First, define *A* = 2^{n}. In other words, we
want the result to be confined to *n* bits. If *x* is
outside the desired range (either too large or too small), then some
of the bits outside the range will be non-zero. This gives us our test
condition. Even better, the specific bits set indicate on which side
we are *and* how to correct for it.
If *x*<0, then *x* is negative (well, duh), meaning that
the sign-bit is set. If *x*≥*A*, then the sign-bit is
clear. Now, the values to clamp to are (all) 0 and
*A*−1 (all ones for *n* bits); which are essentially
the complements of the extended signbit. In other words, once you
know you're outside the valid range, sign-extend
(*x*>>31), flip all bits and mask by *A*−1
(or shift-right by 32−*n*).

Interestingly, you can combine a few of these operations. The range-test can be performed by a right-shift, which also acts as a sign-extension. The flip and shift as usual. Using a shift-test, rather than a mask-test also clears up a potential problem, namely that the out-of-range values could have extended into the very high bitrange (right underneath the sign-bit), and so complicating the final shift.

// Alternative for (x<0 ? 0 : (x>=(1<<n) ? (1<<n)-1 : x)

// NOTE: n is considered constant.

//# C version:

int x; // NOTE: signed!

u32 y; // NOTE: unsigned!

if( y=x>>n )

x= ~y >> (32-n);

@# ARM. r0:x

movs r1, r0, asr #n

mvnne r1, r1

movne r0, r1, lsr #(32-n)

@# Thumb. r0:x

asr r1, r0, #n

beq .Linrange

mvn r1, r1

lsr r0, r1, #(32-n)

.Linrange:

...

If the clamp is in a loop, you can actually cut the ARM version down to two instructions. Instead of ``~y>>(32−n)`

',
pre-load *B* = *A*−1 and use
``B^(y>>(32−n))`

'. One extra note on the
C version: currently, gcc still uses an extra (superfluous) `cmp`

, so
compiled code would not be optimal, but it'd still be faster than the standard version.

@ in the loop:

movs r1, r0, asr #n

eorne r0, ip, r1, lsr #(32-n)

### 2.6 Cheap power-of-two math

You must know these already, but I'm going to point them out again anyway. For powers-of-two (PoT), you can replace multiplies, divisions and modulos by shifts.

Base form | Bitop version |
---|---|

y= x*2^{n} | y= x<<n |

y= x/2^{n} | y= x >>n |

y= x%2^{n} | y= x & (2^{n}−1) |

I should point out, though that the division and modulo aren't
*quite* correct. The problem is that a division rounds towards 0
and a right-shift rounds towards −infinity. Sometimes this is
what you want anyway, but for an exact signed-division replacement you'd
need to do a correction.

## 3 Bit unpacking

Fair enough, bit-unpacking doesn't happen too often anymore, but it may be useful for creating masks for bitmaps or collision detection, or for font rendering, if the font is initially stored as a 1bpp bitmap – and even when it isn't you can still do ... interesting things with bit patterns.

### 3.1 8-fold 1→4 bitunpack via look-up

There are two ways of doing this: putting the masks in a table and look them up, or use straight bit-manipulation. Version A is the pussy version, but in Thumb seems to work best. What it does is read a source byte, use the nybbles of that byte to look-up the two halves of the word from a 16-entry LUT, then string them together to make the final mask.

const u16 pxlut[16] =

{

0x0000, 0x0001, 0x0010, 0x0011, 0x0100, 0x0101, 0x0110, 0x0111,

0x1000, 0x1001, 0x1010, 0x1011, 0x1100, 0x1101, 0x1110, 0x1111

}

u8 *srcL;

...

u32 raw= *srcL;

if(raw)

{

raw= pxlut[raw&15] | pxlut[raw>>4];

...

}

@# r1:srcL; r3/r4:raw; r6:pxlut; r7:tmpmask

@ Before loop

ldr r6,=pxlut

mov r7, #0x1E @ =15*2

...

ldrb r3, [r1]

lsl r3, r3, #1 @ *2 because pxlut = 16-bit

beq .Lnopx @ Zero byte == Zero mask : skip work.

lsr r4, r3, #4

ldrh r4, [r6, r4] @ Load HI mask

and r3, r7

ldrh r3, [r6, r3] @ Load LO mask

lsl r4, r4, #16 @ - Combine

orr r3, r4 @ /

...

.Lnopx

Note that there is a check to see whether `raw`

is zero. If
it is, then you know the line is empty and you don't need to render
anything. If your font has lower-case letters, they will often have non-
zero lines (‘e’, for example). Skipping the render can make
a big difference. If not, the one extra S-cycle isn't too much compared
to the rest.

### 3.2 8-fold 1→4 bitunpack via bitops

You can also AND/OR/SHIFT your way to the solution, which is actually
faster in ARM code because shifts are free. With shifted ORRs, you can
double the number of bits processed per iteration, giving something like
log_{2}(*M*) speed. The only problem is that at some
point you'll start overwriting previous results, so you have to clear
bits from time to time as well.

The basic algorithm is as follows. The changes are given in red, and the ‘correct’ bits are bold.

0000 0000 0000 0000 0000 0000 hgfe dcbap (start) 0000 0000 0000 hgfedcba 0000 hgfe dcbap |= p<<12 0000 00hgfedc xxfedcxx fedcxxfe dcbap |= p<< 6 0000 00h0 0000 00f0 0000 00d0 0000 00b0 q = p&0x02020202 0000 000g0000 000e0000 000c0000 000ap &= 0x01010101 000h000g000f000e000d000c000b000ap |= q<<3

Each of these corresponds to one ARM instruction. Aside from using a 256-entry LUT, this should be the fastest way to expand 8 bits into 8 nybbles.

ldr r5,=0x01010101 @ Pre-loop mask load

...

ldrb r3, [r1], #1

orr r3, r3, r3, lsl #12 @ - -1 Staggered shift into higher nybbles.

orr r3, r3, r3, lsl #6 @ /

and r4, r3, r5, lsl #1 @ Clean for even nybbles.

and r3, r3, r5 @ Clean for odd nybbles.

orr r3, r3, r4, lsl #3 @ Combine odd and even nybbles.

### 3.3 8-fold 1→4 bitunpack with reverse

The previous subsection was about 1 to 4 bitunpacking where the bits were were indexed little-endian inside the bytes: the left-most bit at bit 0 and so on. This is nice, because both bits and bytes are then arranged in a left-to-right reading order. Many graphics formats, however, have the left-most pixel in the high-bit of a byte, which to some is also nice, because it corresponds to a more conventional number-reading order.

```
# 1 to 4 bitunpack with and without reverse.
without: hgfe dcba -> 000h 000g 000f 000e 000d 000c 000b 000a
with : hgfe dcba -> 000a 000b 000c 000d 000e 000f 000g 000h
```

The case with reversing the bits/nybbles is not that different from the one without. In fact, it's only one step longer than without the reverse. Here's the procedure for it.

0000 0000 0000 0000 0000 0000 hgfe dcba p 0000 hgfedcba 0000 0000 0000 hgfe dcba p |= p<<20;ba00 hgfedcba 00hgfedc ba00hgfe dcba p |= p<<10;b000 000ed000 000gf000 0a00h000 0c00 p &= 0x81818484b0c0 0b0ed000ed0gf000gx00h0a0 0x00 p |= p ROR 5;b0x0cb0xded0edexfgf0gxgxh0x0 ax0x p |= p>>2; 000a000b000c000d000e000f000g000hp = 0x11111111 & (p ROR 7);

As before, each of these steps corresponds to one ARM instruction. It does require two registers to store the masks in, though.

ldr r4,=0x81818484

ldr r5,=0x11111111

ldrb r3, [r1], #1

orr r3, r3, r3, lsl #20 @ - +1 Staggered shift to higher nybbles.

orr r3, r3, r3, lsl #10 @ /

and r3, r3, r4 @ Clean some parts.

orr r3, r3, ror #5 @ - More shifting and positioning.

orr r3, r3, lsr #2 @ /

and r3, r5, r3, ror #7 @ Final cleanup + rotating into place.

Yes, those are RORs. They're needed because some bits have to move up, while others move down. I think some other variations are possible as well, but finding those is left as an exercise for the reader.

## 4 Misc

### 4.1 Writing bytes into an 16-bit bus

This is intended for GBA-VRAM-like memory, which doesn't allow byte
writes. Well, it *does*, but it'll fill both bytes of that
particular halfword with the byte you write, which is generally
undesirable.

The standard solution is to see is to read the halfword, see which byte
you need, then bic/orr the appropriate bits and write the halfword
back. However, there is a slightly faster way. It starts by not reading
the whole halfword, but by reading only the byte that should *not*
be written to. The rest is fairly simple.

u8 *ptr, value;

u32 addr= (u32)ptr;

if(addr&1)

*(u16*)(addr-1) = *(u8*)(addr-1) | value<<8;

else

*(u16*)(addr ) = *(u8*)(addr+1)<<8 | value;

r0:ptr; r1:byte-value

eor r2, r0, #1

ldrb r3, [r2] @ Read the other byte

ands r2, r0, #1 @ Test and prep for alignment

orreq r3, r1, r3, lsl #8 @ Even: src-byte needs shift

orrne r3, r3, r1, lsl #8 @ Odd : value needs shift

strh r3, [r0,-r2] @ Align and write

Yes, I know how awful the C version looks, but that cannot be helped
– you're not *supposed* to do stuff like that to pointers.
That's what assembly is for `:P`. It really is a little faster
than the standard version though. Of course, this is only beneficial
if you require random-access. For sequential accesses, there are
better ways.

### 4.2 Bit reversing

There is a really nifty method of reversing all bits (or power-of-two clumps
of bits) of a byte/word. The standard method will have you loop over all
pixels for O(*N*) time, but it's also possible to do it in O(log*N*).
The basic trick is to first flip all single bits, then groups of 2, then 4 and
so on. Here's an example of a byte-reverse, but it works for longer
pieces as well.

abcd efgh p

badc fehg p= ((p&0xAA)>>1) | ((p<<1) & 0xAA)

dcba hgfe p= ((p&0x33)>>2) | ((p<<2) & 0x33)

hgfe dcba p= ((p&0x0F)>>4) | ((p<<4) & 0xF0) # ANDs may be omitted

But wait there's more! The example above is bit reversing all bits. But if you look carefully, you'll see that reversing groups of bits can be done by skipping the earlier reverses. This works out to a switch-block with fallthroughs.

The following is a C routine to reverse 1, 2, 4, 8 and 16 bits within
a 32-bit word. It probably won't be compiled quite as it should, but
it's a good start. The cases are correctly compiled to a jump-table,
the ROR macro works. If you're working on ARMv5 or higher, you might
be able to use a `CLZ`

here as well somehow.

For those wondering about the
`const volatile`

: no I have not gone insane; it's just that
the GCC version tested with cannot AND literals properly. It'll use 4
byte-sized ANDs, rather than a single word-sized AND, which completely
kills the routine. The awkward formulation here is the only way I've found
that ‘fixes’ this. It probably (hopefully) only applies to
GCC for ARM, and only current versions (at time of writing, GCC 4.3.3).
For reversing inside bytes, this is not an issue.

//# Reverse bits within a word. C version.

void data_bit_rev(u32 *data, u32 size, u32 bpp)

{

u32 i, tmp, mask;

const volatile u32 masks[4]=

{ 0xAAAAAAAA, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };

for(i=0; i<size; i++)

{

tmp= data[i];

// All fall throughs

switch(bpp)

{

case 1:

mask= masks[0];

tmp= (mask & (tmp>>1)) | ((tmp&mask)<<1);

case 2:

mask= masks[1];

tmp= (mask & (tmp>>2)) | ((tmp&mask)<<2);

case 4:

mask= masks[2];

tmp= (mask & (tmp>>4)) | ((tmp&mask)<<4);

case 8:

mask= masks[3];

tmp= (mask & (tmp>>8)) | ((tmp&mask)<<8);

case 16:

tmp= ROR(tmp, 16);

}

data[i]= tmp;

}

}

Great tips!

Also, I really like your tutorial of the Mode7. =)

Keep doing great work like this. =)

Regards,

Awesome. Thanks!

The C comment for the very sneaky bitfield clamp is wrong. ( ... ? ... : 0) should be (...? ... : x)

So it was. Fixed now. Also fixed one or two other small things.

Just thought I'd add this:

In "2.5 A very sneaky bit-field clamp", you may be able to remove the temp. value if the mask fits in an immediate and save a register. ie.

tst req, #IMM

eorne reg, msk, reg, lsr #32 - bit_limit

It's what I've done with my individual RGB brightness adjuster =3

there is probably mistake in "Bit-reverse procedure for a byte.", where the second >> whould be actually <>1) | ((p<>2) | ((p<>4) | ((p<<4) & 0xF0) # ANDs may be omitted

Yup, there is. Was. Fixed now, thank you.

For sign-sign the shift count should be 31 instead of 30.

As it happens, both will work. with >>30, you'll get

`00`

or`01`

for positive numbers, and`FF`

or`FE`

for negatives. ORRing the 1 gives +1 or −1, as requested.But yeah, for consistency's sake, I'll change it to >>31 like the rest.

I was playing with implementing the CLAMP routine under android today (ndk r8b). The code is compiled as Thumb-2, so it is not as restrictive as Thumb. I found that the "asr" instruction misbehaves when Rd != Rs. The workaround was to replace the "asr" with a "movs" like this:

@# Thumb. r0:x

//asr r1, r0, #n

movs r1, r0, asr #n

beq .Linrange

mvn r1, r1

lsr r0, r1, #(32-n)

.Linrange:

Should it be "(-x>>31) - (x>>31)" instead ?

(x>>31) - (-x>>31); will produce -1 for 1, 1 for -1.

Nope. Just tested it again, works as it should.

(-1>>31) - (1>>31) = (-1) - (0) = -1

// x= 1

(1>>31) - (-1>>31) = (0) - (-1) = 1