1 Introduction
Here are a few nifty little items involving bit operations for various purposes like branchless signs and bit conversions. These things should be faster than more standard algorithms for the same thing. Because they came from GBA-experience, they're particularly apt for ARM cores, though they should work for other platforms as well. Routines are given in C or ARM or Thumb assembly. I'm assuming bits per word (BPW) here (BPW = 32), but code can be extended
Just a bit of warning, though: I am not saying these tricks should be used; I'm saying these can be used. They're mostly just for fun and to show alternative solutions. Whether they're faster than the regular implementations depends on many things, including processor and compiler. I would recommend some of these if you're already working in assembly though – once you're doing that, you probably have found that you need more speed than the compiler can give you (if not, you're doing something wrong), at which time these entries can actually help you.
2 Math items
A few simply math functions like abs()
and
sign()
use branched code that could just as well be done
with bit operations, which are often faster than branching. Most of the
items here (alright, pretty much all of them) come from
S.E.
Anderson's bithacks; a collection that I discovered just
after I figured out a few of them myself.
2.1 Branchless 2-way signs
There are three ways of indicating a 2-way sign:
negative-based (−1,0), positive-based (0,+1) and sign-based
(−1,+1). The basic form is negative-based, which in 2s
complement is a one-instruction algorithm: simply sign-extend
into all bits by right-shifting by BPW
−1. For a
positive-based sign, add one. The last one is a little trickier, but
can be calculated by multiplying the negative-sign by 2 and adding
one, or shift-right by BPW
−1 and setting bit 0 (Note using BPW
−2 will work as well).
Here are all three:
// Negative sign (x>=0 ? 0 : -1)
sn = x>>31;
// Positive sign (x>=0 ? +1 : 0)
sp = (x>>31)+1;
// Sign-sign (x>=0 ? +1 : -1)
ss = (x>>31) | 1;
@ Negative sign (x>=0 ? 0 : -1)
asr r0, r0, #31
@ Positive sign (x>=0 ? +1 : 0)
asr r0, r0, #31
add r0, #1
@ Sign-sign (x>=0 ? +1 : -1)
asr r0, r0, #31
lsl r0, r0, #1
add r0, #1
@ or
asr r0, r0, #31
mov r1, #1
orr r0, r1
Most of the routines in this section use the `asr 31
' trick in
some way or another. It's amazing how much you can do with it.
2.2 Branchless 3-way signs
This can also be used for a 3-way sign: ±1 for positive/negative and 0 for zero. The negative and zero parts can be foind from the 2-way negative sign. for the positive part, shift −x, rather than x. This will give −1 for positive numbers, yet (−0)>>31 will still be 0. Subtracting that from the earlier sign gives the whole thing.
//# C version
sign= (x>>31) - (-x>>31);
//# ARM
rsb r1, r0, #0
mov r0, r0, asr #31
sub r0, r0, r1, asr #31
//# Thumb
neg r1, r0
asr r1, r1, #31
asr r0, r0, #31
sub r0, r1
2.3 Branchless abs(x)
The standard is `abs(x) = x>=0 ? x : -x
'. Since
−x = ~x+1 = x XOR (−1) − (−1),
you can do this without branches too.
//# C version
sign= (x>>31);
abs = (x^sign) - sign;
//# ARM
eor r1, r0, r0, asr #31
sub r0, r1, r0, asr #31
//# Thumb
asr r1, r0, #31
eor r0, r1
sub r0, r1
Of course you don't need a branchless abs()
in ARM because
you can do the same thing with a rsblt
, but it's nice that
it exists.
Reference: S.E. Anderson : abs. He also points out that apparently this approach is patented >_>.
2.4 Branchless min/max
For min/maxing, you have to compare two values, say a and b. This generally takes the form of a subtraction: a−b. The point is that whether a>b or vice versa depends on the sign of the difference. If you then mask this difference with the −1-sign, you have either 0 (if a≥b) or the signed difference (if a<b, so this will always be negative), depending on which variable was bigger, and this value can be used to ‘correct’ the compared variables.
min = b + ((a-b) & (a-b)>>31);
max = a - ((a-b) & (a-b)>>31);
@ min
sub r0, r1
asr r2, r0, #31
and r0, r2
add r0, r1
@ max
sub r1, r0, r1
asr r2, r1, #31
and r1, r2
sub r0, r1
Whether these are really much use is debatable, though: they use 3 registers and are one instruction longer than the standard min/max routines, which use a compare, a branch and a move. That said, the branch is quite costly. EDIT, 20071110: but not costly enough. the standard routine costs, on average, 3S+½N cycles; the branchless version costs 4S. Thanks Dennis.
There is, however, one special case where this version is definitely
better: max(a,0)
, (and presumably
min(a,0)
too, but that should be rare). In this case, the
thing basically reduces to `a-(a&a>>31)
'.
This can be simplified to `a&~(a>>31)
'
which is two instructions in Thumb and only one in ARM code.
2.5 A very sneaky bit-field clamp
It is sometimes necessary to keep values within a certain range, like
[0, A〉. The standard approach would be to check for
min/max and act accordingly: `y = min(max(x,A−1),0)
'
Normally, this would cost 4 ARM instructions, or about 6 Thumb (with branches, ick); however, if A is a power of two, you can use this rather shifty
technique to do it faster.
First, define A = 2n. In other words, we want the result to be confined to n bits. If x is outside the desired range (either too large or too small), then some of the bits outside the range will be non-zero. This gives us our test condition. Even better, the specific bits set indicate on which side we are and how to correct for it. If x<0, then x is negative (well, duh), meaning that the sign-bit is set. If x≥A, then the sign-bit is clear. Now, the values to clamp to are (all) 0 and A−1 (all ones for n bits); which are essentially the complements of the extended signbit. In other words, once you know you're outside the valid range, sign-extend (x>>31), flip all bits and mask by A−1 (or shift-right by 32−n).
Interestingly, you can combine a few of these operations. The range-test can be performed by a right-shift, which also acts as a sign-extension. The flip and shift as usual. Using a shift-test, rather than a mask-test also clears up a potential problem, namely that the out-of-range values could have extended into the very high bitrange (right underneath the sign-bit), and so complicating the final shift.
// Alternative for (x<0 ? 0 : (x>=(1<<n) ? (1<<n)-1 : x)
// NOTE: n is considered constant.
//# C version:
int x; // NOTE: signed!
u32 y; // NOTE: unsigned!
if( y=x>>n )
x= ~y >> (32-n);
@# ARM. r0:x
movs r1, r0, asr #n
mvnne r1, r1
movne r0, r1, lsr #(32-n)
@# Thumb. r0:x
asr r1, r0, #n
beq .Linrange
mvn r1, r1
lsr r0, r1, #(32-n)
.Linrange:
...
If the clamp is in a loop, you can actually cut the ARM version down to two instructions. Instead of `~y>>(32−n)
',
pre-load B = A−1 and use
`B^(y>>(32−n))
'. One extra note on the
C version: currently, gcc still uses an extra (superfluous) cmp
, so
compiled code would not be optimal, but it'd still be faster than the standard version.
@ in the loop:
movs r1, r0, asr #n
eorne r0, ip, r1, lsr #(32-n)
2.6 Cheap power-of-two math
You must know these already, but I'm going to point them out again anyway. For powers-of-two (PoT), you can replace multiplies, divisions and modulos by shifts.
Base form | Bitop version |
---|---|
y= x*2n | y= x<<n |
y= x/2n | y= x >>n |
y= x%2n | y= x & (2n−1) |
I should point out, though that the division and modulo aren't quite correct. The problem is that a division rounds towards 0 and a right-shift rounds towards −infinity. Sometimes this is what you want anyway, but for an exact signed-division replacement you'd need to do a correction.
3 Bit unpacking
Fair enough, bit-unpacking doesn't happen too often anymore, but it may be useful for creating masks for bitmaps or collision detection, or for font rendering, if the font is initially stored as a 1bpp bitmap – and even when it isn't you can still do ... interesting things with bit patterns.
3.1 8-fold 1→4 bitunpack via look-up
There are two ways of doing this: putting the masks in a table and look them up, or use straight bit-manipulation. Version A is the pussy version, but in Thumb seems to work best. What it does is read a source byte, use the nybbles of that byte to look-up the two halves of the word from a 16-entry LUT, then string them together to make the final mask.
const u16 pxlut[16] =
{
0x0000, 0x0001, 0x0010, 0x0011, 0x0100, 0x0101, 0x0110, 0x0111,
0x1000, 0x1001, 0x1010, 0x1011, 0x1100, 0x1101, 0x1110, 0x1111
}
u8 *srcL;
...
u32 raw= *srcL;
if(raw)
{
raw= pxlut[raw&15] | pxlut[raw>>4];
...
}
@# r1:srcL; r3/r4:raw; r6:pxlut; r7:tmpmask
@ Before loop
ldr r6,=pxlut
mov r7, #0x1E @ =15*2
...
ldrb r3, [r1]
lsl r3, r3, #1 @ *2 because pxlut = 16-bit
beq .Lnopx @ Zero byte == Zero mask : skip work.
lsr r4, r3, #4
ldrh r4, [r6, r4] @ Load HI mask
and r3, r7
ldrh r3, [r6, r3] @ Load LO mask
lsl r4, r4, #16 @ - Combine
orr r3, r4 @ /
...
.Lnopx
Note that there is a check to see whether raw
is zero. If
it is, then you know the line is empty and you don't need to render
anything. If your font has lower-case letters, they will often have non-
zero lines (‘e’, for example). Skipping the render can make
a big difference. If not, the one extra S-cycle isn't too much compared
to the rest.
3.2 8-fold 1→4 bitunpack via bitops
You can also AND/OR/SHIFT your way to the solution, which is actually faster in ARM code because shifts are free. With shifted ORRs, you can double the number of bits processed per iteration, giving something like log2(M) speed. The only problem is that at some point you'll start overwriting previous results, so you have to clear bits from time to time as well.
The basic algorithm is as follows. The changes are given in red, and the ‘correct’ bits are bold.
0000 0000 0000 0000 0000 0000 hgfe dcba p (start) 0000 0000 0000 hgfe dcba 0000 hgfe dcba p |= p<<12 0000 00hg fedc xxfe dcxx fedc xxfe dcba p |= p<< 6 0000 00h0 0000 00f0 0000 00d0 0000 00b0 q = p&0x02020202 0000 000g 0000 000e 0000 000c 0000 000a p &= 0x01010101 000h 000g 000f 000e 000d 000c 000b 000a p |= q<<3
Each of these corresponds to one ARM instruction. Aside from using a 256-entry LUT, this should be the fastest way to expand 8 bits into 8 nybbles.
ldr r5,=0x01010101 @ Pre-loop mask load
...
ldrb r3, [r1], #1
orr r3, r3, r3, lsl #12 @ - -1 Staggered shift into higher nybbles.
orr r3, r3, r3, lsl #6 @ /
and r4, r3, r5, lsl #1 @ Clean for even nybbles.
and r3, r3, r5 @ Clean for odd nybbles.
orr r3, r3, r4, lsl #3 @ Combine odd and even nybbles.
3.3 8-fold 1→4 bitunpack with reverse
The previous subsection was about 1 to 4 bitunpacking where the bits were were indexed little-endian inside the bytes: the left-most bit at bit 0 and so on. This is nice, because both bits and bytes are then arranged in a left-to-right reading order. Many graphics formats, however, have the left-most pixel in the high-bit of a byte, which to some is also nice, because it corresponds to a more conventional number-reading order.
# 1 to 4 bitunpack with and without reverse.
without: hgfe dcba -> 000h 000g 000f 000e 000d 000c 000b 000a
with : hgfe dcba -> 000a 000b 000c 000d 000e 000f 000g 000h
The case with reversing the bits/nybbles is not that different from the one without. In fact, it's only one step longer than without the reverse. Here's the procedure for it.
0000 0000 0000 0000 0000 0000 hgfe dcba p 0000 hgfe dcba 0000 0000 0000 hgfe dcba p |= p<<20; ba00 hgfe dcba 00hg fedc ba00 hgfe dcba p |= p<<10; b000 000e d000 000g f000 0a00 h000 0c00 p &= 0x81818484 b0c0 0b0e d000 ed0g f000 gx00 h0a0 0x00 p |= p ROR 5; b0x0 cb0x ded0 edex fgf0 gxgx h0x0 ax0x p |= p>>2; 000a 000b 000c 000d 000e 000f 000g 000h p = 0x11111111 & (p ROR 7);
As before, each of these steps corresponds to one ARM instruction. It does require two registers to store the masks in, though.
ldr r4,=0x81818484
ldr r5,=0x11111111
ldrb r3, [r1], #1
orr r3, r3, r3, lsl #20 @ - +1 Staggered shift to higher nybbles.
orr r3, r3, r3, lsl #10 @ /
and r3, r3, r4 @ Clean some parts.
orr r3, r3, ror #5 @ - More shifting and positioning.
orr r3, r3, lsr #2 @ /
and r3, r5, r3, ror #7 @ Final cleanup + rotating into place.
Yes, those are RORs. They're needed because some bits have to move up, while others move down. I think some other variations are possible as well, but finding those is left as an exercise for the reader.
4 Misc
4.1 Writing bytes into an 16-bit bus
This is intended for GBA-VRAM-like memory, which doesn't allow byte writes. Well, it does, but it'll fill both bytes of that particular halfword with the byte you write, which is generally undesirable.
The standard solution is to see is to read the halfword, see which byte you need, then bic/orr the appropriate bits and write the halfword back. However, there is a slightly faster way. It starts by not reading the whole halfword, but by reading only the byte that should not be written to. The rest is fairly simple.
u8 *ptr, value;
u32 addr= (u32)ptr;
if(addr&1)
*(u16*)(addr-1) = *(u8*)(addr-1) | value<<8;
else
*(u16*)(addr ) = *(u8*)(addr+1)<<8 | value;
r0:ptr; r1:byte-value
eor r2, r0, #1
ldrb r3, [r2] @ Read the other byte
ands r2, r0, #1 @ Test and prep for alignment
orreq r3, r1, r3, lsl #8 @ Even: src-byte needs shift
orrne r3, r3, r1, lsl #8 @ Odd : value needs shift
strh r3, [r0,-r2] @ Align and write
Yes, I know how awful the C version looks, but that cannot be helped – you're not supposed to do stuff like that to pointers. That's what assembly is for :P. It really is a little faster than the standard version though. Of course, this is only beneficial if you require random-access. For sequential accesses, there are better ways.
4.2 Bit reversing
There is a really nifty method of reversing all bits (or power-of-two clumps of bits) of a byte/word. The standard method will have you loop over all pixels for O(N) time, but it's also possible to do it in O(logN). The basic trick is to first flip all single bits, then groups of 2, then 4 and so on. Here's an example of a byte-reverse, but it works for longer pieces as well.
abcd efgh p
badc fehg p= ((p&0xAA)>>1) | ((p<<1) & 0xAA)
dcba hgfe p= ((p&0x33)>>2) | ((p<<2) & 0x33)
hgfe dcba p= ((p&0x0F)>>4) | ((p<<4) & 0xF0) # ANDs may be omitted
But wait there's more! The example above is bit reversing all bits. But if you look carefully, you'll see that reversing groups of bits can be done by skipping the earlier reverses. This works out to a switch-block with fallthroughs.
The following is a C routine to reverse 1, 2, 4, 8 and 16 bits within
a 32-bit word. It probably won't be compiled quite as it should, but
it's a good start. The cases are correctly compiled to a jump-table,
the ROR macro works. If you're working on ARMv5 or higher, you might
be able to use a CLZ
here as well somehow.
For those wondering about the
const volatile
: no I have not gone insane; it's just that
the GCC version tested with cannot AND literals properly. It'll use 4
byte-sized ANDs, rather than a single word-sized AND, which completely
kills the routine. The awkward formulation here is the only way I've found
that ‘fixes’ this. It probably (hopefully) only applies to
GCC for ARM, and only current versions (at time of writing, GCC 4.3.3).
For reversing inside bytes, this is not an issue.
//# Reverse bits within a word. C version.
void data_bit_rev(u32 *data, u32 size, u32 bpp)
{
u32 i, tmp, mask;
const volatile u32 masks[4]=
{ 0xAAAAAAAA, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
for(i=0; i<size; i++)
{
tmp= data[i];
// All fall throughs
switch(bpp)
{
case 1:
mask= masks[0];
tmp= (mask & (tmp>>1)) | ((tmp&mask)<<1);
case 2:
mask= masks[1];
tmp= (mask & (tmp>>2)) | ((tmp&mask)<<2);
case 4:
mask= masks[2];
tmp= (mask & (tmp>>4)) | ((tmp&mask)<<4);
case 8:
mask= masks[3];
tmp= (mask & (tmp>>8)) | ((tmp&mask)<<8);
case 16:
tmp= ROR(tmp, 16);
}
data[i]= tmp;
}
}
Great tips!
Also, I really like your tutorial of the Mode7. =)
Keep doing great work like this. =)
Regards,
Awesome. Thanks!
The C comment for the very sneaky bitfield clamp is wrong. ( ... ? ... : 0) should be (...? ... : x)
So it was. Fixed now. Also fixed one or two other small things.
Just thought I'd add this:
In "2.5 A very sneaky bit-field clamp", you may be able to remove the temp. value if the mask fits in an immediate and save a register. ie.
tst req, #IMM
eorne reg, msk, reg, lsr #32 - bit_limit
It's what I've done with my individual RGB brightness adjuster =3
there is probably mistake in "Bit-reverse procedure for a byte.", where the second >> whould be actually <>1) | ((p<>2) | ((p<>4) | ((p<<4) & 0xF0) # ANDs may be omitted
Yup, there is. Was. Fixed now, thank you.
For sign-sign the shift count should be 31 instead of 30.
As it happens, both will work. with >>30, you'll get
00
or01
for positive numbers, andFF
orFE
for negatives. ORRing the 1 gives +1 or −1, as requested.But yeah, for consistency's sake, I'll change it to >>31 like the rest.
I was playing with implementing the CLAMP routine under android today (ndk r8b). The code is compiled as Thumb-2, so it is not as restrictive as Thumb. I found that the "asr" instruction misbehaves when Rd != Rs. The workaround was to replace the "asr" with a "movs" like this:
@# Thumb. r0:x
//asr r1, r0, #n
movs r1, r0, asr #n
beq .Linrange
mvn r1, r1
lsr r0, r1, #(32-n)
.Linrange:
Should it be "(-x>>31) - (x>>31)" instead ?
(x>>31) - (-x>>31); will produce -1 for 1, 1 for -1.
Nope. Just tested it again, works as it should.
(-1>>31) - (1>>31) = (-1) - (0) = -1
// x= 1
(1>>31) - (-1>>31) = (0) - (-1) = 1
Maybe I'm mistaken, but it seems that 3-way sign doesn't work for INT32_MIN.
In 2.5, on the line that reassigns
x = ~y >> (32 - n)
I am running into a theoretical misunderstanding when x >= 32. For simplification, if n = 5 and integers are 8-bit, suppose
int x = 44 // 000101100
Then
int y = x >> n // 00000001
//now we convert to ~y = 11111110 and shift to the right (8-3) = 5 bits for 11111111 = -1
Interestingly, the code implementation and proper substitutions yield this exact result on console (regardless of bit-length since they're all going to be filled in this case.)
Now if I was not sign-extending, the answer would be 00011111 or 31, which is exactly 2^n - 1. Maybe I'm missing something? Are there instances where shifting bits implements sign-extension and... no sign extension? Or is there something missing in the code?
SOLVED AS I TYPED: As I wrote this out, I suddenly understood it. But I figured it might be important for people who read this and fall into the trap I did. y is unsigned. Meaning logical shifts only (i.e. no sign-extending since sign is irrelevant, thus my theoretical solution). x is signed and in the context of gcc implements arithmetic shifts. Yikes, me. The two shifts here are of two different hierarchies. Okay, great. Solved. Hope that helped you, future me.