# computing, costs and caching, oh my

Via coding horror, I stumbled upon a simply wonderful talk by Herb Sutter about various performance issues like how much operations cost. It also discusses how memory, latency and machine architecture can affect that cost how this has changed over the years. You can find the slides and a video of the presentation at http://nwcpp.org/Meetings/2007/09.html.

Be prepared for a total geek-out. This is highly technical (and awesome, but that's bordering on a tautology) stuff and probably not for the faint of heart. Slides 6 and 7, for example, around the 23m mark) show the value of cache compared to getting something from RAM, and just how bad retrieval from disk is. Later (slides 13 and on; around 55m in the video), when it comes to threads and how a compiler or even hardware may screw you over not do what you want to do, or even what you tell it to do, people how still have them are allowed to run to their moms for safety. By Patina, that is just nasty.

Near the end Sutter discusses the differences between using vectors, lists and sets and what the penalties for the latter are for something as simple add adding all the values in them. This starts at around slide 22, or 1h40m. Even if the rest is gobblyjook, this part is easy to understand. Basically, low footprint and sequential accesses are Good Things, even if you have cache and stuff. Especially when you have cache and stuff.

# memcpy and memset replacements for GBA/NDS

The standard C functions for copying and filling are memcpy() and memset(). They're part of the standard library, are easy to use and are often implemented with some optimizations so that they're usually faster than manual looping. The DKA version, for example will fill as words if the alignments and sizes allow for it. This can be much faster than doing the loops yourself.

There is, however, one small annoying fact about these two: they're not VRAM-safe. If the alignment and size aren't right for the word transfers, they will transfer bytes. Not only will this be slow, of course, but because you can't write to VRAM in bytes, the data will be corrupted.

The solutions for this have mostly come down to “so don't do that then”. Often, this can be sufficient: tiles in VRAM are word-aligned by definition, and source graphics data can and should be word-aligned anyway. However, now that I'm finally working on a bitmap blitter for 8bpp and 16bpp, I find that it's simply not enough. So I wrote the following set of functions to serve as replacements.

#### The code

My main goal here was to create smallish and portable replacements, not to have the greatest and fastestest code around because that's rather platform dependent. Yes, even the difference between GBA and NDS should matter, because of the differences in ldr/str times and caching.

There are 5 functions here. The main functions here are tonccpy and __toncset for copying and filling words, respectively. The other 3 are interfaces for __toncset for filling 8-bit, 16-bit and 32-bit data; you need these for, say, filling with a color instead of 8-bit data. For the rest of the discussion, I will use the name “toncset” for the internal routine for convenience.

# Hello my name is cearn, and I'm an ARMaholic.

Aw crap. I told myself I wasn't going to do this:

#### bmp16_floodfill in ARM assembly

@ ---------------------------------------------------------------------
@ void bmp16_floodfill_asm(int x, int y, u16 clrNew, void *dstBase,
@   u16 dstW, u16 dstH);
@ ---------------------------------------------------------------------

.extern memset16

.text
.arm
.align 2
.global bmp16_floodfill_asm
bmp16_floodfill_asm:
stmfd   sp!, {r4-r11,lr}

@ NOTE: all x-values are doubled for use as halfword offsets
/* Reglist:
r0: x, dstL
r1: ppx (current pixel pointer)
r2: x*2 (right, left, count)
r3: yx / tmp / clr
r4: clrNew
r5: clrOld
r6: dstW*2
r7: dstH-1
r8: dstBase
r9: top of span-stack
r10: clrUp
r11: clrDown
ip: y
*/

@ Prep ppx, clrOld, stack
ldrh    r6, [sp, #36]
ldrh    r7, [sp, #40]
mov     r6, r6, lsl #1
mov     r0, r0, lsl #1
mla     r4, r6, r1, r3
ldrh    r5, [r4, r0]
cmp     r5, r2                  @ Old color== new color: nevermind
beq     .Lbmp16_flood_end

mov     r8, r3
mov     r9, sp
mov     r4, r2

@ Initial push (merge with prologue?)
add     r3, r0, r1, lsl #16
stmfd   sp!, {r3}

@ While stack not empty, get new span
.Lbmp16_flood_pop:
ldmfd   sp!, {r3}
mov     ip, r3, lsr #16     @ y
eor     r2, r3, ip, lsl #16 @ x
mla     r0, r6, ip, r8      @ dstL= &dstBase[y*dstW]

@ Find left (can be optimized more)
.Lbmp16_flood_left:
ldrh    r3, [r0,r2]
cmp     r3, r5          @ clr==clrOld
subeq   r2, r2, #2
moveqs  r3, r2, asr #31 @ Special for x>=0
beq     .Lbmp16_flood_left

@ Prepare for moving right
add     r2, r2, #2          @ Step back into span
sub     r1, r0, #2
mvn     r10, #0             @ clrUp= clrDown= -1
mvn     r11, #0

@# Move right, checking up and down as we go
.Lbmp16_flood_right:
ldrh    r3, [r1,#2]!
cmp     r3, r5
bne     .Lbmp16_flood_fill_line

@ Check for new upper span
ldrh    r3, [r1,-r6]
cmp     r3, r10
mov     r10, r3
beq     .Lbmp16_flood_down
cmp     r3, r5
bne     .Lbmp16_flood_down
subs    r3, ip, #1
@ y-1>=0: we have a new up-span: push it. Push it good
addge   r3, r2, r3, lsl #16
stmgefd sp!, {r3}

@ Check for new down span
.Lbmp16_flood_down:
ldrh    r3, [r1,+r6]
cmp     r3, r11
mov     r11, r3
beq     .Lbmp16_flood_right_chk
cmp     r3, r5
bne     .Lbmp16_flood_right_chk
cmp     r3, r7
@ y+1<dstH: we have a new up-span: push it.
addlt   r3, r2, r3, lsl #16
stmltfd sp!, {r3}

.Lbmp16_flood_right_chk:
cmp     r2, r6
bls     .Lbmp16_flood_right

.Lbmp16_flood_fill_line:
sub     r2, r1, r0
movs    r2, r2, lsr #1
mov     r1, r4
ldrne   r3,=memset16
movne   lr, pc
bxne    r3

cmp     r9, sp
bne     .Lbmp16_flood_pop

.Lbmp16_flood_end:
ldmfd   sp!, {r4-r11,lr}
bx      lr

That would be a stack-based 16-bit floodfill. Although it's slower, I prefer this to the recursive floodfill from last week because there's no chance of overflowing the stack. Well actually there is that risk, but it's a very low one because for new span the routine only uses a single extra word of stack-space instead of about 8. Only in really complex fills should you worry about stack-overflow. I also prefer this over the C-equivalent (given below) because C doesn't allow you to use the stack-pointer in this way; you'd have to allocate a complete stack for it (and hope that it doesn't overflow).

I've set the section for this routine to .text here (the standard code-section). This hurts the routine's speed something fierce because of the 16-bit bus of GBA-ROM, but I'm not sure if it's worth putting it in IWRAM just yet. Feel free to put it there if you need the speed, though. For NDS it doesn't really matter because it'd be cached anyway.

Using the actual stack here for keeping track of span coordinates is the first trick I used here. The second would be to double the horizontal coordinates so that they can be used directly as offsets from the scanline start (given here by dstL. The math stays the same throughout so it doesn't really complicate matters. As far as I know, GCC can't do something like that, which is partially why stack-based 16-bit floodfill is slower than the recursive one in C. This would also explain why on 8-bit canvasses, where the doubling isn't necessary, the stack-based version is faster. I think it's possible to shave off one or two instructions here and there, but that'd probably be somewhat of a wasted effort. Yes, you may argue that writing this in assembly was a waste of effort in the first place, but my main goal here was stack safety; not necessarily the speed of the routine.

Note the use of memset16() here. You could, of course, use DMA instead, although that'll only be faster for small spans.

For completeness, here's the C version as well. There's almost a one-to-one correspondence between the C and assembly versions.

typedef struct POINT16 { s16 x, y; } POINT16;

#define PUSH_EA(_sp, _x, _y)    do { _sp->x= _x; _sp->y= _y; _sp++; } while(0)
#define POP_EA(_sp, _x, _y)     do { _sp--; _x= _sp->x; _y= _sp->y; } while(0)

void bmp16_floodfill(int x, int y, COLOR clrNew, void *dstBase,
u32 dstW, u32 dstH)
{
u16 *dstD= dstBase, *dstL= &dstD[y*dstW];
u16 clrOld= dstL[x];

if(clrOld == clrNew)
return;

int left, right;
POINT16 stack[1024], *sp= stack;
PUSH_EA(sp, x, y);

while(sp != stack)
{
POP_EA(sp, left, y);
dstL= &dstD[y*dstW];

// Find left edge
while(dstL[left]==clrOld && left>=0)
left--;
left++;

u32 clr, clrUp= -1, clrDown= -1;

// Trace line until right-edge
for(right= left; dstL[right]==clrOld && right<dstW; right++)
{
clr= dstL[right-dstW];

// Test up-span
if(clr!=clrUp && clr==clrOld && y>0)
PUSH_EA(sp, right, y-1);
clrUp= clr;

// Test down-span
clr= dstL[right+dstW];
if(clr!=clrDown && clr==clrOld && y+1 <  dstH)
PUSH_EA(sp, right, y+1);
clrDown= clr;
}
memset16(&dstL[left], clr, right-left);
}
}

# FloodFill and adventures in optimization

One of the things I've been curious about for sometime but never really looked up is the FloodFill. Thanks to a certain recent post in the PALib forum, I finally started looking for implementations. There's a list of them at Lode Vandevenne's site. The actual implementation I start with actually came from the code codex, which gives an version of the recursive scanline floodfill. One that was originally written by tepples (it can be found in the ToD source). Converted to a form usable for any sized 16bit canvas, it looks a little like this:

//! Struct for semi-constant canvas and floodfill info.
typedef struct flood_t
{
void *dstBase;
u16 dstW;
u16 dstH;
COLOR clrNew;
COLOR clrOld;
} flood_t;

//! Basic floodfill wrapper
void bmp16_flood_fill(int x, int y, COLOR clr, void *dstBase,
u32 dstW, u32 dstH)
{
u16 *dstL= (u16*)(dstBase + (y*dstW + x)*2);

if(*dstL == clr)
return;

flood_t flood= { dstBase, dstW, dstH, clr, *dstL };
bmp16_flood_fill_internal(x, y, &flood);
}

//! Main floodfill code (called recursively)
void bmp16_flood_fill_internal(int x, int y, const flood_t *flood)
{
u16 *dstD= flood->dstBase, *dstL= &dstD[y*flood->dstW];

// Find horz edges
int ii, left=x, right=x;

while(dstL[left-1]==flood->clrOld && left>0)
{
dstL[left]= flood->clrNew;
left--;
}
dstL[left]= flood->clrNew;

while(dstL[right+1]==flood->clrOld && right<flood->dstW-1)
{
right++;
dstL[right]= flood->clrNew;
}

// Recursive *ick*. All the stack-work's making GBA's head spin.
for(ii=left; ii<=right; ii++)
{
if(y>0 && dstL[ii-flood->dstW] == flood->clrOld)
bmp16_flood_fill_internal(ii, y-1, flood);

if(y+1<flood->dstH && dstL[ii+flood->dstW] == flood->clrOld)
bmp16_flood_fill_internal(ii, y+1, flood);
}

}

I've split the code into two parts: bmp16_flood_fill(), which is merely an interface function to get the color to fill over and set-up some variables, and bmp16_flood_fill_internal(), which is the workhorse here.

My first reaction to the workings was “huh!”, in surprise of how simple it is: you look left and right for the length of the span in the current scanline, fill it, then search in the lines above and below for places to fill (i.e., if the color is clrOld) and call bmp16_flood_fill_internal() recursively to do so.

The code looks understandable, works nicely and is moderately fast considering all the work that needs to be done. That said, you can almost double its speed with just a few simple alterations.

#### Datatypes

The first optimization is using the correct datatypes. By which I of course mean int or u32. Using chars and shorts for my local variables and parameters here costs me roughly 25%.

You may also notice the flood_t datatype. This is an optimization too in a way: the more parameters a function has, the more stackwork needs to be done to set-up the arguments and pass them on to the function. Because I only need to read those values and not write to them once inside bmp16_flood_fill_internal(), I'm passing a struct pointer instead of separate parameters. The actual gain in speed turned out to be negligible, but I did find out something else: the amount of stack it cost. I had expected that the recursive nature would require an awful lot of stack, but what I hadn't counted on was that in my test fill, the separate-parameter version took all the stack. When I tested it on no\$gba, I got an warning that that a bx r0 jumped to an invalid address. When I looked at the address it turned out the stack pointer was in the early 0300:xxxx's : it had completely clobbered IWRAM.

#### Dedicated scanline fill

The datatype improvements were things that I basically started out with in the first place, because those are rather obvious. The first real optimization was to use a fast scanlinefill instead of setting pixels individually – something like DMA or memset16(). Using memset16() saved roughly 15%. This optimization will only work if you have longer spans, of course.

The second optimization is also somewhat obvious if you know a little about how compilers do their work. This optimization has to do with variable scope. Local variables and parameters only exist inside the functions they're used in. This means that the compiler knows exactly when they are changed and can use that knowledge to optimize certain things. In contrast, globals and pointer-parameters can be changed everywhere. In particular, they can be changed in the functions that are called by a given routine. This means that after every function-call, globals and pointer-dereferences are considered ‘dirty’ and they have to be reloaded to be sure of their values.

This affects the dereferences of the flood struct here: all the members used in the last loop of bmp16_flood_fill_internal() have to be reloaded at every iteration. There is an easy way of avoiding the reloads: create a local variable for these items and use that. This also increased the speed by about 25%.

#### Changing the branch condition orders

The last optimization I did turned out to be a major one. The loop- and if-conditions consist of two parts: a color check but also a clipping check because we don't want to fill outside the screen. The code above does the clipping check first.

And this is where some knowledge of assembly or compilers comes in again. Because logical ANDs require both expressions to be true, you can quit as soon as one of them is false. Because the conditions are evaluated sequentially, it pays to have the one most likely to fail first. This happens to be the color check, not the clipping check. You might think that this shouldn't matter too much, but it turns out that simply by changing the order of the &&' expressions saves another 25%. Note that this only really matters for Thumb code, since ARM code has conditional opcodes: it doesn't have to use branches to escape early.

#### Final version

void bmp16_flood_fill_internal(int x, int y, const flood_t *flood)
{
u16 clrOld= flood->clrOld, clrNew= flood->clrNew;
u32 dstW= flood->dstW, dstH= flood->dstH;
u16 *dstD= flood->dstBase, *dstL= &dstD[y*dstW];

int ii, left=x, right=x;

// Find horz edges, then fill
while(dstL[left -1]==clrOld && left>0)
left--;
while(dstL[right+1]==clrOld && right<dstW-1)
right++;
memset16(&dstL[left], clrNew, right+1-left);

// Recursive *ick*. All the stackwork's making GBA's head spin.
for(ii=left; ii<=right; ii++)
{
// Do color check first.
if(dstL[ii-dstW]==clrOld && y>0)
bmp16_flood_fill_internal(ii, y-1, flood);

if(dstL[ii+dstW]==clrOld && y+1<dstH)
bmp16_flood_fill_internal(ii, y+1, flood);
}
}

Above you can see the final version of bmp16_flood_fill_internal(). Note that it is not harder to read at all; in fact, I'd say it's actually easier to read than the first version. At the same time we've sped up the function by a factor of 1.15*1.25*1.25 ≈ 1.75. 75% faster just by making some very elementary alterations. I don't really think these can be considered “premature optimizations”, because you could have easily designed and written the algorithm the faster way in the first place. To recap:

• Use words (int/s32/u32) for local variables and parameters, not bytes (char/s8/u8) or halfwords (short/s16/u16).
• Use a fast-fill routine instead of manual looping for longer spans.
• Load up pointers, struct members and global variables into locals if they're used frequently and/or you call other functions. Aside from being potentially faster, it also keeps the worker code cleaner.
• Order the expressions in compound conditions so that the decision can be reached as early as possible. For ANDs put the expressions most likely to be false first; for ORs, put these last.

None of those points are specific to floodfills; they should work for just about any part of code. I suppose you could call them good habits.

 First floodfill results . Second floodfill results.

One word of caution if you want to use the recursive floodfill routine. Because of the amount of stack it uses, you'll have to be careful where you apply it. There is also non-recursive version using a local stack, but you'd still run the risk of overflowing that and it doesn't seem to be faster anyway. Maybe I'll make an asm version later, but preferably not if I don't have to.

# memset16 fix

Although I know I tested the memset16() presented earlier, one instruction miraculously disappeared when posting it here. There should have been an mov r12, r1' in the initializing of the registers for the stmia. I have no idea where it disappeared to, but with this fix the routine should run properly.

I've been having a number of this sort of errors lately, actually. I've also recently found out that the shared-tileset in the current distribution is broken because of a single character that handles it in grit_prep_map. This:

// in grit_prep.cpp, grit_prep_map
if(gr->img_flags & GRIT_IMG_SHARED)
rdxDib= grit_tile_reduce(&mapRec, tileDib, flags, NULL);
else
rdxDib= grit_tile_reduce(&mapRec, tileDib, flags, gr->shared->dib);

should have been this:

// in grit_prep.cpp, grit_prep_map
if(~gr->img_flags & GRIT_IMG_SHARED)
rdxDib= grit_tile_reduce(&mapRec, tileDib, flags, NULL);
else
rdxDib= grit_tile_reduce(&mapRec, tileDib, flags, gr->shared->dib);

The missing ~ means that the the code did exactly the wrong thing. And again, during testing I had it right (cue Twilight-zone theme).

# New memset16 routine

The old tonclib's memset16() was a Thumb/ROM function, but called memset32() if it was desirable. Below you can find an ARM-only version. This time round, I chose to do all the real work inside the 16-bit version, and let memset32() jump into the middle of memset16(). Combined, these are less than 32 instructions, which should make cache happy as well. The main difference in performance is in the lower counts: the function overhead is about 100 cycles lower than before.

@ ---------------------------------------------------------------------
@ CODE_IN_IWRAM void memset32(void *dst, u32 fill, size_t wcount)
@ ---------------------------------------------------------------------
.section .iwram, "ax",%progbits
.arm
.align
.global memset32
memset32:
mov     r2, r2, lsl #1
cmp     r2, #16
bhs     .Lms16_entry32
b       .Lms16_word_loop

@ ---------------------------------------------------------------------
@ CODE_IN_IWRAM void memset16(void *dst, u16 fill, size_t hwcount)
@ ---------------------------------------------------------------------
.section .iwram, "ax", %progbits
.arm
.align
.global memset16
memset16:
cmp     r2, #0              @ if(count != 0)
movnes  r3, r0, ror #2      @   if(dst && (dst&1))
strmih  r1, [r0], #2        @   {   *dst++= fill;   count--;    }
submis  r2, r2, #1          @ if(count == 0 || dst == NULL)
bxeq    lr                  @   return;

orr     r1, r1, lsl #16     @ Prep for word fills.
cmp     r2, #16
blo     .Lms16_word_loop
.Lms16_entry32:

@ --- Block run ---
stmfd   sp!, {r4-r8}
mov     r3, r1
mov     r4, r1
mov     r5, r1
mov     r6, r1
mov     r7, r1
mov     r8, r1
mov     r12, r1
.Lms16_block_loop:
subs    r2, r2, #16
stmhsia r0!, {r1, r3-r8, r12}
bhi     .Lms16_block_loop
ldmfd   sp!, {r4-r8}
bxeq    lr
addne   r2, r2, #16         @ Correct for overstep in loop

@ --- Word run (+ trailing halfword) ---
.Lms16_word_loop:
subs    r2, r2, #2
strhs   r1, [r0], #4
bhi     .Lms16_word_loop
strneh  r1, [r0], #2        @ r2 != 0 means spare hword left
bx  lr

@ EOF

As usual, I'm being somewhat dirty with how the assembly works. In the first 5 instructions of memset16(), I'm doing several things in one go: testing the destination (and count) for 0, doing a single halfword write for non-word aligned destinations, and returning if afterwards the count is 0, return from the routine. I can do all this in five instructions through clever manipulation of conditionals.

The instructions that make up the main loop are a little non-standard as well. Here's how it works and why:

1. This is how it works and why: Reduce fill-count, C, by N hwords. Note that C need not be a multiple of N. This is important.
2. Fill N halfwords if C>=N. It's >=', not >', because C==N indicates the last stretch. It's also not !=', because C need not be a multiple of N.
3. Loop as long as C>N. In this case it is >', because C==N indicates the last full stretch.
4. Now it's time for the residuals. If C%N==0, then we're finished, so it's time to return.
5. However, if there were residuals, then C>0, thanks to the last subtraction inside the loop. So we have to correct for it by adding N again.

The standard method is splitting possible residuals first, but this version is shorter and allows for earlier escaping. A second benefit is that you can use non-power of two values for N as well. It is possible, for example, to use a 12-fold stmia here with only a few changes. The lower number of loops means that this would be ~10% faster … eventually. It really depends on things like memory waitstates whether the 12-fold version is worthwhile.

Oh, the highlighting was done by geshi as well. Making that arm-asm highlighter turned out very easy indeed.

#### EDIT, 2007-12-07

There was a small bug in the version above. r12 was used but never initialized. I know I had it in there when I tested it, but somehow it got lost.