FloodFill and adventures in optimization

One of the things I've been curious about for sometime but never really looked up is the FloodFill. Thanks to a certain recent post in the PALib forum, I finally started looking for implementations. There's a list of them at Lode Vandevenne's site. The actual implementation I start with actually came from the code codex, which gives an version of the recursive scanline floodfill. One that was originally written by tepples (it can be found in the ToD source). Converted to a form usable for any sized 16bit canvas, it looks a little like this:

//! Struct for semi-constant canvas and floodfill info.
typedef struct flood_t
{
    void *dstBase;
    u16 dstW;
    u16 dstH;
    COLOR clrNew;
    COLOR clrOld;
} flood_t;

//! Basic floodfill wrapper
void bmp16_flood_fill(int x, int y, COLOR clr, void *dstBase,
    u32 dstW, u32 dstH)
{
    u16 *dstL= (u16*)(dstBase + (y*dstW + x)*2);

    if(*dstL == clr)
        return;
       
    flood_t flood= { dstBase, dstW, dstH, clr, *dstL };
    bmp16_flood_fill_internal(x, y, &flood);    
}

//! Main floodfill code (called recursively)
void bmp16_flood_fill_internal(int x, int y, const flood_t *flood)
{
    u16 *dstD= flood->dstBase, *dstL= &dstD[y*flood->dstW];
   
    // Find horz edges
    int ii, left=x, right=x;

    while(dstL[left-1]==flood->clrOld && left>0)
    {
        dstL[left]= flood->clrNew;
        left--;
    }
    dstL[left]= flood->clrNew;

    while(dstL[right+1]==flood->clrOld && right<flood->dstW-1)
    {
        right++;
        dstL[right]= flood->clrNew;
    }

    // Recursive *ick*. All the stack-work's making GBA's head spin.
    for(ii=left; ii<=right; ii++)
    {
        if(y>0 && dstL[ii-flood->dstW] == flood->clrOld)
            bmp16_flood_fill_internal(ii, y-1, flood);
       
        if(y+1<flood->dstH && dstL[ii+flood->dstW] == flood->clrOld)
            bmp16_flood_fill_internal(ii, y+1, flood);
    }

}

I've split the code into two parts: bmp16_flood_fill(), which is merely an interface function to get the color to fill over and set-up some variables, and bmp16_flood_fill_internal(), which is the workhorse here.

My first reaction to the workings was “huh!”, in surprise of how simple it is: you look left and right for the length of the span in the current scanline, fill it, then search in the lines above and below for places to fill (i.e., if the color is clrOld) and call bmp16_flood_fill_internal() recursively to do so.

The code looks understandable, works nicely and is moderately fast considering all the work that needs to be done. That said, you can almost double its speed with just a few simple alterations.

Datatypes

The first optimization is using the correct datatypes. By which I of course mean int or u32. Using chars and shorts for my local variables and parameters here costs me roughly 25%.

You may also notice the flood_t datatype. This is an optimization too in a way: the more parameters a function has, the more stackwork needs to be done to set-up the arguments and pass them on to the function. Because I only need to read those values and not write to them once inside bmp16_flood_fill_internal(), I'm passing a struct pointer instead of separate parameters. The actual gain in speed turned out to be negligible, but I did find out something else: the amount of stack it cost. I had expected that the recursive nature would require an awful lot of stack, but what I hadn't counted on was that in my test fill, the separate-parameter version took all the stack. When I tested it on no$gba, I got an warning that that a bx r0 jumped to an invalid address. When I looked at the address it turned out the stack pointer was in the early 0300:xxxx's : it had completely clobbered IWRAM.

Dedicated scanline fill

The datatype improvements were things that I basically started out with in the first place, because those are rather obvious. The first real optimization was to use a fast scanlinefill instead of setting pixels individually – something like DMA or memset16(). Using memset16() saved roughly 15%. This optimization will only work if you have longer spans, of course.

Preload flood_t members

The second optimization is also somewhat obvious if you know a little about how compilers do their work. This optimization has to do with variable scope. Local variables and parameters only exist inside the functions they're used in. This means that the compiler knows exactly when they are changed and can use that knowledge to optimize certain things. In contrast, globals and pointer-parameters can be changed everywhere. In particular, they can be changed in the functions that are called by a given routine. This means that after every function-call, globals and pointer-dereferences are considered ‘dirty’ and they have to be reloaded to be sure of their values.

This affects the dereferences of the flood struct here: all the members used in the last loop of bmp16_flood_fill_internal() have to be reloaded at every iteration. There is an easy way of avoiding the reloads: create a local variable for these items and use that. This also increased the speed by about 25%.

Changing the branch condition orders

The last optimization I did turned out to be a major one. The loop- and if-conditions consist of two parts: a color check but also a clipping check because we don't want to fill outside the screen. The code above does the clipping check first.

And this is where some knowledge of assembly or compilers comes in again. Because logical ANDs require both expressions to be true, you can quit as soon as one of them is false. Because the conditions are evaluated sequentially, it pays to have the one most likely to fail first. This happens to be the color check, not the clipping check. You might think that this shouldn't matter too much, but it turns out that simply by changing the order of the `&&' expressions saves another 25%. Note that this only really matters for Thumb code, since ARM code has conditional opcodes: it doesn't have to use branches to escape early.

Final version

void bmp16_flood_fill_internal(int x, int y, const flood_t *flood)
{
    // Preloading flood variables
    u16 clrOld= flood->clrOld, clrNew= flood->clrNew;
    u32 dstW= flood->dstW, dstH= flood->dstH;
    u16 *dstD= flood->dstBase, *dstL= &dstD[y*dstW];

    int ii, left=x, right=x;

    // Find horz edges, then fill
    while(dstL[left -1]==clrOld && left>0)
        left--;
    while(dstL[right+1]==clrOld && right<dstW-1)
        right++;
    memset16(&dstL[left], clrNew, right+1-left);

    // Recursive *ick*. All the stackwork's making GBA's head spin.
    for(ii=left; ii<=right; ii++)
    {
        // Do color check first.
        if(dstL[ii-dstW]==clrOld && y>0)
            bmp16_flood_fill_internal(ii, y-1, flood);
       
        if(dstL[ii+dstW]==clrOld && y+1<dstH)
            bmp16_flood_fill_internal(ii, y+1, flood);      
    }
}

Above you can see the final version of bmp16_flood_fill_internal(). Note that it is not harder to read at all; in fact, I'd say it's actually easier to read than the first version. At the same time we've sped up the function by a factor of 1.15*1.25*1.25 ≈ 1.75. 75% faster just by making some very elementary alterations. I don't really think these can be considered “premature optimizations”, because you could have easily designed and written the algorithm the faster way in the first place. To recap:

  • Use words (int/s32/u32) for local variables and parameters, not bytes (char/s8/u8) or halfwords (short/s16/u16).
  • Use a fast-fill routine instead of manual looping for longer spans.
  • Load up pointers, struct members and global variables into locals if they're used frequently and/or you call other functions. Aside from being potentially faster, it also keeps the worker code cleaner.
  • Order the expressions in compound conditions so that the decision can be reached as early as possible. For ANDs put the expressions most likely to be false first; for ORs, put these last.

None of those points are specific to floodfills; they should work for just about any part of code. I suppose you could call them good habits.


First floodfill results .

Second floodfill results.

One word of caution if you want to use the recursive floodfill routine. Because of the amount of stack it uses, you'll have to be careful where you apply it. There is also non-recursive version using a local stack, but you'd still run the risk of overflowing that and it doesn't seem to be faster anyway. Maybe I'll make an asm version later, but preferably not if I don't have to.

Leave a Reply

Your email address will not be published. Required fields are marked *