Coranac

The standard C functions for copying and filling are memcpy() and memset(). They're part of the standard library, are easy to use and are often implemented with some optimizations so that they're usually faster than manual looping. The DKA version, for example will fill as words if the alignments and sizes allow for it. This can be much faster than doing the loops yourself.

There is, however, one small annoying fact about these two: they're not VRAM-safe. If the alignment and size aren't right for the word transfers, they will transfer bytes. Not only will this be slow, of course, but because you can't write to VRAM in bytes, the data will be corrupted.

The solutions for this have mostly come down to “so don't do that then”. Often, this can be sufficient: tiles in VRAM are word-aligned by definition, and source graphics data can and should be word-aligned anyway. However, now that I'm finally working on a bitmap blitter for 8bpp and 16bpp, I find that it's simply not enough. So I wrote the following set of functions to serve as replacements.

The code

My main goal here was to create smallish and portable replacements, not to have the greatest and fastestest code around because that's rather platform dependent. Yes, even the difference between GBA and NDS should matter, because of the differences in ldr/str times and caching.

There are 5 functions here. The main functions here are tonccpy and __toncset for copying and filling words, respectively. The other 3 are interfaces for __toncset for filling 8-bit, 16-bit and 32-bit data; you need these for, say, filling with a color instead of 8-bit data. For the rest of the discussion, I will use the name “toncset” for the internal routine for convenience.

//# Stuff you may not have yet.
typedef unsigned int uint;
#define BIT_MASK(len)       ( (1<<(len))-1 )
static inline u32 quad8(u8 x)   {   x |= x<<8; return x | x<<16;    }


//# Declarations and inlines.

void tonccpy(void *dst, const void *src, uint size);

void __toncset(void *dst, u32 fill, uint size);
static inline void toncset(void *dst, u8 src, uint size);
static inline void toncset16(void *dst, u16 src, uint size);
static inline void toncset32(void *dst, u32 src, uint size);


//! VRAM-safe memset, byte version. Size in bytes.
static inline void toncset(void *dst, u8 src, uint size)
{   __toncset(dst, quad8(src), size);               }

//! VRAM-safe memset, halfword version. Size in hwords.
static inline void toncset16(void *dst, u16 src, uint size)
{   __toncset(dst, src|src<<16, size*2);            }

//! VRAM-safe memset, word version. Size in words.
static inline void toncset32(void *dst, u32 src, uint size)
{   __toncset(dst, src, size*4);                    }
//# tonccpy.c

//! VRAM-safe cpy.
/*! This version mimics memcpy in functionality, with
    the benefit of working for VRAM as well. It is also
    slightly faster than the original memcpy, but faster
    implementations can be made.
    \param dst  Destination pointer.
    \param src  Source pointer.
    \param size Fill-length in bytes.
    \note   The pointers and size need not be word-aligned.
*/

void tonccpy(void *dst, const void *src, uint size)
{
    if(size==0 || dst==NULL || src==NULL)
        return;

    uint count;
    u16 *dst16;     // hword destination
    u8  *src8;      // byte source

    // Ideal case: copy by 4x words. Leaves tail for later.
    if( ((u32)src|(u32)dst)%4==0 && size>=4)
    {
        u32 *src32= (u32*)src, *dst32= (u32*)dst;

        count= size/4;
        uint tmp= count&3;
        count /= 4;

        // Duff, bitch!
        switch(tmp) {
            do {    *dst32++ = *src32++;
        case 3:     *dst32++ = *src32++;
        case 2:     *dst32++ = *src32++;
        case 1:     *dst32++ = *src32++;
        case 0:     ; } while(count--);
        }

        // Check for tail
        size &= 3;
        if(size == 0)
            return;

        src8= (u8*)src32;
        dst16= (u16*)dst32;
    }
    else        // Unaligned.
    {
        uint dstOfs= (u32)dst&1;
        src8= (u8*)src;
        dst16= (u16*)(dst-dstOfs);

        // Head: 1 byte.
        if(dstOfs != 0)
        {
            *dst16= (*dst16 & 0xFF) | *src8++<<8;
            dst16++;
            if(--size==0)
                return;
        }
    }

    // Unaligned main: copy by 2x byte.
    count= size/2;
    while(count--)
    {
        *dst16++ = src8[0] | src8[1]<<8;
        src8 += 2;
    }

    // Tail: 1 byte.
    if(size&1)
        *dst16= (*dst16 &~ 0xFF) | *src8;
}
//# toncset.c

//! VRAM-safe memset, internal routine.
/*! This version mimics memset in functionality, with
    the benefit of working for VRAM as well. It is also
    slightly faster than the original memset.
    \param dst  Destination pointer.
    \param fill Word to fill with.
    \param size Fill-length in bytes.
    \note   The \a dst pointer and \a size need not be
        word-aligned. In the case of unaligned fills, \a fill
        will be masked off to match the situation.
*/

void __toncset(void *dst, u32 fill, uint size)
{
    if(size==0 || dst==NULL)
        return;

    uint left= (u32)dst&3;
    u32 *dst32= (u32*)(dst-left);
    u32 count, mask;

    // Unaligned head.
    if(left != 0)
    {
        // Adjust for very small stint.
        if(left+size<4)
        {
            mask= BIT_MASK(size*8)<<(left*8);   
            *dst32= (*dst32 &~ mask) | (fill & mask);
            return;
        }

        mask= BIT_MASK(left*8);
        *dst32= (*dst32 & mask) | (fill&~mask);
        dst32++;
        size -= 4-left;
    }

    // Main stint.
    count= size/4;
    uint tmp= count&3;
    count /= 4;

    switch(tmp) {
        do {    *dst32++ = fill;
    case 3:     *dst32++ = fill;
    case 2:     *dst32++ = fill;
    case 1:     *dst32++ = fill;
    case 0:     ; } while(count--);
    }

    // Tail
    size &= 3;
    if(size)
    {
        mask= BIT_MASK(size*8);
        *dst32= (*dst32 &~ mask) | (fill & mask);
    }
}

Discussion

Both tonccpy and toncset have the following structure: the destination memory is divided into an unaligned head, followed by an aligned main stint and then a tail for any trailing bytes. In the optimal case there is no head (i.e., the destination (and source) are word-aligned) and perhaps no tail either (dstv+size is aligned). If that happens you're in luck: a 4x unrolled word copier will blaze through memory. And yes, that is a Duff's Device I'm using there; I know it's evil, but I've always wanted to try one. Interestingly, it is also the only way I've been able to get DKA to create the optimal output for an unrolled loop.

When you aren't lucky, you'll have to go through the motions of masking off halfwords and/or words to insert the bytes. Technically it's faster to use words rather than halfwords, but this can be extremely unpleasant because it's also possible for the transfer to be in the middle of a word. For example, dst%4== 1 and size = 2 would require a dst-mask of 0xFF0000FF. I still have this in toncset, but for the copier you have to deal with such annoying cases that I preferred to go with the simplicity of halfwords for now.

Tests

I've tested the routines with all alignment variations and many different sizes and found them to work accurately. I've also tested them for speed. The following graphics show the results of the tests; while reading, remember that all items have been compiled with -O2 -mthumb and are in ROM. The source data for the copiers was in EWRAM and the destination was VRAM.

Fig 1a and Fig 1b show tonccpy(), memcpy() and something called vramcpy. vramcpy() is a more optimized version of tonccpy(), but it's not ready yet. It uses memcpy32() for the aligned main stint, so that should represent optimal copying speed. The “a” and “u” affixes mean aligned and unaligned pointers, respectively.

As you can see, even though tonccpy() is more complicated than memcpy(), it's a little faster in almost all cases. Only for short stints is memcpy() faster because tonccpy() has a larger overhead. The vramcpy() lines show that there is much room for optimization for longer stints. Using a dedicated word copier (memcpy32(), DMA) would help, but I wanted tonccpy() to be portable and self-sufficient.

You can also see the incredible differences between when the source and destination have equal alignments (a) and when they don't (u). I know it is possible to speed up the unaligned parts by 50% to 100% or so as well, but I haven't quite zeroed in on the right solution yet.


Fig 1a: copier comparisons, long stints. a: src=dst alignment; u: unaligned.

Fig 1b: copier comparisons, short stints. a: src=dst alignment; u: unaligned.

The results of the fillers are in fig 2a and fig 2b. These pictures show toncset(), memset() and toncset2(). toncset2() is essentially toncset() using memset32() for the main stint. Because there's no source alignment to worry about, there is no chance for src-dst misalignments. This is why there is little difference between the aligned and unaligned cases for the toncset() variants; only memset() is very slow in the unaligned case.


Fig 2a: fill comparisons, long stints. a: dst = aligned; u: unaligned.

Fig 2b: fill comparisons, short stints. a: dst = aligned; u: unaligned.

Lastly, the numbers for the transfer-rate and the overhead for calling them. Please remember that the actual numbers depend very much on what the waitstates are for the source and destination and where the code resides. That said, the figures should be useful for relative comparisons. Also, these are GBA timings; the NDS figures will be different, but I'll test for them when I get round to it.

Table 1: copier results in cycles.
  rate [c/byte] overhead [c]
vramcpy, a 2.219 270
vramcpy, u 19.520 165
memcpy, a 6.207 121
memcpy, u 35.00 84
tonccpy, a 5.83 167
tonccpy, u 25.020 151
 
Table 2: filler results in cycles.
  rate [c/byte] overhead [c]
toncset2, a 0.656 222
toncset2, u 0.656 334
memset, a 3.000 158
memset, u 23.000 93
toncset, a 2.81 166
toncset, u 2.81 266

Conclusions

tonccpy() and toncset() are essentially variations of memcpy() and memset() that also work for GBA/NDS VRAM even in the worse circumstances. They're actually a little faster as well, so I can recommend using them instead of memcpy/set in all cases. This is not to say that they are the optimal solutions: faster general solutions certainly exist, but they will be longer, hairier and probably less portable. Faster non-general solutions exist as well, of course. If you know your pointers and sizes will be word-aligned, consider DMA32, CpuFastSet or the memcpy32/set32 routines from tonclib.


No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment




XHTML: You can use these tags:<a href=""> <b> <blockquote cite=""> <br> <cite> <code> <div align="" class=""> <em> <i> <li> <ol> <p> <q cite=""> <sub> <sup> <u> <ul>
Others: [br], [code lang='*'], [color], [url], [wiki]




Powered by WordPress