memcpy and memset replacements for GBA/NDS

The standard C functions for copying and filling are memcpy() and memset(). They're part of the standard library, are easy to use and are often implemented with some optimizations so that they're usually faster than manual looping. The DKA version, for example will fill as words if the alignments and sizes allow for it. This can be much faster than doing the loops yourself.

There is, however, one small annoying fact about these two: they're not VRAM-safe. If the alignment and size aren't right for the word transfers, they will transfer bytes. Not only will this be slow, of course, but because you can't write to VRAM in bytes, the data will be corrupted.

The solutions for this have mostly come down to “so don't do that then”. Often, this can be sufficient: tiles in VRAM are word-aligned by definition, and source graphics data can and should be word-aligned anyway. However, now that I'm finally working on a bitmap blitter for 8bpp and 16bpp, I find that it's simply not enough. So I wrote the following set of functions to serve as replacements.

The code

My main goal here was to create smallish and portable replacements, not to have the greatest and fastestest code around because that's rather platform dependent. Yes, even the difference between GBA and NDS should matter, because of the differences in ldr/str times and caching.

There are 5 functions here. The main functions here are tonccpy and __toncset for copying and filling words, respectively. The other 3 are interfaces for __toncset for filling 8-bit, 16-bit and 32-bit data; you need these for, say, filling with a color instead of 8-bit data. For the rest of the discussion, I will use the name “toncset” for the internal routine for convenience.

//# Stuff you may not have yet.
typedef unsigned int uint;
#define BIT_MASK(len)       ( (1<<(len))-1 )
static inline u32 quad8(u8 x)   {   x |= x<<8; return x | x<<16;    }


//# Declarations and inlines.

void tonccpy(void *dst, const void *src, uint size);

void __toncset(void *dst, u32 fill, uint size);
static inline void toncset(void *dst, u8 src, uint size);
static inline void toncset16(void *dst, u16 src, uint size);
static inline void toncset32(void *dst, u32 src, uint size);


//! VRAM-safe memset, byte version. Size in bytes.
static inline void toncset(void *dst, u8 src, uint size)
{   __toncset(dst, quad8(src), size);               }

//! VRAM-safe memset, halfword version. Size in hwords.
static inline void toncset16(void *dst, u16 src, uint size)
{   __toncset(dst, src|src<<16, size*2);            }

//! VRAM-safe memset, word version. Size in words.
static inline void toncset32(void *dst, u32 src, uint size)
{   __toncset(dst, src, size*4);                    }
//# tonccpy.c

//! VRAM-safe cpy.
/*! This version mimics memcpy in functionality, with
    the benefit of working for VRAM as well. It is also
    slightly faster than the original memcpy, but faster
    implementations can be made.
    \param dst  Destination pointer.
    \param src  Source pointer.
    \param size Fill-length in bytes.
    \note   The pointers and size need not be word-aligned.
*/

void tonccpy(void *dst, const void *src, uint size)
{
    if(size==0 || dst==NULL || src==NULL)
        return;

    uint count;
    u16 *dst16;     // hword destination
    u8  *src8;      // byte source

    // Ideal case: copy by 4x words. Leaves tail for later.
    if( ((u32)src|(u32)dst)%4==0 && size>=4)
    {
        u32 *src32= (u32*)src, *dst32= (u32*)dst;

        count= size/4;
        uint tmp= count&3;
        count /= 4;

        // Duff, bitch!
        switch(tmp) {
            do {    *dst32++ = *src32++;
        case 3:     *dst32++ = *src32++;
        case 2:     *dst32++ = *src32++;
        case 1:     *dst32++ = *src32++;
        case 0:     ; } while(count--);
        }

        // Check for tail
        size &= 3;
        if(size == 0)
            return;

        src8= (u8*)src32;
        dst16= (u16*)dst32;
    }
    else        // Unaligned.
    {
        uint dstOfs= (u32)dst&1;
        src8= (u8*)src;
        dst16= (u16*)(dst-dstOfs);

        // Head: 1 byte.
        if(dstOfs != 0)
        {
            *dst16= (*dst16 & 0xFF) | *src8++<<8;
            dst16++;
            if(--size==0)
                return;
        }
    }

    // Unaligned main: copy by 2x byte.
    count= size/2;
    while(count--)
    {
        *dst16++ = src8[0] | src8[1]<<8;
        src8 += 2;
    }

    // Tail: 1 byte.
    if(size&1)
        *dst16= (*dst16 &~ 0xFF) | *src8;
}
//# toncset.c

//! VRAM-safe memset, internal routine.
/*! This version mimics memset in functionality, with
    the benefit of working for VRAM as well. It is also
    slightly faster than the original memset.
    \param dst  Destination pointer.
    \param fill Word to fill with.
    \param size Fill-length in bytes.
    \note   The \a dst pointer and \a size need not be
        word-aligned. In the case of unaligned fills, \a fill
        will be masked off to match the situation.
*/

void __toncset(void *dst, u32 fill, uint size)
{
    if(size==0 || dst==NULL)
        return;

    uint left= (u32)dst&3;
    u32 *dst32= (u32*)(dst-left);
    u32 count, mask;

    // Unaligned head.
    if(left != 0)
    {
        // Adjust for very small stint.
        if(left+size<4)
        {
            mask= BIT_MASK(size*8)<<(left*8);  
            *dst32= (*dst32 &~ mask) | (fill & mask);
            return;
        }

        mask= BIT_MASK(left*8);
        *dst32= (*dst32 & mask) | (fill&~mask);
        dst32++;
        size -= 4-left;
    }

    // Main stint.
    count= size/4;
    uint tmp= count&3;
    count /= 4;

    switch(tmp) {
        do {    *dst32++ = fill;
    case 3:     *dst32++ = fill;
    case 2:     *dst32++ = fill;
    case 1:     *dst32++ = fill;
    case 0:     ; } while(count--);
    }

    // Tail
    size &= 3;
    if(size)
    {
        mask= BIT_MASK(size*8);
        *dst32= (*dst32 &~ mask) | (fill & mask);
    }
}

Discussion

Both tonccpy and toncset have the following structure: the destination memory is divided into an unaligned head, followed by an aligned main stint and then a tail for any trailing bytes. In the optimal case there is no head (i.e., the destination (and source) are word-aligned) and perhaps no tail either (dstv+size is aligned). If that happens you're in luck: a 4x unrolled word copier will blaze through memory. And yes, that is a Duff's Device I'm using there; I know it's evil, but I've always wanted to try one. Interestingly, it is also the only way I've been able to get DKA to create the optimal output for an unrolled loop.

When you aren't lucky, you'll have to go through the motions of masking off halfwords and/or words to insert the bytes. Technically it's faster to use words rather than halfwords, but this can be extremely unpleasant because it's also possible for the transfer to be in the middle of a word. For example, dst%4== 1 and size = 2 would require a dst-mask of 0xFF0000FF. I still have this in toncset, but for the copier you have to deal with such annoying cases that I preferred to go with the simplicity of halfwords for now.

Tests

I've tested the routines with all alignment variations and many different sizes and found them to work accurately. I've also tested them for speed. The following graphics show the results of the tests; while reading, remember that all items have been compiled with -O2 -mthumb and are in ROM. The source data for the copiers was in EWRAM and the destination was VRAM.

Fig 1a and Fig 1b show tonccpy(), memcpy() and something called vramcpy. vramcpy() is a more optimized version of tonccpy(), but it's not ready yet. It uses memcpy32() for the aligned main stint, so that should represent optimal copying speed. The “a” and “u” affixes mean aligned and unaligned pointers, respectively.

As you can see, even though tonccpy() is more complicated than memcpy(), it's a little faster in almost all cases. Only for short stints is memcpy() faster because tonccpy() has a larger overhead. The vramcpy() lines show that there is much room for optimization for longer stints. Using a dedicated word copier (memcpy32(), DMA) would help, but I wanted tonccpy() to be portable and self-sufficient.

You can also see the incredible differences between when the source and destination have equal alignments (a) and when they don't (u). I know it is possible to speed up the unaligned parts by 50% to 100% or so as well, but I haven't quite zeroed in on the right solution yet.


Fig 1a: copier comparisons, long stints. a: src=dst alignment; u: unaligned.

Fig 1b: copier comparisons, short stints. a: src=dst alignment; u: unaligned.

The results of the fillers are in fig 2a and fig 2b. These pictures show toncset(), memset() and toncset2(). toncset2() is essentially toncset() using memset32() for the main stint. Because there's no source alignment to worry about, there is no chance for src-dst misalignments. This is why there is little difference between the aligned and unaligned cases for the toncset() variants; only memset() is very slow in the unaligned case.


Fig 2a: fill comparisons, long stints. a: dst = aligned; u: unaligned.

Fig 2b: fill comparisons, short stints. a: dst = aligned; u: unaligned.

Lastly, the numbers for the transfer-rate and the overhead for calling them. Please remember that the actual numbers depend very much on what the waitstates are for the source and destination and where the code resides. That said, the figures should be useful for relative comparisons. Also, these are GBA timings; the NDS figures will be different, but I'll test for them when I get round to it.

Table 1: copier results in cycles.
  rate [c/byte] overhead [c]
vramcpy, a 2.219 270
vramcpy, u 19.520 165
memcpy, a 6.207 121
memcpy, u 35.00 84
tonccpy, a 5.83 167
tonccpy, u 25.020 151
 
Table 2: filler results in cycles.
  rate [c/byte] overhead [c]
toncset2, a 0.656 222
toncset2, u 0.656 334
memset, a 3.000 158
memset, u 23.000 93
toncset, a 2.81 166
toncset, u 2.81 266

Conclusions

tonccpy() and toncset() are essentially variations of memcpy() and memset() that also work for GBA/NDS VRAM even in the worse circumstances. They're actually a little faster as well, so I can recommend using them instead of memcpy/set in all cases. This is not to say that they are the optimal solutions: faster general solutions certainly exist, but they will be longer, hairier and probably less portable. Faster non-general solutions exist as well, of course. If you know your pointers and sizes will be word-aligned, consider DMA32, CpuFastSet or the memcpy32/set32 routines from tonclib.

3 thoughts on “memcpy and memset replacements for GBA/NDS

  1. I've once made C++ templated functions for blitting. One of my goals was to train myself a bit with templates which I rarely use for my own code. This is quite different from your approach though, because it was designed around the blitting features more than the speed. But we share one purpose: being able to blit into VRAM for 8bpp sources. I know a lot about several assemblers mechanics (ALU, registers, adressing modes, flags, etc... I used to spend lot of time to gain some few cycles), but I lack knowledge about ARM op-codes cycles. Since I don't really want to the take time to profile my code (it works and I don't feel it necessary, I'll talk about it below), I'd like to see these templates optimized by someone who run after the cycles... and I'm of the ones who still believe that a human programmer can do a lot better than a compiler, even with today CPU with all their pipelines and stuffs, simply because... we're still smarter than the machines! Other arguments are: gaining few cycles will pay more on a 67Mhz CPU than a multi-Ghz-monster, and the actual optimization is essentially about taking care of that damn critical inner loop.

    I've read another article of you on this site about your hesitations to use C++ instead of C. You have good reasons, but I think the performance one is no longer a threat. Effectively, there was a time where C++ compilers performed poorly compared to C compilers. This was not because of the language itself (C++ can be as close to the machine as C can), but because of the compilers themselves, and often because of the programmers who merely know what is costly and what is not once translated into op-codes (those programmers have more reasons to fall in C++ performance traps than others). My point is that templates are not of the features that slow things down. At times, they can become hard to master, must be planned carefully, but their tremendous power is worth thoses prices.

    Now about the blitters. The core blitter logic is centralized. Only when it needs to update the actual pixel a function (inlined) is used. This function takes the form of a functor (another C++ feature with no cycle cost, and even potentially faster than function pointers which cannot be inlined). Then you only have to define a new functor to perform special effects at the pixel level (transparency, combininations with source... etc).

    Actually, I use 2 blitter templates for everything:

    - one for generic blitting from X bpp source to Y bpp destination (usually both are the same and are 16bpp on DS). Note that I also use that for blitting tile-maps on BGs screens, after all, this is just the same as if you had a 16bpp 32x32 pixels screen. The core implementation is pretty simple and I don't think it needs optimizations for the inner loop.

    - the other template is intended for paletted-source blits (the specific problem you are talking about). You're right this needs some special processing, because the usual approach is wrong in terms of performance, and even it won't work well if destination is VRAM. This template can blit X bpp paletted images (where X is a templated int of 1, 2, 4 or 8) to a 16bpp bitmap memory (could be templated for 32bpp, not needed on DS though). It reads by chunks of 16bpp for efficiency, taking line bounds alignment into account. It also uses a functor, with slightly different parameters (source is a color index in palette, not the actual color). The generic inner loop is undoubtly slower than yours (mainly due to X bpp source and because I did not really care about performance), but classic head/body/tail methods like yours would speed things up. Of course, template specialization is also possible. I did not put much effort on optimization for this stuff because the DS horsepower definitely doesn't lie in the area of old-school bitmap blitting techniques. Bitmap mode is a commodity on this platform: you can't count on the hardware to help you there (in fact, it will even slows you down more than 90 % of the time since the video needs the data without delay while rendering). If I want moving stuffs, I use OBJs, BGs or 3D polygons, that's it. Blitting is mostly about dealing with big bunches of memory, and a 67Mhz ARM or DMA won't do miracles whatever you do. The DS can make a far better use of its time than blitting bitmaps... so my point is that if I use blitting techniques, I need features over speed. The functor can do this with advanced pixel operations.

    [[format fairie was here]]

    // Here is a sample of the generic simple blit functor:

    /** Simple copy from source to dest.
     */

    template<typename T_SRC, typename T_DST>
    struct simpleCopyFunc : std::binary_function
    {
        void operator() (T_SRC src, T_DST *dst) const
        {
            *dst = src;
        }
    };

    // And the generic transparent blit functor:

    /** Writes source to destination if source is not of a given value.
     */

    template<typename T_SRC, typename T_DST>
    struct transparentCopyFunc : std::binary_function
    {
        T_SRC trans_;
        transparentCopyFunc(T_SRC trans=0) :
                trans_(trans)
        {}
        void inline operator() (T_SRC src, T_DST *dst) const
        {
            if (src != trans_)
                *dst = src;
        }
    };

    // The paletted transparent blit functor (no need to be templated actually):

    struct palettedTransparentCopyFunc
    {
        const u16 *pal_;
        u8 trans_;
        palettedTransparentCopyFunc(const u16 *pal, u8 trans=0) :
                pal_(pal),
                trans_(trans)
        {}

        void operator() (u8 src, u16 *&amp;dst) const
        {
            if (src != trans_)
                *dst = 0x8000 | pal_[src]; // Note: the 'OR' can be avoided if palette has all hi-order bits set.
        }
    };

    // sample call for paletted blitting:
    blit(srcAddr, srcPitch,
        dstAddr, dstPitch,
        srcX, srcY, width, height,
        dstX, dstY,
        palettedTransparentCopyFunc(palAddr, transColor));

    These are just samples... but you can see that adding effects is as simple as adding a new functor (just copy-paste the simple functor, rename it, and do what is needed for each pixel). Then when you need to blit anything, you simply call the templated blitter function with appropriate arguments (source & destination addresses, source rectangle, destination position (upper-left), source & dest pitches, and functor). Let's be clear: no need for templates to do that, function pointers would work as well (no inline though). I used templates mainly for genericity, so to centralize the blitter logic, nothing else.

    The overall code is fairly portable (only u8, u16 should be replaced by uint16_t etc... for better portability). I can share the template code with you if you have an interest. I would be great to have the power of templates alongside well optimized inner loops. So if someday you wanna cross the C++ lines for your blitting stuffs (besides simply being able to overload functions), do not hesitate to contact me. I'd be happy to contribute.

  2. Gah, it looks like the WP sanitizer managed some of the code there (removing the template arguments and such); If you don't mind, I'll try to clean it up a little.

    I must say that this is a very interesting approach, one that I probably should look into in the future. With the right kinds of templates, this could make blitting much easier.

    I do think optimization has to be considered for large-scale copies such as blits, though. Pixel for pixel copying can be much slower than a hand-assembled asm block transfer routine and because so many pixels have to be considered it can become a bottleneck. However, for more complex functionality like pal->>16bpp and transparent blits, templates can indeed be very helpful.

  3. Damn, you're right, HTML messed with template syntax (due to ''). Well, I should have used tags for the code, but TBH, I don't know much about HTML :) But you can easily restore them: T_SRC & T_DST are just 'typename' arguments (or 'class', as you wish), and the std::binary_function inheritance can simply be removed (they don't hurt and they're a good habit for STL algorithms which I don't use here anyway). Still, it's not really worth it, since I did not post the core templates code. My first main goal was for the reader to get the point and have some feedbacks.

    Let's just say we agree for speed. After all, this is the main purpose of your article and I wouldn't have commented here if I felt it was not that significant.

Leave a Reply

Your email address will not be published. Required fields are marked *