The standard C functions for copying and filling are memcpy()
and memset(). They're part of the standard library, are easy
to use and are often implemented with some optimizations so that they're
usually faster than manual looping. The DKA version, for example will fill as
words if the alignments and sizes allow for it. This can be much
faster than doing the loops yourself.
There is, however, one small annoying fact about these two: they're not VRAM-safe. If the alignment and size aren't right for the word transfers, they will transfer bytes. Not only will this be slow, of course, but because you can't write to VRAM in bytes, the data will be corrupted.
The solutions for this have mostly come down to “so don't do that then”. Often, this can be sufficient: tiles in VRAM are word-aligned by definition, and source graphics data can and should be word-aligned anyway. However, now that I'm finally working on a bitmap blitter for 8bpp and 16bpp, I find that it's simply not enough. So I wrote the following set of functions to serve as replacements.
The code
My main goal here was to create smallish and portable replacements, not to
have the greatest and fastestest code around because that's rather platform
dependent. Yes, even the difference between GBA and NDS should matter,
because of the differences in ldr/str times and caching.
There are 5 functions here. The main functions here are
tonccpy and __toncset for copying and
filling words, respectively. The other 3 are interfaces for __toncset
for filling 8-bit, 16-bit and 32-bit data; you need these for, say, filling with a color
instead of 8-bit data. For the rest of the discussion, I will use the name
“toncset” for the internal routine for convenience.
typedef unsigned int uint;
#define BIT_MASK(len) ( (1<<(len))-1 )
static inline u32 quad8(u8 x) { x |= x<<8; return x | x<<16; }
//# Declarations and inlines.
void tonccpy(void *dst, const void *src, uint size);
void __toncset(void *dst, u32 fill, uint size);
static inline void toncset(void *dst, u8 src, uint size);
static inline void toncset16(void *dst, u16 src, uint size);
static inline void toncset32(void *dst, u32 src, uint size);
//! VRAM-safe memset, byte version. Size in bytes.
static inline void toncset(void *dst, u8 src, uint size)
{ __toncset(dst, quad8(src), size); }
//! VRAM-safe memset, halfword version. Size in hwords.
static inline void toncset16(void *dst, u16 src, uint size)
{ __toncset(dst, src|src<<16, size*2); }
//! VRAM-safe memset, word version. Size in words.
static inline void toncset32(void *dst, u32 src, uint size)
{ __toncset(dst, src, size*4); }
//! VRAM-safe cpy.
/*! This version mimics memcpy in functionality, with
the benefit of working for VRAM as well. It is also
slightly faster than the original memcpy, but faster
implementations can be made.
\param dst Destination pointer.
\param src Source pointer.
\param size Fill-length in bytes.
\note The pointers and size need not be word-aligned.
*/
void tonccpy(void *dst, const void *src, uint size)
{
if(size==0 || dst==NULL || src==NULL)
return;
uint count;
u16 *dst16; // hword destination
u8 *src8; // byte source
// Ideal case: copy by 4x words. Leaves tail for later.
if( ((u32)src|(u32)dst)%4==0 && size>=4)
{
u32 *src32= (u32*)src, *dst32= (u32*)dst;
count= size/4;
uint tmp= count&3;
count /= 4;
// Duff, bitch!
switch(tmp) {
do { *dst32++ = *src32++;
case 3: *dst32++ = *src32++;
case 2: *dst32++ = *src32++;
case 1: *dst32++ = *src32++;
case 0: ; } while(count--);
}
// Check for tail
size &= 3;
if(size == 0)
return;
src8= (u8*)src32;
dst16= (u16*)dst32;
}
else // Unaligned.
{
uint dstOfs= (u32)dst&1;
src8= (u8*)src;
dst16= (u16*)(dst-dstOfs);
// Head: 1 byte.
if(dstOfs != 0)
{
*dst16= (*dst16 & 0xFF) | *src8++<<8;
dst16++;
if(--size==0)
return;
}
}
// Unaligned main: copy by 2x byte.
count= size/2;
while(count--)
{
*dst16++ = src8[0] | src8[1]<<8;
src8 += 2;
}
// Tail: 1 byte.
if(size&1)
*dst16= (*dst16 &~ 0xFF) | *src8;
}
//! VRAM-safe memset, internal routine.
/*! This version mimics memset in functionality, with
the benefit of working for VRAM as well. It is also
slightly faster than the original memset.
\param dst Destination pointer.
\param fill Word to fill with.
\param size Fill-length in bytes.
\note The \a dst pointer and \a size need not be
word-aligned. In the case of unaligned fills, \a fill
will be masked off to match the situation.
*/
void __toncset(void *dst, u32 fill, uint size)
{
if(size==0 || dst==NULL)
return;
uint left= (u32)dst&3;
u32 *dst32= (u32*)(dst-left);
u32 count, mask;
// Unaligned head.
if(left != 0)
{
// Adjust for very small stint.
if(left+size<4)
{
mask= BIT_MASK(size*8)<<(left*8);
*dst32= (*dst32 &~ mask) | (fill & mask);
return;
}
mask= BIT_MASK(left*8);
*dst32= (*dst32 & mask) | (fill&~mask);
dst32++;
size -= 4-left;
}
// Main stint.
count= size/4;
uint tmp= count&3;
count /= 4;
switch(tmp) {
do { *dst32++ = fill;
case 3: *dst32++ = fill;
case 2: *dst32++ = fill;
case 1: *dst32++ = fill;
case 0: ; } while(count--);
}
// Tail
size &= 3;
if(size)
{
mask= BIT_MASK(size*8);
*dst32= (*dst32 &~ mask) | (fill & mask);
}
}
Discussion
Both tonccpy and toncset have the
following structure: the destination memory is divided into an
unaligned head, followed by an aligned main stint
and then a tail for any trailing bytes. In the optimal case there
is no head (i.e., the destination (and source) are word-aligned) and perhaps
no tail either (dstv+size is aligned). If that happens you're in
luck: a 4x unrolled word copier will blaze through memory. And yes, that is a
Duff's Device I'm
using there; I know it's evil, but I've always wanted to try one. Interestingly,
it is also the only way I've been able to get DKA to create the optimal output
for an unrolled loop.
When you aren't lucky, you'll have to go through the motions of masking off
halfwords and/or words to insert the bytes. Technically it's faster to use
words rather than halfwords, but this can be extremely unpleasant because
it's also possible for the transfer to be in the middle of a word. For example,
dst%4== 1 and size = 2 would require a
dst-mask of 0xFF0000FF. I still have this in
toncset, but for the copier you have to deal with such
annoying cases that I preferred to go with the simplicity of halfwords for now.
Tests
I've tested the routines with all alignment variations and many different sizes
and found them to work accurately. I've also tested them for speed. The
following graphics show the results of the tests; while reading, remember
that all items have been compiled with -O2 -mthumb and are
in ROM. The source data for the copiers was in EWRAM and the destination
was VRAM.
Fig 1a and Fig 1b show
tonccpy(), memcpy() and something called
vramcpy. vramcpy() is a more optimized
version of tonccpy(), but it's not ready yet. It uses
memcpy32() for the aligned main stint, so that should
represent optimal copying speed. The “a” and “u”
affixes mean aligned and unaligned pointers, respectively.
As you can see, even though tonccpy() is more complicated
than memcpy(), it's a little faster in almost all cases. Only for short stints is memcpy() faster because
tonccpy() has a larger overhead. The vramcpy()
lines show that there is much room for optimization for longer stints. Using
a dedicated word copier (memcpy32(), DMA) would help,
but I wanted tonccpy() to be portable and self-sufficient.
You can also see the incredible differences between when the source and destination have equal alignments (a) and when they don't (u). I know it is possible to speed up the unaligned parts by 50% to 100% or so as well, but I haven't quite zeroed in on the right solution yet.

Fig 1a: copier comparisons, long stints. a: src=dst alignment; u: unaligned.

Fig 1b: copier comparisons, short stints. a: src=dst alignment; u: unaligned.
The results of the fillers are in fig 2a and
fig 2b. These pictures show
toncset(), memset() and
toncset2(). toncset2() is essentially
toncset() using memset32() for the
main stint. Because there's no source alignment to worry about,
there is no chance for src-dst misalignments. This is why there
is little difference between the aligned and unaligned cases for the
toncset() variants; only memset()
is very slow in the unaligned case.

Fig 2a: fill comparisons, long stints. a: dst = aligned; u: unaligned.

Fig 2b: fill comparisons, short stints. a: dst = aligned; u: unaligned.
Lastly, the numbers for the transfer-rate and the overhead for calling them. Please remember that the actual numbers depend very much on what the waitstates are for the source and destination and where the code resides. That said, the figures should be useful for relative comparisons. Also, these are GBA timings; the NDS figures will be different, but I'll test for them when I get round to it.
|
|
Conclusions
tonccpy() and toncset() are essentially
variations of memcpy() and memset() that
also work for GBA/NDS VRAM even in the worse circumstances. They're
actually a little faster as well, so I can recommend using them instead of
memcpy/set in all cases. This is not to say that they are
the optimal solutions: faster general solutions certainly exist, but they
will be longer, hairier and probably less portable. Faster non-general
solutions exist as well, of course. If you know your pointers and sizes
will be word-aligned, consider DMA32, CpuFastSet or the memcpy32/set32
routines from tonclib.