# GCC 4.7 vs ARM assembly

In a recent bout of ‘They Changed It, Now It Sucks’, GCC 4.7 won't add interworking instructions to ARM assembly code unless you add a .type directive to the function boilerplate. Following some code from devkitARM 38 announcement:

@ section directive goes here.
@ code dirctive goes here.
.align  2
.global foo
.type   foo STT_FUNC                @ Add this line right here
foo:
@ etc

Though the change is simple, it does effectively mean that basically all older assembly code is now b0rked, including tonc and many of the ASM functions I've written about here. I hope I can assess the damage this weekend and fix everything.

#### UPDATE 2012-05-20

Had some time to work on this. When I first saw that I was going to have to do this, I though ‘Oh naw, I'm going to have to update every single ASM function I have?’ But then I noticed that I had already done so, more or less. One of the things I realized some time after the 1.4 release was that copy-pasting the standard boiler-plate code for ASM functions was stupid, so I created macros to do that instead. I was pleasantly surprised to see that I had already upgraded all tonclib functions with that macro, so all I had to do was add the single line in a single place and recompile. Fuck yeah, macros.

The only troublesome thing was that the inline ASM I used in swi_demo failed horrifically, so I just removed those.

The current status is that all the code's okay again, but the text parts still need to be updated.

# DMA vs ARM9, round 2 : invalidate considered harmful

It would seem these two aren't finished with each other yet.

A while ago, I wrote an article about NDS caching , how it can interfere with DMA transfers and what you can do about them. A little later I got a pingback from ant512, who had tried the “safe” DMA routines I made and found they weren't nearly as safe as I'd hoped. I'm still not sure what the actual problem was, but this incident did make me think about one possible reason, namely the one that will be discussed in this post: problematic cache invalidation.

### 1 Test base

But first things first. Let's start with some simple test code, see below. We have a simple struct definition, two arrays using this struct, and some default data for both arrays that we'll use later.

// A random struct, 32-bits in size.
struct Foo
{
u8  type;
u8  id;
u16 data;
} ALIGN(4);

// Define some globals. We only use 4 of each.
Foo g_src[16] ALIGN(32);
Foo g_dst[16] ALIGN(32);

const Foo c_fooIn[2][4]=
{
{   // Initial source data.
{ 0x55, 0, 0x5111 },
{ 0x55, 1, 0x5111 },
{ 0x55, 2, 0x5111 },
{ 0x55, 3, 0x5111 }
},
{   // Initial destination data.
{ 0xDD, 0, 0xD111 },
{ 0xDD, 1, 0xD111 },
{ 0xDD, 2, 0xD111 },
{ 0xDD, 3, 0xD111 }
},
};

And now we're going to do some simple things with these arrays that we always do: some reads, some writes, and a struct copy. And for the copying, I'm going to use DMA, because DMA-transfers are fast, amirite(1)? The specific actions I will do are the following:

##### Initialization
• Zero out g_src and g_dst.
• Initialize the arrays with some data, in this case from c_fooIn.
• Cache-Flush both arrays to ensure they're uncached.
##### Testing
• Modify element in g_src, namely g_src[0].
• Modify an element in g_dst, namely g_dst[3].
• DMA-copy g_src[0] to g_dst[3].

In other words, this:

void test_init()
{
// Zero out everything
memset(g_src, 0, sizeof(g_src));
memset(g_dst, 0, sizeof(g_dst));

// Fill 4 of each.
for(int i=0; i<4; i++)
{
g_src[i]= c_fooIn[0][i];
g_dst[i]= c_fooIn[1][i];
}

// Flush data to be sure.
DC_FlushRange(g_src, sizeof(g_src));
DC_FlushRange(g_dst, sizeof(g_dst));
}

void test_dmaCopy()
{
test_init();

// Change g_src[0] and g_dst[3]
g_src[0].id += 0x10;
g_src[0].data= 0x5222;

g_dst[3].id += 0x10;
g_dst[3].data= 0xD333;

// DMA src[0] into dst[0];
dmaCopy(&g_src[0], &g_dst[0], sizeof(Foo));
}

Note that there is nothing spectacularly interesting going on here. There's just your average struct definition, run of the mill array definitions, and boring old accesses without even any pointer magic that might hint at something tricky going on. Yes, alignment is forced, but that just makes the test more reliable. Also, the fact that I'm incrementing Foo.id rather than just reading from it is only because ARM9 cache is read-allocate, and I need to have these things end up in cache. The main point is that the actions in test_dmaCopy() are very ordinary.

### 2 Results

It should be obvious what the outcome of the test should be. However, when you run the test (on hardware! not emulator), the result seems to be something different.

// Just dmaCopy.

// Result           // Expected:
// Source (hex)
55, 10, 5222        // 55, 10, 5222
55, 01, 5111        // 55, 01, 5111
55, 02, 5111        // 55, 02, 5111
55, 03, 5111        // 55, 03, 5111

// Destination (hex)
DD, 00, D111        // 55, 10, 5222 (bad!)
DD, 01, D111        // DD, 01, D111
DD, 02, D111        // DD, 02, D111
DD, 13, D333        // DD, 13, D333

Notice that the changed values of g_src[0] never end up in g_dst[0]. Not only that, not even the original values g_src[0] have been copied. It's as if the transfer never happened at all.

The reason for this was covered in detail in the original article. Basically, it's because cache is invisible to DMA. Once a part of memory is cached, the CPU only looks to the contents of the cache and not the actual addresses, meaning that DMA not only reads out-of-date (stale) source data, but also puts it where the CPU won't look. Two actions allow you to remedy this. The first is the cache flush, which write the cache-lines back to RAM and frees the cache-line. Then there's cache invalidate, which just frees the cache-line. Note that in both cases, the cache is dissociated from memory.

With this information, it should be obvious what to do. When DMA-ing from RAM, you need to flush the cache before the transfer to update the source's memory. When DMA-ing to RAM, you need to invalidate after the transfer because now it's actually the cache's data that's stale. When you think about it a little this makes perfect sense, and it's easy enough to implement:

// New DMA-code:
DC_FlushRange(&g_src[0], sizeof(Foo));          // Flush source.
dmaCopy(&g_src[0], &g_dst[0], sizeof(Foo));     // Transfer.
DC_InvalidateRange(&g_dst[0], sizeof(Foo));     // Invalidate destination.

Unfortunately, this doesn't work right either. And if you think about it a lot instead of merely a little, you'll see why. Maybe showing the results will make you see what I mean. The transfer seems to work now, but the earlier changes to g_dst[3] have been erased. How come?

// Result:          // Expected:
// Source (hex)
55, 10, 5222        // 55, 10, 5222
55, 01, 5111        // 55, 01, 5111
55, 02, 5111        // 55, 02, 5111
55, 03, 5111        // 55, 03, 5111

// Destination (hex)
55, 10, D222        // 55, 10, 5222
DD, 01, D111        // DD, 01, D111
DD, 02, D111        // DD, 02, D111
DD, 13, D111        // DD, 13, D333 (wut?)

The problem is that a cache-invalidate invalidates entire cache-lines, not just the range you supply. If the start or end of the data you want invalidate does not align to a cache-line, the adjacent data contained in that line is also thrown away. I hope you can see that this is bad.

This is exactly what's happening here. The ARM9's cache-lines are 32 bytes in size. Because of the alignment I gave the arrays, elements 0 through 3 lie on the same cache-line. The changes to g_dst[3] occur in cache (but only because I read from it through +=). The invalidate of g_dst[0] also invalidates g_dst[3], which throws out the perfectly legit data and you're left in a bummed state. And again, I've done nothing spectacularly interesting here; all I did was modify something and then invalidated data that just happened to be adjacent to it. But that was enough. Very, very bad.

Just to be sure, this is not due to a bad implementation of DC_InvalidateRange(). The function does exactly what it's supposed to do. The problem is inherent in the hardware. If your data does not align correctly to cache-lines, an invalidate will apply to the adjacent data as well. If you do not want that to happen, do not invalidate.

### 3 Solutions

So what to do? Well, there is one thing, but I'm not sure how foolproof this is, but instead of invalidating the destination afterwards, you can also flush it before the transfer. This frees up the cache-lines without loss of data, and then it should be safe to DMA-copy to it.

DC_FlushRange(&g_src[0], sizeof(Foo));          // Flush source.
DC_FlushRange(&g_dst[0], sizeof(Foo));          // Flush destination.
dmaCopy(&g_src[0], &g_dst[0], sizeof(Foo));     // Transfer.
// Result:          // Expected:
// Source (hex)
55, 10, 5222        // 55, 10, 5222
55, 01, 5111        // 55, 01, 5111
55, 02, 5111        // 55, 02, 5111
55, 03, 5111        // 55, 03, 5111

// Destination (hex)
55, 10, D222        // 55, 10, 5222
DD, 01, D111        // DD, 01, D111
DD, 02, D111        // DD, 02, D111
DD, 13, D333        // DD, 13, D333

// Yay \o/

Alternatively, you can also disable the underlying reason behind the problem with invalidation: the write-buffer. The ARM9 cache allows two modes for writing: write-through, which also updates the memory related to the cache-line; and write-back, which doesn't. Obviously, the write-back is faster, so that's how libnds sets things up. I know that putting the cache in write-through mode fixes this problem, because in libnds 1.4.0 the write-buffer had been accidentally disabled and my test cases didn't fail. This is probably not the route you want to take, though.

### 4 Conclusions

So what have we learned?

• Cache - DMA interactions suck and can cause really subtle bugs. Ones that will only show up on hardware too.
• Cache-flushes and invalidates cover the cache-lines of the requested ranges, which exceed the range you actually wanted.
• To safely DMA from cachable memory, flush the source range first.
• Contrary to what I wrote earlier, to DMA to cachable memory, do not cache-invalidate – at least not when the range is not properly aligned to cache-lines. Instead, flush the destination range before the transfer (at which time invalidation should be unnecessary). That said, invalidate should still be safe if the write-buffer is disabled.

##### Notes:
1. No I'm not. For NDS WRAM-WRAM copies, DMA is actually slow as hell and outperformed by every other method. But hopefully more on that later. For now, though, I need the DMA for testing purposes.

# Some new notes on NDS code size

When I discussed the memory footprints of several C/C++ elements, I apparently missed a very important item: operator new and related functions. I assumed new shouldn't increase the binary that much, but boy was I wrong.

The short story is that officially new should throw an exception when it can't allocate new memory. Exceptions come with about 60 kb worth of baggage. Yes, this is more or less the same stuff that goes into vector and string.

The long story, including a detailed look at a minimal binary, a binary that uses new and a solution to the exception overhead (in this particular case anyway) can be read below the fold.

# Signs from Hell

The integer datatypes in C can be either signed or unsigned. Sometimes, it's obvious which should be used; for negative values you clearly should use signed types, for example. In many cases there is no obvious choice – in that case it usually doesn't matter which you use. Usually, but not always. Sometimes, picking the wrong kind can introduce subtle bugs in your program that, unless you know what to look out for, can catch you off-guard and have you searching for the problem for hours.

I've mentioned a few of these occasions in Tonc here and there, but I think it's worth going over them again in a little more detail. First, I'll explain how signed integers work and what the difference between signed and unsigned and where potential problems can come from. Then I'll discuss some common pitfalls so you know what to expect.

# Another fast fixed-point sine approximation

So here I am, looking forward to a nice quiet weekend; hang back, watch some telly and maybe read a bit – but NNnnneeeEEEEEUUUuuuuuuuu!! Someone had to write an interesting article about sine approximation. With a challenge at the end. And using an inefficient kind of approximation. And so now, instead of just relaxing, I have to spend my entire weekend and most of the week figuring out a better way of doing it. I hate it when this happens >_<.

Okay, maybe not.

Sarcasm aside, it is an interesting read. While the standard way of calculating a sine – via a look-up table – works and works well, there's just something unsatisfying about it. The LUT-based approach is just … dull. Uninspired. Cowardly. Inelegant. In contrast, finding a suitable algorithm for it requires effort and a modicum of creativity, so something like that always piques my interest.

In this case it's sine approximation. I'd been wondering about that when I did my arctan article, but figured it would require too many terms to really be worth the effort. But looking at Mr Schraut's post (whose site you should be visiting from time to time too; there's good stuff there) it seems you can get a decent version quite rapidly. The article centers around the work found at devmaster thread 5784, which derived the following two equations:

 (1) \begin{array}{} S_2(x) &=& \frac4\pi x - \frac4{\pi^2} x^2 \\ \\ S_{4d}(x) &=& (1-P)S_2(x) + P S_2^2(x) \end{array}

These approximations work quite well, but I feel that it actually uses the wrong starting point. There are alternative approximations that give more accurate results at nearly no extra cost in complexity. In this post, I'll derive higher-order alternatives for both. In passing, I'll also talk about a few of the tools that can help analyse functions and, of course, provide some source code and do some comparisons.

# DMA vs ARM9 - fight!

DMA, or Direct Memory Access, is a hardware method for transferring data. As it's hardware-driven, it's pretty damn fast(1). As such, it's pretty much the standard method for copying on the NDS. Unfortunately, as many people have noticed, it doesn't always work.

There are two principle reasons for this: cache and TCM. These are two memory regions of the ARM9 that DMA is unaware of, which can lead to incorrect transfers. In this post, I'll discuss the cache, TCM and their interactions (or lack thereof) with DMA.

The majority of the post is actually about cache. Cache basically determines the speed of your app, so it's worth looking into in more detail. Why it and DMA don't like each other much will become clear along the way. I'll also present a number of test cases that show the conflicting areas, and some functions to deal with these problems.

##### Notes:
1. Well, quite fast anyway. In some circumstances CPU-based transfers are faster, but that's a story for another day.

Okay. Apparently, I am an idiot who can't do math.

One of the longer chapters in Tonc is Mode 7 part 2, which covers pretty much all the hairy details of producing mode 7 effects on the GBA. The money shot for in terms of code is the following functions, which calculates the affine parameters of the background for each scanline in section 21.7.3.

IWRAM_CODE void m7_prep_affines(M7_LEVEL *level)
{
if(level->horizon >= SCREEN_HEIGHT)
return;

int ii, ii0= (level->horizon>=0 ? level->horizon : 0);

M7_CAM *cam= level->camera;
FIXED xc= cam->pos.x, yc= cam->pos.y, zc=cam->pos.z;

BG_AFFINE *bga= &level->bgaff[ii0];

FIXED yb, zb;           // b' = Rx(theta) *  (L, ys, -D)
FIXED cf, sf, ct, st;   // sines and cosines
FIXED lam, lcf, lsf;    // scale and scaled (co)sine(phi)
cf= cam->u.x;      sf= cam->u.z;
ct= cam->v.y;      st= cam->w.y;
for(ii= ii0; ii<SCREEN_HEIGHT; ii++)
{
yb= (ii-M7_TOP)*ct + M7_D*st;
lam= DivSafe( yc<<12,  yb);     // .12f    <- OI!!!

lcf= lam*cf>>8;                 // .12f
lsf= lam*sf>>8;                 // .12f

bga->pa= lcf>>4;                // .8f
bga->pc= lsf>>4;                // .8f

// lambda·Rx·b
zb= (ii-M7_TOP)*st - M7_D*ct;   // .8f
bga->dx= xc + (lcf>>4)*M7_LEFT - (lsf*zb>>12);  // .8f
bga->dy= zc + (lsf>>4)*M7_LEFT + (lcf*zb>>12);  // .8f

// hack that I need for fog. pb and pd are unused anyway
bga->pb= lam;
bga++;
}
level->bgaff[SCREEN_HEIGHT]= level->bgaff[0];
}

For details on what all the terms mean, go the page in question. For now, just note that call to DivSafe() to calculate the scaling factor λ and recall that division on the GBA is pretty slow. In Mode 7 part 1, I used a LUT, but here I figured that since the yb term can be anything thanks to the pitch you can't do that. After helping Ruben with his mode 7 demo, it turns out that you can.

Fig 1. Sideview of the camera and floor. The camera is tilted slightly down by angle θ.

Fig 1 shows the situation. There is a camera (the black triangle) that is tilted down by pitch angle θ. I've put the origin at the back of the camera because it makes things easier to read. The front of the camera is the projection plane, which is essentially the screen. A ray is cast from the back of the camera on to the floor and this ray intersects the projection plane. The coordinates of this point are xp = (yp, D) in projection plane space, which corresponds to point (yb, zb) in world space. This is simply rotating point xp by θ. The scaling factor is the ratio between the y or z coordinates of the points on the floor and on the projection plane, so that's:

 $\lambda = y_c / y_b,$

and for yb the rotation gives us:

 $y_b = y_p cos \theta + D sin \theta,$

where yc is the camera height, yp is a scanline offset (measured from the center of the screen) and D is the focus length.

Now, the point is that while yb is variable and non-integral when θ ≠ 0, it is still bounded! What's more, you can easily calculate its maximum value, since it's simply the maximum length of xp. Calling this factor R, we get:

 $R = \sqrt{max(y_p)^2 + D^2}$

This factor R, rounded up, is the size of the required LUT. In my particular case, I've used yp= scanline−80 and D = 256, which gives R = sqrt((160−80)² + 256²) = 268.2. In other words, I need a division LUT with 269 entries. Using .16 fixed point numbers for this LUT, the replacement code is essentially:

// The new division LUT. For 1/0 and 1/1, 0xFFFF is used.
u16 m7_div_lut[270]=
{
0xFFFF, 0xFFFF, 0x8000, 0x5556, ...
};

// Inside the function
for(ii= ii0; ii<SCREEN_HEIGHT; ii++)
{
yb= (ii-M7_TOP)*ct + M7_D*st;           // .8
lam= (yc*m7_div_lut[yb>>8])>>12;        // .8*.16/.12 = .12

}

At this point, several questions may arise.

• What about negative yb? The beauty here is that while yb may be negative in principle, such values would correspond to lines above the horizon and we don't calculate those anyway.
• Won't non-integral yb cause inaccurate look-ups? True, yb will have a fractional part that is simply cut off during a simple look-up and some sort of interpolation would be better. However, in testing there were no noticeable differences between direct look-up, lerped look-up or using Div(), so the simplest method suffices.
• Are .16 fixed point numbers enough?. Yes, apparently so.
• ZOMG OVERFLOW! Are .16 fixed point numbers too high? Technically, yes, there is a risk of overflow when the camera height gets too high. However, at high altitudes the map is going to look like crap anyway due to the low resolution of the screen. Furthermore, the hardware only uses 8.8 fixeds, so scales above 256.0 wouldn't work anyway.

And finally:

• What do I win? With Div() m7_prep_affines() takes about 51k cycles. With the direct look-up this reduces to about 13k: a speed increase by a factor of 4.

So yeah, this is what I should have figured out years ago, but somehow kept overlooking it. I'm not sure if I'll add this whole thing to Tonc's text and code, but I'll at least put up a link to here. Thanks Ruben, for showing me how to do this properly.

# To C or not to C

Tonclib is coded mostly in C. The reason for this was twofold. First, I still have it in my head that C is lower level than C++, and that the former would compile to faster code; and faster is good. Second, it's easier for C++ to call C than the other way around so, for maximum compatibility, it made sense to code it in C. But these arguments always felt a little weak and now that I'm trying to port tonclib's functions to the DS, the question pops up again.

On many occasions, I just hated not going for C++. Not so much for its higher-level functionality like classes, inheritance and other OOPy goodness (or badness, some might say), but more because I would really, really like to make use of things like function overloading, default parameters and perhaps templates too.

For example, say you have a blit routine. You can implement this in multiple ways: with full parameters (srcX/Y, dstX/Y, width/height), using Point and Rect structs (srcRect, dstPoint) or perhaps just a destination point, using the full source-bitmap. In other words:

void blit(Surface *dst, int dstX, int dstY, int srcW, int srcH, Surface *src, int srcX, int srcY);
void blit(Surface *dst, Point *dstPoint, Surface *src, Rect *srcRect);
void blit(Surface *dst, Point *dstPoint, Surface *src);

In C++, this would be no problem. You just declare and define the functions and the compiler mangles the names internally to avoid naming conflicts. You can even make some of the functions inline facades that morphs the arguments for the One True Implementation. In C, however, this won't work. You have to do the name mangling yourself, like blit, blit2, blit3, or blitEx or blitRect, and so on and so forth. Eeghh, that is just ugly.

Speaking of points and rectangles, that's another thing. Structs for points and rects are quite useful, so you make one using int members (you should always start with ints). But sometimes it's better to have smaller versions, like shorts. Or maybe unsigned variations. And so you end up with:

struct point8_t   { s8  x, y; };   // Point as signed char
struct point16_t  { s16 x, y; };   // Point as signed short
struct point32_t  { s32 x, y; };   // Point as signed int

struct upoint8_t  { u8  x, y; };   // Point as unsigned char
struct upoint16_t { u16 x, y; };   // Point as unsigned short
struct upoint32_t { u32 x, y; };   // Point as unsigned int

And then that for rects too. And perhaps 3D vectors. And maybe add floats to the mix as well. This all requires that you make structs which are identical except for the primary datatype. That just sounds kinda dumb to me.

But wait, it gets even better! You might like to have some functions to go with these structs, so now you have to create different sets (yes, sets) of functions that differ only by their parameter types too! AAAARGGGGHHHHH, for the love of IPU, NOOOOOOOOOOOOOO!!! Neen, neen; driewerf neen! >_<

That's how it would be in C. In C++, you can just use a template like so:

template<class T>
struct point_t  { T x, y; };    // Point via templates

typedef point_t<u8> point8_t;   // And if you really want, you can
// typedef for specific types.

and be done with it. And then you can make a single template function (or was it function template, I always forget) that works for all the datatypes and let the compiler work it out. Letting the computer do the work for you, hah! What will they think of next.

Oh, and there's namespaces too! Yes! In C, you always have to worry about if some other library has something with the same name as you're thinking of using. This is where all those silly prefixes come from (oh hai, FreeImage!). With C++, there's a clean way out of that: you can encapsulate them in a namespace and when a conflict arises you can use mynamespace::foo to get out of it. And if there's no conflicts, use using namespace mynamespace; and type just plain foo. None of that FreeImage_foo causing you to have more prefix than genuine function identifier.

And [i]then[/i] there's C++ benefits like classes and everything that goes with it. Yes, classes can become fiendishly difficult if pushed too far(1), but inheritance and polymorphism are nice when you have any kind of hierarchy in your program. All Actors have positions, velocities and states. But a PlayerActor also needs input; and an NpcActor has AI. And each kind of NPC has different methods for behaviour and capabilities, and different Items have different effects and so on. It's possible to do this in just C (hint: unioned-structs and function-tables and of course state engines), but whether you'd want to is another matter. And there's constructors for easier memory management, STL and references. And, yes, streams, exceptions and RTTI too if you want to kill your poor CPU (regarding GBA/DS I mean), but nobody's forcing you to use those.

So why the hell am I staying with C again? Oh right, performance!

Performance, really? I think I heard this was a valid point a long time ago, but is it still true now? To test this, I turned all tonclib's C files into C++ files, compiled again and compared the two. This is the result:

Difference in function size between C++ and C in bytes.

That graph shows the difference in the compiled function size. Positive means C++ uses more instructions. In nearly 300 functions, the only differences are minor variations in irq_set(), some of the rendering routines and TTE parsers, and neither language is the clear winner. Overall, C++ seems to do a little bit better, but the difference is marginal.

I've also run a diff between the generated assembly. There are a handful of functions where the order of instructions are different, or different registers are used, or a value is placed in a register instead of on the stack. That's about it. In other words, there is no significant difference between pure C code and its C++ equivalent. Things will probably be a little different when OOP features and exceptions enter the fray, but that's to be expected. But if you stay close to C-like C++, the only place you'll notice anything is in the name-mangling. Which you as a programmer won't notice anyway because it all happens behind the scenes.

So that strikes performance off my list, leaving only wider compatibility. I suppose that has still some merit, but considering you can turn C-code into valid C++ by changing the extension(2), this is sound more and more like an excuse instead of a reason.

##### Notes:
1. As the saying goes: C++ makes it harder to shoot yourself in the foot, but when you do, you blow off your whole leg.
2. and clean up the type issues that C allows but C++ doesn't, like void* arithmetic and implicit pointer casts from void*.

# Fiddle to the bittle

I've added two new routines to the bit-trick page: 1→4 bit-unpack with reverse and bit reversals. This last one is elegant … except for one bit of C tomfoolery that is required to get GCC to produce the right ARM code. I hope to discuss this in more detail later.

I've also added a new document about dealing with bitfields. It explains what to do with them, gives a few useful functions to get and set bitfields, and demonstrates how to use the C construct for bitfields. It also touches briefly on a nasty detail in the way GCC implements bitfield that can cause them to fail in certain GBA/NDS memory sections. If you're using bitfields to map VRAM or OAM, please read.

# Surface drawing routines.

I've been building a basic interface for dealing with graphic surfaces lately. I already had most of the routines for 16bpp and 8bpp bitmaps in older Toncs, but but their use was still somewhat awkward because you had to provide some details of the destination manually; most notably a base pointer and the pitch. This got more than a little annoying, especially when trying to make blitters as well. So I made some changes.

typedef struct TSurface
{
u8  *data;      //!< Surface data pointer.
u32 pitch;      //!< Scanline pitch in bytes (PONDER: alignment?).
u16 width;      //!< Image width in pixels.
u16 height;     //!< Image width in pixels.
u8  bpp;        //!< Bits per pixel.
u8  type;       //!< Surface type (not used that much).
u16 palSize;    //!< Number of colors.
u16 *palData;   //!< Pointer to palette.
} TSurface;

I've rebuilt the routines around a surface description struct called TSurface (see above). This way, I can just initialize the surface somewhere and just pass the pointer to that surface around. There are a number of different kinds of surfaces. The most important ones are these three:

• bmp16. 16bpp bitmap surfaces.
• bmp8. 8bpp bitmap surfaces.
• chr4c. 4bpp tiled surfaces, in column-major order (i.e., tile 1 is under tile 0 instead of to the right). Column-major order may seem strange, but it actually simplifies the code considerably. There is also a chr4r mode for normal, row-major tiling, but that's unfinished and will probably remain so.

Demonstrating surface routines for 4bpp tiles.

For each of these three, I have the most important rendering functions: plotting pixels, lines, rectangles and blits. Yes, blits too. Even for chr4c-mode. There are routines for frames (empty rectangles) and floodfill as well. The functions have a uniform interface with respect to surface-type, so switching between them should be easy were it necessary. There are also tables with function pointers to these routines, so by using those you need not really care about the details of the surface after its creation. I'll probably add a pointer to such a table in TSurface in the future.

The image on the right is the result of the following routine. Turret pic semi-knowingly provided by Kawa.

void test_surface_procs(const TSurface *src, TSurface *dst,
const TSurfaceProcTab *procs, u16 colors[])
{
// Init object text
tte_init_obj(&oam_mem[127], ATTR0_TALL, ATTR1_SIZE_8, 512,
CLR_YELLOW, 0, &vwf_default, NULL);
tte_init_con();
tte_set_margins(8, 140, 160, 152);

// And go!
tte_printf("#{es;P}%s surface primitives#{w:60}", procs->name);

tte_printf("#{es;P}Rect#{w:20}");
procs->rect(dst, 20, 20, 100, 100, colors[0]);

tte_printf("#{w:30;es;P}Frame#{w:20}");
procs->frame(dst, 21, 21, 99, 99, colors[1]);

tte_printf("#{w:30;es;P}Hlines#{w:20}");

procs->hline(dst, 23, 23, 96, colors[2]);
procs->hline(dst, 23, 96, 96, colors[2]);

tte_printf("#{w:30;es;P}Vlines#{w:20}");
procs->vline(dst, 23, 25, 94, colors[3]);
procs->vline(dst, 96, 25, 94, colors[3]);

tte_printf("#{w:30;es;P}Lines#{w:20}");
procs->line(dst, 25, 25, 94, 40, colors[4]);
procs->line(dst, 94, 25, 79, 94, colors[4]);
procs->line(dst, 94, 94, 25, 79, colors[4]);
procs->line(dst, 25, 94, 40, 25, colors[4]);

tte_printf("#{w:30;es;P}Full blit#{w:20}");
procs->blit(dst, 120, 16, src->width, src->height, src, 0, 0);

tte_printf("#{w:30;es;P}Partial blit#{w:20}");
procs->blit(dst, 40, 40, 40, 40, src, 12, 8);

tte_printf("#{w:30;es;P}Floodfill#{w:20}");
procs->flood(dst, 40, 32, colors[5]);
tte_printf("#{w:30;es;P}Again !#{w:20}");
procs->flood(dst, 40, 32, colors[6]);

tte_printf("#{w:30;es;P;w:30}Ta-dah!!!#{w:20}");

key_wait_till_hit(KEY_ANY);
}

// Test 4bpp tiled, column-major surfaces
void test_chr4c_procs()
{
TSurface turret, dst;

// Init turret for blitting.
srf_init(&turret, SRF_CHR4C, turretChr4cTiles, 128, 128, 4, NULL);

// Init destination surface
srf_init(&dst, SRF_CHR4C, tile_mem[0], 240, 160, 4, pal_bg_mem);
schr4c_prep_map(&dst, se_mem[31], 0);
GRIT_CPY(pal_bg_mem, turretChr4cPal);

// Set video stuff
REG_DISPCNT= DCNT_MODE0 | DCNT_BG2 | DCNT_OBJ | DCNT_OBJ_1D;
REG_BG2CNT= BG_CBB(0)|BG_SBB(31);

u16 colors[8]= { 6, 13, 1, 14, 15, 0, 14, 0 };

// Run internal tester
test_surface_procs(&turret, &dst, &chr4c_tab, colors);
}