Another fast fixed-point sine approximation

Gaddammit!

 

So here I am, looking forward to a nice quiet weekend; hang back, watch some telly and maybe read a bit – but NNnnneeeEEEEEUUUuuuuuuuu!! Someone had to write an interesting article about sine approximation. With a challenge at the end. And using an inefficient kind of approximation. And so now, instead of just relaxing, I have to spend my entire weekend and most of the week figuring out a better way of doing it. I hate it when this happens >_<.

 

Okay, maybe not.

 

Sarcasm aside, it is an interesting read. While the standard way of calculating a sine – via a look-up table – works and works well, there's just something unsatisfying about it. The LUT-based approach is just … dull. Uninspired. Cowardly. Inelegant. In contrast, finding a suitable algorithm for it requires effort and a modicum of creativity, so something like that always piques my interest.

In this case it's sine approximation. I'd been wondering about that when I did my arctan article, but figured it would require too many terms to really be worth the effort. But looking at Mr Schraut's post (whose site you should be visiting from time to time too; there's good stuff there) it seems you can get a decent version quite rapidly. The article centers around the work found at devmaster thread 5784, which derived the following two equations:

(1) \begin{array}{} S_2(x) &=& \frac4\pi x - \frac4{\pi^2} x^2 \\ \\ S_{4d}(x) &=& (1-P)S_2(x) + P S_2^2(x) \end{array}

These approximations work quite well, but I feel that it actually uses the wrong starting point. There are alternative approximations that give more accurate results at nearly no extra cost in complexity. In this post, I'll derive higher-order alternatives for both. In passing, I'll also talk about a few of the tools that can help analyse functions and, of course, provide some source code and do some comparisons.

Continue reading

DMA vs ARM9 - fight!

DMA, or Direct Memory Access, is a hardware method for transferring data. As it's hardware-driven, it's pretty damn fast(1). As such, it's pretty much the standard method for copying on the NDS. Unfortunately, as many people have noticed, it doesn't always work.

There are two principle reasons for this: cache and TCM. These are two memory regions of the ARM9 that DMA is unaware of, which can lead to incorrect transfers. In this post, I'll discuss the cache, TCM and their interactions (or lack thereof) with DMA.

The majority of the post is actually about cache. Cache basically determines the speed of your app, so it's worth looking into in more detail. Why it and DMA don't like each other much will become clear along the way. I'll also present a number of test cases that show the conflicting areas, and some functions to deal with these problems.

Continue reading
Notes:
  1. Well, quite fast anyway. In some circumstances CPU-based transfers are faster, but that's a story for another day.

mode 7 addendum

Okay. Apparently, I am an idiot who can't do math.

 

One of the longer chapters in Tonc is Mode 7 part 2, which covers pretty much all the hairy details of producing mode 7 effects on the GBA. The money shot for in terms of code is the following functions, which calculates the affine parameters of the background for each scanline in section 21.7.3.

IWRAM_CODE void m7_prep_affines(M7_LEVEL *level)
{
    if(level->horizon >= SCREEN_HEIGHT)
        return;

    int ii, ii0= (level->horizon>=0 ? level->horizon : 0);

    M7_CAM *cam= level->camera;
    FIXED xc= cam->pos.x, yc= cam->pos.y, zc=cam->pos.z;

    BG_AFFINE *bga= &level->bgaff[ii0];

    FIXED yb, zb;           // b' = Rx(theta) *  (L, ys, -D)
    FIXED cf, sf, ct, st;   // sines and cosines
    FIXED lam, lcf, lsf;    // scale and scaled (co)sine(phi)
    cf= cam->u.x;      sf= cam->u.z;
    ct= cam->v.y;      st= cam->w.y;
    for(ii= ii0; ii<SCREEN_HEIGHT; ii++)
    {
        yb= (ii-M7_TOP)*ct + M7_D*st;
        lam= DivSafe( yc<<12,  yb);     // .12f    <- OI!!!

        lcf= lam*cf>>8;                 // .12f
        lsf= lam*sf>>8;                 // .12f

        bga->pa= lcf>>4;                // .8f
        bga->pc= lsf>>4;                // .8f

        // lambda·Rx·b
        zb= (ii-M7_TOP)*st - M7_D*ct;   // .8f
        bga->dx= xc + (lcf>>4)*M7_LEFT - (lsf*zb>>12);  // .8f
        bga->dy= zc + (lsf>>4)*M7_LEFT + (lcf*zb>>12);  // .8f

        // hack that I need for fog. pb and pd are unused anyway
        bga->pb= lam;
        bga++;
    }
    level->bgaff[SCREEN_HEIGHT]= level->bgaff[0];
}

For details on what all the terms mean, go the page in question. For now, just note that call to DivSafe() to calculate the scaling factor λ and recall that division on the GBA is pretty slow. In Mode 7 part 1, I used a LUT, but here I figured that since the yb term can be anything thanks to the pitch you can't do that. After helping Ruben with his mode 7 demo, it turns out that you can.

 

Fig 1. Sideview of the camera and floor. The camera is tilted slightly down by angle θ.

Fig 1 shows the situation. There is a camera (the black triangle) that is tilted down by pitch angle θ. I've put the origin at the back of the camera because it makes things easier to read. The front of the camera is the projection plane, which is essentially the screen. A ray is cast from the back of the camera on to the floor and this ray intersects the projection plane. The coordinates of this point are xp = (yp, D) in projection plane space, which corresponds to point (yb, zb) in world space. This is simply rotating point xp by θ. The scaling factor is the ratio between the y or z coordinates of the points on the floor and on the projection plane, so that's:

\lambda = y_c / y_b,

and for yb the rotation gives us:

y_b = y_p cos \theta + D sin \theta,

where yc is the camera height, yp is a scanline offset (measured from the center of the screen) and D is the focus length.

Now, the point is that while yb is variable and non-integral when θ ≠ 0, it is still bounded! What's more, you can easily calculate its maximum value, since it's simply the maximum length of xp. Calling this factor R, we get:

R = \sqrt{max(y_p)^2 + D^2}

This factor R, rounded up, is the size of the required LUT. In my particular case, I've used yp= scanline−80 and D = 256, which gives R = sqrt((160−80)² + 256²) = 268.2. In other words, I need a division LUT with 269 entries. Using .16 fixed point numbers for this LUT, the replacement code is essentially:

// The new division LUT. For 1/0 and 1/1, 0xFFFF is used.
u16 m7_div_lut[270]=
{
    0xFFFF, 0xFFFF, 0x8000, 0x5556, ...
};


// Inside the function
    for(ii= ii0; ii<SCREEN_HEIGHT; ii++)
    {
        yb= (ii-M7_TOP)*ct + M7_D*st;           // .8
        lam= (yc*m7_div_lut[yb>>8])>>12;        // .8*.16/.12 = .12
       
        ... // business as usual
    }

At this point, several questions may arise.

  • What about negative yb? The beauty here is that while yb may be negative in principle, such values would correspond to lines above the horizon and we don't calculate those anyway.
  • Won't non-integral yb cause inaccurate look-ups? True, yb will have a fractional part that is simply cut off during a simple look-up and some sort of interpolation would be better. However, in testing there were no noticeable differences between direct look-up, lerped look-up or using Div(), so the simplest method suffices.
  • Are .16 fixed point numbers enough?. Yes, apparently so.
  • ZOMG OVERFLOW! Are .16 fixed point numbers too high? Technically, yes, there is a risk of overflow when the camera height gets too high. However, at high altitudes the map is going to look like crap anyway due to the low resolution of the screen. Furthermore, the hardware only uses 8.8 fixeds, so scales above 256.0 wouldn't work anyway.

And finally:

  • What do I win? With Div() m7_prep_affines() takes about 51k cycles. With the direct look-up this reduces to about 13k: a speed increase by a factor of 4.
 

So yeah, this is what I should have figured out years ago, but somehow kept overlooking it. I'm not sure if I'll add this whole thing to Tonc's text and code, but I'll at least put up a link to here. Thanks Ruben, for showing me how to do this properly.

Some interesting numbers on NDS code size

Even though the total size of code is usually small compared to assets, there are still some concerns about a number of systems. Most notably among these are stdio, iostream and several STL components like vectors and strings. I've seen people voice concerns about these items, but I don't think I've ever seen any measurements of them. So here are some.

Table 1 : Memory footprint of some C/C++ components in bytes. Items may not be strictly cumulative.
Barebones: just VBlank code14516
base+printf71148
base+iprintf54992
base+iostream266120
base+fopen56160
base+fstream260288
base+<string>59384
base+<vector>59624
base+<string>+<vector>59648

The sizes in Table 1 are for a bare source file with just the VBlank initialization and swiWaitForVBlank() plus whatever's necessary to use a particular component. For the IO parts this means a call to consoleDemoInit(); for vectors and strings, it means defining a variable.

 

Even an empty project is already 15k in size. Almost all of this is FIFO code, which is required for the ARM9 and ARM7 to communicate. Adding consoleDemoInit() and a printf() call adds roughly 71k. Printf has a lot of bagage: you have to have basic IO hooks, character type functions, allocations, decimal and floating point routines and more.

Because printf() uses the usually unnecessary floating point routines for float conversions, it is often suggested to use the integer-only variant iprintf(). In that case, it comes down to 55k. The difference is mostly due to two functions: _vfprintf_r() and _dtoa_r(), for 5.8k and 3.6k, respectively. The rest is made up of dozens of smaller functions. While the difference is relatively large, considering the footprint of the other components, the extra 16k is probably not that big of a deal. For the record, there is no difference in speed between the two. Well, almost: if the format string doesn't contain formatting parts, printf() is actually considerably faster. Another point of note is that the 55k for iprintf() is actually already added just by using consoleDemoInit().

And now the big one. People have said that C++ iostream was heavy, and indeed it is. 266k! That's a quite a lot, especially since the benefits of using iostream over stdio is rather slim if not actually negative(1). Don't use iostream in NDS projects. Don't even #include <iostream>, as that seems enough to link the whole thing in.

Related to iosteam is fstream. This also is about a quarter MB. I haven't checked too carefully, but I think the brunt of these parts are shared, so it won't combine to half a Meg if you use both. Something similar is true for the stdio file routines.

Why are the C++ streams so large? Well, lots of reasons, apparently. One of which is actually its potential for extensibility. Because it doesn't work via formatting flags, none of those can be excluded like in iprintf()'s case. Then there's exceptions, which adds a good deal of code as well. There also seems to be tons of stuff for character traits, numerical traits, money traits (wtf?!?) and iosbase stuff. These items seem small, say 4 to 40 bytes, but when there are over a thousand it adds up. Then there's all the stuff from STL strings and allocators, type info, more exception stuff, error messages for the exceptions, date/time routines, locale settings and more. I tell you, looking at the mapfile for this is enough to give me a headache. And worst of all, you'll probably use next to none of it, but it's all linked in anyway.

Finally, some STL. This is also said to be somewhat big-boned, and yes it isn't light. Doing anything non-trivial with either a vector or string seems to add about 60k. Fortunately, though, this is mostly the same 60k, so there are not bad effects from using both. Unfortunately, I can't really tell where it's all going. About 10k is spent on several d_*() routines like d_print(), which is I assume debug code. Another 10k is exceptions, type info and error messages and 10 more for. But after that it gets a little murky. In any case, even though adding STL strings and vectors includes more that necessary, 60k is a fair price for what these components give you.

 

Conclusions

The smallest size for an NDS binary is about 14k. While printf() is larger than iprintf(), it's probably not enough to worry about, so just use printf() until it becomes a pressing matter (and even then you could probably shrink down another part more easily anyway). There is no speed difference between the two. The C++ iostream and fstream components are not worth it. Their added value over printf and FILE routines are small when it comes to basic IO functionality. STL containers do cost a bit, but are probably worth the effort. If you need more than simple text handling or dynamic arrays and lists, I'd say go for it. But that's just my opinion.

Please note, the tests I did for this were fairly roughly. Your mileage may vary.

 

Lastly. The nm tool (or arm-eabi-nm for DKA) is my new best friend for executable analysis. Unlike the linker's mapfile, nm can sort addresses and show symbol sizes, and doesn't include tons of crap used for glue.


Notes:
  1. Yes, I said negative. Even though it has the benefit of being typesafe and extensible, the value of these advantages are somewhat exaggerated, especially since it has several drawback as well (see FQA:ch 15 for details). For one thing, try finding the iostream equivalent of "%08X", or formatted anything for that matter. For early grit code I was actually using iostream until I got to actually writing the export code. I couldn't move back to stdio fast enough.

On arctangent.

The arctangent is one of the more interesting trigonometry functions – and by “interesting” I of course mean a bitch to get right. I've been meaning to write something about the various methods of calculating it for a while now and finally got round to it recently.

I, ah, uhm, may have gotten a little carried away with it though ... but that's okay,! At least now I while one to recommend.

To C or not to C

Tonclib is coded mostly in C. The reason for this was twofold. First, I still have it in my head that C is lower level than C++, and that the former would compile to faster code; and faster is good. Second, it's easier for C++ to call C than the other way around so, for maximum compatibility, it made sense to code it in C. But these arguments always felt a little weak and now that I'm trying to port tonclib's functions to the DS, the question pops up again.

 

On many occasions, I just hated not going for C++. Not so much for its higher-level functionality like classes, inheritance and other OOPy goodness (or badness, some might say), but more because I would really, really like to make use of things like function overloading, default parameters and perhaps templates too.

 

For example, say you have a blit routine. You can implement this in multiple ways: with full parameters (srcX/Y, dstX/Y, width/height), using Point and Rect structs (srcRect, dstPoint) or perhaps just a destination point, using the full source-bitmap. In other words:

void blit(Surface *dst, int dstX, int dstY, int srcW, int srcH, Surface *src, int srcX, int srcY);
void blit(Surface *dst, Point *dstPoint, Surface *src, Rect *srcRect);
void blit(Surface *dst, Point *dstPoint, Surface *src);

In C++, this would be no problem. You just declare and define the functions and the compiler mangles the names internally to avoid naming conflicts. You can even make some of the functions inline facades that morphs the arguments for the One True Implementation. In C, however, this won't work. You have to do the name mangling yourself, like blit, blit2, blit3, or blitEx or blitRect, and so on and so forth. Eeghh, that is just ugly.

 

Speaking of points and rectangles, that's another thing. Structs for points and rects are quite useful, so you make one using int members (you should always start with ints). But sometimes it's better to have smaller versions, like shorts. Or maybe unsigned variations. And so you end up with:

struct point8_t   { s8  x, y; };   // Point as signed char
struct point16_t  { s16 x, y; };   // Point as signed short
struct point32_t  { s32 x, y; };   // Point as signed int

struct upoint8_t  { u8  x, y; };   // Point as unsigned char
struct upoint16_t { u16 x, y; };   // Point as unsigned short
struct upoint32_t { u32 x, y; };   // Point as unsigned int

And then that for rects too. And perhaps 3D vectors. And maybe add floats to the mix as well. This all requires that you make structs which are identical except for the primary datatype. That just sounds kinda dumb to me.

But wait, it gets even better! You might like to have some functions to go with these structs, so now you have to create different sets (yes, sets) of functions that differ only by their parameter types too! AAAARGGGGHHHHH, for the love of IPU, NOOOOOOOOOOOOOO!!! Neen, neen; driewerf neen! >_<

That's how it would be in C. In C++, you can just use a template like so:

template<class T>
struct point_t  { T x, y; };    // Point via templates

typedef point_t<u8> point8_t;   // And if you really want, you can
                                // typedef for specific types.

and be done with it. And then you can make a single template function (or was it function template, I always forget) that works for all the datatypes and let the compiler work it out. Letting the computer do the work for you, hah! What will they think of next.

 

Oh, and there's namespaces too! Yes! In C, you always have to worry about if some other library has something with the same name as you're thinking of using. This is where all those silly prefixes come from (oh hai, FreeImage!). With C++, there's a clean way out of that: you can encapsulate them in a namespace and when a conflict arises you can use mynamespace::foo to get out of it. And if there's no conflicts, use using namespace mynamespace; and type just plain foo. None of that FreeImage_foo causing you to have more prefix than genuine function identifier.

 

And [i]then[/i] there's C++ benefits like classes and everything that goes with it. Yes, classes can become fiendishly difficult if pushed too far(1), but inheritance and polymorphism are nice when you have any kind of hierarchy in your program. All Actors have positions, velocities and states. But a PlayerActor also needs input; and an NpcActor has AI. And each kind of NPC has different methods for behaviour and capabilities, and different Items have different effects and so on. It's possible to do this in just C (hint: unioned-structs and function-tables and of course state engines), but whether you'd want to is another matter. And there's constructors for easier memory management, STL and references. And, yes, streams, exceptions and RTTI too if you want to kill your poor CPU (regarding GBA/DS I mean), but nobody's forcing you to use those.


So why the hell am I staying with C again? Oh right, performance!

Performance, really? I think I heard this was a valid point a long time ago, but is it still true now? To test this, I turned all tonclib's C files into C++ files, compiled again and compared the two. This is the result:


Difference in function size between C++ and C in bytes.

That graph shows the difference in the compiled function size. Positive means C++ uses more instructions. In nearly 300 functions, the only differences are minor variations in irq_set(), some of the rendering routines and TTE parsers, and neither language is the clear winner. Overall, C++ seems to do a little bit better, but the difference is marginal.

I've also run a diff between the generated assembly. There are a handful of functions where the order of instructions are different, or different registers are used, or a value is placed in a register instead of on the stack. That's about it. In other words, there is no significant difference between pure C code and its C++ equivalent. Things will probably be a little different when OOP features and exceptions enter the fray, but that's to be expected. But if you stay close to C-like C++, the only place you'll notice anything is in the name-mangling. Which you as a programmer won't notice anyway because it all happens behind the scenes.

 

So that strikes performance off my list, leaving only wider compatibility. I suppose that has still some merit, but considering you can turn C-code into valid C++ by changing the extension(2), this is sound more and more like an excuse instead of a reason.


Notes:
  1. As the saying goes: C++ makes it harder to shoot yourself in the foot, but when you do, you blow off your whole leg.
  2. and clean up the type issues that C allows but C++ doesn't, like void* arithmetic and implicit pointer casts from void*.

Define: overthinking

<ramble>

I think too much. Or so people keep telling me. And they may have a point.

Anyway, so I'm working on this assembly version of toncset for, well, just because I guess. A fill routine has 3 parts: a head, a main run and a tail. The main part fills 32-bit chunks (words) when there's more than 4 bytes left and when the destination is word-aligned. This part is easy, because you can just dump words with stmia or str. For an example of this, see my older memset16 post.

The tail is for the remaining bytes after the main part. Under normal circumstances you could do this byte for byte, but some sections of GBA/NDS memory do not take kindly to byte-writes, so you'd have to read the word, mask the appropriate bytes out/in and then write it back. This is mostly just annoying, but still very doable.

The head part, however, is both annoying and tricky. It consist of filling the unaligned bytes in the first word of a run so that the main part can do its thing. This is similar to the tail part, in that it requires bitmasks. However, it's also possible that both the beginning and and of a run occur in the same word, effectively making the head the tail as well, so you'd have to apply a double mask … somehow. In C, it looks something like this:

/* Input:
  void *dstv;   // potentially non-aligned
  u32 fill;     // Already extended to full 32-bit if appropriate.
  uint size;    // > 0
*/


u32 addr= (u32)dstv;
int left= addr&3;
if(left != 0)
{
    u32 *dst= (u32*)(addr&~3);
    int right= left+size;
    u32 mask= 0xFFFFFFFF;

    if(right < 4)   // Everything in a single word: head and tail.
    {
        mask= ~(mask<<8*size);          // Create right-mask 000F
        mask <<= 8*left;                // Create mid mask 00F0
        *dst = (*dst &~ mask) | (fill & mask);
        return;
    }
    else
    {
        mask <<= 8*left;                // Create left mask FFF0
        *dst = (*dst &~ mask) | (fill & mask);
        size= right-4;
        *dst++;
    }
}

This bit of C translates roughly into the following ARM code:

    @ Reglist
    @ r0 : dst
    @ r1 : src
    @ r2 : size
    @ r3 : left (dst&3) / right (left+size) / data
    @ r4 : lshift (left*8)
    @ r5 : rshift (right*8) / mask
    @ ip : maskBase (0xFFFFFFFF)
.Lfgset_head:
    bic     r0, r0,#3               @ Align dst
    mvn     ip, #0                  @ Set-up mask (0xFFFFFFFF)
    mov     r4, r3, lsl #3          @ (left*=8) != 0 : right-side only
    add     r3, r3, r2              @ right= left+size
    cmp     r3, #4

    @ <= 4 : single-word. Shrink mask, do usual and quit early
    movlo   r5, r2, lsl #3          @ \.
    mvnlo   ip, ip, lsl r5          @ - r5= ((1<<8*size)-1)<<(8*left);
    mov     r5, ip, lsl r4          @ /

    subhi   r2, r3, #4              @ Adjust size for follow-ups

    @ Mask in r1 and write back to dst
    ldr     r3, [r0]
    bic     r3, r3, r5
    and     r5, r1, r5
    orr     r3, r5
    str     r3, [r0], #4

    bhi     .Lfgset_main            @ Longer stretch : go back.
    bx      lr                      @ Single-word fill : finished.

The main thing I thought when I'd written this down was “meh”. You will note that registers r4 and r5 are used here, which means stack-work (omitted here for brevity). The positioning of pushing and popping makes everything a little awkward, so I was off looking for something else.

The essence of the problem here is that you can only use five registers without touching the stack: r0-r3 and ip (r12). Now, r0-r2 are taken by the destination, fill and size, so I can't do anything with those. I also need one to store the left-edge (r3 in this case), leaving us with one for the right edge, the left and right shift intermediaries, and the left and right masks. Right! Crap -_-.


So; strategies. Well, one of the reasons I need to use so many registers is because the lifetimes overlap. For example, I still need left for a while because shifting the mask up comes last here. I can't use r2 for multiple purposes either because I'll need it for the size. Now, I could free up r3 by making the left-mask first, but then I might get in trouble when creating the right-mask. Also, right-4 is actually what size wants to be when it grows up in the long-run case, so I can use that there as well. I'd just have to undo that for the short case, or perhaps even create the right-mask from negative numbers.

At this point I figured it would be helpful to take a look at the various ways of creating the masks I needed. The standard form for a 0x000000FF-style bitmask is (1<<x)-1, but there are others as well. The following list holds a few examples.

// Bitmask examples. Assume x=8 (in r1)

 ( 1<<x)-1;     // 000000FF.    mov r0, #1; rsb r0, r0, r0, lsl r1;
~(-1<<x);       // 000000FF.    mvn r0, #0; mvn r0, r0, lsl r1;
  -1<<x;        // FFFFFF00.    mvn r0, #0; mov r0, r0, lsl r1;
-( 1<<x);       // FFFFFF00.    mov r0, #1; sub r0, r0, r0 lsl r1; sub r0, #1
  -1>>x;        // 00FFFFFF.    mvn r0, #0; mov r0, r0, lsr r1;
~(-1>>x);       // FF000000.    mvn r0, #0; mvn r0, r0, lsl r1;

All of these use +1 or −1 as their base, and all but one is a two-instruction affair. The left-mask looks like 0xFFFFFF00, so the most obvious one to pick here is -1<<x. Technically, the right-mask is 0x000000FF, using x = 8*right = 8*(left+size). However, you can also see it as a 0x00FFFFFF-style mask if you use 4-right. This solves two problems at once. First, it is the negative of the new size, so the value is readily available. Second, this mask is a right-shifted 0xFFFFFFFF, but as the lower bits are shifted out anyway, it doesn't actually have to be a proper 0xFFFFFFFF; it can be a 0xFFFFFF00 as well, which we have in the form of the left-mask. In other words, we don't require as many registers for temps because we already have everything we need. The resulting code is this:

    add     r2, r3, r2              @ (1a) right := left+size.
    movs    r3, r3, lsl #3
    mvn     ip, #0                  @ ip= -1
    mov     r3, ip, lsl r3          @ lmask := (-1)<<(8*left)
    subs    r2, r2, #4              @ (1b) aligned size= right-4

    @ new size < 0 : single-word fill
        rsblo   r2, r2, #0          @ - (2) r3= -8*size
        movlo   r2, r2, lsl #3      @ /
        andlo   r3, r3, r3, lsr r2  @ (3) mask = lmask & rmask
   
    @# inserts and jumps

See? No r4 and r5 anywhere. The key is toying around with r2 and r3. While r2 is reserved for the size, it needs to modified anyway to account for the work done here. In the end, size should be right−4, which is what points (1a) and (1b) do. Since right-4 is a right<4 as well, we can use its result as the condition for the special case; the result being the negative distance from the word-edge. As explained above, the right-mask can be constructed from the left-mask by lmask>>(-8*size), which is done at points (2) and (3).

It's a little hairy, but it works. And yet, it still evoked a feeling of “meh” like before. It's the two instructions at point (2) that annoyed me. The reason it's two instructions and not one is because you can't multiply by −8 in one go. By +8, yes: that's a shifted-mov; −1, yes: that's rsb, r2, #0; but the combination is difficult because the shift only applies to the second operand. A sub r2, #0, r2, lsl #3 would do it, but the first operand needs to be a register and I don't have a spare one with zero in it. I could make one, that just means I have an extra instruction somewhere else. I do, however, have either a +1 or −1 in ip, maybe I can use that somehow. And then it hits me: the carry flag!

There are adc, sbc and rsb instructions that add C, C−1 and C−1 to the result, respectively. Setting or clearing the carry flag is easy, so that's not a problem. All I need now is to start the flag using +1 instead of −1 to cancel out the −1 in sbc or rsc. As it turns out, I can do that I use this format for the left-mask: -(1<<x). In the mask overview above I listed thi as a 3-instruction gig, but as it turns out I can use the carry-trick here to for one instruction less. The final version (for just the head part) looks like this:

    ands    r3, r0, #3
    beq     .Lfgset_main            @ Jump to main stint is aligned

.Lfgset_head:
    bic     r0, r0, #3
    add     r2, r3, r2
    movs    r3, r3, lsl #3          @ left*8 ; clear carry
    mov     ip, #1
    sbc     r3, ip, ip, lsl r3      @ -(1<<8*left) +1-1
    subs    r2, r2, #4              @ size= right-4

    @ If negative (==carry clear), this is a single-word fill
    @ This requires a truncated mask (like 0x0000FF00)
        sbclo   r2, ip, r2, lsl #3      @ x= -8*size +1-1
        andlo   r3, r3, r3, lsr r2      @ mask= mask & (mask>>-8*size);

    @ Insert and jump to main stint if available.
.Lfgset_insert:
    ldr     ip, [r0]
    bic     ip, ip, r3
    and     r3, r1, r3
    orr     ip, ip, r3
    str     ip, [r0], #4

    bhi     .Lfgset_main        @ Longer stretch
    bx      lr                  @ Single-word fill : finished.

Sweeeet :). I was happy with this, until I realized what I'd been working on: an exception of an exception. This would definitely not be part of the 20% of the code that uses 80% of the runtime, so it's really not something one should worry about. Interesting, yes, and I learned a few new tricks, but perhaps time would have been better spent on getting 5% extra out of the main loop. The only problem there is that that is just boring old unrolling a bit, whereas the head presented a more ‘interesting’ problem so I went for that instead.

So yeah; I think too much >_>

</ramble>

tonc 1.4 official release

The files have need downloadable for a while now as a preview, but I finally put the text up on the main site as well so I guess that makes it official. Tonc is now at version 1.4. As mentioned before, the main new thing is TTE, a system for text for all occasions. I've also used grit in some of the advanced demos, so if you want to see how you can do advanced work with it, check out the mode 7 demos and the tte demo.

This will be the last version of Tonc. It's really gone on long enough now.


Files and linkies :


Right! Now what …

Fiddle to the bittle

I've added two new routines to the bit-trick page: 1→4 bit-unpack with reverse and bit reversals. This last one is elegant … except for one bit of C tomfoolery that is required to get GCC to produce the right ARM code. I hope to discuss this in more detail later.

I've also added a new document about dealing with bitfields. It explains what to do with them, gives a few useful functions to get and set bitfields, and demonstrates how to use the C construct for bitfields. It also touches briefly on a nasty detail in the way GCC implements bitfield that can cause them to fail in certain GBA/NDS memory sections. If you're using bitfields to map VRAM or OAM, please read.

Surface drawing routines.

I've been building a basic interface for dealing with graphic surfaces lately. I already had most of the routines for 16bpp and 8bpp bitmaps in older Toncs, but but their use was still somewhat awkward because you had to provide some details of the destination manually; most notably a base pointer and the pitch. This got more than a little annoying, especially when trying to make blitters as well. So I made some changes.


typedef struct TSurface
{
    u8  *data;      //!< Surface data pointer.
    u32 pitch;      //!< Scanline pitch in bytes (PONDER: alignment?).
    u16 width;      //!< Image width in pixels.
    u16 height;     //!< Image width in pixels.
    u8  bpp;        //!< Bits per pixel.
    u8  type;       //!< Surface type (not used that much).
    u16 palSize;    //!< Number of colors.
    u16 *palData;   //!< Pointer to palette.
} TSurface;

I've rebuilt the routines around a surface description struct called TSurface (see above). This way, I can just initialize the surface somewhere and just pass the pointer to that surface around. There are a number of different kinds of surfaces. The most important ones are these three:

  • bmp16. 16bpp bitmap surfaces.
  • bmp8. 8bpp bitmap surfaces.
  • chr4c. 4bpp tiled surfaces, in column-major order (i.e., tile 1 is under tile 0 instead of to the right). Column-major order may seem strange, but it actually simplifies the code considerably. There is also a chr4r mode for normal, row-major tiling, but that's unfinished and will probably remain so.
surface.gba movie
Demonstrating surface routines for 4bpp tiles.

For each of these three, I have the most important rendering functions: plotting pixels, lines, rectangles and blits. Yes, blits too. Even for chr4c-mode. There are routines for frames (empty rectangles) and floodfill as well. The functions have a uniform interface with respect to surface-type, so switching between them should be easy were it necessary. There are also tables with function pointers to these routines, so by using those you need not really care about the details of the surface after its creation. I'll probably add a pointer to such a table in TSurface in the future.


Linkies


The image on the right is the result of the following routine. Turret pic semi-knowingly provided by Kawa.

void test_surface_procs(const TSurface *src, TSurface *dst,
    const TSurfaceProcTab *procs, u16 colors[])
{
    // Init object text
    tte_init_obj(&oam_mem[127], ATTR0_TALL, ATTR1_SIZE_8, 512,
        CLR_YELLOW, 0, &vwf_default, NULL);
    tte_init_con();
    tte_set_margins(8, 140, 160, 152);

    // And go!
    tte_printf("#{es;P}%s surface primitives#{w:60}", procs->name);

    tte_printf("#{es;P}Rect#{w:20}");
    procs->rect(dst, 20, 20, 100, 100, colors[0]);

    tte_printf("#{w:30;es;P}Frame#{w:20}");
    procs->frame(dst, 21, 21, 99, 99, colors[1]);

    tte_printf("#{w:30;es;P}Hlines#{w:20}");

    procs->hline(dst, 23, 23, 96, colors[2]);
    procs->hline(dst, 23, 96, 96, colors[2]);

    tte_printf("#{w:30;es;P}Vlines#{w:20}");
    procs->vline(dst, 23, 25, 94, colors[3]);
    procs->vline(dst, 96, 25, 94, colors[3]);

    tte_printf("#{w:30;es;P}Lines#{w:20}");
    procs->line(dst, 25, 25, 94, 40, colors[4]);
    procs->line(dst, 94, 25, 79, 94, colors[4]);
    procs->line(dst, 94, 94, 25, 79, colors[4]);
    procs->line(dst, 25, 94, 40, 25, colors[4]);

    tte_printf("#{w:30;es;P}Full blit#{w:20}");
    procs->blit(dst, 120, 16, src->width, src->height, src, 0, 0);

    tte_printf("#{w:30;es;P}Partial blit#{w:20}");
    procs->blit(dst, 40, 40, 40, 40, src, 12, 8);

    tte_printf("#{w:30;es;P}Floodfill#{w:20}");
    procs->flood(dst, 40, 32, colors[5]);
    tte_printf("#{w:30;es;P}Again !#{w:20}");
    procs->flood(dst, 40, 32, colors[6]);

    tte_printf("#{w:30;es;P;w:30}Ta-dah!!!#{w:20}");

    key_wait_till_hit(KEY_ANY);
}

// Test 4bpp tiled, column-major surfaces
void test_chr4c_procs()
{
    TSurface turret, dst;

    // Init turret for blitting.
    srf_init(&turret, SRF_CHR4C, turretChr4cTiles, 128, 128, 4, NULL);

    // Init destination surface
    srf_init(&dst, SRF_CHR4C, tile_mem[0], 240, 160, 4, pal_bg_mem);
    schr4c_prep_map(&dst, se_mem[31], 0);
    GRIT_CPY(pal_bg_mem, turretChr4cPal);

    // Set video stuff
    REG_DISPCNT= DCNT_MODE0 | DCNT_BG2 | DCNT_OBJ | DCNT_OBJ_1D;
    REG_BG2CNT= BG_CBB(0)|BG_SBB(31);

    u16 colors[8]= { 6, 13, 1, 14, 15, 0, 14, 0 };

    // Run internal tester
    test_surface_procs(&turret, &dst, &chr4c_tab, colors);
}