tonc updates (finally!)

And at long last, I get off my ass and fix tonc's code for the arm-eabiarm-none-eabi change. Also fixed some other things in html and css, and updated the PDF. I had to export the PDF via chrome this time, though, so it'll look a little different. It also lost me about 60 pages, now where did those go all of a sudden …

But yeah … tonc is at 1.4.2 now.

GCC 4.7 vs ARM assembly

In a recent bout of ‘They Changed It, Now It Sucks’, GCC 4.7 won't add interworking instructions to ARM assembly code unless you add a .type directive to the function boilerplate. Following some code from devkitARM 38 announcement:

    @ section directive goes here.
    @ code dirctive goes here.
    .align  2
    .global foo
    .type   foo STT_FUNC                @ Add this line right here
    @ etc

Though the change is simple, it does effectively mean that basically all older assembly code is now b0rked, including tonc and many of the ASM functions I've written about here. I hope I can assess the damage this weekend and fix everything.

UPDATE 2012-05-20

Had some time to work on this. When I first saw that I was going to have to do this, I though ‘Oh naw, I'm going to have to update every single ASM function I have?’ But then I noticed that I had already done so, more or less. One of the things I realized some time after the 1.4 release was that copy-pasting the standard boiler-plate code for ASM functions was stupid, so I created macros to do that instead. I was pleasantly surprised to see that I had already upgraded all tonclib functions with that macro, so all I had to do was add the single line in a single place and recompile. Fuck yeah, macros.

The only troublesome thing was that the inline ASM I used in swi_demo failed horrifically, so I just removed those.


The current status is that all the code's okay again, but the text parts still need to be updated.

mode 7 addendum

Okay. Apparently, I am an idiot who can't do math.


One of the longer chapters in Tonc is Mode 7 part 2, which covers pretty much all the hairy details of producing mode 7 effects on the GBA. The money shot for in terms of code is the following functions, which calculates the affine parameters of the background for each scanline in section 21.7.3.

IWRAM_CODE void m7_prep_affines(M7_LEVEL *level)
    if(level->horizon >= SCREEN_HEIGHT)

    int ii, ii0= (level->horizon>=0 ? level->horizon : 0);

    M7_CAM *cam= level->camera;
    FIXED xc= cam->pos.x, yc= cam->pos.y, zc=cam->pos.z;

    BG_AFFINE *bga= &level->bgaff[ii0];

    FIXED yb, zb;           // b' = Rx(theta) *  (L, ys, -D)
    FIXED cf, sf, ct, st;   // sines and cosines
    FIXED lam, lcf, lsf;    // scale and scaled (co)sine(phi)
    cf= cam->u.x;      sf= cam->u.z;
    ct= cam->v.y;      st= cam->w.y;
    for(ii= ii0; ii<SCREEN_HEIGHT; ii++)
        yb= (ii-M7_TOP)*ct + M7_D*st;
        lam= DivSafe( yc<<12,  yb);     // .12f    <- OI!!!

        lcf= lam*cf>>8;                 // .12f
        lsf= lam*sf>>8;                 // .12f

        bga->pa= lcf>>4;                // .8f
        bga->pc= lsf>>4;                // .8f

        // lambda·Rx·b
        zb= (ii-M7_TOP)*st - M7_D*ct;   // .8f
        bga->dx= xc + (lcf>>4)*M7_LEFT - (lsf*zb>>12);  // .8f
        bga->dy= zc + (lsf>>4)*M7_LEFT + (lcf*zb>>12);  // .8f

        // hack that I need for fog. pb and pd are unused anyway
        bga->pb= lam;
    level->bgaff[SCREEN_HEIGHT]= level->bgaff[0];

For details on what all the terms mean, go the page in question. For now, just note that call to DivSafe() to calculate the scaling factor λ and recall that division on the GBA is pretty slow. In Mode 7 part 1, I used a LUT, but here I figured that since the yb term can be anything thanks to the pitch you can't do that. After helping Ruben with his mode 7 demo, it turns out that you can.


Fig 1. Sideview of the camera and floor. The camera is tilted slightly down by angle θ.

Fig 1 shows the situation. There is a camera (the black triangle) that is tilted down by pitch angle θ. I've put the origin at the back of the camera because it makes things easier to read. The front of the camera is the projection plane, which is essentially the screen. A ray is cast from the back of the camera on to the floor and this ray intersects the projection plane. The coordinates of this point are xp = (yp, D) in projection plane space, which corresponds to point (yb, zb) in world space. This is simply rotating point xp by θ. The scaling factor is the ratio between the y or z coordinates of the points on the floor and on the projection plane, so that's:

\lambda = y_c / y_b,

and for yb the rotation gives us:

y_b = y_p cos \theta + D sin \theta,

where yc is the camera height, yp is a scanline offset (measured from the center of the screen) and D is the focus length.

Now, the point is that while yb is variable and non-integral when θ ≠ 0, it is still bounded! What's more, you can easily calculate its maximum value, since it's simply the maximum length of xp. Calling this factor R, we get:

R = \sqrt{max(y_p)^2 + D^2}

This factor R, rounded up, is the size of the required LUT. In my particular case, I've used yp= scanline−80 and D = 256, which gives R = sqrt((160−80)² + 256²) = 268.2. In other words, I need a division LUT with 269 entries. Using .16 fixed point numbers for this LUT, the replacement code is essentially:

// The new division LUT. For 1/0 and 1/1, 0xFFFF is used.
u16 m7_div_lut[270]=
    0xFFFF, 0xFFFF, 0x8000, 0x5556, ...

// Inside the function
    for(ii= ii0; ii<SCREEN_HEIGHT; ii++)
        yb= (ii-M7_TOP)*ct + M7_D*st;           // .8
        lam= (yc*m7_div_lut[yb>>8])>>12;        // .8*.16/.12 = .12
        ... // business as usual

At this point, several questions may arise.

  • What about negative yb? The beauty here is that while yb may be negative in principle, such values would correspond to lines above the horizon and we don't calculate those anyway.
  • Won't non-integral yb cause inaccurate look-ups? True, yb will have a fractional part that is simply cut off during a simple look-up and some sort of interpolation would be better. However, in testing there were no noticeable differences between direct look-up, lerped look-up or using Div(), so the simplest method suffices.
  • Are .16 fixed point numbers enough?. Yes, apparently so.
  • ZOMG OVERFLOW! Are .16 fixed point numbers too high? Technically, yes, there is a risk of overflow when the camera height gets too high. However, at high altitudes the map is going to look like crap anyway due to the low resolution of the screen. Furthermore, the hardware only uses 8.8 fixeds, so scales above 256.0 wouldn't work anyway.

And finally:

  • What do I win? With Div() m7_prep_affines() takes about 51k cycles. With the direct look-up this reduces to about 13k: a speed increase by a factor of 4.

So yeah, this is what I should have figured out years ago, but somehow kept overlooking it. I'm not sure if I'll add this whole thing to Tonc's text and code, but I'll at least put up a link to here. Thanks Ruben, for showing me how to do this properly.

tonc 1.4 official release

The files have need downloadable for a while now as a preview, but I finally put the text up on the main site as well so I guess that makes it official. Tonc is now at version 1.4. As mentioned before, the main new thing is TTE, a system for text for all occasions. I've also used grit in some of the advanced demos, so if you want to see how you can do advanced work with it, check out the mode 7 demos and the tte demo.

This will be the last version of Tonc. It's really gone on long enough now.

Files and linkies :

Right! Now what …

tonc 1.4 preview

I'm close to releasing the latest (and probably last; this really has gone on long enough) version of Tonc. As a preview, I'm releasing the PDF a little early in the hope that someone may take a look and offer some feedback before the official release (aw, c'mon, it's only 400 pages).

The changes mostly relate to the new Tonc Text Engine, a text system for all occasions. There's a new chapter describing how TTE works, how to write general character printers for (almost) for arbitrary sized fonts and every type of graphics, and a few other things. It's fairly long and could use sanity checking from someone else.

Also, many of the older demos now use TTE for their text as well. As a result they look cleaner and prettier, but it's possible there are some left-overs from older versions. So have at it.

Surface drawing routines.

I've been building a basic interface for dealing with graphic surfaces lately. I already had most of the routines for 16bpp and 8bpp bitmaps in older Toncs, but but their use was still somewhat awkward because you had to provide some details of the destination manually; most notably a base pointer and the pitch. This got more than a little annoying, especially when trying to make blitters as well. So I made some changes.

typedef struct TSurface
    u8  *data;      //!< Surface data pointer.
    u32 pitch;      //!< Scanline pitch in bytes (PONDER: alignment?).
    u16 width;      //!< Image width in pixels.
    u16 height;     //!< Image width in pixels.
    u8  bpp;        //!< Bits per pixel.
    u8  type;       //!< Surface type (not used that much).
    u16 palSize;    //!< Number of colors.
    u16 *palData;   //!< Pointer to palette.
} TSurface;

I've rebuilt the routines around a surface description struct called TSurface (see above). This way, I can just initialize the surface somewhere and just pass the pointer to that surface around. There are a number of different kinds of surfaces. The most important ones are these three:

  • bmp16. 16bpp bitmap surfaces.
  • bmp8. 8bpp bitmap surfaces.
  • chr4c. 4bpp tiled surfaces, in column-major order (i.e., tile 1 is under tile 0 instead of to the right). Column-major order may seem strange, but it actually simplifies the code considerably. There is also a chr4r mode for normal, row-major tiling, but that's unfinished and will probably remain so.
surface.gba movie
Demonstrating surface routines for 4bpp tiles.

For each of these three, I have the most important rendering functions: plotting pixels, lines, rectangles and blits. Yes, blits too. Even for chr4c-mode. There are routines for frames (empty rectangles) and floodfill as well. The functions have a uniform interface with respect to surface-type, so switching between them should be easy were it necessary. There are also tables with function pointers to these routines, so by using those you need not really care about the details of the surface after its creation. I'll probably add a pointer to such a table in TSurface in the future.


The image on the right is the result of the following routine. Turret pic semi-knowingly provided by Kawa.

void test_surface_procs(const TSurface *src, TSurface *dst,
    const TSurfaceProcTab *procs, u16 colors[])
    // Init object text
    tte_init_obj(&oam_mem[127], ATTR0_TALL, ATTR1_SIZE_8, 512,
        CLR_YELLOW, 0, &vwf_default, NULL);
    tte_set_margins(8, 140, 160, 152);

    // And go!
    tte_printf("#{es;P}%s surface primitives#{w:60}", procs->name);

    procs->rect(dst, 20, 20, 100, 100, colors[0]);

    procs->frame(dst, 21, 21, 99, 99, colors[1]);


    procs->hline(dst, 23, 23, 96, colors[2]);
    procs->hline(dst, 23, 96, 96, colors[2]);

    procs->vline(dst, 23, 25, 94, colors[3]);
    procs->vline(dst, 96, 25, 94, colors[3]);

    procs->line(dst, 25, 25, 94, 40, colors[4]);
    procs->line(dst, 94, 25, 79, 94, colors[4]);
    procs->line(dst, 94, 94, 25, 79, colors[4]);
    procs->line(dst, 25, 94, 40, 25, colors[4]);

    tte_printf("#{w:30;es;P}Full blit#{w:20}");
    procs->blit(dst, 120, 16, src->width, src->height, src, 0, 0);

    tte_printf("#{w:30;es;P}Partial blit#{w:20}");
    procs->blit(dst, 40, 40, 40, 40, src, 12, 8);

    procs->flood(dst, 40, 32, colors[5]);
    tte_printf("#{w:30;es;P}Again !#{w:20}");
    procs->flood(dst, 40, 32, colors[6]);



// Test 4bpp tiled, column-major surfaces
void test_chr4c_procs()
    TSurface turret, dst;

    // Init turret for blitting.
    srf_init(&turret, SRF_CHR4C, turretChr4cTiles, 128, 128, 4, NULL);

    // Init destination surface
    srf_init(&dst, SRF_CHR4C, tile_mem[0], 240, 160, 4, pal_bg_mem);
    schr4c_prep_map(&dst, se_mem[31], 0);
    GRIT_CPY(pal_bg_mem, turretChr4cPal);

    // Set video stuff
    REG_BG2CNT= BG_CBB(0)|BG_SBB(31);

    u16 colors[8]= { 6, 13, 1, 14, 15, 0, 14, 0 };

    // Run internal tester
    test_surface_procs(&turret, &dst, &chr4c_tab, colors);

Artsy fartsy

I've been working on a few functions for rendering onto tiles recently. Yesterday was the turn of a rectangle filler. The traditional routine of double-looping over a pixel-plotter would be slow in every case, but for tiled surfaces it's positively evil, so I made something that divides the rectangle in 5 areas and fills them using by words or better. Yes, this is a little tricky but I figured the speed increase of up to 300 would be worth it.

For testing purposes, I filled each region with a different color so that ifwhen something went wrong, I could easily identify the problem. When playing around with the test app, I more or less accidentally came up with this:

accidental mondriaan

Hmmm ... Mondriaany.

Anyway, it seems that this thing went alright. So now tonclib also has plot, hline, vline, line, rect and frame functions for 4bpp tiled modes. No, there's no blitting yet. In anyone wants that, I'm going to insist on some mental hazard pay.

Tonc:setup update

Finally got round to updating Tonc's dev setup page. It finally mentions devkitPro's template makefiles and the basics of how to use them. I've also added a list of potential problems you may encounter when installing/upgrading devkitARM or just building projects. I have not updated the downloadables yet because there's still a few unfinished edits there. I just wanted to get this one out of the way because it's so very, very overdue.

memcpy and memset replacements for GBA/NDS

The standard C functions for copying and filling are memcpy() and memset(). They're part of the standard library, are easy to use and are often implemented with some optimizations so that they're usually faster than manual looping. The DKA version, for example will fill as words if the alignments and sizes allow for it. This can be much faster than doing the loops yourself.

There is, however, one small annoying fact about these two: they're not VRAM-safe. If the alignment and size aren't right for the word transfers, they will transfer bytes. Not only will this be slow, of course, but because you can't write to VRAM in bytes, the data will be corrupted.

The solutions for this have mostly come down to “so don't do that then”. Often, this can be sufficient: tiles in VRAM are word-aligned by definition, and source graphics data can and should be word-aligned anyway. However, now that I'm finally working on a bitmap blitter for 8bpp and 16bpp, I find that it's simply not enough. So I wrote the following set of functions to serve as replacements.

The code

My main goal here was to create smallish and portable replacements, not to have the greatest and fastestest code around because that's rather platform dependent. Yes, even the difference between GBA and NDS should matter, because of the differences in ldr/str times and caching.

There are 5 functions here. The main functions here are tonccpy and __toncset for copying and filling words, respectively. The other 3 are interfaces for __toncset for filling 8-bit, 16-bit and 32-bit data; you need these for, say, filling with a color instead of 8-bit data. For the rest of the discussion, I will use the name “toncset” for the internal routine for convenience.

Continue reading