I've been working on a few functions for rendering onto tiles recently. Yesterday was the turn of a rectangle filler. The traditional routine of double-looping over a pixel-plotter would be slow in every case, but for tiled surfaces it's positively evil, so I made something that divides the rectangle in 5 areas and fills them using by words or better. Yes, this is a little tricky but I figured the speed increase of up to 300 would be worth it.
For testing purposes, I filled each region with a different color so that ifwhen something went wrong, I could easily identify the problem. When playing around with the test app, I more or less accidentally came up with this:

Hmmm ... Mondriaany.
Anyway, it seems that this thing went alright. So now tonclib also has plot, hline, vline, line, rect and frame functions for 4bpp tiled modes. No, there's no blitting yet. In anyone wants that, I'm going to insist on some mental hazard pay.
The download link for the new tonclib seem to be broken (or at least, it doesn't work for me).
I'm really amazed by the awesome optimized routines you keep writing for GBA.
For a project I write, I heavily use plot on 4bpp tiled background (for a variable-width text system), and since my functions are horribly slow, I really look forward an optimized blitting function ! I really can't think to a way to optimize that 4 bits to 4 bits potentially unaligned transfer things, so I'll be happy to see how you've done it ! :)
I fixed the link; it should work now.
About 4bpp blitting: I know the basic technique, but the process really is quite evil. For text purposes you can take a few shortcuts if you order your glyphs properly. The latest tonclib includes a text system that has a few renderers for 4bpp tiled text, so you could take a look at that. You can find a chapter preview with explanations and demos here.