Okay. Apparently, I am an idiot who can't do math.
One of the longer chapters in Tonc is Mode 7 part 2, which covers pretty much all the hairy details of producing mode 7 effects on the GBA. The money shot for in terms of code is the following functions, which calculates the affine parameters of the background for each scanline in section 21.7.3.
{
if(level>horizon >= SCREEN_HEIGHT)
return;
int ii, ii0= (level>horizon>=0 ? level>horizon : 0);
M7_CAM *cam= level>camera;
FIXED xc= cam>pos.x, yc= cam>pos.y, zc=cam>pos.z;
BG_AFFINE *bga= &level>bgaff[ii0];
FIXED yb, zb; // b' = Rx(theta) * (L, ys, D)
FIXED cf, sf, ct, st; // sines and cosines
FIXED lam, lcf, lsf; // scale and scaled (co)sine(phi)
cf= cam>u.x; sf= cam>u.z;
ct= cam>v.y; st= cam>w.y;
for(ii= ii0; ii<SCREEN_HEIGHT; ii++)
{
yb= (iiM7_TOP)*ct + M7_D*st;
lam= DivSafe( yc<<12, yb); // .12f < OI!!!
lcf= lam*cf>>8; // .12f
lsf= lam*sf>>8; // .12f
bga>pa= lcf>>4; // .8f
bga>pc= lsf>>4; // .8f
// lambda·Rx·b
zb= (iiM7_TOP)*st  M7_D*ct; // .8f
bga>dx= xc + (lcf>>4)*M7_LEFT  (lsf*zb>>12); // .8f
bga>dy= zc + (lsf>>4)*M7_LEFT + (lcf*zb>>12); // .8f
// hack that I need for fog. pb and pd are unused anyway
bga>pb= lam;
bga++;
}
level>bgaff[SCREEN_HEIGHT]= level>bgaff[0];
}
For details on what all the terms mean, go the page in question.
For now, just note that call to DivSafe()
to calculate
the scaling factor λ and recall that division on the GBA is
pretty slow. In Mode 7 part 1,
I used a LUT, but here I figured that since the yb
term
can be anything thanks to the pitch you can't do that. After helping
Ruben with his mode 7 demo, it turns out that you can.
Fig 1 shows the situation. There is a camera (the black triangle) that is tilted down by pitch angle θ. I've put the origin at the back of the camera because it makes things easier to read. The front of the camera is the projection plane, which is essentially the screen. A ray is cast from the back of the camera on to the floor and this ray intersects the projection plane. The coordinates of this point are x_{p} = (y_{p}, D) in projection plane space, which corresponds to point (y_{b}, z_{b}) in world space. This is simply rotating point x_{p} by θ. The scaling factor is the ratio between the y or z coordinates of the points on the floor and on the projection plane, so that's:
and for y_{b} the rotation gives us:
where y_{c} is the camera height, y_{p} is a scanline offset (measured from the center of the screen) and D is the focus length.
Now, the point is that while y_{b} is variable and nonintegral when θ ≠ 0, it is still bounded! What's more, you can easily calculate its maximum value, since it's simply the maximum length of x_{p}. Calling this factor R, we get:
This factor R, rounded up, is the size of the required LUT. In my particular case, I've used y_{p}= scanline−80 and D = 256, which gives R = sqrt((160−80)² + 256²) = 268.2. In other words, I need a division LUT with 269 entries. Using .16 fixed point numbers for this LUT, the replacement code is essentially:
u16 m7_div_lut[270]=
{
0xFFFF, 0xFFFF, 0x8000, 0x5556, ...
};
// Inside the function
for(ii= ii0; ii<SCREEN_HEIGHT; ii++)
{
yb= (iiM7_TOP)*ct + M7_D*st; // .8
lam= (yc*m7_div_lut[yb>>8])>>12; // .8*.16/.12 = .12
... // business as usual
}
At this point, several questions may arise.
 What about negative y_{b}? The beauty here is that while y_{b} may be negative in principle, such values would correspond to lines above the horizon and we don't calculate those anyway.

Won't nonintegral y_{b} cause inaccurate lookups?
True, y_{b} will have a fractional part that
is simply cut off during a simple lookup and some sort of
interpolation would be better. However, in testing there were no
noticeable differences between direct lookup, lerped lookup or
using
Div()
, so the simplest method suffices.  Are .16 fixed point numbers enough?. Yes, apparently so.
 ZOMG OVERFLOW! Are .16 fixed point numbers too high? Technically, yes, there is a risk of overflow when the camera height gets too high. However, at high altitudes the map is going to look like crap anyway due to the low resolution of the screen. Furthermore, the hardware only uses 8.8 fixeds, so scales above 256.0 wouldn't work anyway.
And finally:

What do I win?
With
Div()
m7_prep_affines()
takes about 51k cycles. With the direct lookup this reduces to about 13k: a speed increase by a factor of 4.
So yeah, this is what I should have figured out years ago, but somehow kept overlooking it. I'm not sure if I'll add this whole thing to Tonc's text and code, but I'll at least put up a link to here. Thanks Ruben, for showing me how to do this properly.