Okay. Apparently, I am an idiot who can't do math.
One of the longer chapters in Tonc is
Mode 7 part 2, which covers
pretty much all the hairy details of producing mode 7 effects on the
GBA. The money shot for in terms of code is the following functions,
which calculates the affine parameters of the background for each
scanline in section 21.7.3.
IWRAM_CODE void m7_prep_affines(M7_LEVEL *level)
{
if(level>horizon >= SCREEN_HEIGHT)
return;
int ii, ii0= (level>horizon>=0 ? level>horizon : 0);
M7_CAM *cam= level>camera;
FIXED xc= cam>pos.x, yc= cam>pos.y, zc=cam>pos.z;
BG_AFFINE *bga= &level>bgaff[ii0];
FIXED yb, zb; // b' = Rx(theta) * (L, ys, D)
FIXED cf, sf, ct, st; // sines and cosines
FIXED lam, lcf, lsf; // scale and scaled (co)sine(phi)
cf= cam>u.x; sf= cam>u.z;
ct= cam>v.y; st= cam>w.y;
for(ii= ii0; ii<SCREEN_HEIGHT; ii++)
{
yb= (iiM7_TOP)*ct + M7_D*st;
lam= DivSafe( yc<<12, yb); // .12f < OI!!!
lcf= lam*cf>>8; // .12f
lsf= lam*sf>>8; // .12f
bga>pa= lcf>>4; // .8f
bga>pc= lsf>>4; // .8f
// lambda·Rx·b
zb= (iiM7_TOP)*st  M7_D*ct; // .8f
bga>dx= xc + (lcf>>4)*M7_LEFT  (lsf*zb>>12); // .8f
bga>dy= zc + (lsf>>4)*M7_LEFT + (lcf*zb>>12); // .8f
// hack that I need for fog. pb and pd are unused anyway
bga>pb= lam;
bga++;
}
level>bgaff[SCREEN_HEIGHT]= level>bgaff[0];
}
For details on what all the terms mean, go the page in question.
For now, just note that call to DivSafe()
to calculate
the scaling factor λ and recall that division on the GBA is
pretty slow. In Mode 7 part 1,
I used a LUT, but here I figured that since the yb
term
can be anything thanks to the pitch you can't do that. After helping
Ruben with his mode 7 demo, it turns out that you can.
Fig 1. Sideview of the camera and floor. The camera is tilted slightly
down by angle θ.
Fig 1 shows the situation. There is a camera
(the black triangle) that is tilted down by pitch angle θ. I've
put the origin at the back of the camera because it makes things
easier to read. The
front of the camera is the projection plane, which is essentially
the screen. A ray is cast from the back of the camera on to the floor
and this ray intersects the projection plane. The coordinates
of this point are x_{p} =
(y_{p}, D) in projection plane space, which
corresponds to point (y_{b}, z_{b}) in
world space. This is simply rotating point x_{p} by
θ. The scaling factor is the ratio between the y or
z coordinates of the points on the floor and on the projection
plane, so that's:
and for y_{b} the rotation gives us:
where y_{c} is the camera height,
y_{p} is a scanline offset (measured from the center of the screen) and D is the focus
length.
Now, the point is that while y_{b} is variable
and nonintegral when θ ≠ 0, it is still bounded! What's more,
you can easily calculate its maximum value, since it's simply the
maximum length of x_{p}. Calling this factor R,
we get:
This factor R, rounded up, is the size of the required LUT.
In my particular case, I've used y_{p}= scanline−80
and D = 256, which gives
R = sqrt((160−80)² + 256²)
= 268.2. In other words, I need a division LUT with 269 entries. Using .16
fixed point numbers for this LUT, the replacement code is essentially:
// The new division LUT. For 1/0 and 1/1, 0xFFFF is used.
u16 m7_div_lut[270]=
{
0xFFFF, 0xFFFF, 0x8000, 0x5556, ...
};
// Inside the function
for(ii= ii0; ii<SCREEN_HEIGHT; ii++)
{
yb= (iiM7_TOP)*ct + M7_D*st; // .8
lam= (yc*m7_div_lut[yb>>8])>>12; // .8*.16/.12 = .12
... // business as usual
}
At this point, several questions may arise.

What about negative y_{b}? The beauty here
is that while y_{b} may be negative in principle,
such values would correspond to lines above the horizon and we don't
calculate those anyway.

Won't nonintegral y_{b} cause inaccurate lookups?
True, y_{b} will have a fractional part that
is simply cut off during a simple lookup and some sort of
interpolation would be better. However, in testing there were no
noticeable differences between direct lookup, lerped lookup or
using
Div()
, so the simplest method suffices.

Are .16 fixed point numbers enough?. Yes, apparently so.

ZOMG OVERFLOW! Are .16 fixed point numbers too high?
Technically, yes, there is a risk of overflow when the camera height
gets too high. However, at high altitudes the map is going to look
like crap anyway due to the low resolution of the screen.
Furthermore, the hardware only uses 8.8 fixeds, so scales above
256.0 wouldn't work anyway.
And finally:

What do I win?
With
Div()
m7_prep_affines()
takes
about 51k cycles. With the direct lookup this reduces to about 13k:
a speed increase by a factor of 4.
So yeah, this is what I should have figured out years ago, but
somehow kept overlooking it. I'm not sure if I'll add this whole thing to
Tonc's text and code, but I'll at least put up a link to here. Thanks
Ruben, for showing me how to do this properly.