Okay. Apparently, I am an idiot who can't do math.
One of the longer chapters in Tonc is Mode 7 part 2, which covers pretty much all the hairy details of producing mode 7 effects on the GBA. The money shot for in terms of code is the following functions, which calculates the affine parameters of the background for each scanline in section 21.7.3.
if(level->horizon >= SCREEN_HEIGHT)
int ii, ii0= (level->horizon>=0 ? level->horizon : 0);
M7_CAM *cam= level->camera;
FIXED xc= cam->pos.x, yc= cam->pos.y, zc=cam->pos.z;
BG_AFFINE *bga= &level->bgaff[ii0];
FIXED yb, zb; // b' = Rx(theta) * (L, ys, -D)
FIXED cf, sf, ct, st; // sines and cosines
FIXED lam, lcf, lsf; // scale and scaled (co)sine(phi)
cf= cam->u.x; sf= cam->u.z;
ct= cam->v.y; st= cam->w.y;
for(ii= ii0; ii<SCREEN_HEIGHT; ii++)
yb= (ii-M7_TOP)*ct + M7_D*st;
lam= DivSafe( yc<<12, yb); // .12f <- OI!!!
lcf= lam*cf>>8; // .12f
lsf= lam*sf>>8; // .12f
bga->pa= lcf>>4; // .8f
bga->pc= lsf>>4; // .8f
zb= (ii-M7_TOP)*st - M7_D*ct; // .8f
bga->dx= xc + (lcf>>4)*M7_LEFT - (lsf*zb>>12); // .8f
bga->dy= zc + (lsf>>4)*M7_LEFT + (lcf*zb>>12); // .8f
// hack that I need for fog. pb and pd are unused anyway
For details on what all the terms mean, go the page in question.
For now, just note that call to
DivSafe() to calculate
the scaling factor λ and recall that division on the GBA is
pretty slow. In Mode 7 part 1,
I used a LUT, but here I figured that since the
can be anything thanks to the pitch you can't do that. After helping
Ruben with his mode 7 demo, it turns out that you can.
Fig 1 shows the situation. There is a camera (the black triangle) that is tilted down by pitch angle θ. I've put the origin at the back of the camera because it makes things easier to read. The front of the camera is the projection plane, which is essentially the screen. A ray is cast from the back of the camera on to the floor and this ray intersects the projection plane. The coordinates of this point are xp = (yp, D) in projection plane space, which corresponds to point (yb, zb) in world space. This is simply rotating point xp by θ. The scaling factor is the ratio between the y or z coordinates of the points on the floor and on the projection plane, so that's:
and for yb the rotation gives us:
where yc is the camera height, yp is a scanline offset (measured from the center of the screen) and D is the focus length.
Now, the point is that while yb is variable and non-integral when θ ≠ 0, it is still bounded! What's more, you can easily calculate its maximum value, since it's simply the maximum length of xp. Calling this factor R, we get:
This factor R, rounded up, is the size of the required LUT. In my particular case, I've used yp= scanline−80 and D = 256, which gives R = sqrt((160−80)² + 256²) = 268.2. In other words, I need a division LUT with 269 entries. Using .16 fixed point numbers for this LUT, the replacement code is essentially:
0xFFFF, 0xFFFF, 0x8000, 0x5556, ...
// Inside the function
for(ii= ii0; ii<SCREEN_HEIGHT; ii++)
yb= (ii-M7_TOP)*ct + M7_D*st; // .8
lam= (yc*m7_div_lut[yb>>8])>>12; // .8*.16/.12 = .12
... // business as usual
At this point, several questions may arise.
- What about negative yb? The beauty here is that while yb may be negative in principle, such values would correspond to lines above the horizon and we don't calculate those anyway.
Won't non-integral yb cause inaccurate look-ups?
True, yb will have a fractional part that
is simply cut off during a simple look-up and some sort of
interpolation would be better. However, in testing there were no
noticeable differences between direct look-up, lerped look-up or
Div(), so the simplest method suffices.
- Are .16 fixed point numbers enough?. Yes, apparently so.
- ZOMG OVERFLOW! Are .16 fixed point numbers too high? Technically, yes, there is a risk of overflow when the camera height gets too high. However, at high altitudes the map is going to look like crap anyway due to the low resolution of the screen. Furthermore, the hardware only uses 8.8 fixeds, so scales above 256.0 wouldn't work anyway.
What do I win?
m7_prep_affines()takes about 51k cycles. With the direct look-up this reduces to about 13k: a speed increase by a factor of 4.
So yeah, this is what I should have figured out years ago, but somehow kept overlooking it. I'm not sure if I'll add this whole thing to Tonc's text and code, but I'll at least put up a link to here. Thanks Ruben, for showing me how to do this properly.