Or: everything you never wanted to know about arctangent implementations and weren't afraid to not ask.
1 Introduction
Whenever rotations or directions are required in games, the usual way to describe them is with an angle. Together with a magnitude, r, the angle φ can be converted into x and y coordinates through the sine and cosine functions (see Eq 1). Since the process is so common, for games, pretty much every document dealing with trigonometry explain how this works and give a decent way of implementing these functions (i.e., look-up tables or LUTs).
Something usually not covered is the inverse operation: turning Cartesian coordinates into angles. This requires the arctangent function (see Eq 1, right-hand side). Even though this type of conversion not as necessary as the angle→coordinate-pair conversion (and in fact less necessary than most people think; much of the calculations can be done with vector math instead), it is still somewhat unfortunate that this is neglected. Not just unfortunate for the lack of symmetry, but also because the process is rather more tricksy than using the standard trigonometry functions.
In this document, I will discuss several (yes, there is more than one) implementations for the arctangent function. My main targets here are GBA and NDS so these will be integer-based, but the methods are general enough to work for floating-point as well. I'll discuss the advantages and disadvantages of each in terms of accuracy, speed and size so that people can see the different options, make an informed decision what method will work best for them and won't have to worry about it again. Ever.
(1) |
In Fig 1 are shown the standard trigonometry functions and their relation to the angle φ. If you travel along the unit circle, cos(φ) marks the x-coordinate, sin(φ) marks the y-coordinate and tan(φ) is the y-coordinate is you extend the radial line to x=1 (which effectively scales it by 1/cos(φ). In other words, tan(φ) = sin(φ)/cos(φ) ). The arctangent is the inverse of the tangent function.
Fig 2 plots these as functions of φ. Note that all of them are periodic; this is nice because functions with bound domains are perfect for LUTs^{(1)}.
However, the arctangent, plotted in Fig 2, has several issues. Firstly, it goes on forever in both x directions. Since an infinite amount of memory can't be bought in the shops just yet, a trivial LUT-based solution would seem out of the question. Secondly, the argument of the arctan isn't just x or y; it's y/x. Depending on the hardware, divisions can be a little slow. Thirdly, the arctan's range is only [−½π, ½π] : half a circle and not a full one. This is also a consequence of the division, since (−y)/(−x) = y/x.
Of course, none of these problems is insurmountable, but it does mean that the solution will involve a little more than just lookup.
2 Arctangent methods.
There are either methods that I'd like to discuss. These fall into three broad categories: LUT-based, series expansion, and a cute little method known as CORDIC.
- Basic lookup.
- Lookup + interpolation.
- Inverse-lookup + interpolation.
- Standard Taylor series.
- GBA series expansion.
- Home-grown series expansion.
- Approximation by sines.
- CORDIC.
Before discussing the specifics, let's see what the interface for the arctangent will be like. Now, because I intend the function to be used in FPU-less environments, the parameters and the return value will be integers; preferably scaled to some convenient fixed point number for accuracy.
There will be two parameters, an x and an y value. The
standard name for a two-argument arctan is atan2()
, so I'll
use that name as well. As a division of the two coordinates is involved,
the position of the fixed point is actually not important, but let's
take it as 12, so that 1.0 corresponds to 0x1000
. The
return value is an integer with a 2π range. Whether you use
[0, 2π) or something like [−π, π) doesn't
really matter thanks to the periodicity of angles, but because it'll
make the calculations easier, I will use the former. Also, since the
number of angles in a circle is essentially arbitrary as well, we will
choose something convenient for computers, like π≡0x4000; in
other words, 0x8000=32768 ‘degrees’ for a full circle. The units
for this power-of-two circle division is often called a brad,
which stands for binary radian. In this case we have
0x4000 brad = π rad = 180 degrees.
With these things in mind, we have the following constants for our code:
static const uint BRAD_HPI= BRAD_PI/2, BRAD_2PI= BRAD_PI*2;
static const uint ATAN_ONE = 0x1000, ATAN_FP= 12;
The fact that these constants correspond to libnds' units for trig is, of course, entirely accidental.
Next up: getting a full circle's worth of angles from a function that technically only returns half a circle. This is done by using the symmetry properties of the circle; most notably that a positive y uses the range [0, π) and negative y means [π, 2π) angles.
Two other axes of symmetry can be used as well: the x=0 and y=x lines. Through these, you can restrict the actual you need to calculate to a single octant (see Fig 4). The nice thing about bringing it to a single octant is that you will end up at a place where quotient of x and y is below 1.0 – in other words, get the argument of the actual arctan down to a finite domain instead of an infinite one.
The process is illustrated by Fig 5. Requested is the angle to point p=p_{0}. Successive rotations can bring this point into the first quadrant (Fig 5a and b). If necessary, rotate the point by π to bring it into the top hemisphere, then rotate by ½π to get to the first quadrant. We need to keep track of the performed rotations in the form of φ_{i}.
Once inside the first quadrant, two procedures can be followed (see Fig 5c and d). Starting with the latter, notice that for calculating angle β, you can use the quotient x/y rather than y/x if the point is in the second octant. The angle α follows from α = ½π−β. The other method (Fig 5c) is to continue rotating: if x>y, rotate by −¼π. Now, if you look at Eq 2, you might think this is a bad idea thanks to the division by √2. However, this scaling doesn't matter, because the coordinate division will remove any trace of it anyway!
Because it will reduce the complexity of the code, I will use the method from Fig 5c.
(2) |
The basic framework for the atan2()
functions will use the
following scheme. First, return early if x and y are both 0.
If not, the coordinates will be rotated in till the fall into the first
octant and a preliminary angle is constructed as an offset to the real
octant. After this initial set-up, the different functions will calculate
the arctangent according to their specific methods, which is added to the
earlier angle and then returned. A template for the atan2()
is shown below.
#define OCTANTIFY(_x, _y, _o) do { \
int _t; _o= 0; \
if(_y< 0) { _x= -_x; _y= -_y; _o += 4; } \
if(_x<= 0) { _t= _x; _x= _y; _y= -_t; _o += 2; } \
if(_x<=_y) { _t= _y-_x; _x= _x+_y; _y= _t; _o += 1; } \
} while(0);
// QDIV stands for the fixed-point division method most appropriate for
// your system. Modify where appropriate.
// This would be for NDS.
static inline int QDIV(int num, int den, const int bits)
{
while(REG_DIVCNT & DIV_BUSY);
REG_DIVCNT = DIV_64_32;
REG_DIV_NUMER = ((int64)num)<<bits;
REG_DIV_DENOM_L = den;
while(REG_DIVCNT & DIV_BUSY);
return (REG_DIV_RESULT_L);
}
// Basic setup for atan2. MY_ARCTAN handles the specifics of each method.
uint atan2Base(int x, int y)
{
if(y==0) return (x>=0 ? 0 : BRAD_PI);
int phi, dphi;
uint t;
OCTANTIFY(x, y, phi);
phi *= BRAD_PI/4;
// --- <specifics> ---
t= QDIV(y, x, ATAN_FP);
dphi= MY_ARCTAN(t);
// --- </specifics> ---
return phi + dphi;
}
2.1 atanLookup : simple lookup
Since by reducing the region of calculation to the first octant we do have a bound domain, look-up tables may be of use after all. This particular LUT needs to have values for t = y/x∈[0, 1] (i.e, φ∈[0, ¼π]) Note that 1.0 is included in this range, as this will required for the next method. I'll put the entry for 1.0 at position 128, so that the indices are effectively Q7 numbers. For accuracy, I'll scale each entry so that as far as the LUT is concerned, ¼π ≡ 0x8000. This will be shifted down at the end of the calculations.
The LUT and the code for atan2Lookup()
are given below.
I've also added some other useful constants indicating the
‘stride’ of the LUT entries; that is, the effective width of
each LUT-entry. Note that I'm using an actual division for the
final transition to the BRAD_PI
scale. As long as the
numerator is unsigned, this won't give any trouble.
static const uint ATANLUT_STRIDE = ATAN_ONE / 0x80, ATANLUT_STRIDE_SHIFT= 5;
// Arctangens LUT. Interval: [0, 1] (one=128); PI=0x20000
const unsigned short atanLUT[130] ALIGN(4)=
{
0x0000,0x0146,0x028C,0x03D2,0x0517,0x065D,0x07A2,0x08E7,
0x0A2C,0x0B71,0x0CB5,0x0DF9,0x0F3C,0x107F,0x11C1,0x1303,
0x1444,0x1585,0x16C5,0x1804,0x1943,0x1A80,0x1BBD,0x1CFA,
0x1E35,0x1F6F,0x20A9,0x21E1,0x2319,0x2450,0x2585,0x26BA,
0x27ED,0x291F,0x2A50,0x2B80,0x2CAF,0x2DDC,0x2F08,0x3033,
0x315D,0x3285,0x33AC,0x34D2,0x35F6,0x3719,0x383A,0x395A,
0x3A78,0x3B95,0x3CB1,0x3DCB,0x3EE4,0x3FFB,0x4110,0x4224,
0x4336,0x4447,0x4556,0x4664,0x4770,0x487A,0x4983,0x4A8B,
// 64
0x4B90,0x4C94,0x4D96,0x4E97,0x4F96,0x5093,0x518F,0x5289,
0x5382,0x5478,0x556E,0x5661,0x5753,0x5843,0x5932,0x5A1E,
0x5B0A,0x5BF3,0x5CDB,0x5DC1,0x5EA6,0x5F89,0x606A,0x614A,
0x6228,0x6305,0x63E0,0x64B9,0x6591,0x6667,0x673B,0x680E,
0x68E0,0x69B0,0x6A7E,0x6B4B,0x6C16,0x6CDF,0x6DA8,0x6E6E,
0x6F33,0x6FF7,0x70B9,0x717A,0x7239,0x72F6,0x73B3,0x746D,
0x7527,0x75DF,0x7695,0x774A,0x77FE,0x78B0,0x7961,0x7A10,
0x7ABF,0x7B6B,0x7C17,0x7CC1,0x7D6A,0x7E11,0x7EB7,0x7F5C,
// 128
0x8000,0x80A2
};
// Basic lookup for atan2. Returns [0,2pi), where pi ~ 0x4000.
uint atan2Lookup(int x, int y)
{
if(y==0) return (x>=0 ? 0 : BRAD_PI);
int phi;
uint t;
OCTANTIFY(x, y, phi);
phi *= BRAD_PI/4;
t = QDIV(y, x, ATAN_FP);
return phi + atanLUT[t/ATANLUT_STRIDE]/8;
}
2.2 atanLerp : LUT + linear interpolation
The LUT is fairly coarse. The gaps between the LUT entries are roughly 0x100 in size, corresponding to about 0.7 of a degree. We can get considerably better accuracy (in the order of 2% of a degree) by using linear interpolation (lerp for short) between the LUT entries surrounding the quotient t (see tonc:lerp for the basic process). The essence is that t lies between LUT-points (t_{a}, φ_{a}) and (t_{b}, φ_{b}). From the basic description of a line, we can derive that
(3) |
The cute thing about lerping for LUTs is that, if one assumes t is a fixed-point number, then t_{b} = t_{a}+1.0 and the operations with t in them boil down to a division and modulo by a power of two. The code is then becomes as follows.
// Returns [0,2pi), where pi ~ 0x4000.
uint atan2Lerp(int x, int y)
{
if(y==0) return (x>=0 ? 0 : BRAD_PI);
int phi;
uint t, fa, fb, h;
OCTANTIFY(x, y, phi);
phi *= BRAD_PI/4;
t= QDIV(y, x, ATAN_FP);
h= t % ATANLUT_STRIDE;
fa= atanLUT[t/ATANLUT_STRIDE ];
fb= atanLUT[t/ATANLUT_STRIDE+1];
return phi + ( fa + ((fb-fa)*h >> ATANLUT_STRIDE_SHIFT) )/8;
}
2.3 atanInvLerp : Inverse lookup
If you're strapped for space but already have a tangent LUT available,
you can also try to retrieve the arctangent by a sort of inverse lookup.
The parameter of the arctangent is actually the tangent value of the
angle we're looking for, so by searching through the tangent LUT for
this value, we can find the two LUT-entries that bracket
our angle. Because the tangent is a monotonously increasing function, the
bracketing can be done via a binary search. The lutBracket()
presented below does just this, and returns the LUT-index of the
lower bound. This value and the next are used to once again linearly
interpolate to the desired angle.
This will probably not be as fast as the direct lerp. Firstly, the bracketing itself takes O(log N) time, which is probably slower than the O(1) that direct look-up uses. Also, it requires an extra division because the range to divide by is the distance between the LUT-values (φ_{b}−φ_{a}) and not the LUT-indices.
static const uint TANLUT_STRIDE= BRAD_PI/TANLUT_PI, TANLUT_STRIDE_SHIFT= 6;
static const uint TANLUT_SIZE= 129, TANLUT_FP= 16;
// Tangens LUT, domain: [0, PI/2]; PI= 0x100, Q16 values.
// tan(PI/2) set to 400.0.
static const unsigned int tanLUT[129]=
{
0x000000,0x000324,0x000649,0x00096E,0x000C94,0x000FBA,0x0012E2,0x00160C,
0x001937,0x001C64,0x001F93,0x0022C5,0x0025F9,0x002931,0x002C6C,0x002FAA,
0x0032EC,0x003632,0x00397D,0x003CCC,0x004020,0x004379,0x0046D8,0x004A3D,
0x004DA8,0x00511A,0x005492,0x005812,0x005B99,0x005F28,0x0062C0,0x006660,
0x006A0A,0x006DBD,0x00717A,0x007542,0x007914,0x007CF2,0x0080DC,0x0084D2,
0x0088D6,0x008CE7,0x009106,0x009534,0x009971,0x009DBE,0x00A21C,0x00A68C,
0x00AB0E,0x00AFA3,0x00B44C,0x00B909,0x00BDDD,0x00C2C7,0x00C7C9,0x00CCE3,
0x00D218,0x00D768,0x00DCD4,0x00E25E,0x00E806,0x00EDD0,0x00F3BB,0x00F9CB,
// 64
0x010000,0x01065D,0x010CE3,0x011394,0x011A74,0x012184,0x0128C6,0x01303F,
0x0137F0,0x013FDD,0x014809,0x015077,0x01592D,0x01622E,0x016B7E,0x017523,
0x017F22,0x018980,0x019445,0x019F76,0x01AB1C,0x01B73F,0x01C3E7,0x01D11F,
0x01DEF1,0x01ED6A,0x01FC96,0x020C84,0x021D44,0x022EE9,0x024187,0x025534,
0x026A0A,0x028026,0x0297A8,0x02B0B5,0x02CB79,0x02E823,0x0306EC,0x032816,
0x034BEB,0x0372C6,0x039D11,0x03CB48,0x03FE02,0x0435F7,0x047405,0x04B940,
0x050700,0x055EF9,0x05C35D,0x063709,0x06BDD0,0x075CE6,0x081B98,0x09046E,
0x0A2736,0x0B9CC6,0x0D8E82,0x1046EA,0x145B00,0x1B2672,0x28BC49,0x517BB6,
// 128
0x01900000
};
//! Get a tangent value as a Q12 fixed-point number.
int itan(int x)
{
uint ux= (uint)x % BRAD_PI;
uint xa= ux / TANLUT_STRIDE;
int ya, yb, h= ux % TANLUT_STRIDE;
if(ux <= BRAD_HPI)
{
ya= tanLUT[xa ];
yb= tanLUT[xa+1];
}
else
{
ya= -tanLUT[TANLUT_PI-xa ];
yb= -tanLUT[TANLUT_PI-xa-1];
}
return (ya + ((yb-ya)*h >> TANLUT_STRIDE_SHIFT))>>4;
}
//! Function to bracket a value in LUT.
/*! Consider a monotonously increasing LUT (or sorted array), with
two adjacent value form a bin. This function will find the
bin in which \a key can be found.
\param key Value to bracket.
\param array Array to search.
\param size Size of the array.
\return The low bracket of the bin.
*/
static inline uint lutBracket32(int key, const int array[], uint size)
{
uint i, low=0, high= size-1;
while(low+1<high)
{
i= (low+high)/2;
if(key < array[i])
high= i;
else
low= i;
}
return i;
}
// atan2 using inverse lookup and linear interpolation.
// Returns [0,2pi), where pi ~ 0x4000.
uint atan2InvLerp(int x, int y)
{
if(y==0) return (x>=0 ? 0 : BRAD_PI);
int phi, fa, dphi, ta, tb;
uint t;
OCTANTIFY(x, y, phi);
phi *= BRAD_PI/4;
t= QDIV(y, x, TANLUT_FP);
fa= lutBracket(t, tanLUT, TANLUT_SIZE);
ta= tanLUT[fa ];
tb= tanLUT[fa+1];
dphi= fa*TANLUT_STRIDE + QDIV((t-ta)*TANLUT_STRIDE, tb-ta, 0);
return phi + dphi;
}
2.4 atanTaylor : Taylor series expansion
A method that does not rely on look-up tables at all is series expansion. There is a mathematical process known as Taylor Series expansion, by which you can approximate functions adding the right sum of its derivatives. When in games you extrapolate the current position, velocity and acceleration to the next position, you are in a way making use of this process. I'll spare you the derivation of the Taylor series for the arctangent and just write it down.
(4) |
Now, although Eq 4 is true, for calculation purposes it's better to write it as a backwards recurrence relation; one that ends at i=n. This is better for the stability of the algorithm comes down to a simpler for-loop.
(5) |
Now, this equation only really works for t≤1; above it things
will diverge. This is alright though, because that's always true for
the first octant. As is, Eq 4 and
Eq 5, t ∈ [0, 1] will translate
to domain φ ∈ [0, ¼π], but this is with the real
value for π; not the one we've adopted for our integer-based
trigonometry. to account for that, we need to scale things up by
BRAD_PI/π. For the accuracy of the calculations, we can even scale it
up a little further and shift back down after everything's done. This
gives us the following code. The constant base
is
8*BRAD_PI/π.
// Returns [0,2pi), where pi ~ 0x4000.
uint atan2Taylor(int x, int y)
{
if(y==0) return (x>=0 ? 0 : BRAD_PI);
static const int fixShift= 15, base=0xA2F9;
int i, phi, t, t2, dphi;
OCTANTIFY(x, y, phi);
phi *= BRAD_PI/4;
t= QDIV(y, x, fixShift);
t2= -t*t>>fixShift;
dphi= 0;
for(i=15; i>=1; i -= 2)
{
dphi = dphi*t2>>fixShift;
dphi += QDIV(base, i, 0);
}
return phi + (dphi*t >> (fixShift+3));
}
In this particular case, I've chosen for a 15th (n=7) order approximation. This works out quite well for most of the range, but not near t=1! And unless you really use an infinite number of terms, it never will be; the error will be around the next term in the expansion, i.e. 1/(2n+3). This is something you just cannot avoid.
2.5 atan2Gba : taylor series, GBA implementation
Or can you?
While Eq 4 is indeed the correct Taylor expansion for the arctangent, it requires an infinite number of terms; something we simply cannot have in programming. This causes the problems around t=1. However, there's no rule against breaking it off early and adjusting the terms a little to compensate for the error. Finding the best terms for the job is more of an art than a science, because many different combinations will work.
One of these sets can be found in the GBA implementation of ArcTan. It basically uses the process of Eq 5 with n=7, but with the following sequence of multipliers instead of 1/(2i+1).
order | scaled | approximation |
---|---|---|
1 | 0xA2F9 | 0.99999 (1/1) |
3 | 0x3651 | 0.33328 (1/3) |
5 | 0x2081 | 0.19944 (1/5) |
7 | 0x16AA | 0.13906 (5/36) |
9 | 0x0FB6 | 0.09640 (8/83) |
11 | 0x091C | 0.05589 (9/161) |
13 | 0x0390 | 0.02186 (4/183) |
15 | 0x00A9 | 0.00405 (8/1974) |
Interestingly, this works very, very well. It usually underestimates the real value by only 1 on a 0x8000 range. The function itself is given below. Because the number of terms is known and only 7, the loop may just as well be unrolled. Again, the calculation scale is 8*BRAD_PI/π, hence the additional shift by 3 at the end.
// Returns [0,2pi), where pi ~ 0x4000.
uint atan2Gba(int x, int y)
{
if(y==0) return (x>=0 ? 0 : BRAD_PI);
static const int fixShift= 15;
int phi, t, t2, dphi;
OCTANTIFY(x, y, phi);
phi *= BRAD_PI/4;
t= QDIV(y, x, fixShift);
t2= -t*t>>fixShift;
dphi= 0x00A9;
dphi= 0x0390 + (t2*dphi>>fixShift);
dphi= 0x091C + (t2*dphi>>fixShift);
dphi= 0x0FB6 + (t2*dphi>>fixShift);
dphi= 0x16AA + (t2*dphi>>fixShift);
dphi= 0x2081 + (t2*dphi>>fixShift);
dphi= 0x3651 + (t2*dphi>>fixShift);
dphi= 0xA2F9 + (t2*dphi>>fixShift);
return phi + (dphi*t >> (fixShift+3));
}
2.6 atanTonc : homegrown Taylor expansion
The function above works out really well, but can be cut down on the terms even further? Yes, we can. I modeled these equations in Excel (interactive graphing FTW!), fiddled around for a while and came up with this list.
order | scaled | approximation |
---|---|---|
1 | 0xA2FC | 1.00006 (1/1) |
3 | 0x364C | 0.33316 (1/3) |
5 | 0x1F0B | 0.19048 (4/21) |
7 | 0x1029 | 0.09916 (12/121) |
9 | 0x0470 | 0.02723 (7/257) |
Excel is a very nice platform for finding out if your mathematical algorithms work and getting the optimal terms to use for them. What you do is set-up a decent number of points to evaluate and plot the difference between the real numbers and your model and try to minimize. Create a target value (like sum of squares or errors or min/avg/max data) and fiddle with points till you get something worthwhile. If the case of floating point values, you can even let Excel do it for you by using the Solver Addin. Because that doesn't work for integers, I just plotted the errors and fiddled until the errors for nearly all points were between ±0.5.
These terms result in roughly the same kind of error profile as the GBA
method – somewhat fewer errors even. And it uses only 5 terms
instead of 8 :). The code for it is below. I'm calling it
atan2Tonc()
because, well, I have to call it something. Also
note the +4 at the end; this is used for rounding off to the nearest
integer. This step is not essential, but does help reduce the average error.
// Returns [0,2pi), where pi ~ 0x4000.
uint atan2Tonc(int x, int y)
{
if(y==0) return (x>=0 ? 0 : BRAD_PI);
static const int fixShift= 15;
int phi, t, t2, dphi;
OCTANTIFY(x, y, phi);
phi *= BRAD_PI/4;
t= QDIV(y, x, fixShift);
t2= -t*t>>fixShift;
dphi= 0x0470;
dphi= 0x1029 + (t2*dphi>>fixShift);
dphi= 0x1F0B + (t2*dphi>>fixShift);
dphi= 0x364C + (t2*dphi>>fixShift);
dphi= 0xA2FC + (t2*dphi>>fixShift);
dphi= dphi*t>>fixShift;
return phi + ((dphi+4)>>3);
}
2.7 atan2Sin : sine approximation
This next method isn't actually as accurate under the present circumstances as the other ones, but it's just too skanky to be left unmentioned. If you look at the arctangent over the interval [−1, 1], you get an anti-symmetric curve with a slope that is +1 at t=0 and decreases with increasing t. You know what else has those characteristics? A sine wave. By using the right amplitude and frequency of a sine, you can actually approximate an arctangent to within about two degrees. Adding another sine with carefully chosen terms gets you an accuracy of about 3% of a degree.
(6) |
term | int value |
---|---|
A_{1} | 0x14FF |
k_{1} | 9/8 |
A_{2} | 0x007D |
k_{2} | 37/8 |
Finding the right terms is mostly a matter of trial and error. Once
you have reasonably numbers, you should be able to use any sine
generator to get the approximation. Below you can find both the
atan2Sin()
function and the sine generator (which in
this case is a little slow, but has a low memory footprint).
const unsigned short sinLUT[130] ALIGN(4)=
{
0x0000,0x0192,0x0324,0x04B6,0x0648,0x07D9,0x096B,0x0AFB,
0x0C8C,0x0E1C,0x0FAB,0x113A,0x12C8,0x1455,0x15E2,0x176E,
0x18F9,0x1A83,0x1C0C,0x1D93,0x1F1A,0x209F,0x2224,0x23A7,
0x2528,0x26A8,0x2827,0x29A4,0x2B1F,0x2C99,0x2E11,0x2F87,
0x30FC,0x326E,0x33DF,0x354E,0x36BA,0x3825,0x398D,0x3AF3,
0x3C57,0x3DB8,0x3F17,0x4074,0x41CE,0x4326,0x447B,0x45CD,
0x471D,0x486A,0x49B4,0x4AFB,0x4C40,0x4D81,0x4EC0,0x4FFB,
0x5134,0x5269,0x539B,0x54CA,0x55F6,0x571E,0x5843,0x5964,
// 64
0x5A82,0x5B9D,0x5CB4,0x5DC8,0x5ED7,0x5FE4,0x60EC,0x61F1,
0x62F2,0x63EF,0x64E9,0x65DE,0x66D0,0x67BD,0x68A7,0x698C,
0x6A6E,0x6B4B,0x6C24,0x6CF9,0x6DCA,0x6E97,0x6F5F,0x7023,
0x70E3,0x719E,0x7255,0x7308,0x73B6,0x7460,0x7505,0x75A6,
0x7642,0x76D9,0x776C,0x77FB,0x7885,0x790A,0x798A,0x7A06,
0x7A7D,0x7AEF,0x7B5D,0x7BC6,0x7C2A,0x7C89,0x7CE4,0x7D3A,
0x7D8A,0x7DD6,0x7E1E,0x7E60,0x7E9D,0x7ED6,0x7F0A,0x7F38,
0x7F62,0x7F87,0x7FA7,0x7FC2,0x7FD9,0x7FEA,0x7FF6,0x7FFE,
// 128
0x8000,0x7FFE
};
//! Get a sine value as a Q12 fixed-point number.
int isin(int x)
{
int h, ya, yb, y;
uint ux, quad;
ux= (uint)x%BRAD_2PI;
h= ux % SINLUT_STRIDE;
quad= ux/SINLUT_STRIDE / SINLUT_HPI;
ux= ux/SINLUT_STRIDE % SINLUT_HPI;
if(quad & 1)
{
ya= sinLUT[SINLUT_HPI-ux ];
yb= sinLUT[SINLUT_HPI-ux-1];
}
else
{
ya= sinLUT[ux ];
yb= sinLUT[ux+1];
}
y=(ya + ((yb-ya)*h >> SINLUT_STRIDE_SHIFT))>>3;
return (quad & 2) ? -y : +y;
}
//! Get a cosine value as a Q12 fixed-point number.
int icos(int x)
{
return isin(x + BRAD_HPI);
}
// atan2 via added sines.
// Returns [0,2pi), where pi ~ 0x4000.
uint atan2Sin(int x, int y)
{
if(y==0) return (x>=0 ? 0 : BRAD_PI);
int phi, dphi;
uint t;
OCTANTIFY(x, y, phi);
phi *= BRAD_PI/4;
t= QDIV(y, x, ATAN_FP);
dphi= 0x14FF*isin(9*t/8) + 0x7D*isin(37*t/8);
return phi + (dphi>>12);
}
2.8 atan2Cordic : CORDIC
CORDIC stands for COordinate Rotation DIgital Computer. This is a really nifty procedure that allows you to calculate trig functions through successive coordinate rotations. It is effectively a search routine through angular space, as illustrated by Fig 6. You start at some point p and towards the positive x-axis by some angle α. This means rotate (and scale; see Eq 7) by +α if p_{y}≥0, or by −α if p_{y} is negative. Repeat this process with increasingly smaller angles. The full angle between p and the x-axis is the sum of the αs.
The real beauty in the CORDIC procedure, however, is in the way
the angles are chosen. The first guess for such a procedure would
be bisection: divide angles by two. This is what the
OCTANTIFY()
macro did. However, once you get down to
π/8 you run into the problems that the elements of the rotation
matrix (sines and cosines) don't give neat binary fractions anymore.
Instead of bisection, the CORDIC procedure divides the angles so
that the elements of the matrix do give nice powers of two.
The angles themselves are then simply looked up from a table.
Eq 7 shows the derivation of the process. The cosines are taken out of the matrix itself, reducing the matrix to ones and tangents. Using tan(α) = 2^{-i} for the iterations, this further reduces to shifts and adds and a few angle lookups. Note: there are no divisions here! Now, technically you'd still have to multiply the cosine scalings but, like before, those would be divided out anyway and so can be safely ignored.
(7) |
For the implementation, you need to use a lookup table with the arctan
terms for α_{i}. Since we're working with 15bit angles,
we will need about 15 terms for it – though note that the first 3
terms have been taken care of already though OCTANTIFY()
.
// Returns [0,2pi), where pi ~ 0x4000.
uint atan2Cordic(int x, int y)
{
if(y==0) return (x>=0 ? 0 : BRAD_PI);
int phi;
OCTANTIFY(x, y, phi);
phi *= BRAD_PI/4;
// Scale up a bit for greater accuracy.
if(x < 0x10000)
{
x *= 0x1000;
y *= 0x1000;
}
// atan(2^-i) terms using PI=0x10000 for accuracy
const u16 list[]=
{
0x4000, 0x25C8, 0x13F6, 0x0A22, 0x0516, 0x028C, 0x0146, 0x00A3,
0x0051, 0x0029, 0x0014, 0x000A, 0x0005, 0x0003, 0x0001, 0x0001
};
int i, tmp, dphi=0;
for(i=1; i<12; i++)
{
if(y>=0)
{
tmp= x + (y>>i);
y = y - (x>>i);
x = tmp;
dphi += list[i];
}
else
{
tmp= x - (y>>i);
y = y + (x>>i);
x = tmp;
dphi -= list[i];
}
}
return phi + (dphi>>2);
}
There are two additional points of note for this routine. Firstly, notice that I've scaled the sub-angles a bit for a higher accuracy. More importantly, though, I've scaled the coordinates up before going into the loop. This is necessary for accuracy, otherwise you might hit the y=0 line too soon.
Finally, ecurtz presented a nice, unrolled ARM assembly version on the GBAdev forum in thread 3840. It's designed for π≡0x0400 instead of 0x4000 so it won't be as accurate, but it's still quite pretty.
3 Results and discussion
3.1 Accuracy
The tests are based on Eq 1. I start with all
angles from 0 to 0x8000, convert those to coordinates using convenient
(co)sine routines designated icos()
and isin()
,
and see how far away we are from the original angle. As could be
expected, the errors repeat after ¼π, so I'll only present
the first octant.
Fig 7 shows the errors E for a selection of angles. The vertical axis shows the numbers as Q12 numbers, so a 1 would correspond to 1/4096^{th}. Because the lines are so jumpy and overlap often, I've added a small offset to keep all the lines visible. For the real errors, round to the nearest integer.
There is a huge jumble around E=0 and E=−1. This is
a good thing, because it means that we either got the original angle
back exactly or are off by one – something which is entirely
understandable considering the nature of arctan and the method of
testing. The functions atan2Gba()
and
atan2Tonc()
look particularly good, with
atan2Cordic()
being only slightly behind because it has
a few 2s in some places. The linear interpolators,
atan2Lerp()
and atan2InvLerp()
are only
slightly behind. The former is nearly always 1 or 2 below what it
should be, while the latter hovers around 0 except for near the
φ=¼π mark, where it overshoots by 2 at times.
The other three methods have visibly higher inaccuracies. The standard
Taylor method, atan2Taylor()
, is alright until about
t≈¾π, but then goes horribly
wrong, taking a nosedive to −110 (≈2°). Without
interpolation, atan2Lut()
underestimates the angle by about
16 (≈0.4°). The error for the sine-approximation method,
atan2Sin()
starts off at roughly −4 and then
approaches zero after φ≈π/8. Adding a bias of 2 might
benefit this method, making it center around 0. I think it might be
further improved by using a higher fixed-point fraction for the sines.
lut | lerp | ilerp | taylor | gba | tonc | sin | cordic | |
---|---|---|---|---|---|---|---|---|
max | 1 | 2 | 3 | 1 | 1 | 2 | 3 | 2 |
avg | -17.083 | -0.506 | 0.078 | -6.989 | -0.614 | -0.129 | -1.950 | -0.502 |
min | -43 | -3 | -2 | -163 | -2 | -2 | -6 | -3 |
stdev | 10.309 | 0.855 | 1.064 | 21.857 | 0.799 | 0.787 | 1.637 | 0.889 |
#off | 32443 | 20277 | 20827 | 22982 | 20161 | 18627 | 28213 | 20545 |
Table 4 shows some additional data about errors of each
method, based on all available angles. The max, avg and min rows
contain the maximum, average and minimum errors. Zero would be the
preferred result here, but small deviations are acceptable. For those
methods where the average is not between 0.5 and −0.5, you can
add a bias to make it so. The standard deviation (stdev, σ) is a
measure of how far from the average the set of points are (the
range ±1σ contains 68% of the data points). With the
exception of atan2Lut()
and atan2Taylor()
,
all routines do pretty well. The final row (#off) shows how many
points do not give the original angle back. Even in the best case,
atan2Tonc()
, you're still off a little half the time.
3.2 Costs in time and space
Next, we'll take a look at how much each method costs in terms of
processor time and their memory footprints. The sizes can simply be
read from the mapfile generated by the linker. For the time spend
calculating, I took a selection of angles from the whole circle,
converted these to coordinates and used the various arctan2 method
to calculate the angle again. The total time for everything is was
then divided by the number of iterations to give an average time
for each method. To reduce the measured time to just that of the
arctan code itself, the time for an empty arctan procedure
(atan2Null()
) was also taken and subtracted from the
others.
The most important part of the test can be found in the snippet below.
// Main part of test procedure:
// --------------------------------------------------------------------
// Find the testing overhead.
profileStart();
for(phi=0; pihi<BRAD_2PI; phi += 32)
atan2Null(icos(phi), isin(phi));
nullTime= profileStop() - procNullTime;
// Get the actual time for each method, sans overhead.
for(i=0; i<N_METHODS; i++)
{
profileStart();
for(phi=0; phi<BRAD_2PI; pi += 32)
atan2s[i].proc(icos(phi), isin(phi));
atan2s[i].time= (profileStop() - nullTime)/(BRAD_2PI/32);
}
// --------------------------------------------------------------------
// Some more versions for analysis' sake
// --------------------------------------------------------------------
// Just the form
uint atan2Null(int x, int y)
{
return x;
}
// With octants
uint atan2Oct(int x, int y)
{
if(y==0) return (x>=0 ? 0 : BRAD_PI);
int phi;
uint t;
OCTANTIFY(x, y, phi);
phi *= BRAD_PI/4;
return phi;
}
// With octants+div
uint atan2OctDiv(int x, int y)
{
if(y==0) return (x>=0 ? 0 : BRAD_PI);
int phi;
uint t;
OCTANTIFY(x, y, phi);
phi *= BRAD_PI/4;
t= QDIV(y, x, ATAN_FP);
return phi+t;
}
This test was run on both the GBA and NDS in two ways. As ARM or Thumb code
on the NDS, and ARM-in-IWRAM or Thumb-in-ROM on the GBA. The other relevant
factors were -O2
with devkitARM r24 and default wait-states. For
the division, the hardware 64/32 div was used for NDS and
Table 5 lists the results. As mentioned above, an empty
function called atan2Null()
was used to calculate the test
overhead, which was subtracted from the times of the others. This is why
the time atan2Null()
's cycle time can be higher than the
rest. Two nearly empty functions, atan2Oct()
and
atan2OctDiv()
were also used to analyze where most of the
time is spent inside each routine. For the GBA, the BIOS call
arcTan2()
has also been included for comparison.
time (NDS) | time (GBA) | size (NDS) | ||||
---|---|---|---|---|---|---|
Thumb | Arm | ROM/Thumb | IWRAM/ARM | Thumb | ARM | |
atan2Null | 60 | 44 | 448 | 204 | 2 | 4 |
atan2Oct | 13 | 7 | 89 | 15 | 62 | 76 |
atan2OctDiv | 94 | 81 | 336 | 258 | 140 | 188 |
atan2Lookup | 95 (1) | 82 (2) | 370 (34) | 270 (12) | 152+260 | 208+260 |
atan2Lerp | 100 (1) | 86 (6) | 417 (81) | 288 (30) | 168+260 | 236+260 |
atan2InvLerp | 226 (132) | 198 (118) | 1030 (694) | 621 (363) | 252+516 | 344+516 |
atan2Taylor | 725 (631) | 626 (546) | 2961 (2625) | 2467 (2209) | 272 | 296 |
atan2Gba | 123 (29) | 105 (25) | 584 (248) | 348 (90) | 232 | 328 |
atan2Tonc | 114 (20) | 97 (17) | 520 (184) | 332 (74) | 204 | 288 |
atan2Sin | 155 (61) | 126 (46) | 820 (484) | 383 (125) | 184+260 | 264+260 |
atan2Cordic | 123 (29) | 120 (40) | 877 (541) | 312 (54) | 152 | 188 |
ArcTan2 | - | - | 352 | 360 | - | - |
Speed
As was to be expected, the lookup method is the fastest, closely
followed by the linear interpolator atan2Lerp()
and then
atan2Gba()
and atan2Tonc()
.
Doing a proper
Taylor expansion (atan2Taylor()
) is very slow,
and completely not worth the effort – especially considering it
has accuracy problems as well. Using an inverse lookup like
atan2InvLerp()
does doesn't seem to be that great either
in terms of speed.
I'm somewhat disappointed by the performance of
atan2Sin()
, actually. I'd expected it to be quicker.
the reason for this is that approximation by sines lives or dies by how
well fast your sine function is. This can mean a 60/200 difference
per sine for NDS/GBA, which is a sizable chunk of the total time. The
implementation I've used in section 2.7 is
okay, but using a full LUT instead of a partial one would be
considerably faster still. Or you can use assembly for math routines,
because the compiler just doesn't seem to understand how to set-up a
fast indexed memory access. You could probably near
atan2Lerp()
speeds if you use the most simple sine
implementation (i.e., a single lookup), but you'd lose a fair bit of
accuracy when doing so.
The results for the CORDIC case are interesting, especially when
compared to te other methods. On the NDS, this is a fast method and there
is next to no difference between Thumb and ARM code. The GBA times,
however, are very different. In ARM/IWRAM it's pretty fast; in ROM/Thumb,
it's very slow. The reason for it's relative quickness is the absence of
divisions. As you can see from the difference between
atan2Oct()
and atan2OctDiv()
the division
(which is done via the BIOS Div()
accounts for about 240
cycles in the GBA's case. Most GBA implementations are in fact
dominated by the division. The reason for ROM/Thumb's ghastly performance
is ... well I don't know actually. But it's probably related to the
many jumps and loads, which are both killers if done in/from ROM.
Speed per system and code/memory type
Before discussing the different columns representing the different
platform specs, it's instructive to look at the three overhead functions,
atan2Null()
, atan2Oct()
, and
atan2OctDiv()
. These represent how much time is spend in
the non-arctan part of the loop, the OCTANTIFY()
macro
and the macro+division, respectively. because the sine/cosine functions
aren't that fast, the test overhead is substantial. While this is
subtracted from the other results, there's always the chance that it
messes up some of the timings so consider yourself warned.
The difference
between atan2Oct()
and atan2OctDiv()
gives
you an estimate on how long the division takes; this comes down to
about 70-80 cycles for the NDS and 240 for the GBA. This is one
of the main differences between the two systems and illustrates the
importance of a good divider. Incidentally,
GBATek:math
lists the 64/32 division time as being 34 cycles. The reason it's more
than double that here is probably due to the overhead of setting it
up in the first place. The higher figure is the one you'd have to
consider when using the hardware division, though.
Now, back to the different platforms. For a decent comparison, you
should take the factors discussed above out of the equation. The times
between parentheses are the differences with the
atan2OctDiv()
results and should represent these values for
the heart of each method. In most cases, it's pretty much what you'd
expect, but there are also some nice differences. The Thumb/ROM results
nicely show the wait-states that ARM/IWRAM doesn't have, causing a rough
3-4× speed difference. Where it's less, it's probably due to
ROM lookups and extra divisions. atan2Cordic()
stands
out as weird, though. It may be because the load are
done from IWRAM in the ARM/IWRAM case, but that's about all I can say.
Because of the NDS-cache, NDS-Arm and GBA-ARM/IWRAM should have identical wait-states and, unless someone is playing trick on me, identical code as well. So that there are differences here is may seem somewhat odd. However, there are two factors to consider here. The first is, as mentioned, that the GBA LUTs were still in ROM. The second is that the ARM9 that the NDS uses should have faster memory access^{(2)}. Lastly, NDS Thumb vs ARM. In general, ARM will be a faster, but use more memory. The division seems to be the main factor for the NDS as well, so code-type is not that important. I will say that manual ARM assembly should be faster still, and possibly use less memory than compiled Thumb code.
Size
In terms of size, obviously ARM code will be larger than Thumb, though
under proper optimization I believe that the difference should be smaller
than these numbers indicate. The smallest routine is CORDIC because
its loop isn't unrolled like atan2Gba()
and
atan2Tonc()
. The code for the LUT-based routine is small
as well, but adding the LUT size, these are effectively take up the most
space. These things considered, size for individual routines probably
don't matter too much anyway. As with speed, if you want the best,
do it in assembly.
Other considerations
If you need the routines faster, there are three things you can do.
First, get a better divider. This is the deciding factor in all
methods except CORDIC. If you have a really sucky DIV, use CORDIC.
Second, improve octant finding. All methods shown here use
OCTANTIFY()
to reduce to a single octant and
have the core method in one place. The alternative is to consider each
octant individually, with 8 cores. The code will be 8 times larger
and more complex, but should be a little faster – but not
that much faster, as atan2Oct()
shows that
the octant search is rather quick. Lastly, there's assembly.
For some reason, the compiler is rather sloppy for tight
mathematical algorithms, spilling registers and instructions that
could have been easily avoided. I estimate the core and octant
search can be sped up considerably; but because it's the division
that matters, the overall gain will probably be low.
For floating-point equivalents, the accuracy for the non-LUT based
routines can be improved by using different constants.
atan2Sin()
in particular can become better by not using
integers. Because the Excel Solver Add-in works for floating-point
units, finding the right coefficients shouldn't take more than a few
seconds once you have a suitable test environment set up.
4 Conclusions
The basic way to do an arctan2 function is to reduce the coordinates to a single octant and one of several methods to calculate the angle for that last octant. Two basic methods exist for this: CORDIC, which just keeps rotating, and everything else, which requires a division for y/x. If your division is really slow (which doesn't seem the case for NDS or even GBA), use CORDIC. If not, other methods tend to produce better results and be quicker about it.
Leaving CORDIC out of it, the fastest method is using a LUT – preferably a LUT combined with (linear) interpolation, because the gain in accuracy can be substantial. Do not use a pure Taylor approximation, because it's both slow and has accuracy issues. Modified truncated Taylor methods are alright, though; even with only five terms you can get a 15-bit accurate angle. Sines can approximate an arctan as well, but is ultimately too slow for a decent accuracy.
The routines presented here are fairly fast, but can be improved by hand-assembling the code, though ultimately the speed of division will dominate anyway. The only place where it may be of worth is the division-less CORDIC method, which if unrolled (see forum:3840) should be very fast indeed.
Notes:
- Well, almost perfect. The asymptote for tan(φ) at φ =π·(½+n) can be problematic, but for the most part it's not really a problem.
- The docs say 1N and N+S+I for ARM9 and ARM7, respectively, but I still have some doubts about that
Have you ever considered a Pade approximant ---it is used widely in practice. Mathematica has a nice function for calculation any order of Pade approximants, here's what I get for a (2,2) order expansion
atan[x_] = PadeApproximant[ArcTan[x], {x, 0, {2, 2}}]
x/(1+x^2/3)
In involves a division and a multiplication, aside from division by 3, which, for fixed point arithmetic, is a multiplication by a constant anyways.
The accuracy of the (2,2) expansion in the first octant is 2.1% in the worst case (whereas 7th order Taylor expansion is %0.65)
Plot[100 (atan[x] - ArcTan[x])/ArcTan[x], {x, 0, Pi/4}]
Here's the plot http://img18.imageshack.us/my.php?image=atanpade22.jpg
A higher order expansion yields (x + 4x^3)/(1 + 3x^2/5) which yields %0.25 at worst, but it seems quite costy.
By the way, if you can live with a minor discontinuity in your atan, you can patch together various Pade approximants centered differently, all of order (2,2). For instance, you can use zero-centered approximant x/(1+x^2/3) for x<0.23 and 0.5 centered approximant
(0.463648 + 1.17092(-0.5 + x) + 0.493095 (-0.5 + x)^2) / (1.0000000000000000 + 0.8(-0.5 + x) + 0.373333(-0.5 + x)^2) which comes with slight overhead (5 mul, 1 div whereas the previous one was 2 mul, 1 div). Note that 0.493095 can be approximated to 1/2, reducing a multiplication to bitshift.
With this, your worst error is 0.03%. But of course, a discontinuity in the function at x=0.23 comes along with the ride (which is ~1.3e-4; actual value of atan(0.23) is ~0.227 so the discontinuity is 0.05% of the value there). The point ~0.23 was to minimize the maximum error, but you can choose another value to minimize discontinuity. If this bothers you, you can reduce this amount by patching 3 (or even more) Pade approximants.
No I hadn't considered Padè approximants, but only because I wasn't aware of their existence. They do look interesting, but I think the extra div may cost too much here compared to other methods. Also note that the methods you've described here all use floats. Transforming to integer-only math is possible, but will affect the accuracy as well. x/(1+x²/3) only uses a single mul, as it can be converted into 3x/(3+x²).
I've looked in my books to see a few words about Padè, but it was rather sketchy on how to calculate the coefficients. Do yo uknow how Mathematica does it?
The basic idea is to get the approximant agree with the function (conveniently, with it's Taylor expansion) at all derivatives. I don't know how Mathematica actually handles it, but I'm pretty sure it has a pretty cool way to do it fast. There's an actual implementation, however, in Numerical Recipes in C, chapter 5.12.
Floats? Why can't you do them with fixed point math?
Cool! Indeed, one mul can be converted to a bitshift+addition :) Would you care to benchmark an actual implementation --it might turn out to be OK.
BTW, I see one more benefit in this 2nd order expansion, atan2 (which would be 3y/(3x+y^{2}/x)) should introduce less rounding errors, while keeping the number of div/mul operations same.
Regarding the atanTonc constants. The following constants were published in a book. They provide 1/4 of the error, at values close to 1, than the ones you derived. Your ones were OK for the fixed point precision you had. But for floating point the extra precision at no cost is welcome.
0.9998660
0.3302995
0.1801410
0.0851330
0.0208351
P.S. Thanks for the great articles. I have just found your website today and am reading like crazy. I have already speed up a Sin approximation I was using.
Ok ,but what should be the required hardware system for the implementation of look up method ?
That's really hard to say, actually. If you have memory to spare and memory-access is fast, the LUT method is a good option. Or if instructions are expensive (the ARM processor does shifts for free, so that's a big part of why series expansion works wel here). Simply put, what's "best" is both CPU and memory dependent, and the only way to know what works for you is to test it yourself.
It appears you have an error in your explaination of how to rotate the angle to get from octant 1 to octact 0. You stated:
"The other method (Fig 5c) is to continue rotating: if x>y, rotate by −¼π."
However, I think you got the comparison backwards, and it should be:
The other method (Fig 5c) is to continue rotating: if x<=y, rotate by −¼π.
Pingback: OlliW's Bastelseiten » Fast Computation of Functions on Microcontrollers