Off on a tangent : a look at arctangent implementations

Or: everything you never wanted to know about arctangent implementations and weren't afraid to not ask.

1 Introduction

Whenever rotations or directions are required in games, the usual way to describe them is with an angle. Together with a magnitude, r, the angle φ can be converted into x and y coordinates through the sine and cosine functions (see Eq 1). Since the process is so common, for games, pretty much every document dealing with trigonometry explain how this works and give a decent way of implementing these functions (i.e., look-up tables or LUTs).

Something usually not covered is the inverse operation: turning Cartesian coordinates into angles. This requires the arctangent function (see Eq 1, right-hand side). Even though this type of conversion not as necessary as the angle→coordinate-pair conversion (and in fact less necessary than most people think; much of the calculations can be done with vector math instead), it is still somewhat unfortunate that this is neglected. Not just unfortunate for the lack of symmetry, but also because the process is rather more tricksy than using the standard trigonometry functions.


In this document, I will discuss several (yes, there is more than one) implementations for the arctangent function. My main targets here are GBA and NDS so these will be integer-based, but the methods are general enough to work for floating-point as well. I'll discuss the advantages and disadvantages of each in terms of accuracy, speed and size so that people can see the different options, make an informed decision what method will work best for them and won't have to worry about it again. Ever.

(1) \begin{pmatrix}x \\ y \end{pmatrix} = \begin{pmatrix}r cos \phi \\ r sin \phi \end{pmatrix} \longleftrightarrow \begin{pmatrix}r \\ \phi \end{pmatrix} = \begin{pmatrix}\sqrt{x^2 + y^2} \\ arctan(y/x) \end{pmatrix}

Fig 1. Unit circle and trig function definitions.

In Fig 1 are shown the standard trigonometry functions and their relation to the angle φ. If you travel along the unit circle, cos(φ) marks the x-coordinate, sin(φ) marks the y-coordinate and tan(φ) is the y-coordinate is you extend the radial line to x=1 (which effectively scales it by 1/cos(φ). In other words, tan(φ) = sin(φ)/cos(φ) ). The arctangent is the inverse of the tangent function.

Fig 2 plots these as functions of φ. Note that all of them are periodic; this is nice because functions with bound domains are perfect for LUTs(1).

However, the arctangent, plotted in Fig 2, has several issues. Firstly, it goes on forever in both x directions. Since an infinite amount of memory can't be bought in the shops just yet, a trivial LUT-based solution would seem out of the question. Secondly, the argument of the arctan isn't just x or y; it's y/x. Depending on the hardware, divisions can be a little slow. Thirdly, the arctan's range is only [−½π, ½π] : half a circle and not a full one. This is also a consequence of the division, since (−y)/(−x) = y/x.

Of course, none of these problems is insurmountable, but it does mean that the solution will involve a little more than just lookup.

Fig 2. Sine, cosine and tangent, with φ in fractions of π.

Fig 3. Arctangent, in fractions of π.

2 Arctangent methods.

There are either methods that I'd like to discuss. These fall into three broad categories: LUT-based, series expansion, and a cute little method known as CORDIC.

  1. Basic lookup.
  2. Lookup + interpolation.
  3. Inverse-lookup + interpolation.
  4. Standard Taylor series.
  5. GBA series expansion.
  6. Home-grown series expansion.
  7. Approximation by sines.
  8. CORDIC.

Before discussing the specifics, let's see what the interface for the arctangent will be like. Now, because I intend the function to be used in FPU-less environments, the parameters and the return value will be integers; preferably scaled to some convenient fixed point number for accuracy.

There will be two parameters, an x and an y value. The standard name for a two-argument arctan is atan2(), so I'll use that name as well. As a division of the two coordinates is involved, the position of the fixed point is actually not important, but let's take it as 12, so that 1.0 corresponds to 0x1000. The return value is an integer with a 2π range. Whether you use [0, 2π) or something like [−π, π) doesn't really matter thanks to the periodicity of angles, but because it'll make the calculations easier, I will use the former. Also, since the number of angles in a circle is essentially arbitrary as well, we will choose something convenient for computers, like π≡0x4000; in other words, 0x8000=32768 ‘degrees’ for a full circle. The units for this power-of-two circle division is often called a brad, which stands for binary radian. In this case we have 0x4000 brad = π rad = 180 degrees.

With these things in mind, we have the following constants for our code:

static const uint BRAD_PI_SHIFT=14,   BRAD_PI = 1<<BRAD_PI_SHIFT;
static const uint BRAD_HPI= BRAD_PI/2, BRAD_2PI= BRAD_PI*2;

static const uint ATAN_ONE = 0x1000, ATAN_FP= 12;

The fact that these constants correspond to libnds' units for trig is, of course, entirely accidental.


Fig 4. Octants. Counting goes at counter-clockwise and starts at the positive x-axis.

Next up: getting a full circle's worth of angles from a function that technically only returns half a circle. This is done by using the symmetry properties of the circle; most notably that a positive y uses the range [0, π) and negative y means [π, 2π) angles.

Two other axes of symmetry can be used as well: the x=0 and y=x lines. Through these, you can restrict the actual you need to calculate to a single octant (see Fig 4). The nice thing about bringing it to a single octant is that you will end up at a place where quotient of x and y is below 1.0 – in other words, get the argument of the actual arctan down to a finite domain instead of an infinite one.

The process is illustrated by Fig 5. Requested is the angle to point p=p0. Successive rotations can bring this point into the first quadrant (Fig 5a and b). If necessary, rotate the point by π to bring it into the top hemisphere, then rotate by ½π to get to the first quadrant. We need to keep track of the performed rotations in the form of φi.

Once inside the first quadrant, two procedures can be followed (see Fig 5c and d). Starting with the latter, notice that for calculating angle β, you can use the quotient x/y rather than y/x if the point is in the second octant. The angle α follows from α = ½π−β. The other method (Fig 5c) is to continue rotating: if x>y, rotate by −¼π. Now, if you look at Eq 2, you might think this is a bad idea thanks to the division by √2. However, this scaling doesn't matter, because the coordinate division will remove any trace of it anyway!

Because it will reduce the complexity of the code, I will use the method from Fig 5c.

Fig 5. Reducing the angle to octant 0. The true angle for point p is φi+α. The successive rotations in (parts a, b and c) reduce p to the first quadrant. Or use a and b, and then use the y/x vs x/y symmetry (see d) to get β, and then α=½π−β.
(2) \begin{eqnarray} \bold{R}\(-\pi\) &=& \begin{bmatrix} -1 & 0 \\ 0 & -1 \end{bmatrix} \; &:& \; \begin{eqnarray} x = -x \\ y= -y \end{eqnarray} \\ \vspace{10} \\ \bold{R}\(-\frac12\pi\) &=& \begin{bmatrix} 0 & -1 \\ -1 & 0 \end{bmatrix} \; &:& \; \begin{eqnarray} x = -y \\ y= -x \end{eqnarray}\\ \vspace{10} \\ \bold{R}\(-\frac14\pi\) &=& \begin{bmatrix} 1 & 1 \\ -1 & 1 \end{bmatrix}\cdot {1\over\sqrt2} \; &:& \; \begin{eqnarray} x = (x+y)/\sqrt2 \\ y= (x-y)/\sqrt2 \end{eqnarray} \\ \end{eqnarray}

The basic framework for the atan2() functions will use the following scheme. First, return early if x and y are both 0. If not, the coordinates will be rotated in till the fall into the first octant and a preliminary angle is constructed as an offset to the real octant. After this initial set-up, the different functions will calculate the arctangent according to their specific methods, which is added to the earlier angle and then returned. A template for the atan2() is shown below.

// Get the octant a coordinate pair is in.
#define OCTANTIFY(_x, _y, _o)   do {                            \
    int _t; _o= 0;                                              \
    if(_y<  0)  {            _x= -_x;   _y= -_y; _o += 4; }     \
    if(_x<= 0)  { _t= _x;    _x=  _y;   _y= -_t; _o += 2; }     \
    if(_x<=_y)  { _t= _y-_x; _x= _x+_y; _y=  _t; _o += 1; }     \
} while(0);

// QDIV stands for the fixed-point division method most appropriate for
// your system. Modify where appropriate.
// This would be for NDS.
static inline int QDIV(int num, int den, const int bits)
    while(REG_DIVCNT & DIV_BUSY);
    REG_DIVCNT = DIV_64_32;

    REG_DIV_NUMER = ((int64)num)<<bits;
    REG_DIV_DENOM_L = den;

    while(REG_DIVCNT & DIV_BUSY);

    return (REG_DIV_RESULT_L);

// Basic setup for atan2. MY_ARCTAN handles the specifics of each method.
uint atan2Base(int x, int y)
    if(y==0)    return (x>=0 ? 0 : BRAD_PI);

    int phi, dphi;
    uint t;

    OCTANTIFY(x, y, phi);
    phi *= BRAD_PI/4;
    // --- <specifics> ---
    t= QDIV(y, x, ATAN_FP);
    dphi= MY_ARCTAN(t);
    // --- </specifics> ---
    return phi + dphi;

2.1 atanLookup : simple lookup

Since by reducing the region of calculation to the first octant we do have a bound domain, look-up tables may be of use after all. This particular LUT needs to have values for t = y/x∈[0, 1] (i.e, φ∈[0, ¼π]) Note that 1.0 is included in this range, as this will required for the next method. I'll put the entry for 1.0 at position 128, so that the indices are effectively Q7 numbers. For accuracy, I'll scale each entry so that as far as the LUT is concerned, ¼π ≡ 0x8000. This will be shifted down at the end of the calculations.

The LUT and the code for atan2Lookup() are given below. I've also added some other useful constants indicating the ‘stride’ of the LUT entries; that is, the effective width of each LUT-entry. Note that I'm using an actual division for the final transition to the BRAD_PI scale. As long as the numerator is unsigned, this won't give any trouble.

// Some constants for dealing with atanLUT.
static const uint ATANLUT_STRIDE = ATAN_ONE / 0x80, ATANLUT_STRIDE_SHIFT= 5;

// Arctangens LUT. Interval: [0, 1] (one=128); PI=0x20000
const unsigned short atanLUT[130] ALIGN(4)=
// 64
// 128

// Basic lookup for atan2. Returns [0,2pi), where pi ~ 0x4000.
uint atan2Lookup(int x, int y)
    if(y==0)    return (x>=0 ? 0 : BRAD_PI);

    int phi;
    uint t;

    OCTANTIFY(x, y, phi);
    phi *= BRAD_PI/4;
    t = QDIV(y, x, ATAN_FP);

    return phi + atanLUT[t/ATANLUT_STRIDE]/8;

2.2 atanLerp : LUT + linear interpolation

The LUT is fairly coarse. The gaps between the LUT entries are roughly 0x100 in size, corresponding to about 0.7 of a degree. We can get considerably better accuracy (in the order of 2% of a degree) by using linear interpolation (lerp for short) between the LUT entries surrounding the quotient t (see tonc:lerp for the basic process). The essence is that t lies between LUT-points (ta, φa) and (tb, φb). From the basic description of a line, we can derive that

(3) \phi = \phi_a + \frac{\phi_b - \phi_a}{t_b-t_a}\cdot(t - t_a) .

The cute thing about lerping for LUTs is that, if one assumes t is a fixed-point number, then tb = ta+1.0 and the operations with t in them boil down to a division and modulo by a power of two. The code is then becomes as follows.

// Basic lookup+linear interpolation for atan2.
// Returns [0,2pi), where pi ~ 0x4000.
uint atan2Lerp(int x, int y)
    if(y==0)    return (x>=0 ? 0 : BRAD_PI);

    int phi;
    uint t, fa, fb, h;

    OCTANTIFY(x, y, phi);
    phi *= BRAD_PI/4;

    t= QDIV(y, x, ATAN_FP);
    h= t % ATANLUT_STRIDE;
    fa= atanLUT[t/ATANLUT_STRIDE  ];
    fb= atanLUT[t/ATANLUT_STRIDE+1];

    return phi + ( fa + ((fb-fa)*h >> ATANLUT_STRIDE_SHIFT) )/8;

2.3 atanInvLerp : Inverse lookup

If you're strapped for space but already have a tangent LUT available, you can also try to retrieve the arctangent by a sort of inverse lookup. The parameter of the arctangent is actually the tangent value of the angle we're looking for, so by searching through the tangent LUT for this value, we can find the two LUT-entries that bracket our angle. Because the tangent is a monotonously increasing function, the bracketing can be done via a binary search. The lutBracket() presented below does just this, and returns the LUT-index of the lower bound. This value and the next are used to once again linearly interpolate to the desired angle.

This will probably not be as fast as the direct lerp. Firstly, the bracketing itself takes O(log N) time, which is probably slower than the O(1) that direct look-up uses. Also, it requires an extra division because the range to divide by is the distance between the LUT-valuesb−φa) and not the LUT-indices.

static const uint TANLUT_PI_SHIFT= 8, TANLUT_PI= 0x100;

static const uint TANLUT_SIZE= 129, TANLUT_FP= 16;

// Tangens LUT, domain: [0, PI/2]; PI= 0x100, Q16 values.
// tan(PI/2) set to 400.0.
static const unsigned int tanLUT[129]=
// 64
// 128

//! Get a tangent value as a Q12 fixed-point number.
int itan(int x)
    uint ux= (uint)x % BRAD_PI;
    uint xa= ux / TANLUT_STRIDE;
    int ya, yb, h= ux % TANLUT_STRIDE;

    if(ux <= BRAD_HPI)
        ya= tanLUT[xa  ];
        yb= tanLUT[xa+1];
        ya= -tanLUT[TANLUT_PI-xa  ];
        yb= -tanLUT[TANLUT_PI-xa-1];

    return (ya + ((yb-ya)*h >> TANLUT_STRIDE_SHIFT))>>4;

//! Function to bracket a value in LUT.
/*! Consider a monotonously increasing LUT (or sorted array), with  
    two adjacent value form a bin. This function will find the
    bin in which \a key can be found.
    \param key      Value to bracket.
    \param array    Array to search.
    \param size     Size of the array.
    \return         The low bracket of the bin.

static inline uint lutBracket32(int key, const int array[], uint size)
    uint i, low=0, high= size-1;

        i= (low+high)/2;
        if(key < array[i])
            high= i;
            low= i;

    return i;

// atan2 using inverse lookup and linear interpolation.
// Returns [0,2pi), where pi ~ 0x4000.
uint atan2InvLerp(int x, int y)
    if(y==0)    return (x>=0 ? 0 : BRAD_PI);

    int phi, fa, dphi, ta, tb;
    uint t;

    OCTANTIFY(x, y, phi);
    phi *= BRAD_PI/4;

    t= QDIV(y, x, TANLUT_FP);
    fa= lutBracket(t, tanLUT, TANLUT_SIZE);

    ta= tanLUT[fa  ];
    tb= tanLUT[fa+1];

    dphi= fa*TANLUT_STRIDE + QDIV((t-ta)*TANLUT_STRIDE, tb-ta, 0);
    return phi + dphi;

2.4 atanTaylor : Taylor series expansion

A method that does not rely on look-up tables at all is series expansion. There is a mathematical process known as Taylor Series expansion, by which you can approximate functions adding the right sum of its derivatives. When in games you extrapolate the current position, velocity and acceleration to the next position, you are in a way making use of this process. I'll spare you the derivation of the Taylor series for the arctangent and just write it down.

(4) arctan(t) = \sum_{i=1}^\infty (-1)^{i+1} \, \frac{t^{2i+1}}{2i+1} = t - \frac{t^3}3 + \frac{t^5}5 - \frac{t^7}7 + ...

arctan via taylor expansion. Shown are the true arctan and Taylor approximations to the 7th order.

Now, although Eq 4 is true, for calculation purposes it's better to write it as a backwards recurrence relation; one that ends at i=n. This is better for the stability of the algorithm comes down to a simpler for-loop.

(5) \begin{eqnarray} q_n &=& \frac1{2n+1} \\ q_{i} &=& \frac1{2i+1} + (-t^2)q_{i+1} \\ arctan(t) &=& t q_1 = t (1-t^2(\frac13-t^2(\frac15-t^2 ...))) \end{eqnarray}

Now, this equation only really works for t≤1; above it things will diverge. This is alright though, because that's always true for the first octant. As is, Eq 4 and Eq 5, t ∈ [0, 1] will translate to domain φ ∈ [0, ¼π], but this is with the real value for π; not the one we've adopted for our integer-based trigonometry. to account for that, we need to scale things up by BRAD_PI/π. For the accuracy of the calculations, we can even scale it up a little further and shift back down after everything's done. This gives us the following code. The constant base is 8*BRAD_PI/π.

// Basic Taylor series (15th order) for atan2.
// Returns [0,2pi), where pi ~ 0x4000.
uint atan2Taylor(int x, int y)
    if(y==0)    return (x>=0 ? 0 : BRAD_PI);

    static const int fixShift= 15, base=0xA2F9;
    int i, phi, t, t2, dphi;

    OCTANTIFY(x, y, phi);
    phi *= BRAD_PI/4;

    t= QDIV(y, x, fixShift);
    t2= -t*t>>fixShift;

    dphi= 0;
    for(i=15; i>=1; i -= 2)
        dphi  = dphi*t2>>fixShift;
        dphi += QDIV(base, i, 0);

    return phi + (dphi*t >> (fixShift+3));

In this particular case, I've chosen for a 15th (n=7) order approximation. This works out quite well for most of the range, but not near t=1! And unless you really use an infinite number of terms, it never will be; the error will be around the next term in the expansion, i.e. 1/(2n+3). This is something you just cannot avoid.

2.5 atan2Gba : taylor series, GBA implementation

Or can you?


While Eq 4 is indeed the correct Taylor expansion for the arctangent, it requires an infinite number of terms; something we simply cannot have in programming. This causes the problems around t=1. However, there's no rule against breaking it off early and adjusting the terms a little to compensate for the error. Finding the best terms for the job is more of an art than a science, because many different combinations will work.

One of these sets can be found in the GBA implementation of ArcTan. It basically uses the process of Eq 5 with n=7, but with the following sequence of multipliers instead of 1/(2i+1).

Table 1 : multipliers used in GBA's ArcTan. Listed are the order, fixed-point term and a rough (ir)rational equivalent.
order scaled approximation
1 0xA2F9 0.99999 (1/1)
3 0x3651 0.33328 (1/3)
5 0x2081 0.19944 (1/5)
7 0x16AA 0.13906 (5/36)
9 0x0FB6 0.09640 (8/83)
11 0x091C 0.05589 (9/161)
13 0x0390 0.02186 (4/183)
15 0x00A9 0.00405 (8/1974)

Interestingly, this works very, very well. It usually underestimates the real value by only 1 on a 0x8000 range. The function itself is given below. Because the number of terms is known and only 7, the loop may just as well be unrolled. Again, the calculation scale is 8*BRAD_PI/π, hence the additional shift by 3 at the end.

// Approximate Taylor series for atan2, GBA implementation.
// Returns [0,2pi), where pi ~ 0x4000.
uint atan2Gba(int x, int y)
    if(y==0)    return (x>=0 ? 0 : BRAD_PI);

    static const int fixShift= 15;
    int phi, t, t2, dphi;

    OCTANTIFY(x, y, phi);
    phi *= BRAD_PI/4;

    t= QDIV(y, x, fixShift);
    t2= -t*t>>fixShift;

    dphi= 0x00A9;
    dphi= 0x0390 + (t2*dphi>>fixShift);
    dphi= 0x091C + (t2*dphi>>fixShift);
    dphi= 0x0FB6 + (t2*dphi>>fixShift);
    dphi= 0x16AA + (t2*dphi>>fixShift);
    dphi= 0x2081 + (t2*dphi>>fixShift);
    dphi= 0x3651 + (t2*dphi>>fixShift);
    dphi= 0xA2F9 + (t2*dphi>>fixShift);

    return phi + (dphi*t >> (fixShift+3));

2.6 atanTonc : homegrown Taylor expansion

The function above works out really well, but can be cut down on the terms even further? Yes, we can. I modeled these equations in Excel (interactive graphing FTW!), fiddled around for a while and came up with this list.

Table 2 : my own atan multipliers. Listed are the order, fixed-point term and a rough (ir)rational equivalent.
order scaled approximation
1 0xA2FC 1.00006 (1/1)
3 0x364C 0.33316 (1/3)
5 0x1F0B 0.19048 (4/21)
7 0x1029 0.09916 (12/121)
9 0x0470 0.02723 (7/257)
Excel for prototyping algorithms

Excel is a very nice platform for finding out if your mathematical algorithms work and getting the optimal terms to use for them. What you do is set-up a decent number of points to evaluate and plot the difference between the real numbers and your model and try to minimize. Create a target value (like sum of squares or errors or min/avg/max data) and fiddle with points till you get something worthwhile. If the case of floating point values, you can even let Excel do it for you by using the Solver Addin. Because that doesn't work for integers, I just plotted the errors and fiddled until the errors for nearly all points were between ±0.5.

These terms result in roughly the same kind of error profile as the GBA method – somewhat fewer errors even. And it uses only 5 terms instead of 8 :). The code for it is below. I'm calling it atan2Tonc() because, well, I have to call it something. Also note the +4 at the end; this is used for rounding off to the nearest integer. This step is not essential, but does help reduce the average error.

// Approximate Taylor series for atan2, home-grown implementation.
// Returns [0,2pi), where pi ~ 0x4000.
uint atan2Tonc(int x, int y)
    if(y==0)    return (x>=0 ? 0 : BRAD_PI);

    static const int fixShift= 15;
    int  phi, t, t2, dphi;

    OCTANTIFY(x, y, phi);
    phi *= BRAD_PI/4;

    t= QDIV(y, x, fixShift);
    t2= -t*t>>fixShift;

    dphi= 0x0470;
    dphi= 0x1029 + (t2*dphi>>fixShift);
    dphi= 0x1F0B + (t2*dphi>>fixShift);
    dphi= 0x364C + (t2*dphi>>fixShift);
    dphi= 0xA2FC + (t2*dphi>>fixShift);
    dphi= dphi*t>>fixShift;

    return phi + ((dphi+4)>>3);

2.7 atan2Sin : sine approximation

This next method isn't actually as accurate under the present circumstances as the other ones, but it's just too skanky to be left unmentioned. If you look at the arctangent over the interval [−1, 1], you get an anti-symmetric curve with a slope that is +1 at t=0 and decreases with increasing t. You know what else has those characteristics? A sine wave. By using the right amplitude and frequency of a sine, you can actually approximate an arctangent to within about two degrees. Adding another sine with carefully chosen terms gets you an accuracy of about 3% of a degree.

(6) atan(t) \approx A_1 sin(k_1t) + A_2 sin(k_2t) \;\; t\in[0,1]
Table 3 : Terms used for Eq 6.
term int value
A1 0x14FF
k1 9/8
A2 0x007D
k2 37/8

Arctan versus sine waves. A single sine (solid magenta) is already close to the arctan. The difference is the dashed cyan line, which is approximated by another sine (dashed magenta). Adding these gives the yellow dotted line.

Finding the right terms is mostly a matter of trial and error. Once you have reasonably numbers, you should be able to use any sine generator to get the approximation. Below you can find both the atan2Sin() function and the sine generator (which in this case is a little slow, but has a low memory footprint).

// Sine LUT. Interval: [0, PI/2]; PI= 0x100, Q15 values.
const unsigned short sinLUT[130] ALIGN(4)=
// 64
// 128

//! Get a sine value as a Q12 fixed-point number.
int isin(int x)
    int h, ya, yb, y;
    uint ux, quad;

    ux= (uint)x%BRAD_2PI;
    h= ux % SINLUT_STRIDE;
    if(quad & 1)
        ya= sinLUT[SINLUT_HPI-ux  ];
        yb= sinLUT[SINLUT_HPI-ux-1];
        ya= sinLUT[ux  ];
        yb= sinLUT[ux+1];

    y=(ya + ((yb-ya)*h >> SINLUT_STRIDE_SHIFT))>>3;
    return (quad & 2) ? -y : +y;

//! Get a cosine value as a Q12 fixed-point number.
int icos(int x)
    return isin(x + BRAD_HPI);

// atan2 via added sines.
// Returns [0,2pi), where pi ~ 0x4000.
uint atan2Sin(int x, int y)
    if(y==0)    return (x>=0 ? 0 : BRAD_PI);

    int phi, dphi;
    uint t;

    OCTANTIFY(x, y, phi);
    phi *= BRAD_PI/4;

    t= QDIV(y, x, ATAN_FP);
    dphi= 0x14FF*isin(9*t/8) + 0x7D*isin(37*t/8);

    return phi + (dphi>>12);

2.8 atan2Cordic : CORDIC

Fig 6. Iteratively rotate p by ±αi in the direction of y=0. The full angle is the sum of αs.

CORDIC stands for COordinate Rotation DIgital Computer. This is a really nifty procedure that allows you to calculate trig functions through successive coordinate rotations. It is effectively a search routine through angular space, as illustrated by Fig 6. You start at some point p and towards the positive x-axis by some angle α. This means rotate (and scale; see Eq 7) by +α if py≥0, or by −α if py is negative. Repeat this process with increasingly smaller angles. The full angle between p and the x-axis is the sum of the αs.

The real beauty in the CORDIC procedure, however, is in the way the angles are chosen. The first guess for such a procedure would be bisection: divide angles by two. This is what the OCTANTIFY() macro did. However, once you get down to π/8 you run into the problems that the elements of the rotation matrix (sines and cosines) don't give neat binary fractions anymore. Instead of bisection, the CORDIC procedure divides the angles so that the elements of the matrix do give nice powers of two. The angles themselves are then simply looked up from a table.

Eq 7 shows the derivation of the process. The cosines are taken out of the matrix itself, reducing the matrix to ones and tangents. Using tan(α) = 2-i for the iterations, this further reduces to shifts and adds and a few angle lookups. Note: there are no divisions here! Now, technically you'd still have to multiply the cosine scalings but, like before, those would be divided out anyway and so can be safely ignored.

(7) \begin{eqnarray} R(\alpha_i) &=& \begin{bmatrix} cos(\alpha_i) & -sin(\alpha_i) \\ sin(\alpha_i) & cos(\alpha_i) \end{bmatrix} \\ &=& \begin{bmatrix} 1 & -tan(\alpha_i) \\ tan(\alpha_i) & 1 \end{bmatrix} cos(\alpha_i) \\ &=& \begin{bmatrix} 1 & -2^{-i} \\ 2^{-i} & 1 \end{bmatrix} cos(\alpha_i) \;\;\; \alpha_i = arctan(2^{-i}) \\ \end{eqnarray}

For the implementation, you need to use a lookup table with the arctan terms for αi. Since we're working with 15bit angles, we will need about 15 terms for it – though note that the first 3 terms have been taken care of already though OCTANTIFY().

// atan via CORDIC (coordinate rotations).
// Returns [0,2pi), where pi ~ 0x4000.
uint atan2Cordic(int x, int y)
    if(y==0)    return (x>=0 ? 0 : BRAD_PI);

    int phi;

    OCTANTIFY(x, y, phi);
    phi *= BRAD_PI/4;

    // Scale up a bit for greater accuracy.
    if(x < 0x10000)
        x *= 0x1000;
        y *= 0x1000;

    // atan(2^-i) terms using PI=0x10000 for accuracy
    const u16 list[]=
        0x4000, 0x25C8, 0x13F6, 0x0A22, 0x0516, 0x028C, 0x0146, 0x00A3,
        0x0051, 0x0029, 0x0014, 0x000A, 0x0005, 0x0003, 0x0001, 0x0001

    int i, tmp, dphi=0;
    for(i=1; i<12; i++)
            tmp= x + (y>>i);
            y  = y - (x>>i);
            x  = tmp;
            dphi += list[i];
            tmp= x - (y>>i);
            y  = y + (x>>i);
            x  = tmp;
            dphi -= list[i];
    return phi + (dphi>>2);

There are two additional points of note for this routine. Firstly, notice that I've scaled the sub-angles a bit for a higher accuracy. More importantly, though, I've scaled the coordinates up before going into the loop. This is necessary for accuracy, otherwise you might hit the y=0 line too soon.

Finally, ecurtz presented a nice, unrolled ARM assembly version on the GBAdev forum in thread 3840. It's designed for π≡0x0400 instead of 0x4000 so it won't be as accurate, but it's still quite pretty.

3 Results and discussion

3.1 Accuracy

The tests are based on Eq 1. I start with all angles from 0 to 0x8000, convert those to coordinates using convenient (co)sine routines designated icos() and isin(), and see how far away we are from the original angle. As could be expected, the errors repeat after ¼π, so I'll only present the first octant.

Fig 7 shows the errors E for a selection of angles. The vertical axis shows the numbers as Q12 numbers, so a 1 would correspond to 1/4096th. Because the lines are so jumpy and overlap often, I've added a small offset to keep all the lines visible. For the real errors, round to the nearest integer.

There is a huge jumble around E=0 and E=−1. This is a good thing, because it means that we either got the original angle back exactly or are off by one – something which is entirely understandable considering the nature of arctan and the method of testing. The functions atan2Gba() and atan2Tonc() look particularly good, with atan2Cordic() being only slightly behind because it has a few  2s in some places. The linear interpolators, atan2Lerp() and atan2InvLerp() are only slightly behind. The former is nearly always 1 or 2 below what it should be, while the latter hovers around 0 except for near the φ=¼π mark, where it overshoots by 2 at times.

The other three methods have visibly higher inaccuracies. The standard Taylor method, atan2Taylor(), is alright until about t≈¾π, but then goes horribly wrong, taking a nosedive to −110 (≈2°). Without interpolation, atan2Lut() underestimates the angle by about 16 (≈0.4°). The error for the sine-approximation method, atan2Sin() starts off at roughly −4 and then approaches zero after φ≈π/8. Adding a bias of 2 might benefit this method, making it center around 0. I think it might be further improved by using a higher fixed-point fraction for the sines.

Fig 7. Errors for the first octant of all methods. A small vertical offset is used to keep the lines from overlapping.
Table 4: statistics for the errors of all atan methods, over all angles. #off refers to the number of incorrectly estimated points.
lut lerp ilerp taylor gba tonc sin cordic
max 1 2 3 1 1 2 3 2
avg -17.083 -0.506 0.078 -6.989 -0.614 -0.129 -1.950 -0.502
min -43 -3 -2 -163 -2 -2 -6 -3
stdev 10.309 0.855 1.064 21.857 0.799 0.787 1.637 0.889
#off 32443 20277 20827 22982 20161 18627 28213 20545

Table 4 shows some additional data about errors of each method, based on all available angles. The max, avg and min rows contain the maximum, average and minimum errors. Zero would be the preferred result here, but small deviations are acceptable. For those methods where the average is not between 0.5 and −0.5, you can add a bias to make it so. The standard deviation (stdev, σ) is a measure of how far from the average the set of points are (the range ±1σ contains 68% of the data points). With the exception of atan2Lut() and atan2Taylor(), all routines do pretty well. The final row (#off) shows how many points do not give the original angle back. Even in the best case, atan2Tonc(), you're still off a little half the time.

3.2 Costs in time and space

Next, we'll take a look at how much each method costs in terms of processor time and their memory footprints. The sizes can simply be read from the mapfile generated by the linker. For the time spend calculating, I took a selection of angles from the whole circle, converted these to coordinates and used the various arctan2 method to calculate the angle again. The total time for everything is was then divided by the number of iterations to give an average time for each method. To reduce the measured time to just that of the arctan code itself, the time for an empty arctan procedure (atan2Null()) was also taken and subtracted from the others. The most important part of the test can be found in the snippet below.

// --------------------------------------------------------------------
// Main part of test procedure:
// --------------------------------------------------------------------

// Find the testing overhead.
for(phi=0; pihi<BRAD_2PI; phi += 32)
    atan2Null(icos(phi), isin(phi));
nullTime= profileStop() - procNullTime;

// Get the actual time for each method, sans overhead.
for(i=0; i<N_METHODS; i++)
    for(phi=0; phi<BRAD_2PI; pi += 32)
        atan2s[i].proc(icos(phi), isin(phi));
    atan2s[i].time= (profileStop() - nullTime)/(BRAD_2PI/32);

// --------------------------------------------------------------------
// Some more versions for analysis' sake
// --------------------------------------------------------------------

// Just the form
uint atan2Null(int x, int y)
    return x;

// With octants
uint atan2Oct(int x, int y)
    if(y==0)    return (x>=0 ? 0 : BRAD_PI);

    int phi;
    uint t;

    OCTANTIFY(x, y, phi);
    phi *= BRAD_PI/4;

    return phi;

// With octants+div
uint atan2OctDiv(int x, int y)
    if(y==0)    return (x>=0 ? 0 : BRAD_PI);

    int phi;
    uint t;

    OCTANTIFY(x, y, phi);
    phi *= BRAD_PI/4;
    t= QDIV(y, x, ATAN_FP);

    return phi+t;

This test was run on both the GBA and NDS in two ways. As ARM or Thumb code on the NDS, and ARM-in-IWRAM or Thumb-in-ROM on the GBA. The other relevant factors were -O2 with devkitARM r24 and default wait-states. For the division, the hardware 64/32 div was used for NDS and

Table 5 lists the results. As mentioned above, an empty function called atan2Null() was used to calculate the test overhead, which was subtracted from the times of the others. This is why the time atan2Null()'s cycle time can be higher than the rest. Two nearly empty functions, atan2Oct() and atan2OctDiv() were also used to analyze where most of the time is spent inside each routine. For the GBA, the BIOS call arcTan2() has also been included for comparison.

Table 5 : cost in time (in cycles) and space (in bytes) depending on platform (NDS or GBA), code-type (ARM or Thumb) and memory section (ROM or IWRAM for GBA). Notes: atan2Null() is the overhead for testing. This time has been subtracted for the other methods. The times in parentheses are the difference with atan2OctDiv(). The sizes in parentheses are the LUT sizes.
time (NDS) time (GBA) size (NDS)
Thumb Arm ROM/Thumb IWRAM/ARM Thumb ARM
atan2Null 60 44 448 204 2 4
atan2Oct 13 7 89 15 62 76
atan2OctDiv 94 81 336 258 140 188
atan2Lookup 95 (1) 82 (2) 370 (34) 270 (12) 152+260 208+260
atan2Lerp 100 (1) 86 (6) 417 (81) 288 (30) 168+260 236+260
atan2InvLerp 226 (132) 198 (118) 1030 (694) 621 (363) 252+516 344+516
atan2Taylor 725 (631) 626 (546) 2961 (2625) 2467 (2209) 272 296
atan2Gba 123 (29) 105 (25) 584 (248) 348 (90) 232 328
atan2Tonc 114 (20) 97 (17) 520 (184) 332 (74) 204 288
atan2Sin 155 (61) 126 (46) 820 (484) 383 (125) 184+260 264+260
atan2Cordic 123 (29) 120 (40) 877 (541) 312 (54) 152 188
ArcTan2 - - 352 360 - -


As was to be expected, the lookup method is the fastest, closely followed by the linear interpolator atan2Lerp() and then atan2Gba() and atan2Tonc(). Doing a proper Taylor expansion (atan2Taylor()) is very slow, and completely not worth the effort – especially considering it has accuracy problems as well. Using an inverse lookup like atan2InvLerp() does doesn't seem to be that great either in terms of speed.

I'm somewhat disappointed by the performance of atan2Sin(), actually. I'd expected it to be quicker. the reason for this is that approximation by sines lives or dies by how well fast your sine function is. This can mean a 60/200 difference per sine for NDS/GBA, which is a sizable chunk of the total time. The implementation I've used in section 2.7 is okay, but using a full LUT instead of a partial one would be considerably faster still. Or you can use assembly for math routines, because the compiler just doesn't seem to understand how to set-up a fast indexed memory access. You could probably near atan2Lerp() speeds if you use the most simple sine implementation (i.e., a single lookup), but you'd lose a fair bit of accuracy when doing so.

The results for the CORDIC case are interesting, especially when compared to te other methods. On the NDS, this is a fast method and there is next to no difference between Thumb and ARM code. The GBA times, however, are very different. In ARM/IWRAM it's pretty fast; in ROM/Thumb, it's very slow. The reason for it's relative quickness is the absence of divisions. As you can see from the difference between atan2Oct() and atan2OctDiv() the division (which is done via the BIOS Div() accounts for about 240 cycles in the GBA's case. Most GBA implementations are in fact dominated by the division. The reason for ROM/Thumb's ghastly performance is ... well I don't know actually. But it's probably related to the many jumps and loads, which are both killers if done in/from ROM.

Speed per system and code/memory type

Before discussing the different columns representing the different platform specs, it's instructive to look at the three overhead functions, atan2Null(), atan2Oct(), and atan2OctDiv(). These represent how much time is spend in the non-arctan part of the loop, the OCTANTIFY() macro and the macro+division, respectively. because the sine/cosine functions aren't that fast, the test overhead is substantial. While this is subtracted from the other results, there's always the chance that it messes up some of the timings so consider yourself warned.

The difference between atan2Oct() and atan2OctDiv() gives you an estimate on how long the division takes; this comes down to about 70-80 cycles for the NDS and 240 for the GBA. This is one of the main differences between the two systems and illustrates the importance of a good divider. Incidentally, GBATek:math lists the 64/32 division time as being 34 cycles. The reason it's more than double that here is probably due to the overhead of setting it up in the first place. The higher figure is the one you'd have to consider when using the hardware division, though.

Now, back to the different platforms. For a decent comparison, you should take the factors discussed above out of the equation. The times between parentheses are the differences with the atan2OctDiv() results and should represent these values for the heart of each method. In most cases, it's pretty much what you'd expect, but there are also some nice differences. The Thumb/ROM results nicely show the wait-states that ARM/IWRAM doesn't have, causing a rough 3-4× speed difference. Where it's less, it's probably due to ROM lookups and extra divisions. atan2Cordic() stands out as weird, though. It may be because the load are done from IWRAM in the ARM/IWRAM case, but that's about all I can say.

Because of the NDS-cache, NDS-Arm and GBA-ARM/IWRAM should have identical wait-states and, unless someone is playing trick on me, identical code as well. So that there are differences here is may seem somewhat odd. However, there are two factors to consider here. The first is, as mentioned, that the GBA LUTs were still in ROM. The second is that the ARM9 that the NDS uses should have faster memory access(2). Lastly, NDS Thumb vs ARM. In general, ARM will be a faster, but use more memory. The division seems to be the main factor for the NDS as well, so code-type is not that important. I will say that manual ARM assembly should be faster still, and possibly use less memory than compiled Thumb code.


In terms of size, obviously ARM code will be larger than Thumb, though under proper optimization I believe that the difference should be smaller than these numbers indicate. The smallest routine is CORDIC because its loop isn't unrolled like atan2Gba() and atan2Tonc(). The code for the LUT-based routine is small as well, but adding the LUT size, these are effectively take up the most space. These things considered, size for individual routines probably don't matter too much anyway. As with speed, if you want the best, do it in assembly.

Other considerations

If you need the routines faster, there are three things you can do. First, get a better divider. This is the deciding factor in all methods except CORDIC. If you have a really sucky DIV, use CORDIC. Second, improve octant finding. All methods shown here use OCTANTIFY() to reduce to a single octant and have the core method in one place. The alternative is to consider each octant individually, with 8 cores. The code will be 8 times larger and more complex, but should be a little faster – but not that much faster, as atan2Oct() shows that the octant search is rather quick. Lastly, there's assembly. For some reason, the compiler is rather sloppy for tight mathematical algorithms, spilling registers and instructions that could have been easily avoided. I estimate the core and octant search can be sped up considerably; but because it's the division that matters, the overall gain will probably be low.

For floating-point equivalents, the accuracy for the non-LUT based routines can be improved by using different constants. atan2Sin() in particular can become better by not using integers. Because the Excel Solver Add-in works for floating-point units, finding the right coefficients shouldn't take more than a few seconds once you have a suitable test environment set up.

4 Conclusions

The basic way to do an arctan2 function is to reduce the coordinates to a single octant and one of several methods to calculate the angle for that last octant. Two basic methods exist for this: CORDIC, which just keeps rotating, and everything else, which requires a division for y/x. If your division is really slow (which doesn't seem the case for NDS or even GBA), use CORDIC. If not, other methods tend to produce better results and be quicker about it.

Leaving CORDIC out of it, the fastest method is using a LUT – preferably a LUT combined with (linear) interpolation, because the gain in accuracy can be substantial. Do not use a pure Taylor approximation, because it's both slow and has accuracy issues. Modified truncated Taylor methods are alright, though; even with only five terms you can get a 15-bit accurate angle. Sines can approximate an arctan as well, but is ultimately too slow for a decent accuracy.


The routines presented here are fairly fast, but can be improved by hand-assembling the code, though ultimately the speed of division will dominate anyway. The only place where it may be of worth is the division-less CORDIC method, which if unrolled (see forum:3840) should be very fast indeed.

  1. Well, almost perfect. The asymptote for tan(φ) at φ =π·(½+n) can be problematic, but for the most part it's not really a problem.
  2. The docs say 1N and N+S+I for ARM9 and ARM7, respectively, but I still have some doubts about that

8 thoughts on “Off on a tangent : a look at arctangent implementations

  1. Have you ever considered a Pade approximant ---it is used widely in practice. Mathematica has a nice function for calculation any order of Pade approximants, here's what I get for a (2,2) order expansion

    atan[x_] = PadeApproximant[ArcTan[x], {x, 0, {2, 2}}]


    In involves a division and a multiplication, aside from division by 3, which, for fixed point arithmetic, is a multiplication by a constant anyways.
    The accuracy of the (2,2) expansion in the first octant is 2.1% in the worst case (whereas 7th order Taylor expansion is %0.65)

    Plot[100 (atan[x] - ArcTan[x])/ArcTan[x], {x, 0, Pi/4}]

    Here's the plot

    A higher order expansion yields (x + 4x^3)/(1 + 3x^2/5) which yields %0.25 at worst, but it seems quite costy.

  2. By the way, if you can live with a minor discontinuity in your atan, you can patch together various Pade approximants centered differently, all of order (2,2). For instance, you can use zero-centered approximant x/(1+x^2/3) for x<0.23 and 0.5 centered approximant
    (0.463648 + 1.17092(-0.5 + x) + 0.493095 (-0.5 + x)^2) / (1.0000000000000000 + 0.8(-0.5 + x) + 0.373333(-0.5 + x)^2) which comes with slight overhead (5 mul, 1 div whereas the previous one was 2 mul, 1 div). Note that 0.493095 can be approximated to 1/2, reducing a multiplication to bitshift.

    With this, your worst error is 0.03%. But of course, a discontinuity in the function at x=0.23 comes along with the ride (which is ~1.3e-4; actual value of atan(0.23) is ~0.227 so the discontinuity is 0.05% of the value there). The point ~0.23 was to minimize the maximum error, but you can choose another value to minimize discontinuity. If this bothers you, you can reduce this amount by patching 3 (or even more) Pade approximants.

  3. No I hadn't considered Padè approximants, but only because I wasn't aware of their existence. They do look interesting, but I think the extra div may cost too much here compared to other methods. Also note that the methods you've described here all use floats. Transforming to integer-only math is possible, but will affect the accuracy as well. x/(1+x²/3) only uses a single mul, as it can be converted into 3x/(3+x²).

    I've looked in my books to see a few words about Padè, but it was rather sketchy on how to calculate the coefficients. Do yo uknow how Mathematica does it?

  4. The basic idea is to get the approximant agree with the function (conveniently, with it's Taylor expansion) at all derivatives. I don't know how Mathematica actually handles it, but I'm pretty sure it has a pretty cool way to do it fast. There's an actual implementation, however, in Numerical Recipes in C, chapter 5.12.

    Floats? Why can't you do them with fixed point math?
    Cool! Indeed, one mul can be converted to a bitshift+addition :) Would you care to benchmark an actual implementation --it might turn out to be OK.
    BTW, I see one more benefit in this 2nd order expansion, atan2 (which would be 3y/(3x+y2/x)) should introduce less rounding errors, while keeping the number of div/mul operations same.

  5. Regarding the atanTonc constants. The following constants were published in a book. They provide 1/4 of the error, at values close to 1, than the ones you derived. Your ones were OK for the fixed point precision you had. But for floating point the extra precision at no cost is welcome.






    P.S. Thanks for the great articles. I have just found your website today and am reading like crazy. I have already speed up a Sin approximation I was using.

  6. Ok ,but what should be the required hardware system for the implementation of look up method ?

  7. That's really hard to say, actually. If you have memory to spare and memory-access is fast, the LUT method is a good option. Or if instructions are expensive (the ARM processor does shifts for free, so that's a big part of why series expansion works wel here). Simply put, what's "best" is both CPU and memory dependent, and the only way to know what works for you is to test it yourself.

  8. It appears you have an error in your explaination of how to rotate the angle to get from octant 1 to octact 0. You stated:

    "The other method (Fig 5c) is to continue rotating: if x>y, rotate by −¼π."

    However, I think you got the comparison backwards, and it should be:

    The other method (Fig 5c) is to continue rotating: if x<=y, rotate by −¼π.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>