Another fast fixed-point sine approximation

So here I am, looking forward to a nice quiet weekend; hang back, watch some telly and maybe read a bit – but NNnnneeeEEEEEUUUuuuuuuuu!! Someone had to write an interesting article about sine approximation. With a challenge at the end. And using an inefficient kind of approximation. And so now, instead of just relaxing, I have to spend my entire weekend and most of the week figuring out a better way of doing it. I hate it when this happens >_<.

Okay, maybe not.

Sarcasm aside, it is an interesting read. While the standard way of calculating a sine – via a look-up table – works and works well, there's just something unsatisfying about it. The LUT-based approach is just … dull. Uninspired. Cowardly. Inelegant. In contrast, finding a suitable algorithm for it requires effort and a modicum of creativity, so something like that always piques my interest.

In this case it's sine approximation. I'd been wondering about that when I did my arctan article, but figured it would require too many terms to really be worth the effort. But looking at Mr Schraut's post (whose site you should be visiting from time to time too; there's good stuff there) it seems you can get a decent version quite rapidly. The article centers around the work found at devmaster thread 5784, which derived the following two equations:

 (1) $\begin{eqnarray} S_2(x) &=& \frac4\pi x - \frac4{\pi^2} x^2 \\ \\ S_{4d}(x) &=& (1-P)S_2(x) + P S_2^2(x) \end{eqnarray}$

These approximations work quite well, but I feel that it actually uses the wrong starting point. There are alternative approximations that give more accurate results at nearly no extra cost in complexity. In this post, I'll derive higher-order alternatives for both. In passing, I'll also talk about a few of the tools that can help analyse functions and, of course, provide some source code and do some comparisons.

1 Theory

1.1 Symmetry

The first analytical tool is symmetry. Symmetry is actually one of the most powerful concepts ever conceived. Symmetry of time leads to the conservation of energy; symmetry of space leads to conservation of momentum; in a 3D world, symmetry of direction gives rise to the inverse square law. In many cases, symmetry basically defines the kinds of functions you're looking for.

One kind of symmetry is parity, and functions can have parity as well. Take any function f(x). A function is even if f(−x) = f(x); it is odd if f(−x) = −f(x).

This may not sound impressive, but a function's parity can be a great source of information and a way of error checking. For example, the product of two odd or even functions is an even function, and an odd-even product is odd (compare positive/negative number products). If in a calculation you notice this doesn't hold true, then you know there's an error somewhere.

Symmetry can also significantly reduce the amount of work you need to do. Take the next sum, for example.

 (2) $y = \int_{-N}^N sin^7(x^3) + \frac{x^5}{x^2+1} - x e^{\frac{x^2}{2\sigma^2}} dx$

If you find something like this in the wild on on a test, your first thought might be “WTF?!?” (assuming you don't run away screaming). As it happens, y = 0, for reasons of symmetry. The function is odd, so the parts left and right of x = 0 cancel out. Instead of actually trying to do the whole calculation, you can just write down the answer in one line: “0, cuz of symmetry”.

Another property of symmetrical functions is that, if you break them down into series expansions, odd functions will only have odd terms, and even functions only have even terms. This becomes important in the next subsection.

1.2 Polynomial and Taylor expansions

Every function can be broken down into a sum of more manageable functions. One fairly obvious choice for these sub-functions is increasing powers of x: polynomials. The most common of these is Taylor series, which uses a reference point (af(a)) and extrapolates to another point some distance h away by using the derivatives of f at the reference point. In equation form, it looks like this:

 (3) $\begin{eqnarray} f(a+h) &=& f(a) + f'(a) h + \frac{f''(a)}{2}h^2 + \frac{f'''(a)}{6} h^3 + ... \\ \\ \\ &=& \sum_{n=0} \frac{f^{(n)}(a)}{n!}h^n \end{eqnarray}$

Chances are you've actually used part of the Taylor series in game programming. On implementing movement with acceleration, you'll often see something like Eq 4. These are the first three terms of the Taylor expansion.

 (4) $x_{new} = x_{old} \:+\: v \Delta t \:+\: \frac12 a (\Delta t)^2$

The step-size (h in Eq 3 and Δt in Eq 4) is small, the higher-order terms will have less effect on the end result. This allows you to cut the expansion short at some point. This leaves you with a short equation that you do the calculations with and some sort of error term, composed of the part you have removed. The error term is usually linked to the order you've truncated the series at; the higher the order, the more accurate the approximation.

 (5) $f(a+h) = f(a) + f'(a) h + \frac{f''(a)}{2}h^2 + \frac{f'''(a)}{6} h^3 + O(h^4)$

If you work out the math for a sine Taylor series, with a = 0 as the reference point, you end up with Eq 6.

 (6) $sin(h) = h \,-\, \frac16 h^3 \,+\, \frac1{5!} h^5 \,-\, \frac1{7!} h^7 \,+\, ...$

Note that all the even powers are conspicuously absent. This is what I meant by symmetry being useful: a sine function is odd, therefore only odd terms are needed in the expansion. But there's more to it than that. The accuracy is given by the highest order in the approximating polynomial. This shows that there's just no point in even starting with any even-powered polynomial, because you can get one extra order basically for free!

This is why using a quadratic approximation for a sine is somewhat useless; a cubic will have two terms as well, and be more accurate to boot. Just because it's curved doesn't mean a parabola is the most suitable approximation.

1.3 Curve fitting (and a 3rd order example)

Using the Taylor series as a basis for a sine approximation is nice, but it also has a problem. The series is meant to have an infinite number of terms and when you truncate the series, you will lose some accuracy. Of course, this was to be expected, but this isn't the real problem; the real problem is that if your function has some crucial points it must pass through (which is certainly true for trigonometry functions), the truncation will move the curve away from those points. Example for a sine, it's really important that sin(π/2) = 1. With Taylor, that simply isn't the case (see Fig 1).

To fix this, you need to use a polynomial with as-yet unknown coefficients (that is, multipliers to the powers) and a set of conditions that need to be satisfied. These conditions will determine the exact value of the coefficients. The Taylor expansion can serve as the basic for your initial approximation, and the final terms should be pretty close to the Taylor coefficients.

Let's try this for a third-order (cubic) sine approximation. Technically, a third-order polynomial means four unknowns, but, since the sine is odd, all the coefficients for the even powers are zero. That takes care of half the coefficients already. I told you symmetry was useful :). The starting polynomial is reduced to Eq 7, which has two coefficients a and b that have to be determined. For good measure I've also added the derivative, as that's often useful to have as well.

 (7) $\begin{eqnarray} S_3(x) &=& ax - b x^3 &=& x (a - b x^2) \\ S_3'(x) &=& a - 3bx^2 \end{eqnarray}$

Two unknowns means we need two conditional to solve the system. The most useful conditions are usually the behaviour at the boundaries. In the case of a sine, that means look at x = 0 and/or x = ½π. The latter happens to be more useful here, so let's look at that. First, sin(½π) = 1, so that's a good one. Also, we know that at ½π a sine is flat (a derivative of 0). This is the second condition.

The conditions are listed in Eq 8. Solving this system is rather straightforward and will give you values for a and b, which are also given in Eq 8. Notice that the values are roughly 5% and 30% away from the pure Taylor coefficients.

 (8) $\left. \begin{eqnarray} S_3(\frac{\pi}2) &=& 1 &=& \frac{\pi}2 a - (\frac{\pi}2)^3 b \\ S_3'(\frac{\pi}2) &=& 0 &=& a - 3(\frac{\pi}2)^2 b \end{eqnarray} \; \; \rightarrow \; \; \begin{eqnarray} a &=& \frac3\pi &\approx& 0.955 \\ \\ b &=& \frac4{\pi^3} &\approx& 0.129 \end{eqnarray}$

The final equation is then:

 (9) $S_3(x) = \frac3\pi x - \frac4{\pi^3} x^3$

In Fig 1 you can see a number of different approximations to the sine. Note that I've done a little coordinate transformation for the x-axis: z = x/(½π), so z = 1 means x = ½π. The benefit of this will become clear later.

As you can see, the third order Taylor expansion starts out all-right, but veers off course near the end. In contrast, the third-order fit matches the sine at both end points. There is also the second-order fit from the devmaster site. As you can see, the third-order approximation is closer.

Fig 1. Sine approximations using 3rd order Taylor and parabolic cubic polynomials for first quadrant. z = x / ½π

Now, please remember that coefficients from Eq 8 are not the only ones you can use. The conditions define what the values will be: different conditions lead to different values. For example, instead using the derivative at ½π, I could have used it at x = 0. This forms the set of equations of Eq 10 and, as you can see, the coefficients are now different. This set is actually more accurate (a 0.6% average error instead of 1.1%), but it also has some rather unsavoury characteristics of having a maximum that's not at ½π and goes over 1.0; this can be really unsettling if you intend to use the sine in something like rotation.

 (10) $\left. \begin{eqnarray} S_3(\frac{\pi}2) &=& 1 &=& \frac{\pi}2 a - (\frac{\pi}2)^3 b \\ S_3'(0) &=& 1 &=& a \end{eqnarray} \; \; \rightarrow \; \; \begin{eqnarray} a &=& 1 \\ \\ b &=& \frac4{\pi^2}(1-\frac2\pi) \approx 0.147 \end{eqnarray}$

1.4 Dimensionless variables and coordinate transformations

For higher accuracy, a higher-order polynomial should be used. Before doing that, though, I'd like to mention one more trick that can make your mathematical analysis considerably easier: dimensionless variables.

The problem with most quantities and equations is units. Metres, feet, litres, gallons; those kinds of units. Units suck. For one, there are different units for identical quantities which can be a total pain to convert and can sometimes lead to disaster. Literally. Then there's the fact that the unit sizes are basically picked at random and have nothing to do with the physical situation they're used for. So you have weird values for constants like G in Newton's law of universal gravitation, the speed of light c and the Planck constant, h. Keeping track of these things in equations is annoying, especially since they tend to pile up and everybody would rather that they'd just go away!

Enter dimensionless variables. The idea here is that instead of using standard units, you express quantities as ratios to some meaningful size. For example, in relativity you often get v/c : velocity over speed of light. Equations become much simpler if you just denote velocities as fractions of the speed of light: β = v/c. Using β in the equations simplifies them immensely and has the bonus that you're not tied to any specific speed-unit anymore.

The dimensionless variable is a type of coordinate transformation. In particular, it's a scaling of the original variable into something more useful. Another useful transformation is translation: moving the variable to a more suitable position. We will come accross this later; but first: an example of dimensionless variables.

A sine wave has lots of symmetry lines, all revolving around the quarter-circles. Because of this, the term that keeps showing up everywhere is ½π. This is the characteristic size of the wave. By using z = x/(½π), all those important points are now at integral z values. Having ones in your equations is generally a good thing because they tend to disappear in multiplications. Look at what Eq 9 becomes when expressed in terms of z

 (11) $\begin{eqnarray} S_3(x) &=& \frac3\pi x - \frac4{\pi^3} x^3 \\ \\ &=& \frac32 \frac{2x}\pi - \frac12 (\frac{2x}\pi)^3 \\ \\ &=& \frac32 z - \frac12 z^3 \\ \\ S_3(z) &=& \frac12z (3 - z^2) \end{eqnarray}$

Doesn't that look a lot nicer? It goes deeper than that though. With dimensionless units, the units your measurements are in simply cease to matter! For angles, this means that whether you're working in radians, degrees or brads, they'll all result in the same circle-fraction, z. This makes converting algorithms to fixed-point notation considerably easier.

2 Derivations and implementations

In the section above, I discussed the tools used for analysis and gave an example of a cubic approximation. In this section I'll also derive high-accuracy fourth and fifth order approximations and show some implementations. Before that, though, there's some terminology to go through.

Since multiple different approximations will be covered, there needs to be a way to separate all of them. In principle, the sine approximation will be named Sn, where n is the order of the polynomial. So that'll give S2 to S5. I will also use S4d for the fourth-order approximation from devmaster. In the derivation of my own fourth-order function, I'll use Cn, because what will actually be derived is a cosine.

Third-order implementation

Let's start with finishing up the story of the third-order approximation. The main equation for this is Eq 11. Because this equation is still rather simple, I'll make this a fixed-point implementation. The main problem with turning a floating-point function into a fixed-point one is keeping track of the fixed-point during the calculations, always making sure there's no overflow, but no underflow either. This is one of the reasons why I wrote Eq 11 like it is: by using nested parentheses you can maximize the accuracy of intermediate calculations and possibly minimize the number of of intermediate calculations and possibly minimize the number of operations to boot.

To correctly account for the fixed-point positions, you need to be aware of the following factors:

• The scale of the outcome (i.e., the amplitude): 2A
• The scale on the inside the parentheses: 2p. This is necessary to keep the multiplications from overflowing.
• The angle-scale: 2n. This is basically the value of ½π in the fixed-point system. Using x for the angle, you have z = x/2n.

Filling this into Eq 11 will give the following:

 (12) $\begin{eqnarray} S_3(z) &=& \frac12 z (3 - z^2) 2^A \\ &=& z (3 - z^2) 2^{A-1} \\ &=& x\cdot2^{-n} (3 - x^2\cdot2^{-2n}) 2^{A-1} \\ &=& x (3 - x^2\cdot2^{-2n}) 2^{A-1-n} \\ &=& x (3\cdot2^p - x^2 2^{p-2n}) 2^{A-n-1-p} \\ S_3(x) &=& x (3\cdot2^p - x^2 / 2^r ) \middle/ 2^s, \end{eqnarray}$

with r = 2np and s = n+p+1−A. These represent the fixed-point shifts you need to apply to keep everything on the level. With p as high as multiplication with x will allow and the standard libnds units leads to the following numbers.

A n p r s
12 13 15 11 17

Fig 2. Mirror domain to the right quadrants.

That's the calculation necessary for the first quadrant, but the domain of a sine is infinite. To get the rest of the domain, you can use the symmetries of the sine: the 2π periodicity and the ½π mirror symmetries. The first is taken care of by doing z % 4. This reduces the domain to the four quadrants of a circle. The next part is somewhat tricky, so pay attention.

Look at Fig 2. S3 works for quadrant 0. Because it's antisymmetric, it will also correctly calculate quadrant 3, which is equivalent to quadrant −1. Quadrants 1 and 2 are the problem. As you can see in Fig 2, what needs to happen is for those quadrants to mirror onto quadrants 0 and −1. A reflection of x at D is defined by Eq 13. In this case, that means that z = 2 − z

 (13) $x = D - (x-D) = 2D-x$

Some test need to be done to see when the reflection should take place. The quadrant numbers in binary are 00, 01, 10, 11. If you build a truth-table around that, you'll see that a XOR of the two bits will do the trick. If you really want to show off, you can combine the periodicity modulo and the quadrant test by doing the arithmetic in the top bits. The implementation is now complete.

/// A sine approximation via a third-order approx.
/// @param x    Angle (with 2^15 units/circle)
/// @return     Sine value (Q12)
s32 isin_S3(s32 x)
{
// S(x) = x * ( (3<<p) - (x*x>>r) ) >> s
// n : Q-pos for quarter circle             13
// A : Q-pos for output                     12
// p : Q-pos for parentheses intermediate   15
// r = 2n-p                                 11
// s = A-1-p-n                              17

static const int qN = 13, qA= 12, qP= 15, qR= 2*qN-qP, qS= qN+qP+1-qA;

x= x<<(30-qN);          // shift to full s32 range (Q13->Q30)

if( (x^(x<<1)) < 0)     // test for quadrant 1 or 2
x= (1<<31) - x;

x= x>>(30-qN);

return x * ( (3<<qP) - (x*x>>qR) ) >> qS;
}

And, of course, there's an assembly version as well. It's only ten instructions, which I think is actually shorter than a LUT+lerp implementation.

@ ARM assembly version, using n=13, p=15, A=12

@ A sine approximation via a third-order approx.
@ @param r0   Angle (with 2^15 units/circle)
@ @return     Sine value (Q12)
.arm
.align
.global isin_S3a
isin_S3a:
mov     r0, r0, lsl #(30-13)
teq     r0, r0, lsl #1
rsbmi   r0, r0, #1<<31
mov     r0, r0, asr #(30-13)
mul     r1, r0, r0
mov     r1, r1, asr #11
rsb     r1, r1, #3<<15
mul     r0, r1, r0
mov     r0, r0, asr #17
bx      lr

Oh wait, the requirement was for the input to be in Q12 radians, right? Weeell, that's no biggy. You just have to do the x → z conversion yourself. Take, say, 220/(2π). Multiply x by this gives z as a Q30 number; exactly what the first line in the C code resulted in. This means that all you have to do is change the first line to x *= 166886;'.

NDS special

The assembly version given above uses standard ARM instructions, but one of the interesting things is that the NDS' ARM9 core has special multiplication instructions. In particular, there is the SMULWx instruction, which does a word*halfword multiplication, where the halfword can be either the top or bottom halfword of operand 2.The main result is 32×16→48 bits long, of which only the top 32 bits are put in the destination register. Effectively it's like a*b>>16 without overflow problems. As a bonus, it's also slightly faster than the standard MUL. By slightly changing the parameters, the down-shift factors r and s can be made 16, fitting perfectly with this instruction, although the internal accuracy is made slightly worse. Additionally, careful placement of each instruction can avoid the interlock cycle that happens for multiplications.

The alternate isin_S3a() becomes:

@ Special ARM assembly version, using n=13 and lots of Q14

@ A sine approximation via a third-order sine, using special ARM9 instructions
@ @param r0   Angle (with 2^15 units/circle)
@ @return     Sine value (Q12)
.arm
.align
.global isin_S3a9
isin_S3a9:
mov     r0, r0, lsl #(30-13)    @ x                 ; Q30
teq     r0, r0, lsl #1
rsbmi   r0, r0, #1<<31

smulwt  r1, r0, r0              @ y=x*x             ; Q30*Q14/Q16 = Q28
mov     r2, #3<<13              @ B_14=3/2
sub     r1, r2, r1, asr #15     @ 3/2-y/2           ; Q14+Q28/Q14/2
smulwt  r0, r1, r0              @                   ; Q14*Q14/Q16 = Q12

bx      lr

Technically it's only two instructions less, but is quite a bit faster due to the difference in speed between MUL and SMULWx.

2.1 High-precision, fifth order

The third order approximation actually still has a substantial error, so it may be useful to use an additional term. This would be the fifth-order approximation, S5. It and its derivative are given in Eq 14.

 (14) $\begin{eqnarray} S_5(x) &=& ax - b x^3 + c x^5 \\ \\ \\ S_5'(x) &=& a - 3b x^2 + 5c x^4 \end{eqnarray}$

To find the terms, I will again use z instead of x. The conditions of note are the position and derivative at z = 1 and the derivative at 0. With these conditions the approximation should behave amicably at both edges.

 (15) $\begin{eqnarray} S_5(z=1) &=& 1 &=& a &-& b &+& c \\ \\ \\ S'_5(z=1) &=& 0 &=& a &-& 3b &+& 5c \\ \\ \\ S'_5(z=0) &=& \frac\pi2 &=& a \end{eqnarray}$

Notice that these equations are linear with respect to a, b and c, which means that it can be solved via matrices. Technically this system of equations forms a 3×3 matrix, but since a is already immediately known it can be reduced to a 2×2 system. I'll spare you the details, but it leads to the coefficients of Eq 16. Note the complete absence of any horrid π5 terms that would have appeared if you had decided not to use dimensionless terms.

 (16) $\begin{eqnarray} a &=& \pi/2 \\ \\ b &=& \pi - 5/2 \\ \\ c &=& \pi/2 - 3/2 \end{eqnarray}$
 (17) $S_5(z) = \frac12 z (\pi - z^2 [ (2\pi-5) - z^2 (\pi - 3) ] )$

Eq 17 is the final quintic approximation in the form that's most accurate and easiest to implement. The implementation is basically an extension of the S3 function and left as an exercise for the reader.

2.2 High precision, fourth order

Fig 3. Sine and cosine.

Lastly, a fourth-order approximation. Normally, I wouldn't even consider this for a sine (odd function == odd power series and all that), but since the devmaster post uses them and they even seem to work, there seems to be something to them after all.

The reason those approximations work is simple: they don't actually approximate a sine at all; they approximate a cosine. And, because of all the symmetries and parallels with sines and cosines, one can be used to implement the other.

 (18) $\begin{eqnarray} sin(x) &=& cos(x - \pi/2) \\ \\ sin(z) &=& cos(z - 1) \end{eqnarray}$

Eq 18 is the transformation you need to perform to turn a cosine into a sine wave. This can be easily done in at the start of an algorithm. What's left is to derive a cosine approximation. Because a cosine is even, only even powers will be needed. The base form and its derivative are given in Eq 19.

 (19) $\begin{eqnarray} C_4 (x) &=& a - b x^2 + c x^4 \\ \\ \\ C_4'(x) &=& - 2b x + 4c x^3 \end{eqnarray}$

For the conditions, we once again look at z = 0 and z = 1, which comes down to the eqt of equations in Eq 20. One of the interesting thing about even functions is that the derivative at 0 is zero, so that's a freebie. A very important freebie, as it means that one of the required symmetries happens automatically.

 (20) $\begin{eqnarray} C_4(z=0) &=& 1 &=& a \\ \\ \\ C_4(z=1) &=& 0 &=& a &-& b &+& c \\ \\ \\ C'_4(z=1) &=& -\frac\pi2 &=& &-& 2b &+& 4c \end{eqnarray}$

The resulting set of coefficients are listed in Eq 21. Note that b = c+1, which may be of use later. The final equation for the fourth order cosine approximation is Eq 22. Only three MULs and two SUBs; nice.

 (21) $\begin{eqnarray} a &=& 1 \\ \\ b &=& 2 - \pi/4 \\ \\ c &=& 1 - \pi/4 \end{eqnarray}$
 (22) $C_4(z) = 1 - z^2 [ (2-\pi/4) - z^2 (1-\pi/4) ]$

Implementation

The floating-point implementation of Eq 22 is again too easy to mention here, so I'll focus on fixed-point variations. Like with S3, you can mix and match fixed-point positions until you get something you like. In this case I'll stick to Q14 for almost everything to keep things simple.

The real trick here is to find out what you need to do about all the other quadrants. Cutting down to four quadrants is, again, easy. For the rest, remember that the cosine approximation calculates the top quadrants and you need to flip the sign for the bottom quadrants. If you think in terms of the parameter that a sine gets, you see that only for odd semi-circles the sign needs to change. Tracing this can be done with a single bitwise AND or a clever shift.

/// A sine approximation via a fourth-order cosine approx.
/// @param x   angle (with 2^15 units/circle)
/// @return     Sine value (Q12)
s32 isin_S4(s32 x)
{
int c, x2, y;
static const int qN= 13, qA= 12, B=19900, C=3516;

c= x<<(30-qN);              // Semi-circle info into carry.
x -= 1<<qN;                 // sine -> cosine calc

x= x<<(31-qN);              // Mask with PI
x= x>>(31-qN);              // Note: SIGNED shift! (to qN)
x= x*x>>(2*qN-14);          // x=x^2 To Q14

y= B - (x*C>>14);           // B - x^2*C
y= (1<<qA)-(x*y>>16);       // A - x^2*(B-x^2*C)

return c>=0 ? y : -y;
}

And an ARM9 assembly version too. As it happens, it's only two instuctions longer than isin_S3a9().

@ ARM assembly version of S4 = C4(gamma-1), using n=13, A=12 and ... miscellaneous.

@ A sine approximation via a fourth-order cosine
@ @param r0   Angle (with 2^15 units/circle)
@ @return     Sine value (Q12)
.arm
.align
.global isin_S4a9
isin_S4a9:
movs    r0, r0, lsl #(31-13)    @ r0=x%2 <<31       ; carry=x/2
sub     r0, r0, #1<<31          @ r0 -= 1.0         ; sin <-> cos
smulwt  r1, r0, r0              @ r1 = x*x          ; Q31*Q15/Q16=Q30

ldr     r2,=14016               @ C = (1-pi/4)<<16
smulwt  r0, r2, r1              @ C*x^2>>16         ; Q16*Q14/Q16 = Q14
add     r2, r2, #1<<16          @ B = C+1
rsb     r0, r0, r2, asr #2      @ B - C*x^2         ; Q14
smulwb  r0, r1, r0              @ x^2 * (B-C*x^2)   ; Q30*Q14/Q16 = Q28
mov     r1, #1<<12
sub     r0, r1, r0, asr #16     @ 1 - x^2 * (B-C*x^2)
rsbcs   r0, r0, #0              @ Flip sign for odd semi-circles.

bx      lr

3 Testing

Deriving approximations is nice and all, but there's really no point unless you do some sort of test to see how well they perform. I'll look at two things: accuracy and some speed-tests. For the speed-test, I'll only consider the functions given here along with some traditional ones. The accuracy test is done only for the first quadrant and in floating-point, but the results should carry over well to a fixed-point case. Finally, I'll show how you can optimize the functions for accuracy.

3.1 Third and fourth-order speed

For the speed test I calculated the sine at 256 points for x ∈ [0, 2π). There will be some loop-overhead in the numbers, but it should be small. Tests were performed on the NDS.

Functions under investigation are the three S3 and two S4 functions given earlier. I've also tested the standard floating-point sin() library function, the libnds sinLerp() and my own isin() function that you can find in arctan:sine. The cumulative and average times can be found in Table 1.

Table 1: sine cycle-times (roughly).
Function (thumb/ARM) Total cycles average cycles
sin (F) 300321 1175.1
sinLerp (T) 10051 39.2
isin (T) 7401 28.9
isin_S3 (T) 5267 20.5
isin_S4 (T) 6456 25.2
isin_S3a (A) 3438 13.4
isin_S3a9 (A) 2591 10.1
isin_S4a9 (A) 3123 12.1

The first thing that should be clear is just why we don't use the floating-point sine. I mean, seriously. There is also a clear difference between the Thumb-compiled and ARM assembly versions, the latter being significantly faster.

Within the compiled versions, I find it interesting to see that the algorithmic calculations are actually faster than the LUT+lerp-based implementations. I guess loading all those numbers from memory really does suck.

And then there's the assembly versions. Wow. Compared to the compiled version they're twice as fast, and up to four times as fast as the LUT-based functions.

NDS timers measure half-cycles

The cycle-times from Table 1 do not make sense if you count instruction cycles. For example, for isin_S3a the function overhead alone should already be around 10 cycles. The thing here is that the numbers are taken from the hardware timers, which use the bus-frequency (33 MHz) rather than the ARM9 cpu (66 MHz). As such, it measures in half-cycles. For details, see gbatek:nds-timings.

3.2 Accuracy

Fig 4 shows all the approximations in one graph. It only shows one quadrant because the rest can be retrieved by symmetry. I've also scaled the sine and its approximations by 212 because that's the scale that usual fixed-point scale right now. And to be sure, yes, this is a different chart than Fig 1; it's just hard to tell because the fourth and fifth order functions are virtually identical to the real sine line.

Fig 4. Taylor and second to fifth order approximations to 212sin(x).

For the high-accuracy approximations, it's better to look at Fig 5, which shows the errors. Here you can clearly see a difference between S4d and S5, the latter is roughly 3 times better.

There's also a large difference between the devmaster fourth-order sine and my own. The reason behind this is a difference in conditions. In my case, I've fixed the derivatives at both end-points, which always results in an over- or underestimate. The devmaster's S4d let go of those conditions and minimized the error. I'll also do this in the next sub-section.

Fig 5. Errors for Fig 4.

Table 2 and Table 3 list some interesting statistics about the various approximations, namely the minimum, average and maximum errors. It also contains a Root Mean Square Deviation (RMSD), which is a special kind of distance. If you consider the data-points as a vector, the RMSD is the average Pythagorean length for each point. Table 2 is normed to 212, whereas Table 3 is table for the traditional floating-point sine scale.

The RMSD values are probably the most useful to look at. From them you can see that there is a huge gap between the low-accuracy and high-accuracy functions of about a factor 60. And if you do your math right, all it costs is one multiplication and one addition, and maybe some extra shifts in the fixed-point case. That's quite a bargain. Compared to that, the difference between the odd and even functions is somewhat meager: only a factor three or so. Still, it is something.

If you look at the fixed-point table, you can see that the error you make with S4d and S5 is in the single digits. This means that this is probably accurate enough for practical purposes. Combined with the fact that even fifth order polynomials can be made pretty fast, this makes them worth considering over LUTs.

Table 2: error statistics for 212sin(x) approx.
min avg max rms
Taylor3 -302.1 -51.5 0 92.7
S2 0 123.1 229.4 146.8
S3 -82.0 -47.6 0 55.0
S4d -4.47 0.19 3.11 2.44
S4 0 5.87 11.4 7.11
S5 0 0.74 1.62 0.94
Table 3: error statistics in percentages.
min% avg% max% rms%
Taylor3 -7.37 -1.26 0 2.26
S2 0 3 5.6 3.58
S3 -2 -1.16 0 1.34
S4d -0.11 0.0047 0.076 0.06
S4 0 0.143 0.278 0.174
S5 0 0.018 0.039 0.023

3.3 Optimizing higher-order approximations

From the charts, you can see that S4 and S5 all err on the same side of the sine line. You can increase the accuracy of the approximation by tweaking the coefficients in such a way that the errors are redistributed in a preferable way. Two methods are possible here: shoot for a zero error average, or minimize the RMSD. Technically minimizing the RMSD is standard (it comes down to least-squares optimization), but because a zero-average allows for an analytical solution, I'll use that. In any case, the differences in outcomes will be small.

First, think of what an average of a function means. The average of a set of numbers is the sum divided by the size of the set. For functions, it's the integral of that function divided by the interval. When you want a zero-average for an approximation, the integral of the function and that of the approximation should be equal. With a polynomial approximation to a sine, we get:

 (23) $\begin{eqnarray} \int_0^1 \sum_n a_n x^n dx &=& \int_0^1 sin(x\pi/2) dx \;\; \rightarrow \\ \\ \sum_n \frac{a_n}{n+1} &=& 2/\pi , \end{eqnarray}$

with an reducing to the coefficients of the polynomials we had before. This can be used as an alternate condition to the derivative at 0. For S4 and S5, you'll end up with the following coefficients.

 (24) $\begin{eqnarray} a_4 &=& 1 \\ b_4 &=& c_4+1 \\ c_4 &=& 5(1-\frac3\pi) \approx 0.225351707 \end{eqnarray}$
 (25) $\begin{eqnarray} a_5 &=& 4(\frac3\pi - \frac9{16}) \approx 1.569718634 \\ b_5 &=& 2 a_5 - 5/2 \\ c_5 &=& a_5 - 3/2 \end{eqnarray}$

If you're still awake and remember the devmaster S4d coefficients, there should be something familiar about a4. Yes, they're practically identical. If you optimize S4 for the RMSD, you actually get the exact same function as S4d.

Table 4 shows the statistics for the original approximations and the new optimized versions, S4o and S5o. The numbers for S4o are basically those from S4d seen earlier. More interesting are the details for S5o. The maximum and minimum errors are now within ±1. That is to say, this approximation gives values that are at most 1 off from the proper Q12 sine. This is about as good as any Q12 approximation is able to get.

Table 4: Optimized Q12 S4 and S5.
min avg max rmsd
S4 0 5.87 11.4 7.11
S5 0 0.74 1.616 0.94
S4o -4.72 0 2.89 2.47
S5o -0.73 0 0.79 0.52

Fig 6. Errors for optimized S4o and S5o.

4 Summary and final thoughts

Here's a few things to take from all this.

• When constructing a polynomial approximation, more terms mean higher accuracy. Symmetry properties of the function approximated allow you to remove terms from consideration, simplifying the equation.
• Coordinate transformations are your friends too. Sometimes it's much easier to work on a scaled or moved version of the original problem. If your situation has a characteristic length (or time, velocity, whatever) consider using dimensionless variables: expressing parameters as ratios of the characteristic length. This makes the initial units pretty much irrelevant. For angles, think circle-fractions.
• Zero and one (0 and 1) are the best values to have in your equations, as they tend to vanish to easily.
• Any approximation formula will have coefficients to be determined. In general, the Taylor series terms are not the best set; values slightly offset from these terms will be better as they can correct for the truncation. To determine the values of the coefficients, define some conditions that need to be satisfied. Examples of conditions are values of the function and its derivative at the boundaries, or its integrals. Or you can wuss out and just dump the thing in the Excel Solver.
• When converting to fixed-point, accuracy and overflow comes into the fray. If you know the domain of the function beforehand, you can optimize for accuracy. Also, it helps if you construct the algorithm in a sort of recursive form instead of a pure polynomial: not ax + bx2 but x(a + xb). Ordered like this, each new additional term only requires one multiplication and one addition extra.
• For fixed-point work, SMULWx` is teh awesome.
• Even a fourth order (and presumably fifth order as well) polynomial implementation in C is faster than the LUT-based sines on the NDS. And specialized assembly versions are considerably faster still.
• The difference in accuracy of S4 vs S2 or S5 vs S3 is huge: a factor of 60. Going from an even to the next odd approximation only gains you a factor 3. Shame; I'd hoped it'd be more.
• Unlike I initially thought, the even-powered polynomials work out quite well. This is because they're actually modified cosine approximations.

1. Express the parabolic approximation S2(x) of Eq 1 in terms of z. 's Not hard, I promise.
2. Implement the fixed-point version of the fifth-order sine approximation, S5(x).
3. For the masochists: derive the coefficients for S5(x) without dimensionless variables. That is to say, with the conditions at x = ½π instead of z = 1.
4. Solve Eq 24 and Eq 25 for minimal RMDS. Also, try to derive an analytical form for minimal RMDS; I think it's exists, but it may be tricky to come up with the right form.

88 thoughts on “Another fast fixed-point sine approximation”

1. Okay, I've uploaded two files in a zip that you can take a look at. You can find them here.

• sine-aprx.xls is my original filed during writing the article. Despair all ye who enter here
• isin_s3.xlsx specifically covers isin_s3() as it's given above. There's some magic in column K and L to emulate integer overflow, but the rest is pretty straightforward. The sheet is set up for quadrant 0, but you can fiddle with cells B2 and B3 to change the interval and angle scale.

1) ...
sin(18°) = 0.309, which translates to 4F1h which is a better match to 4BCh"
It is not clear to me how you got 4F1? which equation is been used to get 4F1h.
sin(18°) = 0.3090170273 is clear to me.

Fixed-point just means : apply a scaling factor to everything. A Q12 (12-bit fixed-point number) value means : scale everything by 212. So sin(18°) * 4096 = 1265 = 04F1h. 18° is 0.05 circle. Look up that value in the spreadsheet to see the real sin value and compare it with isin_s3.

2) I have another question which needs clarification, isn't x u16 when unit circle is been divided into Q15(32768)
I answered to myself this way and I want to cross check with you,
-32768 to 0 -- gives sine on negative x-axis(-2pi, -3pi/2, -pi, -pi/2, 0)
0 to 32768 -- gives sine on positive x-axis(0, pi/2, pi, 3pi/2, 2pi)

just because s16 range is -32768 to 32767, just for one bit on the positive side, you want to use s32, which accomodates -32768 to 32768(do you agree that, this should be mentioned as Q16 instead of Q15) and similarly for -65536 to 65536(Q17).

Is my interpretation correct?

No, this isn't how it works. You have a circle. What you want as the domain is a full run around the circle. For radians, that'd be [0, 2π). Or [−π, π), it doesn't really matter. What matters is: you want one run around the circle. Everything else just maps back to that.

When you normalize that run, you get a natural [0, 1) domain, where 0 means the starting point, ¼ means ¼ circle, etc. In Q15, that [0, 1) domain translates to [0, 215).

What you're proposing is a [-1, 1) domain, which covers two circles. Because everything just repeats, there's no need to cover positive and negative directions separately, because the sine over [-1, 0) is identical to that over [0, 1) (and [1, 2), etc). This periodicity is automatic if you use 16-bit variables and a 216-unit circle. -¼-circle then automatically maps to +¾-circle, or vice versa via integer overflow.

2. Wow!

Been looking for something like this for a long time. And it seems pretty clear to a neophyte like me. But can this be generalized to add the variables amplitude, phase and period?

What if I have is data that looks like a sine wave, one with a slowly changing phase, and period, and amplitude, but the ratio of amplitude to period is more or less constant over time. How do I fit this data with a dynamic sine wave [so I can project the last quarter cycle for my purposes]?

I want to do this in Excel with cell formulae if possible, else macro or VB code. I understand how to use the LINEST function to do a cubic fit, where LINEST generates the coefficients dynamically as the data changes. That works fine, but does not project well with the last set of coefficients.

Can I use LINEST to do a sine fit?

I also can approximate, dynamically via calculations on the data the amplitude, period and phase but the values lag the data too much.

Any ideas/links elsewhere I can examine?

Thanks.

Don

3. Been looking for something like this for a long time. And it seems pretty clear to a neophyte like me. But can this be generalized to add the variables amplitude, phase and period?

What if I have is data that looks like a sine wave, one with a slowly changing phase, and period, and amplitude, but the ratio of amplitude to period is more or less constant over time. How do I fit this data with a dynamic sine wave [so I can project the last quarter cycle for my purposes]?

Well, if the amplitude, period or phase changes dynamically, technically you don't have a sinewave anymore :P

If I'm reading you right, what you have is the following:
$y(x) = A(x) \cdot sin( \frac{x-x_0(x)}{T(x)} )$
To get an approximation in the way I'm doing for the sine wave, you'll have to differentiate/integrate that to x, which will be a bit of a bitch to say the least.

I want to do this in Excel with cell formulae if possible, else macro or VB code. I understand how to use the LINEST function to do a cubic fit, where LINEST generates the coefficients dynamically as the data changes. That works fine, but does not project well with the last set of coefficients.

Can I use LINEST to do a sine fit?

Not LINEST, no. LINEST is the LINe ESTimator, which only works for straight lines. the reason you can also use them for power-series and exponentials is because there ware ways of turning those into straight lines, but that doesn't work for a wave.

What you might be able to do is use GoalSeek or the Solver Add-in. How that works is as follows: you start with something like the equation given above, and define models for A, T, t0. I don't know what your data looks like, so just try linear functions first. This will give you 6 coefficients -- or 5 if A and T not independent. In your excel sheet, you have x and y data in columns, right? Now,

1. Put the coefficients in cells somewhere.
2. Add another column with your estimator function, using the coefficient cells.
3. Next, have an error column, denoting the difference between the actual data and the model function.
4. Then put the sum of squares in a cell. By tweaking the coefficients, you can try to minimize this value to get a Least Squares fit.
5. Alternatively, you can make excel do this for you, using Goalseek or Solver.

Does that help?

I have done a nice 5th order sine approximation with a full circle equivalent to 65536 and an amplitude of 32768 which is especially nice for 16bit computations. The maximum error is 0.04% which seem to be what is to be expected. I will do an optimised 68k assembly implementation one of these days (yes, there are still people coding for 68k processors as a pasttime...).

Now I wondered whether doing a sincos function which returns both the sine and the cosine of a given angle could save some computational effort. This is why I'm looking into approximating cosine now.

What I don't get is your comment about sine giving an extra order of precision for the same computational effort when compared to a cosine. In my understanding it should be the other way round. If I start off with a 6th order approximation of cos I will get

C6(x)=a-bx^2-cx^4+dx^6

I can write this as follows:

C6(x)=a - x^2*(b-x^2*(c-d*x^2))

This means that I can do a 6th order approximation of a cosine with four muls which is the same number of muls I needed for the 5th order sine approximation (unless I'm missing something which may be obvious to you).

For the 6th order cosine approximation (which I haven't implemented) I have used C(z=0.5)=1/sqrt(2) as an additional condition for solving the equation system. I was hoping to get rid of some odd numbers while solving the equation system but admittedly the coefficients I got don't look all that nice. After turning them into integers they look ok, though:

A16 = 32768
B16 = 40419
C16 = 8270
D16 = 619

I hope I haven't made any mistakes while deriving these values. I will know when I have implemented my cosine approximation.

Sorry if any of this is nonsense, it's been decades since I learned my math...

5. Just wanted to add that I have implemented my 6th order cosine approximation in both C and 68k asm and it works fine. Precision is four times better than for the 5th order sine but requires the same number of muls (1 16x16 and 3 32x32 bit muls).

The constants I gave above work fine but with A=32768 there is an overflow for cos(0) (and near that) when using a 16bit result. Not very surprising if you think about it. One solution is to set A=32767 but then I also lose -32768 which is a valid number in 16bits. I found this annoying and solved this by approximating -cos() in the quadrants left and right of x=0. Since we have to deal with quadrants anyway I simply to a NOT operation (bit complement) .for inverting the -cos back to +cos rather than doing NEG (negate). This turns -32768 into +32767 and gives the -1 for positive results for free. Of course, this introduces a very small systematic error into the approximated cosine but I think it won't matter.

I'm very happy with my fast 68k asm sin/cos... :)

6. Hello Philipp, good to see there are still a few people enjoying the good ole 68k :)

Any chance of sharing the 6th order code? it would be great to compare with my similar arm code :) that and I love to read optimised code... sad isnt it.

7. I uploaded my code here:

I haven't given the quadrant code much thought so chances are that it may be optimised, e.g. the negate followed by the add/sub 32768 could probably be replaced by some clever bit magic.

As I wrote, icos() really approximates -cos() which leads to negated constants. However, since the 68k hasn't got an rsub instruction like the ARM, I had to negate some of the constants again to save some additional move.

The isincos() should execute quite well on the superscalar 68060 but the instruction order could probably be enhanced a little to save another cycle or two on an 060 (two subsequent muls cannot be executed in parallel on an 060).

If a sin/cos table were to be built using isincos(), the code could be modified to load the constants into registers and using those instead of the immediate values as this would save some cycles for the repetitive calculations. In such a case the icos() should be used for both sine and cosine as it is a little more precise and almost as fast as the isin().

8. A nice and very illustrative read is "Approximations for Digital Computers"
by Cecil Hastings, Jr., Princeton University Press 1955 (!).
Can be found in the internet, e. g. PDF_Txt_Hstngs_Aprx_Comp.pdf.
See p. 138.
Regards, Hartmut

9. Hey, thank you so much for this. Yesterday I wrote a program for ARM using normal sine, but it was too slow. This morning I woke up thinking about Taylor series, but fixed point numbers broke me up. I did my homework for 5th order, but I can't quite figure out the bit shifts.

Here are the amplitude parameters I found: https://www.overleaf.com/read/hnrfwhshhkxs

You kind of glossed over the details for doing the matix math(I'm taking your results for granted) and calculating the shifts. How do you go from a desired output or input type to those?

I want to generate 16bit audio, but from your code it seems your output is Q13, and that is bounded by the 32bit intermediate values. You write that you derive it from the bits needed to multiply x, so for 32bit that would mean x can only be 16bit, correct?

10. On 05/13/15 you replied to some questions I asked about an approximate sine fit. First, thank you so much for your reply, unfortunately I got involved in other things, but just tonight remembered this reply and just read it again.

I believe this can be simplified by holding the Period [PER] and the Amplitude [A] constant for the duration of any quarter cycle, then allowing them to be slightly changed at the quarter cycle points of 0, Pi/2, Pi, 3Pi/2. I also could only change these variables at 0 and Pi if that seems better. Either would work with my data, I think.

The value of x = 2*Pi*t/PER where "t" is the day variable and PER is the tunable number of days for the current cycle. Note that for t=PER/4, this is the same as x=Pi/2.

The formulae are;

S(x) = A*[a*x - b*x*3] = A and
S'(x) = A*[a - 3*b*x^2] dx/dt = 0 where x = 2*Pi*t/PER = (2*Pi*PER/4)/PER = Pi/2

a = 3/Pi and b = 4/Pi^3 compute to the same values you came up with IF PER and A are held constant for any given quarter cycle.

So do you agree with this? Oh am I missing something?

Don

11. P.S. I forgot to mention that Phase can be zero since I always will recalculate the Period at the end of the fourth quarter cycle, which will be the start of the next cycle at Phase zero.

12. Also, if you have any thoughts on calculating the minimum lag detrending line through the data to get the data into its sine form, let me know. I can dynamically detrend with a Centered Moving average using PER data points, but this does not work for the last half cycle of data, which is the most critical to being able to project forward. I have tried the Andrews Pitchfork approach to projecting the next apex and trendline point, but with limited success.

13. Trying to make a 64 bit 48/16 FP version of the 4th order approx, but not having much luck:

It only puts out the Fixedpoint value equal to 0.25.

14. Matthias, Getting the shifts right is alwayas a little tricky, I'll see if I can figure it out when I have the time for it. That said, remember that the 4th order approx is only good for about 10 bits. While you can add more bits (i.e., Q16), those later bits will be inaccurate anyway. With that in mind, you can also just use the 32bit Q12 result and shift up by 4.

Please check my script which tries to generate polynomial approximations base on method from this article. This script is using SymPy so the resulting polynomial has infinite precision =)
Of course adapting to fixed point is another story...
still modern MCU have float hardware which can give fast, precise polynomial approximations using just few Multiply accumulate instructions (1 float MAC = 1 cycle on cortex M4) using methods described here. I really want to benchmark these against CMSIS trigonometric functions.

output of script for example from article (2.1 High-precision, fifth order):
a + b + c - 1 = 0
a + 3*b + 5*c = 0
a - pi/2 = 0

x**5*(-3/2 + pi/2) + x**3*(-pi + 5/2) + pi*x/2

SNR: 71.0 [dB]
ENOB: 11.8

16. To obtain polynomial you just need to provide a prototype and list of data points for function value and derviative optional args are 'odd' 'even' debug'.

make_poly(sin(pi*x/2),[1],[1,0],'odd','debug')

17. @Matthias:

Here you go: https://pastebin.com/WNb67ivf
One of the problems in your initial code was things like 2<<16 instead of 1<<16, which skews everything. Also, operator precedence.

This particular version uses Q24 for the intermediates to make things easier. You can gain a higher precision by shifting things about, but ultimately since the C4 approximation is only accurate to about 10 bits, that hardly matters. A few simple tests are included as well.

For stuff like this, it's probably easier (well, relatively easier) to always shift the intermediary results down to a fixed fixed-point. Only after that should you think about moving the fixed-point about. This stuff is hard enough as it is, don't you think?

18. Yes, you are right, that was the issue, and I already have funcs for elementary dealings with Q16 numbers (mult, divd). The 5th order approx is more accurate, too. I have gotten a version of the 5th order func to work relatively well, but I'm not getting the accuracy indicated here now. Furthermore, the sign carryover is not reflecting correctly:

I also have a func for doing square roots with fixed point numbers that is fairly interesting, I would like to learn about using the sort of analytical graphs you have here to evaluate it's output over a range, do you have a link to a good guide?

19. Addendum: If the 4th order provides 10 binary digits of accuracy, and 5th order gives me around 12 - 13 (I think), then would I really need a 6th or (most likely) 7th order approximation to get sub-quantum accuracy relative to Q16 precision? That's really my goal, and a couple extra Q16 MULT()s is not terribly much overhead.

20. Addendum 2: After further review, I just realized that the masking process for the even ordered approx uses a different truth table for reflection than the odd-ordered ones. Substituting the branch from the 3rd order example fixes things, but the accuracy is still lower than I would like. What constraint would I used to figure out what the Dx^7 constant coefficient is?

Here is a pastebin of the corrected code: https://pastebin.com/TXF9V2GK

21. I see in the addended pastebin that one of the parentheses for B is wrong. Did you correct for that already?

I've done a quick test with my own implementation of sine5 (https://pastebin.com/0x70eeQa). Notice that I do most of the math in Q24, because why not.
At α = π/4 I get 0xB518. The true value should be 0xB504, which makes my S5 approximation off by roughly 0.05%. This is about the accuracy I expected from Fig 5.

For true Q16 precision, it's necessary to do the intermediary calculations at higher precision. The range of calculations is around [-4.0, +4.0] I think, so you should have bits to spare.

As for the constraints for D: they can be whatever you want :D No really, you can choose the constraints yourself. Suggestions are:

• sin(π/4) = 1/&sqrt;2 (extra intermediary point)
• sin(π/6) = ½ (extra intermediary point)
• ∫ sin dx (0, π/2) = 2/π (area under curve; minimizes average error)
• Minimize the root-mean square. For this you'll need something like Excel's GoalSeek or Solver.

I discuss some of these options in section 3.3. The point is that you can decide on what the characteristics of the curve. I kinda like an extra intermediate point or average error, but for true minimal error, RMS is probably the best option. However, you can't solve for that analytically.

22. Cearn,

Parentheses? Do you mean at line 17 or line 28? Also, I'm still new to the curve fitting thing, so I haven't "got" what makes a good target value to plug into the target method. Is it any value that intersects the x axis on the normal sine func? Right now I'm not using the optimized version as I'm still trying to wrap my head around the basics.

RMS is probably my end method for figuring out the optimized constants, but I have no experience with Goalseek or Solver, I will check those out.

As far as precision goes, my target from the beginning was a complete set of elementary trig functions that use Q16 nums inside of int64_t for all inputs and outputs. 48 bits above the radix should be more than enough scratchspace for the intermediary calcs, and I dislike factoring out the shifts for what should be a separate func/concern. Do you used Q24 for the extra precision on the bottom end? Additionally, both Gcc and Clang should be able to do the final bitshift optimization/algebraic reduction during compile if I set MULT() and DIVD() to allow for inlining.

In fact, here is a paste of my MULT() and DIVD() func for Q16 numbers stored in 64 bit registers: https://pastebin.com/2mC2GxJx

23. When the 3rd order function is calculated S3(x)=3x/PI – 4x³/PI³, there are 5 constants A, n, p, r, s with: r = 2n−p and s = n+p+1−A
I do not understand exactly how A, n, p are chosen.
If I understand right:
- “The scale of the outcome (i.e., the amplitude): 2^A”
The A is chosen A=12, because the sine value return is Q12.
- “The scale on the inside the parentheses: 2^p. This is necessary to keep the multiplications from overflowing.”
The p is chosen p=15, because the final sine function isin_S3 is on 32bit and has the argument on 32bit and when e multiply 2 operands, in order to avoid overflow, then each operand must be 15 binary digits after the decimal point plus the sign.
- “The angle-scale: 2^n. This is basically the value of ½π in the fixed-point system. Using x for the angle, you have z = x/2^n.”
The n is chosen n=13, because if full circle (0...2PI) uses 2^15 angles for resolution, then for 1 quarter of circle (0..PI/2) we need only (2^15)/4=(2^15)/(2^2)=2^13, so n=13.
Is my interpretation right/proper? Please correct me if I am wrong.

I would like to port and test these calculations with a 16bit PIC24 microcontroller.

24. Hi!

Sorry, for the delay. So much to do all the time ...

Anyway, about your questions. A and n are indeed chosen based on the user's parameters. I'm wondering now if I should have let n simply be the full-circle instead of a quarter-circle.

I think you're right about p as well. This is a tricky constant, because it's there's no real "best" value for it. In my case, I believe 15 worked, and 16 would send it over the edge of overflow somewhere along the line. You just need to have something to scale the mulitplications with, and I just named it p.

So yeah, I think you interpreted it correctly. (Which is amazing, as even I have difficulty when reading it these days >_>)

25. In mean time I solved your change with S5(x) and I implemented in ASM for PIC24 microcontroller S3(x) and S5(x).
The next Word file to preserve nice format of seeing the polynomial, you have to download it to see the polynomials in a nice format:
https://goo.gl/HCc9JT
PIC24HJ64GP202 on 16bit with internal clock 80MHz, 25ns for one instruction:
x=0…16383 and S(x)=0…4095.

S3(x) takes 250ns.

//multiply x*x, the result is W2 and W3
//only content of W3 is used
MOV x,W0
MOV x,W1
MUL.UU W0,W1,W2

//shift logical right with 15 (x*x>>15)
//by multiplyinh with 2 and shift right with 16
//the result is W4 and W5, only content of W4 is used
MUL.UU W3,#0x2,W4

//substract (3*2^13) - (x*x>>15), the reuslt is in W4
MOV #0x6000,W6
SUB W6,W4,W4

//multiply x*((3*2^13) - (x*x>>15)), result in W2 and W3
//only W3 is used, being equivalent with shift right 16bit
MUL.UU W1,W4,W2

//get the result from W3 in S3x
MOV W3, S3x

S5(x) takes 425ns.

//multiply x*x and divide by 2^16, the result is in W3
MOV x, W0
MOV x, W1
MUL.UU W0, W1, W2

//multiply by 9279=0x243F and divide by 2^16, the result is in W7
MOV #0x243F, W4
MUL.UU W3, W4, W6

//substract from 5256=0x1488, the result is in W6
MOV #0x1488, W6
SUB W6, W7, W6

//multiply with x and divide by 2^16, the result is in W9
MUL.UU W6, W0, W8

//multiply by 2^5=32=0x20, the result is in W10
MOV #0x20, W5
MUL.UU W5, W9, W10

//multiply with x and divide by 2^16, the result is in W7
MUL.UU W0, W10, W6

//substract from 25736=0x6488, the result is in W7
MOV #0x6488, W6
SUB W6, W7, W7

//multiply with x, which is in W0 or W1 and divide by 2^16, the result is in W3
MUL.UU W0, W7, W2

//get the result from W3 in S5x
MOV W3, S5x

(cearn: edited for code blocks)

I would have enjoyed it more if additional explanations were given, but never the less it opened my eyes and I thank you that you took time to write it.

Microchip offers fixed point math libraries for their XC compilers, but is embedded, you cannot see the code and use with other compilers, except if it is stripped out from .lst file. I think they use also polynomial approximation approach, maybe higher order polynomial.

Page 232 of the next file - see _Q15sinPI for example:
Here is the stripped code for PIC24, in case someone is interested.
It takes 1.9us at 80MHz internal clock of PIC24HJ64GP202.

int fast_sin(int an) {
#asm
mov      an, w0
mov      w0, w9
mov      w0, w2
mov      #0x8000, w1
clr      w0
cpsne    w2, w1
return
mov      w8, [w15++]
mov      #0x1, w8
cpsgt    w2, w0
mov      #0xffff, w8
mul.ss   w2, w8, w0
mov      #0x4001, w2
mov      #0x8000, w3
cpslt    w0, w2
sub      w3, w0, w0
mov      #0x28bf, w2
cp       w0, w2
bra      GE, SinePI_CosCall
mov      #0x6488, w2
mul.ss   w0, w2, w2
mov      #0x1000, w4
lsr      w2, #0xd, w2
sl       w3, #0x3, w4
ior      w2, w4, w0
mov      w0, w2
mov      #0x6bb5, w0
mov      #0x7fff, w1
cpsne    w2, w1
bra      L_SIN_PI_RETURN
mov      #0x944b, w0
mov      #0x8001, w1
cpsgt    w2, w1
bra      L_SIN_PI_RETURN
mul.ss   w2, w2, w6
lsr      w6, #0xf, w6
sl       w7, #0x1, w7
ior      w6, w7, w3
sl       w3, #0x1, w3
mul.su   w2, w3, w4
mov      #0x5555, w1
mul.ss   w5, w1, w6
asr      w7, #0x1, w7
sub      w2, w7, w0
mul.su   w5, w3, w4
mov      #0x4444, w1
mul.ss   w5, w1, w6
asr      w7, #0x5, w7
mul.su   w5, w3, w4
mov      #0x6807, w1
mul.ss   w5, w1, w6
mov      #0x400, w5
asr      w7, #0xb, w7
sub      w0, w7, w0

L_SIN_PI_RETURN:
bra      SIN_PI_END

SinePI_CosCall:
mov      #0x4000, w3
sub      w3, w0, w0
mov      #0x6488, w2
mul.ss   w0, w2, w2
mov      #0x1000, w4
lsr      w2, #0xd, w2
sl       w3, #0x3, w4
ior      w2, w4, w0
mov      #0xff01, w1
mov      #0xff, w2
cp       w0, w1
bra      LT, SIN_PI_END
cp       w0, w2
bra      GT, L_SIN_PI_Cos_Else
mov      #0x7fff, w0
bra      SIN_PI_END

L_SIN_PI_Cos_Else:
mov      w0, w2
mov      #0x8000, w1
mov      #0x4529, w0
cpsne    w2, w1
bra      SIN_PI_END
mov      #0x7fff, w1
mov      #0x7fff, w0
mul.ss   w2, w2, w4
lsr      w4, #0xf, w4
sl       w5, #0x1, w5
ior      w4, w5, w4
sl       w4, #0x1, w4
mul.us   w4, w1, w2
mov      #0x8000, w7
mul.su   w3, w7, w6
sub      w0, w7, w0
mul.us   w4, w3, w2
mov      #0x5555, w7
mul.su   w3, w7, w6
asr      w7, #0x3, w7
mul.us   w4, w3, w2
mov      #0x2d83, w7
mul.su   w3, w7, w6
asr      w7, #0x7, w7
sub      w0, w7, w0
mul.us   w4, w3, w2
mov      #0xd0, w7
mul.su   w3, w7, w6
asr      w7, #0x7, w7

SIN_PI_END:
mul.ss   w0, w8, w0
mov      [--w15], w8
mov      w0, _RETURN_
#endasm
}

(cearn: edited for code blocks)

I would have enjoyed it more if additional explanations were given, but never the less it opened my eyes and I thank you that you took time to write it.

Thanks :D

On the "additional information", could you elaborate on that? Over the years, I've gotten more than a few comments on this article. Whenever I have to re-read things, I also feel like I'm missing a few things.

I'm toying with the idea of revising it a little, and would appreciate your feedback.

28. I have noticed you don't monetize your website,
don't waste your traffic, you can earn extra cash every month because you've got
hi quality content. If you want to know how to make extra money,
search for: Mrdalekjd methods for \$

29. I have checked your site and i have found some duplicate content, that's why you don't rank high in google, but there is
content, search for; Boorfe's tips unlimited content

30. Really, in doing the actions, they'll not even understand that they're learning the
essential mathematical skills at the identical time https://math-problem-solver.com/ .
I commented on the portrayal of mathematics in that new series in a latest commentary within the Huffington Put up.

31. This website is completely gret. I've researched these
stuffs a whole lot and I view it that is good written, easy to understand.

I congratulate you for this article that I am going to recommend to the people
friends. I ask you to visit the gpa-calculator.co
site where each scholar or learner can find results
gpa marks. Success!

32. Thank you for this great article.
I am learning low level programming and assembly language on a 8 bit micro-controller.
In the post, OK for the third order polynomial approximation of the sinus and the change of variable.
But then in section 2 "Derivations and implementations", subsection "Third-order implementation", I don't understand how the constants A, p and n are found.
I actually don't know much concerning fixed-point calculation.
Can someone help me to understand how these constants are calculated or give me a link where I can find some material on this subject?
Another thing I don't understand is the exclusive or trick: "if( (x^(x<<1)) < 0) ..." (below equation (13)).
Hope someone might help me out!

33. This was exactly what I needed for doing fast, fixed-point trig on 32-bit M0 ARM microcontrollers like the BBC micro:bit. Your step-by-step development of the techniques taught me a lot. Thank you taking the time to write this up!

34. Thanks for the interesting article. I made some observations and changes and have a variation to present. First, the output is specified in the original as Q12. That is only partially true, because for the full unit circle it will go negative which means 13 bit signed. Also, as mentioned, Z goes from 0 to 1 over angle 0 to 90, so two extra bits are in the argument for a full unit circle. Also when working with integers, sin and cos are problematic in that 1.0 is not generally a valid number, and shows up at 90 and 0 degrees respectively. Finally, a good fit to computer architecture would suggest that a unit circle have an x value of 0 to 65535 and yield a result over the range of -32767 to 32767. In fact one criteria for a successful set of coefficients has to be that no output exceed 32767, so there are no values that overflow. This is more important than any of the 'fit' criteria.

So I set out to come up with a variation that accomplishes all of these. Also taking a cue from the optimized coefficients, I decided to optimize them further experimentally, rather than theoretically. I should add that rounding is a factor here as well. Every time a right shift occurs, one should add 1/2 the resulting LSB before shifting to cause rounding instead of truncating. The a and b coefficients have inner values conveniently subtracted from them, so can be adjusted to include a rounding factor. Rather than explicitly doing so, I let the optimization just factor that into the constant.

The code below contains several parts. First, I implemented the basic algorithm in its original form in floating point as a sanity check. The optimized constants from the article were then used instead. No additional fine tuning was attempted on this. It is called sin5(). The second part is isin() which is where the optimizations were done. This is designed for first quadrant 14 bit argument and 15 bit unsigned result. One difference is that I subtracted from the constants before shifting in each case, which facilitated the 'built in rounding capability'. The algorithm is otherwise similar to several above.

Finally, a main() program wraps this all together. It also includes a call to the standard C library sin() function which served as a reference. The result of that is multiplied by 32767 (not 32768 to avoid the overflow case) and rounded. An error term is created and used for both 'binning' the resulting values, computing the mean and computing the RMS error.

The ai, bi and ci constants were arrived at by minimizing the errors. It turns out that it is not possible to fully optimize the results and still not exceed 32767. The first concession made in that regard is that the mean value is negative. This does not impact the signal qualities. The second concession was that without worrying about range, it was possible to get the RMS error down to 1.414, but the best RMS error without exceeding 32767 was 1.732. All errors fell within the range of -4 to +4 with the vast majority in the range of -3 to +3. Minimizing this range was another optimization objective.

Because ai and bi are used before shifting, LSB changes in them are not significant, so are zero filled in decimal. "Rounded" hex values would have been equally appropriate.

As a final check on the results, the data was imported into an analysis program and an FFT transform performed on it. The 3rd and 7th harmonics were at -99 dBc, and the 5th harmonic was at -82 dBc. all other responses were more that 100 dB below the fundamental. For 16 bit resolution, -96 dB is the theoretically achievable result.

Here is the program used. It was written for a 32 bit processor, but could easily be adapted to other architectures, including FPGA/DSP.

// 16 bit integer sine function
#include
#include

#define pi 3.14159265359
#define a (4.0 * (3.0 / pi - 9.0/16.0))
#define b (2.0 * a - 5.0/2.0)
#define c (a - 3.0/2.0)
double sin5(double z) // computes sin 0.0 ~ 1.0 (first quadrant) for z over 0.0 ~ 1.0
{ // i.e. z = x / (pi / 2) where x is in radians
double sq = z * z;
double qd = sq * (b - sq * c);
return z * (a - qd);
}

#define ai 421495000
#define bi 2756132000
#define ci 74931

int isin(int x)
{
unsigned int sq = (x * x) >> 16;
unsigned int sq2 = (bi - (sq * ci)) >> 16;
unsigned int cu = (sq2 * x) >> 16;
unsigned int qu = (ai - (cu * x)) >> 11;
unsigned int s5 = (qu * x + 32768) >> 16;
return s5;
}
// at this point it works but runs high with errors of up to 15 or 20
// possibly due to rounding - apparently not

void main(void)
{
int i;
int sum = 0;
int sumsq = 0;
int bins[20];
int maxsum = 0;
int oversum = 0;
#define step 1
#define count 16384
for (i = 0; i < 20; ++i) {
bins[i] = 0;
}
for (i = 0; i = 32768) ++ oversum;
++bins[e + 10];
printf ("%5d, %5d, %5d, %5d\n", i, r, f, t);
}
printf("%6d %f %2d %d\n ", sum, sqrt(sumsq/(count/step)), maxsum, oversum);
for (i = 0; i < 20; ++i) {
printf ("%3d %4d\n", i - 10, bins[i]);
}
}

35. Follow up to the last article.

Firs, there was a bug in the RMS calculation the line should read
printf("%6d %f %2d %d\n ", sum, sqrt((1.0 * sumsq)/(count/step)), maxsum, oversum);
subtle difference, but it forces the divide to be done in floating point, without which the square root becomes granular and not as useful.

In thinking about the optimization, I realized that at 0, the approximation is exact (0) by the nature of the formula (true of almost all sine approximations). But such an inherent relationship does not exist at the opposite end (1). Indeed, the approximation is too large at that end, and constraining it to < 32768 to prevent overflow, prevents full optimization. So I changed the reference calculation to multiply the reference sine wave by 32762 instead (the largest value that prevented overflow when optimized). This introduces a 0.018% gain error.

Almost all applications of sine and cosine are in signal processing, either for audio or for RF. Limiting the range to 32767 instead of 32768 already introduces a 0.003% gain error, and the 0.018% error is still negligible. In both Audio and RF, unwanted frequency products are far more significant than a small gain error. In fact if it matters, the gain error can easily be corrected in a subsequent filter or other stage. For both Audio and RF, the most troublesome artifacts are intermodulation distortion, because they produce products unrelated to the signal. At audio these can be heard by an observant listener down to at least -90 dB. A sine wave generator isn't generally capable of generating intermodulation products, but harmonics can mix with other things to produce them. The second most troublesome artifacts are harmonic distortion. For Audio, odd harmonics are more objectionable, because they introduce new sounds that add harshness, while even harmonics are octaves that simply brighten the sound, and have to be fairly strong to even be noticed.

So back to the formula. By lowering the "expectation" of amplitude, I found I was able to reduce the RMS error to < 1.5 while still keeping the range under 32768. My goal was to lower the one artifact that was above -90 dBc in the spectrum. With the RMS calculation being done properly in floating point, I was able to minimize the error fairly quickly by adjusting the coefficients, resulting in the following, which I submit as the optimum possible:
#define ai 421468797
#define bi 2758881220
#define ci 75676
They were refined until no further improvement was possible. ai can vary +/- 1 without changing the outcome, and bi by +/- about 20. I was surprised that such small changes in these two were significant since after using them, the results are shifted right 16 places. This is undoubtedly related to the rounding factor built into them as discussed in my previous post, which is why I do the subtract before shifting.

Running this result through an FFT revealed that the worst artifact was still the fifth harmonic but it was now -86.2 dBc., about a 4 dB improvement and closer to my goal of all artifacts below -90 dBc. The bottom line is that this approximation fall slightly short of delivering 16 bit performance, but is very close. The presence of fifth harmonic suggests that the x^5 term isn't perfect, which fits with the fact that the output oscillates near 90 degrees, not being fully monotonic. A 7th order term would undoubtedly fix this, but at additional computational cost (3 more multiplies).

It should be noted that the FFT, since it is computing at the sample rate, cannot show artifacts related to the interaction of the sample rate and the sine wave. Typically the function would be called with successive phase increments to generate an output at a desired frequency, rather than a power of 2 fraction of the sample rate. This process can introduce intermodulation products, although they should generally be very low in amplitude and much higher in frequency, such that they can easily be filtered to an acceptable level. Unfortunately, the nature of FFTs is that they only work optimally for frequencies that have a power of 2 relationship to the sample rate, so even setting up a test at a different sampling rate introduces its own artifacts that limit the usefulness of the results.