<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Coranac &#187; code</title>
	<atom:link href="http://www.coranac.com/category/code/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.coranac.com</link>
	<description>my own little world</description>
	<lastBuildDate>Sat, 19 Nov 2011 16:43:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.4</generator>
		<item>
		<title>DMA vs ARM9, round 2 : invalidate considered harmful</title>
		<link>http://www.coranac.com/2010/03/dma-vs-arm9-round-2/</link>
		<comments>http://www.coranac.com/2010/03/dma-vs-arm9-round-2/#comments</comments>
		<pubDate>Sun, 28 Mar 2010 17:29:08 +0000</pubDate>
		<dc:creator>cearn</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[nds]]></category>

		<guid isPermaLink="false">http://www.coranac.com/?p=175</guid>
		<description><![CDATA[It would seem these two aren&#8217;t finished with each other yet. &#160; A while ago, I wrote an article about NDS caching , how it can interfere with DMA transfers and what you can do about them. A little later I got a pingback from ant512, who had tried the &#8220;safe&#8221; DMA routines I made [...]]]></description>
			<content:encoded><![CDATA[<p>
It would seem these two aren&#8217;t finished with each other yet.
</p>
<p><div>&nbsp;</div></p>
<p>
A while ago, I wrote<br />
<a href="http://www.coranac.com/dma-vs-arm9-fight/">an article about NDS caching</a><br />
, how it can interfere with DMA transfers and what you can do about them.<br />
A little later I got a<br />
<a href="http://ant.simianzombie.com/?p=1114">pingback from ant512</a>,<br />
who had tried the &ldquo;safe&rdquo; DMA routines I made and found they<br />
weren&#8217;t nearly as safe as I&#8217;d hoped. I&#8217;m still not sure what the actual<br />
problem was, but this incident did make me think about one possible<br />
reason, namely the one that will be discussed in this post: problematic<br />
cache invalidation.
</p>
<p><h3 id="sec-tests">1
Test base
</h3>
</p>
<p>
But first things first. Let&#8217;s start with some simple test code, see below.<br />
We have a simple struct definition, two arrays using this struct, and<br />
some default data for both arrays that we&#8217;ll use later.
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="co1">// A random struct, 32-bits in size.</span><br />
<span class="kw1">struct</span> Foo<br />
{<br />
&nbsp; &nbsp; u8&nbsp; type;<br />
&nbsp; &nbsp; u8&nbsp; id;<br />
&nbsp; &nbsp; u16 data;<br />
} ALIGN(<span class="nu0">4</span>);</p>
<p><span class="co1">// Define some globals. We only use 4 of each.</span><br />
Foo g_src[<span class="nu0">16</span>] ALIGN(<span class="nu0">32</span>);<br />
Foo g_dst[<span class="nu0">16</span>] ALIGN(<span class="nu0">32</span>);</p>
<p><span class="kw1">const</span> Foo c_fooIn[<span class="nu0">2</span>][<span class="nu0">4</span>]= <br />
{<br />
&nbsp; &nbsp; { &nbsp; <span class="co1">// Initial source data.</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; { <span class="nu0">0&#215;55</span>, <span class="nu0">0</span>, <span class="nu0">0&#215;5111</span> }, <br />
&nbsp; &nbsp; &nbsp; &nbsp; { <span class="nu0">0&#215;55</span>, <span class="nu0">1</span>, <span class="nu0">0&#215;5111</span> },<br />
&nbsp; &nbsp; &nbsp; &nbsp; { <span class="nu0">0&#215;55</span>, <span class="nu0">2</span>, <span class="nu0">0&#215;5111</span> }, <br />
&nbsp; &nbsp; &nbsp; &nbsp; { <span class="nu0">0&#215;55</span>, <span class="nu0">3</span>, <span class="nu0">0&#215;5111</span> } <br />
&nbsp; &nbsp; }, <br />
&nbsp; &nbsp; { &nbsp; <span class="co1">// Initial destination data.</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; { <span class="nu0">0xDD</span>, <span class="nu0">0</span>, <span class="nu0">0xD111</span> },<br />
&nbsp; &nbsp; &nbsp; &nbsp; { <span class="nu0">0xDD</span>, <span class="nu0">1</span>, <span class="nu0">0xD111</span> },<br />
&nbsp; &nbsp; &nbsp; &nbsp; { <span class="nu0">0xDD</span>, <span class="nu0">2</span>, <span class="nu0">0xD111</span> },<br />
&nbsp; &nbsp; &nbsp; &nbsp; { <span class="nu0">0xDD</span>, <span class="nu0">3</span>, <span class="nu0">0xD111</span> } <br />
&nbsp; &nbsp; }, <br />
};</div>
</div>
<p>
And now we&#8217;re going to do some simple things with these arrays that<br />
we always do: some reads, some writes, and a struct copy. And for the<br />
copying, I&#8217;m going to use DMA, because DMA-transfers are fast,<br />
amirite<span class="fnote"><a href="#ft-nr1" title="No I&#8217;m not. For NDS WRAM-WRAM copies, DMA is actually
slow as hell and outperformed by every other method. But hopefully more
on that later. For now, though, I need the DMA for testing
purposes.">(1)</a></span>? The specific actions I will do are the following:
</p>
<h5>Initialization</h5>
<ul>
<li>Zero out <code>g_src</code> and <code>g_dst</code>.</li>
<li>Initialize the arrays with some data, in this case from<br />
    <code>c_fooIn</code>.</li>
<li>Cache-Flush both arrays to ensure they&#8217;re uncached.</li>
</ul>
<h5>Testing</h5>
<ul>
<li>Modify element in <code>g_src</code>, namely <code>g_src[0]</code>.</li>
<li>Modify an element in <code>g_dst</code>, namely <code>g_dst[3]</code>.</li>
<li>DMA-copy <code>g_src[0]</code> to <code>g_dst[3]</code>.</li>
</ul>
<p>
In other words, this:
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="kw1">void</span> test_init()<br />
{<br />
&nbsp; &nbsp; <span class="co1">// Zero out everything</span><br />
&nbsp; &nbsp; <span class="kw3">memset</span>(g_src, <span class="nu0">0</span>, <span class="kw3">sizeof</span>(g_src));<br />
&nbsp; &nbsp; <span class="kw3">memset</span>(g_dst, <span class="nu0">0</span>, <span class="kw3">sizeof</span>(g_dst));</p>
<p>&nbsp; &nbsp; <span class="co1">// Fill 4 of each.</span><br />
&nbsp; &nbsp; <span class="kw1">for</span>(<span class="kw1">int</span> i=<span class="nu0">0</span>; i&lt;<span class="nu0">4</span>; i++)<br />
&nbsp; &nbsp; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; g_src[i]= c_fooIn[<span class="nu0">0</span>][i];<br />
&nbsp; &nbsp; &nbsp; &nbsp; g_dst[i]= c_fooIn[<span class="nu0">1</span>][i];<br />
&nbsp; &nbsp; }</p>
<p>&nbsp; &nbsp; <span class="co1">// Flush data to be sure.</span><br />
&nbsp; &nbsp; DC_FlushRange(g_src, <span class="kw3">sizeof</span>(g_src));<br />
&nbsp; &nbsp; DC_FlushRange(g_dst, <span class="kw3">sizeof</span>(g_dst));<br />
}</p>
<p><span class="kw1">void</span> test_dmaCopy()<br />
{<br />
&nbsp; &nbsp; test_init();</p>
<p>&nbsp; &nbsp; <span class="co1">// Change g_src[0] and g_dst[3]</span><br />
&nbsp; &nbsp; g_src[<span class="nu0">0</span>].id += <span class="nu0">0&#215;10</span>;<br />
&nbsp; &nbsp; g_src[<span class="nu0">0</span>].data= <span class="nu0">0&#215;5222</span>;</p>
<p>&nbsp; &nbsp; g_dst[<span class="nu0">3</span>].id += <span class="nu0">0&#215;10</span>;<br />
&nbsp; &nbsp; g_dst[<span class="nu0">3</span>].data= <span class="nu0">0xD333</span>;</p>
<p>&nbsp; &nbsp; <span class="co1">// DMA src[0] into dst[0];</span><br />
&nbsp; &nbsp; dmaCopy(&amp;g_src[<span class="nu0">0</span>], &amp;g_dst[<span class="nu0">0</span>], <span class="kw3">sizeof</span>(Foo));<br />
}</div>
</div>
<p>
Note that there is nothing spectacularly interesting going on here.<br />
There&#8217;s just your average struct definition, run of the mill array<br />
definitions, and boring old accesses without even any pointer magic<br />
that might hint at something tricky going on. Yes, alignment is forced,<br />
but that just makes the test more reliable. Also, the fact that I&#8217;m<br />
incrementing <code>Foo.id</code> rather than just reading from it is<br />
only because ARM9 cache is read-allocate, and I need to have these<br />
things end up in cache. The main point is that the actions in<br />
<code>test_dmaCopy()</code> are very ordinary.
</p>
<p><h3 id="sec-results">2
Results
</h3>
</p>
<p>
It should be obvious what the outcome of the test should be. However,<br />
when you run the test (on hardware! not emulator), the result seems to<br />
be something different.
</p>
<div class="none">
<div class="none proglist" style=" "><span class="co1">// Just dmaCopy.</span></p>
<p>&nbsp; &nbsp; <span class="co1">// Result &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Expected:</span><br />
&nbsp; &nbsp; <span class="co1">// Source (hex)</span><br />
&nbsp; &nbsp; <span class="nu0">55</span>, <span class="nu0">10</span>, <span class="nu0">5222</span> &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// 55, 10, 5222</span><br />
&nbsp; &nbsp; <span class="nu0">55</span>, 01, <span class="nu0">5111</span>&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// 55, 01, 5111</span><br />
&nbsp; &nbsp; <span class="nu0">55</span>, 02, <span class="nu0">5111</span> &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// 55, 02, 5111</span><br />
&nbsp; &nbsp; <span class="nu0">55</span>, 03, <span class="nu0">5111</span> &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// 55, 03, 5111</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <br />
&nbsp; &nbsp; <span class="co1">// Destination (hex)</span><br />
&nbsp; &nbsp; DD, 00, D111&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// 55, 10, 5222 (bad!)</span><br />
&nbsp; &nbsp; DD, 01, D111 &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// DD, 01, D111</span><br />
&nbsp; &nbsp; DD, 02, D111 &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// DD, 02, D111</span><br />
&nbsp; &nbsp; DD, <span class="nu0">13</span>, D333 &nbsp; &nbsp; &nbsp; &nbsp;<span class="co1">// DD, 13, D333</span></div>
</div>
<p>
Notice that the changed values of <code>g_src[0]</code> never<br />
end up in <code>g_dst[0]</code>. Not only that, not even the<br />
<i>original</i> values <code>g_src[0]</code> have been copied.<br />
It&#8217;s as if the transfer never happened at all.
</p>
<p>
The reason for this was covered in detail in the original article.<br />
Basically, it&#8217;s because cache is invisible to DMA. Once a part of<br />
memory is cached, the CPU only looks to the contents of the cache and<br />
not the actual addresses, meaning that DMA not only reads out-of-date<br />
(stale) source data, but also puts it where the CPU won&#8217;t look.<br />
Two actions allow you to remedy this. The first is the<br />
<dfn>cache flush</dfn>, which write the cache-lines back to RAM and<br />
frees the cache-line. Then there&#8217;s <dfn>cache invalidate</dfn>, which<br />
just frees the cache-line. Note that in both cases, the cache is<br />
dissociated from memory.
</p>
<p>
With this information, it should be obvious what to do. When DMA-ing<br />
from RAM, you need to flush  the cache before the transfer to update<br />
the source&#8217;s memory. When DMA-ing to RAM, you need to invalidate<br />
after the transfer because now it&#8217;s actually the cache&#8217;s data that&#8217;s<br />
stale.<br />
When you think about it a little this makes perfect sense, and it&#8217;s easy<br />
enough to implement:
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="co1">// New DMA-code:</span><br />
&nbsp; &nbsp; DC_FlushRange(&amp;g_src[<span class="nu0">0</span>], <span class="kw3">sizeof</span>(Foo));&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Flush source.</span><br />
&nbsp; &nbsp; dmaCopy(&amp;g_src[<span class="nu0">0</span>], &amp;g_dst[<span class="nu0">0</span>], <span class="kw3">sizeof</span>(Foo)); &nbsp; &nbsp; <span class="co1">// Transfer.</span><br />
&nbsp; &nbsp; DC_InvalidateRange(&amp;g_dst[<span class="nu0">0</span>], <span class="kw3">sizeof</span>(Foo)); &nbsp; &nbsp; <span class="co1">// Invalidate destination.</span></div>
</div>
<p>
Unfortunately, this doesn&#8217;t work right either. And if you think about<br />
it a lot instead of merely a little, you&#8217;ll see why. Maybe showing the<br />
results will make you see what I mean. The transfer seems to work now,<br />
but the earlier changes to <code>g_dst[3]</code> have been erased. How<br />
come?
</p>
<div class="none">
<div class="none proglist" style=" ">&nbsp; &nbsp; <span class="co1">// Result:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Expected:</span><br />
&nbsp; &nbsp; <span class="co1">// Source (hex)</span><br />
&nbsp; &nbsp; <span class="nu0">55</span>, <span class="nu0">10</span>, <span class="nu0">5222</span> &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// 55, 10, 5222</span><br />
&nbsp; &nbsp; <span class="nu0">55</span>, 01, <span class="nu0">5111</span>&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// 55, 01, 5111</span><br />
&nbsp; &nbsp; <span class="nu0">55</span>, 02, <span class="nu0">5111</span> &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// 55, 02, 5111</span><br />
&nbsp; &nbsp; <span class="nu0">55</span>, 03, <span class="nu0">5111</span> &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// 55, 03, 5111</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <br />
&nbsp; &nbsp; <span class="co1">// Destination (hex)</span><br />
&nbsp; &nbsp; <span class="nu0">55</span>, <span class="nu0">10</span>, D222&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// 55, 10, 5222</span><br />
&nbsp; &nbsp; DD, 01, D111 &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// DD, 01, D111</span><br />
&nbsp; &nbsp; DD, 02, D111 &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// DD, 02, D111</span><br />
&nbsp; &nbsp; DD, <span class="nu0">13</span>, D111 &nbsp; &nbsp; &nbsp; &nbsp;<span class="co1">// DD, 13, D333 (wut?)</span></div>
</div>
<p>
The problem is that a cache-invalidate invalidates entire<br />
<i>cache-lines</i>, not just the range you supply. If the start or<br />
end of the data you want invalidate does not align to a cache-line,<br />
the adjacent data contained in that line is also thrown away. I hope<br />
you can see that this is bad.
</p>
<p>
This is exactly what&#8217;s happening here. The ARM9&#8242;s cache-lines are 32<br />
bytes in size. Because of the alignment I gave the arrays, elements<br />
0 through 3 lie on the same cache-line. The changes to<br />
<code>g_dst[3]</code> occur in cache (but only because I read from it<br />
through <code>+=</code>). The invalidate of <code>g_dst[0]</code><br />
<i>also</i> invalidates <code>g_dst[3]</code>, which throws out the<br />
perfectly legit data and you&#8217;re left in a bummed state. And again,<br />
I&#8217;ve done nothing spectacularly interesting here; all I did was<br />
modify something and then invalidated data that just happened to be<br />
adjacent to it. But that was enough. Very, <i>very</i> bad.
</p>
<p>
Just to be sure, this is <i>not</i> due to a bad implementation of<br />
<code>DC_InvalidateRange()</code>. The function does exactly what it&#8217;s<br />
supposed to do. The problem is inherent in the hardware. If your<br />
data does not align correctly to cache-lines, an invalidate will apply<br />
to the adjacent data as well. If you do not want that to happen, do<br />
<i>not</i> invalidate.
</p>
<p><h3 id="sec-solution">3
Solutions
</h3>
</p>
<p>
So what to do? Well, there is one thing, but I&#8217;m not sure how foolproof<br />
this is, but instead of invalidating the destination afterwards, you<br />
can also flush it before the transfer. This frees up the cache-lines<br />
without loss of data, and then it should be safe to DMA-copy to it.
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">&nbsp; &nbsp; DC_FlushRange(&amp;g_src[<span class="nu0">0</span>], <span class="kw3">sizeof</span>(Foo));&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Flush source.</span><br />
&nbsp; &nbsp; DC_FlushRange(&amp;g_dst[<span class="nu0">0</span>], <span class="kw3">sizeof</span>(Foo));&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Flush destination.</span><br />
&nbsp; &nbsp; dmaCopy(&amp;g_src[<span class="nu0">0</span>], &amp;g_dst[<span class="nu0">0</span>], <span class="kw3">sizeof</span>(Foo)); &nbsp; &nbsp; <span class="co1">// Transfer.</span></div>
</div>
<div class="none">
<div class="none proglist" style=" ">&nbsp; &nbsp; <span class="co1">// Result:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Expected:</span><br />
&nbsp; &nbsp; <span class="co1">// Source (hex)</span><br />
&nbsp; &nbsp; <span class="nu0">55</span>, <span class="nu0">10</span>, <span class="nu0">5222</span> &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// 55, 10, 5222</span><br />
&nbsp; &nbsp; <span class="nu0">55</span>, 01, <span class="nu0">5111</span>&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// 55, 01, 5111</span><br />
&nbsp; &nbsp; <span class="nu0">55</span>, 02, <span class="nu0">5111</span> &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// 55, 02, 5111</span><br />
&nbsp; &nbsp; <span class="nu0">55</span>, 03, <span class="nu0">5111</span> &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// 55, 03, 5111</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <br />
&nbsp; &nbsp; <span class="co1">// Destination (hex)</span><br />
&nbsp; &nbsp; <span class="nu0">55</span>, <span class="nu0">10</span>, D222&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// 55, 10, 5222</span><br />
&nbsp; &nbsp; DD, 01, D111 &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// DD, 01, D111</span><br />
&nbsp; &nbsp; DD, 02, D111 &nbsp; &nbsp; &nbsp;&nbsp; <span class="co1">// DD, 02, D111</span><br />
&nbsp; &nbsp; DD, <span class="nu0">13</span>, D333 &nbsp; &nbsp; &nbsp; &nbsp;<span class="co1">// DD, 13, D333</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span class="co1">// Yay \o/</span></div>
</div>
<p>
Alternatively, you can also disable the underlying reason behind the<br />
problem with invalidation: the write-buffer. The ARM9 cache allows<br />
two modes for writing: <dfn>write-through</dfn>, which also updates<br />
the memory related to the cache-line; and <dfn>write-back</dfn>, which<br />
doesn&#8217;t. Obviously, the write-back is faster, so that&#8217;s how libnds<br />
sets things up. I know that putting the cache in write-through mode<br />
fixes this problem, because in libnds 1.4.0 the write-buffer had been<br />
accidentally disabled and my test cases didn&#8217;t fail. This is probably<br />
not the route you want to take, though.
</p>
<p><h3 id="sec-conc">4
Conclusions
</h3>
</p>
<p>
So what have we learned?
</p>
<ul>
<li>
    Cache &#8211; DMA interactions suck and can cause really subtle bugs.<br />
	Ones that will only show up on hardware too.
  </li>
<li>
    Cache-flushes and invalidates cover the cache-lines of the requested<br />
	ranges, which exceed the range you actually wanted.
  </li>
<li>
    To safely DMA from cachable memory, flush the source range first.
  </li>
<li>
    Contrary to what I wrote earlier, to DMA to cachable memory,<br />
	do <i>not</i> cache-invalidate &ndash; at least not when<br />
	the range is not properly aligned to cache-lines. Instead, flush<br />
	the destination range before the transfer (at which time<br />
	invalidation should be unnecessary). That said, invalidate should<br />
	still be safe if the write-buffer is disabled.
  </li>
</ul>
<p><a href="http://www.coranac.com/files/nds/invalidate.zip">Link to test code.</a></p>
<p><div>&nbsp;</div><br />
<!-- EOF --></p>
<hr /><div class="footnotes">
<h5>Notes:</h5>
<ol>
<li id="ft-nr1"> 
  No I&#8217;m not. For NDS WRAM-WRAM copies, DMA is actually<br />
slow as hell and outperformed by every other method. But hopefully more<br />
on that later. For now, though, I need the DMA for testing<br />
purposes.
</li>
</ol>
</div
<hr />
]]></content:encoded>
			<wfw:commentRss>http://www.coranac.com/2010/03/dma-vs-arm9-round-2/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Some new notes on NDS code size</title>
		<link>http://www.coranac.com/2009/11/sizeof-new/</link>
		<comments>http://www.coranac.com/2009/11/sizeof-new/#comments</comments>
		<pubDate>Tue, 24 Nov 2009 15:31:31 +0000</pubDate>
		<dc:creator>cearn</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[nds]]></category>

		<guid isPermaLink="false">http://www.coranac.com/?p=133</guid>
		<description><![CDATA[When I discussed the memory footprints of several C/C++ elements, I apparently missed a very important item: operator new and related functions. I assumed new shouldn&#8217;t increase the binary that much, but boy was I wrong. The short story is that officially new should throw an exception when it can&#8217;t allocate new memory. Exceptions come [...]]]></description>
			<content:encoded><![CDATA[<p>
When I discussed the<br />
<a href="http://www.coranac.com/2009/02/some-interesting-numbers-on-nds-code-size/"><br />
memory footprints of several C/C++ elements</a>, I apparently missed a<br />
very important item: <code>operator new</code> and related functions. I<br />
assumed <code>new</code> shouldn&#8217;t increase the binary that much,<br />
but boy was I wrong.
</p>
<p>
The short story is that officially <code>new</code> should throw an<br />
exception when it can&#8217;t allocate new memory. Exceptions come with about<br />
60 kb worth of baggage. Yes, this is more or less the same stuff that<br />
goes into <code>vector</code> and <code>string</code>.
</p>
<p>
The long story, including a detailed look at a minimal binary,<br />
a binary that uses <code>new</code> and a solution to the exception overhead (in this particular case anyway) can be read below the fold.
</p>
<p><span id="more-133"></span></p>
<p><div>&nbsp;</div><ul>
  <li> <a href="#sec-base">1
Minimal project
</a> </li>
  <li> <a href="#sec-std-new">2
Standard C++ new/delete
</a> </li>
  <li> <a href="#sec-own-new">3
Custom new/delete
</a> </li>
  <li> <a href="#sec-conc">4
Other considerations and conclusions.
</a> </li>
</ul>
</p>
<p><h2 id="sec-base">1
Minimal project
</h2>
</p>
<p>
The following is essentially an empty project. It should represent<br />
the smallest binary you can get with the current DKA (r26) and<br />
libnds (1.3.7). This is the primary reference case.
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">#include &lt;nds.h&gt;</p>
<p><span class="kw1">int</span> main()<br />
{<br />
&nbsp; &nbsp; <span class="kw1">while</span>() ;<br />
}</div>
</div>
<p>
This actually already leads to a binary of 53.5 kb. To analyze what<br />
goes on in there, we can look at the map file. <i>Not</i> the mapfile<br />
generated by the linker, mind you, but by the <tt>arm-eabi-nm</tt> tool,<br />
whose generated files are considerably easier to read. To use this tool,<br />
add the following line to <code>$(BUILD)</code> rule in the makefile,<br />
so that it looks like below. If you want to know what the flags mean,<br />
please <a href="http://sourceware.org/binutils/docs/binutils/nm.html">RTFM</a>.
</p>
<div class="make">
<div class="make proglist" style=" ">$(<span class="re2">BUILD</span>):<br />
&nbsp; &nbsp; @[ -d <span class="re0">$@</span> ] || mkdir -p <span class="re0">$@</span><br />
&nbsp; &nbsp; @make &#8211;no-print-directory -C $(<span class="re2">BUILD</span>) -f $(<span class="re2">CURDIR</span>)/Makefile<br />
&nbsp; &nbsp; arm-eabi-nm -Sn $(<span class="re2">OUTPUT</span>).elf &gt; $(<span class="re2">BUILD</span>)/$(<span class="re2">TARGET</span>).map</div>
</div>
<p>
And this is the resulting mapfile, in full.
</p>
<div class="none">
<div class="none proglist" style=" ">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;w _Jv_RegisterClasses<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;w __deregister_frame_info<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;w __register_frame_info<br />
00080000 N _stack<br />
01000000 A __vectors_end<br />
01000000 A __vectors_start<br />
01000100 A __itcm_start<br />
01000100 000000c8 T irqTable<br />
010001c8 T IntrMain<br />
010001fc t findIRQ<br />
01000218 t no_handler<br />
01000228 t jump_intr<br />
0100023c t got_handler<br />
0100025c t IntrRet<br />
01000290 A __itcm_end<br />
02000000 T __text_start<br />
02000000 T _start<br />
02000194 t ILoop<br />
02000198 t checkARGV<br />
020001dc t .copyforward<br />
020001f0 t .copybackward<br />
02000200 t .copydone<br />
02000214 t ClearMem<br />
02000228 t ClrLoop<br />
02000238 t CopyMemCheck<br />
0200023c t CopyMem<br />
0200024c t CIDLoop<br />
02000300 T _init<br />
02000310 t __do_global_dtors_aux<br />
0200033c t frame_dummy<br />
0200037c 00000004 T main<br />
02000380 000000ec T initSystem<br />
0200046c 00000012 T ledBlink<br />
02000480 0000002c T powerOff<br />
020004ac 00000030 T powerOn<br />
020004dc 00000018 T systemSleep<br />
020004f4 00000010 T powerValueHandler<br />
02000504 00000044 T systemMsgHandler<br />
02000548 00000164 t fifoInternalSend<br />
020006ac 00000038 T fifoSendAddress<br />
020006e4 00000048 T fifoSendValue32<br />
0200072c 00000070 T fifoGetAddress<br />
0200079c 00000074 T fifoSetAddressHandler<br />
02000810 00000070 T fifoGetValue32<br />
02000880 00000074 T fifoSetValue32Handler<br />
020008f4 00000024 T fifoCheckAddress<br />
02000918 00000024 T fifoCheckDatamsg<br />
0200093c 00000024 T fifoCheckValue32<br />
02000960 00000094 t fifoInternalSendInterrupt<br />
020009f4 00000010 t __timeoutvbl<br />
02000a04 000001b8 T fifoInit<br />
02000bbc 00000100 T fifoGetDatamsg<br />
02000cbc 0000040c t fifoInternalRecvInterrupt<br />
020010c8 000000a8 T fifoSetDatamsgHandler<br />
02001170 00000070 T fifoSendDatamsg<br />
020011e0 00000002 T irqDummy<br />
020011e4 0000006c T irqSet<br />
02001250 0000004c T irqInit<br />
0200129c 00000030 T irqInitHandler<br />
020012cc 00000060 T irqEnable<br />
0200132c 00000060 T irqDisable<br />
0200138c 0000002c T irqClear<br />
020013c0 T swiSoftReset<br />
020013c4 T swiDelay<br />
020013c8 T swiIntrWait<br />
020013cc T swiWaitForVBlank<br />
020013d0 T swiSleep<br />
020013d4 T swiChangeSoundBias<br />
020013d8 T swiDivide<br />
020013dc T swiRemainder<br />
020013e2 T swiDivMod<br />
020013ee T swiCopy<br />
020013f2 T swiFastCopy<br />
020013f6 T swiSqrt<br />
020013fa T swiCRC16<br />
020013fe T swiIsDebugger<br />
02001402 T swiUnpackBits<br />
02001406 T swiDecompressLZSSWram<br />
0200140a T swiDecompressLZSSVram<br />
0200140e T swiDecompressHuffman<br />
02001412 T swiDecompressRLEWram<br />
02001416 T swiDecompressRLEVram<br />
0200141a T swiWaitForIRQ<br />
0200141e T swiDecodeDelta8<br />
02001422 T swiDecodeDelta16<br />
02001426 T swiSetHaltCR<br />
02001430 00000030 T __libc_fini_array<br />
02001460 00000050 T __libc_init_array<br />
020014b4 00000080 T memcpy<br />
02001534 00000006 T _times_r<br />
0200153c 0000002c T _gettimeofday_r<br />
02001568 00000014 T _times<br />
0200157c 00000052 T build_argv<br />
020015d0 0000000c T __errno<br />
020015dc T _fini<br />
020015e8 A __text_end<br />
020015e8 00000004 R _global_impure_ptr<br />
020015f0 A __exidx_end<br />
020015f0 A __exidx_start<br />
020015f0 t __frame_dummy_init_array_entry<br />
020015f0 A __init_array_start<br />
020015f0 A __preinit_array_end<br />
020015f0 A __preinit_array_start<br />
020015f4 t __do_global_dtors_aux_fini_array_entry<br />
020015f4 A __fini_array_start<br />
020015f4 A __init_array_end<br />
020015f8 r __EH_FRAME_BEGIN__<br />
020015f8 r __FRAME_END__<br />
020015f8 A __fini_array_end<br />
020015fc d __JCR_END__<br />
020015fc d __JCR_LIST__<br />
02001600 A __data_start<br />
02001600 D __dso_handle<br />
02001600 A __ewram_start<br />
02001604 00000004 D fifo_freewords<br />
02001608 00000004 D fifo_send_queue<br />
0200160c 00000004 D fifo_buffer_free<br />
02001610 00000004 D fifo_receive_queue<br />
02001618 00000004 D _impure_ptr<br />
02001620 00000428 d impure_data<br />
02001a48 A __bss_start<br />
02001a48 A __bss_start__<br />
02001a48 A __bss_vma<br />
02001a48 A __data_end<br />
02001a48 A __dtcm_lma<br />
02001a48 A __itcm_lma<br />
02001a48 b completed.2775<br />
02001a4c b object.2787<br />
02001a64 00000004 b __timeout<br />
02001a68 00000004 B processing<br />
02001a6c 00000004 B fake_heap_end<br />
02001a70 00000004 B fake_heap_start<br />
02001a74 00000004 B theTime<br />
02001a78 00000040 B fifo_datamsg_data<br />
02001ab8 00000800 B fifo_buffer<br />
02001bd8 A __vectors_lma<br />
020022b8 00000040 B fifo_value32_func<br />
020022f8 00000040 B fifo_address_func<br />
02002338 00000040 B fifo_value32_data<br />
02002378 00000040 B fifo_value32_queue<br />
020023b8 00000040 B fifo_data_queue<br />
020023f8 00000040 B fifo_address_data<br />
02002438 00000040 B fifo_datamsg_func<br />
02002478 00000040 B fifo_address_queue<br />
020024b8 00000004 B punixTime<br />
020024bc A __bss_end<br />
020024bc A __bss_end__<br />
020024bc A __end__<br />
020024bc A _end<br />
023ff000 A __eheap_end<br />
023ff000 A __ewram_end<br />
027fff70 a _libnds_argv<br />
0b000000 A __dtcm_end<br />
0b000000 A __dtcm_start<br />
0b000000 A __sbss_end<br />
0b000000 A __sbss_start<br />
0b000000 A __sbss_start__<br />
0b003d00 A __sp_usr<br />
0b003e00 A __sp_irq<br />
0b003f00 A __sp_svc<br />
0b003ff8 A __irq_flags<br />
0b003ffc A __irq_vector<br />
0b004000 A __dtcm_top</div>
</div>
<p>
Now, I expect you can&#8217;t really tell much from this, so here&#8217;s a summary.
</p>
<div class="none">
<div class="none proglist" style=" ">[map]<br />
begin &nbsp; &nbsp; &nbsp; end &nbsp; &nbsp; &nbsp; &nbsp; size&nbsp; &nbsp; &nbsp; Description<br />
02000000 &#8211; 0200033c &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; : crt0.S (roughly)<br />
0200037c &#8211; 02000380 &nbsp; &nbsp; 0004 &nbsp; &nbsp;: main.c<br />
02000380 &#8211; 02000548 &nbsp; &nbsp; 01C8&nbsp; &nbsp; : libnds system init/handlers<br />
02000548 &#8211; 020011e0 &nbsp; &nbsp; 0C98&nbsp; &nbsp; : libnds fifo routines<br />
020011e0 &#8211; 020013c0 &nbsp; &nbsp; 01E0&nbsp; &nbsp; : libnds interrupt.c<br />
020013c0 &#8211; 02001430 &nbsp; &nbsp; 0070&nbsp; &nbsp; : libnds bios.s<br />
02001430 &#8211; 020015e8 &nbsp; &nbsp; 01B8&nbsp; &nbsp; : libc misc<br />
020015e8 A __text_end<br />
020015e8 &#8211; 02001600 &nbsp; &nbsp; 0018&nbsp; &nbsp; : C/C++ ctor/dtor overhead, etc?<br />
02001600 &#8211; 02001618 &nbsp; &nbsp; 0018&nbsp; &nbsp; : libnds fifo data<br />
02001618 &#8211; 02001a48 &nbsp; &nbsp; 0430&nbsp; &nbsp; : impure ?!?<br />
02001a48 &#8211; 02001a78 &nbsp; &nbsp; 0030&nbsp; &nbsp; : misc bookkeeping</p>
<p>02001a78 &#8211; 020024b8 &nbsp; &nbsp; 0A40&nbsp; &nbsp; : libnds fifo data + pointers<br />
020024bc A _end</p>
<p>000024bc &#8211; 0000D630 0000B174&nbsp; &nbsp; : ???<br />
[/map]</div>
</div>
<p>
The <code>0100:xxxx</code> and <code>0B00:xxxx</code> ranges belong to<br />
ITCM and DTCM, so those are irrelevant when looking at main RAM size.<br />
The libc, impure and misc bookkeeping sections are stuff related to the<br />
C library and C overhead, accounting for about 1.5 kb. The boot code,<br />
<tt>crt0.S</tt> also covers close to 1.0 kb. As expected, the code for<br />
<code>main.c</code> &ndash;the actual project&ndash; is more or less<br />
nothing.
</p>
<p>
The rest, about 7 kb, is libnds. Now, you may say that this is quite a bit<br />
of overhead, but it really isn&#8217;t. Pretty much all of it relates to<br />
interrupts and the fifo system, which takes care of ARM7-ARM9<br />
communication. You <i>need</i> to have these parts. Okay, you could try<br />
to roll your own to shrink this down to the bare essentials, but in all<br />
likelihood that&#8217;s more trouble than it&#8217;s worth.
</p>
<p><div>&nbsp;</div></p>
<p>
The observant of you should have noticed something: we&#8217;re only at 9.5 kb,<br />
but the file size is 53.5 kb. So what the hell happened to the other 44 kb?<br />
Well, I don&#8217;t know, to be honest. It doesn&#8217;t appear in MWRAM to be sure.<br />
It&#8217;s probably the stuff <tt>ndstool</tt> adds. My guess it that that&#8217;s<br />
where the ARM7 binary goes, along with the icon, titles and possibly<br />
DLDI interfaces, but I really can&#8217;t say right now.
</p>
<p><h2 id="sec-std-new">2
Standard C++ new/delete
</h2>
</p>
<p>
And now, let&#8217;s look at what happens when you invoke <code>new</code>.
</p>
<div class="none">
<div class="none proglist" style=" ">void test_std_new()<br />
{<br />
&nbsp; &nbsp; u8 *ptr= new u8[<span class="nu0">8</span>];<br />
&nbsp; &nbsp; delete[] ptr;<br />
}</p>
<p>int main()<br />
{<br />
&nbsp; &nbsp; while(<span class="nu0">1</span>) ;<br />
}</div>
</div>
<p>
Just this small thing increases the file size to 117 kb! And remember,<br />
that&#8217;s not merely a doubling of the size, as 44 kb of the binary is not<br />
put in memory. The memory load has gone from about 10 kb to over 70 kb.<br />
What causes this increase? Well, let&#8217;s see:
</p>
<div class="none">
<div class="none proglist" style=" ">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;w _Jv_RegisterClasses<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;w __deregister_frame_info<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;w __gnu_Unwind_Find_exidx<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;w __register_frame_info<br />
00080000 N _stack<br />
01000000 A __vectors_end<br />
01000000 A __vectors_start<br />
01000100 A __itcm_start<br />
01000100 000000c8 T irqTable<br />
010001c8 T IntrMain<br />
010001fc t findIRQ<br />
01000218 t no_handler<br />
01000228 t jump_intr<br />
0100023c t got_handler<br />
0100025c t IntrRet<br />
01000290 A __itcm_end<br />
02000000 T __text_start<br />
02000000 T _start<br />
02000194 t ILoop<br />
02000198 t checkARGV<br />
020001dc t .copyforward<br />
020001f0 t .copybackward<br />
02000200 t .copydone<br />
02000214 t ClearMem<br />
02000228 t ClrLoop<br />
02000238 t CopyMemCheck<br />
0200023c t CopyMem<br />
0200024c t CIDLoop<br />
02000300 T _init<br />
02000310 t __do_global_dtors_aux<br />
0200033c t frame_dummy<br />
0200037c 00000004 T main<br />
02000380 00000012 T _Z12test_std_newv<br />
02000394 000000ec T initSystem<br />
02000480 00000012 T ledBlink<br />
02000494 0000002c T powerOff<br />
020004c0 00000030 T powerOn<br />
020004f0 00000018 T systemSleep<br />
02000508 00000010 T powerValueHandler<br />
02000518 00000044 T systemMsgHandler<br />
0200055c 00000164 t fifoInternalSend<br />
020006c0 00000038 T fifoSendAddress<br />
020006f8 00000048 T fifoSendValue32<br />
02000740 00000070 T fifoGetAddress<br />
020007b0 00000074 T fifoSetAddressHandler<br />
02000824 00000070 T fifoGetValue32<br />
02000894 00000074 T fifoSetValue32Handler<br />
02000908 00000024 T fifoCheckAddress<br />
0200092c 00000024 T fifoCheckDatamsg<br />
02000950 00000024 T fifoCheckValue32<br />
02000974 00000094 t fifoInternalSendInterrupt<br />
02000a08 00000010 t __timeoutvbl<br />
02000a18 000001b8 T fifoInit<br />
02000bd0 00000100 T fifoGetDatamsg<br />
02000cd0 0000040c t fifoInternalRecvInterrupt<br />
020010dc 000000a8 T fifoSetDatamsgHandler<br />
02001184 00000070 T fifoSendDatamsg<br />
020011f4 00000002 T irqDummy<br />
020011f8 0000006c T irqSet<br />
02001264 0000004c T irqInit<br />
020012b0 00000030 T irqInitHandler<br />
020012e0 00000060 T irqEnable<br />
02001340 00000060 T irqDisable<br />
020013a0 0000002c T irqClear<br />
020013d0 T swiSoftReset<br />
020013d4 T swiDelay<br />
020013d8 T swiIntrWait<br />
020013dc T swiWaitForVBlank<br />
020013e0 T swiSleep<br />
020013e4 T swiChangeSoundBias<br />
020013e8 T swiDivide<br />
020013ec T swiRemainder<br />
020013f2 T swiDivMod<br />
020013fe T swiCopy<br />
02001402 T swiFastCopy<br />
02001406 T swiSqrt<br />
0200140a T swiCRC16<br />
0200140e T swiIsDebugger<br />
02001412 T swiUnpackBits<br />
02001416 T swiDecompressLZSSWram<br />
0200141a T swiDecompressLZSSVram<br />
0200141e T swiDecompressHuffman<br />
02001422 T swiDecompressRLEWram<br />
02001426 T swiDecompressRLEVram<br />
0200142a T swiWaitForIRQ<br />
0200142e T swiDecodeDelta8<br />
02001432 T swiDecodeDelta16<br />
02001436 T swiSetHaltCR<br />
02001440 00000054 t d_make_comp<br />
02001494 0000003a t d_make_name<br />
020014d0 00000058 t d_number<br />
02001528 0000004c t d_call_offset<br />
02001574 00000096 t d_cv_qualifiers<br />
0200160c 00000060 t d_template_param<br />
0200166c 00000160 t d_substitution<br />
020017cc 00000050 t d_append_char<br />
0200181c 00000084 t d_find_pack<br />
020018a0 00000090 t d_source_name<br />
02001930 00000240 t d_expression<br />
02001b70 0000056c t d_type<br />
020020dc 0000009a t d_bare_function_type<br />
02002178 000000ec t d_operator_name<br />
02002264 00000136 t d_unqualified_name<br />
0200239c 000000ca t d_expr_primary<br />
02002468 000000aa t d_template_args<br />
02002514 0000022c t d_name<br />
02002740 0000039c t d_encoding<br />
02002adc 00000060 t d_exprlist<br />
02002b3c 0000008a t d_growable_string_callback_adapter<br />
02002bc8 00000098 t d_append_buffer<br />
02002c60 000000a0 t d_append_string<br />
02002d00 000001f8 t d_print_array_type<br />
02002ef8 00000108 t d_print_mod_list<br />
02003000 00000234 t d_print_function_type<br />
02003234 00000ba0 t d_print_comp<br />
02003dd4 000001c0 t d_demangle_callback<br />
02003f94 0000002e T __gcclibcxx_demangle_callback<br />
02003fc4 000000c0 T __cxa_demangle<br />
02004084 000000c8 t d_print_mod<br />
0200414c 00000104 t d_print_cast<br />
02004250 0000009c t d_print_expr_op<br />
020042ec 000000a8 t d_print_subexpr<br />
02004398 T __cxa_end_cleanup<br />
020043a4 T __aeabi_uidiv<br />
020043a4 0000007a T __udivsi3<br />
02004420 0000000e T __aeabi_uidivmod<br />
02004430 00000002 T __aeabi_idiv0<br />
02004430 00000002 T __aeabi_ldiv0<br />
02004430 00000002 T __div0<br />
02004434 00000010 t _Unwind_decode_target2<br />
02004444 0000002a T _Unwind_VRS_Get<br />
02004470 0000001a t _Unwind_GetGR<br />
0200448c 0000002a T _Unwind_VRS_Set<br />
020044b8 0000001c t _Unwind_SetGR<br />
020044d4 00000020 t selfrel_offset31<br />
020044f4 00000074 t search_EIT_table<br />
02004568 00000004 T _Unwind_GetCFA<br />
0200456c 00000002 T _Unwind_Complete<br />
02004570 00000016 T _Unwind_DeleteException<br />
02004588 000002bc t __gnu_unwind_pr_common<br />
02004844 0000000e W __aeabi_unwind_cpp_pr2<br />
02004854 0000000e W __aeabi_unwind_cpp_pr1<br />
02004864 0000000e T __aeabi_unwind_cpp_pr0<br />
02004874 000000d0 t get_eit_entry<br />
02004944 0000005a t restore_non_core_regs<br />
020049a0 00000080 T __gnu_Unwind_Backtrace<br />
02004a20 000000e4 t unwind_phase2_forced<br />
02004b04 00000018 T __gnu_Unwind_ForcedUnwind<br />
02004b1c 00000034 t unwind_phase2<br />
02004b50 00000060 T __gnu_Unwind_RaiseException<br />
02004bb0 0000001e T __gnu_Unwind_Resume_or_Rethrow<br />
02004bd0 00000040 T __gnu_Unwind_Resume<br />
02004c10 00000268 T _Unwind_VRS_Pop<br />
02004e80 0000001c T __restore_core_regs<br />
02004e80 0000001c T restore_core_regs<br />
02004e9c T __gnu_Unwind_Restore_VFP<br />
02004ea4 T __gnu_Unwind_Save_VFP<br />
02004eac T __gnu_Unwind_Restore_VFP_D<br />
02004eb4 T __gnu_Unwind_Save_VFP_D<br />
02004ebc T __gnu_Unwind_Restore_VFP_D_16_to_31<br />
02004ec4 T __gnu_Unwind_Save_VFP_D_16_to_31<br />
02004ecc T __gnu_Unwind_Restore_WMMXD<br />
02004f10 T __gnu_Unwind_Save_WMMXD<br />
02004f54 T __gnu_Unwind_Restore_WMMXC<br />
02004f68 T __gnu_Unwind_Save_WMMXC<br />
02004f7c 0000002a T _Unwind_RaiseException<br />
02004f7c 0000002a T ___Unwind_RaiseException<br />
02004fa8 0000002a T _Unwind_Resume<br />
02004fa8 0000002a T ___Unwind_Resume<br />
02004fd4 0000002a T _Unwind_Resume_or_Rethrow<br />
02004fd4 0000002a T ___Unwind_Resume_or_Rethrow<br />
02005000 0000002a T _Unwind_ForcedUnwind<br />
02005000 0000002a T ___Unwind_ForcedUnwind<br />
0200502c 0000002a T _Unwind_Backtrace<br />
0200502c 0000002a T ___Unwind_Backtrace<br />
02005058 00000036 t next_unwind_byte<br />
02005090 00000006 T _Unwind_GetTextRelBase<br />
02005098 00000006 T _Unwind_GetDataRelBase<br />
020050a0 0000001a t _Unwind_GetGR<br />
020050bc 0000000e t unwind_UCB_from_context<br />
020050cc 00000018 T _Unwind_GetLanguageSpecificData<br />
020050e4 0000000e T _Unwind_GetRegionStart<br />
020050f4 000002e8 T __gnu_unwind_execute<br />
020053dc 0000002a T __gnu_unwind_frame<br />
02005408 0000000e T abort<br />
02005418 0000002c T fputc<br />
02005444 00000026 T _fputc_r<br />
0200546c 0000005c T _fputs_r<br />
020054c8 0000001c T fputs<br />
020054e4 00000324 T __sfvwrite_r<br />
0200580c 0000007c T _fwrite_r<br />
02005888 00000028 T fwrite<br />
020058b0 00000030 T __libc_fini_array<br />
020058e0 00000050 T __libc_init_array<br />
02005934 00000018 T free<br />
0200594c 00000018 T malloc<br />
02005964 00000504 T _malloc_r<br />
02005e68 00000080 T memchr<br />
02005ee8 00000058 T memcmp<br />
02005f40 00000080 T memcpy<br />
02005fc0 000000a0 T memmove<br />
02006060 00000094 T memset<br />
020060f4 00000002 T __malloc_lock<br />
020060f8 00000002 T __malloc_unlock<br />
020060fc 00000064 T putc<br />
02006160 0000005e T _putc_r<br />
020061c0 0000001c T realloc<br />
020061dc 00000360 T _realloc_r<br />
0200653c 0000005c T _raise_r<br />
02006598 00000018 T raise<br />
020065b0 00000036 T _init_signal_r<br />
020065e8 00000014 T _init_signal<br />
020065fc 00000056 T __sigtramp_r<br />
02006654 00000018 T __sigtramp<br />
0200666c 00000040 T _signal_r<br />
020066ac 0000001c T signal<br />
020066cc 00000044 T sprintf<br />
02006710 00000040 T _sprintf_r<br />
02006750 0000005c T strcmp<br />
020067ac 0000004c T strcpy<br />
020067f8 0000006c T strlen<br />
02006864 000000ac T strncmp<br />
02006910 00000134 t __sprint_r<br />
02006a44 000015d6 T _svfprintf_r<br />
02008020 00000020 T write<br />
02008040 000000c4 T __swbuf_r<br />
02008104 0000001c T __swbuf<br />
02008120 00000042 T _wcrtomb_r<br />
02008164 00000020 T wcrtomb<br />
02008184 000000da T _wcsrtombs_r<br />
02008260 00000028 T wcsrtombs<br />
02008288 000002c8 T _wctomb_r<br />
02008550 000000d0 T __swsetup_r<br />
02008620 00000154 t quorem<br />
02008774 00000e9c T _dtoa_r<br />
02009610 00000114 T _fflush_r<br />
02009724 00000030 T fflush<br />
02009758 00000002 T __sfp_lock_acquire<br />
0200975c 00000002 T __sfp_lock_release<br />
02009760 00000002 T __sinit_lock_acquire<br />
02009764 00000002 T __sinit_lock_release<br />
02009768 00000004 t __fp_lock<br />
0200976c 00000004 t __fp_unlock<br />
02009770 0000001c T __fp_unlock_all<br />
0200978c 0000001c T __fp_lock_all<br />
020097a8 00000014 T _cleanup_r<br />
020097bc 00000014 T _cleanup<br />
020097d0 0000004c t std<br />
0200981c 0000005c T __sinit<br />
02009878 00000030 T __sfmoreglue<br />
020098a8 00000090 T __sfp<br />
02009938 000000a4 T _malloc_trim_r<br />
020099dc 000001ac T _free_r<br />
02009b88 00000064 T _fwalk_reent<br />
02009bec 0000005c T _fwalk<br />
02009c4c 0000000c T __locale_charset<br />
02009c58 00000008 T _localeconv_r<br />
02009c60 00000008 T localeconv<br />
02009c68 00000254 T _setlocale_r<br />
02009ebc 0000001c T setlocale<br />
02009ed8 000000e8 T __smakebuf_r<br />
02009fc0 0000065e T _mbtowc_r<br />
0200a620 00000016 T _Bfree<br />
0200a638 00000054 T __hi0bits<br />
0200a68c 00000068 T __lo0bits<br />
0200a6f4 00000042 T __mcmp<br />
0200a738 00000050 T __ulp<br />
0200a788 0000009c T __b2d<br />
0200a824 00000064 T __ratio<br />
0200a888 00000044 T _mprec_log10<br />
0200a8cc 00000048 T __copybits<br />
0200a914 00000054 T __any_on<br />
0200a968 00000052 T _Balloc<br />
0200a9bc 000000e4 T __d2b<br />
0200aaa0 00000120 T __mdiff<br />
0200abc0 000000c4 T __lshift<br />
0200ac84 00000164 T __multiply<br />
0200ade8 00000016 T __i2b<br />
0200ae00 000000a4 T __multadd<br />
0200aea4 000000b8 T __pow5mult<br />
0200af5c 0000009c T __s2b<br />
0200aff8 00000024 T __isinfd<br />
0200b01c 00000020 T __isnand<br />
0200b03c 00000010 T __sclose<br />
0200b04c 00000030 T __sseek<br />
0200b07c 0000003c T __swrite<br />
0200b0b8 0000002c T __sread<br />
0200b0e4 0000005c T _calloc_r<br />
0200b140 000000a2 T _fclose_r<br />
0200b1e4 00000018 T fclose<br />
0200b200 0000004c T _close_r<br />
0200b250 00000054 T _fstat_r<br />
0200b2a8 0000000a T _getpid_r<br />
0200b2b4 00000004 T _isatty_r<br />
0200b2b8 0000000a T _kill_r<br />
0200b2c4 0000004c T _lseek_r<br />
0200b314 0000004c T _read_r<br />
0200b364 00000054 T _sbrk_r<br />
0200b3b8 00000006 T _times_r<br />
0200b3c0 0000002c T _gettimeofday_r<br />
0200b3ec 00000014 T _times<br />
0200b400 0000004c T _write_r<br />
0200b450 00000014 T _exit<br />
0200b468 00000052 T build_argv<br />
0200b4bc 00000020 T __get_handle<br />
0200b4dc 0000003c T __alloc_handle<br />
0200b518 0000002c T __release_handle<br />
0200b544 00000014 T setDefaultDevice<br />
0200b558 0000007c T AddDevice<br />
0200b5d4 00000068 T FindDevice<br />
0200b63c 00000020 T GetDeviceOpTab<br />
0200b65c 00000024 T RemoveDevice<br />
0200b680 T __aeabi_idiv<br />
0200b680 00000094 T __divsi3<br />
0200b714 0000000e T __aeabi_idivmod<br />
0200b724 T __aeabi_drsub<br />
0200b72c 00000314 T __aeabi_dsub<br />
0200b72c 00000314 T __subdf3<br />
0200b730 00000310 T __adddf3<br />
0200b730 00000310 T __aeabi_dadd<br />
0200ba40 00000024 T __aeabi_ui2d<br />
0200ba40 00000024 T __floatunsidf<br />
0200ba64 00000028 T __aeabi_i2d<br />
0200ba64 00000028 T __floatsidf<br />
0200ba8c 00000040 T __aeabi_f2d<br />
0200ba8c 00000040 T __extendsfdf2<br />
0200bacc 00000074 T __aeabi_ul2d<br />
0200bacc 00000074 T __floatundidf<br />
0200bae0 00000060 T __aeabi_l2d<br />
0200bae0 00000060 T __floatdidf<br />
0200bb40 00000290 T __aeabi_dmul<br />
0200bb40 00000290 T __muldf3<br />
0200bdd0 0000020c T __aeabi_ddiv<br />
0200bdd0 0000020c T __divdf3<br />
0200bfdc 00000094 T __gedf2<br />
0200bfdc 00000094 T __gtdf2<br />
0200bfe4 0000008c T __ledf2<br />
0200bfe4 0000008c T __ltdf2<br />
0200bfec 00000084 T __cmpdf2<br />
0200bfec 00000084 T __eqdf2<br />
0200bfec 00000084 T __nedf2<br />
0200c070 00000034 T __aeabi_cdrcmple<br />
0200c08c 00000018 T __aeabi_cdcmpeq<br />
0200c08c 00000018 T __aeabi_cdcmple<br />
0200c0a4 00000018 T __aeabi_dcmpeq<br />
0200c0bc 00000018 T __aeabi_dcmplt<br />
0200c0d4 00000018 T __aeabi_dcmple<br />
0200c0ec 00000018 T __aeabi_dcmpge<br />
0200c104 00000018 T __aeabi_dcmpgt<br />
0200c11c 0000005c T __aeabi_d2iz<br />
0200c11c 0000005c T __fixdfsi<br />
0200c178 0000000c T __errno<br />
0200c184 0000000c T _ZdaPv<br />
0200c190 0000004c t _ZL21base_of_encoded_valuehP15_Unwind_Context<br />
0200c1dc 0000016c t _ZL17parse_lsda_headerP15_Unwind_ContextPKhP16lsda_header_info<br />
0200c348 0000073a T __gxx_personality_v0<br />
0200ca84 00000010 T _ZSt13set_terminatePFvvE<br />
0200ca94 00000010 T _ZSt14set_unexpectedPFvvE<br />
0200caa4 00000020 T _ZN10__cxxabiv111__terminateEPFvvE<br />
0200cac4 00000010 T _ZSt9terminatev<br />
0200cad4 0000000c T _ZN10__cxxabiv112__unexpectedEPFvvE<br />
0200cae0 00000010 T _ZSt10unexpectedv<br />
0200caf0 00000018 T _Znaj<br />
0200cb08 0000010e T _ZN9__gnu_cxx27__verbose_terminate_handlerEv<br />
0200cc18 00000010 T _ZdlPv<br />
0200cc28 000000f8 T __cxa_type_match<br />
0200cd20 00000062 T __cxa_begin_cleanup<br />
0200cd84 0000006a T __gnu_end_cleanup<br />
0200cdf0 00000020 T __cxa_bad_typeid<br />
0200ce10 00000020 T __cxa_bad_cast<br />
0200ce30 00000048 T __cxa_call_terminate<br />
0200ce78 00000122 T __cxa_call_unexpected<br />
0200cf9c 00000004 T __cxa_get_exception_ptr<br />
0200cfa0 00000012 T _ZSt18uncaught_exceptionv<br />
0200cfb4 00000086 T __cxa_end_catch<br />
0200d03c 00000086 T __cxa_begin_catch<br />
0200d0c4 0000000c T _ZNSt9exceptionD2Ev<br />
0200d0d0 0000000c T _ZNSt9exceptionD1Ev<br />
0200d0dc 0000000c T _ZNSt13bad_exceptionD2Ev<br />
0200d0e8 0000000c T _ZNSt13bad_exceptionD1Ev<br />
0200d0f4 0000000c T _ZN10__cxxabiv115__forced_unwindD2Ev<br />
0200d100 0000000c T _ZN10__cxxabiv115__forced_unwindD1Ev<br />
0200d10c 0000000c T _ZN10__cxxabiv119__foreign_exceptionD2Ev<br />
0200d118 0000000c T _ZN10__cxxabiv119__foreign_exceptionD1Ev<br />
0200d124 00000008 T _ZNKSt9exception4whatEv<br />
0200d12c 00000008 T _ZNKSt13bad_exception4whatEv<br />
0200d134 00000000 T _ZNKSt13bad_exhelpimtrappedinabinaryfactoryEv<br />
0200d134 0000001c T _ZN10__cxxabiv119__foreign_exceptionD0Ev<br />
0200d150 0000001c T _ZN10__cxxabiv115__forced_unwindD0Ev<br />
0200d16c 0000001c T _ZNSt9exceptionD0Ev<br />
0200d188 0000001c T _ZNSt13bad_exceptionD0Ev<br />
0200d1a4 00000008 T __cxa_get_globals_fast<br />
0200d1ac 00000008 T __cxa_get_globals<br />
0200d1b4 00000068 T __cxa_rethrow<br />
0200d21c 0000005c T __cxa_throw<br />
0200d278 00000034 t _ZL23__gxx_exception_cleanup19_Unwind_Reason_CodeP21_Unwind_Control_Block<br />
0200d2ac 00000026 T __cxa_current_exception_type<br />
0200d2d4 0000001c T _ZN10__cxxabiv123__fundamental_type_infoD1Ev<br />
0200d2f0 0000001c T _ZN10__cxxabiv123__fundamental_type_infoD2Ev<br />
0200d30c 00000020 T _ZN10__cxxabiv123__fundamental_type_infoD0Ev<br />
0200d32c 00000010 T _ZSt15set_new_handlerPFvvE<br />
0200d33c 00000008 T _ZNKSt9bad_alloc4whatEv<br />
0200d344 0000001c T _ZNSt9bad_allocD1Ev<br />
0200d360 0000001c T _ZNSt9bad_allocD2Ev<br />
0200d37c 00000020 T _ZNSt9bad_allocD0Ev<br />
0200d39c 0000006a T _Znwj<br />
0200d408 00000004 T _ZNK10__cxxabiv119__pointer_type_info14__is_pointer_pEv<br />
0200d40c 0000004c T _ZNK10__cxxabiv119__pointer_type_info15__pointer_catchEPKNS_17__pbase_type_infoEPPvj<br />
0200d458 0000001c T _ZN10__cxxabiv119__pointer_type_infoD1Ev<br />
0200d474 0000001c T _ZN10__cxxabiv119__pointer_type_infoD2Ev<br />
0200d490 00000020 T _ZN10__cxxabiv119__pointer_type_infoD0Ev<br />
0200d4b0 00000014 T __cxa_pure_virtual<br />
0200d4c4 0000002e T _ZNK10__cxxabiv120__si_class_type_info11__do_upcastEPKNS_17__class_type_infoEPKvRNS1_15__upcast_resultE<br />
0200d4f4 00000096 T _ZNK10__cxxabiv120__si_class_type_info12__do_dyncastEiNS_17__class_type_info10__sub_kindEPKS1_PKvS4_S6_RNS1_16__dyncast_resultE<br />
0200d58c 00000048 T _ZNK10__cxxabiv120__si_class_type_info20__do_find_public_srcEiPKvPKNS_17__class_type_infoES2_<br />
0200d5d4 0000001c T _ZN10__cxxabiv120__si_class_type_infoD1Ev<br />
0200d5f0 0000001c T _ZN10__cxxabiv120__si_class_type_infoD2Ev<br />
0200d60c 00000020 T _ZN10__cxxabiv120__si_class_type_infoD0Ev<br />
0200d62c 0000000c T _ZNSt9type_infoD2Ev<br />
0200d638 0000000c T _ZNSt9type_infoD1Ev<br />
0200d644 0000000c T _ZNKSt9type_infoeqERKS_<br />
0200d650 00000004 T _ZNKSt9type_info14__is_pointer_pEv<br />
0200d654 00000004 T _ZNKSt9type_info15__is_function_pEv<br />
0200d658 0000000c T _ZNKSt9type_info10__do_catchEPKS_PPvj<br />
0200d664 00000004 T _ZNKSt9type_info11__do_upcastEPKN10__cxxabiv117__class_type_infoEPPv<br />
0200d668 0000001c T _ZNSt9type_infoD0Ev<br />
0200d684 00000008 T _ZNKSt8bad_cast4whatEv<br />
0200d68c 0000001c T _ZNSt8bad_castD1Ev<br />
0200d6a8 0000001c T _ZNSt8bad_castD2Ev<br />
0200d6c4 00000020 T _ZNSt8bad_castD0Ev<br />
0200d6e4 00000008 T _ZNKSt10bad_typeid4whatEv<br />
0200d6ec 0000001c T _ZNSt10bad_typeidD1Ev<br />
0200d708 0000001c T _ZNSt10bad_typeidD2Ev<br />
0200d724 00000020 T _ZNSt10bad_typeidD0Ev<br />
0200d744 0000003e T _ZNK10__cxxabiv117__class_type_info11__do_upcastEPKS0_PPv<br />
0200d784 00000012 T _ZNK10__cxxabiv117__class_type_info20__do_find_public_srcEiPKvPKS0_S2_<br />
0200d798 00000020 T _ZNK10__cxxabiv117__class_type_info11__do_upcastEPKS0_PKvRNS0_15__upcast_resultE<br />
0200d7b8 0000004a T _ZNK10__cxxabiv117__class_type_info12__do_dyncastEiNS0_10__sub_kindEPKS0_PKvS3_S5_RNS0_16__dyncast_resultE<br />
0200d804 00000034 T _ZNK10__cxxabiv117__class_type_info10__do_catchEPKSt9type_infoPPvj<br />
0200d838 0000001c T _ZN10__cxxabiv117__class_type_infoD1Ev<br />
0200d854 0000001c T _ZN10__cxxabiv117__class_type_infoD2Ev<br />
0200d870 00000020 T _ZN10__cxxabiv117__class_type_infoD0Ev<br />
0200d890 00000002 t _GLOBAL__I___cxa_allocate_exception<br />
0200d894 0000003c T __cxa_free_dependent_exception<br />
0200d8d0 0000003c T __cxa_free_exception<br />
0200d90c 00000084 T __cxa_allocate_dependent_exception<br />
0200d990 00000088 T __cxa_allocate_exception<br />
0200da18 00000018 W _ZNK10__cxxabiv117__pbase_type_info15__pointer_catchEPKS0_PPvj<br />
0200da30 00000064 T _ZNK10__cxxabiv117__pbase_type_info10__do_catchEPKSt9type_infoPPvj<br />
0200da94 0000001c T _ZN10__cxxabiv117__pbase_type_infoD1Ev<br />
0200dab0 0000001c T _ZN10__cxxabiv117__pbase_type_infoD2Ev<br />
0200dacc 00000020 T _ZN10__cxxabiv117__pbase_type_infoD0Ev<br />
0200daec T _fini<br />
0200daf8 A __text_end<br />
0200e1dc 000000c4 r standard_subs<br />
0200e2a0 00000280 r cplus_demangle_builtin_types<br />
0200e520 00000350 r cplus_demangle_operators<br />
0200e884 00000004 R _global_impure_ptr<br />
0200e9ec 00000010 r blanks.3548<br />
0200e9fc 00000010 r zeroes.3549<br />
0200ea94 00000030 r lconv<br />
0200eadc 00000048 r JIS_state_table<br />
0200eb24 00000048 r JIS_action_table<br />
0200eb70 000000c8 R __mprec_tens<br />
0200ec38 0000000c r p05.2435<br />
0200ec48 00000028 R __mprec_bigtens<br />
0200ec70 00000028 R __mprec_tinytens<br />
0200ec98 0000005c R dotab_stdnull<br />
0200f4f8 00000014 R _ZTVN10__cxxabiv115__forced_unwindE<br />
0200f510 00000008 R _ZTISt9exception<br />
0200f518 00000014 R _ZTVSt9exception<br />
0200f530 00000008 R _ZTIN10__cxxabiv115__forced_unwindE<br />
0200f538 00000012 R _ZTSSt13bad_exception<br />
0200f54c 00000024 R _ZTSN10__cxxabiv119__foreign_exceptionE<br />
0200f594 00000008 R _ZTIN10__cxxabiv119__foreign_exceptionE<br />
0200f5a0 00000014 R _ZTVSt13bad_exception<br />
0200f5b8 0000000d R _ZTSSt9exception<br />
0200f5c8 00000014 R _ZTVN10__cxxabiv119__foreign_exceptionE<br />
0200f5e0 00000020 R _ZTSN10__cxxabiv115__forced_unwindE<br />
0200f600 0000000c R _ZTISt13bad_exception<br />
0200f60c 00000010 V _ZTIPKe<br />
0200f61c 00000010 V _ZTIPe<br />
0200f62c 00000008 V _ZTIe<br />
0200f634 00000010 V _ZTIPKd<br />
0200f644 00000010 V _ZTIPd<br />
0200f654 00000008 V _ZTId<br />
0200f65c 00000010 V _ZTIPKf<br />
0200f66c 00000010 V _ZTIPf<br />
0200f67c 00000008 V _ZTIf<br />
0200f684 00000010 V _ZTIPKy<br />
0200f694 00000010 V _ZTIPy<br />
0200f6a4 00000008 V _ZTIy<br />
0200f6ac 00000010 V _ZTIPKx<br />
0200f6bc 00000010 V _ZTIPx<br />
0200f6cc 00000008 V _ZTIx<br />
0200f6d4 00000010 V _ZTIPKm<br />
0200f6e4 00000010 V _ZTIPm<br />
0200f6f4 00000008 V _ZTIm<br />
0200f6fc 00000010 V _ZTIPKl<br />
0200f70c 00000010 V _ZTIPl<br />
0200f71c 00000008 V _ZTIl<br />
0200f724 00000010 V _ZTIPKj<br />
0200f734 00000010 V _ZTIPj<br />
0200f744 00000008 V _ZTIj<br />
0200f74c 00000010 V _ZTIPKi<br />
0200f75c 00000010 V _ZTIPi<br />
0200f76c 00000008 V _ZTIi<br />
0200f774 00000010 V _ZTIPKt<br />
0200f784 00000010 V _ZTIPt<br />
0200f794 00000008 V _ZTIt<br />
0200f79c 00000010 V _ZTIPKs<br />
0200f7ac 00000010 V _ZTIPs<br />
0200f7bc 00000008 V _ZTIs<br />
0200f7c4 00000010 V _ZTIPKh<br />
0200f7d4 00000010 V _ZTIPh<br />
0200f7e4 00000008 V _ZTIh<br />
0200f7ec 00000010 V _ZTIPKa<br />
0200f7fc 00000010 V _ZTIPa<br />
0200f80c 00000008 V _ZTIa<br />
0200f814 00000010 V _ZTIPKc<br />
0200f824 00000010 V _ZTIPc<br />
0200f834 00000008 V _ZTIc<br />
0200f83c 00000010 V _ZTIPKDi<br />
0200f84c 00000010 V _ZTIPDi<br />
0200f85c 00000008 V _ZTIDi<br />
0200f864 00000010 V _ZTIPKDs<br />
0200f874 00000010 V _ZTIPDs<br />
0200f884 00000008 V _ZTIDs<br />
0200f88c 00000010 V _ZTIPKw<br />
0200f89c 00000010 V _ZTIPw<br />
0200f8ac 00000008 V _ZTIw<br />
0200f8b4 00000010 V _ZTIPKb<br />
0200f8c4 00000010 V _ZTIPb<br />
0200f8d4 00000008 V _ZTIb<br />
0200f8dc 00000010 V _ZTIPKv<br />
0200f8ec 00000010 V _ZTIPv<br />
0200f8fc 00000008 V _ZTIv<br />
0200f904 00000004 V _ZTSPKe<br />
0200f908 00000003 V _ZTSPe<br />
0200f90c 00000002 V _ZTSe<br />
0200f910 00000004 V _ZTSPKd<br />
0200f914 00000003 V _ZTSPd<br />
0200f918 00000002 V _ZTSd<br />
0200f91c 00000004 V _ZTSPKf<br />
0200f920 00000003 V _ZTSPf<br />
0200f924 00000002 V _ZTSf<br />
0200f928 00000004 V _ZTSPKy<br />
0200f92c 00000003 V _ZTSPy<br />
0200f930 00000002 V _ZTSy<br />
0200f934 00000004 V _ZTSPKx<br />
0200f938 00000003 V _ZTSPx<br />
0200f93c 00000002 V _ZTSx<br />
0200f940 00000004 V _ZTSPKm<br />
0200f944 00000003 V _ZTSPm<br />
0200f948 00000002 V _ZTSm<br />
0200f94c 00000004 V _ZTSPKl<br />
0200f950 00000003 V _ZTSPl<br />
0200f954 00000002 V _ZTSl<br />
0200f958 00000004 V _ZTSPKj<br />
0200f95c 00000003 V _ZTSPj<br />
0200f960 00000002 V _ZTSj<br />
0200f964 00000004 V _ZTSPKi<br />
0200f968 00000003 V _ZTSPi<br />
0200f96c 00000002 V _ZTSi<br />
0200f970 00000004 V _ZTSPKt<br />
0200f974 00000003 V _ZTSPt<br />
0200f978 00000002 V _ZTSt<br />
0200f97c 00000004 V _ZTSPKs<br />
0200f980 00000003 V _ZTSPs<br />
0200f984 00000002 V _ZTSs<br />
0200f988 00000004 V _ZTSPKh<br />
0200f98c 00000003 V _ZTSPh<br />
0200f990 00000002 V _ZTSh<br />
0200f994 00000004 V _ZTSPKa<br />
0200f998 00000003 V _ZTSPa<br />
0200f99c 00000002 V _ZTSa<br />
0200f9a0 00000004 V _ZTSPKc<br />
0200f9a4 00000003 V _ZTSPc<br />
0200f9a8 00000002 V _ZTSc<br />
0200f9ac 00000005 V _ZTSPKDi<br />
0200f9b4 00000004 V _ZTSPDi<br />
0200f9b8 00000003 V _ZTSDi<br />
0200f9bc 00000005 V _ZTSPKDs<br />
0200f9c4 00000004 V _ZTSPDs<br />
0200f9c8 00000003 V _ZTSDs<br />
0200f9cc 00000004 V _ZTSPKw<br />
0200f9d0 00000003 V _ZTSPw<br />
0200f9d4 00000002 V _ZTSw<br />
0200f9d8 00000004 V _ZTSPKb<br />
0200f9dc 00000003 V _ZTSPb<br />
0200f9e0 00000002 V _ZTSb<br />
0200f9e4 00000004 V _ZTSPKv<br />
0200f9e8 00000003 V _ZTSPv<br />
0200f9ec 00000002 V _ZTSv<br />
0200f9f0 0000000c R _ZTIN10__cxxabiv123__fundamental_type_infoE<br />
0200f9fc 00000028 R _ZTSN10__cxxabiv123__fundamental_type_infoE<br />
0200fa28 00000020 R _ZTVN10__cxxabiv123__fundamental_type_infoE<br />
0200fa48 00000014 R _ZTVSt9bad_alloc<br />
0200fa60 0000000d R _ZTSSt9bad_alloc<br />
0200fa70 0000000c R _ZTISt9bad_alloc<br />
0200fa8c 00000001 R _ZSt7nothrow<br />
0200fa90 00000024 R _ZTSN10__cxxabiv119__pointer_type_infoE<br />
0200fab4 0000000c R _ZTIN10__cxxabiv119__pointer_type_infoE<br />
0200fac0 00000024 R _ZTVN10__cxxabiv119__pointer_type_infoE<br />
0200fb08 0000002c R _ZTVN10__cxxabiv120__si_class_type_infoE<br />
0200fb38 0000000c R _ZTIN10__cxxabiv120__si_class_type_infoE<br />
0200fb44 00000025 R _ZTSN10__cxxabiv120__si_class_type_infoE<br />
0200fb6c 00000008 R _ZTISt9type_info<br />
0200fb74 0000000d R _ZTSSt9type_info<br />
0200fb88 00000020 R _ZTVSt9type_info<br />
0200fba8 0000000c R _ZTISt8bad_cast<br />
0200fbb4 0000000c R _ZTSSt8bad_cast<br />
0200fbc0 00000014 R _ZTVSt8bad_cast<br />
0200fbe8 00000014 R _ZTVSt10bad_typeid<br />
0200fc00 0000000c R _ZTISt10bad_typeid<br />
0200fc1c 0000000f R _ZTSSt10bad_typeid<br />
0200fc30 0000002c R _ZTVN10__cxxabiv117__class_type_infoE<br />
0200fc60 0000000c R _ZTIN10__cxxabiv117__class_type_infoE<br />
0200fc6c 00000022 R _ZTSN10__cxxabiv117__class_type_infoE<br />
0200fc90 0000000c R _ZTIN10__cxxabiv117__pbase_type_infoE<br />
0200fc9c 00000022 R _ZTSN10__cxxabiv117__pbase_type_infoE<br />
0200fcc0 00000024 R _ZTVN10__cxxabiv117__pbase_type_infoE<br />
0200ff44 A __exidx_start<br />
02010364 A __exidx_end<br />
02010364 t __frame_dummy_init_array_entry<br />
02010364 A __init_array_start<br />
02010364 A __preinit_array_end<br />
02010364 A __preinit_array_start<br />
0201036c t __do_global_dtors_aux_fini_array_entry<br />
0201036c A __fini_array_start<br />
0201036c A __init_array_end<br />
02010370 r __EH_FRAME_BEGIN__<br />
02010370 A __fini_array_end<br />
02011114 r __FRAME_END__<br />
02011118 d __JCR_END__<br />
02011118 d __JCR_LIST__<br />
0201111c A __data_start<br />
0201111c D __dso_handle<br />
0201111c A __ewram_start<br />
02011120 00000004 D fifo_freewords<br />
02011124 00000004 D fifo_send_queue<br />
02011128 00000004 D fifo_buffer_free<br />
0201112c 00000004 D fifo_receive_queue<br />
02011130 00000004 D _impure_ptr<br />
02011138 00000428 d impure_data<br />
02011560 00000408 D __malloc_av_<br />
02011968 00000004 D __malloc_sbrk_base<br />
0201196c 00000004 D __malloc_trim_threshold<br />
02011970 00000004 d charset<br />
02011974 0000000c d last_lc_ctype.1268<br />
02011980 0000000c D __lc_ctype<br />
0201198c 0000000c d last_lc_messages.1270<br />
02011998 0000000c d lc_messages.1269<br />
020119a4 00000004 D __mb_cur_max<br />
020119a8 00000004 d defaultDevice<br />
020119ac 00000040 D devoptab_list<br />
020119ec 00000004 D _ZN10__cxxabiv119__terminate_handlerE<br />
020119f0 00000004 D _ZN10__cxxabiv120__unexpected_handlerE<br />
020119f4 A __bss_start<br />
020119f4 A __bss_start__<br />
020119f4 A __bss_vma<br />
020119f4 A __data_end<br />
020119f4 A __dtcm_lma<br />
020119f4 A __itcm_lma<br />
020119f4 b completed.2775<br />
020119f8 b object.2787<br />
02011a10 00000004 b __timeout<br />
02011a14 00000004 B processing<br />
02011a18 00000001 b _ZZN9__gnu_cxx27__verbose_terminate_handlerEvE11terminating<br />
02011a1c 0000000c b _ZL10eh_globals<br />
02011a28 00000004 B __new_handler<br />
02011a2c 00000004 b _ZL15dependents_used<br />
02011a30 000001e0 b _ZL17dependents_buffer<br />
02011b84 A __vectors_lma<br />
02011c10 00000004 b _ZL14emergency_used<br />
02011c18 00000800 b _ZL16emergency_buffer<br />
02012418 00000004 B __malloc_top_pad<br />
0201241c 00000028 B __malloc_current_mallinfo<br />
02012444 00000004 B __malloc_max_sbrked_mem<br />
02012448 00000004 B __malloc_max_total_mem<br />
0201244c 00000004 B __nlocale_changed<br />
02012450 00000004 B __mlocale_changed<br />
02012454 00000004 B _PathLocale<br />
02012458 00000004 b heap_start.2602<br />
0201245c 00000004 B fake_heap_end<br />
02012460 00000004 B fake_heap_start<br />
02012464 00000008 B __syscalls<br />
0201246c 00001000 b handles<br />
0201346c 00000004 B theTime<br />
02013470 00000040 B fifo_datamsg_data<br />
020134b0 00000800 B fifo_buffer<br />
02013cb0 00000040 B fifo_value32_func<br />
02013cf0 00000040 B fifo_address_func<br />
02013d30 00000040 B fifo_value32_data<br />
02013d70 00000040 B fifo_value32_queue<br />
02013db0 00000040 B fifo_data_queue<br />
02013df0 00000040 B fifo_address_data<br />
02013e30 00000040 B fifo_datamsg_func<br />
02013e70 00000040 B fifo_address_queue<br />
02013eb0 00000004 B punixTime<br />
02013eb4 A __bss_end<br />
02013eb4 A __bss_end__<br />
02013eb4 A __end__<br />
02013eb4 A _end<br />
023ff000 A __eheap_end<br />
023ff000 A __ewram_end<br />
027fff70 a _libnds_argv<br />
0b000000 A __dtcm_end<br />
0b000000 A __dtcm_start<br />
0b000000 A __sbss_end<br />
0b000000 A __sbss_start<br />
0b000000 A __sbss_start__<br />
0b003d00 A __sp_usr<br />
0b003e00 A __sp_irq<br />
0b003f00 A __sp_svc<br />
0b003ff8 A __irq_flags<br />
0b003ffc A __irq_vector<br />
0b004000 A __dtcm_to</div>
</div>
<p>
Well, I did say this was going to be the <i>long</i> story, didn&#8217;t I?<br />
Everything that was in the base project is in here as well. The<br />
additional parts can summarized as follows.
</p>
<div class="none">
<div class="none proglist" style=" "><span class="co2"># Additions w.r.t the base case.</span><br />
02001440 &#8211; 02004398 &nbsp; &nbsp; 2F58&nbsp; &nbsp; : d_* routines<br />
020043a4 &#8211; 02004434 &nbsp; &nbsp; 0090&nbsp; &nbsp; : software div (__aeabi_uidiv etc)<br />
02004434 &#8211; 02005408 &nbsp; &nbsp; 0FD4&nbsp; &nbsp; : exception unwind routines<br />
02005418 &#8211; 0200b680 &nbsp; &nbsp; <span class="nu0">6268</span>&nbsp; &nbsp; : various libc : printf et al, malloc,mem*,locale, Device, etc<br />
0200b680 &#8211; 0200c184 &nbsp; &nbsp; 0B04&nbsp; &nbsp; : div and FP math routines. (for printf)<br />
0200c190 &#8211; 0200daf8 &nbsp; &nbsp; <span class="nu0">1968</span>&nbsp; &nbsp; : exception/typeinfo routines.<br />
0200daf8 A __text_end<br />
0200e1dc &#8211; 0200ff44 &nbsp; &nbsp; 1D68&nbsp; &nbsp; : exception/typeinfo strings and pointers.<br />
02013eb4 A _end</div>
</div>
<p>
There are three main areas to discern:
</p>
<ul>
<li>
    <code>d_*()</code> routines, presumably for debug printing.<br />
	(size: 12k)
  </li>
<li>
    Stdio formatting and related. This includes file handling, device<br />
	handling and many forms of <code>printf</code>, which brings a<br />
	whole lot of bagage (some allocation, format parsing and<br />
	math/floating point routines). There&#8217;s also some abort and<br />
	signalling routines. (size: 28k).
  </li>
<li>
    Exception handling. Not just routines for handling them, but also<br />
	the typeinfo stuff required, the output strings and the output<br />
	string pointers. (size: 18k)
  </li>
</ul>
<p>
These roughly 60k of stuff is the overhead of exceptions &ndash;<br />
any <i>potential</i> exception. In this case, it&#8217;s because<br />
<code>new</code> requires a <code>bad_alloc</code> exception when it&#8217;s<br />
unable to allocate more.
</p>
<p>
The problem is that exceptions have<br />
many dependencies: to do exception handling, you keep track and unwind<br />
the stack. You also need to be able to tell the type of exception<br />
thrown, which requires RTTI. And then you say which exception was<br />
thrown, so you need error messages, <i>and</i> a list of pointers to<br />
those messages, <i>and</i> a way to format and write those messages,<br />
hence the <code>d_*()</code> routines and all the stdio stuff.
</p>
<p><h2 id="sec-own-new">3
Custom new/delete
</h2>
</p>
<p>
There is a way around this, though: redefine <code>new</code> and<br />
related functions. Technically speaking, this is a <i>bad</i> idea<br />
if you don&#8217;t know what you&#8217;re doing, but it can be done. Note that<br />
you would need overload four operators: <code>new</code>,<br />
<code>delete</code> and their array counterparts.
</p>
<div class="none">
<div class="none proglist" style=" ">void* operator new(size_t size) &nbsp; &nbsp; { &nbsp; return malloc(size);&nbsp; &nbsp; }</p>
<p>void operator delete(void *p) &nbsp; &nbsp; &nbsp; { &nbsp; free(p);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }</p>
<p>void* operator new[](size_t size) &nbsp; { &nbsp; return malloc(size);&nbsp; &nbsp; }</p>
<p>void operator delete[](void *p) &nbsp; &nbsp; { &nbsp; free(p);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }</p></div>
</div>
<p>
This way, you just incur the cost of <code>malloc()</code> and<br />
<code>free()</code>, which are only about 3k. But again, this is going<br />
against the standard and you&#8217;ll really have to ask yourself if the<br />
(at best) 2% of main RAM you save with this is really worth it.
</p>
<p><div>&nbsp;</div></p>
<p>More on this can be read at<br />
<a href="http://brewforums.qualcomm.com/showthread.php?t=2033"></p>
<p>http://brewforums.qualcomm.com/showthread.php?t=2033</a>.</p>
<p><h2 id="sec-conc">4
Other considerations and conclusions.
</h2>
</p>
<p>
The binary size is <i>not</i> the same as the main RAM footprint.<br />
About 44 kb other stuff.
</p>
<p>
The overhead of the standard <code>new</code> is 60 kb, which is all<br />
due to exceptions. You <i>cannot</i> remove it by<br />
using the compiler options <tt>-fno-exceptions</tt> and<br />
<tt>-fno-rtti</tt>, because that only affects your own code, not the<br />
standard libraries. You can remove this overhead by using overloading<br />
<code>new</code> and related functions, but you have to be really<br />
careful with this.
</p>
<p>
I&#8217;ve also done a little bit of testing with <code>vector</code>, and<br />
it seems that <code>vector</code>&#8216;s overhead also comes from<br />
<code>new</code> and can be removed the same way. However, other parts<br />
of <code>vector</code> (and STL) may use other exceptions, so it&#8217;s<br />
quite possible it won&#8217;t work in all cases.
</p>
<p>
Note that roughly 28 kb of the exception overhead is actually<br />
stdio related &ndash; specifically formatted printing:<br />
<code>*printf</code>. If you&#8217;re using <code>printf</code> anyway, the<br />
effective overhead of exceptions is reduced considerably.
</p>
<p>
Finally, remember that the exception overhead amounts to roughly 2% of<br />
main RAM at most. In most homebrew cases it won&#8217;t matter that much.<br />
When it does start to affect your app, you will likely have other parts<br />
that are easier and safer to optimize out.
</p>
<p><div>&nbsp;</div></p>
<p><a href="/files/misc/minimal.zip">Test project + notes.</a><div>&nbsp;</div></p>
]]></content:encoded>
			<wfw:commentRss>http://www.coranac.com/2009/11/sizeof-new/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Signs from Hell</title>
		<link>http://www.coranac.com/2009/08/signs-from-hell/</link>
		<comments>http://www.coranac.com/2009/08/signs-from-hell/#comments</comments>
		<pubDate>Mon, 03 Aug 2009 19:42:24 +0000</pubDate>
		<dc:creator>cearn</dc:creator>
				<category><![CDATA[code]]></category>

		<guid isPermaLink="false">http://www.coranac.com/?p=100</guid>
		<description><![CDATA[<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
The integer datatypes in C can be either signed or unsigned. Sometimes, it&#8217;s obvious which should be used; for negative values you clearly should use signed types, for example. In many cases there is no obvious choice &#8211; in that case it usually doesn&#8217;t matter which you use. Usually, but not always. Sometimes, picking the [...]]]></description>
			<content:encoded><![CDATA[<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<p></p>
<p>
The integer datatypes in C can be either signed or unsigned. Sometimes,<br />
it&#8217;s obvious which should be used; for negative values you clearly<br />
should use signed types, for example. In many cases there is no obvious<br />
choice &ndash; in that case it usually doesn&#8217;t matter which you use.<br />
Usually, but not always. <i>Sometimes</i>, picking the wrong kind<br />
can introduce subtle bugs in your program that, unless you know what<br />
to look out for, can catch you off-guard and have you searching for<br />
the problem for hours.
</p>
<p>
I&#8217;ve mentioned a few of these occasions in Tonc<br />
<a href="/tonc/text/affine.htm#ssec-fin-type">here</a> and<br />
<a href="/tonc/text/numbers.htm#ssec-bits-int">there</a>, but I think<br />
it&#8217;s worth going over them again in a little more detail. First, I&#8217;ll<br />
explain how signed integers work and what the difference between signed<br />
and unsigned and where potential problems can come from. Then I&#8217;ll<br />
discuss some common pitfalls so you know what to expect.
</p>
<p><ul>
  <li> <a href="#sec-basics">1
Basics
</a> </li>
  <li> <a href="#sec-prob">2
Potential problems
</a> </li>
  <li> <a href="#sec-summary">3
Summary
</a> </li>
</ul>
</p>
<p><h2 id="sec-basics">1
Basics
</h2>
</p>
<p>
The <dfn>signedness</dfn> of a variable refers to whether it can be<br />
used to represent negative values or not. Unsigned variables can only<br />
have positive values; signed values can be both positive or negative.
</p>
<p>
In the computer world, signedness is mostly a matter of interpretation.<br />
Say you have a variable that is <i>N</i> bits long. This is enough<br />
room for 2<sup>N</sup> distinct numbers, but it says nothing about<br />
which range of numbers you should be using them for. Interpreted as<br />
unsigned integers, its range would be [0,&nbsp;2<sup>N</sup>&minus;1].<br />
Under a signed interpretation, you&#8217;d use some bit-patterns for negative<br />
numbers. There are actually several ways of doing this, but the most<br />
commonly used is known as 
<a href="http://en.wikipedia.org/wiki/two%26%238217%3Bs%20complement">two&#8217;s complement</a> which leads to<br />
a [&minus;2<sup>N&minus;1</sup>,&nbsp;2<sup>N&minus;1</sup>&minus;1] range:<br />
half positive and half negative.
</p>
<p><h3 id="ssec-base-twos">1.1
Two&#8217;s complement theory
</h3>
</p>
<p>
Two&#8217;s complement is sometimes seen as an awkward system, but it<br />
actually follows quite naturally when you only have a fixed number<br />
of digits to write down numbers with.<br />
Consider the whole line of positive and negative integers. As you<br />
move away from zero, the numbers will grow larger and larger.<br />
Now suppose you have an<br />

<a href="http://en.wikipedia.org/wiki/Counter%23Mechanical_counters">counting device</a><br />
composed of a limited number of digits, each of which can only display<br />
numbers 0 through 10&minus;1. With <i>N</i> digits, you only have<br />
room for 10<sup>N</sup> different numbers, and once those are used up<br />
(at 10<sup>N</sup>&minus;1), the counter returns to 0 and counting<br />
effectively resets. In essence, the number on the counter works in<br />
modulo 10<sup>N</sup>.
</p>
<p>
The key is that this works in both positive and negative directions.<br />
As far as the counter is concerned, 0 and 10<sup>N</sup> are the<br />
same thing. This being the case, you can argue that &minus;1<br />
(that is, the number before zero) is equivalent to 10<sup>N</sup>&minus;1;<br />
and &minus;2&nbsp;&equiv;&nbsp;10<sup>N</sup>&minus;2, and so on.<br />
Note that this works regardless of what 10 actually is; it can be<br />
ten (decimal), two (binary) or sixteen (hexadecimal).
</p>
<p>
The 10<sup>N</sup> possible numbers form a window over the number line,<br />
but where the window starts is up to the user. For signed numbers,<br />
you can move the window so that the upper half of the 10<sup>N</sup><br />
range is interpreted as negative numbers.
</p>
<p><div>&nbsp;</div></p>
<p>
Fig&nbsp;1 shows how this works for 8-bit numbers<br />
(written in hex for convenience). The black numbers represent the<br />
entire number line, where numbers can have as many digits as you<br />
need. With only two 
<a href="http://en.wikipedia.org/wiki/nybble">nybble</a>s, the counter repeats every<br />
100h&nbsp;=&nbsp;256 values. FFh, 1FFh, but also &minus;1 all reduce to the same<br />
symbol, namely FFh. In Fig&nbsp;2 you can see<br />
how the available symbols are mapped to either signed or unsigned<br />
values. In the unsigned case, numbers simply count from 0 to FFh;<br />
for signed, the top half of the symbol range is put on the left side<br />
of zero and are used for negative numbers.
</p>
<div class=cblock>
<div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-numline"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;1. </b>
</div>
</p>
<p><div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-signedness"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;2. </b>
</div>

</p></div>
<p><div>&nbsp;</div></p>
<p>
The mathematical reason behind all this like this. Assume for<br />
convenience that <i>N</i>&nbsp;=&nbsp;1, so that 0 is equivalent to 10 and<br />
in fact every multiple of 10. By definition, subtracting a value<br />
from itself gives 0. Because subtraction is merely addition by its<br />
negative value, you get the following:
</p>
<p><table class="eqtbl" id="eq-complement-def">
<tr>
<td class="eqnrcell">(1)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20x%20%26-%26%20x%20%26%3D%26%200%26%20%5C%5C%20x%20%26%2B%26%20%28-x%29%20%26%3D%26%200%20%26%20%5C%5C%20x%20%26%2B%26%20%28-x%29%20%26%5Cequiv%26%2010%20%26%20%5C%5C%20%26%20%26%20%28-x%29%20%26%3D%26%2010%20%26-%20x%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} x &amp;-&amp; x &amp;=&amp; 0&amp; \\ x &amp;+&amp; (-x) &amp;=&amp; 0 &amp; \\ x &amp;+&amp; (-x) &amp;\equiv&amp; 10 &amp; \\ &amp; &amp; (-x) &amp;=&amp; 10 &amp;- x \end{eqnarray}"<br />
	alt="\begin{eqnarray} x &amp;-&amp; x &amp;=&amp; 0&amp; \\ x &amp;+&amp; (-x) &amp;=&amp; 0 &amp; \\ x &amp;+&amp; (-x) &amp;\equiv&amp; 10 &amp; \\ &amp; &amp; (-x) &amp;=&amp; 10 &amp;- x \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
The term &minus;<i>x</i> in the last step should be seen as a unit,<br />
call it <i>C</i>. Numerically, <i>C</i> is the number that, when added<br />
to <i>x</i>, gives 10. In decimal, if <i>x</i>&nbsp;=&nbsp;1, then <i>C</i>&nbsp;=&nbsp;9.<br />
<i>C</i> is called the 10&#8242;s <dfn>complement</dfn> of <i>x</i>, because<br />
it&#8217;s what&#8217;s needed to complete the 10. It&#8217;s called the two&#8217;s<br />
complement in binary, because then 10 equals two.
</p>
<p>
In binary, there is an alternative to calculate the twos complement<br />
of a number. Subtracting a number from 2<sup>N</sup> is equivalent to<br />
inverting all its bits, so you get:
</p>
<p><table class="eqtbl" id="eq-complement-bin">
<tr>
<td class="eqnrcell">(2)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20%28-x%29%20%26%3D%26%202%5EN%20-%20x%20%5C%5C%20%26%3D%26%202%5EN%20-1%20-%20x%20%2B%201%20%5C%5C%20%28-x%29%20%26%3D%26%20%5Csim%20x%20%2B%201%20%5C%5C%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} (-x) &amp;=&amp; 2^N - x \\ &amp;=&amp; 2^N -1 - x + 1 \\ (-x) &amp;=&amp; \sim x + 1 \\ \end{eqnarray}"<br />
	alt="\begin{eqnarray} (-x) &amp;=&amp; 2^N - x \\ &amp;=&amp; 2^N -1 - x + 1 \\ (-x) &amp;=&amp; \sim x + 1 \\ \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
Using two&#8217;s complement<span class="fnote"><a href="#ft-nr1" title="Or any 10&#8242;s complement, really.">(1)</a></span><br />
for negative numbers has some interesting<br />
properties. First, subtraction and addition are basically the same<br />
thing. This is nice for arithmetic implementers for two reasons:<br />
the same hardware can be used for both operations, and it can be used<br />
for both positive and negative numbers.
</p>
<p>
Second, because the top half<br />
is now used for negative numbers, the most significant bit can be<br />
seen as a sign bit. Note: <i>a</i> sign bit, not <i>the</i> sign bit.<br />
There is a subtle linguistic difference here. When talking about<br />
<i>the</i><br />
sign bit, one may thing of it as a single bit that indicates the sign.<br />
For example, 8-bit +1 and &minus;1 could be `<code><b>0</b>000&nbsp;0001</code>&#8216;<br />
and `<code><b>1</b>000 0001</code>&#8216;, respectively. In two&#8217;s complement,<br />
however, +1 and &minus;1 are actually `<code>0000&nbsp;0001</code>&#8216;<br />
and `<code>11111111</code>&#8216; (the sum of which is<br />
`<code>1,00000000</code>&#8216;&nbsp;&equiv;&nbsp;<code>0</code>, as<br />
it should be).
</p>
<p><h3 id="ssec-base-decl">1.2
Declaring signed or unsigned
</h3>
</p>
<p>
In the end, whether a particular group of bits is signed or unsigned is<br />
a matter of interpretation. For example, the 8-bit group<br />
`<code>1111&nbsp;1111</code>&#8216; can be either 255 or &minus;1, depending on<br />
how you <i>want</i> to look at it. You can&#8217;t determine the signedness<br />
from just the bits themselves.
</p>
<p>
Also, when you&#8217;ve decided you&#8217;re going to use a signed interpretation,<br />
whether the group forms negative number or not depends on the size of<br />
the group. for example, consider the two bytes `<code>01 FF</code>&#8216;.<br />
As separate bytes, these would form +1 and &minus;1, respectively.<br />
However, if you view them as a single 16-bit integer<br />
(&lsquo;short&rdquo;), it forms 0x01FF, which is a positive number.
</p>
<p><div>&nbsp;</div></p>
<p>
In C, you specify signedness when you declare a variable. The general<br />
rule is that an integer is signed unless the keyword<br />
`<code>unsigned</code>&#8216; is used. The exception to the rule is<br />
`<code>char</code>&#8216;, whose default signedness is platform and<br />
compiler-dependent! Be careful with this particular datatype.
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="kw1">int</span> ia; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Signed integer.</span><br />
<span class="kw1">unsigned</span> <span class="kw1">int</span> ib;&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Unsigned integer.</span></p>
<p><span class="kw1">short</span> sa; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Signed 16-bit integer.</span><br />
<span class="kw1">unsigned</span> <span class="kw1">short</span> sb;&nbsp; &nbsp; &nbsp; <span class="co1">// Signed 16-bit integer.</span></p>
<p><span class="kw1">char</span> ca;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// ??-signed 8-bit integer.</span><br />
<span class="kw1">signed</span> <span class="kw1">char</span> cb; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// signed 8-bit integer.</span><br />
<span class="kw1">unsigned</span> <span class="kw1">char</span> cc; &nbsp; &nbsp; &nbsp; <span class="co1">// unsigned 8-bit integer.</span></div>
</div>
<p>
Because they&#8217;re shorter and more descriptive, the following typedefs<br />
are often used for variable declarations. Basically, it&#8217;s<br />
&lsquo;<code>s</code>&rsquo; or &lsquo;<code>u</code>&rsquo; for signed<br />
or unsigned, respectively, followed by the size of the type in bits.<br />
Unsigned variants are also sometimes indicated by<br />
&lsquo;u&rdquo;+<i>typename</i>.
</p>
<div class=lblock>
<table id="tbl-data-typedefs" border=1 cellpadding=2 cellspacing=0 width=200>
<caption align=bottom>
  <b>Table&nbsp;1</b>: common short (un)signed typedefs.<br />
</caption>
<tr>
<th>Base type</th>
<th>Signed</th>
<th colspan=2>Unsigned</th>
</tr>
<tr class=rnum>
<th>char</th>
<td>s8</td>
<td>u8</td>
<td>uchar</td>
</tr>
<tr class=rnum>
<th>short</th>
<td>s16</td>
<td>u16</td>
<td>ushort</td>
</tr>
<tr class=rnum>
<th>int/long</th>
<td>s32</td>
<td>u32</td>
<td>uint</td>
</tr>
<tr class=rnum>
<th>long long</th>
<td>s64</td>
<td>u64</td>
<td>&nbsp;</td>
</tr>
</table>
</div>
<p><div>&nbsp;</div></p>
<p>
In assembly, you can&#8217;t declare the signedness of variables, because<br />
there&#8217;s no such thing as variables. There&#8217;s only labels and how you<br />
use those labels determines what the related data are. Technically,<br />
there is only one datatype: the 32-bit word, corresponding to C&#8217;s<br />
int or long. The other datatypes are essentially emulated, or<br />
defined by how which memory instructions you use:<br />
<code>LDRB/LDRSB/STRB</code> for bytes and<br />
<code>LDRH/LDRSH/STRH</code> for halfwords. For most data<br />
operations, signedness is irrelevant and as such mostly ignored.<br />
Only in a few cases does the sign actually matter and as these are<br />
essentially the topic of the rest of the article, we&#8217;ll get<br />
to those eventually.
</p>
<p><h2 id="sec-prob">2
Potential problems
</h2>
</p>
<p>
The following sections are cases where signedness may become<br />
problematic. I say &ldquo;may&rdquo;, because often it just works<br />
out. But that&#8217;s just the thing: it can work most of the time and then<br />
things can go horribly wrong all of a sudden. The root of the problem<br />
comes down to one thing: negative numbers; usually, negative numbers<br />
becoming large positive numbers when interpreted as unsigned values.
</p>
<p>
For example, 32-bit signed &minus;1 = 0xFFFFFFFF = unsigned 4294967295<br />
(= 2<sup>32</sup>&minus;1). If nothing else, remember that part.
</p>
<p><h3 id="ssec-prob-extend">2.1
Sign extension, casting and shifting
</h3>
</p>
<p>
When you go from a small datatype to a larger one, you&#8217;re essentially<br />
adding a new set of bits at the top, and these bits have to be<br />
initialized in a meaningful way. The addition of these bits should have<br />
no effect on the value itself. For example, +1 should remain +1 and<br />
&minus;1 should remain &minus;1. What this boils down to for two&#8217;s<br />
complement is that the new bits need to be filled with the sign-bit<br />
of the old value. This is called <dfn>sign extension</dfn>, because<br />
the top-bit (the sign-bit) is extended into all the higher bits. There<br />
is also <dfn>zero-extension</dfn>, which is when the higher bits are<br />
zeroed out. These two forms effectively correspond to signed<br />
and unsigned casting. <span class="fnote"><a href="#ft-nr2" title="One could say that zero-extension is just a
form of sign-extension; it&#8217;s just that the sign for an unsigned number
is always positive.">(2)</a></span>.
</p>
<div class=cblock>
<div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-extend"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;3. </b>
</div>

</div>
<p>
Conversions of this kind actually happen <b>all the time</b>,<br />
without any kind of direct intervention from the programmer. Data<br />
operations are always done in CPU words and any time you use a smaller<br />
datatype, there is the need to sign- or zero-extend.<br />
This also brings forth the question of which type of extension will be<br />
used: sign- or zero-extension. As the following bit of code shows, it<br />
depends on the signedness of the variable you&#8217;re converting <i>from</i>.<br />
8-bit variables <code>sc</code> and <code>uc</code> are both initialized<br />
by 0xFF, which is either &minus;1 or 255 (you can use either of those too,<br />
by the way). After that, these are used to initialize signed or unsigned<br />
words.
</p>
<p>
As you can see from the output, the value in the words correspond<br />
to the signedness of the bytes, not the words. Also note that printing<br />
<code>sc</code> (the signed byte) gives 0xFFFFFFFF and not the 0xFF you<br />
initialized it with, and which are in fact its actual contents since<br />
0xFFFFFFFF is too large to fit into a byte. However, when using it with<br />
anything, it&#8217;s automatically extended to word-size. This becomes great<br />
fun when you later compare it to 0xFF again.
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="co1">// Testing implicit conversions.</span><br />
<span class="kw1">void</span> test_conversion()<br />
{<br />
&nbsp; &nbsp; s8 sc= <span class="nu0">0xFF</span>;&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// 8-bit -1 (and 255) &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </span><br />
&nbsp; &nbsp; u8 uc= <span class="nu0">0xFF</span>;&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// 8-bit 255 (and -1)</span></p>
<p>&nbsp; &nbsp; s32 sisc= sc, siuc= uc;<br />
&nbsp; &nbsp; u32 uisc= sc, uiuc= uc;</p>
<p>&nbsp; &nbsp; <span class="kw3">printf</span>(<span class="st0">&quot; &nbsp;sc: %4d=%08X ; &nbsp; uc:%4d=%08X<span class="es1">\n</span>&quot;</span>, sc, sc, uc, uc);<br />
&nbsp; &nbsp; <span class="kw3">printf</span>(<span class="st0">&quot;sisc: %4d=%08X ; siuc:%4d=%08X<span class="es1">\n</span>&quot;</span>, sisc, sisc, siuc, siuc);<br />
&nbsp; &nbsp; <span class="kw3">printf</span>(<span class="st0">&quot;uisc: %4d=%08X ; uiuc:%4d=%08X<span class="es1">\n</span>&quot;</span>, uisc, uisc, uiuc, uiuc);<br />
&nbsp; &nbsp; <span class="kw3">printf</span>(<span class="st0">&quot;sc==0xFF : %s<span class="es1">\n</span>&quot;</span>, (sc==<span class="nu0">0xFF</span> ? <span class="st0">&quot;true&quot;</span> : <span class="st0">&quot;false&quot;</span>) );</p>
<p>&nbsp; &nbsp; <span class="coMULTI">/* Output:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; sc: &nbsp; -1=FFFFFFFF ; &nbsp; uc: 255=000000FF<br />
&nbsp; &nbsp; &nbsp; &nbsp; sisc: &nbsp; -1=FFFFFFFF ; siuc: 255=000000FF<br />
&nbsp; &nbsp; &nbsp; &nbsp; uisc: &nbsp; -1=FFFFFFFF ; uiuc: 255=000000FF<br />
&nbsp; &nbsp; &nbsp; &nbsp; sc==0xFF : false</p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; Warnings issued (for sc=0xFF):<br />
&nbsp; &nbsp; &nbsp; &nbsp; &#8211; warning C4305: &#8216;initializing&#8217; : truncation from &#8216;const int&#8217; to &#8216;signed char&#8217;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &#8211; warning C4309: &#8216;initializing&#8217; : truncation of constant value<br />
&nbsp; &nbsp; */</span><br />
}</div>
</div>
<p><div>&nbsp;</div></p>
<p>
Sign- and zero-extension also play a role in right-shifts. When using<br />
shifts for arithmetic (shift-right is short-hand for a division by<br />
power of two), you want the sign preserved. For example, when dividing<br />
&minus;16&nbsp;=&nbsp;0xFFFF:FFF0 by 16 (shift-right by 4), you want the result<br />
to be &minus;1 (=0xFFFF:FFFF), and not 268435455 (=0x0FFF:FFFF).<br />
The right-shift that preserves the sign is the<br />
<dfn>arithmetic right-shift</dfn>, and is used for signed numbers.<br />
For unsigned numbers, or if the variable is considered a set of bits<br />
instead of a single number, a <dfn>logical right-shift</dfn> is<br />
appropriate, since that uses zero-extension.
</p>
<p>
In assembly, arithmetic and logical right-shift are called<br />
<code>ASR</code> and <code>LSR</code>, respectively. In Java and other<br />
languages where the keyword <code>unsigned</code> does not exist<br />
the difference is indicated by <code>&gt;&gt;</code> (sign-extend)<br />
and <code>&gt;&gt;&gt;</code> (zero-extend). In C, however,<br />
both types use the same symbol: <code>&gt;&gt;</code>. As such,<br />
you cannot tell which type of extension is used from just the<br />
expression; you&#8217;d have to look at the signedness of the operands<br />
(including temporaries) to see if it&#8217;s a logical or arithmetic<br />
right-shift.
</p>
<div class=cblock>
<table border=0>
<tr>
<td>
<table id="tbl-shift" border=1 cellpadding=2 cellspacing=0>
<caption align=bottom>
  <b>Table&nbsp;2</b>: Right-shifts for different languages.<br />
</caption>
<tr>
<th>Language</th>
<th>Signed</th>
<th>Unsigned</th>
</tr>
<tr>
<th>ARM asm</th>
<td>asr</td>
<td>lsr</td>
</tr>
<tr>
<th>C</th>
<td>&gt;&gt;</td>
<td>&gt;&gt;</td>
</tr>
<tr>
<th>Java(script)</th>
<td>&gt;&gt;</td>
<td>&gt;&gt;&gt;</td>
</tr>
</table>
</td>
<td width=32></td>
<td>
<div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-rshift"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;4. </b>
</div>

</td>
</tr>
</table>
</div>
<p>
This ambivalence of shift symbols in C can be a major source of pain in<br />
fixed-point calculations. Since unsigned has precedence over signed, if<br />
you have an unsigned variable at <i>any</i> point of the calculation,<br />
all subsequent calculations are unsigned too and you can kiss negative<br />
numbers goodbye. If everything starts going wrong as soon as you move in<br />
another direction or if rotations aren&#8217;t calculated properly, this will<br />
be the cause.
</p>
<p><div>&nbsp;</div></p>
<p>
The code below illustrates the problem in a very common situation. You<br />
have a position <b>p</b>, and a directional vector for movement,<br />
<b>u</b>. Since you want sub-pixel control of these, you use<br />
fixed-point notation for both (I&#8217;m assuming non-FPU system<br />
here). The <b>u</b> vector is a unit vector<br />
(say, cos(&alpha;),&nbsp;sin(&alpha;)); to get to the full velocity vector,<br />
we have to multiply <b>u</b> by some speed. The procedure comes down to<br />
something like this:
</p>
<p><table class="eqtbl">
<tr>
<td class="eqnrcell"></td>
  <td class="eqcell"><br />
<b>p</b><sub>new</sub>&nbsp;=&nbsp;<b>p</b><sub>old</sub>&nbsp;+&nbsp;<i>speed</i>&middot;<b>u</b><br />
</td>
</tr>
</table></p>
<p>
In the example, I&#8217;m only considering the <i>x</i>-component for<br />
convenience. Now, because position and direction can have negative<br />
components, those would be signed. The speed, however, is a length<br />
and therefore always positive, so it makes sense to make it unsigned,<br />
right? Well, yes and no. As you can see from the result, mostly no.
</p>
<p>
With <i>speed</i>&nbsp;=&nbsp;+1 and <i>u</i><sub>x</sub>&nbsp;=&nbsp;&minus;1, the<br />
end result should be +1*&minus;1&nbsp;=&nbsp;&minus;1, which would be 0xFFFFFF00<br />
in Q8 fixed-point notation. However, it <i>isn&#8217;t</i>, thanks to the<br />
unsignedness of <code>speed</code>, which makes subsequent arithmetic<br />
unsigned so the right-shift does not sign-extend. So instead of the<br />
small step you intended, you get a giant leap into no man&#8217;s land.
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="kw1">void</span> test_right_shift()<br />
{<br />
&nbsp; &nbsp; <span class="co1">// Assume movement for 2 directions, with Q8 for everything.</span><br />
&nbsp; &nbsp; <span class="co1">// a = look direction. &nbsp;</span><br />
&nbsp; &nbsp; <span class="co1">// p = (px, py) = position.</span><br />
&nbsp; &nbsp; <span class="co1">// u = (ux, uy) = ( cos(a), sin(a) )</span></p>
<p>&nbsp; &nbsp; <span class="kw1">int</span> &nbsp;px= <span class="nu0">0</span>; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Starting position.</span><br />
&nbsp; &nbsp; <span class="kw1">int</span> &nbsp;ux= -<span class="nu0">1</span>&lt;&lt;<span class="nu0">8</span>; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Moving backwards.</span><br />
&nbsp; &nbsp; uint speed= <span class="nu0">1</span>&lt;&lt;<span class="nu0">8</span>; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Unsigned as speed&#8217;s always &gt;= 0, right?</span></p>
<p>&nbsp; &nbsp; px = px + (speed*ux&gt;&gt;<span class="nu0">8</span>);&nbsp; &nbsp; <span class="co1">// Fixed point motion. Result should be -1&lt;&lt;8.</span></p>
<p>&nbsp; &nbsp; <span class="kw3">printf</span>(<span class="st0">&quot;px : %d=%08X<span class="es1">\n</span>&quot;</span>, px, px);</p>
<p>&nbsp; &nbsp; <span class="coMULTI">/* Result: <br />
&nbsp; &nbsp; &nbsp; &nbsp; px: px : 16776960=00FFFF00</p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; In other words: NOT the -1&lt;&lt;8 = 0xFFFFFF00 you were after.<br />
&nbsp; &nbsp; */</span><br />
}</div>
</div>
<p>
This mistake is depressingly easy to make, even for those who generally<br />
think about which datatype to use. <i>Especially</i> those people, as<br />
they&#8217;re prone to optimize prematurely and automatically pick unsigned<br />
for a variable that will never be negative. The danger is that<br />
unsigned arithmetic has precedence, which can screw up at later<br />
right-shifts.
</p>
<p>
Bottom line: variables used in fixed-point calculations should be<br />
signed. Always.
</p>
<p><h3 id="ssec-prob-div">2.2
Division
</h3>
</p>
<p>
This isn&#8217;t really a signed-vs-unsigned item per se, but integer<br />
division behaves in a peculiar way for negative numbers. It becomes<br />
one, however when you throw right-shift in the fray, which doesn&#8217;t<br />
quite work like a division equivalent anymore for negative numbers.<br />
To discriminate between integer and normal division, I will use<br />
&lsquo;\ &rsquo; for integer division in this section. Note also the<br />
modulo operation is intimately tied to division, so this section applies<br />
to that as well.
</p>
<p>
What integer division comes down to is taking a normal division and throwing<br />
away the remaining fraction. For example, 7&nbsp;/&nbsp;4&nbsp;=&nbsp;1&frac34;. The<br />
integer division is just 1. This is also true for negative numbers:<br />
&minus;7&nbsp;/&nbsp;4&nbsp;=&nbsp;&minus;1&frac34;, so 7&nbsp;\&nbsp;4&nbsp;=&nbsp;&minus;1. In short,<br />
integer division rounds towards zero. With bit-shifting, however, you get<br />
something slightly different. Theoretically, <i>x</i>&gt;&gt;<i>n</i> is<br />
equivalent to <i>x</i>&nbsp;\&nbsp;2<sup>n</sup>. For positive numbers, this is<br />
true: 7&gt;&gt;2 in binary is<br />
<code><b>000001</b>11</code>&gt;&gt;2&nbsp;=&nbsp;<code>00<b>000001</b></code>.<br />
But with &minus;7&gt;&gt;2 you get<br />
<code>11111001</code>&gt;&gt;2&nbsp;=&nbsp;<code>11<b>111110</b></code>&nbsp;=&nbsp;&minus;2.<br />
Division-by-right-shift always rounds to negative infinity.
</p>
<p>
The upshot of this difference is that for negative numbers, the results<br />
of <i>x</i>&nbsp;\&nbsp;2<sup>n</sup> and <i>x</i>&gt;&gt;<i>n</i> will be out<br />
of sync, as Table&nbsp;3 illustrates. They still give<br />
identical results for positive numbers though.
</p>
<div class=cblock>
<table id="tbl-div-shift" border=1 cellpadding=2 cellspacing=0>
<caption align=bottom>
  <b>Table&nbsp;3</b>: integer and by-shift division by four.<br />
</caption>
<tbody align="right">
<tr>
<th>x (dec)</th>
<th>x \ 4</th>
<th>x&gt;&gt;2 (dec)</th>
<th rowspan=20>&nbsp;</th>
<th>x (bin)</th>
<th>x&gt;&gt;2 (bin)</th>
</tr>
<tr>
<td>-9</td>
<td class=bg0>-2</td>
<td class=bg1>-3</td>
<td>11110111</td>
<td>11111101</td>
</tr>
<tr>
<td>-8</td>
<td class=bg0>-2</td>
<td class=bg0>-2</td>
<td>11111000</td>
<td>11111110</td>
</tr>
<tr>
<td>-7</td>
<td class=bg1>-1</td>
<td class=bg0>-2</td>
<td>11111001</td>
<td>11111110</td>
</tr>
<tr>
<td>-6</td>
<td class=bg1>-1</td>
<td class=bg0>-2</td>
<td>11111010</td>
<td>11111110</td>
</tr>
<tr>
<td>-5</td>
<td class=bg1>-1</td>
<td class=bg0>-2</td>
<td>11111011</td>
<td>11111110</td>
</tr>
<tr>
<td>-4</td>
<td class=bg1>-1</td>
<td class=bg1>-1</td>
<td>11111100</td>
<td>11111111</td>
</tr>
<tr>
<td>-3</td>
<td class=bg0> 0</td>
<td class=bg1>-1</td>
<td>11111101</td>
<td>11111111</td>
</tr>
<tr>
<td>-2</td>
<td class=bg0> 0</td>
<td class=bg1>-1</td>
<td>11111110</td>
<td>11111111</td>
</tr>
<tr>
<td>-1</td>
<td class=bg0> 0</td>
<td class=bg1>-1</td>
<td>11111111</td>
<td>11111111</td>
</tr>
<tr>
<td> 0</td>
<td class=bg0> 0</td>
<td class=bg0> 0</td>
<td>00000000</td>
<td>00000000</td>
</tr>
<tr>
<td> 1</td>
<td class=bg0> 0</td>
<td class=bg0> 0</td>
<td>00000001</td>
<td>00000000</td>
</tr>
<tr>
<td> 2</td>
<td class=bg0> 0</td>
<td class=bg0> 0</td>
<td>00000010</td>
<td>00000000</td>
</tr>
<tr>
<td> 3</td>
<td class=bg0> 0</td>
<td class=bg0> 0</td>
<td>00000011</td>
<td>00000000</td>
</tr>
<tr>
<td> 4</td>
<td class=bg1> 1</td>
<td class=bg1> 1</td>
<td>00000100</td>
<td>00000001</td>
</tr>
<tr>
<td> 5</td>
<td class=bg1> 1</td>
<td class=bg1> 1</td>
<td>00000101</td>
<td>00000001</td>
</tr>
<tr>
<td> 6</td>
<td class=bg1> 1</td>
<td class=bg1> 1</td>
<td>00000110</td>
<td>00000001</td>
</tr>
<tr>
<td> 7</td>
<td class=bg1> 1</td>
<td class=bg1> 1</td>
<td>00000111</td>
<td>00000001</td>
</tr>
<tr>
<td> 8</td>
<td class=bg0> 2</td>
<td class=bg0> 2</td>
<td>00001000</td>
<td>00000010</td>
</tr>
<tr>
<td> 9</td>
<td class=bg0> 2</td>
<td class=bg0> 2</td>
<td>00001001</td>
<td>00000010</td>
</tr>
</tbody>
</table>
</div>
<p>
There are some other consequences besides the obvious difference in<br />
results. First, there&#8217;s how compilers deal with it. Compilers are very<br />
well aware that a bit-shift is faster than division and one of the<br />
optimizations they perform is replacing divisions by shifts<br />
where appropriate<span class="fnote"><a href="#ft-nr3" title="And please let the compiler do its job in this
regard: the low operator-precedence of shifts makes their use awkward and
error-prone. If you mean division, then use division.">(3)</a></span>. For<br />
unsigned numerals the division will be replaced by a single shift.<br />
However, for signed variables some extra instructions have to added to<br />
correct the difference in rounding.
</p>
<p>
Second, note that the standard integer division does not give an equal<br />
distribution of results: there are more results in the zero-bin.<br />
Shift-division spreads the results around evenly. In some cases, you<br />
will want to use the shift version for that reason. One clear example<br />
of this would be tiling: using the &lsquo;proper&rsquo; integer<br />
division would give you odd-looking results.
</p>
<p><div class=note id="nt-div-shift">
<div  class=nh>Negative number division / right-shift equivalents</div>
</p>
<p>
Table&nbsp;3 shows that for negative numbers, integer<br />
division and right-shift don&#8217;t give the same results. If you do want<br />
the same results, the following equations can be used. Given<br />
<i>x</i>&nbsp;&lt;&nbsp;0 and <i>N</i>&nbsp;=&nbsp;2<sup>n</sup>, then
</p>
<p><table class="eqtbl">
<tr>
<td class="eqnrcell"></td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20x%20%5Cbackslash%20N%20%26%3D%26%20%28x%20%2B%20%28N-1%29%29%20%3E%3E%20n%20%5C%5C%20%5C%5C%20x%3E%3En%20%26%3D%26%20%28x%20-%20%28N-1%29%29%20%5Cbackslash%20N%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} x \backslash N &amp;=&amp; (x + (N-1)) &gt;&gt; n \\ \\ x&gt;&gt;n &amp;=&amp; (x - (N-1)) \backslash N \end{eqnarray}"<br />
	alt="\begin{eqnarray} x \backslash N &amp;=&amp; (x + (N-1)) &gt;&gt; n \\ \\ x&gt;&gt;n &amp;=&amp; (x - (N-1)) \backslash N \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>GCC will use the <i>x</i>\<i>N</i> equivalence to produce<br />
signed integer division if possible.</p>
<p></div>
</p>
<p><h3 id="ssec-prob-cmp">2.3
Comparisons
</h3>
</p>
<p>
The last area where signedness can be a factor is comparisons. The<br />
next bit of code is from my implementation of a filled circle renderer<br />
with boundary clipping. The circle is centered on<br />
(<i>x</i><sub>0</sub>,&nbsp;<i>y</i><sub>0</sub>). Variables<br />
<code>x</code> and <code>y</code> are local variables that<br />
keep track of where we are on the circle, because these<br />
can be negative, they must be signed. Variables <code>dstW</code><br />
and <code>dstH</code> are the destination image&#8217;s width and<br />
height. Since width and height are unsigned by definition,<br />
it&#8217;d make sense to make these unsigned, right? Right?
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="co1">//# Part of a clipped filled circle renderer that didn&#8217;t quite work.</span></p>
<p>&nbsp; &nbsp; <span class="kw1">int</span> dstP= srf-&gt;pitch/<span class="nu0">2</span>; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// used in arithmetic, so signed.</span><br />
&nbsp; &nbsp; uint dstW= srf-&gt;width, dstH= srf-&gt;height; &nbsp; <span class="co1">// Unsigned by definition.</span><br />
&nbsp; &nbsp; u16 *dstD= ((u16*)srf-&gt;data)+(y0*dstP);<br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span class="kw1">int</span> x=<span class="nu0">0</span>, y= rad, d= <span class="nu0">1</span>-rad, left, right;</p>
<p>&nbsp; &nbsp; &#8230;<br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Side octants</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; left= x0-y;<br />
&nbsp; &nbsp; &nbsp; &nbsp; right= x0+y;<br />
&nbsp; &nbsp; &nbsp; &nbsp; \&lt;b\&gt;<span class="kw1">if</span>(right&gt;=<span class="nu0">0</span> &amp;&amp; left&lt;=dstW)\&lt;/b\&gt; &nbsp; &nbsp; &nbsp; <span class="co1">// Fully out of bounds</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span>(left&lt;<span class="nu0">0</span>)&nbsp; &nbsp; &nbsp; left= <span class="nu0">0</span>;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Clip left</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span>(right&gt;=dstW) right= dstW-<span class="nu0">1</span>;&nbsp; &nbsp; &nbsp; <span class="co1">// Clip right</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Render at scanlines y0-x and y0+x</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span>(inRange(y0-x, <span class="nu0">0</span>, dstH))<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; armset16(color, &amp;dstD[-x*dstP+left], <span class="nu0">2</span>*(right-left+<span class="nu0">1</span>));<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span>(inRange(y0+x, <span class="nu0">0</span>, dstH))<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; armset16(color, &amp;dstD[+x*dstP+left], <span class="nu0">2</span>*(right-left+<span class="nu0">1</span>));<br />
&nbsp; &nbsp; &nbsp; &nbsp; }<br />
&nbsp; &nbsp; &#8230;</div>
</div>
<p>
Well, apparently not. When I tested this, right and bottom edge<br />
clipping went fine, but when the circle went over the top or<br />
left edge, it disappeared completely.
</p>
<p>
The problem lies with the line in bold, which does the trivial rejection<br />
test. Variables <code>left</code> and <code>right</code> are the left and<br />
right-most edges of the scanline of the circle. If this is completely<br />
to the left of the screen (<code>right</code>&nbsp;&lt;&nbsp;0) or to<br />
the right of the screen (<code>left</code>&nbsp;&ge;&nbsp;<code>dstW</code>)<br />
then there&#8217;s nothing to do. </p>
<p>Technically, the tests on that line are correct, so the code<br />
<i>should</i> work.<br />
The reason it doesn&#8217;t actually occurs a few lines earlier: the<br />
definition of <code>dstW</code> as an unsigned variable. Because of<br />
this, the second condition is an unsigned comparison. Now think of<br />
what happens when <code>left</code> moves over the left of the<br />
screen. <code>left</code> becomes becomes a (small) negative number,<br />
which is converted to postive number for the comparison.<br />
A <i>large positive</i> number for that matter &ndash; one that&#8217;s<br />
quite a bit larger than the width of the image and as a result<br />
the routine thinks the circle is out of bounds.
</p>
<p>
So again, a routine went all wonky because I assumed that, since<br />
a width is always positive, using an unsigned variable would be<br />
a good idea.
</p>
<p><div>&nbsp;</div></p>
<p>
The worst part of this particular bit, however, is that I should have<br />
known this. The compiler actually issues a warning for this type of<br />
thing:
</p>
<p><blockquote>
<br />
warning: comparison between signed and unsigned integer expressions<br />

</blockquote>
</p>
<p>
Or at least it <i>would</i> have if I hadn&#8217;t disabled the warning<br />
because the message was cropping up everywhere in my normal and sign-safe<br />
for-loops. Let this be a lesson: disable warnings at your own risk<br />
and for Offler&#8217;s sake do <i>not</i> ignore them.
</p>
<p><h3 id="ssec-prob-duh">2.4
Well, duh
</h3>
</p>
<p>
The problems covered above are the subtle ones, where you have to be<br />
aware of some of the details that go into the C language itself. There<br />
are also a few issues where the programmer really should have known<br />
they were going to be a problem from the start.
</p>
<p><div>&nbsp;</div></p>
<p>
The first example is, again, one that can occur when optimizing<br />
prematurely. You may have heard that loops work better when you count<br />
down instead of count up, because in machine code a subtraction is an<br />
automatic comparison to zero. So, a clever programmer may turn this:
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">uint i; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Unsigned, since it&#8217;s always positive.</span><br />
<span class="kw1">for</span>(i=<span class="nu0">0</span>; i&lt;size; i++)<br />
{<br />
&nbsp; &nbsp; <span class="co1">// Do whatever</span><br />
}</div>
</div>
<p>into this:</p>
<div class="cpp">
<div class="cpp proglist" style=" ">uint i; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Unsigned, since it&#8217;s always positive. Right?</span><br />
<span class="kw1">for</span>(i=size-<span class="nu0">1</span>; i&gt;=<span class="nu0">0</span>; i&#8211;)<br />
{<br />
&nbsp; &nbsp; <span class="co1">// Do whatever</span><br />
}</div>
</div>
<p>
There are two problems with this code. First, the change probably will<br />
not matter with modern compilers because they are aware of the<br />
equivalence and can do this conversion themselves<span class="fnote"><a href="#ft-nr4" title="Although they
may well do it
incorrectly: turning the decrementing loop into an incrementing one.
Point is, the compiler may not follow exactly what you&#8217;re doing
anyway.">(4)</a></span>, so there&#8217;s nothing to gain from this.
</p>
<p>
The real problem, however, is the terminating condition:<br />
`<code>i&gt;=0</code>&#8216;. Since <code>i</code> is unsigned, it can<br />
never be negative, and therefore the condition is always true.
</p>
<p><div>&nbsp;</div></p>
<p>
The second example involves bitfields. As it happens, bitfields can be<br />
signed or unsigned as well. For the most part, handling this is like<br />
handling normal signedness, but there is one situation where you have<br />
to be careful.
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="kw1">void</span> test_bitfield()<br />
{<br />
&nbsp; &nbsp; <span class="kw1">struct</span> Foo {<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">int</span> &nbsp; &nbsp; s7 : <span class="nu0">7</span>; &nbsp; &nbsp; <span class="co1">// 7-bit signed</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; uint&nbsp; &nbsp; u7 : <span class="nu0">7</span>; &nbsp; &nbsp; <span class="co1">// 7-bit unsigned</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">int</span> &nbsp; &nbsp; s1 : <span class="nu0">1</span>; &nbsp; &nbsp; <span class="co1">// 1-bit signed</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; uint&nbsp; &nbsp; u1 : <span class="nu0">1</span>; &nbsp; &nbsp; <span class="co1">// 1-bit unsigned</span><br />
&nbsp; &nbsp; };</p>
<p>&nbsp; &nbsp; Foo f= { -<span class="nu0">1</span>, -<span class="nu0">1</span>, <span class="nu0">1</span>, <span class="nu0">1</span> };</p>
<p>&nbsp; &nbsp; <span class="kw3">printf</span>(<span class="st0">&quot;s7: %3d<span class="es1">\n</span>u7: %3d<span class="es1">\n</span>s1: %3d<span class="es1">\n</span>u1: %3d<span class="es1">\n</span><span class="es1">\n</span>&quot;</span>, f.s7, f.u7, f.s1, f.u1);<br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span class="coMULTI">/*&nbsp; Results:<br />
&nbsp; &nbsp; &nbsp; &nbsp; s7: &nbsp;-1 &nbsp; &nbsp; // Inited to -1<br />
&nbsp; &nbsp; &nbsp; &nbsp; u7: 127 &nbsp; &nbsp; // Inited to -1<br />
&nbsp; &nbsp; &nbsp; &nbsp; \&lt;b\&gt;s1: &nbsp;-1&nbsp; &nbsp; &nbsp;// Inited to &nbsp;1\&lt;/b\&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; u1: &nbsp; 1 &nbsp; &nbsp; // Inited to &nbsp;1<br />
&nbsp; &nbsp; */</span><br />
}</div>
</div>
<p>
In the code above I&#8217;ve created a bif-fielded struct with both<br />
signed and unsigned members. There are two 7-bit fields and two<br />
1-bit fields, and these are initialized to &minus;1 and +1,<br />
respectively. The values are then printed.
</p>
<p>
The 7-bit fields work as you might expect. <code>f.s7</code> is<br />
&minus;1, as it&#8217;s signed, and <code>f.u7</code> is 127, which is the<br />
7-bit equivalent of &minus;1. The interesting case is for<br />
<code>f.s1</code>. This is initialized to 1, but comes out as<br />
&minus;1, because for a single signed bit the possibilities<br />
are 0 and &minus;1, and <i>not</i> 0 and +1! Without this knowledge,<br />
a later test like `<code>f.s1==1</code>&#8216; might give unexpected results.
</p>
<p><h2 id="sec-summary">3
Summary
</h2>
</p>
<p>
So, summarizing:
</p>
<ul>
<li>
    Unsigned variables only represents positive numbers; signed ones<br />
	can have positive or negative values.<br />
    Negative numbers are usually represented via two&#8217;s complement,<br />
	which is based on the cyclical nature of counters when you have<br />
	a limited number of digits.
  </li>
<li>
    In C, integers are signed unless specified otherwise, except<br />
	for <code>char</code>, whose signedness is compiler dependent.
  </li>
<li>
    Careless use of signed and unsigned types can result in subtle<br />
	runtime bugs with not-so-subtle results. Usually, what happens<br />
	is that a negative number is reinterpreted as a very large<br />
	positive number and everything goes banana-shaped.
  </li>
<li>
    <b>Unsigned has a higher operator precedence than signed</b>. If<br />
	one of the operands is unsigned, the operation will use unsigned<br />
	arithmetic. This can cause problems for divisions, modulos,<br />
	right-shifts <i>and</i> comparisons.
  </li>
<li>
    For negative numbers, division/modulo by 2<sup>n</sup> is not<br />
	quite the same as right-shifts/ANDs. Analyse which is best for<br />
    your situation, then act accordingly.
  </li>
<li>
    Ignore compiler warnings at your own peril.
  </li>
<li>
    The place where a bug manifests is not always the place where it<br />
	originates. The declaration of variables matters! Do not forget<br />
	this when debugging or when asking for assistance.
  </li>
</ul>
<p><div>&nbsp;</div></p>
<p>
There isn&#8217;t really a hard rule on when to use which signedness, but<br />
here are a few guidelines nonetheless.
</p>
<ul>
<li>
    If a variable can, in principle, have negative values, make it<br />
	signed. If it represents a physical quantity (position, velocity,<br />
	mass, etc), make it signed.
  </li>
<li>
    A variable that represents logical values (bools, pixels, colors,<br />
	raw data) should probably be unsigned.
  </li>
<li>
    And now the big one: just because a variable will always be<br />
	positive doesn&#8217;t mean it should be unsigned. Yes, you may waste<br />
	half the range, but using signed variables is usually safer. If<br />
	you must have the larger range (for the smaller datatypes, for<br />
	example), consider defining the storage variables unsigned, but<br />
	convert them to local signed ints when you&#8217;re really going to<br />
	use them.
  </li>
<li>
	If mathematical symbols were gods, the minus sign would be<br />
	
<a href="http://en.wikipedia.org/wiki/Loki">Loki</a>. Be extra careful when you encounter them.<br />
	If there are minus signs <i>anywhere</i> in the algorithm, or even<br />
	the potential for negative numbers, <i>everything</i> should be<br />
	done with signed numbers.</p>
</li>
</ul>
<p> <!--</p>
<ul>
<li>
    A computer will do what you <i>tell</i> it to do. Make sure this<br />
	corresponds with what you <i>want</i> it to do.
  </li>
</ul>
<p>&#8211;></p>
<hr /><div class="footnotes">
<h5>Notes:</h5>
<ol>
<li id="ft-nr1"> 
  Or any 10&#8242;s complement, really.
</li>
<li id="ft-nr2"> 
  One could say that zero-extension is just a<br />
form of sign-extension; it&#8217;s just that the sign for an unsigned number<br />
is always positive.
</li>
<li id="ft-nr3"> 
  And please let the compiler do its job in this<br />
regard: the low operator-precedence of shifts makes their use awkward and<br />
error-prone. If you mean division, then use division.
</li>
<li id="ft-nr4"> 
  Although they<br />
may well do it<br />
incorrectly: turning the decrementing loop into an incrementing one.<br />
Point is, the compiler may not follow exactly what you&#8217;re doing<br />
anyway.
</li>
</ol>
</div
<hr />
]]></content:encoded>
			<wfw:commentRss>http://www.coranac.com/2009/08/signs-from-hell/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Another fast fixed-point sine approximation</title>
		<link>http://www.coranac.com/2009/07/sines/</link>
		<comments>http://www.coranac.com/2009/07/sines/#comments</comments>
		<pubDate>Thu, 16 Jul 2009 20:19:05 +0000</pubDate>
		<dc:creator>cearn</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[math]]></category>
		<category><![CDATA[fixed point]]></category>
		<category><![CDATA[nds]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[sine]]></category>

		<guid isPermaLink="false">http://www.coranac.com/?p=87</guid>
		<description><![CDATA[<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
Gaddammit! &#160; So here I am, looking forward to a nice quiet weekend; hang back, watch some telly and maybe read a bit &#8211; but NNnnneeeEEEEEUUUuuuuuuuu!! Someone had to write an interesting article about sine approximation. With a challenge at the end. And using an inefficient kind of approximation. And so now, instead of just [...]]]></description>
			<content:encoded><![CDATA[<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<p></p>
<p>
Gad<i>dammit</i>!
</p>
<p><div>&nbsp;</div></p>
<p>
So here I am, looking forward to a nice quiet weekend; hang<br />
back, watch some telly and maybe read a bit &ndash; but<br />
<i>NNnnneeeEEEEEUUUuuuuuuuu!!</i> <i>Someone</i> had to write an interesting<br />
<a href="http://www.console-dev.de/2009/07/06/sine-approximation-with-fixed-point-math/" rel="pingback">article about sine approximation</a>.<br />
With a <i>challenge</i> at the end. <i>And</i> using an inefficient kind<br />
of approximation. And so now, instead of just relaxing, I have to spend<br />
my entire weekend <i>and</i> most of the week figuring out a better way<br />
of doing it. I hate it when this happens <kbd>&gt;_&lt;</kbd>.
</p>
<p><div>&nbsp;</div></p>
<p>
Okay, maybe not.
</p>
<p><div>&nbsp;</div></p>
<p>
Sarcasm aside, it is an interesting read. While the standard way of<br />
calculating a sine &ndash; via a look-up table &ndash; works and<br />
works well, there&#8217;s just something unsatisfying about it. The<br />
LUT-based approach is just &hellip; dull.<br />
Uninspired. Cowardly. <i>Inelegant</i>.<br />
In contrast, finding a suitable algorithm for it requires effort and a<br />
modicum of creativity, so something like that always piques my interest.
</p>
<p>
In this case it&#8217;s sine approximation. I&#8217;d been wondering about that<br />
when I did my <a href="http://www.coranac.com/documents/arctangent">arctan article</a>,<br />
but figured it would require too many terms to really be worth<br />
the effort. But looking at Mr Schraut&#8217;s post (whose site you should be<br />
visiting from time to time too; there&#8217;s good stuff there) it seems<br />
you can get a decent version quite rapidly. The article centers around<br />
the work found at<br />
<a href="http://www.devmaster.net/forums/showthread.php?t=5784">devmaster thread<br />
5784</a>, which derived the following two equations:
</p>
<p><table class="eqtbl" id="eq-lab">
<tr>
<td class="eqnrcell">(1)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20S_2%28x%29%20%26%3D%26%20%5Cfrac4%5Cpi%20x%20-%20%5Cfrac4%7B%5Cpi%5E2%7D%20x%5E2%20%5C%5C%20%5C%5C%20S_%7B4d%7D%28x%29%20%26%3D%26%20%281-P%29S_2%28x%29%20%2B%20P%20S_2%5E2%28x%29%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} S_2(x) &amp;=&amp; \frac4\pi x - \frac4{\pi^2} x^2 \\ \\ S_{4d}(x) &amp;=&amp; (1-P)S_2(x) + P S_2^2(x) \end{eqnarray}"<br />
	alt="\begin{eqnarray} S_2(x) &amp;=&amp; \frac4\pi x - \frac4{\pi^2} x^2 \\ \\ S_{4d}(x) &amp;=&amp; (1-P)S_2(x) + P S_2^2(x) \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
These approximations work quite well, but I feel that it actually<br />
uses the wrong starting point. There are alternative approximations<br />
that give more accurate results at nearly no extra cost in<br />
complexity. In this post, I&#8217;ll derive higher-order alternatives for<br />
both. In passing, I&#8217;ll also talk about a few of the tools that can<br />
help analyse functions and, of course, provide some source code and<br />
do some comparisons.
</p>
<p><span id="more-87"></span></p>
<p><ul>
  <li> <a href="#sec-theory">1
Theory
</a> </li>
  <li> <a href="#sec-prod">2
Derivations and implementations
</a> </li>
  <li> <a href="#sec-test">3
Testing
</a> </li>
  <li> <a href="#sec-summary">4
Summary and final thoughts
</a> </li>
</ul>
</p>
<p><h2 id="sec-theory">1
Theory
</h2>
</p>
<p><h3 id="ssec-try-symmetry">1.1
Symmetry
</h3>
</p>
<p>
The first analytical tool is symmetry. Symmetry is actually one of the<br />
most powerful concepts ever conceived. Symmetry of time leads to the<br />
conservation of energy; symmetry of space leads to conservation of<br />
momentum; in a 3D world, symmetry of direction gives rise to the<br />
inverse square law. In many cases, symmetry basically defines the kinds<br />
of functions you&#8217;re looking for.
</p>
<p>
One kind of symmetry is parity, and functions can have parity as well.<br />
Take any function <i>f</i>(<i>x</i>). A function is <dfn>even</dfn> if<br />
<i>f</i>(&minus;<i>x</i>)&nbsp;=&nbsp;<i>f</i>(<i>x</i>); it is <dfn>odd</dfn><br />
if <i>f</i>(&minus;<i>x</i>)&nbsp;=&nbsp;&minus;<i>f</i>(<i>x</i>).
</p>
<p>
This may not sound impressive, but a function&#8217;s parity can be a great<br />
source of information and a way of error checking. For example, the<br />
product of two odd or even functions is an even function, and an<br />
odd-even product is odd (compare positive/negative number products).<br />
If in a calculation you notice this doesn&#8217;t hold true, then you know<br />
there&#8217;s an error somewhere.
</p>
<p>
Symmetry can also significantly reduce the amount of work you need<br />
to do. Take the next sum, for example.
</p>
<p><table class="eqtbl" id="eq-sym">
<tr>
<td class="eqnrcell">(2)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?y%20%3D%20%5Cint_%7B-N%7D%5EN%20sin%5E7%28x%5E3%29%20%2B%20%5Cfrac%7Bx%5E5%7D%7Bx%5E2%2B1%7D%20-%20x%20e%5E%7B%5Cfrac%7Bx%5E2%7D%7B2%5Csigma%5E2%7D%7D%20dx'<br />
	title="y = \int_{-N}^N sin^7(x^3) + \frac{x^5}{x^2+1} - x e^{\frac{x^2}{2\sigma^2}} dx"<br />
	alt="y = \int_{-N}^N sin^7(x^3) + \frac{x^5}{x^2+1} - x e^{\frac{x^2}{2\sigma^2}} dx" /><br />
</td>
</tr>
</table></p>
<p>
If you find something like this in the wild on on a test, your first<br />
thought might be &ldquo;WTF?!?&rdquo; (assuming you don&#8217;t run away<br />
screaming). As it happens, <i>y</i>&nbsp;=&nbsp;0, for reasons of symmetry. The<br />
function is odd, so the parts left and right of <i>x</i>&nbsp;=&nbsp;0 cancel out.<br />
Instead of actually trying to do the whole calculation, you can just<br />
write down the answer in one line: &ldquo;0, cuz of symmetry&rdquo;.
</p>
<p>
Another property of symmetrical functions is that, if you break them<br />
down into series expansions, odd functions will only have odd terms,<br />
and even functions only have even terms. This becomes important in<br />
the next subsection.
</p>
<p><h3 id="ssec-try-series">1.2
Polynomial and Taylor expansions
</h3>
</p>
<p>
Every function can be broken down into a sum of more manageable<br />
functions. One fairly obvious choice for these sub-functions is<br />
increasing powers of <i>x</i>: polynomials. The most common of<br />
these is 
<a href="http://en.wikipedia.org/wiki/Taylor%20series">Taylor series</a>, which uses<br />
a reference point (<i>a</i>,&nbsp;<i>f</i>(<i>a</i>)) and extrapolates<br />
to another point some distance <i>h</i> away by using the<br />
derivatives of <i>f</i> at the reference point. In equation form,<br />
it looks like this:
</p>
<p><table class="eqtbl" id="eq-taylor-def">
<tr>
<td class="eqnrcell">(3)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20f%28a%2Bh%29%20%26%3D%26%20f%28a%29%20%2B%20f%27%28a%29%20h%20%2B%20%5Cfrac%7Bf%27%27%28a%29%7D%7B2%7Dh%5E2%20%2B%20%5Cfrac%7Bf%27%27%27%28a%29%7D%7B6%7D%20h%5E3%20%2B%20...%20%5C%5C%20%5C%5C%20%5C%5C%20%26%3D%26%20%5Csum_%7Bn%3D0%7D%20%5Cfrac%7Bf%5E%7B%28n%29%7D%28a%29%7D%7Bn%21%7Dh%5En%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} f(a+h) &amp;=&amp; f(a) + f&#039;(a) h + \frac{f&#039;&#039;(a)}{2}h^2 + \frac{f&#039;&#039;&#039;(a)}{6} h^3 + ... \\ \\ \\ &amp;=&amp; \sum_{n=0} \frac{f^{(n)}(a)}{n!}h^n \end{eqnarray}"<br />
	alt="\begin{eqnarray} f(a+h) &amp;=&amp; f(a) + f&#039;(a) h + \frac{f&#039;&#039;(a)}{2}h^2 + \frac{f&#039;&#039;&#039;(a)}{6} h^3 + ... \\ \\ \\ &amp;=&amp; \sum_{n=0} \frac{f^{(n)}(a)}{n!}h^n \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
Chances are you&#8217;ve actually used part of the Taylor series in game<br />
programming. On implementing movement with acceleration, you&#8217;ll<br />
often see something like Eq&nbsp;4. These are the<br />
first three terms of the Taylor expansion.
</p>
<p><table class="eqtbl" id="eq-taylor-xva">
<tr>
<td class="eqnrcell">(4)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?x_%7Bnew%7D%20%3D%20x_%7Bold%7D%20%5C%3A%2B%5C%3A%20v%20%5CDelta%20t%20%5C%3A%2B%5C%3A%20%5Cfrac12%20a%20%28%5CDelta%20t%29%5E2'<br />
	title="x_{new} = x_{old} \:+\: v \Delta t \:+\: \frac12 a (\Delta t)^2"<br />
	alt="x_{new} = x_{old} \:+\: v \Delta t \:+\: \frac12 a (\Delta t)^2" /><br />
</td>
</tr>
</table></p>
<p>
Ihe step-size (<i>h</i> in Eq&nbsp;3 and<br />
&Delta;<i>t</i> in Eq&nbsp;4) is small, the<br />
higher-order terms will have less effect on the end result. This<br />
allows you to cut the expansion short at some point. This leaves<br />
you with a short equation that you do the calculations with and<br />
some sort of error term, composed of the part you have removed.<br />
The error term is usually linked to the order you&#8217;ve truncated<br />
the series at; the higher the order, the more accurate the<br />
approximation.
</p>
<p><table class="eqtbl" id="eq-taylor-error">
<tr>
<td class="eqnrcell">(5)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?f%28a%2Bh%29%20%3D%20f%28a%29%20%2B%20f%27%28a%29%20h%20%2B%20%5Cfrac%7Bf%27%27%28a%29%7D%7B2%7Dh%5E2%20%2B%20%5Cfrac%7Bf%27%27%27%28a%29%7D%7B6%7D%20h%5E3%20%2B%20O%28h%5E4%29'<br />
	title="f(a+h) = f(a) + f&#039;(a) h + \frac{f&#039;&#039;(a)}{2}h^2 + \frac{f&#039;&#039;&#039;(a)}{6} h^3 + O(h^4)"<br />
	alt="f(a+h) = f(a) + f&#039;(a) h + \frac{f&#039;&#039;(a)}{2}h^2 + \frac{f&#039;&#039;&#039;(a)}{6} h^3 + O(h^4)" /><br />
</td>
</tr>
</table><div>&nbsp;</div></p>
<p>
If you work out the math for a sine Taylor series, with <i>a</i>&nbsp;=&nbsp;0<br />
as the reference point, you end up with Eq&nbsp;6.
</p>
<p><table class="eqtbl" id="eq-taylor-sine">
<tr>
<td class="eqnrcell">(6)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?sin%28h%29%20%3D%20h%20%5C%2C-%5C%2C%20%5Cfrac16%20h%5E3%20%5C%2C%2B%5C%2C%20%5Cfrac1%7B5%21%7D%20h%5E5%20%5C%2C-%5C%2C%20%5Cfrac1%7B7%21%7D%20h%5E7%20%5C%2C%2B%5C%2C%20...'<br />
	title="sin(h) = h \,-\, \frac16 h^3 \,+\, \frac1{5!} h^5 \,-\, \frac1{7!} h^7 \,+\, ..."<br />
	alt="sin(h) = h \,-\, \frac16 h^3 \,+\, \frac1{5!} h^5 \,-\, \frac1{7!} h^7 \,+\, ..." /><br />
</td>
</tr>
</table></p>
<p>
Note that all the even powers are conspicuously absent. This is what<br />
I meant by symmetry being useful: a sine function is odd, therefore<br />
only odd terms are needed in the expansion. But there&#8217;s more to it<br />
than that. The accuracy is given by the highest order in the<br />
approximating polynomial. This shows that there&#8217;s just no point in even<br />
starting with any even-powered polynomial, because you can get one extra<br />
order basically for free!
</p>
<p>
This is why using a quadratic approximation for a sine is somewhat<br />
useless; a cubic will have two terms as well, and be more accurate to<br />
boot. Just because it&#8217;s curved doesn&#8217;t mean a parabola is the most<br />
suitable approximation.
</p>
<p><h3 id="ssec-try-fit">1.3
Curve fitting (and a 3rd order example)
</h3>
</p>
<p>
Using the Taylor series as a basis for a sine approximation is nice,<br />
but it also has a problem. The series is meant to have an infinite<br />
number of terms and when you truncate the series, you will lose<br />
some accuracy. Of course, this was to be expected, but this isn&#8217;t<br />
the real problem; the real problem is that if your function<br />
has some crucial points it <i>must</i> pass through (which is<br />
certainly true for trigonometry functions), the truncation will<br />
move the curve away from those points.
</p>
<p>
To fix this, you need to use a polynomial with as-yet unknown<br />
coefficients (that is, multipliers to the powers) and a set of<br />
conditions that need to be satisfied. These conditions will determine<br />
the exact value of the coefficients. The Taylor expansion can serve<br />
as the basic for your initial approximation, and the final terms<br />
should be pretty close to the Taylor coefficients.
</p>
<p><div>&nbsp;</div></p>
<p>
Let&#8217;s try this for a third-order (cubic) sine approximation.<br />
Technically, a third-order polynomial means four unknowns, <i>but</i>,<br />
since the sine is odd, all the coefficients for the even powers<br />
are zero. That takes care of half the coefficients already. I told<br />
you symmetry was useful <kbd>:)</kbd>. The starting polynomial is<br />
reduced to Eq&nbsp;7, which has two coefficients<br />
<i>a</i> and <i>b</i> that have to be determined. For good measure<br />
I&#8217;ve also added the derivative, as that&#8217;s often useful to have as<br />
well.
</p>
<p><table class="eqtbl" id="eq-s3-base">
<tr>
<td class="eqnrcell">(7)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20S_3%28x%29%20%26%3D%26%20ax%20-%20b%20x%5E3%20%26%3D%26%20x%20%28a%20-%20b%20x%5E2%29%20%5C%5C%20S_3%27%28x%29%20%26%3D%26%20a%20-%203bx%5E2%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} S_3(x) &amp;=&amp; ax - b x^3 &amp;=&amp; x (a - b x^2) \\ S_3&#039;(x) &amp;=&amp; a - 3bx^2 \end{eqnarray}"<br />
	alt="\begin{eqnarray} S_3(x) &amp;=&amp; ax - b x^3 &amp;=&amp; x (a - b x^2) \\ S_3&#039;(x) &amp;=&amp; a - 3bx^2 \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
Two unknowns means we need two conditional to solve the system.<br />
The most useful conditions are usually the behaviour at the<br />
boundaries. In the case of a sine, that means look at <i>x</i>&nbsp;=&nbsp;0<br />
and/or <i>x</i>&nbsp;=&nbsp;&frac12;&pi;. The latter happens to be more<br />
useful here, so let&#8217;s look at that. First, sin(&frac12;&pi;)&nbsp;=&nbsp;1,<br />
so that&#8217;s a good one. Also, we know that at &frac12;&pi; a sine is<br />
flat (a derivative of 0). This is the second condition.
</p>
<p>
The conditions are listed in Eq&nbsp;8. Solving this<br />
system is rather straightforward and will give you values for<br />
<i>a</i> and <i>b</i>, which are also given in Eq&nbsp;8.<br />
Notice that the values are roughly 5% and 30% away from the pure<br />
Taylor coefficients.
</p>
<p><table class="eqtbl" id="eq-s3-cnd">
<tr>
<td class="eqnrcell">(8)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cleft.%20%5Cbegin%7Beqnarray%7D%20S_3%28%5Cfrac%7B%5Cpi%7D2%29%20%26%3D%26%201%20%26%3D%26%20%5Cfrac%7B%5Cpi%7D2%20a%20-%20%28%5Cfrac%7B%5Cpi%7D2%29%5E3%20b%20%5C%5C%20S_3%27%28%5Cfrac%7B%5Cpi%7D2%29%20%26%3D%26%200%20%26%3D%26%20a%20-%203%28%5Cfrac%7B%5Cpi%7D2%29%5E2%20b%20%5Cend%7Beqnarray%7D%20%5C%3B%20%5C%3B%20%5Crightarrow%20%5C%3B%20%5C%3B%20%5Cbegin%7Beqnarray%7D%20a%20%26%3D%26%20%5Cfrac3%5Cpi%20%26%5Capprox%26%200.955%20%5C%5C%20%5C%5C%20b%20%26%3D%26%20%5Cfrac4%7B%5Cpi%5E3%7D%20%26%5Capprox%26%200.129%20%5Cend%7Beqnarray%7D'<br />
	title="\left. \begin{eqnarray} S_3(\frac{\pi}2) &amp;=&amp; 1 &amp;=&amp; \frac{\pi}2 a - (\frac{\pi}2)^3 b \\ S_3&#039;(\frac{\pi}2) &amp;=&amp; 0 &amp;=&amp; a - 3(\frac{\pi}2)^2 b \end{eqnarray} \; \; \rightarrow \; \; \begin{eqnarray} a &amp;=&amp; \frac3\pi &amp;\approx&amp; 0.955 \\ \\ b &amp;=&amp; \frac4{\pi^3} &amp;\approx&amp; 0.129 \end{eqnarray}"<br />
	alt="\left. \begin{eqnarray} S_3(\frac{\pi}2) &amp;=&amp; 1 &amp;=&amp; \frac{\pi}2 a - (\frac{\pi}2)^3 b \\ S_3&#039;(\frac{\pi}2) &amp;=&amp; 0 &amp;=&amp; a - 3(\frac{\pi}2)^2 b \end{eqnarray} \; \; \rightarrow \; \; \begin{eqnarray} a &amp;=&amp; \frac3\pi &amp;\approx&amp; 0.955 \\ \\ b &amp;=&amp; \frac4{\pi^3} &amp;\approx&amp; 0.129 \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
The final equation is then:
</p>
<p><table class="eqtbl" id="eq-s3">
<tr>
<td class="eqnrcell">(9)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?S_3%28x%29%20%3D%20%5Cfrac3%5Cpi%20x%20-%20%5Cfrac4%7B%5Cpi%5E3%7D%20x%5E3'<br />
	title="S_3(x) = \frac3\pi x - \frac4{\pi^3} x^3"<br />
	alt="S_3(x) = \frac3\pi x - \frac4{\pi^3} x^3" /><br />
</td>
</tr>
</table></p>
<p>
In Fig&nbsp;1 you can see a number of different<br />
approximations to the sine. Note that I&#8217;ve done a little coordinate<br />
transformation for the <i>x</i>-axis: <i>z</i>&nbsp;=&nbsp;<i>x</i>/(&frac12;&pi;),<br />
so <i>z</i>&nbsp;=&nbsp;1 means <i>x</i>&nbsp;=&nbsp;&frac12;&pi;. The benefit of this<br />
will become clear later.
</p>
<p>
As you can see, the third order Taylor expansion starts out all-right,<br />
but veers off course near the end. In contrast, the third-order fit<br />
matches the sine at both end points. There is also the second-order<br />
fit from the devmaster site. As you can see, the third-order approximation<br />
is closer.
</p>
<div class=cblock>
<div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-sine-t23"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;1. </b>
</div>

</div>
<p><div>&nbsp;</div></p>
<p>
Now, please remember that coefficients from Eq&nbsp;8<br />
are not the only ones you can use. The conditions define what the<br />
values will be; different conditions lead to different values. For<br />
example, instead using the derivative at &frac12;&pi;, I could have<br />
used it at <i>x</i>&nbsp;=&nbsp;0. This forms the<br />
set of equations of Eq&nbsp;10 and, as you can see,<br />
the coefficients are now different. This set is actually more accurate<br />
(a 0.6% average error instead of 1.1%), but it also has some rather<br />
unsavoury characteristics of having a maximum that&#8217;s not at<br />
&frac12;&pi; and goes over 1.0; this can be <i>really</i> unsettling<br />
if you intend to use the sine in something like rotation.
</p>
<p><table class="eqtbl" id="eq-s3-cnd-alt">
<tr>
<td class="eqnrcell">(10)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cleft.%20%5Cbegin%7Beqnarray%7D%20S_3%28%5Cfrac%7B%5Cpi%7D2%29%20%26%3D%26%201%20%26%3D%26%20%5Cfrac%7B%5Cpi%7D2%20a%20-%20%28%5Cfrac%7B%5Cpi%7D2%29%5E3%20b%20%5C%5C%20S_3%27%280%29%20%26%3D%26%201%20%26%3D%26%20a%20%5Cend%7Beqnarray%7D%20%5C%3B%20%5C%3B%20%5Crightarrow%20%5C%3B%20%5C%3B%20%5Cbegin%7Beqnarray%7D%20a%20%26%3D%26%201%20%5C%5C%20%5C%5C%20b%20%26%3D%26%20%5Cfrac4%7B%5Cpi%5E2%7D%281-%5Cfrac2%5Cpi%29%20%5Capprox%200.147%20%5Cend%7Beqnarray%7D'<br />
	title="\left. \begin{eqnarray} S_3(\frac{\pi}2) &amp;=&amp; 1 &amp;=&amp; \frac{\pi}2 a - (\frac{\pi}2)^3 b \\ S_3&#039;(0) &amp;=&amp; 1 &amp;=&amp; a \end{eqnarray} \; \; \rightarrow \; \; \begin{eqnarray} a &amp;=&amp; 1 \\ \\ b &amp;=&amp; \frac4{\pi^2}(1-\frac2\pi) \approx 0.147 \end{eqnarray}"<br />
	alt="\left. \begin{eqnarray} S_3(\frac{\pi}2) &amp;=&amp; 1 &amp;=&amp; \frac{\pi}2 a - (\frac{\pi}2)^3 b \\ S_3&#039;(0) &amp;=&amp; 1 &amp;=&amp; a \end{eqnarray} \; \; \rightarrow \; \; \begin{eqnarray} a &amp;=&amp; 1 \\ \\ b &amp;=&amp; \frac4{\pi^2}(1-\frac2\pi) \approx 0.147 \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p><h3 id="ssec-try-dimless">1.4
Dimensionless variables and coordinate transformations
</h3>
</p>
<p>
For higher accuracy, a higher-order polynomial should be used. Before<br />
doing that, though, I&#8217;d like to mention one more trick that can make your<br />
mathematical analysis considerably easier: dimensionless variables.
</p>
<p><div>&nbsp;</div></p>
<p>
The problem with most quantities and equations is units. Metres, feet, litres,<br />
gallons; those kinds of units. Units suck. For one, there are different<br />
units for identical quantities which can be a total pain to convert<br />
and can sometimes lead to disaster.<br />

<a href="http://en.wikipedia.org/wiki/Gimli_Glider">Literally</a>.</p>
<p>Then there&#8217;s the fact that the unit sizes are basically picked at random<br />
and have nothing to do with the physical situation they&#8217;re used for.<br />
So you have weird values for constants like <i>G</i> in<br />

<a href="http://en.wikipedia.org/wiki/Newton%26%238217%3Bs%20law%20of%20universal%20gravitation">Newton&#8217;s law of universal gravitation</a>, the speed of<br />
light <i>c</i> and the 
<a href="http://en.wikipedia.org/wiki/Planck%20constant">Planck constant</a>, <i>h</i>. Keeping<br />
track of these things in equations is annoying, especially since they<br />
tend to pile up and everybody would rather that they&#8217;d just <i>go<br />
away</i>!
</p>
<p>
Enter dimensionless variables. The idea here is that instead of using<br />
standard units, you express quantities as ratios to some meaningful<br />
size. For example, in relativity you often get <i>v</i>/<i>c</i> :<br />
velocity over speed of light. Equations become much simpler if you<br />
just denote velocities as fractions of the speed of light:<br />
&beta;&nbsp;=&nbsp;<i>v</i>/<i>c</i>. Using &beta; in the equations simplifies<br />
them immensely and has the bonus that you&#8217;re not tied to any<br />
specific speed-unit anymore.
</p>
<p>
The dimensionless variable is a type of coordinate transformation.<br />
In particular, it&#8217;s a scaling of the original variable into something<br />
more useful. Another useful transformation is translation: moving<br />
the variable to a more suitable position. We will come accross this<br />
later; but first: an example of dimensionless variables.
</p>
<p><div>&nbsp;</div></p>
<p>
A sine wave has lots of symmetry lines, all revolving around the<br />
quarter-circles. Because of this, the term that keeps showing up<br />
everywhere is &frac12;&pi;. This is the characteristic size of<br />
the wave. By using <i>z</i>&nbsp;=&nbsp;<i>x</i>/(&frac12;&pi;), all those<br />
important points are now at integral <i>z</i> values. Having ones<br />
in your equations is generally a good thing because they tend to<br />
disappear in multiplications. Look at what Eq&nbsp;9<br />
becomes when expressed in terms of <i>z</i>
</p>
<p><table class="eqtbl" id="eq-s3-dimless">
<tr>
<td class="eqnrcell">(11)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20S_3%28x%29%20%26%3D%26%20%5Cfrac3%5Cpi%20x%20-%20%5Cfrac4%7B%5Cpi%5E3%7D%20x%5E3%20%5C%5C%20%5C%5C%20%26%3D%26%20%5Cfrac32%20%5Cfrac%7B2x%7D%5Cpi%20-%20%5Cfrac12%20%28%5Cfrac%7B2x%7D%5Cpi%29%5E3%20%5C%5C%20%5C%5C%20%26%3D%26%20%5Cfrac12%20z%20-%20%5Cfrac12%20z%5E3%20%5C%5C%20%5C%5C%20S_3%28z%29%20%26%3D%26%20%5Cfrac12z%20%283%20-%20z%5E2%29%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} S_3(x) &amp;=&amp; \frac3\pi x - \frac4{\pi^3} x^3 \\ \\ &amp;=&amp; \frac32 \frac{2x}\pi - \frac12 (\frac{2x}\pi)^3 \\ \\ &amp;=&amp; \frac12 z - \frac12 z^3 \\ \\ S_3(z) &amp;=&amp; \frac12z (3 - z^2) \end{eqnarray}"<br />
	alt="\begin{eqnarray} S_3(x) &amp;=&amp; \frac3\pi x - \frac4{\pi^3} x^3 \\ \\ &amp;=&amp; \frac32 \frac{2x}\pi - \frac12 (\frac{2x}\pi)^3 \\ \\ &amp;=&amp; \frac12 z - \frac12 z^3 \\ \\ S_3(z) &amp;=&amp; \frac12z (3 - z^2) \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
Doesn&#8217;t that look a lot nicer? It goes deeper than that though.<br />
With dimensionless units, the units your measurements are in simply<br />
cease to matter! For angles, this means that whether you&#8217;re working in<br />
radians, degrees or brads, they&#8217;ll all result in the same circle-fraction,<br />
<i>z</i>. This makes converting algorithms to fixed-point notation<br />
considerably easier.
</p>
<p><h2 id="sec-prod">2
Derivations and implementations
</h2>
</p>
<p>
In the section above, I discussed the tools used for analysis and<br />
gave an example of a cubic approximation. In this section I&#8217;ll also<br />
derive high-accuracy fourth and fifth order approximations and<br />
show some implementations. Before that, though, there&#8217;s some<br />
terminology to go through.
</p>
<p>
Since multiple different approximations will be covered, there needs<br />
to be a way to separate all of them. In principle, the sine<br />
approximation will be named <i>S</i><sub>n</sub>, where <i>n</i> is<br />
the order of the polynomial. So that&#8217;ll give <i>S</i><sub>2</sub> to<br />
<i>S</i><sub>5</sub>. I will also use <i>S</i><sub>4d</sub> for the<br />
fourth-order approximation from devmaster. In the derivation of my<br />
own fourth-order function, I&#8217;ll use <i>C</i><sub>n</sub>, because<br />
what will actually be derived is a cosine.
</p>
<p><h3>
Third-order implementation
</h3>
</p>
<p>
Let&#8217;s start with finishing up the story of the third-order<br />
approximation. The main equation for this is Eq&nbsp;11.<br />
Because this equation is still rather simple, I&#8217;ll make this a fixed-point<br />
implementation. The main problem with turning a floating-point function<br />
into a fixed-point one is keeping track of the fixed-point during the<br />
calculations, always making sure there&#8217;s no overflow, but no underflow<br />
either. This is one of the reasons why I wrote Eq&nbsp;11<br />
like it is: by using nested parentheses you can maximize the accuracy<br />
of intermediate calculations and possibly minimize the number of<br />
of intermediate calculations and possibly minimize the number of<br />
operations to boot.
</p>
<p>
To coorectly account for the fixed-point positions, you need to<br />
be aware of the following factors:
</p>
<ul>
<li>
    The scale of the outcome (i.e., the amplitude): 2<sup>A</sup>
  </li>
<li>
    The scale on the inside the parentheses: 2<sup>p</sup>. This is<br />
    necessary to keep the multiplications from overflowing.
  </li>
<li>
    The angle-scale: 2<sup>n</sup>. This is basically the value of<br />
	&frac12;&pi; in the fixed-point system. Using <i>x</i> for the<br />
	angle, you have &nbsp;=&nbsp;<i>x</i>/2<sup>n</sup>.
  </li>
</ul>
<p>
Filling this into Eq&nbsp;11 will give the following:
</p>
<p><table class="eqtbl" id="eq-s3-fp">
<tr>
<td class="eqnrcell">(12)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20S_3%28y%29%20%26%3D%26%20%5Cfrac12%20z%20%283%20-%204z%5E2%29%202%5EA%20%5C%5C%20%26%3D%26%20z%20%283%5Ccdot2%5Ep%20-%20z%5E2%202%5Ep%29%202%5E%7BA-p-1%7D%20%5C%5C%20%26%3D%26%20x%20%283%5Ccdot2%5Ep%20-%20x%5E2%202%5E%7Bp-2n%7D%29%202%5E%7BA-p-n-1%7D%20%5C%5C%20%26%3D%26%20x%20%283%5Ccdot2%5Ep%20-%20x%5E2%20%2F%202%5Er%20%29%20%5Cmiddle%2F%202%5Es%2C%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} S_3(y) &amp;=&amp; \frac12 z (3 - 4z^2) 2^A \\ &amp;=&amp; z (3\cdot2^p - z^2 2^p) 2^{A-p-1} \\ &amp;=&amp; x (3\cdot2^p - x^2 2^{p-2n}) 2^{A-p-n-1} \\ &amp;=&amp; x (3\cdot2^p - x^2 / 2^r ) \middle/ 2^s, \end{eqnarray}"<br />
	alt="\begin{eqnarray} S_3(y) &amp;=&amp; \frac12 z (3 - 4z^2) 2^A \\ &amp;=&amp; z (3\cdot2^p - z^2 2^p) 2^{A-p-1} \\ &amp;=&amp; x (3\cdot2^p - x^2 2^{p-2n}) 2^{A-p-n-1} \\ &amp;=&amp; x (3\cdot2^p - x^2 / 2^r ) \middle/ 2^s, \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
with <i>r</i>&nbsp;=&nbsp;2<i>n</i>&minus;<i>p</i> and<br />
<i>s</i>&nbsp;=&nbsp;<i>n</i>+<i>p</i>+1&minus;<i>A</i>. These represent the<br />
fixed-point shifts you need to apply to keep everything on the level.<br />
With <i>p</i> as high as multiplication with <i>x</i> will allow and the<br />
standard libnds units leads to the following numbers.
</p>
<div class=lblock>
<table border=1 cellpadding=2 cellspacing=0>
<tr>
<th> A</th>
<th> n</th>
<th> p</th>
<th> r</th>
<th> s</th>
</tr>
<tr>
<td>12</td>
<td>13</td>
<td>15</td>
<td>11</td>
<td>17</td>
</tr>
</table>
</div>
<p><div class="cptfr" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-quadrants"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;2. </b>
</div>
</p>
<p>
That&#8217;s the calculation necessary for the first quadrant, but the domain<br />
of a sine is infinite. To get the rest of the domain, you can use<br />
the symmetries of the sine: the 2&pi; periodicity and the<br />
&frac12;&pi; mirror symmetries. The first is taken care of by doing<br />
<i>z</i>&nbsp;%&nbsp;4. This reduces the domain to the four quadrants of a<br />
circle. The next part is somewhat tricky, so pay attention.
</p>
<p>
Look at Fig&nbsp;2. <i>S</i><sub>3</sub> works for<br />
quadrant 0. Because it&#8217;s antisymmetric, it will also correctly<br />
calculate quadrant 3, which is equivalent to quadrant &minus;1.<br />
Quadrants 1 and 2 are the problem. As you can see in<br />
Fig&nbsp;2, what needs to happen is for those<br />
quadrants to mirror onto quadrants 0 and &minus;1. A reflection<br />
of <i>x</i> at <i>D</i> is defined by Eq&nbsp;13.<br />
In this case, that means that <i>z</i>&nbsp;=&nbsp;2&nbsp;&minus;&nbsp;<i>z</i>
</p>
<p><table class="eqtbl" id="eq-reflect">
<tr>
<td class="eqnrcell">(13)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?x%20%3D%20D%20-%20%28x-D%29%20%3D%202D-x'<br />
	title="x = D - (x-D) = 2D-x"<br />
	alt="x = D - (x-D) = 2D-x" /><br />
</td>
</tr>
</table></p>
<p>
Some test need to be done to see when the reflection should take<br />
place. The quadrant numbers in binary are 00, 01, 10, 11. If you<br />
build a truth-table around that, you&#8217;ll see that a XOR of the<br />
two bits will do the trick. If you really want to show off,<br />
you can combine the periodicity modulo and the quadrant test by<br />
doing the arithmetic in the top bits. The implementation is<br />
now complete.
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">s32 isin_S3(s32 x)<br />
{<br />
&nbsp; &nbsp; <span class="co1">// S(x) = x * ( (3&lt;&lt;p) &#8211; (x*x&gt;&gt;r) ) &gt;&gt; s</span><br />
&nbsp; &nbsp; <span class="co1">// n : Q-pos for quarter circle &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 13</span><br />
&nbsp; &nbsp; <span class="co1">// A : Q-pos for output &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 12</span><br />
&nbsp; &nbsp; <span class="co1">// p : Q-pos for parentheses intermediate &nbsp; 15</span><br />
&nbsp; &nbsp; <span class="co1">// r = 2n-p &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 11</span><br />
&nbsp; &nbsp; <span class="co1">// s = A-1-p-n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 17</span></p>
<p>&nbsp; &nbsp; <span class="kw1">static</span> <span class="kw1">const</span> <span class="kw1">int</span> qN = <span class="nu0">13</span>, qA= <span class="nu0">12</span>, qP= <span class="nu0">15</span>, qR= <span class="nu0">2</span>*qN-qP, qS= qN+qP+<span class="nu0">1</span>-qA;</p>
<p>&nbsp; &nbsp; x= x&lt;&lt;(<span class="nu0">30</span>-qN);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// shift to full s32 range (Q13-&gt;Q30)</span></p>
<p>&nbsp; &nbsp; <span class="kw1">if</span>( (x^(x&lt;&lt;<span class="nu0">1</span>)) &lt; <span class="nu0">0</span>) &nbsp; &nbsp; <span class="co1">// test for quadrant 1 or 2</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; x= (<span class="nu0">1</span>&lt;&lt;<span class="nu0">31</span>) &#8211; x;</p>
<p>&nbsp; &nbsp; x= x&gt;&gt;(<span class="nu0">30</span>-qN);</p>
<p>&nbsp; &nbsp; <span class="kw1">return</span> x * ( (<span class="nu0">3</span>&lt;&lt;qP) &#8211; (x*x&gt;&gt;qR) ) &gt;&gt; qS;<br />
}</div>
</div>
<p>
And, of course, there&#8217;s an assembly version as well. It&#8217;s only ten<br />
instructions, which I think is actually shorter than a LUT+lerp<br />
implementation.
</p>
<div class="gccarm">
<div class="gccarm proglist" style=" "><span class="co1">@ ARM assembly version, using n=13, p=15, A=12</span><br />
<span class="co1">@ Input: gamma in Q13</span><br />
&nbsp; &nbsp; <span class="kw4">.arm</span><br />
&nbsp; &nbsp; <span class="kw4">.align</span><br />
&nbsp; &nbsp; <span class="kw4">.global</span> isin_S3a<br />
isin_S3a:<br />
&nbsp; &nbsp; <span class="re1">mov</span> &nbsp; &nbsp; <span class="kw2">r0</span>, <span class="kw2">r0</span>, <span class="kw1">lsl</span> #(<span class="nu0">30</span>-<span class="nu0">13</span>)<br />
&nbsp; &nbsp; <span class="re1">teq</span> &nbsp; &nbsp; <span class="kw2">r0</span>, <span class="kw2">r0</span>, <span class="kw1">lsl</span> #<span class="nu0">1</span><br />
&nbsp; &nbsp; <span class="re1">rsbmi</span> &nbsp; <span class="kw2">r0</span>, <span class="kw2">r0</span>, #<span class="nu0">1</span>&lt;&lt;<span class="nu0">31</span><br />
&nbsp; &nbsp; <span class="re1">mov</span> &nbsp; &nbsp; <span class="kw2">r0</span>, <span class="kw2">r0</span>, <span class="kw1">asr</span> #(<span class="nu0">30</span>-<span class="nu0">13</span>)<br />
&nbsp; &nbsp; <span class="re1">mul</span> &nbsp; &nbsp; <span class="kw2">r1</span>, <span class="kw2">r0</span>, <span class="kw2">r0</span><br />
&nbsp; &nbsp; <span class="re1">mov</span> &nbsp; &nbsp; <span class="kw2">r1</span>, <span class="kw2">r1</span>, <span class="kw1">asr</span> #<span class="nu0">11</span><br />
&nbsp; &nbsp; <span class="re1">rsb</span> &nbsp; &nbsp; <span class="kw2">r1</span>, <span class="kw2">r1</span>, #<span class="nu0">3</span>&lt;&lt;<span class="nu0">15</span><br />
&nbsp; &nbsp; <span class="re1">mul</span> &nbsp; &nbsp; <span class="kw2">r0</span>, <span class="kw2">r1</span>, <span class="kw2">r0</span><br />
&nbsp; &nbsp; <span class="re1">mov</span> &nbsp; &nbsp; <span class="kw2">r0</span>, <span class="kw2">r0</span>, <span class="kw1">asr</span> #<span class="nu0">17</span><br />
&nbsp; &nbsp; <span class="re2">bx</span>&nbsp; &nbsp; &nbsp; <span class="kw2">lr</span></div>
</div>
<h4>Radians?</h4>
<p>
Oh wait, the requirement was for the input to be in Q12 radians,<br />
right? Weeell, that&#8217;s no biggy. You just have to do the<br />
<i>x</i>&nbsp;&rarr;&nbsp;<i>z</i> conversion yourself. Take, say,<br />
2<sup>20</sup>/(2&pi;). Multiply <i>x</i> by this gives <i>z</i><br />
as a Q30 number; exactly what the first line in the C code resulted in.<br />
This means that all you have to do is change the first line to<br />
`<code>x *= 166886;</code>&#8216;.
</p>
<h4>NDS special</h4>
<p>
The assembly version given above uses standard ARM instructions, but<br />
one of the interesting things is that the NDS&#8217; ARM9 core has special<br />
multiplication instructions. In particular, there is the<br />
<code>SMULWx</code> instruction, which does a word*halfword<br />
multiplication, where the halfword can be either the top or bottom<br />
halfword of operand 2.The main result is 32&times;16&rarr;48 bits<br />
long, of which only the top 32 bits are put in the destination<br />
register. Effectively it&#8217;s like <i>a</i>*<i>b</i>&gt;&gt;16 without<br />
overflow problems. As a bonus, it&#8217;s also slightly faster than the<br />
standard <code>MUL</code>. By slightly changing the parameters,<br />
the down-shift factors <i>r</i> and <i>s</i> can be made 16, fitting<br />
perfectly with this instruction, although the internal accuracy is<br />
made slightly worse. Additionally, careful placement of each<br />
instruction can avoid the interlock cycle that happens for<br />
multiplications.
</p>
<p>
The alternate <code>isin_S3a()</code> becomes:
</p>
<div class="gccarm">
<div class="gccarm proglist" style=" "><span class="co1">@ Special ARM assembly version, using n=13 and lots of Q14</span><br />
<span class="co1">@ Input: gamma in Q13</span><br />
&nbsp; &nbsp; <span class="kw4">.arm</span><br />
&nbsp; &nbsp; <span class="kw4">.align</span><br />
&nbsp; &nbsp; <span class="kw4">.global</span> isin_S3a9<br />
isin_S3a9:<br />
&nbsp; &nbsp; <span class="re1">mov</span> &nbsp; &nbsp; <span class="kw2">r0</span>, <span class="kw2">r0</span>, <span class="kw1">lsl</span> #(<span class="nu0">30</span>-<span class="nu0">13</span>)&nbsp; &nbsp; <span class="co1">@ x &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ; Q30</span><br />
&nbsp; &nbsp; <span class="re1">teq</span> &nbsp; &nbsp; <span class="kw2">r0</span>, <span class="kw2">r0</span>, <span class="kw1">lsl</span> #<span class="nu0">1</span><br />
&nbsp; &nbsp; <span class="re1">rsbmi</span> &nbsp; <span class="kw2">r0</span>, <span class="kw2">r0</span>, #<span class="nu0">1</span>&lt;&lt;<span class="nu0">31</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span class="re2">smulwt</span>&nbsp; <span class="kw2">r1</span>, <span class="kw2">r0</span>, <span class="kw2">r0</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">@ y=x*x &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ; Q30*Q14/Q16 = Q28</span><br />
&nbsp; &nbsp; <span class="re1">mov</span> &nbsp; &nbsp; <span class="kw2">r2</span>, #<span class="nu0">3</span>&lt;&lt;<span class="nu0">13</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">@ B_14=3/2</span><br />
&nbsp; &nbsp; <span class="re1">sub</span> &nbsp; &nbsp; <span class="kw2">r1</span>, <span class="kw2">r2</span>, <span class="kw2">r1</span>, <span class="kw1">asr</span> #<span class="nu0">15</span> &nbsp; &nbsp; <span class="co1">@ 3/2-y/2 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ; Q14+Q28/Q14/2</span><br />
&nbsp; &nbsp; <span class="re2">smulwt</span>&nbsp; <span class="kw2">r0</span>, <span class="kw2">r1</span>, <span class="kw2">r0</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">@ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ; Q14*Q14/Q16 = Q12</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span class="re2">bx</span>&nbsp; &nbsp; &nbsp; <span class="kw2">lr</span></div>
</div>
<p>
Technically it&#8217;s only two instruction less, but is quite a bit<br />
faster due to the difference in speed between <code>MUL</code><br />
and <code>SMULWx</code>.
</p>
<p><h3 id="ssec-prod-s5">2.1
High-precision, fifth order
</h3>
</p>
<p>
The third order approximation actually still has a substantial error,<br />
so it may be useful to use an additional term. This would be<br />
the fifth-order approximation, <i>S</i><sub>5</sub>. It and its<br />
derivative are given in Eq&nbsp;14.
</p>
<p><table class="eqtbl" id="eq-s5-base">
<tr>
<td class="eqnrcell">(14)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20S_5%28x%29%20%26%3D%26%20ax%20-%20b%20x%5E3%20%2B%20c%20x%5E5%20%5C%5C%20%5C%5C%20%5C%5C%20S_5%27%28x%29%20%26%3D%26%20a%20-%203b%20x%5E2%20%2B%205c%20x%5E4%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} S_5(x) &amp;=&amp; ax - b x^3 + c x^5 \\ \\ \\ S_5&#039;(x) &amp;=&amp; a - 3b x^2 + 5c x^4 \end{eqnarray}"<br />
	alt="\begin{eqnarray} S_5(x) &amp;=&amp; ax - b x^3 + c x^5 \\ \\ \\ S_5&#039;(x) &amp;=&amp; a - 3b x^2 + 5c x^4 \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
To find the terms, I will again use <i>z</i> instead of <i>x</i>.<br />
The conditions of note are the position and derivative at <i>z</i>&nbsp;=&nbsp;1<br />
and the derivative at 0. With these conditions the approximation<br />
should behave amicably at both edges.
</p>
<p><table class="eqtbl" id="eq-s5-cnd">
<tr>
<td class="eqnrcell">(15)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20S_5%28z%3D1%29%20%26%3D%26%201%20%26%3D%26%20a%20%26-%26%20b%20%26%2B%26%20c%20%5C%5C%20%5C%5C%20%5C%5C%20S%27_5%28z%3D1%29%20%26%3D%26%200%20%26%3D%26%20a%20%26-%26%203b%20%26%2B%26%205c%20%5C%5C%20%5C%5C%20%5C%5C%20S%27_5%28z%3D0%29%20%26%3D%26%20%5Cfrac%5Cpi2%20%26%3D%26%20a%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} S_5(z=1) &amp;=&amp; 1 &amp;=&amp; a &amp;-&amp; b &amp;+&amp; c \\ \\ \\ S&#039;_5(z=1) &amp;=&amp; 0 &amp;=&amp; a &amp;-&amp; 3b &amp;+&amp; 5c \\ \\ \\ S&#039;_5(z=0) &amp;=&amp; \frac\pi2 &amp;=&amp; a \end{eqnarray}"<br />
	alt="\begin{eqnarray} S_5(z=1) &amp;=&amp; 1 &amp;=&amp; a &amp;-&amp; b &amp;+&amp; c \\ \\ \\ S&#039;_5(z=1) &amp;=&amp; 0 &amp;=&amp; a &amp;-&amp; 3b &amp;+&amp; 5c \\ \\ \\ S&#039;_5(z=0) &amp;=&amp; \frac\pi2 &amp;=&amp; a \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
Notice that these equations are linear with respect to <i>a</i>,<br />
<i>b</i> and <i>c</i>, which means that it can be solved via matrices.<br />
Technically this system of equations forms a 3&times;3 matrix, but since<br />
<i>a</i> is already immediately known it can be reduced to a<br />
2&times;2 system. I&#8217;ll spare you the details, but it leads to the<br />
coefficients of Eq&nbsp;16. Note the complete absence of<br />
any horrid &pi;<sup>5</sup> terms that would have appeared if you had<br />
decided <i>not</i> to use dimensionless terms.
</p>
<p><table class="eqtbl" id="eq-s5-coef">
<tr>
<td class="eqnrcell">(16)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20a%20%26%3D%26%20%5Cpi%2F2%20%5C%5C%20%5C%5C%20b%20%26%3D%26%20%5Cpi%20-%205%2F2%20%5C%5C%20%5C%5C%20c%20%26%3D%26%20%5Cpi%2F2%20-%203%2F2%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} a &amp;=&amp; \pi/2 \\ \\ b &amp;=&amp; \pi - 5/2 \\ \\ c &amp;=&amp; \pi/2 - 3/2 \end{eqnarray}"<br />
	alt="\begin{eqnarray} a &amp;=&amp; \pi/2 \\ \\ b &amp;=&amp; \pi - 5/2 \\ \\ c &amp;=&amp; \pi/2 - 3/2 \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p><table class="eqtbl" id="eq-s5-final">
<tr>
<td class="eqnrcell">(17)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?S_5%28z%29%20%3D%20%5Cfrac12%20z%20%28%5Cpi%20-%20z%5E2%20%5B%20%282%5Cpi-5%29%20-%20z%5E2%20%28%5Cpi%20-%203%29%20%5D%20%29'<br />
	title="S_5(z) = \frac12 z (\pi - z^2 [ (2\pi-5) - z^2 (\pi - 3) ] )"<br />
	alt="S_5(z) = \frac12 z (\pi - z^2 [ (2\pi-5) - z^2 (\pi - 3) ] )" /><br />
</td>
</tr>
</table></p>
<p>
Eq&nbsp;17 is the final quintic approximation in the<br />
form that&#8217;s most accurate and easiest to implement. The implementation<br />
is basically an extension of the <i>S</i><sub>3</sub> function<br />
and left as an exercise for the reader.
</p>
<p><h3 id="ssec-prod-s4">2.2
High precision, fourth order
</h3>
</p>
<p><div class="cptfr" style="width:192px;">
  <a href="" target="_blank">  <img src="" id="img-sincos"
    alt="" width="192" /></a><br />
  <b>Fig&nbsp;3. </b>
</div>
</p>
<p>
Lastly, a fourth-order approximation. Normally, I wouldn&#8217;t even consider<br />
this for a sine (odd function == odd power series and all that), but<br />
since the devmaster post uses them and they even seem to work, there<br />
seems to be something to them after all.
</p>
<p>
The reason those approximations work is simple: they don&#8217;t actually<br />
approximate a sine at all; they approximate a <b>co</b>sine. And,<br />
because of all the symmetries and parallels with sines and cosines,<br />
one can be used to implement the other.
</p>
<p><table class="eqtbl" id="eq-sincos">
<tr>
<td class="eqnrcell">(18)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20sin%28x%29%20%26%3D%26%20cos%28x%20-%20%5Cpi%2F2%29%20%5C%5C%20%5C%5C%20sin%28z%29%20%26%3D%26%20cos%28z%20-%201%29%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} sin(x) &amp;=&amp; cos(x - \pi/2) \\ \\ sin(z) &amp;=&amp; cos(z - 1) \end{eqnarray}"<br />
	alt="\begin{eqnarray} sin(x) &amp;=&amp; cos(x - \pi/2) \\ \\ sin(z) &amp;=&amp; cos(z - 1) \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
Eq&nbsp;18 is<br />
the transformation you need to perform to turn a cosine into a sine<br />
wave. This can be easily done in at the start of an algorithm.<br />
What&#8217;s left is to derive a cosine approximation. Because a cosine<br />
is even, only even powers will be needed. The base form and its<br />
derivative are given in Eq&nbsp;19.
</p>
<p><table class="eqtbl" id="eq-c4-base">
<tr>
<td class="eqnrcell">(19)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20C_4%20%28x%29%20%26%3D%26%20a%20-%20b%20x%5E2%20%2B%20c%20x%5E4%20%5C%5C%20%5C%5C%20%5C%5C%20C_4%27%28x%29%20%26%3D%26%20-%202b%20x%20%2B%204c%20x%5E3%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} C_4 (x) &amp;=&amp; a - b x^2 + c x^4 \\ \\ \\ C_4&#039;(x) &amp;=&amp; - 2b x + 4c x^3 \end{eqnarray}"<br />
	alt="\begin{eqnarray} C_4 (x) &amp;=&amp; a - b x^2 + c x^4 \\ \\ \\ C_4&#039;(x) &amp;=&amp; - 2b x + 4c x^3 \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
For the conditions, we once again look at <i>z</i>&nbsp;=&nbsp;0 and <i>z</i>&nbsp;=&nbsp;1,<br />
which comes down to the eqt of equations in Eq&nbsp;20.<br />
One of the interesting thing about even functions is that the<br />
derivative at 0 is zero, so that&#8217;s a freebie. A very important<br />
freebie, as it means that one of the required symmetries happens<br />
automatically.
</p>
<p><table class="eqtbl" id="eq-c4-cnd">
<tr>
<td class="eqnrcell">(20)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20C_4%28z%3D0%29%20%26%3D%26%201%20%26%3D%26%20a%20%5C%5C%20%5C%5C%20%5C%5C%20C_4%28z%3D1%29%20%26%3D%26%200%20%26%3D%26%20a%20%26-%26%20b%20%26%2B%26%20c%20%5C%5C%20%5C%5C%20%5C%5C%20C%27_4%28z%3D1%29%20%26%3D%26%20-%5Cfrac%5Cpi2%20%26%3D%26%20%26-%26%202b%20%26%2B%26%204c%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} C_4(z=0) &amp;=&amp; 1 &amp;=&amp; a \\ \\ \\ C_4(z=1) &amp;=&amp; 0 &amp;=&amp; a &amp;-&amp; b &amp;+&amp; c \\ \\ \\ C&#039;_4(z=1) &amp;=&amp; -\frac\pi2 &amp;=&amp; &amp;-&amp; 2b &amp;+&amp; 4c \end{eqnarray}"<br />
	alt="\begin{eqnarray} C_4(z=0) &amp;=&amp; 1 &amp;=&amp; a \\ \\ \\ C_4(z=1) &amp;=&amp; 0 &amp;=&amp; a &amp;-&amp; b &amp;+&amp; c \\ \\ \\ C&#039;_4(z=1) &amp;=&amp; -\frac\pi2 &amp;=&amp; &amp;-&amp; 2b &amp;+&amp; 4c \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
The resulting set of coefficients are listed in Eq&nbsp;21.<br />
Note that <i>b</i>&nbsp;=&nbsp;<i>c</i>+1, which may be of use later. The final<br />
equation for the fourth order cosine approximation is<br />
Eq&nbsp;22. Only three MULs and two SUBs; nice.
</p>
<p><table class="eqtbl" id="eq-c4-coef">
<tr>
<td class="eqnrcell">(21)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20a%20%26%3D%26%201%20%5C%5C%20%5C%5C%20b%20%26%3D%26%202%20-%20%5Cpi%2F4%20%5C%5C%20%5C%5C%20c%20%26%3D%26%201%20-%20%5Cpi%2F4%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} a &amp;=&amp; 1 \\ \\ b &amp;=&amp; 2 - \pi/4 \\ \\ c &amp;=&amp; 1 - \pi/4 \end{eqnarray}"<br />
	alt="\begin{eqnarray} a &amp;=&amp; 1 \\ \\ b &amp;=&amp; 2 - \pi/4 \\ \\ c &amp;=&amp; 1 - \pi/4 \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p><table class="eqtbl" id="eq-c4-final">
<tr>
<td class="eqnrcell">(22)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?C_4%28z%29%20%3D%201%20-%20z%5E2%20%5B%20%282-%5Cpi%2F4%29%20-%20z%5E2%20%281-%5Cpi%2F4%29%20%5D'<br />
	title="C_4(z) = 1 - z^2 [ (2-\pi/4) - z^2 (1-\pi/4) ]"<br />
	alt="C_4(z) = 1 - z^2 [ (2-\pi/4) - z^2 (1-\pi/4) ]" /><br />
</td>
</tr>
</table></p>
<h4>Implementation</h4>
<p>
The floating-point implementation of Eq&nbsp;22 is<br />
again too easy to mention here, so I&#8217;ll focus on fixed-point<br />
variations. Like with <i>S</i><sub>3</sub>, you can mix and match<br />
fixed-point positions until you get something you like. In this<br />
case I&#8217;ll stick to Q14 for almost everything to keep things simple.
</p>
<p>
The real trick here is to find out what you need to do about all the<br />
other quadrants. Cutting down to four quadrants is, again, easy.<br />
For the rest, remember that the cosine approximation calculates the top<br />
quadrants and you need to flip the sign for the bottom quadrants.<br />
If you think in terms of the parameter that a sine gets, you see that<br />
only for odd semi-circles the sign needs to change. Tracing this<br />
can be done with a single bitwise AND or a clever shift.
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="co1">//! A sine approximation via &nbsp;a fourth-order cosine approx.</span><br />
s32 isin_S4(s32 x)<br />
{<br />
&nbsp; &nbsp; <span class="kw1">int</span> c, x2, y;<br />
&nbsp; &nbsp; <span class="kw1">static</span> <span class="kw1">const</span> <span class="kw1">int</span> qN= <span class="nu0">13</span>, qA= <span class="nu0">12</span>, B=<span class="nu0">19900</span>, C=<span class="nu0">3516</span>;</p>
<p>&nbsp; &nbsp; c= x&lt;&lt;(<span class="nu0">30</span>-qN);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Semi-circle info into carry.</span><br />
&nbsp; &nbsp; x -= <span class="nu0">1</span>&lt;&lt;qN; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// sine -&gt; cosine calc</span></p>
<p>&nbsp; &nbsp; x= x&lt;&lt;(<span class="nu0">31</span>-qN);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Mask with PI</span><br />
&nbsp; &nbsp; x= x&gt;&gt;(<span class="nu0">31</span>-qN);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Note: SIGNED shift! (to qN)</span><br />
&nbsp; &nbsp; x= x*x&gt;&gt;(<span class="nu0">2</span>*qN-<span class="nu0">14</span>);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// x=x^2 To Q14</span></p>
<p>&nbsp; &nbsp; y= B &#8211; (x*C&gt;&gt;<span class="nu0">14</span>); &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// B &#8211; x^2*C</span><br />
&nbsp; &nbsp; y= (<span class="nu0">1</span>&lt;&lt;qA)-(x*y&gt;&gt;<span class="nu0">16</span>); &nbsp; &nbsp; &nbsp; <span class="co1">// A &#8211; x^2*(B-x^2*C)</span></p>
<p>&nbsp; &nbsp; <span class="kw1">return</span> c&gt;=<span class="nu0">0</span> ? y : -y;<br />
}</div>
</div>
<p>
And an ARM9 assembly version too. As it happens, it&#8217;s only two<br />
instuctions longer than <code>isin_S3a9()</code>.
</p>
<div class="gccarm">
<div class="gccarm proglist" style=" "><span class="co1">@ ARM assembly version of S4 = C4(gamma-1), using n=13, A=12 and &#8230; miscellaneous.</span><br />
<span class="co1">@ Input: gamma in Q13</span><br />
&nbsp; &nbsp; <span class="kw4">.arm</span><br />
&nbsp; &nbsp; <span class="kw4">.align</span><br />
&nbsp; &nbsp; <span class="kw4">.global</span> isin_S4a9<br />
isin_S4a9:<br />
&nbsp; &nbsp; <span class="re1">movs</span>&nbsp; &nbsp; <span class="kw2">r0</span>, <span class="kw2">r0</span>, <span class="kw1">lsl</span> #(<span class="nu0">31</span>-<span class="nu0">13</span>)&nbsp; &nbsp; <span class="co1">@ r0=x%2 &lt;&lt;31 &nbsp; &nbsp; &nbsp; ; carry=x/2</span><br />
&nbsp; &nbsp; <span class="re1">sub</span> &nbsp; &nbsp; <span class="kw2">r0</span>, <span class="kw2">r0</span>, #<span class="nu0">1</span>&lt;&lt;<span class="nu0">31</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">@ r0 -= 1.0 &nbsp; &nbsp; &nbsp; &nbsp; ; sin &lt;-&gt; cos</span><br />
&nbsp; &nbsp; <span class="re2">smulwt</span>&nbsp; <span class="kw2">r1</span>, <span class="kw2">r0</span>, <span class="kw2">r0</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">@ r1 = x*x&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ; Q31*Q15/Q16=Q30</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span class="re2">ldr</span> &nbsp; &nbsp; <span class="kw2">r2</span>,=<span class="nu0">14016</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">@ C = (1-pi/4)&lt;&lt;16</span><br />
&nbsp; &nbsp; <span class="re2">smulwt</span>&nbsp; <span class="kw2">r0</span>, <span class="kw2">r2</span>, <span class="kw2">r1</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">@ C*x^2&gt;&gt;16 &nbsp; &nbsp; &nbsp; &nbsp; ; Q16*Q14/Q16 = Q14</span><br />
&nbsp; &nbsp; <span class="re1">add</span> &nbsp; &nbsp; <span class="kw2">r2</span>, <span class="kw2">r2</span>, #<span class="nu0">1</span>&lt;&lt;<span class="nu0">16</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">@ B = C+1</span><br />
&nbsp; &nbsp; <span class="re1">rsb</span> &nbsp; &nbsp; <span class="kw2">r0</span>, <span class="kw2">r0</span>, <span class="kw2">r2</span>, <span class="kw1">asr</span> #<span class="nu0">2</span>&nbsp; &nbsp; &nbsp; <span class="co1">@ B &#8211; C*x^2 &nbsp; &nbsp; &nbsp; &nbsp; ; Q14</span><br />
&nbsp; &nbsp; <span class="re2">smulwb</span>&nbsp; <span class="kw2">r0</span>, <span class="kw2">r1</span>, <span class="kw2">r0</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">@ x^2 * (B-C*x^2) &nbsp; ; Q30*Q14/Q16 = Q28</span><br />
&nbsp; &nbsp; <span class="re1">mov</span> &nbsp; &nbsp; <span class="kw2">r1</span>, #<span class="nu0">1</span>&lt;&lt;<span class="nu0">12</span><br />
&nbsp; &nbsp; <span class="re1">sub</span> &nbsp; &nbsp; <span class="kw2">r0</span>, <span class="kw2">r1</span>, <span class="kw2">r0</span>, <span class="kw1">asr</span> #<span class="nu0">16</span> &nbsp; &nbsp; <span class="co1">@ 1 &#8211; x^2 * (B-C*x^2)</span><br />
&nbsp; &nbsp; <span class="re1">rsbcs</span> &nbsp; <span class="kw2">r0</span>, <span class="kw2">r0</span>, #<span class="nu0">0</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">@ Flip sign for odd semi-circles.</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span class="re2">bx</span>&nbsp; &nbsp; &nbsp; <span class="kw2">lr</span></div>
</div>
<p><h2 id="sec-test">3
Testing
</h2>
</p>
<p>
Deriving approximations is nice and all, but there&#8217;s really no point<br />
unless you do some sort of test to see how well they perform. I&#8217;ll<br />
look at two things: accuracy and some speed-tests. For the speed-test,<br />
I&#8217;ll only consider the functions given here along with some traditional<br />
ones. The accuracy test is done only for the first quadrant and in<br />
floating-point, but the results should carry over well to a fixed-point<br />
case. Finally, I&#8217;ll show how you can optimize the functions for accuracy.
</p>
<p><h3 id="ssec-test-speed">3.1
Third and fourth-order speed
</h3>
</p>
<p>
For the speed test I calculated the sine at 256 points for<br />
<i>x</i>&nbsp;&isin;&nbsp;[0,&nbsp;2&pi;). There will be some loop-overhead<br />
in the numbers, but it should be small. Tests were performed on the<br />
NDS.
</p>
<p>
Functions under investigation are the three <i>S</i><sub>3</sub> and<br />
two <i>S</i><sub>4</sub> functions given earlier. I&#8217;ve also tested<br />
the standard floating-point <code>sin()</code> library function,<br />
the libnds <code>sinLerp()</code> and my own <code>isin()</code><br />
function that you can find in<br />
<a href="http://www.coranac.com/documents/arctangent#ssec-atan-sin">arctan:sine</a>.<br />
The cumulative and average times can be found in<br />
Table&nbsp;1.
</p>
<div class=lblock>
<table id="tbl-isin-speed"<br />
  border=1 cellpadding=3 cellspacing=0><br />
<caption align=bottom>
  <b>Table&nbsp;1</b>: sine cycle-times (roughly).<br />
</caption>
<tr>
<th>Function (thumb/ARM) </th>
<th>Total cycles</th>
<th>average cycles</th>
<tr class=rnum>
<th>sin (F)</th>
<td>300321</td>
<td>1175.1</td>
</tr>
<tr class=rnum>
<th>sinLerp (T)</th>
<td>10051</td>
<td>39.2</td>
</tr>
<tr class=rnum>
<th>isin (T)</th>
<td>7401</td>
<td>28.9</td>
</tr>
<tr class=rnum>
<th>isin_S3 (T)</th>
<td>5267</td>
<td>20.5</td>
</tr>
<tr class=rnum>
<th>isin_S4 (T)</th>
<td>6456</td>
<td>25.2</td>
</tr>
<tr class=rnum>
<th>isin_S3a (A)</th>
<td>3438</td>
<td>13.4</td>
</tr>
<tr class=rnum>
<th>isin_S3a9 (A)</th>
<td>2591</td>
<td>10.1</td>
</tr>
<tr class=rnum>
<th>isin_S4a9 (A)</th>
<td>3123</td>
<td>12.1</td>
</tr>
</table>
</div>
<p>
The first thing that should be clear is just why we don&#8217;t use the<br />
floating-point sine. I mean, seriously. There is also a clear difference<br />
between the Thumb-compiled and ARM assembly versions, the latter being<br />
significantly faster.
</p>
<p>
Within the compiled versions, I find it interesting to see that the<br />
algorithmic calculations are actually faster than the LUT+lerp-based<br />
implementations. I guess loading all those numbers from memory<br />
really does suck.
</p>
<p>
And <i>then</i> there&#8217;s the assembly versions. Wow. Compared to the<br />
compiled version they&#8217;re twice as fast, and up to four times as fast<br />
as the LUT-based functions.
</p>
<p><div class=note>
<div  class=nhcare>NDS timers measure half-cycles</div>
</p>
<p>
The cycle-times from Table&nbsp;1 do not make sense<br />
if you count instruction cycles. For example, for <code>isin_S3a</code><br />
the function overhead alone should already be around 10 cycles. The<br />
thing here is that the numbers are taken from the hardware timers,<br />
which use the bus-frequency (33 MHz) rather than the ARM9 cpu (66 MHz).<br />
As such, it measures in half-cycles. For details, see<br />
<a href="http://nocash.emubase.de/gbatek.htm/#dsmemorytimings">gbatek:nds-timings</a>.
</p>
<p></div>
</p>
<p><h3 id="ssec-test-acc">3.2
Accuracy
</h3>
</p>
<p>
Fig&nbsp;4 shows all the approximations in one graph.<br />
It only shows one quadrant because the rest can be retrieved by<br />
symmetry. I&#8217;ve also scaled the sine and its approximations by<br />
2<sup>12</sup> because that&#8217;s the scale that usual fixed-point scale<br />
right now. And to be sure, yes, this is a different chart than<br />
Fig&nbsp;1; it&#8217;s just hard to tell because the<br />
fourth and fifth order functions are virtually identical to the<br />
real sine line.
</p>
<div class=cblock>
<div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-sine-all"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;4. </b>
</div>

</div>
<p>
For the high-accuracy approximations, it&#8217;s better to look at<br />
Fig&nbsp;5, which shows the errors. Here you can<br />
clearly see a difference between <i>S</i><sub>4d</sub> and<br />
<i>S</i><sub>5</sub>, the latter is roughly 3 times better.
</p>
<p>
There&#8217;s also a large difference between the devmaster fourth-order<br />
sine and my own. The reason behind this is a difference in conditions.<br />
In my case, I&#8217;ve fixed the derivatives at both end-points, which<br />
always results in an over- or underestimate. The devmaster&#8217;s<br />
<i>S</i><sub>4d</sub> let go of those conditions and minimized the<br />
error. I&#8217;ll also do this in the next sub-section.
</p>
<div class=cblock>
<div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-sine-all-err"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;5. </b>
</div>

</div>
<p><div>&nbsp;</div></p>
<p>
Table&nbsp;2 and Table&nbsp;3<br />
list some interesting statistics about<br />
the various approximations, namely the minimum, average and maximum<br />
errors. It also contains a 
<a href="http://en.wikipedia.org/wiki/Root%20Mean%20Square%20Deviation">Root Mean Square Deviation</a> (RMSD), which is a<br />
special kind of distance. If you consider the data-points as a<br />
vector, the RMSD is the average Pythagorean length for each point.<br />
Table&nbsp;2 is normed to 2<sup>12</sup>, whereas<br />
Table&nbsp;3 is table for the traditional floating-point<br />
sine scale.
</p>
<p>
The RMSD values are probably the most useful to look at. From them<br />
you can see that there is a huge gap between the low-accuracy and<br />
high-accuracy functions of about a factor 60. And if you do your<br />
math right, all it costs is one multiplication and one addition,<br />
and maybe some extra shifts in the fixed-point case. That&#8217;s quite<br />
a bargain. Compared to that, the difference between the odd and<br />
even functions is somewhat meager: only a factor three or so.<br />
Still, it is something.
</p>
<p>
If you look at the fixed-point table, you can see that the error<br />
you make with <i>S</i><sub>4d</sub> and<br />
<i>S</i><sub>5</sub> is in the single digits. This means<br />
that this is probably accurate enough for practical purposes.<br />
Combined with the fact that even fifth order polynomials can be<br />
made pretty fast, this makes them worth considering over LUTs.
</p>
<div class=cblock>
<table width=80%>
<tr>
<td>
<table id="tbl-stats" border=1 cellpadding=3 cellspacing=0>
<caption align=bottom>
  <b>Table&nbsp;2</b>: error statistics for 2<sup>12</sup>sin(x) approx.<br />
</caption>
<tr class=top>
<th></th>
<th>min</th>
<th>avg</th>
<th>max</th>
<th>rms</th>
</tr>
<tr class=rnum>
<th>Taylor3</th>
<td>-302.1</td>
<td>-51.5</td>
<td>0</td>
<td>92.7</td>
</tr>
<tr class=rnum>
<th>S2</th>
<td>0</td>
<td>123.1</td>
<td>229.4</td>
<td>146.8</td>
</tr>
<tr class=rnum>
<th>S3</th>
<td>-82.0</td>
<td>-47.6</td>
<td>0</td>
<td>55.0</td>
</tr>
<tr class=rnum>
<th>S4d</th>
<td>-4.47</td>
<td>0.19</td>
<td>3.11</td>
<td>2.44</td>
</tr>
<tr class=rnum>
<th>S4</th>
<td>0</td>
<td>5.87</td>
<td>11.4</td>
<td>7.11</td>
</tr>
<tr class=rnum>
<th>S5</th>
<td>0</td>
<td>0.74</td>
<td>1.62</td>
<td>0.94</td>
</tr>
</table>
</td>
<td>
<table id="tbl-statsp" border=1 cellpadding=3 cellspacing=0>
<caption align=bottom>
  <b>Table&nbsp;3</b>: error statistics in percentages.<br />
</caption>
<tr class=top>
<th></th>
<th>min%</th>
<th>avg%</th>
<th>max%</th>
<th>rms%</th>
</tr>
<tr class=rnum>
<th>Taylor3</th>
<td>-7.37</td>
<td>-1.26</td>
<td>0</td>
<td>2.26</td>
</tr>
<tr class=rnum>
<th>S2</th>
<td>0</td>
<td>3</td>
<td>5.6</td>
<td>3.58</td>
</tr>
<tr class=rnum>
<th>S3</th>
<td>-2</td>
<td>-1.16</td>
<td>0</td>
<td>1.34</td>
</tr>
<tr class=rnum>
<th>S4d</th>
<td>-0.11</td>
<td>0.0047</td>
<td>0.076</td>
<td>0.06</td>
</tr>
<tr class=rnum>
<th>S4</th>
<td>0</td>
<td>0.143</td>
<td>0.278</td>
<td>0.174</td>
</tr>
<tr class=rnum>
<th>S5</th>
<td>0</td>
<td>0.018</td>
<td>0.039</td>
<td>0.023</td>
</tr>
</table>
</td>
</table>
</div>
<p><h3 id="ssec-test-opt">3.3
Optimizing higher-order approximations
</h3>
</p>
<p>
From the charts, you can see that <i>S</i><sub>4</sub> and<br />
<i>S</i><sub>5</sub> all err on the same side of the sine line. You<br />
can increase the accuracy of the approximation by tweaking the<br />
coefficients in such a way that the errors are redistributed in<br />
a preferable way. Two methods are possible here: shoot for a zero<br />
error average, or minimize the RMSD. Technically minimizing the<br />
RMSD is standard (it comes down to least-squares optimization), but<br />
because a zero-average allows for an analytical solution, I&#8217;ll use<br />
that. In any case, the differences in outcomes will be small.
</p>
<p><div>&nbsp;</div></p>
<p>
First, think of what an average of a function means. The average<br />
of a set of numbers is the sum divided by the size of the set. For<br />
functions, it&#8217;s the integral of that function divided by the<br />
interval. When you want a zero-average for an approximation, the<br />
integral of the function and that of the approximation should<br />
be equal. With a polynomial approximation to a sine, we get:
</p>
<p><table class="eqtbl" id="eq-cnd-avg0">
<tr>
<td class="eqnrcell">(23)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20%5Cint_0%5E1%20%5Csum_n%20a_n%20x%5En%20dx%20%26%3D%26%20%5Cint_0%5E1%20sin%28x%5Cpi%2F2%29%20dx%20%5C%3B%5C%3B%20%5Crightarrow%20%5C%5C%20%5C%5C%20%5Csum_n%20%5Cfrac%7Ba_n%7D%7Bn%2B1%7D%20%26%3D%26%202%2F%5Cpi%20%2C%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} \int_0^1 \sum_n a_n x^n dx &amp;=&amp; \int_0^1 sin(x\pi/2) dx \;\; \rightarrow \\ \\ \sum_n \frac{a_n}{n+1} &amp;=&amp; 2/\pi , \end{eqnarray}"<br />
	alt="\begin{eqnarray} \int_0^1 \sum_n a_n x^n dx &amp;=&amp; \int_0^1 sin(x\pi/2) dx \;\; \rightarrow \\ \\ \sum_n \frac{a_n}{n+1} &amp;=&amp; 2/\pi , \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
with <i>a</i><sub>n</sub> reducing to the coefficients of the<br />
polynomials we had before. This can be used as an alternate condition<br />
to the derivative at 0. For <i>S</i><sub>4</sub> and<br />
<i>S</i><sub>5</sub>, you&#8217;ll end up with the following coefficients.
</p>
<p><!--</p>
<p>
The simplest way to do this is to just dump the data into Excel<br />
and let the Solver do its magic on all the coefficients. A better<br />
way is first see what actually needs to change. We still have some<br />
conditions that need to be satisfied: 0 and 1 at the boundaries<br />
and a maximum at <i>z</i>&nbsp;=&nbsp;1. Only the derivative at 0 is flexible,<br />
and this leads to some restrictions in the search. In fact, it<br />
turns out that only one coefficient needs to be minimized for, and<br />
the rest follow from its value.
</p>
<p>
Taking all that into account, you get the following<br />
coefficients for <i>S</i><sub>4</sub> and<br />
<i>S</i><sub>5</sub>.
</p>
<p>&#8211;></p>
<p><table class="eqtbl" id="eq-s4-opt-coef">
<tr>
<td class="eqnrcell">(24)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20a_4%20%26%3D%26%201%20%5C%5C%20b_4%20%26%3D%26%20c_4%2B1%20%5C%5C%20c_4%20%26%3D%26%205%281-%5Cfrac3%5Cpi%29%20%5Capprox%200.225351707%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} a_4 &amp;=&amp; 1 \\ b_4 &amp;=&amp; c_4+1 \\ c_4 &amp;=&amp; 5(1-\frac3\pi) \approx 0.225351707 \end{eqnarray}"<br />
	alt="\begin{eqnarray} a_4 &amp;=&amp; 1 \\ b_4 &amp;=&amp; c_4+1 \\ c_4 &amp;=&amp; 5(1-\frac3\pi) \approx 0.225351707 \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p><table class="eqtbl" id="eq-s5-opt-coef">
<tr>
<td class="eqnrcell">(25)</td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Cbegin%7Beqnarray%7D%20a_5%20%26%3D%26%204%28%5Cfrac3%5Cpi%20-%20%5Cfrac9%7B16%7D%29%20%5Capprox%201.569718634%20%5C%5C%20b_5%20%26%3D%26%202%20a_5%20-%205%2F2%20%5C%5C%20c_5%20%26%3D%26%20a_5%20-%203%2F2%20%5Cend%7Beqnarray%7D'<br />
	title="\begin{eqnarray} a_5 &amp;=&amp; 4(\frac3\pi - \frac9{16}) \approx 1.569718634 \\ b_5 &amp;=&amp; 2 a_5 - 5/2 \\ c_5 &amp;=&amp; a_5 - 3/2 \end{eqnarray}"<br />
	alt="\begin{eqnarray} a_5 &amp;=&amp; 4(\frac3\pi - \frac9{16}) \approx 1.569718634 \\ b_5 &amp;=&amp; 2 a_5 - 5/2 \\ c_5 &amp;=&amp; a_5 - 3/2 \end{eqnarray}" /><br />
</td>
</tr>
</table></p>
<p>
If you&#8217;re still awake and remember the devmaster <i>S</i><sub>4d</sub><br />
coefficients, there should be something familiar about<br />
<i>a</i><sub>4</sub>. Yes, they&#8217;re practically identical. If you<br />
optimize <i>S</i><sub>4</sub> for the RMSD, you actually get the exact<br />
same function as <i>S</i><sub>4d</sub>.
</p>
<p>
Table&nbsp;4 shows the statistics for the original<br />
approximations and the new optimized versions, <i>S</i><sub>4o</sub><br />
and <i>S</i><sub>5o</sub>. The numbers for <i>S</i><sub>4o</sub><br />
are basically those from <i>S</i><sub>4d</sub> seen earlier. More<br />
interesting are the details for <i>S</i><sub>5o</sub>. The maximum<br />
and minimum errors are now within &plusmn;1. That is to say,<br />
this approximation gives values that are at most 1 off from the<br />
proper Q12 sine. This is about as good as any Q12<br />
approximation is able to get.
</p>
<div class=lblock>
<table id="tbl-stats-opt" border=1 cellpadding=3 cellspacing=0>
<caption align=bottom>
  <b>Table&nbsp;4</b>: Optimized Q12 <i>S</i><sub>4</sub> and <i>S</i><sub>5</sub>.<br />
</caption>
<tr>
<th class=top></th>
<th class=top>min</th>
<th class=top>avg</th>
<th class=top>max</th>
<th class=top>rmsd</th>
</tr>
<tr class=rnum>
<th>S4</th>
<td>0</td>
<td>5.87</td>
<td>11.4</td>
<td>7.11</td>
</tr>
<tr class=rnum>
<th>S5</th>
<td>0</td>
<td>0.74</td>
<td>1.616</td>
<td>0.94</td>
</tr>
<tr class=rnum>
<th>S4o</th>
<td>-4.72</td>
<td>0</td>
<td>2.89</td>
<td>2.47</td>
</tr>
<tr class=rnum>
<th>S5o</th>
<td>-0.73</td>
<td>0</td>
<td>0.79</td>
<td>0.52</td>
</tr>
</table>
</div>
<div class=cblock>
<div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-sine-err45-opt"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;6. </b>
</div>

</div>
<p><h2 id="sec-summary">4
Summary and final thoughts
</h2>
</p>
<p>
Here&#8217;s a few things to take from all this.
</p>
<ul>
<li>Symmetry is your friend.</li>
<li>
    When constructing a polynomial approximation, more terms mean<br />
    higher accuracy. Symmetry properties of the function approximated<br />
    allow you to remove terms from consideration, simplifying the<br />
    equation.
  </li>
<li>
    Coordinate transformations are your friends too. Sometimes<br />
    it&#8217;s much easier to work on a scaled or moved version of the<br />
    original problem.<br />
    If your situation has a characteristic length (or time,<br />
	velocity, whatever) consider using dimensionless variables:<br />
	expressing parameters as ratios of the characteristic length.<br />
	This makes the initial units pretty much irrelevant. For<br />
	angles, think circle-fractions.
  </li>
<li>
    Zero and one (0 and 1) are the best values to have in your<br />
	equations, as they tend to vanish to easily.
  </li>
<li>
    Any approximation formula will have coefficients to be determined.<br />
    In general, the Taylor series terms are <i>not</i> the best<br />
    set; values slightly offset from these terms will be better as<br />
    they can correct for the truncation. To determine the values of<br />
    the coefficients, define some conditions that need to be<br />
	satisfied. Examples of conditions are values of the function and<br />
	its derivative at the boundaries, or its integrals. Or you can<br />
	wuss out and just dump the thing in the Excel Solver.
  </li>
<li>
    When converting to fixed-point, accuracy and overflow comes into<br />
    the fray. If you know the domain of the function beforehand, you can<br />
    optimize for accuracy. Also, it helps if you construct the<br />
	algorithm in a sort of recursive form instead of a pure<br />
	polynomial: not <i>a</i><i>x</i>&nbsp;+&nbsp;<i>b</i><i>x</i><sup>2</sup><br />
	but	<i>x</i>(<i>a</i>&nbsp;+&nbsp;<i>x</i><i>b</i>). Ordered like this,<br />
	each new additional term only requires one multiplication and<br />
	one addition extra.
  </li>
<li>
    For fixed-point work, <code>SMULWx</code> is teh awesome.
  </li>
<li>
    Even a fourth order (and presumably fifth order as well) polynomial<br />
    implementation in C is faster than the LUT-based sines on the NDS.<br />
	And specialized assembly versions are considerably faster still.
  </li>
<li>
    The difference in accuracy of<br />
    <i>S</i><sub>4</sub> vs <i>S</i><sub>2</sub> or<br />
    <i>S</i><sub>5</sub> vs <i>S</i><sub>3</sub> is huge: a factor of<br />
    60. Going from an even to the next odd approximation only gains you<br />
	a factor 3. Shame; I&#8217;d hoped it&#8217;d be more.
  </li>
<li>
    Unlike I initially thought, the even-powered polynomials work<br />
	out quite well. This is because they&#8217;re actually modified cosine<br />
	approximations.
  </li>
</ul>
<h4>Exercises for the reader</h4>
<ol>
<li>
    Express the parabolic approximation <i>S</i><sub>2</sub>(<i>x</i>)<br />
	of Eq&nbsp;1 in terms of <i>z</i>. &#8216;s Not hard, I promise.
  </li>
<li>
    Implement the fixed-point version of the fifth-order sine<br />
	approximation, <i>S</i><sub>5</sub>(<i>x</i>).
  </li>
<li>
    For the masochists: derive the coefficients for <i>S</i><sub>5</sub>(<i>x</i>)<br />
	<i>without</i> dimensionless variables. That is to say, with<br />
	the conditions at <i>x</i>&nbsp;=&nbsp;&frac12;&pi; instead of <i>z</i>&nbsp;=&nbsp;1.
  </li>
<li>
    Solve Eq&nbsp;24 and Eq&nbsp;25 for<br />
    minimal RMDS. Also, try to derive an analytical form for minimal RMDS;<br />
    I think it&#8217;s exists, but it may be tricky to come up with the right form.
  </li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.coranac.com/2009/07/sines/feed/</wfw:commentRss>
		<slash:comments>23</slash:comments>
		</item>
		<item>
		<title>DMA vs ARM9 &#8211; fight!</title>
		<link>http://www.coranac.com/2009/05/dma-vs-arm9-fight/</link>
		<comments>http://www.coranac.com/2009/05/dma-vs-arm9-fight/#comments</comments>
		<pubDate>Thu, 28 May 2009 21:07:10 +0000</pubDate>
		<dc:creator>cearn</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[nds]]></category>
		<category><![CDATA[cache]]></category>
		<category><![CDATA[dma]]></category>
		<category><![CDATA[research]]></category>

		<guid isPermaLink="false">http://www.coranac.com/?p=68</guid>
		<description><![CDATA[<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize(http://www.coranac.com/img/post/arm9vsdma/cache_eg01.png&amp;#8221; width=&amp;#8221;300&amp;#8243;&lt;br /&gt;
	  cap=&amp;#8221;RAM[0&amp;#93; is written to. No change to cache.) [<a href='function.getimagesize'>function.getimagesize</a>]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found
 in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize(http://www.coranac.com/img/post/arm9vsdma/cache_eg02.png&amp;#8221; width=&amp;#8221;300&amp;#8243;&lt;br /&gt;
	  cap=&amp;#8221;Read from RAM[1&amp;#93;. Cache-line 0 = RAM Block 0.) [<a href='function.getimagesize'>function.getimagesize</a>]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found
 in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize(http://www.coranac.com/img/post/arm9vsdma/cache_eg04.png&amp;#8221; width=&amp;#8221;300&amp;#8243;&lt;br /&gt;
	  cap=&amp;#8221;Read from RAM[3&amp;#93;. This was in cache, so data's read&lt;br /&gt;
	  from there, not RAM.) [<a href='function.getimagesize'>function.getimagesize</a>]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found
 in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
1 The ARM 9 core 2 Cache example 3 Cache vs DMA solution 4 Test cases 5 Conclusions DMA, or Direct Memory Access, is a hardware method for transferring data. As it&#8217;s hardware-driven, it&#8217;s pretty damn fast(1). As such, it&#8217;s pretty much the standard method for copying on the NDS. Unfortunately, as many people have [...]]]></description>
			<content:encoded><![CDATA[<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize(http://www.coranac.com/img/post/arm9vsdma/cache_eg01.png&amp;#8221; width=&amp;#8221;300&amp;#8243;&lt;br /&gt;
	  cap=&amp;#8221;RAM[0&amp;#93; is written to. No change to cache.) [<a href='function.getimagesize'>function.getimagesize</a>]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found
 in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize(http://www.coranac.com/img/post/arm9vsdma/cache_eg02.png&amp;#8221; width=&amp;#8221;300&amp;#8243;&lt;br /&gt;
	  cap=&amp;#8221;Read from RAM[1&amp;#93;. Cache-line 0 = RAM Block 0.) [<a href='function.getimagesize'>function.getimagesize</a>]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found
 in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize(http://www.coranac.com/img/post/arm9vsdma/cache_eg04.png&amp;#8221; width=&amp;#8221;300&amp;#8243;&lt;br /&gt;
	  cap=&amp;#8221;Read from RAM[3&amp;#93;. This was in cache, so data's read&lt;br /&gt;
	  from there, not RAM.) [<a href='function.getimagesize'>function.getimagesize</a>]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found
 in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<p></p>
<p><ul>
  <li> <a href="#sec-arm9">1
The ARM 9 core
</a> </li>
  <li> <a href="#sec-example">2
Cache example
</a> </li>
  <li> <a href="#sec-solu">3
Cache vs DMA solution
</a> </li>
  <li> <a href="#sec-tests">4
Test cases
</a> </li>
  <li> <a href="#sec-conc">5
Conclusions
</a> </li>
</ul>
</p>
<p>
DMA, or Direct Memory Access, is a hardware method for transferring<br />
data. As it&#8217;s hardware-driven, it&#8217;s pretty damn fast<span class="fnote"><a href="#ft-nr1" title="Well,
quite fast anyway. In some circumstances CPU-based transfers
are faster, but that&#8217;s a story for another day.">(1)</a></span>. As such,<br />
it&#8217;s pretty much the standard method for copying on the NDS.<br />
Unfortunately, as many people have noticed, it doesn&#8217;t always work.
</p>
<p>
There are two principle reasons for this: cache and TCM. These are<br />
two memory regions of the ARM9 that DMA is unaware of, which can lead<br />
to incorrect transfers. In this post, I&#8217;ll discuss the cache, TCM and<br />
their interactions (or lack thereof) with DMA.
</p>
<p>
The majority of the post is actually about cache. Cache basically<br />
determines the speed of your app, so it&#8217;s worth looking into in more<br />
detail. Why it and DMA don&#8217;t like each other much will become clear<br />
along the way. I&#8217;ll also present a number of test cases that show<br />
the conflicting areas, and some functions to deal with these problems.
</p>
<p><span id="more-68"></span></p>
<p><h2 id="sec-arm9">1
The ARM 9 core
</h2>
</p>
<p><div class="cptfr" style="width:200px;">
  <a href="" target="_blank">  <img src="" id="img-arm9"
    alt="" width="200" /></a><br />
  <b>Fig&nbsp;1. </b>
</div>
</p>
<p>
The first thing to know is that the DMA trouble only relates to the<br />
ARM9 processor of the NDS. Work with the ARM7 should be fine. The most<br />
relevant items of the ARM9 are illustrated by Fig&nbsp;1.<br />
The processor consists of the actual logic unit, and caches and two<br />
<dfn>Tightly Coupled Memory</dfn> (TCM) units. There are two caches<br />
and TCMs, one for data and one for instructions. The point here is<br />
that (as far as I know), these areas are <i>on the chip</i>, and<br />
as such accessible only by the CPU itself. CPU-only, as in not the<br />
DMA controller.
</p>
<p><h3 id="ssec-arm9-tcm">1.1
ITCM and DTCM
</h3>
</p>
<p>
The Instruction and Data TCM areas (ITCM and DTCM) are basically<br />
fast-RAM areas. Technically the addresses of these sections are<br />
arbitrary, but set to the 0100:0000 and 0B00:0000 ranges,<br />
respectively, in libnds. Exactly which addresses they use isn&#8217;t<br />
important though, since that&#8217;s all taken care of by the linker anyway.<br />
What <i>is</i> important is that the stack (where local variables<br />
and function arguments go<span class="fnote"><a href="#ft-nr2" title="Well, sometimes. Usually these go in
CPU registers, but this is not the right place for that discussion
either.">(2)</a></span>) is also put in DTCM. This means that you can&#8217;t<br />
use DMA with local arrays. It also means that you can&#8217;t use a local<br />
variable as a source for a DMA-fill. This is why the NDS ARM9 has<br />
special DMA registers for these, called<br />
<a href="http://nocash.emubase.de/gbatek.htm#dsdmatransfers">REG_DMA<i>n</i>FILL</a>.
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="kw1">#define</span> DMA_FILL16&nbsp; (DMA_ENABLE | DMA_START_NOW | DMA_SRC_FIX)</p>
<p><span class="co1">// This doesn&#8217;t work on ARM: `fill&#8217; is in unreachable DTCM.</span><br />
<span class="co1">// Should work on ARM7 though.</span><br />
<span class="kw1">void</span> dmaFill_bad()<br />
{<br />
&nbsp; &nbsp; <span class="kw1">volatile</span> u16 fill= <span class="nu0">0</span>;<br />
&nbsp; &nbsp; REG_DMA3SAD= &amp;fill;<br />
&nbsp; &nbsp; REG_DMA3DAD= VRAM_A;<br />
&nbsp; &nbsp; REG_DMA3CNT= DMA_FILL16 | <span class="nu0">256</span>*<span class="nu0">192</span>;<br />
}</p>
<p><span class="co1">// This does work on ARM9, </span><br />
<span class="co1">// but not ARM7 which has no REG_DMAnFILL registers.</span><br />
<span class="kw1">void</span> dmaFill_good()<br />
{<br />
&nbsp; &nbsp; REG_DMA3FILL= <span class="nu0">0</span>;<br />
&nbsp; &nbsp; REG_DMA3SAD= &amp;REG_DMA3FILL;<br />
&nbsp; &nbsp; REG_DMA3DAD= VRAM_A;<br />
&nbsp; &nbsp; REG_DMA3CNT= DMA_FILL16 | <span class="nu0">256</span>*<span class="nu0">192</span>;<br />
}</div>
</div>
<p>
The DMA-fill routines in libnds correctly use <code>REG_DMAnFILL</code><br />
so fortunately you don&#8217;t have to worry about that. However, copying from<br />
local arrays is still impossible, and there&#8217;s just no way around that<br />
using just DMA.
</p>
<p><h3 id="ssec-arm9-cache">1.2
Cache
</h3>
</p>
<p>
The cache for the ARM9 is a little more complicated. Or perhaps<br />
&ldquo;obscured&rdquo; is a better term here. The TCMs have<br />
addresses, so you can toy with them yourself. The cache, however,<br />
is completely hidden from the view of the user.<br />
Before going into detail about how the cache works and why DMA<br />
and cache hate each other, let&#8217;s look at why cache is useful,<br />
especially in light of you not being able to use it directly.
</p>
<p><div>&nbsp;</div></p>
<p>
Where there is memory, there are waitstates. Generally, a CPU can handle<br />
data faster than the RAM can supply it, so the CPU will have to wait<br />
until it can continue. The slowdown can easily be a factor 100 on<br />
PCs, or even millions if you include disk memory. Fortunately, it&#8217;s<br />
only about ten for the NDS, I think, but that&#8217;s still quite a bit.
</p>
<p>
Cache is one method of getting around memory waitstates. Instead of<br />
having to go to RAM all the time for something, you store recently used<br />
data in an area that the CPU can have faster access to. Then next<br />
time it needs that data, it can retrieve it from there instead of<br />
going to RAM again. For good measure, the area around the data is also<br />
cached, because that might be accessed soon as well. Since<br />
closely-related data is often stored closely together as well, the<br />
caching process can significantly increase the overall speed of an<br />
application.
</p>
<p><h3 id="ssec-arm9-ndscache">1.3
The NDS ARM9 cache
</h3>
</p>
<p>
<a href="http://nocash.emubase.de/gbatek.htm#dsmemorycontrolcacheandtcm">GBATEK</a> gives us the<br />
following information about the cache that the NDS has:
</p>
<blockquote><p>
  Data Cache 4KB, Instruction Cache 8KB<br />
  4-way set associative method<br />
  Cache line 8 words (32 bytes)<br />
  Read-allocate method (ie. writes are not allocating cache lines)<br />
  Round-robin and Pseudo-random replacement algorithms selectable<br />
  Cache Lockdown, Instruction Prefetch, Data Preload<br />
  Data write-through and write-back modes selectable
</p></blockquote>
<p>
Which to most people will probably mean absolutely nothing. Now, I&#8217;m<br />
not exactly an expert in all things cache, but I&#8217;ll try to explain what<br />
it all means.
</p>
<p>
First, as noted earlier, there are actually two caches: one for<br />
data and one for instructions. Having a separate instruction cache is nice<br />
because then you can be sure that a function that processes a lot of<br />
data won&#8217;t push the that function out of the cache. Instruction cache<br />
also means that loops in code will be in cache except for perhaps the<br />
first iteration. Effectively, all the code that really matters (i.e.,<br />
inner loops that do most of the work) will always be in fast memory<br />
automatically.
</p>
<p>
It is common that cache works in groups of bytes instead of individual<br />
bytes. These groups are the <dfn>cache lines</dfn>. A cache line maps<br />
onto a RAM chunk of the same size, and if anything within a chunk<br />
is to be put in cache, the whole line will be filled. The NDS cache<br />
lines are 32 bytes long.
</p>
<p>
This is probably a good time to introduce two important terms:<br />
cache hit and cache miss. A <dfn>cache hit</dfn> is when the data<br />
you&#8217;re looking for is already in cache and so access is fast. A<br />
<dfn>cache miss</dfn> is when it&#8217;s not in cache. This means two things.<br />
First, the access will be slow thanks to the memory waitstates.<br />
Second, if this triggers a cache-line fill, you&#8217;ll have to wait for<br />
the <i>entire cache line</i> to be read. While this block read will<br />
be faster than if you were reading the block without cache, it&#8217;ll<br />
still take longer than getting just the byte or so you were looking<br />
for. Moral of the story: cache hit good, cache miss bad.
</p>
<p>
Cache hits and misses add a consequence to how your data is stored.<br />
If data is tightly packed and sequential (think structs/arrays), you&#8217;re<br />
more likely to have cache hits and work will be fast. If the data is<br />
all over the place (linked lists for example), the chance of cache<br />
misses increases dramatically.
</p>
<p><div>&nbsp;</div></p>
<p>
That is how cached or non-cached data operates. What&#8217;s also important<br />
is when data is put in cache in the first place &ndash; when cache<br />
allocation occurs. There are two types here: <dfn>read-allocate</dfn><br />
or <dfn>write-allocate</dfn>, These terms refer to whether a<br />
cache-line will be tied to a memory block when it&#8217;s from, or when<br />
it&#8217;s written to, respectively. As you can see from the GBATEK data,<br />
the NDS cache is read-allocate. A memory-write will not require a new<br />
cache line.
</p>
<p>
Now, Suppose a block is in cache and something is written to that block.<br />
This write will update the data in cache, but what about the RAM it&#8217;s<br />
tied to? The process dealing with this is called the <dfn>write policy</dfn>,<br />
and two options exist. There&#8217;s <dfn>write-through</dfn>, which means that<br />
both cache and RAM are written to. In <dfn>write-back</dfn> mode, only the<br />
cached data is changed; RAM is <i>not</i> updated! This is the main<br />
cause of trouble with DMA. Apparently, the write policy is selectable,<br />
but it&#8217;ll usually be write-back.
</p>
<p><div>&nbsp;</div></p>
<p>
Lastly, there&#8217;s the <dfn>replacement policy</dfn>, which stipulates<br />
how cache lines relate to RAM, and when to kick data out ot cache.<br />
Unfortunately I don&#8217;t know much about this part, but it&#8217;s of lesser<br />
importance anyway. The rest of the terms do not affect the potential<br />
cache-DMA conflict either. For more details, visit the wikipedia page<br />
on 
<a href="http://en.wikipedia.org/wiki/CPU_cache">CPU_cache</a>.
</p>
<p><h2 id="sec-example">2
Cache example
</h2>
</p>
<p>
At this point I think it&#8217;s useful to give an example of how it works<br />
in practice. For this, I will use a fictional CPU that uses a cache with<br />
following properties.
</p>
<ul>
<li>2 cache lines, 4 bytes each.</li>
<li>Read-allocate and write-back.</li>
<li>Direct mapped cache: RAM-block <i>n</i> goes into cache-line<br />
    <i>n</i>%2. In other words, even blocks go to line 0, odd blocks<br />
    to line 1.
  </li>
</ul>
<p>
The next set of pictures illustrate what happens when you do a<br />
number of reads and writes. Fig&nbsp;2 shows the basic<br />
system in the initial state. There is the CPU on the left with the<br />
core and two cache lines. I&#8217;ve also included a register called <i>x</i><br />
here for convenience. On the right there is 16 bytes of RAM,<br />
distributed over four 4-byte blocks (the equivalents of the cache<br />
lines). RAM is already initialized; cache is still empty. In the<br />
figures, green is used to indicate reads from RAM (loads) and purple<br />
for writes (stores).
</p>
<div class="cblock">
<table cellpadding=4>
<tbody valign=top>
<tr>
<td>
    <div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-eg00"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;2. </b>
</div>

  </td>
<td>
    <div class="cpt" style="width:px;">
  <a href="http://www.coranac.com/img/post/arm9vsdma/cache_eg01.png&#8221; width=&#8221;300&#8243;<br />
	  cap=&#8221;RAM[0&#93; is written to. No change to cache." target="_blank">  <img src="http://www.coranac.com/img/post/arm9vsdma/cache_eg01.png&#8221; width=&#8221;300&#8243;<br />
	  cap=&#8221;RAM[0&#93; is written to. No change to cache." id="img-eg01"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;3. </b>
</div>

  </td>
</tr>
<tr>
<td>
    <div class="cpt" style="width:px;">
  <a href="http://www.coranac.com/img/post/arm9vsdma/cache_eg02.png&#8221; width=&#8221;300&#8243;<br />
	  cap=&#8221;Read from RAM[1&#93;. Cache-line 0 = RAM Block 0." target="_blank">  <img src="http://www.coranac.com/img/post/arm9vsdma/cache_eg02.png&#8221; width=&#8221;300&#8243;<br />
	  cap=&#8221;Read from RAM[1&#93;. Cache-line 0 = RAM Block 0." id="img-eg02"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;4. </b>
</div>

  </td>
<td>
    <div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-eg03"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;5. </b>
</div>

  </td>
</tr>
<tr>
<td>
    <div class="cpt" style="width:px;">
  <a href="http://www.coranac.com/img/post/arm9vsdma/cache_eg04.png&#8221; width=&#8221;300&#8243;<br />
	  cap=&#8221;Read from RAM[3&#93;. This was in cache, so data's read<br />
	  from there, not RAM." target="_blank">  <img src="http://www.coranac.com/img/post/arm9vsdma/cache_eg04.png&#8221; width=&#8221;300&#8243;<br />
	  cap=&#8221;Read from RAM[3&#93;. This was in cache, so data's read<br />
	  from there, not RAM." id="img-eg04"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;6. </b>
</div>

  </td>
<td>
    <div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-eg05"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;7. </b>
</div>

  </td>
</tr>
<tr>
<td>
    <div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-eg06"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;8. </b>
</div>

  </td>
<td>
    <div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-eg07a"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;9. </b>
</div>

  </td>
</tr>
</tbody>
</table>
</div>
<ol start=0>
<li>
    Initial state.
  </li>
<li>
    <code>RAM[0]= 'R'</code>. A write to RAM does not trigger cache<br />
    allocation, so it goes straight to RAM. Slowly.
  </li>
<li>
    <code>x= RAM[1]</code>. The read from RAM[1] causes a cache<br />
	allocation RAM[1] is part of block 0 (even), so that goes into<br />
	cache line 0. Line 0 and block 0 are identical.<br />
	Cache miss + allocation; very slow.
  </li>
<li>
    <code>RAM[2]= RAM[3]= 'S'</code>. Two writes; this is where it<br />
    gets tricky. These addresses are in cache (block 0), and I<br />
	said this cache was in write-back mode. This means that the writes<br />
	go to the cache, but <i>NOT</i> the actual RAM. So now the data<br />
	in cache is different that the equivalent block in RAM: the RAM&#8217;s<br />
	gone <dfn>stale</dfn>. This is a cache hit; a fast write.
  </li>
<li>
    <code>x= RAM[3]</code>. Again, RAM[3] is cached, so data is taken<br />
    from cache instead of from RAM. <i>x</i> is now <code>'S'</code>,<br />
    as was expected from the last statement. The fact the real RAM[3]<br />
    is different doesn&#8217;t matter, because the CPU doesn&#8217;t look there<br />
	anyway. If something <i>other</i> than the CPU (like, say, DMA)<br />
	reads from RAM[3], though, chaos ensues. Cache hit, fast read.
  </li>
<li>
    <code>x= RAM[4]</code>. RAM[4] is in block 1, which wasn&#8217;t cached<br />
    yet, so a new line is allocated. Block 1 goes into line 1, because<br />
    it&#8217;s an odd-numbered block. Very slow operation.
  </li>
<li>
    <code>RAM[4]= RAM[5]= 'T'</code>. Much like before, The writes go<br />
    into cache rather than the real RAM. Another cache hit.
  </li>
<li>
    <code>x= RAM[8]</code>. This is also an interesting case. Two<br />
	things happen here. RAM[8] belongs to block 2 (even), which hadn&#8217;t<br />
	been cached yet. It&#8217;s supposed to go into line 0, but that&#8217;s<br />
	already filled. The new data will replace the old data. Cache line is<br />
	tied to block 0, so addresses 0 through 3 will be filled with the<br />
	data from line 0; this block is now up to date again. After that,<br />
	line 0 receives the data from block 2. Cache write-out + new<br />
	allocation; this should be awful.
  </li>
</ol>
<p>
This should cover all important cases: reads/writes to non-cached<br />
addresses, to cached addresses and a little bit about allocation<br />
and replacements. At some points, cache and RAM start to disagree. This<br />
wouldn&#8217;t be a problem if RAM was only accessed by the CPU, but<br />
unfortunately it isn&#8217;t.
</p>
<p><div class=note>
<div  class=nhcare>On cache timings</div>
</p>
<p>
To be completely honest, I have not really tested the cycle-times<br />
for the various cases. All I have to go on is<br />
<a href="http://nocash.emubase.de/gbatek.htm#dsmemorytimings">gbatek:memory timings</a> and<br />
educated guesswork. The estimates <i>should</i> make sense, but I<br />
don&#8217;t have much in the way of evidence at present, not would I know<br />
exactly how to get that in the first place, as experimenting with<br />
cache can be tricky.
</p>
<p></div>
</p>
<p><h2 id="sec-solu">3
Cache vs DMA solution
</h2>
</p>
<p>
Fig&nbsp;5 and Fig&nbsp;8 illustrate the main<br />
problem. The data in RAM is out of date and when DMA tries to read<br />
it, it actually uses the wrong data. the reverse is also possible.<br />
DMA could write to RAM that had been cached; in this case it&#8217;s actually<br />
the cache that&#8217;s out of date.
</p>
<p>
The solution to this is to align cache and RAM manually. The two<br />
actions involved are called flushing and invalidating. A<br />
<dfn>cache flush</dfn> dumps the contents of cache back into RAM. Now<br />
that cache and RAM contain the same data again, it&#8217;s safe to DMA-read<br />
from. An <dfn>invalidate</dfn> tells the CPU to simply delete cache<br />
lines, because its assumptions regarding the contents of the original<br />
RAM have become invalid. The next CPU-read would come from RAM again.<br />
This is what you need after DMA writes to RAM.
</p>
<p>
Fig&nbsp;10 and Fig&nbsp;11 pick up from<br />
case 6 (Fig&nbsp;8). They show what happens when you<br />
flush or invalidate a cache line. You actually supply a RAM block<br />
number because the cache lines themselves are completely hidden from<br />
view.
</p>
<ol start=7>
<li>
    <b>Flush block 0</b>. In this case, you want to synchronize RAM<br />
  block 0 to the cache. Since block 0 is indeed in cache and using<br />
  line 0. Therefore, the contents of line 0 are written back to block 0.
  </li>
<li>
    <b>Invalidate block 1</b>. Suppose that previously, some contents of<br />
    block 1 had been written to without cache&#8217;s knowledge, so that cache<br />
    is out of date. The invalidate throws away the line related to<br />
	block 1, making RAM the primary source for the block again.
  </li>
</ol>
<div class="cblock">
<table cellpadding=4>
<tr>
<td>
    <div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-eg07b"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;10. </b>
</div>

  </td>
<td>
    <div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-eg08"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;11. </b>
</div>

  </td>
</tr>
</table>
</div>
<p><div class=note>
<div  class=nhgood>When to Flush/Invalidate</div>
</p>
<ul>
<li>
    A cache flush writes cached data back to RAM.<br />
	This is required <i>before</i> DMA-reads.
  </li>
<li>
    A cache invalidate frees cache lines, causing the next read<br />
	to be from RAM. This is required <i>after</i> DMA-writes.
  </li>
</ul>
<p>
Get the operation or the timing wrong,<br />
<a href="http://www.youtube.com/watch?v=NjLUMG4Kpf4"><i>and<br />
they dock ya</i></a>!</p>
<p><small>I mean, uhm &hellip; and you get memory corruption. Yeah.</small>
</p>
<p></div>
</p>
<p><h3 id="ssec-solu-libnds">3.1
libnds cache functions
</h3>
</p>
<p>
libnds contains functions to<br />
<a href="http://libnds.devkitpro.org/a00065.html">flush or invalidate</a>.<br />
They can either affect the whole cache, or just certain address ranges.<br />
I am unsure of the timings of these functions, but I expect there will<br />
be a cost. Invalidating itself could be fast, but it&#8217;d make all<br />
subsequent reads cache misses. A flush would require a large amount of<br />
of writes to memory. To keep these costs down, use the ranged versions<br />
as much as possible.
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="co1">//! Flush the entire data cache to memory.</span><br />
<span class="kw1">void</span> DC_FlushAll()<br />
<span class="co1">//! Flush the data cache for a range of addresses to memory.</span><br />
<span class="kw1">void</span> DC_FlushRange(<span class="kw1">const</span> <span class="kw1">void</span> *base, u32 size)</p>
<p><span class="co1">//! Iinvalidate the entire data cache.</span><br />
<span class="kw1">void</span> DC_InvalidateAll()<br />
<span class="co1">//! Invalidate the data cache for a range of addresses.</span><br />
<span class="kw1">void</span> DC_InvalidateRange(<span class="kw1">const</span> <span class="kw1">void</span> *base, u32 size)</p>
<p><span class="co1">//! Invalidate entire instruction cache.</span><br />
<span class="kw1">void</span> IC_InvalidateAll()<br />
<span class="co1">//! Invalidate the instruction cache for a range of addresses. </span><br />
<span class="kw1">void</span> IC_InvalidateRange(<span class="kw1">const</span> <span class="kw1">void</span> *base, u32 size)</div>
</div>
<p><h3 id="ssec-solu-proc">3.2
Some safe DMA functions
</h3>
</p>
<p>
To guard against potential DMA failures, it&#8217;s useful to have a few<br />
functions that take care of those themselves. <code>dmaCopySafe()</code><br />
and <code>dmaFillSafe()</code> check if DMA can reach the source and<br />
destination regions and will return <code>false</code> if not. They<br />
also check what chunk-size is appropriate by looking at the source,<br />
destination and size. Odd alignments fail completely; word-alignment<br />
and sizes use 32-bit transfers and the rest uses 16-bit transfers.<br />
They also flush and invalidate where appropriate.
</p>
<p>
Note that for a completely safe version, you&#8217;d need much more checking.<br />
For example, each region has its own size that would have to be looked<br />
at, and some sections are read-only like ROM. Checking for all<br />
possibilities, however would just make the function too unwieldy and<br />
have therefore been omitted.
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="co1">//! Copy data from a src to a dst via DMA in a cache/section safe manner.</span><br />
<span class="coMULTI">/*! The ARM9&#8242;s DMA doesn&#8217;t play well with the cache and can&#8217;t access <br />
&nbsp; &nbsp; ITCM or DTCM. This means that a basic DMA copy may not work as <br />
&nbsp; &nbsp; expected. This function flushes or invalidates cache if necessary and <br />
&nbsp; &nbsp; will only copy if the ranges are accessible.<br />
&nbsp; &nbsp; param src &nbsp; Source pointer.<br />
&nbsp; &nbsp; param dst &nbsp; Destination pointer.<br />
&nbsp; &nbsp; param size&nbsp; Size (in bytes) to copy.<br />
&nbsp; &nbsp; return&nbsp; &nbsp; &nbsp; True if the copy succeeded.<br />
&nbsp; &nbsp; note&nbsp; &nbsp; &nbsp; &nbsp; It&#8217;s possible I missed some invalid cases, YHBW.<br />
*/</span><br />
<span class="kw1">bool</span> dmaCopySafe(<span class="kw1">const</span> <span class="kw1">void</span> *src, <span class="kw1">void</span> *dst, u32 size)<br />
{<br />
&nbsp; &nbsp; u32 srca= (u32)src, dsta= (u32)dst;</p>
<p>&nbsp; &nbsp; <span class="co1">// Check TCMs and BIOS (0&#215;01000000, 0x0B000000, 0xFFFF0000).</span><br />
&nbsp; &nbsp; <span class="co1">//# NOTE: probably incomplete checks.</span><br />
&nbsp; &nbsp; <span class="kw1">if</span>((srca&gt;&gt;<span class="nu0">24</span>)==<span class="nu0">0&#215;01</span> || (srca&gt;&gt;<span class="nu0">24</span>)==<span class="nu0">0x0B</span> || (srca&gt;&gt;<span class="nu0">24</span>)==<span class="nu0">0xFF</span>)<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> <span class="kw1">false</span>;<br />
&nbsp; &nbsp; <span class="kw1">if</span>((dsta&gt;&gt;<span class="nu0">24</span>)==<span class="nu0">0&#215;01</span> || (dsta&gt;&gt;<span class="nu0">24</span>)==<span class="nu0">0x0B</span> || (dsta&gt;&gt;<span class="nu0">24</span>)==<span class="nu0">0xFF</span>)<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> <span class="kw1">false</span>;</p>
<p>&nbsp; &nbsp; <span class="kw1">if</span>((srca|dsta) &amp; <span class="nu0">1</span>) &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Fail on byte copy.</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> <span class="kw1">false</span>;<br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span class="kw1">while</span>(REG_DMA3CNT &amp; DMA_BUSY) ;</p>
<p>&nbsp; &nbsp; <span class="kw1">if</span>((srca&gt;&gt;<span class="nu0">24</span>)==<span class="nu0">0&#215;02</span>)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Write cache back to memory.</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; DC_FlushRange(src, size);</p>
<p>&nbsp; &nbsp; <span class="kw1">if</span>((srca|dsta|size) &amp; <span class="nu0">3</span>)<br />
&nbsp; &nbsp; &nbsp; &nbsp; dmaCopyHalfWords(<span class="nu0">3</span>, src, dst, size);<br />
&nbsp; &nbsp; <span class="kw1">else</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; dmaCopyWords(<span class="nu0">3</span>, src, dst, size);</p>
<p>&nbsp; &nbsp; <span class="kw1">if</span>((dsta&gt;&gt;<span class="nu0">24</span>)==<span class="nu0">0&#215;02</span>)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Set cache of dst range to &#8216;dirty&#8217;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; DC_InvalidateRange(dst, size);</p>
<p>&nbsp; &nbsp; <span class="kw1">return</span> <span class="kw1">true</span>;<br />
}</p>
<p><span class="co1">//! Fill a dst with a fill via DMA in a cache/section safe manner.</span><br />
<span class="coMULTI">/*! The ARM9&#8242;s DMA doesn&#8217;t play well with the cache and can&#8217;t access <br />
&nbsp; &nbsp; ITCM or DTCM. This means that a basic DMA fill may not work as <br />
&nbsp; &nbsp; expected. This function flushes or invalidates cache if necessary and <br />
&nbsp; &nbsp; will only fill if the ranges are accessible.<br />
&nbsp; &nbsp; param fill&nbsp; Fill value.<br />
&nbsp; &nbsp; param dst &nbsp; Destination pointer.<br />
&nbsp; &nbsp; param size&nbsp; Size (in bytes) to copy.<br />
&nbsp; &nbsp; return&nbsp; &nbsp; &nbsp; True if the fill succeeded.<br />
&nbsp; &nbsp; note&nbsp; &nbsp; &nbsp; &nbsp; It&#8217;s possible I missed some invalid cases, YHBW.<br />
*/</span><br />
<span class="kw1">bool</span> dmaFillSafe(u32 fill, <span class="kw1">void</span> *dst, u32 size)<br />
{<br />
&nbsp; &nbsp; u32 dsta= (u32)dst;</p>
<p>&nbsp; &nbsp; <span class="co1">// Check TCMs and BIOS (0&#215;01000000, 0x0B000000, 0xFFFF0000).</span><br />
&nbsp; &nbsp; <span class="co1">//# NOTE: probably incomplete checks.</span><br />
&nbsp; &nbsp; <span class="kw1">if</span>((dsta&gt;&gt;<span class="nu0">24</span>)==<span class="nu0">0&#215;01</span> || (dsta&gt;&gt;<span class="nu0">24</span>)==<span class="nu0">0x0B</span> || (dsta&gt;&gt;<span class="nu0">24</span>)==<span class="nu0">0xFF</span>)<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> <span class="kw1">false</span>;</p>
<p>&nbsp; &nbsp; <span class="kw1">if</span>(dsta &amp; <span class="nu0">1</span>)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Fail on byte fill.</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> <span class="kw1">false</span>;<br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp; <span class="kw1">while</span>(REG_DMA3CNT &amp; DMA_BUSY) ;</p>
<p>&nbsp; &nbsp; <span class="kw1">if</span>((dsta|size) &amp; <span class="nu0">3</span>)<br />
&nbsp; &nbsp; &nbsp; &nbsp; dmaFillHalfWords(fill, dst, size);<br />
&nbsp; &nbsp; <span class="kw1">else</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; dmaFillWords(fill, dst, size);</p>
<p>&nbsp; &nbsp; <span class="kw1">if</span>((dsta&gt;&gt;<span class="nu0">24</span>)==<span class="nu0">0&#215;02</span>)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// Set cache of dst range to &#8216;dirty&#8217;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; DC_InvalidateRange(dst, size);</p>
<p>&nbsp; &nbsp; <span class="kw1">return</span> <span class="kw1">true</span>;<br />
}</div>
</div>
<p><h2 id="sec-tests">4
Test cases
</h2>
</p>
<p><div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-test-scheme"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;12. </b>
</div>
</p>
<p>
All this talk about what to do and when is nice and all, but it&#8217;s<br />
always best to run a few tests to see if everything happens like you<br />
expected. Fig&nbsp;12 illustrates how the tests operate.<br />
There are two source bitmaps of the letters &lsquo;A&rsquo; and<br />
&lsquo;B&rsquo;. In each test case, one of these is copied into<br />
a secondary buffer by either a CPU- or DMA-based copy. This<br />
second buffer is blitted to VRAM in two different places via<br />
a CPU- or DMA-based blit. At various points, a flush or invalidate<br />
may be inserted to see the effects.
</p>
<p>
In terms of code, ever case is split into two parts. First, there&#8217;s a<br />
setup that initializes each case. This clears the console and prints<br />
some description for the case and erases the RAM buffer and the VRAM<br />
rectangles. The second part alternatively copies the letters into the<br />
buffer, does cache operations, and blits. Since the first part is<br />
boring, only the case-specific part will be given here.
</p>
<p><h3 id="ssec-test-01">4.1
<br />
Direct blit: &lsquo;A&rsquo; &rarr; VRAM
</h3>
</p>
<p><div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" 
    alt="" width="" /></a><br />
  
</div>
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">&nbsp; &nbsp; cpuBlit(&amp;bmpA, X_CPU, Y_CPU);<br />
&nbsp; &nbsp; dmaBlit(&amp;bmpA, X_DMA, X_DMA);</div>
</div>
<p class=ni>
<b>Result</b>: both correct.<br />
<b>Explanation</b>: the data in the source buffer never changes,<br />
so this should always be okay.
</p>
<p><h3 id="ssec-test-02">4.2
<br />
Indirect Blit I: &lsquo;B&rsquo; &rarr; buffer &rarr; VRAM
</h3>
</p>
<p><div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" 
    alt="" width="" /></a><br />
  
</div>
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">&nbsp; &nbsp; <span class="kw3">memcpy</span>(bmpBuf.data, bmpB.data, SIZE);</p>
<p>&nbsp; &nbsp; cpuBlit(&amp;bmpBuf, X_CPU, Y_CPU);<br />
&nbsp; &nbsp; dmaBlit(&amp;bmpBuf, X_DMA, Y_DMA);</div>
</div>
<p class=ni>
<b>Result</b>: both correct.<br />
<b>Explanation</b>: buffer used for the first time, so no cache<br />
incoherency possible. Yet.
</p>
<p><h3 id="ssec-test-03">4.3
<br />
Indirect Blit II: &lsquo;A&rsquo; &rarr; buffer &rarr; VRAM
</h3>
</p>
<p><div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" 
    alt="" width="" /></a><br />
  
</div>
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">&nbsp; &nbsp; <span class="kw3">memcpy</span>(bmpBuf.data, bmpA.data, SIZE);</p>
<p>&nbsp; &nbsp; cpuBlit(&amp;bmpBuf, X_CPU, Y_CPU);<br />
&nbsp; &nbsp; dmaBlit(&amp;bmpBuf, X_DMA, Y_DMA);</div>
</div>
<p class=ni>
<b>Result</b>: CPU okay, DMA blit corrupted.<br />
<b>Explanation</b>: The buffered data is in cache, so the<br />
<code>memcpy()</code> to the buffer goes to cache as well. Meanwhile,<br />
the actual buffer (in RAM) still holds (parts of) &lsquo;B&rsquo;,<br />
which is where DMA gets its data from. The result is a mix of<br />
&lsquo;A&rsquo; and &lsquo;B&rsquo; for <code>dmaBlit()</code>. It&#8217;s<br />
a mix because apparently some cache lines have already been flushed<br />
out by the replacement policy.
</p>
<p><h3 id="ssec-test-04">4.4
<br />
Indirect Blit + flush: &lsquo;B&rsquo; &rarr; buffer, flush &rarr; VRAM
</h3>
</p>
<p><div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" 
    alt="" width="" /></a><br />
  
</div>
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">&nbsp; &nbsp; <span class="kw3">memcpy</span>(bmpBuf.data, bmpB.data, SIZE);<br />
&nbsp; &nbsp; DC_FlushRange(bmpBuf.data, SIZE);</p>
<p>&nbsp; &nbsp; cpuBlit(&amp;bmpBuf, X_CPU, Y_CPU);<br />
&nbsp; &nbsp; dmaBlit(&amp;bmpBuf, X_DMA, Y_DMA);</div>
</div>
<p class=ni>
<b>Result</b>: both correct.<br />
<b>Explanation</b>: again, the <code>memcpy()</code> will result in<br />
corrupt data in RAM, but this time we force the up-to-date cache lines<br />
back to RAM to get cache and RAM in synch again. At this point, both<br />
the CPU and DMA-based blits will use the correct data.
</p>
<p><h3 id="ssec-test-05">4.5
<br />
Indirect Blit + invalidate: &lsquo;A&rsquo; &rarr; buffer, invalidate &rarr; VRAM
</h3>
</p>
<p><div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" 
    alt="" width="" /></a><br />
  
</div>
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">&nbsp; &nbsp; <span class="kw3">memcpy</span>(bmpBuf.data, bmpA.data, SIZE);<br />
&nbsp; &nbsp; DC_InvalidateRange(bmpBuf.data, SIZE);</p>
<p>&nbsp; &nbsp; cpuBlit(&amp;bmpBuf, X_CPU, Y_CPU);<br />
&nbsp; &nbsp; dmaBlit(&amp;bmpBuf, X_DMA, Y_DMA);</div>
</div>
<p class=ni>
<b>Result</b>: both screwed in the same way.<br />
<b>Explanation</b>: as said in the note, you need to do the right<br />
operation at the right time. This is an example of what happens if you<br />
don&#8217;t. A cache invalidate simply erases cache lines. The data in RAM is<br />
now considered valid. However, after the <code>memcpy()</code>, RAM<br />
<i>isn&#8217;t</i> valid, as one can see from case 3. This is why now both<br />
blits fail.<br />
<a href="http://tvtropes.org/pmwiki/pmwiki.php/Main/NiceJobBreakingItHero">Nice<br />
job breaking it, hero.</a>
</p>
<p><h3 id="ssec-test-06">4.6
<br />
Indirect Blit III: &lsquo;A&rsquo; (dma)&rarr; buffer &rarr; VRAM
</h3>
</p>
<p><div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" 
    alt="" width="" /></a><br />
  
</div>
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">&nbsp; &nbsp; dmaCopy(bmpA.data, bmpBuf.data, SIZE);</p>
<p>&nbsp; &nbsp; cpuBlit(&amp;bmpBuf, X_CPU, Y_CPU);<br />
&nbsp; &nbsp; dmaBlit(&amp;bmpBuf, X_DMA, Y_DMA);</div>
</div>
<p class=ni>
<b>Result</b>: CPU blit fail.<br />
<b>Explanation</b>: in this case, the source&rarr;buffer transfer is done<br />
via DMA rather than the CPU-based <code>memcpy()</code>. This time the<br />
CPU doesn&#8217;t know the contents of RAM have been altered. At this point,<br />
the cache will contain mostly the cleared data from the<br />
<code>memset()</code> done in the case set-up; <code>cpuBlit()</code><br />
will pick up some straggling lines from RAM during blit, resulting in<br />
the image you see here. Naturally, <code>dmaBlit()</code> works<br />
properly.
</p>
<p><h3 id="ssec-test-07">4.7
<br />
Indirect Blit + invalidate II: &lsquo;B&rsquo;<br />
  (dma)&rarr; buffer, invalidate &rarr; VRAM
</h3>
</p>
<p><div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" 
    alt="" width="" /></a><br />
  
</div>
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">&nbsp; &nbsp; dmaCopy(bmpB.data, bmpBuf.data, SIZE);<br />
&nbsp; &nbsp; DC_InvalidateRange(bmpBuf.data, SIZE);</p>
<p>&nbsp; &nbsp; cpuBlit(&amp;bmpBuf, X_CPU, Y_CPU);<br />
&nbsp; &nbsp; dmaBlit(&amp;bmpBuf, X_DMA, Y_DMA);</div>
</div>
<p class=ni>
<b>Result</b>: both correct.<br />
<b>Explanation</b>: as case 6, but with an invalidate after the<br />
transfer to the buffer. The invalidate removes the allocated cache<br />
lines, so that the next CPU-reads come from RAM. Unlike case 5,<br />
RAM <i>is</i> the most up-to-date area, so the invalidate works as<br />
it&#8217;s supposed to.
</p>
<p><h3 id="ssec-test-08">4.8
<br />
Indirect Blit + invalidate/flush: &lsquo;A&rsquo;<br />
  &rarr; buffer, invalidate/flush &rarr; VRAM
</h3>
</p>
<p><div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" 
    alt="" width="" /></a><br />
  
</div>
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">&nbsp; &nbsp; <span class="kw3">memcpy</span>(bmpBuf.data, bmpA.data, SIZE);<br />
&nbsp; &nbsp; DC_InvalidateRange(bmpBuf.data, SIZE);<br />
&nbsp; &nbsp; DC_FlushRange(bmpBuf.data, SIZE);</p>
<p>&nbsp; &nbsp; cpuBlit(&amp;bmpBuf, X_CPU, Y_CPU);<br />
&nbsp; &nbsp; dmaBlit(&amp;bmpBuf, X_DMA, Y_DMA);</div>
</div>
<p class=ni>
<b>Result</b>: both fail.<br />
<b>Explanation</b>: The invalidate frees cache lines (see case 5).<br />
The subsequent flush does nothing, because no lines are associated<br />
with those addresses anymore.
</p>
<p>
This is an example of what I like to call LOL-type programming.<br />
As in &ldquo;What are you doing?!?&rdquo; &#8211; &ldquo;I dunno lol&rdquo;.<br />
This type of coding generally indicates a very confused mind that<br />
hopes that if you throw enough shit against a wall maybe something<br />
will hold up. He may have heard terms like flush and invalidate<br />
used around DMA and decided to try them randomly. This never works.<br />
If you find code like this, be afraid; very, <i>very</i> afraid.<br />
This will likely not be the only instance of cargo-cult programming<br />
in the code-base and it&#8217;d be best to consider the whole thing suspect.
</p>
<p><h3 id="ssec-test-09">4.9
<br />
dmaSafe I: &lsquo;B&rsquo; &rarr; buffer (safe)&rarr; VRAM
</h3>
</p>
<p><div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" 
    alt="" width="" /></a><br />
  
</div>
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">&nbsp; &nbsp; <span class="kw3">memcpy</span>(bmpBuf.data, bmpB.data, SIZE);</p>
<p>&nbsp; &nbsp; cpuBlit(&amp;bmpBuf, X_CPU, Y_CPU);<br />
&nbsp; &nbsp; dmaBlitSafe(&amp;bmpBuf, X_DMA, Y_DMA);</div>
</div>
<p class=ni>
<b>Result</b>: both correct.<br />
<b>Explanation</b>: this uses the <code>dmaCopySafe()</code> function given<br />
previously in the DMA blitter. Since that function checks whether a<br />
flush is appropriate, everything should be fine. And it is.
</p>
<p><h3 id="ssec-test-10">4.10
<br />
dmaSafe II: &lsquo;A&rsquo; (safedma)&rarr; buffer &rarr; VRAM
</h3>
</p>
<p><div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" 
    alt="" width="" /></a><br />
  
</div>
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">&nbsp; &nbsp; dmaCopySafe(bmpA.data, bmpBuf.data, SIZE);</p>
<p>&nbsp; &nbsp; cpuBlit(&amp;bmpBuf, X_CPU, Y_CPU);<br />
&nbsp; &nbsp; dmaBlitSafe(&amp;bmpBuf, X_DMA, Y_DMA);</div>
</div>
<p class=ni>
<b>Result</b>: both correct.<br />
<b>Explanation</b>: <code>dmaCopySafe()</code> also performs an<br />
invalidate if necessary so, again, it all works.
</p>
<p><h3 id="ssec-test-11">4.11
<br />
Buffer in stack: &lsquo;B&rsquo; &rarr; local buffer &rarr; VRAM
</h3>
</p>
<p><div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" 
    alt="" width="" /></a><br />
  
</div>
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">&nbsp; &nbsp; u16 localBuf[<span class="nu0">16</span>*<span class="nu0">16</span>];<br />
&nbsp; &nbsp; MiniBmp bmpLocal= { <span class="nu0">16</span>, <span class="nu0">16</span>, localBuf };</p>
<p>&nbsp; &nbsp; <span class="kw3">memcpy</span>(localBuf, bmpB.data, SIZE);</p>
<p>&nbsp; &nbsp; cpuBlit(&amp;bmpLocal, X_CPU, Y_CPU);<br />
&nbsp; &nbsp; dmaBlit(&amp;bmpLocal, X_DMA, Y_DMA);</div>
</div>
<p class=ni>
<b>Result</b>: DMA fail.<br />
<b>Explanation</b>: in this case, I&#8217;m using a local buffer for the<br />
bitmap instead of a global one. The difference here is that a local<br />
buffer goes onto the stack, which is in DTCM rather than RAM. As<br />
discussed earlier, DTCM is invisible to DMA, so <code>dmaBlit()</code><br />
doesn&#8217;t work. Note that using <code>dmaBlitSafe()</code> wouldn&#8217;t<br />
work either, but at least you&#8217;d get an return value indicating failure<br />
back instead of nothing at all.
</p>
<p><div>&nbsp;</div></p>
<p><div class=note>
<div  class=nhcare>Hardware vs emulator</div>
</p>
<p>
As far as I know <i>all</i> current NDS emulators do not emulate<br />
cache properly, so that these tests would actually seem produce correct<br />
results. &ldquo;correct&rdquo; in the sense that they reproduce the<br />
target image, not that they give similar results as hardware.
</p>
<p></div>
</p>
<p><h2 id="sec-conc">5
Conclusions
</h2>
</p>
<ul>
<li>
    The ARM9 has DTCM and ITCM sections that DMA can&#8217;t access. DMA<br />
    transfers to and from there will fail. Because the stack is in<br />
    DTCM, this includes transfers to/from (non-static) local variables.
  </li>
<li>
    The ARM9 has cache that DMA can&#8217;t see either. If DMA tries to<br />
	read/write from RAM block that have been cached, the wrong data<br />
	<i>may</i> be transferred.
  </li>
<li>
    The ARM9 uses 32-byte cache lines that are initiated when addresses<br />
    are read from, but not when written to (read-allocate). It also uses a<br />
    write-back policy: cached data is written back to RAM only when the<br />
    cache line is replaced.
  </li>
<li>
    Cache-miss bad; cache-hit good. Stale cache also bad, since it&#8217;s the cause<br />
    of incorrect DMA transfers.
  </li>
<li>
    DMA-reads from RAM should be preceded by a cache flush, which writes<br />
    cache lines back to RAM.
  </li>
<li>
    DMA-writes to RAM should be preceded or followed by a cache<br />
	invalidate, which clears cache lines so that the next CPU-read will be<br />
	from RAM again.
  </li>
<li>
    As far as I know, most emulators do not handle DMA-DTCM correctly.<br />
	None emulate cache. If you suddenly find corrupted data after<br />
	copying on hardware but not emulators, look at your DMA calls.
  </li>
<li>
    Making graphics with rounded corners and intricate wide lines take<br />
    <i>forever</i> to get right.
  </li>
</ul>
<p><div>&nbsp;</div></p>
<p>
Related test project: <a href="/files/nds/arm9vsdma.zip">arm9vsdma.zip</a></p>
<hr /><div class="footnotes">
<h5>Notes:</h5>
<ol>
<li id="ft-nr1"> 
  Well,<br />
<i>quite</i> fast anyway. In some circumstances CPU-based transfers<br />
are faster, but that&#8217;s a story for another day.
</li>
<li id="ft-nr2"> 
  Well, sometimes. Usually these go in<br />
CPU registers, but this is not the right place for that discussion<br />
either.
</li>
</ol>
</div
<hr />
]]></content:encoded>
			<wfw:commentRss>http://www.coranac.com/2009/05/dma-vs-arm9-fight/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>mode 7 addendum</title>
		<link>http://www.coranac.com/2009/04/mode-7-addendum/</link>
		<comments>http://www.coranac.com/2009/04/mode-7-addendum/#comments</comments>
		<pubDate>Sun, 19 Apr 2009 16:32:53 +0000</pubDate>
		<dc:creator>cearn</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[tonc]]></category>
		<category><![CDATA[fix]]></category>
		<category><![CDATA[math]]></category>
		<category><![CDATA[mode7]]></category>

		<guid isPermaLink="false">http://www.coranac.com/?p=67</guid>
		<description><![CDATA[<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
Okay. Apparently, I am an idiot who can&#8217;t do math. &#160; One of the longer chapters in Tonc is Mode 7 part 2, which covers pretty much all the hairy details of producing mode 7 effects on the GBA. The money shot for in terms of code is the following functions, which calculates the affine [...]]]></description>
			<content:encoded><![CDATA[<br />
<b>Warning</b>:  getimagesize() [<a href='function.getimagesize'>function.getimagesize</a>]: Filename cannot be empty in <b>/home/coranac/public_html/wordpress/wp-content/plugins/crnfilters.php</b> on line <b>466</b><br />
<p>
Okay. Apparently, I am an idiot who can&#8217;t do math.
</p>
<p><div>&nbsp;</div></p>
<p>
One of the longer chapters in Tonc is<br />
<a href="/tonc/text/mode7ex.htm">Mode 7 part 2</a>, which covers<br />
pretty much all the hairy details of producing mode 7 effects on the<br />
GBA. The money shot for in terms of code is the following functions,<br />
which calculates the affine parameters of the background for each<br />
scanline in section <a href="/tonc/text/mode7ex.htm#ssec-code-bg">21.7.3</a>.
</p>
<div class="cpp">
<div class="cpp proglist" style=" ">IWRAM_CODE <span class="kw1">void</span> m7_prep_affines(M7_LEVEL *level)<br />
{<br />
&nbsp; &nbsp; <span class="kw1">if</span>(level-&gt;horizon &gt;= SCREEN_HEIGHT)<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span>;</p>
<p>&nbsp; &nbsp; <span class="kw1">int</span> ii, ii0= (level-&gt;horizon&gt;=<span class="nu0">0</span> ? level-&gt;horizon : <span class="nu0">0</span>);</p>
<p>&nbsp; &nbsp; M7_CAM *cam= level-&gt;camera;<br />
&nbsp; &nbsp; FIXED xc= cam-&gt;pos.x, yc= cam-&gt;pos.y, zc=cam-&gt;pos.z;</p>
<p>&nbsp; &nbsp; BG_AFFINE *bga= &amp;level-&gt;bgaff[ii0];</p>
<p>&nbsp; &nbsp; FIXED yb, zb; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// b&#8217; = Rx(theta) * &nbsp;(L, ys, -D)</span><br />
&nbsp; &nbsp; FIXED cf, sf, ct, st; &nbsp; <span class="co1">// sines and cosines</span><br />
&nbsp; &nbsp; FIXED lam, lcf, lsf; &nbsp; &nbsp;<span class="co1">// scale and scaled (co)sine(phi)</span><br />
&nbsp; &nbsp; cf= cam-&gt;u.x; &nbsp; &nbsp; &nbsp;sf= cam-&gt;u.z;<br />
&nbsp; &nbsp; ct= cam-&gt;v.y; &nbsp; &nbsp; &nbsp;st= cam-&gt;w.y;<br />
&nbsp; &nbsp; <span class="kw1">for</span>(ii= ii0; ii&lt;SCREEN_HEIGHT; ii++)<br />
&nbsp; &nbsp; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; yb= (ii-M7_TOP)*ct + M7_D*st;<br />
&nbsp; &nbsp; &nbsp; &nbsp; lam= DivSafe( yc&lt;&lt;<span class="nu0">12</span>, &nbsp;yb); &nbsp; &nbsp; <span class="co1">// .12f &nbsp; &nbsp;&lt;- OI!!!</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; lcf= lam*cf&gt;&gt;<span class="nu0">8</span>; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// .12f</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; lsf= lam*sf&gt;&gt;<span class="nu0">8</span>; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// .12f</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; bga-&gt;pa= lcf&gt;&gt;<span class="nu0">4</span>; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span class="co1">// .8f</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; bga-&gt;pc= lsf&gt;&gt;<span class="nu0">4</span>; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span class="co1">// .8f</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// lambda·Rx·b</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; zb= (ii-M7_TOP)*st &#8211; M7_D*ct; &nbsp; <span class="co1">// .8f</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; bga-&gt;dx= xc + (lcf&gt;&gt;<span class="nu0">4</span>)*M7_LEFT &#8211; (lsf*zb&gt;&gt;<span class="nu0">12</span>); &nbsp;<span class="co1">// .8f</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; bga-&gt;dy= zc + (lsf&gt;&gt;<span class="nu0">4</span>)*M7_LEFT + (lcf*zb&gt;&gt;<span class="nu0">12</span>); &nbsp;<span class="co1">// .8f</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// hack that I need for fog. pb and pd are unused anyway</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; bga-&gt;pb= lam;<br />
&nbsp; &nbsp; &nbsp; &nbsp; bga++;<br />
&nbsp; &nbsp; }<br />
&nbsp; &nbsp; level-&gt;bgaff[SCREEN_HEIGHT]= level-&gt;bgaff[<span class="nu0">0</span>];<br />
}</div>
</div>
<p>
For details on what all the terms mean, go the page in question.<br />
For now, just note that call to <code>DivSafe()</code> to calculate<br />
the scaling factor &lambda; and recall that division on the GBA is<br />
pretty slow. In <a href="/tonc/text/mode7.htm">Mode 7 part 1</a>,<br />
I used a LUT, but here I figured that since the <code>yb</code> term<br />
can be anything thanks to the pitch you can&#8217;t do that. After helping<br />
Ruben with his mode 7 demo, it turns out that you can.
</p>
<p><div>&nbsp;</div></p>
<p><div class="cpt" style="width:px;">
  <a href="" target="_blank">  <img src="" id="img-crd-c2p2"
    alt="" width="" /></a><br />
  <b>Fig&nbsp;1. </b>
</div>
</p>
<p>
Fig&nbsp;1 shows the situation. There is a camera<br />
(the black triangle) that is tilted down by pitch angle &theta;. I&#8217;ve<br />
put the origin at the back of the camera because it makes things<br />
easier to read. The<br />
front of the camera is the projection plane, which is essentially<br />
the screen. A ray is cast from the back of the camera on to the floor<br />
and this ray intersects the projection plane. The coordinates<br />
of this point are <b>x</b><sub>p</sub> =<br />
(<i>y</i><sub>p</sub>, <i>D</i>) in projection plane space, which<br />
corresponds to point (<i>y</i><sub>b</sub>, <i>z</i><sub>b</sub>) in<br />
world space. This is simply rotating point <b>x</b><sub>p</sub> by<br />
&theta;. The scaling factor is the ratio between the <i>y</i> or<br />
<i>z</i> coordinates of the points on the floor and on the projection<br />
plane, so that&#8217;s:
</p>
<p><table class="eqtbl">
<tr>
<td class="eqnrcell"></td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?%5Clambda%20%3D%20y_c%20%2F%20y_b%2C'<br />
	title="\lambda = y_c / y_b,"<br />
	alt="\lambda = y_c / y_b," /><br />
</td>
</tr>
</table></p>
<p>
and for <i>y</i><sub>b</sub> the rotation gives us:
</p>
<p><table class="eqtbl">
<tr>
<td class="eqnrcell"></td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?y_b%20%3D%20y_p%20cos%20%5Ctheta%20%2B%20D%20sin%20%5Ctheta%2C'<br />
	title="y_b = y_p cos \theta + D sin \theta,"<br />
	alt="y_b = y_p cos \theta + D sin \theta," /><br />
</td>
</tr>
</table></p>
<p>
where <i>y</i><sub>c</sub> is the camera height,<br />
<i>y</i><sub>p</sub> is a scanline offset (measured from the center of the screen) and <i>D</i> is the focus<br />
length.
</p>
<p>
Now, the point is that while <i>y</i><sub>b</sub> is variable<br />
and non-integral when &theta; &ne; 0, it is still bounded! What&#8217;s more,<br />
you can easily calculate its maximum value, since it&#8217;s simply the<br />
maximum length of <b>x</b><sub>p</sub>. Calling this factor <i>R</i>,<br />
we get:
</p>
<p><table class="eqtbl">
<tr>
<td class="eqnrcell"></td>
  <td class="eqcell"><br />
<img src='http://www.coranac.com/cgi-bin/mimetex.cgi?R%20%3D%20%5Csqrt%7Bmax%28y_p%29%5E2%20%2B%20D%5E2%7D'<br />
	title="R = \sqrt{max(y_p)^2 + D^2}"<br />
	alt="R = \sqrt{max(y_p)^2 + D^2}" /><br />
</td>
</tr>
</table></p>
<p>
This factor <i>R</i>, rounded up, is the size of the required LUT.<br />
In my particular case, I&#8217;ve used <i>y</i><sub>p</sub>= scanline&minus;80<br />
and <i>D</i> = 256, which gives<br />
<i>R</i>&nbsp;=&nbsp;sqrt((160&minus;80)&sup2;&nbsp;+&nbsp;256&sup2;)<br />
= 268.2. In other words, I need a division LUT with 269 entries. Using .16<br />
fixed point numbers for this LUT, the replacement code is essentially:
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="co1">// The new division LUT. For 1/0 and 1/1, 0xFFFF is used.</span><br />
u16 m7_div_lut[<span class="nu0">270</span>]= <br />
{<br />
&nbsp; &nbsp; <span class="nu0">0xFFFF</span>, <span class="nu0">0xFFFF</span>, <span class="nu0">0&#215;8000</span>, <span class="nu0">0&#215;5556</span>, &#8230;<br />
};</p>
<p>
<span class="co1">// Inside the function</span><br />
&nbsp; &nbsp; <span class="kw1">for</span>(ii= ii0; ii&lt;SCREEN_HEIGHT; ii++)<br />
&nbsp; &nbsp; {<br />
&nbsp; &nbsp; &nbsp; &nbsp; yb= (ii-M7_TOP)*ct + M7_D*st; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// .8</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; lam= (yc*m7_div_lut[yb&gt;&gt;<span class="nu0">8</span>])&gt;&gt;<span class="nu0">12</span>;&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// .8*.16/.12 = .12</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <br />
&nbsp; &nbsp; &nbsp; &nbsp; &#8230; <span class="co1">// business as usual</span><br />
&nbsp; &nbsp; }</div>
</div>
<p>
At this point, several questions may arise.
</p>
<ul>
<li>
    <b>What about negative <i>y</i><sub>b</sub>?</b> The beauty here<br />
    is that while <i>y</i><sub>b</sub> may be negative in principle,<br />
    such values would correspond to lines above the horizon and we don&#8217;t<br />
    calculate those anyway.
  </li>
<li>
    <b>Won&#8217;t non-integral <i>y</i><sub>b</sub> cause inaccurate look-ups?</b><br />
    True, <i>y</i><sub>b</sub> will have a fractional part that<br />
    is simply cut off during a simple look-up and some sort of<br />
    interpolation would be better. However, in testing there were no<br />
    noticeable differences between direct look-up, lerped look-up or<br />
    using <code>Div()</code>, so the simplest method suffices.
  </li>
<li>
    <b>Are .16 fixed point numbers enough?</b>. Yes, apparently so.
  </li>
<li>
    <b>ZOMG OVERFLOW! Are .16 fixed point numbers too high?</b><br />
    Technically, yes, there is a risk of overflow when the camera height<br />
    gets too high. However, at high altitudes the map is going to look<br />
    like crap anyway due to the low resolution of the screen.<br />
    Furthermore, the hardware only uses 8.8 fixeds, so scales above<br />
    256.0 wouldn&#8217;t work anyway.
  </li>
</ul>
<p>
And finally:
</p>
<ul>
<li>
  <b>What do I win?</b><br />
  With <code>Div()</code> <code>m7_prep_affines()</code> takes<br />
  about 51k cycles. With the direct look-up this reduces to about 13k:<br />
  a speed increase by a factor of 4.
  </li>
</ul>
<p><div>&nbsp;</div></p>
<p>
So yeah, this is what I <i>should</i> have figured out years ago, but<br />
somehow kept overlooking it. I&#8217;m not sure if I&#8217;ll add this whole thing to<br />
Tonc&#8217;s text and code, but I&#8217;ll at least put up a link to here. Thanks<br />
Ruben, for showing me how to do this properly.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.coranac.com/2009/04/mode-7-addendum/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>To C or not to C</title>
		<link>http://www.coranac.com/2008/09/to-c-or-not-to-c/</link>
		<comments>http://www.coranac.com/2008/09/to-c-or-not-to-c/#comments</comments>
		<pubDate>Tue, 02 Sep 2008 23:14:15 +0000</pubDate>
		<dc:creator>cearn</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[research]]></category>

		<guid isPermaLink="false">http://www.coranac.com/?p=56</guid>
		<description><![CDATA[Tonclib is coded mostly in C. The reason for this was twofold. First, I still have it in my head that C is lower level than C++, and that the former would compile to faster code; and faster is good. Second, it&#8217;s easier for C++ to call C than the other way around so, for [...]]]></description>
			<content:encoded><![CDATA[<p>
Tonclib is coded mostly in C. The reason for this was twofold. First, I still have it in my head that C is lower level than C++, and that the former would compile to faster code; and faster is good. Second, it&#8217;s easier for C++ to call C than the other way around so, for maximum compatibility, it made sense to code it in C. But these arguments always felt a little weak and now that I&#8217;m trying to port tonclib&#8217;s functions to the DS, the question pops up again.
</p>
<p><div>&nbsp;</div></p>
<p>
On many occasions, I just <i>hated</i> not going for C++. Not so much for its higher-level functionality like classes, inheritance and other OOPy goodness (or badness, some might say), but more because I would really, really like to make use of things like function overloading, default parameters and perhaps templates too.
</p>
<p><div>&nbsp;</div></p>
<p>
For example, say you have a blit routine. You can implement this in multiple ways: with full parameters (srcX/Y, dstX/Y, width/height), using Point and Rect structs (srcRect, dstPoint) or perhaps just a destination point, using the full source-bitmap. In other words:
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="kw1">void</span> blit(Surface *dst, <span class="kw1">int</span> dstX, <span class="kw1">int</span> dstY, <span class="kw1">int</span> srcW, <span class="kw1">int</span> srcH, Surface *src, <span class="kw1">int</span> srcX, <span class="kw1">int</span> srcY);<br />
<span class="kw1">void</span> blit(Surface *dst, Point *dstPoint, Surface *src, Rect *srcRect);<br />
<span class="kw1">void</span> blit(Surface *dst, Point *dstPoint, Surface *src);</div>
</div>
<p>
In C++, this would be no problem. You just declare and define the functions and the compiler mangles the names internally to avoid naming conflicts. You can even make some of the functions inline facades that morphs the arguments for the One True Implementation. In C, however, this won&#8217;t work. You have to do the 
<a href="http://en.wikipedia.org/wiki/name%20mangling">name mangling</a> yourself, like blit, blit2, blit3, or blitEx or blitRect, and so on and so forth. Eeghh, that is just ugly.
</p>
<p><div>&nbsp;</div></p>
<p>
Speaking of points and rectangles, that&#8217;s another thing. Structs for points and rects are quite useful, so you make one using <code>int</code> members (you should always start with ints). But sometimes it&#8217;s better to have smaller versions, like shorts. Or maybe unsigned variations. And so you end up with:
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="kw1">struct</span> point8_t &nbsp; { s8 &nbsp;x, y; }; &nbsp; <span class="co1">// Point as signed char</span><br />
<span class="kw1">struct</span> point16_t &nbsp;{ s16 x, y; }; &nbsp; <span class="co1">// Point as signed short</span><br />
<span class="kw1">struct</span> point32_t &nbsp;{ s32 x, y; }; &nbsp; <span class="co1">// Point as signed int</span></p>
<p><span class="kw1">struct</span> upoint8_t &nbsp;{ u8 &nbsp;x, y; }; &nbsp; <span class="co1">// Point as unsigned char</span><br />
<span class="kw1">struct</span> upoint16_t { u16 x, y; }; &nbsp; <span class="co1">// Point as unsigned short</span><br />
<span class="kw1">struct</span> upoint32_t { u32 x, y; }; &nbsp; <span class="co1">// Point as unsigned int</span></div>
</div>
<p>
And then that for rects too. And perhaps 3D vectors. And maybe add floats to the mix as well. This all requires that you make structs which are identical except for the primary datatype. That just sounds kinda dumb to me.
</p>
<p>
But wait, it gets even better! You might like to have some functions<br />
to go with these structs, so now you have to create different sets (yes, <i>sets</i>) of functions that differ only by their parameter types too! AAAARGGGGHHHHH, for the love of IPU, NOOOOOOOOOOOOOO!!! Neen, neen; driewerf neen! <kbd>&gt;_&lt;</kbd>
</p>
<p>
That&#8217;s how it would be in C. In C++, you can just use a template like so:
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="kw1">template</span>&lt;<span class="kw1">class</span> T&gt;<br />
<span class="kw1">struct</span> point_t&nbsp; { T x, y; }; &nbsp; &nbsp;<span class="co1">// Point via templates</span></p>
<p><span class="kw1">typedef</span> point_t&lt;u8&gt; point8_t; &nbsp; <span class="co1">// And if you really want, you can </span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">// typedef for specific types.</span></div>
</div>
<p>
and be done with it. And then you can make a single template function (or was it function template, I always forget) that works for all the datatypes and let the compiler work it out. Letting the computer do the work for you, hah! What will they think of next.
</p>
<p><div>&nbsp;</div></p>
<p>
Oh, and there&#8217;s namespaces too! Yes! In C, you always have to worry about if some other library has something with the same name as you&#8217;re thinking of using. This is where all those silly prefixes come from (oh hai, FreeImage!). With C++, there&#8217;s a clean way out of that: you can encapsulate them in a namespace and when a conflict arises you can use <code>mynamespace::foo</code> to get out of it. And if there&#8217;s no conflicts, use <code>using namespace mynamespace;</code> and type just plain <code>foo</code>. None of that <code>FreeImage_foo</code> causing you to have more prefix than genuine function identifier.
</p>
<p><div>&nbsp;</div></p>
<p>
And [i]then[/i] there&#8217;s C++ benefits like classes and everything that goes with it. Yes, classes can become fiendishly difficult if pushed too far<span class="fnote"><a href="#ft-nr1" title="As the saying goes: C++ makes it harder to shoot yourself in the foot, but when you do, you blow off your whole leg.">(1)</a></span>, but inheritance and polymorphism are nice when you have any kind of hierarchy in your program. All Actors have positions, velocities and states. But a PlayerActor also needs input; and an NpcActor has AI. And each kind of NPC has different<br />
methods for behaviour and capabilities, and different Items have different effects and so on. It&#8217;s possible to do this in just C (hint: unioned-structs and function-tables and of course state engines), but whether you&#8217;d want to is another matter. And there&#8217;s constructors for easier memory management, STL and references. And, yes, streams, exceptions and RTTI too if you want to kill your poor CPU (regarding GBA/DS I mean), but nobody&#8217;s forcing you to use those.
</p>
<p></p>
<p>
So why the hell am I staying with C again? Oh right, performance!
</p>
<p>
Performance, really? I think I heard this was a valid point a long time ago, but is it still true now? To test this, I turned all tonclib&#8217;s C files into C++ files, compiled again and compared the two. This is the result:
</p>
<div class="cblock">
<div class="cpt" style="width:480px;">
  <a href="http://www.coranac.com/img/post/cpp-vs-c.jpg" target="_blank">  <img src="http://www.coranac.com/img/post/cpp-vs-c.jpg" 
    alt="" width="480" /></a><br />
  Difference in function size between C++ and C in bytes.
</div>

</div>
<p>
That graph shows the difference in the compiled function size. Positive means C++ uses more instructions. In nearly 300 functions, the only differences are minor variations in irq_set(), some of the rendering routines and TTE parsers, and neither language is the clear winner. Overall, C++ seems to do a little bit better, but the difference is marginal.
</p>
<p>
I&#8217;ve also run a diff between the generated assembly. There are a handful of functions where the order of instructions are different, or different registers are used, or a value is placed in a register instead of on the stack. That&#8217;s about it. In other words, there is no significant difference between pure C code and its C++ equivalent. Things will probably be a little different when OOP features and exceptions enter the fray, but that&#8217;s to be expected. But if you stay close to C-like C++, the only place you&#8217;ll notice anything is in the name-mangling. Which you as a programmer won&#8217;t notice anyway because it all happens behind the scenes.
</p>
<p><div>&nbsp;</div></p>
<p>
So that strikes performance off my list, leaving only wider compatibility. I suppose that has still some merit, but considering you can turn C-code into valid C++ by changing the extension<span class="fnote"><a href="#ft-nr2" title="and clean up the type issues that C allows but C++ doesn&#8217;t, like void* arithmetic and implicit pointer casts from void*.">(2)</a></span>, this is sound more and more like an excuse instead of a reason.</p>
<hr /><div class="footnotes">
<h5>Notes:</h5>
<ol>
<li id="ft-nr1"> 
  As the saying goes: C++ makes it harder to shoot yourself in the foot, but when you do, you blow off your whole leg.
</li>
<li id="ft-nr2"> 
  and clean up the type issues that C allows but C++ doesn&#8217;t, like void* arithmetic and implicit pointer casts from void*.
</li>
</ol>
</div
<hr />
]]></content:encoded>
			<wfw:commentRss>http://www.coranac.com/2008/09/to-c-or-not-to-c/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Fiddle to the bittle</title>
		<link>http://www.coranac.com/2008/05/fiddle-to-the-bittle/</link>
		<comments>http://www.coranac.com/2008/05/fiddle-to-the-bittle/#comments</comments>
		<pubDate>Fri, 30 May 2008 17:34:34 +0000</pubDate>
		<dc:creator>cearn</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[documents]]></category>
		<category><![CDATA[bit fiddling]]></category>
		<category><![CDATA[bitops]]></category>

		<guid isPermaLink="false">http://www.coranac.com/?p=51</guid>
		<description><![CDATA[I&#8217;ve added two new routines to the bit-trick page: 1&#8594;4 bit-unpack with reverse and bit reversals. This last one is elegant &#8230; except for one bit of C tomfoolery that is required to get GCC to produce the right ARM code. I hope to discuss this in more detail later. I&#8217;ve also added a new [...]]]></description>
			<content:encoded><![CDATA[<p>
I&#8217;ve added two new routines to the bit-trick page:<br />
<a href="http://www.coranac.com/documents/bittrick/#ssec-bup-1x4-rev"><br />
1&rarr;4 bit-unpack with reverse</a> and<br />
<a href="http://www.coranac.com/documents/bittrick/#ssec-misc-rev"><br />
bit reversals</a>. This last one is elegant &hellip; except for one bit of C tomfoolery<br />
that is required to get GCC to produce the right ARM code. I hope to discuss this in<br />
more detail later.
</p>
<p>
I&#8217;ve also added a new document about dealing with<br />
<a href="http://www.coranac.com/documents/working-with-bits-and-bitfields/"><br />
bitfields</a>. It explains what to do with them, gives a few useful functions to<br />
get and set bitfields, and demonstrates how to use the C construct for bitfields.<br />
It also touches briefly on a nasty detail in the way GCC implements bitfield<br />
that can cause them to fail in certain GBA/NDS memory sections. If you&#8217;re using<br />
bitfields to map VRAM or OAM, please read.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.coranac.com/2008/05/fiddle-to-the-bittle/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Surface drawing routines.</title>
		<link>http://www.coranac.com/2008/05/surface-drawing-routines/</link>
		<comments>http://www.coranac.com/2008/05/surface-drawing-routines/#comments</comments>
		<pubDate>Wed, 14 May 2008 16:19:28 +0000</pubDate>
		<dc:creator>cearn</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[tonc]]></category>
		<category><![CDATA[gba]]></category>
		<category><![CDATA[rendering]]></category>
		<category><![CDATA[surface]]></category>

		<guid isPermaLink="false">http://www.coranac.com/2008/05/14/48/</guid>
		<description><![CDATA[I&#8217;ve been building a basic interface for dealing with graphic surfaces lately. I already had most of the routines for 16bpp and 8bpp bitmaps in older Toncs, but but their use was still somewhat awkward because you had to provide some details of the destination manually; most notably a base pointer and the pitch. This [...]]]></description>
			<content:encoded><![CDATA[<p>
I&#8217;ve been building a basic interface for dealing with graphic surfaces lately. I<br />
already had most of the routines for 16bpp and 8bpp bitmaps in older Toncs, but<br />
but their use was still somewhat awkward because you had to provide some<br />
details of the destination manually; most notably a base pointer and the pitch.<br />
This got more than a little annoying, especially when trying to make blitters as<br />
well. So I made some changes.
</p>
<p></p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="kw1">typedef</span> <span class="kw1">struct</span> TSurface<br />
{<br />
&nbsp; &nbsp; u8&nbsp; *data;&nbsp; &nbsp; &nbsp; <span class="co1">//!&lt; Surface data pointer.</span><br />
&nbsp; &nbsp; u32 pitch;&nbsp; &nbsp; &nbsp; <span class="co1">//!&lt; Scanline pitch in bytes (PONDER: alignment?).</span><br />
&nbsp; &nbsp; u16 width;&nbsp; &nbsp; &nbsp; <span class="co1">//!&lt; Image width in pixels. </span><br />
&nbsp; &nbsp; u16 height; &nbsp; &nbsp; <span class="co1">//!&lt; Image width in pixels.</span><br />
&nbsp; &nbsp; u8&nbsp; bpp;&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">//!&lt; Bits per pixel.</span><br />
&nbsp; &nbsp; u8&nbsp; type; &nbsp; &nbsp; &nbsp; <span class="co1">//!&lt; Surface type (not used that much).</span><br />
&nbsp; &nbsp; u16 palSize;&nbsp; &nbsp; <span class="co1">//!&lt; Number of colors.</span><br />
&nbsp; &nbsp; u16 *palData; &nbsp; <span class="co1">//!&lt; Pointer to palette.</span><br />
} TSurface;</div>
</div>
<p>
I&#8217;ve rebuilt the routines around a surface description struct called<br />
<code>TSurface</code> (see above). This way, I can just initialize the<br />
surface somewhere and just pass the pointer to that surface around.<br />
There are a number of different kinds of surfaces. The most important<br />
ones are these three:
</p>
<ul>
<li><b>bmp16</b>. 16bpp bitmap surfaces.</li>
<li><b>bmp8</b>. 8bpp bitmap surfaces.</li>
<li><b>chr4c</b>. 4bpp tiled surfaces, in column-major order (i.e., tile 1 is<br />
<i>under</i> tile 0 instead of to the right). Column-major order may seem<br />
strange, but it actually simplifies the code considerably. There is also a<br />
<code>chr4<b>r</b></code> mode for normal, row-major tiling, but that&#8217;s unfinished<br />
and will probably remain so.
</ul>
<div class="cptfr" style="width:240px;">
  <img src="http://www.coranac.com/img/post/surface.gif"<br />
    alt="surface.gba movie" /><br />
  Demonstrating surface routines for 4bpp tiles.
</div>
<p>
For each of these three, I have the most important rendering functions:<br />
plotting pixels, lines, rectangles and <b>blits</b>. Yes, blits too. Even for<br />
<code>chr4c</code>-mode. There are routines for frames (empty<br />
rectangles) and floodfill as well. The functions have a uniform interface<br />
with respect to surface-type, so switching between them should be<br />
easy were it necessary. There are also tables with function pointers<br />
to these routines, so by using those you need not really care about<br />
the details of the surface after its creation. I&#8217;ll probably add a<br />
pointer to such a table in <code>TSurface</code> in the future.
</p>
<p></p>
<p>Linkies</p>
<ul>
<li>Demo project: <a href="http://www.coranac.com/files/misc/surface.zip">surface.zip</a>.</li>
<li>Tonclib: <a href="http://www.coranac.com/files/misc/tonclib-20080514.zip">tonclib</a>. </li>
<li><a href="http://www.coranac.com/man/tonclib/group__grpSurface.htm">Tonclib manual, TSurface module</a>.</li>
</ul>
<p>
<p>The image on the right is the result of the following routine.<br />
Turret pic semi-knowingly provided by<br />
<a href="http://helmetedrodent.kickassgamers.com/Pika/blog/">Kawa</a>.
</p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="kw1">void</span> test_surface_procs(<span class="kw1">const</span> TSurface *src, TSurface *dst, <br />
&nbsp; &nbsp; <span class="kw1">const</span> TSurfaceProcTab *procs, u16 colors[])<br />
{<br />
&nbsp; &nbsp; <span class="co1">// Init object text</span><br />
&nbsp; &nbsp; tte_init_obj(&amp;oam_mem[<span class="nu0">127</span>], ATTR0_TALL, ATTR1_SIZE_8, <span class="nu0">512</span>, <br />
&nbsp; &nbsp; &nbsp; &nbsp; CLR_YELLOW, <span class="nu0">0</span>, &amp;vwf_default, <span class="kw2">NULL</span>);<br />
&nbsp; &nbsp; tte_init_con();<br />
&nbsp; &nbsp; tte_set_margins(<span class="nu0">8</span>, <span class="nu0">140</span>, <span class="nu0">160</span>, <span class="nu0">152</span>);</p>
<p>&nbsp; &nbsp; <span class="co1">// And go!</span><br />
&nbsp; &nbsp; tte_printf(<span class="st0">&quot;#{es;P}%s surface primitives#{w:60}&quot;</span>, procs-&gt;name);</p>
<p>&nbsp; &nbsp; tte_printf(<span class="st0">&quot;#{es;P}Rect#{w:20}&quot;</span>);<br />
&nbsp; &nbsp; procs-&gt;rect(dst, <span class="nu0">20</span>, <span class="nu0">20</span>, <span class="nu0">100</span>, <span class="nu0">100</span>, colors[<span class="nu0">0</span>]);</p>
<p>&nbsp; &nbsp; tte_printf(<span class="st0">&quot;#{w:30;es;P}Frame#{w:20}&quot;</span>);<br />
&nbsp; &nbsp; procs-&gt;frame(dst, <span class="nu0">21</span>, <span class="nu0">21</span>, <span class="nu0">99</span>, <span class="nu0">99</span>, colors[<span class="nu0">1</span>]);</p>
<p>&nbsp; &nbsp; tte_printf(<span class="st0">&quot;#{w:30;es;P}Hlines#{w:20}&quot;</span>);</p>
<p>&nbsp; &nbsp; procs-&gt;hline(dst, <span class="nu0">23</span>, <span class="nu0">23</span>, <span class="nu0">96</span>, colors[<span class="nu0">2</span>]);<br />
&nbsp; &nbsp; procs-&gt;hline(dst, <span class="nu0">23</span>, <span class="nu0">96</span>, <span class="nu0">96</span>, colors[<span class="nu0">2</span>]);</p>
<p>&nbsp; &nbsp; tte_printf(<span class="st0">&quot;#{w:30;es;P}Vlines#{w:20}&quot;</span>);<br />
&nbsp; &nbsp; procs-&gt;vline(dst, <span class="nu0">23</span>, <span class="nu0">25</span>, <span class="nu0">94</span>, colors[<span class="nu0">3</span>]);<br />
&nbsp; &nbsp; procs-&gt;vline(dst, <span class="nu0">96</span>, <span class="nu0">25</span>, <span class="nu0">94</span>, colors[<span class="nu0">3</span>]);</p>
<p>&nbsp; &nbsp; tte_printf(<span class="st0">&quot;#{w:30;es;P}Lines#{w:20}&quot;</span>);<br />
&nbsp; &nbsp; procs-&gt;line(dst, <span class="nu0">25</span>, <span class="nu0">25</span>, <span class="nu0">94</span>, <span class="nu0">40</span>, colors[<span class="nu0">4</span>]);<br />
&nbsp; &nbsp; procs-&gt;line(dst, <span class="nu0">94</span>, <span class="nu0">25</span>, <span class="nu0">79</span>, <span class="nu0">94</span>, colors[<span class="nu0">4</span>]);<br />
&nbsp; &nbsp; procs-&gt;line(dst, <span class="nu0">94</span>, <span class="nu0">94</span>, <span class="nu0">25</span>, <span class="nu0">79</span>, colors[<span class="nu0">4</span>]);<br />
&nbsp; &nbsp; procs-&gt;line(dst, <span class="nu0">25</span>, <span class="nu0">94</span>, <span class="nu0">40</span>, <span class="nu0">25</span>, colors[<span class="nu0">4</span>]);</p>
<p>&nbsp; &nbsp; tte_printf(<span class="st0">&quot;#{w:30;es;P}Full blit#{w:20}&quot;</span>);<br />
&nbsp; &nbsp; procs-&gt;blit(dst, <span class="nu0">120</span>, <span class="nu0">16</span>, src-&gt;width, src-&gt;height, src, <span class="nu0">0</span>, <span class="nu0">0</span>);</p>
<p>&nbsp; &nbsp; tte_printf(<span class="st0">&quot;#{w:30;es;P}Partial blit#{w:20}&quot;</span>);<br />
&nbsp; &nbsp; procs-&gt;blit(dst, <span class="nu0">40</span>, <span class="nu0">40</span>, <span class="nu0">40</span>, <span class="nu0">40</span>, src, <span class="nu0">12</span>, <span class="nu0">8</span>);</p>
<p>&nbsp; &nbsp; tte_printf(<span class="st0">&quot;#{w:30;es;P}Floodfill#{w:20}&quot;</span>);<br />
&nbsp; &nbsp; procs-&gt;flood(dst, <span class="nu0">40</span>, <span class="nu0">32</span>, colors[<span class="nu0">5</span>]);<br />
&nbsp; &nbsp; tte_printf(<span class="st0">&quot;#{w:30;es;P}Again !#{w:20}&quot;</span>);<br />
&nbsp; &nbsp; procs-&gt;flood(dst, <span class="nu0">40</span>, <span class="nu0">32</span>, colors[<span class="nu0">6</span>]);</p>
<p>&nbsp; &nbsp; tte_printf(<span class="st0">&quot;#{w:30;es;P;w:30}Ta-dah!!!#{w:20}&quot;</span>);</p>
<p>&nbsp; &nbsp; key_wait_till_hit(KEY_ANY);<br />
}</p>
<p><span class="co1">// Test 4bpp tiled, column-major surfaces</span><br />
<span class="kw1">void</span> test_chr4c_procs()<br />
{<br />
&nbsp; &nbsp; TSurface turret, dst;</p>
<p>&nbsp; &nbsp; <span class="co1">// Init turret for blitting.</span><br />
&nbsp; &nbsp; srf_init(&amp;turret, SRF_CHR4C, turretChr4cTiles, <span class="nu0">128</span>, <span class="nu0">128</span>, <span class="nu0">4</span>, <span class="kw2">NULL</span>);</p>
<p>&nbsp; &nbsp; <span class="co1">// Init destination surface</span><br />
&nbsp; &nbsp; srf_init(&amp;dst, SRF_CHR4C, tile_mem[<span class="nu0">0</span>], <span class="nu0">240</span>, <span class="nu0">160</span>, <span class="nu0">4</span>, pal_bg_mem);<br />
&nbsp; &nbsp; schr4c_prep_map(&amp;dst, se_mem[<span class="nu0">31</span>], <span class="nu0">0</span>);<br />
&nbsp; &nbsp; GRIT_CPY(pal_bg_mem, turretChr4cPal);</p>
<p>&nbsp; &nbsp; <span class="co1">// Set video stuff</span><br />
&nbsp; &nbsp; REG_DISPCNT= DCNT_MODE0 | DCNT_BG2 | DCNT_OBJ | DCNT_OBJ_1D;<br />
&nbsp; &nbsp; REG_BG2CNT= BG_CBB(<span class="nu0">0</span>)|BG_SBB(<span class="nu0">31</span>);</p>
<p>&nbsp; &nbsp; u16 colors[<span class="nu0">8</span>]= { <span class="nu0">6</span>, <span class="nu0">13</span>, <span class="nu0">1</span>, <span class="nu0">14</span>, <span class="nu0">15</span>, <span class="nu0">0</span>, <span class="nu0">14</span>, <span class="nu0">0</span> };</p>
<p>&nbsp; &nbsp; <span class="co1">// Run internal tester</span><br />
&nbsp; &nbsp; test_surface_procs(&amp;turret, &amp;dst, &amp;chr4c_tab, colors);<br />
}</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.coranac.com/2008/05/surface-drawing-routines/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>computing, costs and caching, oh my</title>
		<link>http://www.coranac.com/2008/02/computing-costs-and-caching-oh-my/</link>
		<comments>http://www.coranac.com/2008/02/computing-costs-and-caching-oh-my/#comments</comments>
		<pubDate>Thu, 21 Feb 2008 13:36:09 +0000</pubDate>
		<dc:creator>cearn</dc:creator>
				<category><![CDATA[code]]></category>

		<guid isPermaLink="false">http://www.coranac.com/2008/02/21/computing-costs-and-caching-oh-my/</guid>
		<description><![CDATA[Via coding horror, I stumbled upon a simply wonderful talk by Herb Sutter about various performance issues like how much operations cost. It also discusses how memory, latency and machine architecture can affect that cost how this has changed over the years. You can find the slides and a video of the presentation at http://nwcpp.org/Meetings/2007/09.html. [...]]]></description>
			<content:encoded><![CDATA[<p>
Via <a href="http://www.codinghorror.com/blog/archives/001061.html">coding horror</a>, I stumbled upon a simply <i>wonderful</i> talk by Herb Sutter about various performance issues like how much operations cost. It also discusses how memory, latency and machine architecture can affect that cost how this has changed over the years. You can find the slides and a video of the presentation at <a href="http://nwcpp.org/Meetings/2007/09.html">http://nwcpp.org/Meetings/2007/09.html</a>.
</p>
<p>
Be prepared for a total geek-out. This is highly technical (and awesome, but that&#8217;s bordering on a tautology) stuff and probably not for the faint of heart. Slides 6 and 7, for example, around the 23m mark) show the value of cache compared to getting something from RAM, and just <i>how bad</i> retrieval from disk is. Later (slides 13 and on; around 55m in the video), when it comes to threads and how a compiler or even hardware may <span style="text-decoration:line-through;">screw you over</span> not do what you want to do, or even what you <i>tell</i> it to do, people how still have them are allowed to run to their moms for safety. By Patina, that is just nasty.
</p>
<p>
Near the end Sutter discusses the differences between using vectors, lists and sets and what the penalties for the latter are for something as simple add adding all the values in them. This starts at around slide 22, or 1h40m. Even if the rest is gobblyjook, this part is easy to understand. Basically, low footprint and sequential accesses are Good Things, even if you have cache and stuff. <i>Especially</i> when you have cache and stuff.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.coranac.com/2008/02/computing-costs-and-caching-oh-my/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!--
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head>
</head>
<body>
<p>
My database has called in sick. Please imagine some 
annoying elevator tune till he gets back.
</p>
<p>
<small>[[Doo-di-doo tooo. Dum-di-dum-di-doo-dooo.]]</small>
</p>
</body>
</html>

-->
