<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: memcpy and memset replacements for GBA/NDS</title>
	<atom:link href="http://www.coranac.com/2008/01/tonccpy/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.coranac.com/2008/01/tonccpy/</link>
	<description>my own little world</description>
	<lastBuildDate>Mon, 30 Aug 2010 15:42:13 -0400</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Radamanthe</title>
		<link>http://www.coranac.com/2008/01/tonccpy/comment-page-1/#comment-2096</link>
		<dc:creator>Radamanthe</dc:creator>
		<pubDate>Fri, 20 Feb 2009 11:37:36 +0000</pubDate>
		<guid isPermaLink="false">http://www.coranac.com/2008/01/25/tonccpy/#comment-2096</guid>
		<description>Damn, you&#039;re right, HTML messed with template syntax (due to &#039;&#039;). Well, I should have used tags for the code, but TBH, I don&#039;t know much about HTML :) But you can easily restore them: T_SRC &amp; T_DST are just &#039;typename&#039; arguments (or &#039;class&#039;, as you wish), and the std::binary_function inheritance can simply be removed (they don&#039;t hurt and they&#039;re a good habit for STL algorithms which I don&#039;t use here anyway). Still, it&#039;s not really worth it, since I did not post the core templates code. My first main goal was for the reader to get the point and have some feedbacks.

Let&#039;s just say we agree for speed. After all, this is the main purpose of your article and I wouldn&#039;t have commented here if I felt it was not that significant.</description>
		<content:encoded><![CDATA[<p>Damn, you&#8217;re right, HTML messed with template syntax (due to &#8221;). Well, I should have used tags for the code, but TBH, I don&#8217;t know much about HTML :) But you can easily restore them: T_SRC &amp; T_DST are just &#8216;typename&#8217; arguments (or &#8216;class&#8217;, as you wish), and the std::binary_function inheritance can simply be removed (they don&#8217;t hurt and they&#8217;re a good habit for STL algorithms which I don&#8217;t use here anyway). Still, it&#8217;s not really worth it, since I did not post the core templates code. My first main goal was for the reader to get the point and have some feedbacks.</p>
<p> Let&#8217;s just say we agree for speed. After all, this is the main purpose of your article and I wouldn&#8217;t have commented here if I felt it was not that significant.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: cearn</title>
		<link>http://www.coranac.com/2008/01/tonccpy/comment-page-1/#comment-2095</link>
		<dc:creator>cearn</dc:creator>
		<pubDate>Thu, 19 Feb 2009 20:45:31 +0000</pubDate>
		<guid isPermaLink="false">http://www.coranac.com/2008/01/25/tonccpy/#comment-2095</guid>
		<description>Gah, it looks like the WP sanitizer managed some of the code there (removing the template arguments and such); If you don&#039;t mind, I&#039;ll try to clean it up a little. 

I must say that this is a very interesting approach, one that I probably should look into in the future. With the right kinds of templates, this could make blitting much easier.

I do think optimization has to be considered for large-scale copies such as blits, though. Pixel for pixel copying can be much slower than a hand-assembled asm block transfer routine and because so many pixels have to be considered it can become a bottleneck. However, for more complex functionality like pal-&gt;&gt;16bpp and transparent blits, templates can indeed be very helpful.</description>
		<content:encoded><![CDATA[<p>Gah, it looks like the WP sanitizer managed some of the code there (removing the template arguments and such); If you don&#8217;t mind, I&#8217;ll try to clean it up a little. </p>
<p> I must say that this is a very interesting approach, one that I probably should look into in the future. With the right kinds of templates, this could make blitting much easier.</p>
<p> I do think optimization has to be considered for large-scale copies such as blits, though. Pixel for pixel copying can be much slower than a hand-assembled asm block transfer routine and because so many pixels have to be considered it can become a bottleneck. However, for more complex functionality like pal->&gt;16bpp and transparent blits, templates can indeed be very helpful.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Radamanthe</title>
		<link>http://www.coranac.com/2008/01/tonccpy/comment-page-1/#comment-2094</link>
		<dc:creator>Radamanthe</dc:creator>
		<pubDate>Thu, 19 Feb 2009 17:48:05 +0000</pubDate>
		<guid isPermaLink="false">http://www.coranac.com/2008/01/25/tonccpy/#comment-2094</guid>
		<description>I&#039;ve once made C++ templated functions for blitting. One of my goals was to train myself a bit with templates which I rarely use for my own code. This is quite different from your approach though, because it was designed around the blitting features more than the speed. But we share one purpose: being able to blit into VRAM for 8bpp sources. I know a lot about several assemblers mechanics (ALU, registers, adressing modes, flags, etc... I used to spend lot of time to gain some few cycles), but I lack knowledge about ARM op-codes cycles. Since I don&#039;t really want to the take time to profile my code (it works and I don&#039;t feel it necessary, I&#039;ll talk about it below), I&#039;d like to see these templates optimized by someone who run after the cycles... and I&#039;m of the ones who still believe that a human programmer can do a lot better than a compiler, even with today CPU with all their pipelines and stuffs, simply because... we&#039;re still smarter than the machines! Other arguments are: gaining few cycles will pay more on a 67Mhz CPU than a multi-Ghz-monster, and the actual optimization is essentially about taking care of that damn critical inner loop.

I&#039;ve read another article of you on this site about your hesitations to use C++ instead of C. You have good reasons, but I think the performance one is no longer a threat. Effectively, there was a time where C++ compilers performed poorly compared to C compilers. This was not because of the language itself (C++ can be as close to the machine as C can), but because of the compilers themselves, and often because of the programmers who merely know what is costly and what is not once translated into op-codes (those programmers have more reasons to fall in C++ performance traps than others). My point is that templates are not of the features that slow things down. At times, they can become hard to master, must be planned carefully, but their tremendous power is worth thoses prices.

Now about the blitters. The core blitter logic is centralized. Only when it needs to update the actual pixel a function (inlined) is used. This function takes the form of a functor (another C++ feature with no cycle cost, and even potentially faster than function pointers which cannot be inlined). Then you only have to define a new functor to perform special effects at the pixel level (transparency, combininations with source... etc).

Actually, I use 2 blitter templates for everything:

- one for generic blitting from X bpp source to Y bpp destination (usually both are the same and are 16bpp on DS). Note that I also use that for blitting tile-maps on BGs screens, after all, this is just the same as if you had a 16bpp 32x32 pixels screen. The core implementation is pretty simple and I don&#039;t think it needs optimizations for the inner loop.

- the other template is intended for paletted-source blits (the specific problem you are talking about). You&#039;re right this needs some special processing, because the usual approach is wrong in terms of performance, and even it won&#039;t work well if destination is VRAM. This template can blit X bpp paletted images (where X is a templated int of 1, 2, 4 or 8) to a 16bpp bitmap memory (could be templated for 32bpp, not needed on DS though). It reads by chunks of 16bpp for efficiency, taking line bounds alignment into account. It also uses a functor, with slightly different parameters (source is a color index in palette, not the actual color). The generic inner loop is undoubtly slower than yours (mainly due to X bpp source and because I did not really care about performance), but classic head/body/tail methods like yours would speed things up. Of course, template specialization is also possible. I did not put much effort on optimization for this stuff because the DS horsepower definitely doesn&#039;t lie in the area of old-school bitmap blitting techniques. Bitmap mode is a commodity on this platform: you can&#039;t count on the hardware to help you there (in fact, it will even slows you down more than 90 % of the time since the video needs the data without delay while rendering). If I want moving stuffs, I use OBJs, BGs or 3D polygons, that&#039;s it. Blitting is mostly about dealing with big bunches of memory, and a 67Mhz ARM or DMA won&#039;t do miracles whatever you do. The DS can make a far better use of its time than blitting bitmaps... so my point is that if I use blitting techniques, I need features over speed. The functor can do this with advanced pixel operations.

&lt;b&gt;[[format fairie was here]]&lt;/b&gt;
[code lang=&quot;cpp&quot;]
// Here is a sample of the generic simple blit functor:

/** Simple copy from source to dest.
 */
template&lt;typename T_SRC, typename T_DST&gt;
struct simpleCopyFunc : std::binary_function
{
	void operator() (T_SRC src, T_DST *dst) const
	{
		*dst = src;
	}
};

// And the generic transparent blit functor:

/** Writes source to destination if source is not of a given value.
 */
template&lt;typename T_SRC, typename T_DST&gt;
struct transparentCopyFunc : std::binary_function
{
	T_SRC trans_;
	transparentCopyFunc(T_SRC trans=0) :
			trans_(trans)
	{}
	void inline operator() (T_SRC src, T_DST *dst) const
	{
		if (src != trans_)
			*dst = src;
	}
};

// The paletted transparent blit functor (no need to be templated actually):

struct palettedTransparentCopyFunc
{
	const u16 *pal_;
	u8 trans_;
	palettedTransparentCopyFunc(const u16 *pal, u8 trans=0) :
			pal_(pal),
			trans_(trans)
	{}

	void operator() (u8 src, u16 *&amp;dst) const
	{
		if (src != trans_)
			*dst = 0x8000 &#124; pal_[src]; // Note: the &#039;OR&#039; can be avoided if palette has all hi-order bits set.
	}
};

// sample call for paletted blitting:
blit(srcAddr, srcPitch,
	dstAddr, dstPitch,
	srcX, srcY, width, height,
	dstX, dstY,
	palettedTransparentCopyFunc(palAddr, transColor));
[/code]
These are just samples... but you can see that adding effects is as simple as adding a new functor (just copy-paste the simple functor, rename it, and do what is needed for each pixel). Then when you need to blit anything, you simply call the templated blitter function with appropriate arguments (source &amp; destination addresses, source rectangle, destination position (upper-left), source &amp; dest pitches, and functor). Let&#039;s be clear: no need for templates to do that, function pointers would work as well (no inline though). I used templates mainly for genericity, so to centralize the blitter logic, nothing else.

The overall code is fairly portable (only u8, u16 should be replaced by uint16_t etc... for better portability). I can share the template code with you if you have an interest. I would be great to have the power of templates alongside well optimized inner loops. So if someday you wanna cross the C++ lines for your blitting stuffs (besides simply being able to overload functions), do not hesitate to contact me. I&#039;d be happy to contribute.</description>
		<content:encoded><![CDATA[<p>I&#8217;ve once made C++ templated functions for blitting. One of my goals was to train myself a bit with templates which I rarely use for my own code. This is quite different from your approach though, because it was designed around the blitting features more than the speed. But we share one purpose: being able to blit into VRAM for 8bpp sources. I know a lot about several assemblers mechanics (ALU, registers, adressing modes, flags, etc&#8230; I used to spend lot of time to gain some few cycles), but I lack knowledge about ARM op-codes cycles. Since I don&#8217;t really want to the take time to profile my code (it works and I don&#8217;t feel it necessary, I&#8217;ll talk about it below), I&#8217;d like to see these templates optimized by someone who run after the cycles&#8230; and I&#8217;m of the ones who still believe that a human programmer can do a lot better than a compiler, even with today CPU with all their pipelines and stuffs, simply because&#8230; we&#8217;re still smarter than the machines! Other arguments are: gaining few cycles will pay more on a 67Mhz CPU than a multi-Ghz-monster, and the actual optimization is essentially about taking care of that damn critical inner loop.</p>
<p> I&#8217;ve read another article of you on this site about your hesitations to use C++ instead of C. You have good reasons, but I think the performance one is no longer a threat. Effectively, there was a time where C++ compilers performed poorly compared to C compilers. This was not because of the language itself (C++ can be as close to the machine as C can), but because of the compilers themselves, and often because of the programmers who merely know what is costly and what is not once translated into op-codes (those programmers have more reasons to fall in C++ performance traps than others). My point is that templates are not of the features that slow things down. At times, they can become hard to master, must be planned carefully, but their tremendous power is worth thoses prices.</p>
<p> Now about the blitters. The core blitter logic is centralized. Only when it needs to update the actual pixel a function (inlined) is used. This function takes the form of a functor (another C++ feature with no cycle cost, and even potentially faster than function pointers which cannot be inlined). Then you only have to define a new functor to perform special effects at the pixel level (transparency, combininations with source&#8230; etc).</p>
<p> Actually, I use 2 blitter templates for everything:</p>
<p> &#8211; one for generic blitting from X bpp source to Y bpp destination (usually both are the same and are 16bpp on DS). Note that I also use that for blitting tile-maps on BGs screens, after all, this is just the same as if you had a 16bpp 32&#215;32 pixels screen. The core implementation is pretty simple and I don&#8217;t think it needs optimizations for the inner loop.</p>
<p> &#8211; the other template is intended for paletted-source blits (the specific problem you are talking about). You&#8217;re right this needs some special processing, because the usual approach is wrong in terms of performance, and even it won&#8217;t work well if destination is VRAM. This template can blit X bpp paletted images (where X is a templated int of 1, 2, 4 or 8) to a 16bpp bitmap memory (could be templated for 32bpp, not needed on DS though). It reads by chunks of 16bpp for efficiency, taking line bounds alignment into account. It also uses a functor, with slightly different parameters (source is a color index in palette, not the actual color). The generic inner loop is undoubtly slower than yours (mainly due to X bpp source and because I did not really care about performance), but classic head/body/tail methods like yours would speed things up. Of course, template specialization is also possible. I did not put much effort on optimization for this stuff because the DS horsepower definitely doesn&#8217;t lie in the area of old-school bitmap blitting techniques. Bitmap mode is a commodity on this platform: you can&#8217;t count on the hardware to help you there (in fact, it will even slows you down more than 90 % of the time since the video needs the data without delay while rendering). If I want moving stuffs, I use OBJs, BGs or 3D polygons, that&#8217;s it. Blitting is mostly about dealing with big bunches of memory, and a 67Mhz ARM or DMA won&#8217;t do miracles whatever you do. The DS can make a far better use of its time than blitting bitmaps&#8230; so my point is that if I use blitting techniques, I need features over speed. The functor can do this with advanced pixel operations.</p>
<p> <b>[[format fairie was here]]</b></p>
<div class="cpp">
<div class="cpp proglist" style=" "><span class="co1">// Here is a sample of the generic simple blit functor:</span></p>
<p> <span class="coMULTI">/** Simple copy from source to dest.<br /> &nbsp;*/</span><br /> <span class="kw1">template</span>&lt;<span class="kw1">typename</span> T_SRC, <span class="kw1">typename</span> T_DST&gt;<br /> <span class="kw1">struct</span> simpleCopyFunc : std::binary_function<br /> {<br /> &nbsp; &nbsp; <span class="kw1">void</span> operator() (T_SRC src, T_DST *dst) <span class="kw1">const</span><br /> &nbsp; &nbsp; {<br /> &nbsp; &nbsp; &nbsp; &nbsp; *dst = src;<br /> &nbsp; &nbsp; }<br /> };</p>
<p> <span class="co1">// And the generic transparent blit functor:</span></p>
<p> <span class="coMULTI">/** Writes source to destination if source is not of a given value.<br /> &nbsp;*/</span><br /> <span class="kw1">template</span>&lt;<span class="kw1">typename</span> T_SRC, <span class="kw1">typename</span> T_DST&gt;<br /> <span class="kw1">struct</span> transparentCopyFunc : std::binary_function<br /> {<br /> &nbsp; &nbsp; T_SRC trans_;<br /> &nbsp; &nbsp; transparentCopyFunc(T_SRC trans=<span class="nu0">0</span>) :<br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; trans_(trans)<br /> &nbsp; &nbsp; {}<br /> &nbsp; &nbsp; <span class="kw1">void</span> <span class="kw1">inline</span> operator() (T_SRC src, T_DST *dst) <span class="kw1">const</span><br /> &nbsp; &nbsp; {<br /> &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> (src != trans_)<br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; *dst = src;<br /> &nbsp; &nbsp; }<br /> };</p>
<p> <span class="co1">// The paletted transparent blit functor (no need to be templated actually):</span></p>
<p> <span class="kw1">struct</span> palettedTransparentCopyFunc<br /> {<br /> &nbsp; &nbsp; <span class="kw1">const</span> u16 *pal_;<br /> &nbsp; &nbsp; u8 trans_;<br /> &nbsp; &nbsp; palettedTransparentCopyFunc(<span class="kw1">const</span> u16 *pal, u8 trans=<span class="nu0">0</span>) :<br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; pal_(pal),<br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; trans_(trans)<br /> &nbsp; &nbsp; {}</p>
<p> &nbsp; &nbsp; <span class="kw1">void</span> operator() (u8 src, u16 *&amp;amp;dst) <span class="kw1">const</span><br /> &nbsp; &nbsp; {<br /> &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> (src != trans_)<br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; *dst = <span class="nu0">0&#215;8000</span> | pal_[src]; <span class="co1">// Note: the &#8216;OR&#8217; can be avoided if palette has all hi-order bits set.</span><br /> &nbsp; &nbsp; }<br /> };</p>
<p> <span class="co1">// sample call for paletted blitting:</span><br /> blit(srcAddr, srcPitch,<br /> &nbsp; &nbsp; dstAddr, dstPitch,<br /> &nbsp; &nbsp; srcX, srcY, width, height,<br /> &nbsp; &nbsp; dstX, dstY,<br /> &nbsp; &nbsp; palettedTransparentCopyFunc(palAddr, transColor));</div>
</div>
<p> These are just samples&#8230; but you can see that adding effects is as simple as adding a new functor (just copy-paste the simple functor, rename it, and do what is needed for each pixel). Then when you need to blit anything, you simply call the templated blitter function with appropriate arguments (source &amp; destination addresses, source rectangle, destination position (upper-left), source &amp; dest pitches, and functor). Let&#8217;s be clear: no need for templates to do that, function pointers would work as well (no inline though). I used templates mainly for genericity, so to centralize the blitter logic, nothing else.</p>
<p> The overall code is fairly portable (only u8, u16 should be replaced by uint16_t etc&#8230; for better portability). I can share the template code with you if you have an interest. I would be great to have the power of templates alongside well optimized inner loops. So if someday you wanna cross the C++ lines for your blitting stuffs (besides simply being able to overload functions), do not hesitate to contact me. I&#8217;d be happy to contribute.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!--
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head>
</head>
<body>
<p>
My database has called in sick. Please imagine some 
annoying elevator tune till he gets back.
</p>
<p>
<small>[[Doo-di-doo tooo. Dum-di-dum-di-doo-dooo.]]</small>
</p>
</body>
</html>

-->