Probabilities in simple random samples

1 Introduction

When it comes to math, one of the things that people are phenomenally bad at – aside from the obvious everything – is probabilities. Specifically, calculating probabilities. People are bad enough at estimating chances, but properly calculating then has its own hurdles.

In this particular case, I'd like to take a look at the probability of a Simple random sample: getting a particular distribution of results from a number of samples, like getting 3 sixes in a certain number of rolls. Specifically, I'll look at the case with replacement. In this case, the sample pool remains the same after each individual sample. The case without replacement (like picking colored marbles from a vase without putting them back) is a bit trickier so I won't cover that now.

Continue reading

DMA vs ARM9, round 2 : invalidate considered harmful

It would seem these two aren't finished with each other yet.

 

A while ago, I wrote an article about NDS caching , how it can interfere with DMA transfers and what you can do about them. A little later I got a pingback from ant512, who had tried the “safe” DMA routines I made and found they weren't nearly as safe as I'd hoped. I'm still not sure what the actual problem was, but this incident did make me think about one possible reason, namely the one that will be discussed in this post: problematic cache invalidation.

1 Test base

But first things first. Let's start with some simple test code, see below. We have a simple struct definition, two arrays using this struct, and some default data for both arrays that we'll use later.

// A random struct, 32-bits in size.
struct Foo
{
    u8  type;
    u8  id;
    u16 data;
} ALIGN(4);

// Define some globals. We only use 4 of each.
Foo g_src[16] ALIGN(32);
Foo g_dst[16] ALIGN(32);

const Foo c_fooIn[2][4]=
{
    {   // Initial source data.
        { 0x55, 0, 0x5111 },
        { 0x55, 1, 0x5111 },
        { 0x55, 2, 0x5111 },
        { 0x55, 3, 0x5111 }
    },
    {   // Initial destination data.
        { 0xDD, 0, 0xD111 },
        { 0xDD, 1, 0xD111 },
        { 0xDD, 2, 0xD111 },
        { 0xDD, 3, 0xD111 }
    },
};

And now we're going to do some simple things with these arrays that we always do: some reads, some writes, and a struct copy. And for the copying, I'm going to use DMA, because DMA-transfers are fast, amirite(1)? The specific actions I will do are the following:

Initialization
  • Zero out g_src and g_dst.
  • Initialize the arrays with some data, in this case from c_fooIn.
  • Cache-Flush both arrays to ensure they're uncached.
Testing
  • Modify element in g_src, namely g_src[0].
  • Modify an element in g_dst, namely g_dst[3].
  • DMA-copy g_src[0] to g_dst[3].

In other words, this:

void test_init()
{
    // Zero out everything
    memset(g_src, 0, sizeof(g_src));
    memset(g_dst, 0, sizeof(g_dst));

    // Fill 4 of each.
    for(int i=0; i<4; i++)
    {
        g_src[i]= c_fooIn[0][i];
        g_dst[i]= c_fooIn[1][i];
    }

    // Flush data to be sure.
    DC_FlushRange(g_src, sizeof(g_src));
    DC_FlushRange(g_dst, sizeof(g_dst));
}

void test_dmaCopy()
{
    test_init();

    // Change g_src[0] and g_dst[3]
    g_src[0].id += 0x10;
    g_src[0].data= 0x5222;

    g_dst[3].id += 0x10;
    g_dst[3].data= 0xD333;

    // DMA src[0] into dst[0];
    dmaCopy(&g_src[0], &g_dst[0], sizeof(Foo));
}

Note that there is nothing spectacularly interesting going on here. There's just your average struct definition, run of the mill array definitions, and boring old accesses without even any pointer magic that might hint at something tricky going on. Yes, alignment is forced, but that just makes the test more reliable. Also, the fact that I'm incrementing Foo.id rather than just reading from it is only because ARM9 cache is read-allocate, and I need to have these things end up in cache. The main point is that the actions in test_dmaCopy() are very ordinary.

2 Results

It should be obvious what the outcome of the test should be. However, when you run the test (on hardware! not emulator), the result seems to be something different.

// Just dmaCopy.

    // Result           // Expected:
    // Source (hex)
    55, 10, 5222        // 55, 10, 5222
    55, 01, 5111        // 55, 01, 5111
    55, 02, 5111        // 55, 02, 5111
    55, 03, 5111        // 55, 03, 5111
                                 
    // Destination (hex)
    DD, 00, D111        // 55, 10, 5222 (bad!)
    DD, 01, D111        // DD, 01, D111
    DD, 02, D111        // DD, 02, D111
    DD, 13, D333        // DD, 13, D333

Notice that the changed values of g_src[0] never end up in g_dst[0]. Not only that, not even the original values g_src[0] have been copied. It's as if the transfer never happened at all.

The reason for this was covered in detail in the original article. Basically, it's because cache is invisible to DMA. Once a part of memory is cached, the CPU only looks to the contents of the cache and not the actual addresses, meaning that DMA not only reads out-of-date (stale) source data, but also puts it where the CPU won't look. Two actions allow you to remedy this. The first is the cache flush, which write the cache-lines back to RAM and frees the cache-line. Then there's cache invalidate, which just frees the cache-line. Note that in both cases, the cache is dissociated from memory.

With this information, it should be obvious what to do. When DMA-ing from RAM, you need to flush the cache before the transfer to update the source's memory. When DMA-ing to RAM, you need to invalidate after the transfer because now it's actually the cache's data that's stale. When you think about it a little this makes perfect sense, and it's easy enough to implement:

// New DMA-code:
    DC_FlushRange(&g_src[0], sizeof(Foo));          // Flush source.
    dmaCopy(&g_src[0], &g_dst[0], sizeof(Foo));     // Transfer.
    DC_InvalidateRange(&g_dst[0], sizeof(Foo));     // Invalidate destination.

Unfortunately, this doesn't work right either. And if you think about it a lot instead of merely a little, you'll see why. Maybe showing the results will make you see what I mean. The transfer seems to work now, but the earlier changes to g_dst[3] have been erased. How come?

    // Result:          // Expected:
    // Source (hex)
    55, 10, 5222        // 55, 10, 5222
    55, 01, 5111        // 55, 01, 5111
    55, 02, 5111        // 55, 02, 5111
    55, 03, 5111        // 55, 03, 5111
                                 
    // Destination (hex)
    55, 10, D222        // 55, 10, 5222
    DD, 01, D111        // DD, 01, D111
    DD, 02, D111        // DD, 02, D111
    DD, 13, D111        // DD, 13, D333 (wut?)

The problem is that a cache-invalidate invalidates entire cache-lines, not just the range you supply. If the start or end of the data you want invalidate does not align to a cache-line, the adjacent data contained in that line is also thrown away. I hope you can see that this is bad.

This is exactly what's happening here. The ARM9's cache-lines are 32 bytes in size. Because of the alignment I gave the arrays, elements 0 through 3 lie on the same cache-line. The changes to g_dst[3] occur in cache (but only because I read from it through +=). The invalidate of g_dst[0] also invalidates g_dst[3], which throws out the perfectly legit data and you're left in a bummed state. And again, I've done nothing spectacularly interesting here; all I did was modify something and then invalidated data that just happened to be adjacent to it. But that was enough. Very, very bad.

Just to be sure, this is not due to a bad implementation of DC_InvalidateRange(). The function does exactly what it's supposed to do. The problem is inherent in the hardware. If your data does not align correctly to cache-lines, an invalidate will apply to the adjacent data as well. If you do not want that to happen, do not invalidate.

3 Solutions

So what to do? Well, there is one thing, but I'm not sure how foolproof this is, but instead of invalidating the destination afterwards, you can also flush it before the transfer. This frees up the cache-lines without loss of data, and then it should be safe to DMA-copy to it.

    DC_FlushRange(&g_src[0], sizeof(Foo));          // Flush source.
    DC_FlushRange(&g_dst[0], sizeof(Foo));          // Flush destination.
    dmaCopy(&g_src[0], &g_dst[0], sizeof(Foo));     // Transfer.
    // Result:          // Expected:
    // Source (hex)
    55, 10, 5222        // 55, 10, 5222
    55, 01, 5111        // 55, 01, 5111
    55, 02, 5111        // 55, 02, 5111
    55, 03, 5111        // 55, 03, 5111
                                 
    // Destination (hex)
    55, 10, D222        // 55, 10, 5222
    DD, 01, D111        // DD, 01, D111
    DD, 02, D111        // DD, 02, D111
    DD, 13, D333        // DD, 13, D333
   
    // Yay \o/

Alternatively, you can also disable the underlying reason behind the problem with invalidation: the write-buffer. The ARM9 cache allows two modes for writing: write-through, which also updates the memory related to the cache-line; and write-back, which doesn't. Obviously, the write-back is faster, so that's how libnds sets things up. I know that putting the cache in write-through mode fixes this problem, because in libnds 1.4.0 the write-buffer had been accidentally disabled and my test cases didn't fail. This is probably not the route you want to take, though.

4 Conclusions

So what have we learned?

  • Cache - DMA interactions suck and can cause really subtle bugs. Ones that will only show up on hardware too.
  • Cache-flushes and invalidates cover the cache-lines of the requested ranges, which exceed the range you actually wanted.
  • To safely DMA from cachable memory, flush the source range first.
  • Contrary to what I wrote earlier, to DMA to cachable memory, do not cache-invalidate – at least not when the range is not properly aligned to cache-lines. Instead, flush the destination range before the transfer (at which time invalidation should be unnecessary). That said, invalidate should still be safe if the write-buffer is disabled.

Link to test code.

 

Notes:
  1. No I'm not. For NDS WRAM-WRAM copies, DMA is actually slow as hell and outperformed by every other method. But hopefully more on that later. For now, though, I need the DMA for testing purposes.

Grit 0.8.6 : synchronization update

This is just an update to synchronize what I have with devkitPro's distribution of grit. This includes updates to the makefile, and turning back the way the size-constant was defined back to a #define. Apparently, consts aren't constanty enough for C compilers for use in array declarations. Shame, I would have liked to get rid of macros as much as possible :(.


In any case, the two versions should be identical again (with one small exception, namely that my version emits a .size directive for assembly, but that's a minor something that should not affect anyone.)

grit 0.8.4 (out with the old bugs, in with the new)

Okay, so it's been a while, but there's finally a new version for grit.

 

First of all, the vector::insert should finally be fixed. And there was much rejoicing. I've also added an option for forcing the map palette-index (-mp <num>, which should help with NDS backgrounds that use ext-palettes.

 

Also – and this one is pretty big – I've completely replaced the tile-mapping routines for something more general. The new method should be able to handle variable-sized tiles (-tw <n> and -th <n>) and is mostly independent of bitdepth. Specifically, bitdepths over 8 bpp can be handled as well, at least in principle. It also means that the external tileset can be a metatile-set as well now, which is good if you're using metatiles.

With this new method also comes a way to create a custom bitformat for maps (-mB flag). I'm not entirely sure how this can be used yet, but using more than 10 bits for the tile-index, or a 1bpp collision map should be possible now.

Since this is a fairly major change, I kinda expect there's still some bugs in the system. I have tested it for a number of options, but you know how it is with multi-platform stuff. In particular, if any of you big-endian-system users have trouble now, this will probably be the cause.

And now I will leave you with a …

Some new notes on NDS code size

When I discussed the memory footprints of several C/C++ elements, I apparently missed a very important item: operator new and related functions. I assumed new shouldn't increase the binary that much, but boy was I wrong.

The short story is that officially new should throw an exception when it can't allocate new memory. Exceptions come with about 60 kb worth of baggage. Yes, this is more or less the same stuff that goes into vector and string.

The long story, including a detailed look at a minimal binary, a binary that uses new and a solution to the exception overhead (in this particular case anyway) can be read below the fold.

Continue reading

Filter juggling and comment preview

One of the nice features of WordPress is that it already has a lot of functionality built-in. The whole thing is set up so that normal people can just install and start writing posts immediately, with WordPress taking care of all the details like converting HTML entities and adding newline where appropriate.

Of course, for those that aren't normal and that would like to write in raw HTML, these things are somewhat annoying. Fortunately, though, WordPress allows you to disable these kinds of filters. The catch is that you need to find out which filters to disable, namely, wptexturize (which converts HTML entities) and wpautop (which does newline control). WordPress also makes it easy add additional filters, like the CodeSnippet plugin that I use for code highlighting.

However, with the amount of filters available, sometimes things will clash. A good example of this is comments that have source code in them. Part of what CodeSnippet does is convert certain characters (specifically: ‘<’, ‘>’, ‘&’) to printable characters (&lt;, &gt;, &amp;) and aren't considered special HTML characters anymore. However, there are several other filters that have a similar task, so that when you write this:

Oh hai! This is a useful bitfield function.
 
[code lang="cpp"]
template<class T>
inline void bfInsert(T &y, u32 x, int start, int len)
{
   u32 mask= ((1<<len)-1) << start;
   y &= ~mask;
   y |= (x<<start) & mask;
}
[/code]

what it becomes is:

Oh hai! This is a useful bit function.
template
inline void bfInsert(T &amp;y, u32 x, int start, int len)
{
    u32 mask= (1&lt;&lt;len) &lt;&lt; start;
    y &amp;= ~mask;
    y |= (x&lt;&lt;start) &amp; mask;
}

Not exactly pretty. Note that the template class is simply removed because it's seen as an illicit HTML tag, and all the special characters are doubly converted. This is still a mild example; I think if you place the angle brackets wrong, whole swaths of code can simply be eaten by the sanitizer.

Unfortunately, finding out where the problem lies is tricky. Not only are there dozens of potential functions doing the conversion, they can be called from anywhere and PHP isn't exactly rich in the debugger department. You also have no idea where to start, because the filters can be called from everywhere. Worse still, in this particular case the place where the bad happens is actually before the comment is even saved to the database (but only for unregistered people; for me the code comments would work fine), and because comments are handled on a page that you don't actually ever see, random echo/print statements are useless as well.

But I think I finally got it: it was wp_kses() using (in a roundabout way) wp_specialchars() in the wp-includes/kses.php roomfile. The contractor is actually wp_filter_comment() from wp-includes/comment.php, using the pre_comment_content filter as a middleman.

The trick now is to keep it from happening. What I've done is define not one but two pre_comment_content filters: one that pre-mangles the brackets and ampersand before wp_kses, and one that de-mangles them afterwards. Of course, this will only be of importance between [code] tags. Exactly how to do this will depend on the plugin you're using, but in the case of CodeSnippet it goes like this:

//# Put this along with the other add_filter() calls.

// Ensure in-\&#91;code] entities ('<>&') work out right in the end.
add_filter('pre_comment_content', array(&$CodeSnippet, 'filterDeEntity'), 1);
add_filter('pre_comment_content', array(&$CodeSnippet, 'filterReEntity'), 50);

...

//# Add these methods to the CodeSnippet class.
    /**
     * Pre-encode HTML entities. Should come \e before wp_kses.
     */

    function filterDeEntity($content)
    {
        $content=  preg_replace(
            '#(\[code.*?\])(.*?)(\[/code\])#msie',
            '"\\1" . str_replace(
                array("<", ">", "&"),
                array("[|LT|]", "[|GT|]", "[|AMP|]"),
                \'\\2\') . "\\3";'
,
            $content);
        $content= str_replace('\"', '"', $content);
       
        return $content;
    }
    /**
     * Decode HTML entities. Should come \e after wp_kses.
     */

    function filterReEntity($content)
    {
        if(strstr($content, "[|"))
        {
            $content= preg_replace(
                '#(\[code.*?\])(.*?)(\[/code\])#msie',
                '"\\1" . str_replace(
                    array("[|LT|]", "[|GT|]", "[|AMP|]"),
                    array("<", ">", "&"),
                    \'\\2\') . "\\3";'
,
                $content);
            $content= str_replace('\"', '"', $content);
        }
       
        return $content;
    }

Notice that both methods are under the same filter group. The trick is that they have different priorities, which makes one act before wp_kses(), and one after. Also note how the regexps work in the replacement part of preg_replace(). This particular feature of preg_replace() allows for shorter code, but is very fragile; it may be better to use preg_replace_callback() instead. In any case, written like this it seems to work:

Oh hai! This is a useful bit function.
template<class T>
inline void bfInsert(T &y, u32 x, int start, int len)
{
   u32 mask= ((1<<len)-1)<<start;
   y &= ~mask;
   y |= (x<<start) & mask;
}

Comment preview

The code-comment mangling is just part of the issues one can encounter in blog comments. It's usually impossible to see beforehand what will be accepted and what not. Is HTML allowed? Are all tags allows, or just some or none at all? What about whitespace? Or BB-like tags? Basically, you'll never know what a comment will look like until you submitted it, and by then it's too late to change it.

You know what'd be really helpful? A comment preview!

You'd think this'd be a fairly obvious feature for a blogging system to have, but apparently not. I was thinking of making by own preview functionality, but when attempting to do so several items within WP thwarted my efforts. Fortunately, it seems plugins of this sort exist already. The plugin I'm now using is ajax-comment-preview, which works pretty darn well.

 

So anyway, comments should be able to handle code properly now and there's a comment-preview to show you what the comment will look like in the end. And there was much rejoicing.

Signs from Hell

The integer datatypes in C can be either signed or unsigned. Sometimes, it's obvious which should be used; for negative values you clearly should use signed types, for example. In many cases there is no obvious choice – in that case it usually doesn't matter which you use. Usually, but not always. Sometimes, picking the wrong kind can introduce subtle bugs in your program that, unless you know what to look out for, can catch you off-guard and have you searching for the problem for hours.

I've mentioned a few of these occasions in Tonc here and there, but I think it's worth going over them again in a little more detail. First, I'll explain how signed integers work and what the difference between signed and unsigned and where potential problems can come from. Then I'll discuss some common pitfalls so you know what to expect.

Continue reading

Another fast fixed-point sine approximation

Gaddammit!

 

So here I am, looking forward to a nice quiet weekend; hang back, watch some telly and maybe read a bit – but NNnnneeeEEEEEUUUuuuuuuuu!! Someone had to write an interesting article about sine approximation. With a challenge at the end. And using an inefficient kind of approximation. And so now, instead of just relaxing, I have to spend my entire weekend and most of the week figuring out a better way of doing it. I hate it when this happens >_<.

 

Okay, maybe not.

 

Sarcasm aside, it is an interesting read. While the standard way of calculating a sine – via a look-up table – works and works well, there's just something unsatisfying about it. The LUT-based approach is just … dull. Uninspired. Cowardly. Inelegant. In contrast, finding a suitable algorithm for it requires effort and a modicum of creativity, so something like that always piques my interest.

In this case it's sine approximation. I'd been wondering about that when I did my arctan article, but figured it would require too many terms to really be worth the effort. But looking at Mr Schraut's post (whose site you should be visiting from time to time too; there's good stuff there) it seems you can get a decent version quite rapidly. The article centers around the work found at devmaster thread 5784, which derived the following two equations:

(1) \begin{eqnarray} S_2(x) &=& \frac4\pi x - \frac4{\pi^2} x^2 \\ \\ S_{4d}(x) &=& (1-P)S_2(x) + P S_2^2(x) \end{eqnarray}

These approximations work quite well, but I feel that it actually uses the wrong starting point. There are alternative approximations that give more accurate results at nearly no extra cost in complexity. In this post, I'll derive higher-order alternatives for both. In passing, I'll also talk about a few of the tools that can help analyse functions and, of course, provide some source code and do some comparisons.

Continue reading