DMA vs ARM9, round 2 : invalidate considered harmful

It would seem these two aren't finished with each other yet.

 

A while ago, I wrote an article about NDS caching , how it can interfere with DMA transfers and what you can do about them. A little later I got a pingback from ant512, who had tried the “safe” DMA routines I made and found they weren't nearly as safe as I'd hoped. I'm still not sure what the actual problem was, but this incident did make me think about one possible reason, namely the one that will be discussed in this post: problematic cache invalidation.

1 Test base

But first things first. Let's start with some simple test code, see below. We have a simple struct definition, two arrays using this struct, and some default data for both arrays that we'll use later.

// A random struct, 32-bits in size.
struct Foo
{
    u8  type;
    u8  id;
    u16 data;
} ALIGN(4);

// Define some globals. We only use 4 of each.
Foo g_src[16] ALIGN(32);
Foo g_dst[16] ALIGN(32);

const Foo c_fooIn[2][4]=
{
    {   // Initial source data.
        { 0x55, 0, 0x5111 },
        { 0x55, 1, 0x5111 },
        { 0x55, 2, 0x5111 },
        { 0x55, 3, 0x5111 }
    },
    {   // Initial destination data.
        { 0xDD, 0, 0xD111 },
        { 0xDD, 1, 0xD111 },
        { 0xDD, 2, 0xD111 },
        { 0xDD, 3, 0xD111 }
    },
};

And now we're going to do some simple things with these arrays that we always do: some reads, some writes, and a struct copy. And for the copying, I'm going to use DMA, because DMA-transfers are fast, amirite(1)? The specific actions I will do are the following:

Initialization
  • Zero out g_src and g_dst.
  • Initialize the arrays with some data, in this case from c_fooIn.
  • Cache-Flush both arrays to ensure they're uncached.
Testing
  • Modify element in g_src, namely g_src[0].
  • Modify an element in g_dst, namely g_dst[3].
  • DMA-copy g_src[0] to g_dst[3].

In other words, this:

void test_init()
{
    // Zero out everything
    memset(g_src, 0, sizeof(g_src));
    memset(g_dst, 0, sizeof(g_dst));

    // Fill 4 of each.
    for(int i=0; i<4; i++)
    {
        g_src[i]= c_fooIn[0][i];
        g_dst[i]= c_fooIn[1][i];
    }

    // Flush data to be sure.
    DC_FlushRange(g_src, sizeof(g_src));
    DC_FlushRange(g_dst, sizeof(g_dst));
}

void test_dmaCopy()
{
    test_init();

    // Change g_src[0] and g_dst[3]
    g_src[0].id += 0x10;
    g_src[0].data= 0x5222;

    g_dst[3].id += 0x10;
    g_dst[3].data= 0xD333;

    // DMA src[0] into dst[0];
    dmaCopy(&g_src[0], &g_dst[0], sizeof(Foo));
}

Note that there is nothing spectacularly interesting going on here. There's just your average struct definition, run of the mill array definitions, and boring old accesses without even any pointer magic that might hint at something tricky going on. Yes, alignment is forced, but that just makes the test more reliable. Also, the fact that I'm incrementing Foo.id rather than just reading from it is only because ARM9 cache is read-allocate, and I need to have these things end up in cache. The main point is that the actions in test_dmaCopy() are very ordinary.

2 Results

It should be obvious what the outcome of the test should be. However, when you run the test (on hardware! not emulator), the result seems to be something different.

// Just dmaCopy.

    // Result           // Expected:
    // Source (hex)
    55, 10, 5222        // 55, 10, 5222
    55, 01, 5111        // 55, 01, 5111
    55, 02, 5111        // 55, 02, 5111
    55, 03, 5111        // 55, 03, 5111
                                 
    // Destination (hex)
    DD, 00, D111        // 55, 10, 5222 (bad!)
    DD, 01, D111        // DD, 01, D111
    DD, 02, D111        // DD, 02, D111
    DD, 13, D333        // DD, 13, D333

Notice that the changed values of g_src[0] never end up in g_dst[0]. Not only that, not even the original values g_src[0] have been copied. It's as if the transfer never happened at all.

The reason for this was covered in detail in the original article. Basically, it's because cache is invisible to DMA. Once a part of memory is cached, the CPU only looks to the contents of the cache and not the actual addresses, meaning that DMA not only reads out-of-date (stale) source data, but also puts it where the CPU won't look. Two actions allow you to remedy this. The first is the cache flush, which write the cache-lines back to RAM and frees the cache-line. Then there's cache invalidate, which just frees the cache-line. Note that in both cases, the cache is dissociated from memory.

With this information, it should be obvious what to do. When DMA-ing from RAM, you need to flush the cache before the transfer to update the source's memory. When DMA-ing to RAM, you need to invalidate after the transfer because now it's actually the cache's data that's stale. When you think about it a little this makes perfect sense, and it's easy enough to implement:

// New DMA-code:
    DC_FlushRange(&g_src[0], sizeof(Foo));          // Flush source.
    dmaCopy(&g_src[0], &g_dst[0], sizeof(Foo));     // Transfer.
    DC_InvalidateRange(&g_dst[0], sizeof(Foo));     // Invalidate destination.

Unfortunately, this doesn't work right either. And if you think about it a lot instead of merely a little, you'll see why. Maybe showing the results will make you see what I mean. The transfer seems to work now, but the earlier changes to g_dst[3] have been erased. How come?

    // Result:          // Expected:
    // Source (hex)
    55, 10, 5222        // 55, 10, 5222
    55, 01, 5111        // 55, 01, 5111
    55, 02, 5111        // 55, 02, 5111
    55, 03, 5111        // 55, 03, 5111
                                 
    // Destination (hex)
    55, 10, D222        // 55, 10, 5222
    DD, 01, D111        // DD, 01, D111
    DD, 02, D111        // DD, 02, D111
    DD, 13, D111        // DD, 13, D333 (wut?)

The problem is that a cache-invalidate invalidates entire cache-lines, not just the range you supply. If the start or end of the data you want invalidate does not align to a cache-line, the adjacent data contained in that line is also thrown away. I hope you can see that this is bad.

This is exactly what's happening here. The ARM9's cache-lines are 32 bytes in size. Because of the alignment I gave the arrays, elements 0 through 3 lie on the same cache-line. The changes to g_dst[3] occur in cache (but only because I read from it through +=). The invalidate of g_dst[0] also invalidates g_dst[3], which throws out the perfectly legit data and you're left in a bummed state. And again, I've done nothing spectacularly interesting here; all I did was modify something and then invalidated data that just happened to be adjacent to it. But that was enough. Very, very bad.

Just to be sure, this is not due to a bad implementation of DC_InvalidateRange(). The function does exactly what it's supposed to do. The problem is inherent in the hardware. If your data does not align correctly to cache-lines, an invalidate will apply to the adjacent data as well. If you do not want that to happen, do not invalidate.

3 Solutions

So what to do? Well, there is one thing, but I'm not sure how foolproof this is, but instead of invalidating the destination afterwards, you can also flush it before the transfer. This frees up the cache-lines without loss of data, and then it should be safe to DMA-copy to it.

    DC_FlushRange(&g_src[0], sizeof(Foo));          // Flush source.
    DC_FlushRange(&g_dst[0], sizeof(Foo));          // Flush destination.
    dmaCopy(&g_src[0], &g_dst[0], sizeof(Foo));     // Transfer.
    // Result:          // Expected:
    // Source (hex)
    55, 10, 5222        // 55, 10, 5222
    55, 01, 5111        // 55, 01, 5111
    55, 02, 5111        // 55, 02, 5111
    55, 03, 5111        // 55, 03, 5111
                                 
    // Destination (hex)
    55, 10, D222        // 55, 10, 5222
    DD, 01, D111        // DD, 01, D111
    DD, 02, D111        // DD, 02, D111
    DD, 13, D333        // DD, 13, D333
   
    // Yay \o/

Alternatively, you can also disable the underlying reason behind the problem with invalidation: the write-buffer. The ARM9 cache allows two modes for writing: write-through, which also updates the memory related to the cache-line; and write-back, which doesn't. Obviously, the write-back is faster, so that's how libnds sets things up. I know that putting the cache in write-through mode fixes this problem, because in libnds 1.4.0 the write-buffer had been accidentally disabled and my test cases didn't fail. This is probably not the route you want to take, though.

4 Conclusions

So what have we learned?

  • Cache - DMA interactions suck and can cause really subtle bugs. Ones that will only show up on hardware too.
  • Cache-flushes and invalidates cover the cache-lines of the requested ranges, which exceed the range you actually wanted.
  • To safely DMA from cachable memory, flush the source range first.
  • Contrary to what I wrote earlier, to DMA to cachable memory, do not cache-invalidate – at least not when the range is not properly aligned to cache-lines. Instead, flush the destination range before the transfer (at which time invalidation should be unnecessary). That said, invalidate should still be safe if the write-buffer is disabled.

Link to test code.

 

Notes:
  1. No I'm not. For NDS WRAM-WRAM copies, DMA is actually slow as hell and outperformed by every other method. But hopefully more on that later. For now, though, I need the DMA for testing purposes.

9 thoughts on “DMA vs ARM9, round 2 : invalidate considered harmful

  1. Pingback: Coranac » DMA vs ARM9, round 2 : invalidate considered harmful Zero Me

  2. When Flush(destination) is believed to take too much time, we could replace it with

    if (unaligned(dst.start))
    flush(dst.start & alignment, ( dst.start & alignment )+cache_line);
    if (unaligned(destination.end))
    flush(dst.end & alignment, ( dst.end & alignment) + cache_line);
    flush(src)
    dma_copy(...)
    invalidate(dst)

    couldn't we ?

    Btw, I'd suspect that RAM-to-RAM DMA copies being slower than RAM-to-vraM could be due to the fact that raM and VRAM are actually using distinct buses. When moving data through separated busses, reads and writes don't have to fight for bus bandwidth and can happen in parallel.

  3. When Flush(destination) is believed to take too much time, we could replace it with

    if (unaligned(dst.start))
       flush(dst.start &amp; alignment, ( dst.start & alignment )+cache_line);
    if (unaligned(destination.end))
       flush(dst.end &amp; alignment, ( dst.end & alignment) + cache_line);
    flush(src)
    dma_copy(...)
    invalidate(dst)

    couldn't we ?

    Argh, I forgot about this option. Yes, this is also possible. And probably the best solution, although it's a little awkward to write it out. I think it'd be something like this:

    #define CACHE_LINE_SIZE 32

    // Assuming cached regions. Add tests for that yourself.
    void dmaCopySafish(const void *src, void *dst, u32 size)
    {
        DC_FlushRange(src, size);                       // Flush source.
       
        u32 addr= (u32)dst;
        if(addr % CACHE_LINE_SIZE)                      // Check head
            DC_FlushRange((void*)(addr), 1);
           
        if((addr+size) % CACHE_LINE_SIZE)               // Check tail.
            DC_FlushRange((void*)(addr+size), 1);

        dmaCopy(src, dst, size);                        // Actual copy.
        DC_InvalidateRange(dst, size);                  // Final invalidate.
    }

    Btw, I'd suspect that RAM-to-RAM DMA copies being slower than RAM-to-vraM could be due to the fact that raM and VRAM are actually using distinct buses. When moving data through separated buses, reads and writes don't have to fight for bus bandwidth and can happen in parallel.

    According to gbatek, it's this:

    NDS Sequential Main Memory DMA
    Main RAM has different access time for sequential and non-sequential access. Normally DMA uses sequential access (except for the first word), however, if the source and destination addresses are both in Main RAM, then all accesses become non-sequential. In that case it would be faster to use two DMA transfers, one from Main RAM to a scratch buffer in WRAM, and one from WRAM to Main RAM.

    I just noticed this thread, where simonjhall warns against exactly this type of behaviour: http://forum.gbadev.org/viewtopic.php?t=15294.

  4. According to gbatek, it's this:
    ooh. YeS. of course. Row-preload latency and all those RAS-CAS things. I tend to overlook it. And it basically explain the role of "cache lines" size mentioned above.

    Thanks for the time you invest on talking about the issue. It's been a while since I found something to have a DMA-related discussion with ;)

  5. Pingback: cache link « Corey's Journal

  6. Hi,
    // Assuming cached regions. Add tests for that yourself.
    void dmaCopySafish(const void *src, void *dst, u32 size)
    {
    DC_FlushRange(src, size); // Flush source.

    u32 addr= (u32)dst;
    if(addr % CACHE_LINE_SIZE) // Check head
    DC_FlushRange((void*)(addr), 1);

    if((addr+size) % CACHE_LINE_SIZE) // Check tail.
    DC_FlushRange((void*)(addr+size), 1);

    When some task interrupt the current task, and access the boundary of the current cache(because the current cache is
    not aligne "cache line", then may be cached, if cpu modify these memory at this time,

    dmaCopy(src, dst, size); // Actual copy.
    DC_InvalidateRange(dst, size); // Final invalidate.

    the modify of the memory has lost....

    }

  7. Just to check, has everyone ever managed to use the stack as source/destination of a DMA transfer, or is that an auwful idea per se ?

  8. The stack is in DTCM, which is invisible to the DMA controller. This is pretty much why REG_DMAnFILL exists, and why you can't DMA a local array.

Leave a Reply

Your email address will not be published. Required fields are marked *