# DMA vs ARM9 - fight!

DMA, or Direct Memory Access, is a hardware method for transferring data. As it's hardware-driven, it's pretty damn fast(1). As such, it's pretty much the standard method for copying on the NDS. Unfortunately, as many people have noticed, it doesn't always work.

There are two principle reasons for this: cache and TCM. These are two memory regions of the ARM9 that DMA is unaware of, which can lead to incorrect transfers. In this post, I'll discuss the cache, TCM and their interactions (or lack thereof) with DMA.

The majority of the post is actually about cache. Cache basically determines the speed of your app, so it's worth looking into in more detail. Why it and DMA don't like each other much will become clear along the way. I'll also present a number of test cases that show the conflicting areas, and some functions to deal with these problems.

## 1 The ARM 9 core

Fig 1. ARM 9 schematic. Core + TCMs and caches.

The first thing to know is that the DMA trouble only relates to the ARM9 processor of the NDS. Work with the ARM7 should be fine. The most relevant items of the ARM9 are illustrated by Fig 1. The processor consists of the actual logic unit, and caches and two Tightly Coupled Memory (TCM) units. There are two caches and TCMs, one for data and one for instructions. The point here is that (as far as I know), these areas are on the chip, and as such accessible only by the CPU itself. CPU-only, as in not the DMA controller.

### 1.1 ITCM and DTCM

The Instruction and Data TCM areas (ITCM and DTCM) are basically fast-RAM areas. Technically the addresses of these sections are arbitrary, but set to the 0100:0000 and 0B00:0000 ranges, respectively, in libnds. Exactly which addresses they use isn't important though, since that's all taken care of by the linker anyway. What is important is that the stack (where local variables and function arguments go(2)) is also put in DTCM. This means that you can't use DMA with local arrays. It also means that you can't use a local variable as a source for a DMA-fill. This is why the NDS ARM9 has special DMA registers for these, called REG_DMAnFILL.

#define DMA_FILL16  (DMA_ENABLE | DMA_START_NOW | DMA_SRC_FIX)

// This doesn't work on ARM: fill' is in unreachable DTCM.
// Should work on ARM7 though.
{
volatile u16 fill= 0;
REG_DMA3CNT= DMA_FILL16 | 256*192;
}

// This does work on ARM9,
// but not ARM7 which has no REG_DMAnFILL registers.
void dmaFill_good()
{
REG_DMA3FILL= 0;
REG_DMA3CNT= DMA_FILL16 | 256*192;
}

The DMA-fill routines in libnds correctly use REG_DMAnFILL so fortunately you don't have to worry about that. However, copying from local arrays is still impossible, and there's just no way around that using just DMA.

### 1.2 Cache

The cache for the ARM9 is a little more complicated. Or perhaps “obscured” is a better term here. The TCMs have addresses, so you can toy with them yourself. The cache, however, is completely hidden from the view of the user. Before going into detail about how the cache works and why DMA and cache hate each other, let's look at why cache is useful, especially in light of you not being able to use it directly.

Where there is memory, there are waitstates. Generally, a CPU can handle data faster than the RAM can supply it, so the CPU will have to wait until it can continue. The slowdown can easily be a factor 100 on PCs, or even millions if you include disk memory. Fortunately, it's only about ten for the NDS, I think, but that's still quite a bit.

Cache is one method of getting around memory waitstates. Instead of having to go to RAM all the time for something, you store recently used data in an area that the CPU can have faster access to. Then next time it needs that data, it can retrieve it from there instead of going to RAM again. For good measure, the area around the data is also cached, because that might be accessed soon as well. Since closely-related data is often stored closely together as well, the caching process can significantly increase the overall speed of an application.

### 1.3 The NDS ARM9 cache

GBATEK gives us the following information about the cache that the NDS has:

Data Cache 4KB, Instruction Cache 8KB
4-way set associative method
Cache line 8 words (32 bytes)
Read-allocate method (ie. writes are not allocating cache lines)
Round-robin and Pseudo-random replacement algorithms selectable
Cache Lockdown, Instruction Prefetch, Data Preload
Data write-through and write-back modes selectable

Which to most people will probably mean absolutely nothing. Now, I'm not exactly an expert in all things cache, but I'll try to explain what it all means.

First, as noted earlier, there are actually two caches: one for data and one for instructions. Having a separate instruction cache is nice because then you can be sure that a function that processes a lot of data won't push the that function out of the cache. Instruction cache also means that loops in code will be in cache except for perhaps the first iteration. Effectively, all the code that really matters (i.e., inner loops that do most of the work) will always be in fast memory automatically.

It is common that cache works in groups of bytes instead of individual bytes. These groups are the cache lines. A cache line maps onto a RAM chunk of the same size, and if anything within a chunk is to be put in cache, the whole line will be filled. The NDS cache lines are 32 bytes long.

This is probably a good time to introduce two important terms: cache hit and cache miss. A cache hit is when the data you're looking for is already in cache and so access is fast. A cache miss is when it's not in cache. This means two things. First, the access will be slow thanks to the memory waitstates. Second, if this triggers a cache-line fill, you'll have to wait for the entire cache line to be read. While this block read will be faster than if you were reading the block without cache, it'll still take longer than getting just the byte or so you were looking for. Moral of the story: cache hit good, cache miss bad.

Cache hits and misses add a consequence to how your data is stored. If data is tightly packed and sequential (think structs/arrays), you're more likely to have cache hits and work will be fast. If the data is all over the place (linked lists for example), the chance of cache misses increases dramatically.

That is how cached or non-cached data operates. What's also important is when data is put in cache in the first place – when cache allocation occurs. There are two types here: read-allocate or write-allocate, These terms refer to whether a cache-line will be tied to a memory block when it's from, or when it's written to, respectively. As you can see from the GBATEK data, the NDS cache is read-allocate. A memory-write will not require a new cache line.

Now, Suppose a block is in cache and something is written to that block. This write will update the data in cache, but what about the RAM it's tied to? The process dealing with this is called the write policy, and two options exist. There's write-through, which means that both cache and RAM are written to. In write-back mode, only the cached data is changed; RAM is not updated! This is the main cause of trouble with DMA. Apparently, the write policy is selectable, but it'll usually be write-back.

Lastly, there's the replacement policy, which stipulates how cache lines relate to RAM, and when to kick data out ot cache. Unfortunately I don't know much about this part, but it's of lesser importance anyway. The rest of the terms do not affect the potential cache-DMA conflict either. For more details, visit the wikipedia page on CPU_cache.

## 2 Cache example

At this point I think it's useful to give an example of how it works in practice. For this, I will use a fictional CPU that uses a cache with following properties.

• 2 cache lines, 4 bytes each.
• Read-allocate and write-back.
• Direct mapped cache: RAM-block n goes into cache-line n%2. In other words, even blocks go to line 0, odd blocks to line 1.

The next set of pictures illustrate what happens when you do a number of reads and writes. Fig 2 shows the basic system in the initial state. There is the CPU on the left with the core and two cache lines. I've also included a register called x here for convenience. On the right there is 16 bytes of RAM, distributed over four 4-byte blocks (the equivalents of the cache lines). RAM is already initialized; cache is still empty. In the figures, green is used to indicate reads from RAM (loads) and purple for writes (stores).

 Fig 2. Initial state. Cache is empty. Nothing's happening. Fig 3. RAM[0] is written to. No change to cache. Fig 4. Read from RAM[1]. Cache-line 0 = RAM Block 0. Fig 5. Writes to cached addresses. Go to cache; not to RAM. Fig 6. Read from RAM[3]. This was in cache, so data's read from there, not RAM. Fig 7. Read from block 1; new allocation to line 1. Fig 8. Another cache hit. Fig 9. Another cache hit, with deficits. Block 2 replaces block 0's cached data. Block 0 receives line 0's data before replacement, and is now non-cached again.
1. Initial state.
2. RAM[0]= 'R'. A write to RAM does not trigger cache allocation, so it goes straight to RAM. Slowly.
3. x= RAM[1]. The read from RAM[1] causes a cache allocation RAM[1] is part of block 0 (even), so that goes into cache line 0. Line 0 and block 0 are identical. Cache miss + allocation; very slow.
4. RAM[2]= RAM[3]= 'S'. Two writes; this is where it gets tricky. These addresses are in cache (block 0), and I said this cache was in write-back mode. This means that the writes go to the cache, but NOT the actual RAM. So now the data in cache is different that the equivalent block in RAM: the RAM's gone stale. This is a cache hit; a fast write.
5. x= RAM[3]. Again, RAM[3] is cached, so data is taken from cache instead of from RAM. x is now 'S', as was expected from the last statement. The fact the real RAM[3] is different doesn't matter, because the CPU doesn't look there anyway. If something other than the CPU (like, say, DMA) reads from RAM[3], though, chaos ensues. Cache hit, fast read.
6. x= RAM[4]. RAM[4] is in block 1, which wasn't cached yet, so a new line is allocated. Block 1 goes into line 1, because it's an odd-numbered block. Very slow operation.
7. RAM[4]= RAM[5]= 'T'. Much like before, The writes go into cache rather than the real RAM. Another cache hit.
8. x= RAM[8]. This is also an interesting case. Two things happen here. RAM[8] belongs to block 2 (even), which hadn't been cached yet. It's supposed to go into line 0, but that's already filled. The new data will replace the old data. Cache line is tied to block 0, so addresses 0 through 3 will be filled with the data from line 0; this block is now up to date again. After that, line 0 receives the data from block 2. Cache write-out + new allocation; this should be awful.

This should cover all important cases: reads/writes to non-cached addresses, to cached addresses and a little bit about allocation and replacements. At some points, cache and RAM start to disagree. This wouldn't be a problem if RAM was only accessed by the CPU, but unfortunately it isn't.

On cache timings

To be completely honest, I have not really tested the cycle-times for the various cases. All I have to go on is gbatek:memory timings and educated guesswork. The estimates should make sense, but I don't have much in the way of evidence at present, not would I know exactly how to get that in the first place, as experimenting with cache can be tricky.

## 3 Cache vs DMA solution

Fig 5 and Fig 8 illustrate the main problem. The data in RAM is out of date and when DMA tries to read it, it actually uses the wrong data. the reverse is also possible. DMA could write to RAM that had been cached; in this case it's actually the cache that's out of date.

The solution to this is to align cache and RAM manually. The two actions involved are called flushing and invalidating. A cache flush dumps the contents of cache back into RAM. Now that cache and RAM contain the same data again, it's safe to DMA-read from. An invalidate tells the CPU to simply delete cache lines, because its assumptions regarding the contents of the original RAM have become invalid. The next CPU-read would come from RAM again. This is what you need after DMA writes to RAM.

Fig 10 and Fig 11 pick up from case 6 (Fig 8). They show what happens when you flush or invalidate a cache line. You actually supply a RAM block number because the cache lines themselves are completely hidden from view.

1. Flush block 0. In this case, you want to synchronize RAM block 0 to the cache. Since block 0 is indeed in cache and using line 0. Therefore, the contents of line 0 are written back to block 0.
2. Invalidate block 1. Suppose that previously, some contents of block 1 had been written to without cache's knowledge, so that cache is out of date. The invalidate throws away the line related to block 1, making RAM the primary source for the block again.
 Fig 10. Flush the cache line related to block 0 (== RAM 0-3). Fig 11. Invalidate the cache line related to block 1 (== RAM 4-7).
When to Flush/Invalidate
• A cache flush writes cached data back to RAM. This is required before DMA-reads.
• A cache invalidate frees cache lines, causing the next read to be from RAM. This is required after DMA-writes.

Get the operation or the timing wrong, and they dock ya!

I mean, uhm … and you get memory corruption. Yeah.

### 3.1 libnds cache functions

libnds contains functions to flush or invalidate. They can either affect the whole cache, or just certain address ranges. I am unsure of the timings of these functions, but I expect there will be a cost. Invalidating itself could be fast, but it'd make all subsequent reads cache misses. A flush would require a large amount of of writes to memory. To keep these costs down, use the ranged versions as much as possible.

//! Flush the entire data cache to memory.
void DC_FlushAll()
//! Flush the data cache for a range of addresses to memory.
void DC_FlushRange(const void *base, u32 size)

//! Iinvalidate the entire data cache.
void DC_InvalidateAll()
//! Invalidate the data cache for a range of addresses.
void DC_InvalidateRange(const void *base, u32 size)

//! Invalidate entire instruction cache.
void IC_InvalidateAll()
//! Invalidate the instruction cache for a range of addresses.
void IC_InvalidateRange(const void *base, u32 size)

### 3.2 Some safe DMA functions

To guard against potential DMA failures, it's useful to have a few functions that take care of those themselves. dmaCopySafe() and dmaFillSafe() check if DMA can reach the source and destination regions and will return false if not. They also check what chunk-size is appropriate by looking at the source, destination and size. Odd alignments fail completely; word-alignment and sizes use 32-bit transfers and the rest uses 16-bit transfers. They also flush and invalidate where appropriate.

Note that for a completely safe version, you'd need much more checking. For example, each region has its own size that would have to be looked at, and some sections are read-only like ROM. Checking for all possibilities, however would just make the function too unwieldy and have therefore been omitted.

//! Copy data from a src to a dst via DMA in a cache/section safe manner.
/*! The ARM9's DMA doesn't play well with the cache and can't access
ITCM or DTCM. This means that a basic DMA copy may not work as
expected. This function flushes or invalidates cache if necessary and
will only copy if the ranges are accessible.
param src   Source pointer.
param dst   Destination pointer.
param size  Size (in bytes) to copy.
return      True if the copy succeeded.
note        It's possible I missed some invalid cases, YHBW.
*/

bool dmaCopySafe(const void *src, void *dst, u32 size)
{
u32 srca= (u32)src, dsta= (u32)dst;

// Check TCMs and BIOS (0x01000000, 0x0B000000, 0xFFFF0000).
//# NOTE: probably incomplete checks.
if((srca>>24)==0x01 || (srca>>24)==0x0B || (srca>>24)==0xFF)
return false;
if((dsta>>24)==0x01 || (dsta>>24)==0x0B || (dsta>>24)==0xFF)
return false;

if((srca|dsta) & 1)                 // Fail on byte copy.
return false;

while(REG_DMA3CNT & DMA_BUSY) ;

if((srca>>24)==0x02)                // Write cache back to memory.
DC_FlushRange(src, size);

if((srca|dsta|size) & 3)
dmaCopyHalfWords(3, src, dst, size);
else
dmaCopyWords(3, src, dst, size);

if((dsta>>24)==0x02)                // Set cache of dst range to 'dirty'
DC_InvalidateRange(dst, size);

return true;
}

//! Fill a dst with a fill via DMA in a cache/section safe manner.
/*! The ARM9's DMA doesn't play well with the cache and can't access
ITCM or DTCM. This means that a basic DMA fill may not work as
expected. This function flushes or invalidates cache if necessary and
will only fill if the ranges are accessible.
param fill  Fill value.
param dst   Destination pointer.
param size  Size (in bytes) to copy.
return      True if the fill succeeded.
note        It's possible I missed some invalid cases, YHBW.
*/

bool dmaFillSafe(u32 fill, void *dst, u32 size)
{
u32 dsta= (u32)dst;

// Check TCMs and BIOS (0x01000000, 0x0B000000, 0xFFFF0000).
//# NOTE: probably incomplete checks.
if((dsta>>24)==0x01 || (dsta>>24)==0x0B || (dsta>>24)==0xFF)
return false;

if(dsta & 1)                        // Fail on byte fill.
return false;

while(REG_DMA3CNT & DMA_BUSY) ;

if((dsta|size) & 3)
dmaFillHalfWords(fill, dst, size);
else
dmaFillWords(fill, dst, size);

if((dsta>>24)==0x02)                // Set cache of dst range to 'dirty'
DC_InvalidateRange(dst, size);

return true;
}

## 4 Test cases

Fig 12. Test procedure. Copy either letter into buffer, then blit via CPU or DMA.

All this talk about what to do and when is nice and all, but it's always best to run a few tests to see if everything happens like you expected. Fig 12 illustrates how the tests operate. There are two source bitmaps of the letters ‘A’ and ‘B’. In each test case, one of these is copied into a secondary buffer by either a CPU- or DMA-based copy. This second buffer is blitted to VRAM in two different places via a CPU- or DMA-based blit. At various points, a flush or invalidate may be inserted to see the effects.

In terms of code, ever case is split into two parts. First, there's a setup that initializes each case. This clears the console and prints some description for the case and erases the RAM buffer and the VRAM rectangles. The second part alternatively copies the letters into the buffer, does cache operations, and blits. Since the first part is boring, only the case-specific part will be given here.

### 4.1 Direct blit: ‘A’ → VRAM

cpuBlit(&bmpA, X_CPU, Y_CPU);
dmaBlit(&bmpA, X_DMA, X_DMA);

Result: both correct.
Explanation: the data in the source buffer never changes, so this should always be okay.

### 4.2 Indirect Blit I: ‘B’ → buffer → VRAM

memcpy(bmpBuf.data, bmpB.data, SIZE);

cpuBlit(&bmpBuf, X_CPU, Y_CPU);
dmaBlit(&bmpBuf, X_DMA, Y_DMA);

Result: both correct.
Explanation: buffer used for the first time, so no cache incoherency possible. Yet.

### 4.3 Indirect Blit II: ‘A’ → buffer → VRAM

memcpy(bmpBuf.data, bmpA.data, SIZE);

cpuBlit(&bmpBuf, X_CPU, Y_CPU);
dmaBlit(&bmpBuf, X_DMA, Y_DMA);

Result: CPU okay, DMA blit corrupted.
Explanation: The buffered data is in cache, so the memcpy() to the buffer goes to cache as well. Meanwhile, the actual buffer (in RAM) still holds (parts of) ‘B’, which is where DMA gets its data from. The result is a mix of ‘A’ and ‘B’ for dmaBlit(). It's a mix because apparently some cache lines have already been flushed out by the replacement policy.

### 4.4 Indirect Blit + flush: ‘B’ → buffer, flush → VRAM

memcpy(bmpBuf.data, bmpB.data, SIZE);
DC_FlushRange(bmpBuf.data, SIZE);

cpuBlit(&bmpBuf, X_CPU, Y_CPU);
dmaBlit(&bmpBuf, X_DMA, Y_DMA);

Result: both correct.
Explanation: again, the memcpy() will result in corrupt data in RAM, but this time we force the up-to-date cache lines back to RAM to get cache and RAM in synch again. At this point, both the CPU and DMA-based blits will use the correct data.

### 4.5 Indirect Blit + invalidate: ‘A’ → buffer, invalidate → VRAM

memcpy(bmpBuf.data, bmpA.data, SIZE);
DC_InvalidateRange(bmpBuf.data, SIZE);

cpuBlit(&bmpBuf, X_CPU, Y_CPU);
dmaBlit(&bmpBuf, X_DMA, Y_DMA);

Result: both screwed in the same way.
Explanation: as said in the note, you need to do the right operation at the right time. This is an example of what happens if you don't. A cache invalidate simply erases cache lines. The data in RAM is now considered valid. However, after the memcpy(), RAM isn't valid, as one can see from case 3. This is why now both blits fail. Nice job breaking it, hero.

### 4.6 Indirect Blit III: ‘A’ (dma)→ buffer → VRAM

dmaCopy(bmpA.data, bmpBuf.data, SIZE);

cpuBlit(&bmpBuf, X_CPU, Y_CPU);
dmaBlit(&bmpBuf, X_DMA, Y_DMA);

Result: CPU blit fail.
Explanation: in this case, the source→buffer transfer is done via DMA rather than the CPU-based memcpy(). This time the CPU doesn't know the contents of RAM have been altered. At this point, the cache will contain mostly the cleared data from the memset() done in the case set-up; cpuBlit() will pick up some straggling lines from RAM during blit, resulting in the image you see here. Naturally, dmaBlit() works properly.

### 4.7 Indirect Blit + invalidate II: ‘B’ (dma)→ buffer, invalidate → VRAM

dmaCopy(bmpB.data, bmpBuf.data, SIZE);
DC_InvalidateRange(bmpBuf.data, SIZE);

cpuBlit(&bmpBuf, X_CPU, Y_CPU);
dmaBlit(&bmpBuf, X_DMA, Y_DMA);

Result: both correct.
Explanation: as case 6, but with an invalidate after the transfer to the buffer. The invalidate removes the allocated cache lines, so that the next CPU-reads come from RAM. Unlike case 5, RAM is the most up-to-date area, so the invalidate works as it's supposed to.

### 4.8 Indirect Blit + invalidate/flush: ‘A’ → buffer, invalidate/flush → VRAM

memcpy(bmpBuf.data, bmpA.data, SIZE);
DC_InvalidateRange(bmpBuf.data, SIZE);
DC_FlushRange(bmpBuf.data, SIZE);

cpuBlit(&bmpBuf, X_CPU, Y_CPU);
dmaBlit(&bmpBuf, X_DMA, Y_DMA);

Result: both fail.
Explanation: The invalidate frees cache lines (see case 5). The subsequent flush does nothing, because no lines are associated with those addresses anymore.

This is an example of what I like to call LOL-type programming. As in “What are you doing?!?” - “I dunno lol”. This type of coding generally indicates a very confused mind that hopes that if you throw enough shit against a wall maybe something will hold up. He may have heard terms like flush and invalidate used around DMA and decided to try them randomly. This never works. If you find code like this, be afraid; very, very afraid. This will likely not be the only instance of cargo-cult programming in the code-base and it'd be best to consider the whole thing suspect.

### 4.9 dmaSafe I: ‘B’ → buffer (safe)→ VRAM

memcpy(bmpBuf.data, bmpB.data, SIZE);

cpuBlit(&bmpBuf, X_CPU, Y_CPU);
dmaBlitSafe(&bmpBuf, X_DMA, Y_DMA);

Result: both correct.
Explanation: this uses the dmaCopySafe() function given previously in the DMA blitter. Since that function checks whether a flush is appropriate, everything should be fine. And it is.

### 4.10 dmaSafe II: ‘A’ (safedma)→ buffer → VRAM

dmaCopySafe(bmpA.data, bmpBuf.data, SIZE);

cpuBlit(&bmpBuf, X_CPU, Y_CPU);
dmaBlitSafe(&bmpBuf, X_DMA, Y_DMA);

Result: both correct.
Explanation: dmaCopySafe() also performs an invalidate if necessary so, again, it all works.

### 4.11 Buffer in stack: ‘B’ → local buffer → VRAM

u16 localBuf[16*16];
MiniBmp bmpLocal= { 16, 16, localBuf };

memcpy(localBuf, bmpB.data, SIZE);

cpuBlit(&bmpLocal, X_CPU, Y_CPU);
dmaBlit(&bmpLocal, X_DMA, Y_DMA);

Result: DMA fail.
Explanation: in this case, I'm using a local buffer for the bitmap instead of a global one. The difference here is that a local buffer goes onto the stack, which is in DTCM rather than RAM. As discussed earlier, DTCM is invisible to DMA, so dmaBlit() doesn't work. Note that using dmaBlitSafe() wouldn't work either, but at least you'd get an return value indicating failure back instead of nothing at all.

Hardware vs emulator

As far as I know all current NDS emulators do not emulate cache properly, so that these tests would actually seem produce correct results. “correct” in the sense that they reproduce the target image, not that they give similar results as hardware.

## 5 Conclusions

• The ARM9 has DTCM and ITCM sections that DMA can't access. DMA transfers to and from there will fail. Because the stack is in DTCM, this includes transfers to/from (non-static) local variables.
• The ARM9 has cache that DMA can't see either. If DMA tries to read/write from RAM block that have been cached, the wrong data may be transferred.
• The ARM9 uses 32-byte cache lines that are initiated when addresses are read from, but not when written to (read-allocate). It also uses a write-back policy: cached data is written back to RAM only when the cache line is replaced.
• Cache-miss bad; cache-hit good. Stale cache also bad, since it's the cause of incorrect DMA transfers.
• DMA-reads from RAM should be preceded by a cache flush, which writes cache lines back to RAM.
• DMA-writes to RAM should be preceded or followed by a cache invalidate, which clears cache lines so that the next CPU-read will be from RAM again.
• As far as I know, most emulators do not handle DMA-DTCM correctly. None emulate cache. If you suddenly find corrupted data after copying on hardware but not emulators, look at your DMA calls.
• Making graphics with rounded corners and intricate wide lines take forever to get right.

Related test project: arm9vsdma.zip

##### Notes:
1. Well, quite fast anyway. In some circumstances CPU-based transfers are faster, but that's a story for another day.
2. Well, sometimes. Usually these go in CPU registers, but this is not the right place for that discussion either.

## 7 thoughts on “DMA vs ARM9 - fight!”

1. Thanks for the tip. I already had noticed some oddities with DMA transfers a while ago, but i re-parsed my code after reading your document and fixed a couple of other errors (such as flushing target after a dmacopy rather than invalidating it) in my project.

Live long and prosper.

2. Pingback: link run « console-dev.de

3. In section 4.5, shouldn't DC_FlushRange(...); be replaced with DC_InvalidateRange;(...)` ?

4. ChaimLeib, yes it should. C&P error, it happens sometimes :(

For those interested, also read ant's post regarding my 'safe' functions. There could be more at play here than I initially thought.

5. I have noticed you don't monetize your page, don't waste your traffic, you can earn additional bucks every month because you've got
high quality content. If you want to know how
to make extra \$, search for: Mertiso's tips best adsense alternative

6. I have checked your site and i have found some duplicate content, that's why you don't rank high in google's search results, but
there is a tool that can help you to create
100% unique articles, search for: Boorfe's tips unlimited content