It seems that a bug in the hiscore entry code that I thought I'd fixed over a month ago got resurrected in setds v1.0. So I fixed it – refixed it – today and we're now at setds v1.01.
Sigh. Gaddammit >_<
It seems that a bug in the hiscore entry code that I thought I'd fixed over a month ago got resurrected in setds v1.0. So I fixed it – refixed it – today and we're now at setds v1.01.
Sigh. Gaddammit >_<
I've finally taken the time to add proper saving that works on hardware (at least on my R4, which is the only card I have right now). I've also fixed a very annoying game-hanging bug and did some tweaking here and there, like correcting the decay constant so that I can finally get the scores that I was used to on tatset (2300+, wooo!).
This is pretty much where I wanted to go with this game, so it's now version 1.0. I may add some other things later, but for now I guess it's done.
It would seem these two aren't finished with each other yet.
A while ago, I wrote an article about NDS caching , how it can interfere with DMA transfers and what you can do about them. A little later I got a pingback from ant512, who had tried the “safe” DMA routines I made and found they weren't nearly as safe as I'd hoped. I'm still not sure what the actual problem was, but this incident did make me think about one possible reason, namely the one that will be discussed in this post: problematic cache invalidation.
But first things first. Let's start with some simple test code, see below. We have a simple struct definition, two arrays using this struct, and some default data for both arrays that we'll use later.
And now we're going to do some simple things with these arrays that we always do: some reads, some writes, and a struct copy. And for the copying, I'm going to use DMA, because DMA-transfers are fast, amirite(1)? The specific actions I will do are the following:
g_src and g_dst.c_fooIn.g_src, namely g_src[0].g_dst, namely g_dst[3].g_src[0] to g_dst[3].In other words, this:
Note that there is nothing spectacularly interesting going on here.
There's just your average struct definition, run of the mill array
definitions, and boring old accesses without even any pointer magic
that might hint at something tricky going on. Yes, alignment is forced,
but that just makes the test more reliable. Also, the fact that I'm
incrementing Foo.id rather than just reading from it is
only because ARM9 cache is read-allocate, and I need to have these
things end up in cache. The main point is that the actions in
test_dmaCopy() are very ordinary.
It should be obvious what the outcome of the test should be. However, when you run the test (on hardware! not emulator), the result seems to be something different.
Notice that the changed values of g_src[0] never
end up in g_dst[0]. Not only that, not even the
original values g_src[0] have been copied.
It's as if the transfer never happened at all.
The reason for this was covered in detail in the original article. Basically, it's because cache is invisible to DMA. Once a part of memory is cached, the CPU only looks to the contents of the cache and not the actual addresses, meaning that DMA not only reads out-of-date (stale) source data, but also puts it where the CPU won't look. Two actions allow you to remedy this. The first is the cache flush, which write the cache-lines back to RAM and frees the cache-line. Then there's cache invalidate, which just frees the cache-line. Note that in both cases, the cache is dissociated from memory.
With this information, it should be obvious what to do. When DMA-ing from RAM, you need to flush the cache before the transfer to update the source's memory. When DMA-ing to RAM, you need to invalidate after the transfer because now it's actually the cache's data that's stale. When you think about it a little this makes perfect sense, and it's easy enough to implement:
Unfortunately, this doesn't work right either. And if you think about
it a lot instead of merely a little, you'll see why. Maybe showing the
results will make you see what I mean. The transfer seems to work now,
but the earlier changes to g_dst[3] have been erased. How
come?
The problem is that a cache-invalidate invalidates entire cache-lines, not just the range you supply. If the start or end of the data you want invalidate does not align to a cache-line, the adjacent data contained in that line is also thrown away. I hope you can see that this is bad.
This is exactly what's happening here. The ARM9's cache-lines are 32
bytes in size. Because of the alignment I gave the arrays, elements
0 through 3 lie on the same cache-line. The changes to
g_dst[3] occur in cache (but only because I read from it
through +=). The invalidate of g_dst[0]
also invalidates g_dst[3], which throws out the
perfectly legit data and you're left in a bummed state. And again,
I've done nothing spectacularly interesting here; all I did was
modify something and then invalidated data that just happened to be
adjacent to it. But that was enough. Very, very bad.
Just to be sure, this is not due to a bad implementation of
DC_InvalidateRange(). The function does exactly what it's
supposed to do. The problem is inherent in the hardware. If your
data does not align correctly to cache-lines, an invalidate will apply
to the adjacent data as well. If you do not want that to happen, do
not invalidate.
So what to do? Well, there is one thing, but I'm not sure how foolproof this is, but instead of invalidating the destination afterwards, you can also flush it before the transfer. This frees up the cache-lines without loss of data, and then it should be safe to DMA-copy to it.
Alternatively, you can also disable the underlying reason behind the problem with invalidation: the write-buffer. The ARM9 cache allows two modes for writing: write-through, which also updates the memory related to the cache-line; and write-back, which doesn't. Obviously, the write-back is faster, so that's how libnds sets things up. I know that putting the cache in write-through mode fixes this problem, because in libnds 1.4.0 the write-buffer had been accidentally disabled and my test cases didn't fail. This is probably not the route you want to take, though.
So what have we learned?
Okay, so it's only a card game; but a game nonetheless.
The game in question is an NDS implementation of SET. Set is a card-matching game with 81 cards (see below). The figures on the cards have four properties and 3 possibilities for each property. The key is to find three cards for which the values of each property are either all equal or all different. Looking at the color property, for example, a "Red Red Red" combination could (yes "could"; there are still three other properties to consider) form a set. "Red Green Blue" would also work, but "Red Green Green" would not.
Further details can be found in the readme and the game itself.
The game is mostly finished. There may be some tweaking to do here and there, but right now I don't want to get bogged down in a massive fine-tuning-fest – especially since I'm not sure what parts need fine-tuning … and because there's other stuff I really should get back to.
That said, all important aspects work … with one exception: hiscore saving. Yes, that. I've seen the multitude of threads on the subject but sofar I'm unsure of what would work on both hardware and emulator, so I'm leaving it as is for now. If anyone has a tidy hw+emu solution, please do tell.
Oh, and merry Christmas everybody.
When I discussed the
memory footprints of several C/C++ elements, I apparently missed a
very important item: operator new and related functions. I
assumed new shouldn't increase the binary that much,
but boy was I wrong.
The short story is that officially new should throw an
exception when it can't allocate new memory. Exceptions come with about
60 kb worth of baggage. Yes, this is more or less the same stuff that
goes into vector and string.
The long story, including a detailed look at a minimal binary,
a binary that uses new and a solution to the exception overhead (in this particular case anyway) can be read below the fold.
DMA, or Direct Memory Access, is a hardware method for transferring data. As it's hardware-driven, it's pretty damn fast(1). As such, it's pretty much the standard method for copying on the NDS. Unfortunately, as many people have noticed, it doesn't always work.
There are two principle reasons for this: cache and TCM. These are two memory regions of the ARM9 that DMA is unaware of, which can lead to incorrect transfers. In this post, I'll discuss the cache, TCM and their interactions (or lack thereof) with DMA.
The majority of the post is actually about cache. Cache basically determines the speed of your app, so it's worth looking into in more detail. Why it and DMA don't like each other much will become clear along the way. I'll also present a number of test cases that show the conflicting areas, and some functions to deal with these problems.
(more...)Even though the total size of code is usually small compared to assets, there are still some concerns about a number of systems. Most notably among these are stdio, iostream and several STL components like vectors and strings. I've seen people voice concerns about these items, but I don't think I've ever seen any measurements of them. So here are some.
| Barebones: just VBlank code | 14516 |
| base+printf | 71148 |
| base+iprintf | 54992 |
| base+iostream | 266120 |
| base+fopen | 56160 |
| base+fstream | 260288 |
| base+<string> | 59384 |
| base+<vector> | 59624 |
| base+<string>+<vector> | 59648 |
The sizes in Table 1 are for a bare source file with just the VBlank initialization and swiWaitForVBlank() plus whatever's necessary to use a particular component. For the IO parts this means a call to consoleDemoInit(); for vectors and strings, it means defining a variable.
Even an empty project is already 15k in size. Almost all of this is FIFO code, which is required for the ARM9 and ARM7 to communicate. Adding consoleDemoInit() and a printf() call adds roughly 71k. Printf has a lot of bagage: you have to have basic IO hooks, character type functions, allocations, decimal and floating point routines and more.
Because printf() uses the usually unnecessary floating point routines for float conversions, it is often suggested to use the integer-only variant iprintf(). In that case, it comes down to 55k. The difference is mostly due to two functions: _vfprintf_r() and _dtoa_r(), for 5.8k and 3.6k, respectively. The rest is made up of dozens of smaller functions. While the difference is relatively large, considering the footprint of the other components, the extra 16k is probably not that big of a deal. For the record, there is no difference in speed between the two. Well, almost: if the format string doesn't contain formatting parts, printf() is actually considerably faster. Another point of note is that the 55k for iprintf() is actually already added just by using consoleDemoInit().
And now the big one. People have said that C++ iostream was heavy, and indeed it is. 266k! That's a quite a lot, especially since the benefits of using iostream over stdio is rather slim if not actually negative(1). Don't use iostream in NDS projects. Don't even #include <iostream>, as that seems enough to link the whole thing in.
Related to iosteam is fstream. This also is about a quarter MB. I haven't checked too carefully, but I think the brunt of these parts are shared, so it won't combine to half a Meg if you use both. Something similar is true for the stdio file routines.
Why are the C++ streams so large? Well, lots of reasons, apparently. One of which is actually its potential for extensibility. Because it doesn't work via formatting flags, none of those can be excluded like in iprintf()'s case. Then there's exceptions, which adds a good deal of code as well. There also seems to be tons of stuff for character traits, numerical traits, money traits (wtf?!?) and iosbase stuff. These items seem small, say 4 to 40 bytes, but when there are over a thousand it adds up. Then there's all the stuff from STL strings and allocators, type info, more exception stuff, error messages for the exceptions, date/time routines, locale settings and more. I tell you, looking at the mapfile for this is enough to give me a headache. And worst of all, you'll probably use next to none of it, but it's all linked in anyway.
Finally, some STL. This is also said to be somewhat big-boned, and yes it isn't light. Doing anything non-trivial with either a vector or string seems to add about 60k. Fortunately, though, this is mostly the same 60k, so there are not bad effects from using both. Unfortunately, I can't really tell where it's all going. About 10k is spent on several d_*() routines like d_print(), which is I assume debug code. Another 10k is exceptions, type info and error messages and 10 more for. But after that it gets a little murky. In any case, even though adding STL strings and vectors includes more that necessary, 60k is a fair price for what these components give you.
The smallest size for an NDS binary is about 14k. While printf() is larger than iprintf(), it's probably not enough to worry about, so just use printf() until it becomes a pressing matter (and even then you could probably shrink down another part more easily anyway). There is no speed difference between the two. The C++ iostream and fstream components are not worth it. Their added value over printf and FILE routines are small when it comes to basic IO functionality. STL containers do cost a bit, but are probably worth the effort. If you need more than simple text handling or dynamic arrays and lists, I'd say go for it. But that's just my opinion.
Please note, the tests I did for this were fairly roughly. Your mileage may vary.
Lastly. The nm tool (or arm-eabi-nm for DKA) is my new best friend for executable analysis. Unlike the linker's mapfile, nm can sort addresses and show symbol sizes, and doesn't include tons of crap used for glue.
libnds has fixed the datatypes of pretty much all registers and have moved to the GBATek nomenclature for the BG-related registers. The list has been updated to match the libnds v1.3.1. of
The state of register names for NDS homebrew is a bit of a mess. First, there are the GBATek names. Since GBATek is considered the source of GBA/NDS information, it would make sense to adhere to those names pretty closely. But, of course, that's not how actually is in the de facto library for NDS homebrew, libnds.
libnds has two sets of names. This probably is a result of serving
different masters in its early days. One set uses
Mappy's nomenclature.
That's the one without the REG_ in front of it, and
uses things like _CR, and _SR. This is
one you're most likely to see in the current NDS tutorials.
The second set uses GBATek's names (mostly) plus a REG_
prefix. If you've done GBA programming, these should feel quite
familiar.
Powered by WordPress