| TTE | Contents | Lab |
Very broadly speaking, you can divide programming languages into 4 classes. At the lowest level is machine code: raw numbers that the CPU decodes into instructions to execute. One step up is assembly. This is essentially machine code in words: each assembly instruction corresponds to one machine code instruction. Above this are compiled languages like C, which use structured language element to read more like English, but need to be compiled to machine code to be able to run. Finally, there are scripted languages like PHP (and usually VB and Java) which are run through interpreters configured to run the right kinds of machine code for the desired effects.
Every step up the ladder increases the human readability factor and portability, at the cost of runtime speed and program size. In the old days, programmers were Real Programmers and did their work in machine code or assembly because of clock speed and/or memory constraints. For PCs, these days are long gone and most work is done in the higher level languages. This, admittedly, is a good thing: code can be written faster and maintained more easily. However, there are still a few instances where the higher languages are insufficient. The GBA, with its 16.7Mhz CPU and less than 1 MB or work RAM is one of them. Here the inefficiency of the highest languages will cost you dearly, if it'd run at all. This is why most GBA work is done in C/C++, sometimes affectionately nicknamed ‘portable assembly’, because it still has the capability of working with memory directly. But sometimes even that isn't enough. Sometimes you really have to make every cycle count. And for this, you need assembly.
Now, in some circles the word “assembly” can be used to frighten small programmers. Because it is so closely tied to the CPU, you can make it do everything; but that also means you have to do everything. Being close to hardware also means you're bypassing all the safety features that higher languages may have, so that it's much easier to break things. So yeah, it is harder and more dangerous. Although some may prefer the term ‘adventurous’.
To program in assembly, you need to know how a processor
actually works and write in a way it can understand, rather than rely
on a compiler or interpreter to do it for you. There are no structured
for- or while- loops or even if/else branches, just
goto; no structs or classes with inheritance, and even
datatypes are mostly absent. It's anarchy, but the lack of bureaucracy
is exactly what makes fast code possible.
Speed/size issues aside, there are other reasons why learning assembly might be a good idea. Like I said, it forces you to actually understand how the CPU functions, and you can use that knowledge in your C code as well. A good example of this is the ‘best’ datatype for variables. Because the ARM processor is 32bit, it will prefer ints for most things, and other types will be slower, sometimes much slower. And while this is obvious from the description of the processor itself, knowledge of assembly will show you why they are slower.
A third reason, and not an inconsiderable one, is just for general coolness =B). The very fact that it is harder than higher languages should appeal to your inner geek, who relishes such challenges. The simplicity of the statements themselves have an aesthetic quality as well: no messing about with classes, different loop styles, operator precedence, etc – it's one line, one opcode and never more than a handful of parameters.
Anyway, about this chapter. A complete document on assembly is nothing less than a full user's manual for a CPU. This would require an entire book in itself, which is not something I'm aiming at. My intention here is to give you an introduction (but a thorough one) to ARM assembly. I'll explain the most important instructions of the ARM and THUMB instruction sets, what you can and cannot do with them (and a little bit about why). I'll also cover how to use GCC's assembler to actually assemble the code and how to make your assembly and C files work together. Lastly, I'll give an example of a fast memory copier as an illustration of both ARM and Thumb code.
With that information, you should be able to do a lot of stuff, or at least know how to make use of the various reference documents out there. This chapter is not an island, I am assuming you have some or all of the following documents:
If you want more ARM/THUMB guides, you'll have to find them yourself.
Assembly is little more than a glorified macro language for machine code. There is a one-to-one relationship between the assembly instructions and the actual machine code and assembly uses mnemonics for the operations the processor is capable of, which are much easier to remember than the raw binary. The tool that converts the asm code into machine code is the assembler.
Every processor must be able to do basic data processing: arithmetic and bit manipulation. They should also have instructions to access memory, and be able to jump from one place in the code to another of conditionals and loops and such. However, different processors will have different ways of doing these things, and some operations of one set might not be present in another. For example, ARM lacks a division instruction, and can't perform data processing on memory directly. However, the ARM instruction set has some benefits too, like a fair amount of general-purpose registers and a simple instruction set, for the proper definition of “simple”. And it has a very nifty way of dealing with bit-shifts.
In the snippet below you can find a few examples of additions and
memory reads in three different assembly languages: x86 (Intel),
68000, and ARM. The basic format is usually something like
‘operation operand1, operand2, ...’,
though there are always exceptions. Note that where x86 and ARM put
the destination in Op1, 68000 asm puts it in the last. The
terminology of the registers is also different. Some semantics are
pretty universal, the addition ‘x += y is found
in all three, for example, but x86 also has a special instruction for
increments by one, and in ARM the result register can be
different from the two operands. These differences correspond to
how the processors actually work! Higher languages allow you
to use operations that do not seem present in the instruction set,
but in fact they only appear to do so: the compiler/interpreter
will convert it to a form the processor can actually handle.
Another point I must make here is that even for a given processor, there can be differences in how you write assembly. Assemblers aren't difficult to write, and there's nothing to stop you from using a different kind of syntax. Apart from the wrath of other programmers, of course.
// Some examples // Addition and memory loads in different assemblies // === x68 asm ======================================================== add eax, #2 // Add immediate: eax += 2; add eax, ebx // Add register: eax += ebx; add eax, [ebx] // Add from memory: eax += ebx[0]; inc eax // Increment: eax++; mov eax, DWORD PTR [ebx] // Load int from memory: eax= ebx[0]; mov eax, DWORD PTR [ebx+4] // Load next int: eax= ebx[1]; // === 68000 asm ====================================================== ADD #2, D0 // Add immediate: D0 += 2; ADD D1, D0 // Add register: D0 += D1; ADD (A0), D0 // Add from memory: D0 += A0[0]; MOVE.L (A0), D0 // Load int from memory: D0= A0[0]; MOVE.L 4(A0), D0 // Load next int: D0= A0[1]; // === ARM asm ======================================================== add r0, r0, #2 // Add immediate: r0 += 2; add r0, r0, r1 // Add register: r0 += r1; add r0, r1, r2 // Add registers: r0= r1 + r2; ldr r0, [r2] // Load int from memory: r0= r2[0]; ldr r0, [r2, #4] // Load int from memory: r0= r2[1]; ldmia r2, {r0, r1} // Load multiple: r0= r2[0]; r1= r2[1];
In HLLs you have variables to work on, in assembly you can have registers, variables (that is, specific ranges in memory), and the stack. A register is essentially a variable inside the chip itself, and can be accessed quickly. The downside is that there are usually only a few of them, from just one to perhaps a few dozen. Most programs will require a lot more, which is why you can put variables in addressable memory as well. There's a lot more bytes in memory than in registers, but it'll also be slower to use. Note that both registers and memory are essentially global variables, change them in one function and you'll have changed them for the rest of the program. For local variables, you can use the stack.
The stack is a special region of memory used as, well, a stack: a Last-In, First-Out mechanism. There will be a special register called the stack pointer (SP for short) which contains the address of the top of the stack. You can push variables onto the top of the stack for safe keeping, and then pop them off once you're done with them, restoring the registers to their original values. The address of the stack (that is, the top of the stack, the contents of SP) is not fixed: it grows as you move deeper in the code's hierarchy, and shrinks as you move out again. The point is that each block of code should clean up after itself so that the stack pointer is the same before and after it. If not, prepare for a spectacular failure of the rest of the program.
For example, suppose you have functions foo() and which
uses registers A, B, C and D. Function foo() calls
function bar(), which also uses A, B and C, but in a
different context than foo(). To make sure
foo() would still work, bar() pushes A,
B and C onto the stack at its start, then uses them the way it
wants, and then pops them off the stack into A, B and C again when it
ends. In pseudo code:
// Use of stack in pseudo-asm // Function foo foo: // Push A, B, C, D onto the stack, saving their original values push {A, B, C, D} // Use A-D mov A, #1 // A= 1 mov B, #2 // B= 2 mov C, #3 // well, you get the idea call bar mov D, global_var0 // global_var1 = A+B+C+D add A, B add A, C add A, D mov global_var1, A // Pop A-D, restoring then to their original values pop {A-D} return // Function bar bar: // push A-C: stack now holds 1, 2, 3 at the top push {A-C} // A=2; B=5; C= A+B; mov A, #2 mov B, #5 mov C, A add C, B // global_var0= A+B+C (is 2*C) add C, C mov global_var, C // A=2, B=5, C=14 here, which would be bad when we // return to foo. So we restore A-C to original values. // In this case to: A=1, B=2, C=3 pop {A-C} return
While the syntax above is asm-like, it's not actually part of any assembly – at least not as far as I know. It is also particularly bad assembly, because it's inefficient in its use of registers, for one. If you were to write the corresponding C code and compile it (with optimizations, mind you), you get better code. But the point was here stack-use, not efficiency.
What you see here is that foo()
sets A, B and C to 1, 2 and 3, respectively (mov stands
for ‘move’, which usually comes down to assignment), and
then calls bar(), which sets them to something else and
sets a global variable called global_var0 to A+B+C.
Because A, B and C are now different from what they were
foo(), that function would use the wrong values in later
calculations. To counter that, bar() uses the
stack to save and restore A, B and C so that functions that call
bar() still work. Note that foo() also
uses the stack for A, B, C and D, because the function calling
foo() may want to use them as well.
Stacking registers inside the called function is only a
guideline, not a law. You could make the caller save/restore
the variables that it uses. You could even not use the stack at all,
as if you meant A, B and C to change and consider them return values
of the function. By not setting the registers manually in
bar(), A and B would effectively be function arguments.
Or you could use the stack for function arguments. And return values.
Or use both registers and the stack. The point is, you are free to
do deal with them in any way you want. At least, in principle.
In practice, there are guidelines written down by the original
manufacturers, and while not written in stone, it can be considered
bad form not to adhere to them. And you can see just how
bad a form if you intend to make the code interface with compiled
code, which does adhere to them.
The normal operation for a computer is to take instructions one by one and execute them. A special register known as the program counter (PC) indicates the address of the next instruction. When it's time, the processor reads that instruction, does its magic and increments the program counter for the next instruction. This is a relatively straightforward process; things start to get interesting when you can set the program counter to a completely different address, redirecting the flow of the program. Some might say that you can really only speak of a computer if such a thing is possible.
The technical term for this redirection is branching,
though the term ‘jump’ is used as well. With branching
you can create things like loops (infinite loops, mind you)
and implement subroutines. The usual mnemonic for branching is
something like b or j(mp)
// Asm version of the while(1) { ... } endless loop // Label for (possible) branching destination endless: ... // stuff b endless // Branch to endless, for an endless loop.
The full power of branching comes from branching only when certain conditions are met. With that, you can perform if-else blocks and loops that can actually end. The conditions allowed depend on the processor, but the most common ones are:
These condition flags are stores in the Program Status Register (PSR), and each data processing instruction will set these one of more of these flags, depending on the outcome of the operation. Special versions of the branch instruction can use these flags to determine whether to make the jump.
Below you can see a simple example of a basic for-loop. The
cmp instruction compares A to 16 and sets
the PSR flags accordingly. The instruction bne stands
for ‘branch if Not Equal’, which corresponds to a clear
Z-flag. the reason for the Zero-flag's involvement is that the
equality of two numbers is indicated by whether the difference
between them is zero or not. So if there's a difference between
A and 16, we jump back to for_start; if
not, then we continue with the rest of the code.
// Asm version of for(A=0; A != 16; A++) mov A, #0 // Start of for-loop. for_start: ... // stuff add A, #1 cmp A, #16 // Compare A to 16 bne for_start // Branch to beginning of loop if A isn't 16
The number and names of the conditional codes depends on the platform. The ARM has 16 of these, but I'll cover these later.
Before getting into ARM assembly itself, I'd like to show you a real-life example it. Assembly is an intermediary step of the build process, and you can capture GCC's assembly output by using the ‘-S’ or ‘-save-temps’ flags. This gives you the opportunity to see what the compiler is actually doing, to compare the C and assembly versions of a given algorithm, and provides quick pointers on how to code non-trivial things in assembly, like function calling, structures, loops etc. This section is optional, and you may not understand all the things here, but it is very educational nonetheless.
# Makefile settings for producing asm output
$(CC) $(RCFLAGS) -S $<
// gen_asm.c : // plotting two horizontal lines using normal and inline functions. #include <tonc.h> void PlotPixel3(int x, int y, u16 clr) { vid_mem[y*240+x]= clr; } int main() { int ii; // --- using function --- ASM_CMT("using function"); for(ii=0; ii<240; ii++) PlotPixel3(ii, 16, CLR_LIME); // --- using inline --- ASM_CMT("using inline"); for(ii=0; ii<240; ii++) m3_plot(ii, 12, CLR_RED); while(1); return 0; }
@@ gen_asm.s : @@ Generated ASM (-O2 -mthumb -mthumb-interwork -S) @@ Applied a little extra formatting and comments for easier reading. @@ Standard comments use by @; my comments use @@ @@ Oh, and DON'T PANIC! :) .code 16 .file "gen_asm.c" @@ - Source filename (not required) .text @@ - Code section (text -> ROM) @@ <function block> .align 2 @@ - 2^n alignment (n=2) .global PlotPixel3 @@ - Symbol name for function .code 16 @@ - 16bit THUMB code (BOTH are required!) .thumb_func @@ / .type PlotPixel3, %function @@ - symbol type (not req) @@ Declaration : void PlotPixel3(int x, int y, u16 clr) @@ Uses r0-r3 for params 0-3, and stack for param 4 and over @@ r0: x @@ r1: y @@ r2: clr PlotPixel3: lsl r3, r1, #4 @@ \ sub r3, r3, r1 @@ - (y*16-1)*16 = y*240 lsl r3, r3, #4 @@ / add r3, r3, r0 @@ - (y*240+x) mov r1, #192 @@ - 192<<19 = 0600:0000 lsl r1, r1, #19 @@ / lsl r3, r3, #1 @@ - *2 for halfword, not byte, offset add r3, r3, r1 @ lr needed for prologue strh r2, [r3] @@ store halfword at vid_mem[y*240+x] @ sp needed for prologue bx lr .size PlotPixel3, .-PlotPixel3 @@ - symbol size (not req) @@ </ function block> .align 2 .global main .code 16 .thumb_func .type main, %function main: push {r4, lr} @@ Save regs r4, lr @ --- using function --- @@ Comment from ASM_CMT, indicating .code 16 @@ the PlotPixel3() loop mov r4, #0 @@ r4: ii=0 .L4: mov r2, #248 @@ - r2: clr= 248*4= 0x03E0= CLR_LIME lsl r2, r2, #2 @@ / mov r0, r4 @@ r0: x= ii mov r1, #16 @@ r1: y= 16 add r4, r4, #1 @@ ii++ bl PlotPixel3 @@ Call PlotPixel3 (params in r0,r1,r2) cmp r4, #240 @@ - loop while(ii<240) bne .L4 @@ / @ --- using inline --- @@ Comment from ASM_CMT, indicating .code 16 @@ the m3_plot() loop ldr r3, .L14 @@ r3: starting/current address (vid_mem[12*240]) ldr r2, .L14+4 @@ r2: terminating address (vid_mem[13*240]) mov r1, #31 @@ r1: clr (CLR_RED) .L6: strh r1, [r3] @@ - *r3++ = clr add r3, r3, #2 @@ / cmp r3, r2 @@ - loop while(r3<r2) bne .L6 @@ / .L12: b .L12 .L15: .align 2 .L14: .word 100669056 @@ 0600:1680 =&vid_mem[12*240] .word 100669536 @@ 0600:1886 =&vid_mem[13*240] .size main, .-main .ident "GCC: (GNU) 4.1.0 (devkitARM release 18)"
After the initial shock of seeing a non-trivial assembly file for the first time, you may be able to notice a few things, even without any prior knowledge of assembly.
The real instructions are usually composed of a mnemonic
(add, ldr, b,
mov) followed by register identifiers, numbers or
labels. With a little thought, you should be able to piece together
what each of these might do. For example, add performs
an addition, ldr read something from memory,
b branches, i.e. jumps to another memory address and
mov does an assignment.
.global’ directive to make it
globally visible. And a label to mark the start of the function of
course. Note that for thumb functions, require a
‘.thumb_func’ directive as well as either
‘.code 16’ or
‘.thumb’. GCC also inserts the symbol-type
and size info, but these are not required.
Calling and returning from functions uses the bl and
bx instructions. What isn't very clear from this code,
except in my added comments, is that the arguments of the functions
are put in registers r0-r3. What you definitely don't see is that
if there are more than 4 parameters, these are put on the stack,
and that the return value is put in r0.
You also don't see that r0-r3 (and r12) are expected to be trashed in each function, so that the functions calling them should save their values if they want to use them after the call. The other registers (r4-r15) should be pushed into the stack by the called function. The standard procedure for function calling can be found in the AAPCS. Failure to adhere to this standard will break your code if you try to combine it with C or other asm.
PlotPixel3() loop versus the m3_plot()
loop, which you can find in the assembly because I've used a macro
that can write asm comments in C. The m3_plot() loop
contains 4 instructions. The PlotPixel3() loop
takes 8, plus an additional 10 from the function itself. So that's
4 instructions against 18 instructions. The C code seems
pretty much the same, so what gives?
Welcome to the wonderful world of function call overhead.
In principle, you only need the instructions of the shorter loop:
a store, an add for the next destination, a compare and a
loop-branch. Because m3_plot() is inlined, the
compiler can see that this is all that's required and optimize the
loop accordingly.
In contrast, because PlotPixel() is a full function,
the caller does not know what its internal code is, hence
no optimizations are possible. The loop should reset the registers
on every iteration because PlotPixel() will
clobber them, making the loop in main() unnecessarily
long. Furthermore, PlotPixel3() doesn't know under
what conditions it will be called, so there are no optimizations
there either. That means piecing together the destination in
every iteration, rather than just incrementing it like the
inline version does. All in all, you get a line plotter that's
nearly 4 times as slow purely because you've used a function
for a single line of code instead of inlining it via a macro
or inline function. While anyone could have told you something
like that would happen, actually looking at the differences leaves
a stronger impression.
There is a lot more that could be learned from this code, but I'll leave it at this for now. The main aim was to show you what assembly (in this case THUMB asm) looks like. In this small piece of code, you can already see many of the elements that go into a full program. Even though the lack of variable identifiers is a bit of a pain, it should be possible to follow along with the code, just as you would with a C program. See, it's not all that bad now, is it?
Looking at other people's code (in this case GCC's assembly) is a nice way of learning how to make things work, it is not a substitute for the manual. It may show you how to get something done, there is always the danger of getting them done wrongly or inefficiently. Programming is hardly ever trivial and you are likely to miss important details: the compiler may not be not optimising correctly, you could misinterpret the data, etc. This kind of learning often leads to cargo-cult programming, which often does more harm than good. If you want examples of these problems, look at nearly all other GBA tutorials and a lot of the available GBA demo code out there.
The assembler of the GNU toolchains is known as the GNU assembler or GAS, and the tool's name is arm-eabi-as. You can call this directly, or you can use the familiar arm-eabi-gcc to act as a gateway. The latter is probably a better choice, as it'll allow the use of the C preprocessor with ‘-x assembler-with-cpp’. That's right, you can then use macros, C-style comments and #include if you wish. A rule for assembling things might look something like this.
AS := arm-eabi-gcc
ASFLAGS := -x assembler-with-cpp
# Rule for assembling .s -> .o files
$(SOBJ) : %.o : %.s
$(AS) $(ASFLAGS) -c $< -o $@
This rule should work on the generated output of gcc -S. Note that it will probably not assemble under other assemblers (ARM SDT, Goldroad) because they have different standards for directives and comments and the like. I'll cover some important directives of GAS later, after we've seen what ARM assembly itself is like.
The ARM core is a RISC (Reduced Instruction Set Computer) processor. Whereas CISC (Complex Instruction Set Computer) chips have a rich instruction set capable of doing complex things with a single instruction, RISC architectures try to go for more generalized instructions and efficiency. They have a comparatively large number of general-purpose registers and data instructions usually use three registers: one destination and two operands. The length of each instruction is the same, easing the decoding process, and RISC processors strive for 1-cycle instructions.
There are actually two instruction sets that the ARM core can use: ARM code with 32bit instructions, and a subset of this called THUMB, which has 16bit long instructions. Naturally, the ARM set is more powerful, but because the most used instructions can be found in both, an algorithm coded in THUMB uses less memory and may actually be faster if the memory buses are 16bit; which is true for GBA ROM and EWRAM and the reason why most of the code is compiled to THUMB. The focus in this section will be the ARM set, to learn THUMB is basically a matter of knowing which things you cannot do anymore.
The GBA processor's full name is ARM7TDMI, meaning it's an ARM 7 code (aka ARM v4), which can read THUMB code, has a Debug mode and a fast Multiplier. This chapter has this processor in mind, but most of it should be applicable to other chips in the ARM family as well.
ARM processors have 16 32bit registers named r0-r15, of which the last three are usually reserved for special purposes: r13 is used as the stack pointer (SP); r14 is the link register (LR), indicating where to return to from a function, and r15 is the program counter (PC).. The rest are free, but there are a few conventions. The first four, r0-r3, are argument and/or scratch registers; function parameters go here (or onto the stack), and these registers are expected to be clobbered by the called function. r12 falls into this category too. The rest, r4-r11, are also known as variable registers.
| std | gcc | arm | description |
|---|---|---|---|
| r0-r3 | r0-r3 | a1-a4 | argument / scratch |
| r4-r7 | r4-r7 | v1-v4 | variable |
| r8 r9 |
r8 r9 | v5 | variable |
| v6/SB | platform specific | ||
| r10 r11 |
sl | v7 | variable |
| fp | v8 | variable / frame pointer | |
| r12 | ip | IP | Intra-Procedure-call scratch |
| r13 | sp | SP | Stack Pointer |
| r14 | lr | LR | Link Register |
| r15 | pc | PC | Program Counter |
Nearly all of the possible instructions fall into the following three classes: data operations, such as arithmetic and bit ops; memory operations, load and store in many guises and branches for jumping around code for loops, ifs and function calls. The speed of instructions almost follows this scheme as well. Data instructions usually happen in a cycle; memory ops uses two or three and branches uses 3 or 4. The whole timing thing is actually a lot more complicated than this, but it's a useful rule of thumb.
On most processors, you can only use branches conditionally, but
on ARM systems you can attach the conditionals to all
instructions. This can be very handy for small if/else blocks, or
compound conditions, which would otherwise require the use of the
more time-consuming branches. The code below
contains asm versions of the familiar max(a, b)
macro. The first one is the traditional version, which requires two
labels, two jumps (although only one of those is executed) and the
two instructions that actually do the work.
The second version just uses two mov's, but
only one of them will actually be executed thanks to the conditionals.
As a result, it is shorter, faster and more readable.
These kinds of conditionals shouldn't be used blindly, though. Even though you won't execute the instruction is the conditional fails, you still need to read it from memory, which costs one cycle. As a rough guideline, after about 3 skipped instructions, the branch would actually be faster.
@ // r2= max(r0, r1): @ r2= r0>=r1 ? r0 : r1; @ Traditional code cmp r0, r1 blt .Lbmax @ r1>r0: jump to r1=higher code mov r2, r0 @ r0 is higher b .Lrest @ skip r1=higher code .Lbmax: mov r2, r1 @ r1 is higher .Lrest: ... @ rest of code @ With conditionals; much cleaner cmp r0, r1 movge r2, r0 @ r0 is higher movlt r2, r1 @ r1 is higher ... @ rest of code
Another optional item is whether or not the status
flags are set. Test instructions like cmp always
set them, but most of the other require an
‘-s’ affix. For example,
sub would not set the flags, but
subs would. Because this kinda clashes with the
plural ‘s’, I'm using adding an apostrophe for the
plural form, so subs means sub with
status flags, but sub's means multiple
sub instructions.
Each instruction of the ARM set can be run conditionally, allowing shorter, cleaner and faster code.
A barrel shifter is a circuit dedicated to performing bit-shifts. Well, shifts and rotations, but I'll use the word ‘shift’ for both here. The barrel shifter is part of the ARM core itself and comes before any arithmetic so that you it can handle shifted numbers very fast. The real value of the barrel shifter comes from the fact that almost all instructions can apply a shift to one of their operands at no extra cost.
There are four barrel-shift operations:
left shift (lsl),
logical right-shift (lsr),
arithmetic right-shift (asr) and
rotate-right (ror). The difference between arithmetic
and logical shift right is one of signed/unsigned numbers; see the
bit ops section for
details.
These operations are attached to the last register in an operation,
followed by an immediate value or a register. For example, instead of
simply Rm you can have
‘Rm, lsl #2’ means Rm<<2
and ‘Rm, lsr Rs’ for Rm>>Rs.
Because shifted registers can apply to almost all instructions and I
don't want to write it in full all the time, I will designate the
shifted register as Op2.
Now this may seem like esoteric functionality, but it's actually
very useful and more common than you think. One application is
multiplications by 2n±1, without resorting to
relatively slow multiplication instructions. For example,
x*9 is the same as x*(1+8) =
x + x*8 = x+(x<<3). This
can be done in a single add. Another use is in loading
values from arrays, for which indices would have to be multiplied by
the size of the elements to get the right addresses.
@ Multiplication by shifted add/sub add r0, r1, r1, lsl #3 @ r0= r1+(r1<<3) = r1*9 rsb r0, r1, r1, lsl #4 @ r0= (r1<<2)-r1 = r1*15 @ word-array lookup: r1= address (see next section) ldr r0, [r1, r2, lsl #2] @ u32 *r1; r0= r1[r2]
Other uses are certainly possible as well. Like the conditional, you
might not really need use of shifted add's and such, but
they allow for some wonderfully optimized code in the hands of
the clever programmer. These are things that make assembly fun.
Or at least very nearly so. Shifts-by-value can be performed at no extra cost, and shifts-by-register cost only one cycle. Actually, bit-rotates behave like this as well, but they're rather rare and since I don't know of the correct term that encompasses both, I'll use the word “shift” for both.
And now for one of the points that makes ARM assembly less fun. As I said, each instruction is 32bits long. Now, there are 16 condition codes, which take up 4 bits of the instruction. Then there's 2x four for the destination and first operand registers, one for the set-status flag, and then an assorted number of bits for other matters like the actual opcodes. The bottom line is that you only have 12 bits left for any immediate value you might like to use.
That's right 12. A whole 12 bits. You may have already figured
out that, since this will only allow for 4096 distinct values, this
presents a bit of a problem. And you'd be right.
This is one of the major points of bad news for RISC processors:
after assigning bits to instruction-type, registers and other fields,
there's very little room for actual numbers left. So how is one
to load a number like 0601:0000 (object VRAM) then?
Well … you can't! At least, not in one go.
So, there is only a limited amount of numbers that can be used directly; the rest must be pieced together from multiple smaller numbers. Instead of just taking the 12 bits for a single integer, what the designers have done is split it into an 8bit number (n) and a 4bit rotation field (r). The barrel shifter will take care of the rest. The full immediate value v is given by:
| (23.1) | v = n ror 2*r. |
This means that you can create values like 255 (n=255, r=0) and 0x06000000 (n=6, r=4 (remember, rotate-right)). However, 511 and 0x06010000 are still invalid because the bit-patterns can't fit into one byte. For these invalid numbers you have two options: construct them in multiple instructions, or load them from memory. Both of these can become quite expensive so if it is possible to avoid them, do so.
The faster method of forming bigger numbers is a matter of debate.
There are many factors involved: the number in mind, memory section,
instruction set and amount of space left, all interacting in nasty
ways. It's probably best not to worry about it too much, but as
a guideline, I'd say if you can do it in two data instructions do
so; if not, use a load. The easiest way of creating big numbers
is with a special form of the ldr instruction:
‘ldr Rd,=num’ (note: no ‘#’!).
The assembler will turn this into a mov if the number allows it, or
an ldr if it doesn't. The space that the number needs
will be created automatically as well.
@ form 511(0x101) with mov's
mov r0, #256 @ 256= 1 ror 24, so still valid
add r0, #255 @ 256+255 = 511
@ Load 511 from memory with special ldr
@ NOTE: no '#' !
ldr r0,=511
That there is only room for an 8bit number + 4bit rotate for immediate operands is something you'll just have to learn to live with. If the assembler occasionally complains about invalid constants, you now know what it means and how you can correct for it. Oh, and if you thought this was bad, think of how it would work for THUMB code, which only has 16 bits to work with.
When instructions allow immediate values, the only permissible values
are those that can be reduced to a byte rotated by an even number.
0xFF and 0x100 are allowed, but 0x101 is not. This has consequences
for data operations, but also for memory addressing since it will
not be possible to load a full 32bit address in one go. You can either
construct the larger value out of smaller parts, or use a
load-assignment construct:
‘ldr Rd,=num’ which the assembler
will convert into a mov if possible, or a PC-relative load
if not.
Is this worth a separate note? Maybe not, but the previous note is important enough to remember. It is not exactly intuitive that code should behave that way and if you found yourself staring at the enigmatic invalid constant error message, you'd probably be lost without this bit of info.
The data operations carry out the calculations of a program, which includes both arithmetic and logical operations. You can find a summary of the data instructions in table 23.2. While this lists them in four groups, the only real division is between the multiplies and the rest. As you can see, there is no division instruction. While this can be considered highly annoying, as it turns out the need for division is actually quite small – small enough to cut it out of the instruction set, anyway.
Unlike some processors, ARM can only perform data processing on
registers, not on memory variables directly. Most data instructions
use one destination register and two operands. The first operand is
always a register, the second can be four things:
an immediate value or register ( #n / Rm) or a
register shifted by an immediate value or register
(‘Rm, lsl #n’,
‘Rm, lsl Rs’, and similar for
lsr, asr and ror). Because
this arrangement is quite common, it is often referred to as simply
Op2, even if it's not actually a second operand.
Like all instructions, data instructions can be executed
conditionally by adding the appropriate affix. They can also alter
the status flags by appending the -s prefix. When using
both, the conditional affix always comes first.
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The first group, arithmetic, only contains variants of addition and
subtraction. add and sub are their base forms.
rsb is a special thing that reverses the operand order;
the difference with the regular sub is that Op2
is now the minuend (the thing subtracted from). Only Op2
is allowed to have immediate values and shifted registers, which allows
you to negate values (0−x) and fast-multiply by
2n−1.
The variants ending in ‘c’ are additions and
subtractions with carry, which allows for arithmetic for values larger
than the register size. For example, consider you have 8bit registers
and want to add 0x00FF and 0x0104. Because the latter doesn't fit into
one register, you have to split it and then add twice, starting with
with the least significant byte. This gives 0xFF+0x04=0x103,
represented by 0x02 in the destination register and a set carry flag.
For the second part you have to add 0x00 and 0x01 from the operands,
and the carry from the lower byte, giving 0x00+0x01+1 = 0x02.
Now string the separate parts together to give 0x0203.
Because ARM registers are 32bit wide you probably won't be using the those instructions much, but you never know.
The second group are the bit operations, most of which you should be
familiar with already. The all have exact matches in C operators,
with the exception of bit-clear. However, the value of such an
instruction should be obvious. You will notice a distinct absence of
shift instructions here, for the simple reason that they're not
really necessary: thanks to the barrel shifter, the
mov instruction can be used for shifts and rotates.
‘r1 = r0<<4’ could be written as
‘mov r1, r0, lsl #4’.
I have mentioned this a couple of times now, but as we're dealing with another language now it bares repeating: there is a difference between right-shifting a signed and unsigned numbers. Right-shifts remove bits from the top; unsigned numbers should be zero-extended (filled with 0), but signed numbers should be sign-extended (filled with the original MSB). This is the difference between a logical shift right and an arithmetic shift right. This doesn't apply to left-shifts, because that fills zeroes either way.
The third group isn't much of a group, really. The status flag
operations set the status bits according to the results of their
functionality. Now, you can do this with the regular instructions
as well; for example, a compare (cmp) is basically
a subtraction that also sets the status flags, i.e., a
subs. The only real difference is that this time there's
no register to hold the result of the operation.
Lastly, the multiplication formats. At the table indicates, you cannot use immediate values; if you want to multiply with a constant you must load it into a register first. Second, no Op2 means no shifted registers. There is also a third rule that Rd and Rm can't use the same register, because of how the multiplication algorithm is implemented. That said, there don't seem to be any adverse effects using Rd=Rm.
The instruction mla stands for ‘multiply with
accumulate’, which can be handy for dot-products and the like.
The mull and mlal instructions are for
64bit arithmetic, useful for when you expect the result not to fit
into 32bit registers.
@ Possible variations of data instructions add r0, r1, #1 @ r0 = r2 + 1 add r0, r1, r2 @ r0 = r1 + r2 add r0, r1, r2, lsl #4 @ r0 = r1 + r2<<4 add r0, r1, r2, lsl r3 @ r0 = r1 + r2<<r3 @ op= variants add r0, r0, #2 @ r0 += 2; add r0, #2 @ r0 += 2; alternative (but not on all assemblers) @ Multiplication via shifted add/sub add r0, r1, r1, lsl #4 @ r0 = r1 + 16*r1 = 17*r1 rsb r0, r1, r1, lsl #4 @ r0 = 16*r1 - r1 = 15*r1 rsb r0, r1, #0 @ r0 = 0 - r1 = -r1 @ Difference between asr and lsr mvn r1, #0 @ r1 = ~0 = 0xFFFFFFFF = -1 mov r0, r1, asr #16 @ r0 = -1>>16 = -1 mov r0, r1, lsr #16 @ r0 = 0xFFFFFFFF>>16 = 0xFFFF = 65535 @ Signed division using shifts. r1= r0/16 @ if(r0<0) @ r0 += 0x0F; @ r1= r0>>4; mov r1, r0, asr #31 @ r0= (r0>=0 ? 0 : -1); add r0, r0, r1, lsr #28 @ += 0 or += (0xFFFFFFFF>>28 = 0xF) mov r1, r0, asr #4 @ r1 = r0>>4;
Because ARM processors can only perform data processing on registers, interactions with memory only come in two flavors: loading values from memory into registers and storing values into memory from registers.
The basic instructions for that are ldr
(LoaD Register) and str (STore Register), which
load and store words. Again, the most general form uses two
registers and an Op2:
| op{cond}{type} Rd, [Rn, Op2] |
Here op is either ldr or str.
Because they're so similar in appearance, I will just use
ldr for the remainder of the discussion on syntax,
except when things are different. The condition flag again goes
directly behind the base opcode. The type refers to the
datatype to load (or store), which can be words, halfwords or
bytes. The word forms do not use any extension, halfwords use
-h or -sh, and bytes use
-b and -sb. The extra s
is to indicate a signed byte or halfword. Because the registers
are 32bit, the top bits need to be sign-extended or zero-extended,
depending on the desired datatype.
The first register here, Rd can be either the
destination or source register. The thing between brackets always
denotes the memory address; ldr means load
from memory, in which case Rd is the
destination, and str means store to memory,
so Rd would be the source there. Rn
is known as the base register, for reasons that we will
go into later, and Op2 often serves as an offset. The
combination works very much like array indexing and pointer
arithmetic.
To make the comparison to C a little easier, I will sometimes indicate
what happens using pointers, but in order to do that I will have to
indicate the type of the pointer somehow. I could use some horrid
casting notation, but it would be easiest to use a form of arrays for
this, and use the register-name + an affix to show the data type. I'll
use ‘_w’ for words, ‘_h’ for halfwords, and
‘_b’ for bytes, and ‘_sw’, etc. for their
signed versions. For example, r0_sh would indicate that
r0 is a signed halfword pointer. This is just a useful
bit of shorthand, not actually part of assembly itself.
@ Basic load/store examples. Assume r1 contains a word-aligned address ldr r0, [r1] @ r0= *(u32*)r1; //or r0= r1_w[0]; str r0, [r1] @ *(u32*)r1= r0; //or r1_w[1]= r0;
There are several ways of interacting, known as addressing modes. The simplest form is direct addressing, where you indicate the address directly via an immediate value. However, that mode is unavailable to ARM systems because full addresses don't fit in the instruction. What we do have is several indirect addressing forms.
The first available form is register indirect addressing,
which gets the address from a register, like
‘ldr Rd, [Rn]’.
An extension of this is pre-indexed addressing,
which adds an offset to the base register before the load. The
base form of this is
‘ldr Rd, [Rn, Op2]’. This is very
much like array accesses. For example
‘ldr r1, [r0, r2, lsl #2]’
corresponds to r0_w[r2]: an word-array load using
r2 as the index.
Another special form of this is PC-relative addressing, which makes up for not having direct addressing. Suppose you have a variable in memory somewhere. While you may not be able to use that variable's address directly, what you can do is store the address close to where you are in the code. That address is at a certain allowed offset from the program counter register (PC), so you could load the variable's address from there and then read the variable's contents. You can also use this to load constants that are too large to fit into a shifted byte.
While it is possible to calculate the required offset manually,
you'll be glad to know you can let the assembler do this for you.
There are two ways of doing this. The first is to create a
data-pool where you intend to put the addresses and
constants, and label it. You can then get its address via
‘ldr Rd, LabelName’ (note the
absence of brackets here). The assembler will turn this into
pc-relative loads. The second method is to let the assembler do
all the work by using
‘ldr Rd,=foo’, where foo
is the variable name or an immediate value. The assembler will
then allocate space for foo itself. Please remember that
using =varname does not load the variable itself,
only its address.
And then there are the so-called write-back modes.
In the pre-index mode, the final address was made up of
Rn+Op2, but that had no effect on
Rn. With write-back, the final address is put
in Rn. This can be useful for walking through
arrays because you won't need an actual index.
There are two forms of write-back, pre-indexing and post-indexing.
Pre-indexing write-back works much like the normal write-back and
is indicated by an exclamation mark after the brackets:
‘ldr Rd, [Rn, Op2]!’. Post-indexing
doesn't add Op2 to the address (and Rn) until
after the memory access; its format is
‘ldr Rd, [Rn], Op2’.
@ Examples of addressing modes @ NOTE: *(u32*)(address+ofs) is the same as ((u32*)address)[ofs/4] @ That's just how array/pointer offsets work mov r1, #4 mov r2, #1 adr r0, fooData @ u32 *src= fooData; @ PC-relative and indirect addressing ldr r3, fooData @ r3= fooData[0]; // PC-relative ldr r3, [r0] @ r3= src[0]; // Indirect addressing ldr r3, fooData+4 @ r3= fooData[1]; // PC-relative ldr r3, [r0, r1] @ r3= src[1]; // Pre-indexing ldr r3, [r0, r2, lsl #2] @ r3= src[1] // Pre-index, via r2 @ Pre- and post-indexing write-back ldr r3, [r0, #4]! @ src++; r3= *src; ldr r3, [r0], #4 @ r3= *src; src++; @ u32 fooData[3]= { 0xF000, 0xF001, 0xF002 }; fooData: .word 0x0000F000 .word 0x0000F001 .word 0x0000F002
PC-relative instructions are common, and have a special
shorthand that is easier and shorter to use than creating a pool of
data to load from. The format for this is
‘ldr Rd,=foo’, where foo is
a label or an immediate value. In both cases, a pool is created to
hold these numbers automatically. Not that the value for the label
is its address, not the address' contents.
If the label is near enough you can also use adr, which
is assembled to a PC-add instruction. This will not create a
pool-entry.
@ Normal pc-relative method: @ create a nearby pool and load from it ldr r0, .Lpool @ Load a value ldr r0, .Lpool+4 @ Load far_var's address ldr r0, [r0] @ Load far_var's contents .Lpool: .word 0x06010000 .word far_var
@ Shorthand: use ldr= and GCC will manage the pool for you ldr r0,=0x06010000 @ Load a value ldr r0,=far_var @ Load far_var's address ldr r0, [r0] @ Load far_var's contents
Note that I'm not actually creating far_var here; just
stoarge room for its address. Creation of variables is covered later.
It is also possible to load/store bytes and halfwords. The opcodes
for loads are ldrb and ldrh for unsigned,
and ldrsb and ldrsh for signed bytes and
halfwords, respectively. The ‘r’ in the signed versions is
actually optional, so you'll also see ldsb and
ldsh now and then. As stores can cast away the more
significant bytes anyway, strb and strh
will work for both signed and unsigned stores..
All the things you can do with ldr/str, you can do with
the byte and halfword versions as well: PC-relative, indirect,
pre/post-indexing it's all there … with one exception. The
signed-byte load (ldsb) and all of the halfword
loads and stores cannot do shifted register-loads. Only
ldrb has the complete functionality of the word
instructions. The consequence is that signed-byte or halfword arrays
may require extra instructions to keep the offset and index in check.
Oh, one more thing: alignment. In C, you could rely on the compiler to align variables to their preferred boundaries. Now that you're taking over from the compiler, it stands to reason that you're also in charge of alignment. This can be done with the ‘.align n’ directive, with aligns the next piece of code or data to a 2n boundary. Actually, you're supposed to properly align code as well, something I'm taking for granted in these snippets because it makes things easier.
mov r2, #1
@ Byte loads
adr r0, bytes
ldrb r3, bytes @ r3= bytes[0]; // r3= 0x000000FF= 255
ldrsb r3, bytes @ r3= (s8)bytes[0]; // r3= 0xFFFFFFFF= -1
ldrb r3, [r0], r2 @ r3= *r0_b++; // r3= 255, r0++;
@ Halfword loads
adr r0, hwords
ldrh r3, hwords+2 @ r3= words[1]; // r3= 0x0000FFFF= 65535
ldrsh r3, [r0, #2] @ r3= (s16)r0_h[1]; // r3= 0xFFFFFFFF= -1
ldrh r3, [r0, r2, lsl #1] @ r3= r0_h[1]? No! Illegal instruction :(
@ Byte array: u8 bytes[3]= { 0xFF, 1, 2 };
bytes:
.byte 0xFF, 1, 2
@ Halfword array u16 hwords[3]= { 0xF001, 0xFFFF, 0xF112 };
.align 1 @ align to even bytes REQUIRED!!!
hwords:
.hword 0xF110, 0xFFFF, 0xF112
Block transfers allow you to load or store multiple successive words
into registers in one instruction. This is useful because it saves on
instructions, but more importantly, it saves time because individual
memory instructions are quite costly and with block transfers you
only have to pay the overhead once. The basic instructions for block
transfers are ldm (LoaD Multiple) and
stm (STore Multiple), and the operands are a base
register (with an optional exclamation mark for Rd
write-back) and a list of registers between braces.
| op{cond}{mode} Rd{!}, {Rlist} |
This register list can be comma separated, or hyphenated to indicate a
range. For example, {r4-r7, lr} means registers r4, r5,
r6, r7 and r14. The order in which the registers are actually loaded
or stored are not based on the order in which they are
specified in this list! Rather, the list indicates the number of words
used (in this case 5), and the order of addresses follows the index of
the registers: the lowest register uses the lowest address in the
block, etc.
The block-transfer opcodes can take a number of affixes that determine
how the block extends from the base register Rd. The four
possibilities are:
-IA/-IB (Increment After/Before) and
-DA/-DB (Decrement After/Before). The
differences are essentially those between pre/post-indexing and
incrementing or decrementing from the base address. It should be noted
that these increments/decrements happen regardless of whether the base
register carries an exclamation mark or not: that thing only indicates
that the base register itself is updated afterwards.
adr r0, words+16 @ u32 *src= &words[4];
@ r4, r5, r6, r7
ldmia r0, {r4-r7} @ *src++ : 0, 1, 2, 3
ldmib r0, {r4-r7} @ *++src : 1, 2, 3, 4
ldmda r0, {r4-r7} @ *src-- : 0, -1, -2, -3
ldmdb r0, {r4-r7} @ *--src : -1, -2, -3, -4
.align 2
words:
.word -4, -3, -2, -1
.word 0, 1, 2, 3, 4
The block transfers are also used for stack-work. There are four types
of stacks, depending on whether the address that sp
points to already has a stacked value or not (Full or Empty), and
whether the stack grows down or up in memory (Descending/Ascending).
These have special affixes (-FD, -FA,
-ED and -EA) because using the standard
affixes would be awkward. For example, the GBA uses an FD-type stack,
which means that pushing is done with stmdb because
decrementing after the store would overwrite an already stacked value
(full stack), but popping requires ldmia for similar
reasons. A stmfd/ldmfd pair is much easier to deal with.
Or you could just use push and pop,
which expand to ‘stmfd sp!,’ and
‘ldmfd sp!,’, respectively.
| Block op | Standard | Stack alt |
|---|---|---|
| Increment After | ldmia / stmia | ldmfd / stmea |
| Increment Before | ldmib / stmib | ldmed / stmfa |
| Decrement After | ldmda / stmda | ldmfa / stmed |
| Decrement Before | ldmdb / stmdb | ldmea / stmfd |
They seem to work for devkitARM r15 and up (haven't checked older versions), but DevKitAdv for example doesn't accept them. Just try and see what happens.
Higher languages typically have numerous methods for implementing choices, loops and function calls. The all come down to the same thing though: the ability to move the program counter and thereby diverting the flow of the program. This procedure is known as branching.
There are three branching instructions in ARM: the simple
branch b for ifs and loops, the branch with
link bl used for function calls, and branch with
exchange bx used for switching between ARM and THUMB
code, returning from functions and out-of-section jumps.
b and bl use a label as their argument,
but bx uses a register with the address to branch
to. Technically there are more ways of branching (PC is just another
register, after all) but these three are the main ones.
I've already mentioned part of this in the introduction, so I'll make this brief. The ARM processor has 4 status flags, (Z)ero, (N)egative, (C)arry and signed o(O)erflow, which can be found in the program status register. There are actually two of these: one for the current status (CPSR) and a saved statur register (SPSR), which is used in interrupt handlers. You won't have to deal with either of these, though, as reacting to status registers usually goes through the conditional codes (table 23.4). But first, a few words about the flags themselves:
Each of the data instructions can set the status flags by appending
-s to the instruction, except for cmp,
cmn, tst and teq, which always
set the flags.
Table 23.4 lists 16 affixes that can be
added to the basic branch instruction. For example,
bne Label would jump to Label if the status
is non-zero, and continue with the next instruction if it isn't.
| Affix | Flags | Description |
|---|---|---|
| eq | Z=1 | Zero (EQual to 0) |
| ne | Z=0 | Not zero (Not Equal to 0) |
| cs / hs | C=1 | Carry Set / unsigned Higher or Same |
| cc / lo | C=0 | Carry Clear / unsigned LOwer |
| mi | N=1 | Negative (MInus) |
| pl | N=0 | Positive or zero (PLus) |
| vs | V=1 | Sign overflow (oVerflow Set) |
| vc | V=0 | No sign overflow (oVerflow Clear) |
| hi | C=1 & Z=0 | Unsigned HIgher |
| ls | C=0 | Z=1 | Unsigned Lower or Same |
| ge | N=V | Signed Greater or Equal |
| lt | N != V | Signed Less Than |
| gt | Z=0 & N=V | Signed Greater Than |
| le | Z=1 | N != V | Signed Less or Equal |
| al | - | ALways (default) |
| nv | - | NeVer |
To use these condition codes properly, you need to know what each stands for, but also how the data operations set the flags. The effect on the status flags depends on the instruction itself, and not all flags are affected by all instructions. For example, overflow only has meaning for arithmetic, not bit operations.
In the case of Z and N, the case is pretty easy. The operation
gives a certain 32bit value as its result; if it's 0, then the
Zero flag is set. Because of two's complement, the Negative flag
is the same as bit 31. The reason -eq and
ne are linked to the zero flags is because a
comparison (cmp) is basically a subtraction: it
looks at the difference between the two numbers and when that's
zero, then the numbers are equal.
For the carry bit it can get a little harder. The best way to see it
is as an extra most significant bit. You can see how this work in
the example of table 23.5. Here we add
two unsigned numbers, 231 = 0x80000000. When adding them,
the result would overflow 32bits, giving 0 and not 232.
However, that overflowed bit will go into the carry. With the
adc instruction you could then go on to build adders
for numbers larger thatn the registers.
|
Bit-operations like and don't affect
it because it operates purely on the lower 32bits. Shifts, however
do.
You may find it odd that -cc is the code for
unsigned higher than. As mentioned, a comparison is essentially
a subtraction, but when you subtract, say 7−1, there
doesn't really seem to be a carry here. The key here is that
subtractions are infact forms of additions: 7−1 is actually
7+0xFFFFFFFF, which would cause an overflow into the carry bit.
You can also thing of subtractions as starting out with the carry
bit set.
The overflow flag indicates signed overflow (the carry bit would be unsigned overflow). Note, this is not merely a sign change, but a sign change the wrong way. For example, an addition of two positive numbers should always be positive, but if the numbers are big enough (say, 230, see table 23.6) then the results of the lower 30 bits may overflow into bit 31, therefore changing the sign and you'll have an incorrect addition. For subtraction, there can be a similar problem. Short of doing the full operation and checking whether the signs are correct, there isn't a simple way of figuring out what counts as overflow, but fortunately you don't have to. Usually overflow is only important for signed comparisons, and the condition mnemonics themselves should provide you with enough information to pick the right one.
|
With these points in mind, the conditional codes shouldn't be too hard to understand. The descriptions tell you what code you should use when. Also, don't forget that any instruction can be conditionally executed, not just a branch.
Let's start with the most basic of branches, b. This is
the most used branch, used to implement normal conditional code and
loops of all kinds. It is most often used in conjunction with one
of the 16 conditional codes of table 23.4.
Most of the times a branch will look something like this:
@ Branch example, pseudo code data-ops, Rd, Rn, Op2 @ Data operation to set the flags bcnd-code .Llabel @ Branch upon certain conditions @ more code A .Llabel: @ Branch goes here @ more code B
First, you have a data processing instruction that sets the status
flags, usually a subs or cmp, but it can
be any one of them. Then a bcond diverts the
flow to .Llabel if the conditions are met. A simple
example of this would be a division routine which checks if the
denominator is zero first. For example, the Div()
routine that uses BIOS Call #6 could be
safeguarded against division by 0 like this:
@ int DivSafe(int num, int den); @ \param num Numerator (in r0) @ \param den Denominator (in r1) @ \return r0= r0/r1, or INT_MAX/INT_MIN if r1 == 0 DivSafe: cmp r1, #0 beq .Ldiv_bad @ Branch on r1 == 0 swi 0x060000 bx lr .Ldiv_bad: mvn r1, #0x80000000 @ \ sub r0, r1, r0, asr #31 @ - r0= r0>=0 ? INT_MAX : INT_MIN; bx lr
The numerator and denominator will be in registers r0 and r1,
respectively. The cmp checks whether the denominator is
zero. If it's not, no branch is taken, the swi 6 is executed and
the function returns afterwards. If it is zero, the beq
will take the code to .Ldiv_bad. The two instructions
there set r0 to either INT_MAX (231−1 = 0x7FFFFFFF)
or INT_MIN (−231 = 0x80000000), depending on whether
r0 is positive or negative. If it's a little hard to see that,
mvn inverts bits, so the first line after
.Ldiv_bad sets r0 to INT_MAX. The second line we've seen
before: ‘r0, asr #31’ does a sign-extension
in to all other bits, giving 0 or −1 for positive and negative
numbers, respectively, giving INT_MAX− −1 = INT_MIN for
negative values of r0. Little optimizing tricks like these decide if
you're fit to be an assembly programmer; if not you could just as well
let the compiler do them, because it does know. (It's where I got the
‘asr #31’ thing from in the first place.)
Now in this case I used a branch, but in truth, it wasn't even necessary. The non-branch part consists of one instruction, and the branched part of two, so using conditional instructions throughout would have been both shorter and faster:
@ Second version using conditionally executed code DivSafe: cmp r1, #0 mvneq r1, #0x80000000 subeq r0, r1, r0, asr #31 swine 0x060000 bx lr
If the denominator is not zero, the mvneq and
subeq are essentially skipped. Actually, not so much
skipped, but turned into nop: non-operations. So is
swine (i.e., swi + ne,
no piggies here) if it is zero. True, the division line has
increased by a cycle, not taking the branch makes the exception line
a little faster and the function itself has shrunk from 7 to 5
instructions.
In the first DivSafe snippet, the internal branch
destination used a .L prefix, while the function
label did not. The .L prefix is used by GCC to indicate
labels for the sake of labels, as opposed to symbol labels like
DivSafe. While not required, it's a useful convention.
Any sort of branch will create a fork in the road and, depending on
the conditions, one road will be taken more often. That would be
the major branch. The other one would be the
minor branch, probably some sort of exception. The
branch instruction, b, represents a deviation from the
normal road and is relatively costly, therefore it pays to have to
branch to the exceptions. Consider these possibilities:
// Basic if statement in C if(r0 == 0) { /* IF clause */ } ...
@ === asm-if v1 : 'bus stop' branch === cmp r0, #0 beq .Lif .Lrest: ... bx lr @ function ends .Lif @ IF clause b .Lrest @ === asm-if v2 : 'skip' branch === cmp r0, #0 bne .Lrest @ IF clause .Lrest: ... bx lr @ function ends
The first version is more like the C version: it splits off for the IF-clause and the returns to the rest of the code. The flow branches twice if the conditions are met, but if they aren't the rest of the code doesn't branch at all. The second version branches if the conditions aren't met: it skips the IF-clause. Overall, the assembly code is simpler and shorter, but the fact that the branch-conditions are inverted with respect to the C version could take some getting used to.
So which to use? Well, that depends actually. All things being equal, the second one is better because it's one instruction and label shorter. As far as I know, this is what GCC uses. The problem is that some things may be more equal than others. If the IF-clause is exceptional (i.e., the minor branch), it'd mean that the second version almost always takes the branch, while the first version would hardly ever branch, so on average the latter would be faster.
Which one you chose is entirely up to you as you know your intentions best. I hope. For the remainder of this chapter I'll use the skip-branch because in demonstrations things usually are equal. Of course, if the clause is small enough you can just use conditional instructions and be done with it :).
Even though all you have now is b, it doesn't mean you
can't implement branching construct found in HLLs in assembly.
After all, the compiler seems to manage. Here's a couple of them.
The if-elseif is an extension of the normal
if-else, and from it you can extend to longer
if-elseif-else chains. In this case I want to look
at a wrapping algorithm for keeping numbers within certain boundaries:
the number x should stay within range
[mn, mx〉, if it exceeds either boundary it should
come out the other end. In C it looks like this:
// wrap(int x, int mn, int mx), C version: int res; if(x >= mx) res= mn + x-mx; else if(x < mn) res= mx + x-mn; else res= x;
The straightforward compilation would be:
@ r0= x ; r1= mn ; r2= mx cmp r0, r2 blt .Lx_lt_mx @ if( x >= mx ) add r3, r0, r1 @ r0= mn + x-mx sub r0, r3, r2 b .Lend .Lx_lt_mx: cmp r0, r1 @ bge .Lend @ if( x < mn ) add r3, r0, r2 @ r0= mx + x-mn; sub r0, r3, r1 .Lend: ...
This is what GCC gives, and it's pretty good. The ordering of the
clauses remained, which means that the condition for the branches
have to be inverted, so
‘if(x >= mx) {}’ becomes
‘skip if NOT x >= mx’. At the end of
each clause, you'd need to skip all the others:
branch to .Lend. The conditional branches mean
‘go to the next branch’.
And now an optimized version. First, a cmp is
equivalent to sub except that it doesn't put the
result in a register. However, as we need the result lateron anyway,
we might as well combine the ‘cmp’
and ‘sub’. Secondly, the clauses are pretty
small, so we can use conditional ops as well. The new version would
be:
@ Optimized wrapper subs r3, r0, r2 @ r3= x-mx addge r0, r3, r1 @ x= x-mx + mn bge .Lend subs r3, r0, r1 @ r3= x-mn addlt r0, r3, r2 @ r0= x-mn + mx; .Lend: ...
Cleans up nicely, wouldn't you say? Less branches, less code and
it matches the C code more closely. We can even get rid of
the last branch too because we can execute the subs
conditionally as well. Because ge and lt
are each others complements there won't be any interference.
So the final version is:
@ Optimized wrapper, version 2 subs r3, r0, r2 @ r3= x-mx addge r0, r3, r1 @ x= x-mx + mn sublts r3, r0, r1 @ r3= x-mn addlt r0, r3, r2 @ r0= x-mn + mx; ...
Of course, it isn't always possible to optimize to such an extent. However, if the clauses have small bodies conditional instructions may become attractive. Also, converting a compare to some form of data operation that you'd need later anyway is common and recommended.
Higher languages often allow you to string multiple conditions
together using
logical AND (&&) and
logical OR (||). What the books often won't say is
that these are merely shorthand notations of a chain of
ifs. Here's what actually happens.
// === if(x && y) { /* clause */ } === if(x) { if(y) { /* clause */ } } // === if(x || y) { /* clause */ } === if(x) { /* clause */ } else if(y) { /* clause */ }
The later terms in the AND are only evaluated if earlier expressions were true. If not, they're simply skipped. Much fun can be had if the second term happens to be a function with side effects. A logical OR is basically an if-else chain with identical clauses; this is just for show of course, in the final version there's one clause which is branched to. In assembly, these would look something like this.
@ if(r0 != 0 && r1 != 0) { /* clause */ } cmp r0, #0 beq .Lrest cmp r1, #0 beq .Lrest @ clause .Lrest: ... @ Alternative cmp r0, #0 cmpne r1, #0 beq .Lrest @ clause .Lrest: ...
@ if( r0 != 0 || r1 != 0 ){ /* clause */ } cmp r0, #0 bne .Ltrue cmp r1, #0 beq .Lrest .Ltrue: @ clause .Lrest: ...
As always, alternative solutions will present themselves for your specific situation. Also note that you can transform ANDs into ORs using De Morgan's Laws.
One of the most important reasons to use assembly is speeding up oft-used code, which will probably involve loops because that's where most of the time will be spent. If you can remove one instruction in a non-loop situation, you'll have won one cycle. If you remove one from a loop, you'll have gained one for every iteration of the loop. For example, saving a cycle in a clear-screen function would save 240*160 = 19200 cycles – more, actually, because of memory wait-states. That one cycle can mean the difference between smooth and choppy animation.
In short, optimization is pretty much all about loops, especially
inner loops. Interestingly, this is where GCC often misses the mark,
either because it adds more stuff than necessary. For example,
in older versions (DKA and DKP r12, something like that) it often
kept constructed memory addresses (VRAM, etc) inside the loop.
Unfortunately, DKP r19 also has major issues with struct copies
and ldm/stm pairs, which are now only give a small
benefit over other methods. Now, before you blame GCC for slow loops,
it's also often the C programmer that forces GCC to produce slow
code. In the introduction, you could see the major difference that
inlining makes, and in the
profiling demo
I showed how much difference using the wrong datatype makes.
Anyway, loops in assembly. Making a loop is the easiest thing in the
world: just branch to a previous label. The differences between
for, do-while and while loops
are a matter of where you increment and test. In C, you usually use
a for-loop with an incrementing index. In assembly, it's customary to
use a while-loop with a decrementing index. Here are two examples of
a word-copy loop that should show you why.
@ Asm equivalents of copying 16 words. @ u32 *dst=..., *src= ..., ii // r0, r1, r2 @ --- Incrementing for-loop --- @ for(ii=0; ii<16; ii++) @ dst[ii]= src[ii]; mov r2, #0 .LabelF: ldr r3, [r1, r2, lsl #2] str r3, [r0, r2, lsl #2] add r2, r2, #1 cmp r2, r2, #16 blt .LabelF @ --- Decrementing while-loop --- @ ii= 16; @ while(ii--) @ *dst++ = *src++; mov r2, #16 .LabelW: ldr r3, [r1], #4 str r3, [r0], #4 subs r2, r2, #1 bne .LabelW
In an incrementing for-loop you need to increment and then compare against the limit. In the decrementing while loop you subtract and test for zero. Because the zero-test is already part of every instruction, you don't need to compare separately. True, it's not much faster, maybe 10% or so, but many 10 percents here and there do add up. There are actually many versions of this kind of loop, here's another one using block-transfers. The benefit of those is that they also work in THUMB:
@ Yet another version, using ldm/stm add r2, r0, #16 .LabelW: ldmia r1!, {r3} stmia r0!, {r3} cmp r2, r0 bne .LabelW
This is one of those occasions where knowing assembly can help you
write efficient C. Using a decrementing counter and pointer arithmetic
will usually be a little faster, but usually GCC willl do this for you
anyway. Another point is using the right datatype. And with
‘right’ I mean int, of course.
Non-ints require implicit casts
(lsl/lsr pairs) after every arithmetic
operation. That's two instructions after every plus, minus or
whatever. While GCC has become quite proficient in converting
non-ints into ints where possible, this has not always been the case,
and it may not always be possible. I've seen the loops above
cost between 600% more because the index and pointer were
u16, I shit you not.
When dealing with loops, be extremely careful with how you start and stop the loop. It's very easy to come up with a loop that runs once too often or too little. I'm pretty sure these two versions are correct. The way I usually check it is to see how it runs when the count should be 1 or 2. If that works out, larger numbers will too.
The Z status flag revolves around the number zero, so if you use 0 to compare to you can often combine the comparison with the data instruction that sets the flags.
This is also true for testing individual bits. The N and
C flags are effectively bits 31 and 32, respectively. If you
can manipulate the algorithm to use those, you don't need a
cmp or tst.
You probably know this already, but this is a good time to repeat the warning: watch out for off-by-one errors (also known as obi-wan errors). It is just way too easy to do one iteration too few or too many, so always check whether the code you have does the right thing. Goes for other programming languages too, of course.
Function calls use a special kind of branching instruction, namely
bl. It works exactly like the normal branch, except that
it saves the address after the bl in the link register
(r14 or lr) so that you know where to return
to after the called function is finished. In principle, you can return
with to the function using ‘mov pc, lr’,
which points the program counter back to the calling function, but in
practice you might be better off with bx (Branch and
eXchange). The difference is that bx can also switch
between ARM and THUMB states, which isn't possible with the
mov return.
Unlike b and bl, bx takes a
register as its argument, instead of a label. This register will
usually be lr, but the others are allowed as well.
There's also the matter of passing parameters to the function and returning values from it. In principle you're free to use any system you like, it is recommended to ARM's own ARM Architecture Procedure Call Standard (AAPCS) for this. For the majority of the work this can be summarized like this.
bl sets
lr, so stack that one too in that case.
Below is a real-world example of function calling, complete with
parameter passing, stackwork and returning from the call. The function
oamcpy() copies OBJ_ATTRs. The function uses the same
argument order as memcpy(), and these need to be set by
the calling function; before and after the call, lr is
pushed and popped. These two things are part of what's called the
function overhead, which can be disastrous for small functions, as
we've already seen. Inside oamcpy() we either jump back
immediately if the count was 0, or proceed with the copies and then
return. Note that r4 is stacked here, because that's
what the caller expects; if I hadn't and the caller used
r4 as well, I'd be screwed and rightly so. I should
probably point out that r12 is usually considered a
scratch register as well, which I could have used here instead of
r4, removing the need for stacking.
@ Function calling example: oamcpy @ void oamcpy(OBJ_ATTR *dst, const OBJ_ATTR *src, u32 nn); @ Parameters: r0= dst; r1= src; r2= nn; .align 2 oamcpy: cmp r2, #0 bxeq lr @ Nothing to do: return early push {r4} @ Put r4 on stack .Lcpyloop: ldmia r1!, {r3, r4} stmia r0!, {r3, r4} subs r2, #1 bne .Lcpyloop pop {r4} @ Restore r4 to its original value bx lr @ Return to calling function @ Using oamcpy. @ Set arguments mov r0, #0x07000000 ldr r1,=obj_buffer mov r2, #128 push {lr} @ Save lr bl oamcpy @ Call oamcpy (clobbers lr; assumes clobbering of r0-r3,r12) pop {lr} @ Restore lr
The bx instruction is what makes interworking between
ARM and THUMB function possible. Interworking is good. Therefore,
bx is good.
This concludes the primary section on ARM assembly. There are more
things like different processor states, and data-swap (swp)
and co-processor instructions, but those are rare.
If you need more information, look
them up in the proper reference guides. The next two subsections
cover instruction speeds and what an instruction actually looks like
in binary, i.e., what the processor actually processes. Neither
section is necessary in the strictest sense, but still informative.
If you do not want to be informed, move on to the next section:
the THUMB instruction set.
Since the whole reason for coding in asm is speed (well, that and space efficiency), it is important to know how fast each instruction is so that you can decide on which one to use where. The term ‘cycle’ actually has two different meanings: there is the clock cycle, which measures the amount of clock ticks, and there's functional cycle (for lack of a better word), which indicates the number of stages in an instruction. In an ideal world these two would be equal. However, this is the real world, where we have to deal with waitstates and buswidths, which make functional cycles cost multiple clock cycles. A wait(state) is the added cost for accessing memory; memory might just not be as fast as the CPU itself. Memory also as a fixed buswidths, indicating the maximum number of bits that can be sent in one cycle: if the data you want to transfer is larger than the memory bus can handle, it has to be cut up into smaller sections and put through, costing additional cycles. For example, ROM has a 16bit bus which is fine for transferring bytes or halfwords, but words are transferred as two halfwords, costing two functional cycles instead of just one. If you hadn't guessed already, this is why THUMB code is recommended for ROM/EWRAM code.
There are three types of functional cycles: the non-sequential (N), the sequential (S) and the internal (I) cycle. There is a fourth, the coprocessor cycle (C), but as the GBA doesn't have a coprocessor I'm leaving that one out.
Anyway, the N- and S-cycles have to do with memory fetches: if the transfer of the current (functional) cycle is not related to the previous cycle, it is non-sequential; otherwise, it's sequential. Most instructions will be sequential (from the instruction fetch), but branches and loads/stores can have non-sequentials as they have to look up another address. Both sequential and non-sequential cycles are affected by section waitstates. The internal cycles is one where the CPU is already doing something else so that even though it's clear what the next action should be, it'll just have to wait. I-cycles do not suffer from waitstates.
|
|
Table 23.7 shows how much the instructions cost in terms of N/S/I cycles. How one can arrive to these cycle times is explained below. Table 23.8 lists the buswidths, the waitstates and the access times in clock cycles for each section. Note that these are the default wait states, which can be altered in REG_WAITCNT.
The data presented here is just an overview of the most important items, for all the gory details you should look them up in GBATek or the official documents.
The cost of an instruction begins with fetching it from memory, which is a 1S operation. For most instructions, it ends there as well.
Memory instructions also have to get data from memory,
which will cost 1Nd; I've added a subscript d here
because this is an access to the section where the data is
kept, whereas other waitstates are taken from the section where the
code resides. This is an important distinction. Also, because the
address of the next instruction won't be related to the current
address, its timing will begin as a 1N instead of a 1S. This
difference is encompassed in the transfer timing. Note however that
most documentation list ldr as 1S+1N+1I, but this is
false! If you actually test it, you'll see that it is really
1N+1Nd+1I.
Block transfer behave like normal transfers, except that all accesses after the first are Sd-cycles.
Branches need an extra 1N+1S for jumping to the new
address and fetching the resetting the pipeline (I think). Anything
that changes pc can be considered a branch. The THUMB
bl is actually two instructions (or, rather, one
instruction and an address), which is why that has an additional
1S.
Register-shifted operations add 1I to the base cost because the value has to be read from the register before it can be applied.
Multiplies are 1S operations, plus 1I for every significant byte
of the second operand. Yes, this does mean that the cost
is asymmetric in the operands. If you can estimate the ranges
of values for the operands, try to put the lower one in the
second operand. Another 1I is required for the add of
mla, and one more for long multiplications.
Official documentation gives 1S+1Nd+1I as the timing of
ldr, but this is not entirely accurate. It is actually
1N+1Nd+1I. The difference is small and only
visible for ROM instructions, but could be annoying if you're
wondering why what you predicted and what you measured for your
routine doesn't match exactly. This applies to ldm
and perhaps swp too.
See forum:9602 for a little more on the subject.
As an example of how instructions are actually formatted, I invite
you to an in-depth look at the add instruction. This
will be the absolute rock bottom in terms of GBA programming, the
lowest level of them all. Understanding this will go a long way
in understanding the hows and whys (and why-nots!) of ARM assembly.
Before I show the bits, I should point out that add
(and in fact all data instructions) come in three forms, depending
on the second operand. This can be an immediate field (numeric
value), an immediate-shifted register or a register-shifted register.
Bits 4 and 19h indicate the type of add, the lower
12 bits describe the second operand; the rest are the same for
all add forms.
| 1F - 1C | 1B 1A | 19 | 18 - 15 | 14 | 13 - 10 | F - C | B - 8 | 7 | 6 5 | 4 | 3 - 0 | |||
| add Rd, Rn, # | cnd | TA | IF | TB | S | Rn | Rd | IR | IN | |||||
| add Rd, Rn, Rm Op # | IS | ST | SF | Rm | ||||||||||
| add Rd, Rn, Rm Op Rs | Rs | 0 | ||||||||||||
| bits | name | description |
|---|---|---|
| C-F | Rd | Destination register. |
| 10-13 | Rn | First operand register. |
| 14 | S | Set Status bits (the -s affix).
|
| 15-18 | TB | Instruction-type field. Must be 4 for add .
|
| 19 | IF | Immediate flag. The second operand is an immediate value if set, and a (shifted) register if clear. |
| 1A-1C | TA | Another instruction-type field. Is zero for all data instructions. |
| 1D-1F | cnd | Condition field. |
| bits | name | description |
|---|---|---|
| 0-7 | IN | Immediate Number field. The second operand is
IN ror 2*IR.
|
| 8-B | IR | Immediate Rotate field. This denotes the rotate-right amount applied to IN. |
| 0-3 | Rm | Second operand register. |
| 4 | SF | Shift-operand flag. If set, the shift is the immediate value in IS; if clear, the shift comes from register Rs. |
| 5-6 | ST | Shift type.
0: lsl, 1: lsr
2: asr, 3: ror
|
| 7-B | IS | Immediate Shift value. Second operand is
Rm Op IS.
|
| 8-B | Rs | Shift Register. Second operand is
Rm Op Rs.
|
These kinds of tables should feel familiar: yes, I've also used them
for IO-registers throughout Tonc. The fact of the matter is that the
instructions are coded in a very similar way. In this case, you
have a 32bit value, with different bitfields describing the type
of instruction (TA=0 and TB=4 indicate
is an add instruction), the registers to use
(Rd, Rd and maybe Rm and
Rs too) and a few others. We have seen this thing a
number of times by now, so there should be no real difficulty in
understanding here. The assembly instructions are little more than
the BUILD macros I've used a
couple of times, only this time it's the assembler that turn them
into raw numbers instead of the preprocessor. Having said that,
it is possible to construct the instructions manually,
even at run-time, but whether you really want to do such a thing is
another matter.
Now, the top 20 bits indicate the kind of instruction it is and
which registers it uses. The bottom 12 are for Op2.
If this involves a shifted register the bottom 4 bits indicate
Rm. Bits 5 and 6 describe the type of shift-operation
(shift-left, shift-right or rotate-right) and depending on bit 4,
bits 7 to 11 form either the register to shift by (Rs)
or a shift-value (5 bits for 0 to 31). And then there's the
immediate operand …
Sigh. Yes, here are the mere twelve bits you can use for an immediate
operand, divided into a 4bit rotate part and 8bit immediate part. The
allowable immediate values are given by IN ror 2*IR.
This seems like a small range, but interestingly enough you can get
quite far with just these. It does mean that you can never
load variable addresses into a register in one go; you have to
get the address first with a PC-relative load and then load its
value.
@ Forming 511(0x101) mov r0, #511 @ I