I think too much. Or so people keep telling me. And they may have a point.
Anyway, so I'm working on this assembly version of
toncset
for, well, just because I guess. A fill
routine has 3 parts: a head, a main run and a tail. The main part
fills 32-bit chunks (words) when there's more than 4 bytes left and
when the destination is word-aligned. This part is easy, because
you can just dump words with stmia
or str
.
For an example of this, see my older
memset16 post.
The tail is for the remaining bytes after the main part. Under normal circumstances you could do this byte for byte, but some sections of GBA/NDS memory do not take kindly to byte-writes, so you'd have to read the word, mask the appropriate bytes out/in and then write it back. This is mostly just annoying, but still very doable.
The head part, however, is both annoying and tricky. It consist of filling the unaligned bytes in the first word of a run so that the main part can do its thing. This is similar to the tail part, in that it requires bitmasks. However, it's also possible that both the beginning and and of a run occur in the same word, effectively making the head the tail as well, so you'd have to apply a double mask … somehow. In C, it looks something like this:
void *dstv; // potentially non-aligned
u32 fill; // Already extended to full 32-bit if appropriate.
uint size; // > 0
*/
u32 addr= (u32)dstv;
int left= addr&3;
if(left != 0)
{
u32 *dst= (u32*)(addr&~3);
int right= left+size;
u32 mask= 0xFFFFFFFF;
if(right < 4) // Everything in a single word: head and tail.
{
mask= ~(mask<<8*size); // Create right-mask 000F
mask <<= 8*left; // Create mid mask 00F0
*dst = (*dst &~ mask) | (fill & mask);
return;
}
else
{
mask <<= 8*left; // Create left mask FFF0
*dst = (*dst &~ mask) | (fill & mask);
size= right-4;
*dst++;
}
}
This bit of C translates roughly into the following ARM code:
@ r0 : dst
@ r1 : src
@ r2 : size
@ r3 : left (dst&3) / right (left+size) / data
@ r4 : lshift (left*8)
@ r5 : rshift (right*8) / mask
@ ip : maskBase (0xFFFFFFFF)
.Lfgset_head:
bic r0, r0,#3 @ Align dst
mvn ip, #0 @ Set-up mask (0xFFFFFFFF)
mov r4, r3, lsl #3 @ (left*=8) != 0 : right-side only
add r3, r3, r2 @ right= left+size
cmp r3, #4
@ <= 4 : single-word. Shrink mask, do usual and quit early
movlo r5, r2, lsl #3 @ \.
mvnlo ip, ip, lsl r5 @ - r5= ((1<<8*size)-1)<<(8*left);
mov r5, ip, lsl r4 @ /
subhi r2, r3, #4 @ Adjust size for follow-ups
@ Mask in r1 and write back to dst
ldr r3, [r0]
bic r3, r3, r5
and r5, r1, r5
orr r3, r5
str r3, [r0], #4
bhi .Lfgset_main @ Longer stretch : go back.
bx lr @ Single-word fill : finished.
The main thing I thought when I'd written this down was “meh”.
You will note that registers r4
and r5
are
used here, which means stack-work (omitted here for brevity). The
positioning of pushing and popping makes everything a little awkward,
so I was off looking for something else.
The essence of the problem here is that you can only use five registers
without touching the stack: r0-r3
and ip
(r12
). Now, r0-r2
are taken
by the destination, fill and size, so I can't do anything with those.
I also need one to store the left-edge (r3
in this case),
leaving us with one for the right edge, the left and right shift
intermediaries, and the left and right masks. Right! Crap -_-.
So; strategies. Well, one of the reasons I need to use so many
registers is because the lifetimes overlap. For example, I still need
left
for a while because shifting the mask up comes last
here. I can't use r2
for multiple purposes either because
I'll need it for the size. Now, I could free up r3
by
making the left-mask first, but then I might get in trouble when creating
the right-mask. Also, right-4
is actually
what size
wants to be when it grows up in the long-run
case, so I can use that there as well. I'd just have to undo that
for the short case, or perhaps even create the right-mask from
negative numbers.
At this point I figured it would be helpful to take a look at the
various ways of creating the masks I needed. The standard form for a
0x000000FF-style bitmask is (1<<x)-1
, but there are
others as well. The following list holds a few examples.
( 1<<x)-1; // 000000FF. mov r0, #1; rsb r0, r0, r0, lsl r1;
~(-1<<x); // 000000FF. mvn r0, #0; mvn r0, r0, lsl r1;
-1<<x; // FFFFFF00. mvn r0, #0; mov r0, r0, lsl r1;
-( 1<<x); // FFFFFF00. mov r0, #1; sub r0, r0, r0 lsl r1; sub r0, #1
-1>>x; // 00FFFFFF. mvn r0, #0; mov r0, r0, lsr r1;
~(-1>>x); // FF000000. mvn r0, #0; mvn r0, r0, lsl r1;
All of these use +1 or −1 as their base, and all but one is
a two-instruction affair. The left-mask looks like 0xFFFFFF00, so
the most obvious one to pick here is -1<<x
.
Technically, the right-mask is 0x000000FF, using
x = 8*right = 8*(left+size)
.
However, you can also see it as a 0x00FFFFFF-style mask if you
use 4-right
. This solves two problems at once. First,
it is the negative of the new size, so the value is readily available.
Second, this mask is a right-shifted 0xFFFFFFFF, but as the lower bits
are shifted out anyway, it doesn't actually have to be a proper
0xFFFFFFFF; it can be a 0xFFFFFF00 as well, which we have in the form
of the left-mask. In other words, we don't require as many registers
for temps because we already have everything we need. The resulting
code is this:
movs r3, r3, lsl #3
mvn ip, #0 @ ip= -1
mov r3, ip, lsl r3 @ lmask := (-1)<<(8*left)
subs r2, r2, #4 @ (1b) aligned size= right-4
@ new size < 0 : single-word fill
rsblo r2, r2, #0 @ - (2) r3= -8*size
movlo r2, r2, lsl #3 @ /
andlo r3, r3, r3, lsr r2 @ (3) mask = lmask & rmask
@# inserts and jumps
See? No r4
and r5
anywhere. The
key is toying around with r2
and r3
. While
r2
is reserved for the size, it needs to modified anyway
to account for the work done here. In the end, size should be
right−4, which is what points (1a) and (1b) do. Since
right-4
is a right<4
as well, we can use
its result as the condition for the special case; the result
being the negative distance from the word-edge. As explained above,
the right-mask can be constructed from the left-mask by
lmask>>(-8*size)
, which is done at points (2) and
(3).
It's a little hairy, but it works. And yet, it still evoked a feeling
of “meh” like before. It's the two instructions at point (2) that annoyed me. The reason it's two instructions and not one is because you
can't multiply by −8 in one go. By +8, yes: that's a
shifted-mov
; −1, yes: that's
rsb, r2, #0
; but the combination is difficult
because the shift only applies to the second operand. A
sub r2, #0, r2, lsl #3
would do it,
but the first operand needs to be a register and I don't have a spare
one with zero in it. I could make one, that just means I have an
extra instruction somewhere else. I do, however, have either
a +1 or −1 in ip
, maybe I can use that somehow.
And then it hits me: the carry flag!
There are adc
, sbc
and rsb
instructions that add C, C−1 and C−1
to the result, respectively. Setting or clearing the carry flag is
easy, so that's not a problem. All I need now is to start the
flag using +1 instead of −1 to cancel out the −1 in sbc
or rsc
. As it turns out, I can do that
I use this format for the left-mask: -(1<<x)
. In the
mask overview above I listed thi as a 3-instruction gig, but as it turns
out I can use the carry-trick here to for one instruction less. The final
version (for just the head part) looks like this:
beq .Lfgset_main @ Jump to main stint is aligned
.Lfgset_head:
bic r0, r0, #3
add r2, r3, r2
movs r3, r3, lsl #3 @ left*8 ; clear carry
mov ip, #1
sbc r3, ip, ip, lsl r3 @ -(1<<8*left) +1-1
subs r2, r2, #4 @ size= right-4
@ If negative (==carry clear), this is a single-word fill
@ This requires a truncated mask (like 0x0000FF00)
sbclo r2, ip, r2, lsl #3 @ x= -8*size +1-1
andlo r3, r3, r3, lsr r2 @ mask= mask & (mask>>-8*size);
@ Insert and jump to main stint if available.
.Lfgset_insert:
ldr ip, [r0]
bic ip, ip, r3
and r3, r1, r3
orr ip, ip, r3
str ip, [r0], #4
bhi .Lfgset_main @ Longer stretch
bx lr @ Single-word fill : finished.
Sweeeet :). I was happy with this, until I realized what I'd been working on: an exception of an exception. This would definitely not be part of the 20% of the code that uses 80% of the runtime, so it's really not something one should worry about. Interesting, yes, and I learned a few new tricks, but perhaps time would have been better spent on getting 5% extra out of the main loop. The only problem there is that that is just boring old unrolling a bit, whereas the head presented a more ‘interesting’ problem so I went for that instead.
So yeah; I think too much >_>
Ahh, _that_ habit. Very insidious and captivating. And the death of many perfectly good evenings. Did I mention this post is giving me brainfaults? @_@
In the spirit of Not Helping(tm), I'll throw in my favorite ARM perversions:
mov timers, timer_1
orr timers, timer_2, lsl #16
mov r0, shiftCount
orr r0, someSpareFlags, lsl #8
orr r0, counter, lsl #12
...
loop:
rsbs timers, dt, timers, ror #16 ; decrement and test high half < 0
blt infrequent_complex_event_1
rsbs timers, dt, timers, ror #16 ; decrement and test low half < 0
blt pesky_distraction_2
...
mov foo, bar, lsl r0 ; uses low 8 bits of r0
...
orr/bic../tsts r0, someBit ; toy with spare flags
...
subs r0, #(1<<12) ; decrement and test counter
blt loop
This came from some Adlib/OPL2 emulation. Desparate enough? Well, that wasn't the end of it, and there are _still_ LDM's in there for lack of registers. Though not all of this was used in the end. I ditched sample-accurate envelopes, and I still can't break ~80 ARM9 cycles/sample. :|
Niiiice. I never had to cram several variables in a single word like you're doing for
r0
here, but I've always been fond of abusing ROR. For example, on some occasions you can split a coordinate into the divident and the modulo by using ROR 3 instead of LSR 3. In later instructions you can use either part with further shifts, usually without getting into trouble.I also came across a nice bit for the tail-end of a large fill routine. When you're down to less than 32 bytes, you can do a word-loop, unroll the word-loop, or go for a nifty
stm
variant like this:tst r2, #16
stmneia r0!, {r4-r7}
tst r2, #8
stmneia r0!, {r4-r5}
tst r2, #4
... etc
But you can actually test two bits at once if you shift them into the sign and carry bit:
mov ip, r2, lsl #(32-4) @ test for bits 4 and 3
stmcsia r0!, {r4-r7}
stmmiia r0!, {r4-r5}
mov ip, r2, lsl #(32-2) @ test for bits 2 and 1
stmcsia r0!, {r4}
strmih r4, [r0], #2
Don't you just love that barrel shifter? ^_^.
And now I'm curious about that whole thing too, but I'm fighting the urge to ask for details.
I feel so weird working with x86 anymore! I guess it's partly a matter of perspective. (e.g.: Memory operands and register renaming as a form of code compression..)
6502 is my second favorite to ARM. Didn't try much, but I always found the simplicity refreshing. Though a funny thing about it is that if you set the low two opcode bits, supposedly you get an "ALU + shifter" op, like ORA + ASL ("ASO"). Sound familiar? These are the "illegal" ops the chip is so infamous for. I couldn't say whether that's where ARM got the idea, but it's very nifty indeed.
Er, I'll try to be brief then. (It's not my strong point.)
Some background: Yamaha's OPL2/3 series cropped up in tons of 1990s-era sound boards. So many, that I wonder if lame OPL instruments are THE reason some folks associate MIDI with "crap music." But used on children of the 80's, good "FM" tunes--like the NES and C64--should induce spine-shivering waves of nostalgia. ;-)
Other excuses include a burning curiosity regarding MAME's spaghettiesque OPL driver (from which I got the algorithms), and a nice, complete set of instrument patches (ah, wasted youth!).
This is pseudocode for the meat of it. As you can see, it's not truly FM, but it's close. Topping out (at 18 voices, 32Khz stereo) uses ~75% of the ARM9.
Right now it's playing MIDIs, but I'd love to see a Mario Paint sorta thing...
Erk! Maimed by the sanitizer. It actually even ate a few lines. Sorry about that. :(
You've got those beautiful code blocks up there--now I have to go reading... Is there a good reference on what WordPress will tolerate? Or at least a well-hidden preview? ^_^;
Yeah, the sanitizer's a bit of a bitch. It's a battle between the standard WP styles, my own styles and several plugins. I've been toying with some of the settings to make sure the comments behave as sensible as possible, but as you can see it can still be a little messy. I can fix the code for you, but I'd need the stuff that was deleted.
The code blocks are done with code snippet, which uses geshi and [code] tags. For pretty colors, you can use lang="..."; cpp for C/C++ and gccarm for ARM assembly are the most common for me.
I'd love to have a preview, but my webscript-fu is still somewhat weak, so I'm not sure how to implement it ... yet. Of course, if one should suddenly find itself in my mailbox, I would consider it (hint, hint).