NEON Tips & Tricks Part 1 - Using Stack Efficiently, Safely |NEON Alchemy Lab

You NEVER Have Enough Registers

I has been and is a very hard struggle living with the limited number of registers on ARM : There are 16 32bit ones, only 14 of them available.
The major difficulty coding in assembly lies in the register management therefore.
It's quite different when coding for NEON : There are 32 64bit registers - all of them free to use - IN ADDITION to the 14 on ARM. Life is beautiful!

Alas, it might be more than sufficient for most relatively simple integer algorithms, but there are situations where high precision is important, forcing you to do 32bit/float arithmetics. Especially those butterfly ones found in DCT/FFT are quite nasty since every element has to be computed with all other ones once each.

Butterfly diagram of LLM iDCT, from the excellent study
"Efficient Fixed-Point Approximations of the 8x8 Inverse

Discrete Cosine Transform" by

Yuriy A. Reznik, Arianne T. Hindsy, Cixun Zhangz, Lu Yuz, and Zhibo Niz

Madame Butterfly
Even the "tiny" 8x8 matrix from jpeg, when pumped up to 32bit each for high precision arithmetics, occupies 4*8*8 = 256 bytes.
NEON can contain 256bytes in its registers(32*8=256), but that's all. NEON cannot do anything further without other registers holding the coefficient scalars and freely available for destination/temporary purpose.
You are more or less forced to compute the matrix in two passes - four lines each in this case, for example.

PUSH & POP

What do we do if no registers are available for next operations? We put those containing values not necessary for the immediate operations onto stack, use them for the operations, and restore them when it's their turn :

.
.
vpush {line0a, line0b, line1a, line1b, line2a, line2b, line3a, line3b}
mov r12, sp
.
// compute line4~line7
.
.
vpush {line4a, line4b, line5a, line5b, line6a, line6b, line7a, line7b}
vldmia r12, {line0a, line0b, line1a, line1b, line2a, line2b, line3a, line3b}
.
.
// compute line0~line3.
.
.

Now there are some problems here though. In simple situations, PUSHs and POPs are paired like parenthesis, but not in this case.
Since we also have to preserve the results line4~7, we have to resort to an additional ARM register to restore line0~line3 after pushing line4~7, and since you don't pop for this purpose, you have to manage the Stack Pointer manually, called killing the stack pointer :

.
.
add sp, sp, #8*16 // kill the stack pointer
vpop {q4-q7} 
bx lr // return

If you miss this process or miscalculate the killing number, the OS will mercilessly kill your task in return, or even worse, your task will spit out full of nonsenses before it gets killed. - very hard to debug.

Making Life Easier

.
vpush {q4-q7} // according to ATPCS, q4-q7 have to be preserved prior to use.....
mov r3, sp // backing up the sp
sub sp, sp, #16*16 // make room for 16 quad registers
.
.
mov r12, sp
vstmia r12!, {line0a, line0b, line1a, line1b, line2a, line2b, line3a, line3b}
.
.
// compute line4~line7
.
.
vstmia r12!, {line4a, line4b, line5a, line5b, line6a, line6b, line7a, line7b}
mov r12, sp
vldmia r12!, {line0a, line0b, line1a, line1b, line2a, line2b, line3a, line3b}
.
.
// compute line0~line3.
.
.
mov sp, r3 // restoring the sp
vpop {q4-q7} // .....and restored after use
bx lr // return

It's much more safe now. You don't have to calculate the killing number manually every time vpush/vpop population is changed.
You have to sacrifice an ARM register(r3 here) for this method, but there is no way you can be short on ARM registers when programming for NEON, and executing a few ARM instructions while NEON instructions dominate is COMPLETELY FREE, cycle wise.

Aligned Forces

Vpush and vpop aren't ordinary NEON instructions. They are pseudo-instructions translated by the ARM Assembly into "vstmdb sp!, {register-list}" and "vstmia sp!, {register-list}" each.
Vldm and vstm both are VFP-NEON shared instructions. NEON has its proprietary memory related instructions : vld and vst
In addition to the on-the-fly interleaving/de-interleaving property(more on this in next part), vld and vst both benefit from the right memory alignment in form of faster execution, if you guarantee it to them expressively :

.
vpush {q4-q7} // according to ATPCS, q4-q7 have to be preserved prior to use.....
mov r3, sp // backing up the sp
sub sp, sp, #16*16 // make room for 16 quad registers
bic sp, sp, #0x1f  // equivalent to sp &=0xffffffe0, make sp 32byte aligned
.
.
mov r12, sp
vst1.32 {line0a, line0b}, [r12,:256]!
vst1.32 {line1a, line1b}, [r12,:256]!
vst1.32 {line2a, line2b}, [r12,:256]!
vst1.32 {line3a, line3b}, [r12,:256]!
.
.
// compute line4~line7
.
.
vst1.32 {line4a, line4b}, [r12,:256]!
vst1.32 {line5a, line5b}, [r12,:256]!
vst1.32 {line6a, line6b}, [r12,:256]!
vst1.32 {line7a, line7b}, [r12,:256]!
mov r12, sp
vld1.32 {line0a, line0b}, [r12,:256]!
vld1.32 {line1a, line1b}, [r12,:256]!
vld1.32 {line2a, line2b}, [r12,:256]!
vld1.32 {line3a, line3b}, [r12,:256]!
.
.
// compute line0~line3.
.
.
mov sp, r3 // restoring the sp
vpop {q4-q7} // .....and restored after use
bx lr // return

The major drawback of vld and vst compared to vldm and vstm is that the former can load/store only up to two quad registers(four double registers) while the latter can handle up to eight quad registers.(16 double registers)
How expensive are they on Coretex A8? vldm/vstm cost nine cycles (number of q registers+1) while aligned vldm/vstm cost 2 cycles each.

Well, it might seem not worth all the efforts/increased code length for a mere, single cycle saved, but there are other strong benefits.......

to be continued in next part

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

NEON Alchemy Lab

Wednesday, July 17, 2013