NEON Alchemy Lab: optimization

Wednesday, July 17, 2013

NEON Tips & Tricks Part 1 - Using Stack Efficiently, Safely

You NEVER Have Enough Registers

I has been and is a very hard struggle living with the limited number of registers on ARM : There are 16 32bit ones, only 14 of them available.
The major difficulty coding in assembly lies in the register management therefore.
It's quite different when coding for NEON : There are 32 64bit registers - all of them free to use - IN ADDITION to the 14 on ARM. Life is beautiful!

Alas, it might be more than sufficient for most relatively simple integer algorithms, but there are situations where high precision is important, forcing you to do 32bit/float arithmetics. Especially those butterfly ones found in DCT/FFT are quite nasty since every element has to be computed with all other ones once each.

Butterfly diagram of LLM iDCT, from the excellent study
"Efficient Fixed-Point Approximations of the 8x8 Inverse

Discrete Cosine Transform" by

Yuriy A. Reznik, Arianne T. Hindsy, Cixun Zhangz, Lu Yuz, and Zhibo Niz

Madame Butterfly
Even the "tiny" 8x8 matrix from jpeg, when pumped up to 32bit each for high precision arithmetics, occupies 4*8*8 = 256 bytes.
NEON can contain 256bytes in its registers(32*8=256), but that's all. NEON cannot do anything further without other registers holding the coefficient scalars and freely available for destination/temporary purpose.
You are more or less forced to compute the matrix in two passes - four lines each in this case, for example.

PUSH & POP

What do we do if no registers are available for next operations? We put those containing values not necessary for the immediate operations onto stack, use them for the operations, and restore them when it's their turn :

.
.
vpush {line0a, line0b, line1a, line1b, line2a, line2b, line3a, line3b}
mov r12, sp
.
// compute line4~line7
.
.
vpush {line4a, line4b, line5a, line5b, line6a, line6b, line7a, line7b}
vldmia r12, {line0a, line0b, line1a, line1b, line2a, line2b, line3a, line3b}
.
.
// compute line0~line3.
.
.

Now there are some problems here though. In simple situations, PUSHs and POPs are paired like parenthesis, but not in this case.
Since we also have to preserve the results line4~7, we have to resort to an additional ARM register to restore line0~line3 after pushing line4~7, and since you don't pop for this purpose, you have to manage the Stack Pointer manually, called killing the stack pointer :

.
.
add sp, sp, #8*16 // kill the stack pointer
vpop {q4-q7} 
bx lr // return

If you miss this process or miscalculate the killing number, the OS will mercilessly kill your task in return, or even worse, your task will spit out full of nonsenses before it gets killed. - very hard to debug.

Making Life Easier

.
vpush {q4-q7} // according to ATPCS, q4-q7 have to be preserved prior to use.....
mov r3, sp // backing up the sp
sub sp, sp, #16*16 // make room for 16 quad registers
.
.
mov r12, sp
vstmia r12!, {line0a, line0b, line1a, line1b, line2a, line2b, line3a, line3b}
.
.
// compute line4~line7
.
.
vstmia r12!, {line4a, line4b, line5a, line5b, line6a, line6b, line7a, line7b}
mov r12, sp
vldmia r12!, {line0a, line0b, line1a, line1b, line2a, line2b, line3a, line3b}
.
.
// compute line0~line3.
.
.
mov sp, r3 // restoring the sp
vpop {q4-q7} // .....and restored after use
bx lr // return

It's much more safe now. You don't have to calculate the killing number manually every time vpush/vpop population is changed.
You have to sacrifice an ARM register(r3 here) for this method, but there is no way you can be short on ARM registers when programming for NEON, and executing a few ARM instructions while NEON instructions dominate is COMPLETELY FREE, cycle wise.

Aligned Forces

Vpush and vpop aren't ordinary NEON instructions. They are pseudo-instructions translated by the ARM Assembly into "vstmdb sp!, {register-list}" and "vstmia sp!, {register-list}" each.
Vldm and vstm both are VFP-NEON shared instructions. NEON has its proprietary memory related instructions : vld and vst
In addition to the on-the-fly interleaving/de-interleaving property(more on this in next part), vld and vst both benefit from the right memory alignment in form of faster execution, if you guarantee it to them expressively :

.
vpush {q4-q7} // according to ATPCS, q4-q7 have to be preserved prior to use.....
mov r3, sp // backing up the sp
sub sp, sp, #16*16 // make room for 16 quad registers
bic sp, sp, #0x1f  // equivalent to sp &=0xffffffe0, make sp 32byte aligned
.
.
mov r12, sp
vst1.32 {line0a, line0b}, [r12,:256]!
vst1.32 {line1a, line1b}, [r12,:256]!
vst1.32 {line2a, line2b}, [r12,:256]!
vst1.32 {line3a, line3b}, [r12,:256]!
.
.
// compute line4~line7
.
.
vst1.32 {line4a, line4b}, [r12,:256]!
vst1.32 {line5a, line5b}, [r12,:256]!
vst1.32 {line6a, line6b}, [r12,:256]!
vst1.32 {line7a, line7b}, [r12,:256]!
mov r12, sp
vld1.32 {line0a, line0b}, [r12,:256]!
vld1.32 {line1a, line1b}, [r12,:256]!
vld1.32 {line2a, line2b}, [r12,:256]!
vld1.32 {line3a, line3b}, [r12,:256]!
.
.
// compute line0~line3.
.
.
mov sp, r3 // restoring the sp
vpop {q4-q7} // .....and restored after use
bx lr // return

The major drawback of vld and vst compared to vldm and vstm is that the former can load/store only up to two quad registers(four double registers) while the latter can handle up to eight quad registers.(16 double registers)
How expensive are they on Coretex A8? vldm/vstm cost nine cycles (number of q registers+1) while aligned vldm/vstm cost 2 cycles each.

Well, it might seem not worth all the efforts/increased code length for a mere, single cycle saved, but there are other strong benefits.......

to be continued in next part

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Sunday, June 16, 2013

NEON – A Brief Introduction

- What’s NEON – or What’s NEON capable of?

NEON is an advanced SIMD unit from ARM that resides on almost all modern smart phones ( iPhone 3GS, Galaxy S, Nexus One or later)

NEON is capable of computing demanding operations such like colorspace conversions, image processing and so on within a few miliseconds.

In other words, NEON can deal with large amount of data - especially packed ones - so efficiently that even operations that would drive every modern mobile CPU to its knees are almost cakewalks, being executed at near-memcpy speed.

For example, my fully optimized version of YUV420 to BGRA8888 conversion takes less than 10ms for a full-HD image (1920*1080) on an iPhone4(<1Ghz) while an opensource C version takes almost one whole second.

- What’s so great about NEON?

SIMD – processing up to 16 elements at once
on-the-fly packing & unpacking while accessing memory – ideal for processing multimedia data
powerful instructions with built-in saturating, rounding, typecasting – especially well suited for fixed-point arithmetics
direct access to the L2 cache bypassing L1 cache
cache preload – those painful cache-miss penalties are greatly reduced
lots of registers – NEON features 32 64Bit data registers (all of them can be used) while ARM features only 16 32Bit general purpose registers (and only 14 of them can be used). In addition, NEON can (and must) also use ARM registers as address registers(pointers), constants containers, loop counters and so on.
availability – almost all smart phones from 2010 or later feature NEON, ready for use

- NEON deserves better

If you search the web for “ARM NEON” you’ll probably find many negative postings/QnA’s about NEON like :

NEON versions being not much faster or even much slower than their C counterparts
NEON computed results being inaccurate
complicated to get it to work

NEON is a heavy unit with long pipelines. It has to be handled with care. Most beginners aren’t aware of this and put something “unreasonable” in their codes causing pipeline stalls that waste about 12 cycles each time - which is quite a lot when positioned within a loop.

Even worse, it’s more often the compiler that strongly assists these anomalies.

There are NEON intrinsics for compiler toolchains. These are collections of C macros written in inline-assembly that are meant to enable NEON programming in C. It sounds great at first glance, but in reality, there’s way more pain than gain(if at all) at current state :

It simply follows the compiler’s routine job, and part of this preserves and restores ARM registers onto/from stack that cause pipeline stalls on the NEON side
Arithmetics’ priorities change depending on the compiler options, causing results that are completely off the track

There are also people that complain about NEON’s less-than-expected performance with their hand written assembly benchmarks. They are not wrong, just not realistic :

Typically in benchmarks, test routines are called thousands of times. If the data size isn’t large enough, the whole data is read from the L1 cache from the second time on, thus cache miss penalties are gone which mostly isn’t the case in real world applications. Only the C version compiled to ARM codes benefits from this.
If the arithmetics are too simple (a single integer addition per iteration for example), NEON’s gain in performance by computing multiple elements at once is almost nullified by its longer pipelines. There is simply nothing available to fill NEON’s bigger pipeline interlocks without redesigning the iteration itself.

- Any Examples? Resources?

I hope my blog to become the primary source for studies in NEON with upcoming articles/example codes, but until then, try the following excellent tutorials :

Coding for NEON - Part 1: Load and Stores
Coding for NEON - Part 2: Dealing With Leftovers
Coding for NEON - Part 3: Matrix Multiplication
Coding for NEON - Part 4: Shifting Left and Right
Coding for NEON - Part 5: Rearranging Vectors

I myself started with the tutorials above. Unfortunately, over one year passed since the last part is posted. I’ll probably take the baton with my upcoming articles – Stay tuned!

Below is another excellent article on optimizing NEON that shows how large the performance gain can be, and/or how problematic intrinsics can get :

ARM NEON Optimization. An Example

Last but not least, the necessary reference manuals, listing all NEON instructions and their cycle timings :

NEON and VFP instruction summary (PDF)
Instruction Cycle Timing (PDF, Coretex A8)
Instruction Cycle Timing (PDF, Coretex A9)

See you next time - with my first example codes

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.