- What’s NEON – or What’s NEON capable of?
NEON is an advanced SIMD unit from ARM that resides on almost all modern smart phones ( iPhone 3GS, Galaxy S, Nexus One or later)NEON is capable of computing demanding operations such like colorspace conversions, image processing and so on within a few miliseconds.
In other words, NEON can deal with large amount of data - especially packed ones - so efficiently that even operations that would drive every modern mobile CPU to its knees are almost cakewalks, being executed at near-memcpy speed.
For example, my fully optimized version of YUV420 to BGRA8888 conversion takes less than 10ms for a full-HD image (1920*1080) on an iPhone4(<1Ghz) while an opensource C version takes almost one whole second.
- What’s so great about NEON?
- SIMD – processing up to 16 elements at once
- on-the-fly packing & unpacking while accessing memory – ideal for processing multimedia data
- powerful instructions with built-in saturating, rounding, typecasting – especially well suited for fixed-point arithmetics
- direct access to the L2 cache bypassing L1 cache
- cache preload – those painful cache-miss penalties are greatly reduced
- lots of registers – NEON features 32 64Bit data registers (all of them can be used) while ARM features only 16 32Bit general purpose registers (and only 14 of them can be used). In addition, NEON can (and must) also use ARM registers as address registers(pointers), constants containers, loop counters and so on.
- availability – almost all smart phones from 2010 or later feature NEON, ready for use
- NEON deserves better
If you search the web for “ARM NEON” you’ll probably find many negative postings/QnA’s about NEON like :- NEON versions being not much faster or even much slower than their C counterparts
- NEON computed results being inaccurate
- complicated to get it to work
NEON is a heavy unit with long pipelines. It has to be handled with care. Most beginners aren’t aware of this and put something “unreasonable” in their codes causing pipeline stalls that waste about 12 cycles each time - which is quite a lot when positioned within a loop.
Even worse, it’s more often the compiler that strongly assists these anomalies.
There are NEON intrinsics for compiler toolchains. These are collections of C macros written in inline-assembly that are meant to enable NEON programming in C. It sounds great at first glance, but in reality, there’s way more pain than gain(if at all) at current state :
- It simply follows the compiler’s routine job, and part of this preserves and restores ARM registers onto/from stack that cause pipeline stalls on the NEON side
- Arithmetics’ priorities change depending on the compiler options, causing results that are completely off the track
There are also people that complain about NEON’s less-than-expected performance with their hand written assembly benchmarks. They are not wrong, just not realistic :
- Typically in benchmarks, test routines are called thousands of times. If the data size isn’t large enough, the whole data is read from the L1 cache from the second time on, thus cache miss penalties are gone which mostly isn’t the case in real world applications. Only the C version compiled to ARM codes benefits from this.
- If the arithmetics are too simple (a single integer addition per iteration for example), NEON’s gain in performance by computing multiple elements at once is almost nullified by its longer pipelines. There is simply nothing available to fill NEON’s bigger pipeline interlocks without redesigning the iteration itself.
- Any Examples? Resources?
I hope my blog to become the primary source for studies in NEON with upcoming articles/example codes, but until then, try the following excellent tutorials :Coding for NEON - Part 1: Load and Stores
Coding for NEON - Part 2: Dealing With Leftovers
Coding for NEON - Part 3: Matrix Multiplication
Coding for NEON - Part 4: Shifting Left and Right
Coding for NEON - Part 5: Rearranging Vectors
I myself started with the tutorials above. Unfortunately, over one year passed since the last part is posted. I’ll probably take the baton with my upcoming articles – Stay tuned!
Below is another excellent article on optimizing NEON that shows how large the performance gain can be, and/or how problematic intrinsics can get :
ARM NEON Optimization. An Example
Last but not least, the necessary reference manuals, listing all NEON instructions and their cycle timings :
NEON and VFP instruction summary (PDF)
Instruction Cycle Timing (PDF, Coretex A8)
Instruction Cycle Timing (PDF, Coretex A9)
See you next time - with my first example codes
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
I would like to write a 64bit xor shift random number generator using neon. At the moment I am starting with the arm integer instruction set. I basically need to do 3 64 bit shifts. Does neon have 64 bit shift instructions, 128bit? I know I can do what I want in about 7 32bit arm integer instructions. It would be great if I could just use 3 neon instructions instead. 128 bit shift instructions would allow a very high quality random number generator.
ReplyDelete