Saturday, July 13, 2013

NEON Tutorial Part 1 - A Simple Function

No "Hello World" here

I don't intend starting the tutorial with the eighty millionth "Hello World" example. It's boring.

What about some elementary school stuff instead? A very simple function, literally :

f(x) = coefficient * x + intercept

Too simple? D'accord, let's extend this with a division, rounding, and saturation :

void sfC(unsigned short * pDst, short * pSrc, short coeff, short intercept, unsigned int count)
{
 int res;
 do {
  res = *pSrc++ * coeff + intercept;
  if (res & 0x80) res += 256;
  res >>= 8;
  if (res < 0) res = 0;
  if (res>0xffff) res = 0xffff;
  *pDst++ = (unsigned short) res;
 } while (--count);
}

The C function above is quite self explanatory. It does multiplication, addition, division (right shift) with rounding, and saturates to 0 to 65535 (16bit).

Say Hello to NEON

Below is the corresponding code in Apple LLVM 4.2 assembly syntax :

// void sfNEONbad(unsigned short * pDst, short * pSrc, short coeff, short intercept, unsigned int count);
.code 32
.globl _sfNEONbad
.private_extern _sfNEONbad

pDst    .req    r0
pSrc    .req    r1
coeff   .req    r2
interc  .req    r3
count   .req    r12

.align 2
_sfNEONbad:
.cfi_startproc
 ldr     count, [sp]
 vdup.16 d0, coeff
 vdup.16 d1, interc

 1:
  vld1.16   {d2}, [pSrc]!
  vmull.s16  q1, d2, d0[0]
  vaddw.s16  q1, q1, d1
  vqrshrun.s32 d2, q1, #8
  vst1.16   {d2}, [pDst]!
  subs   count, count, #4
 bgt     1b
bx      lr
.cfi_endproc

vdup.16 d0, coeff - duplicate all four 16bit lanes of d0 with the value of coeff
vdup.16 d0, interc - duplicate all four 16bit lanes of d0 with the value of interc
vld1.16 {d2}, [pSrc]! - load planar data from pSrc, equivalent to "d2 = *pSrc++;" in C

NEON can and must use ARM registers as pointers, but it cannot use them for arithmetics. Therefore, both coeff and interc have to be copied to NEON's registers first.
All NEON instructions start with a v (for vector) and are easily distinguished from ARM's thereby.

The Long Model and Vector-Scalar Operation
vmull.s16 q1, d2, d0[0] - multiply all 4 signed int16 data in the vector d2 with the signed int16 data in the first lane of d0, the result is signed int32. (the last l for long)

NEON vector by scalar multiplication long

Now it's SIMD time! d2 is a vector (short array) of 8byte size which means that D registers can contain up to 8 bytes, 4 int16, or 2 int32 (of float)
NEON can compute all the data in the lanes with a single instruction. With 16bit data, it's 4 at once.
And since the results are 4 int32 data, they don't fit to a D register. The destination has to be a Q register therefore.

The Wide Model and Vector-Vector Operation

vaddw.s16 q1, q1, d1 - add 4 int32 data with 4 int16 data in matching lane (w for wide)
NEON vector by vector addition wide

Now we have to add 32bit data with 16bit ones - a Q register with a D register. It's called the Wide Model.
Although we don't have to do a vector-vector operation here since we want to add the same value (interc) to the 4 int32 data, multiplications are the only vector-scalar operations supported by NEON. Therefore, we have to stick to vector-vector addition.

The Narrow Model and The King of The Hill

vqrshrun.s32 d2, q1, #8
q - saturate
r - round
shr - shift right
u - convert to unsigned
n - narrow

You see? A single instruction that does this much. It does shift to right, round, saturate, convert to unsigned and shrink down the results to half.
Just look at the C code. 4 lines with 3 if's were necessary.
vqrshrun is probably the most efficient and useful NEON instruction. (q, r, u, n are all optional flags)

Store and Loop

vst1.16 {d2}, [pDst]! - store planar data to pDst, equivalent to "*pDst++ = d2;" in C

subs count, count, #4 - substract, equivalent to "count -= 4;" in C, and updates the CPSR (suffix s)
CPSR(Current Program Status Register) is a special register for flow control.
It gets updated depending on the result of an arithmetic-logic instruction with an 's' suffix attached.
And then, following instruction(s) can be executed conditionally, meaning they are only executed when the condition given with the condition code attached is met.

bgt 1b - Branch(jump) to label 1 backward if Greater Than 0

It's the 'b' (branch) instruction with the condition code 'gt'.
In this example, we compute 4 data per iteration. count gets decreased by 4 (subs), and if count becomes less than zero, the 'gt' condition isn't met and the branch to 1 backward isn't executed. Thus, the loop is terminated.

bx lr - Branch to the address lr(link register) contains changing mode(ARM<->Thumb) if necessary, equivalent to "return;" in C

Great Expectations

Much easier than previously thought, right? If you didn't understand CPSR and condition codes very well, it doesn't matter much. subs, bgt, and bx are pretty much the routine job.

Well we are finished here. We now compute 4 data per iteration and vqrshrun being so powerful we can expect it to be more than 4 times as fast than the C version, right?

It's time for benchmarking :
device : iPod touch 4th gen (Coretex-A8)
OS : iOS 6.13
data size : 1920*1080*2 bytes
iteration : 1024

function execution time speed factor
sfC 1315126107 1
sfNEONbad 337117053 3.901

Wow, we did it. It's actually almost 4 times as fast.
You must be satisfied. (and probably your boss as well)

But there are something very disturbing here : why is the function's name sfNEONbad?
It's actually very bad.

Let's Count

One of the best thing about NEON is that almost all its instructions consume only 1 cycle, but at a price called latency.
NEON cycle counting
A screen capture from the excellent Cycle Counter for Coretex A8 by Plusar showing how many cycles our code within loop consumes.

Both vld and vst instructions consume 2 cycles each, and the others only 1.
Not bad considering even multiply consumes a single cycle, but why does whole loop consume 15 cycles as indicated left? That's the due to the latency marked in red.
Even though vmull consumes just one cycle, in doesn't mean that the result is available immediately. It takes time.
The next instruction, vaddw needs it, and since it's not available at the time of execution, addw has to wait 8 half-cycles (q1:8).
The same applies to vqrshrun and vst as well. (5 and 7 each)
This kind of delay or wait state is called pipeline interlock or hazard.

An iteration takes 15 cycles and 10 of them are wasted cycles? You must be kidding.
Can we fix this? Of course. It's actually very simple.

The Way It's Supposed To Be

You cannot simply remove interlocks - it's impossible. Instead you do something else filling the gap.

.code 32
.globl _sfNEONbad
.private_extern _sfNEONbad

pDst    .req    r0
pSrc    .req    r1
coeff   .req    r2
interc  .req    r3
count   .req    r12

.align 2
_sfNEON:
.cfi_startproc
 ldr     count, [sp]
 vdup.16 d0, r2 //coeff
 vdup.16 d1, r3 //value

 1:
  vld1.16  {d28-d31}, [pSrc,:128]!
  vmull.s16   q12, d28, d0[0]
  vmull.s16   q13, d29, d0[0]
  vmull.s16   q14, d30, d0[0]
  vmull.s16   q15, d31, d0[0]

  vaddw.s16   q12, q12, d1
  vaddw.s16   q13, q13, d1
  vaddw.s16   q14, q14, d1
  vaddw.s16   q15, q15, d1

  vqrshrun.s32    d24, q12, #8
  vqrshrun.s32    d25, q13, #8
  vqrshrun.s32    d26, q14, #8
  vqrshrun.s32    d27, q15, #8

  subs    count, count, #16
  vst1.16  {d24-d27}, [pDst,:128]!
 bgt     1b

 bx      lr
.cfi_endproc

As you can see, we utilize more registers, loading 4 registers at once.
Then we do the same kind of operation through the registers.
This way, the next kind of operation is delayed, but less to none cycles are wasted.
Lets count again :
NEON optimization by instruction scheduling
Much better now. The interlocks are gone or heavily reduced.
One iteration takes 18 cycles now dealing with 4 times as many bytes! Can we expect the same level of performance boost this time? Let's benchmark again

Sorry disappointing you


function execution time speed factor
sfC 1315126107 1
sfNEONbad 337117053 3.901
sfNEON 266787748 4.929

The "proper" version is only about 26% faster than the "bad" version, and there are reasons.
In our benchmark, we are dealing with a huge chunk of data : 1920*1080*2 = 4,147,200 bytes. If you consider that we write the results to a different memory area, it doubles to 8,294,400bytes = 8.1MB.
What does this mean? 100% cache miss - guaranteed since even the top notch Coretex A15 can be equipped only up to 4MB L2 cache.
How painful is a cache miss? About 20 memory cycles, varying depending on the architecture/SoC. 20 memory cycles are usually 40 CPU cycles!

Something has to be done here, definitely.

The Power Of NEON

In our example we are reading data from memory sequentially. What if we could tell the CPU to read ahead while NEON is computing? Are there instructions for this purpose?
The answer is yes. - PLD - preload data.

.code 32
.globl _sfNEON
.private_extern _sfNEON

pDst    .req    r0
pSrc    .req    r1
coeff   .req    r2
interc  .req    r3
count   .req    r12

.align 2
_sfNEON:
.cfi_startproc
 pld  [pSrc]
 pld  [pSrc, #64*1]
 pld  [pSrc, #64*2]
 pld  [pSrc, #64*3]
 ldr     count, [sp]
 vdup.16 d0, r2 //coeff
 vdup.16 d1, r3 //value

 1:
  pld     [pSrc, #64*4]
  vld1.16  {d28-d31}, [pSrc,:128]!
  vmull.s16   q12, d28, d0[0]
  vmull.s16   q13, d29, d0[0]
  vmull.s16   q14, d30, d0[0]
  vmull.s16   q15, d31, d0[0]

  vaddw.s16   q12, q12, d1
  vaddw.s16   q13, q13, d1
  vaddw.s16   q14, q14, d1
  vaddw.s16   q15, q15, d1

  vqrshrun.s32    d24, q12, #8
  vqrshrun.s32    d25, q13, #8
  vqrshrun.s32    d26, q14, #8
  vqrshrun.s32    d27, q15, #8

  subs    count, count, #16
  vst1.16  {d24-d27}, [pDst,:128]!
 bgt     1b

 bx      lr
.cfi_endproc

Not much changed here, just adding a few PLD instructions
The cache controller doesn't manage the cache byte for byte but in lines, each line consisting of 64bytes.
ARM recommends to read three lines ahead, but since our code is relatively short, we will read 4 lines ahead. (pld[pSrc, #64*4]) On to the next benchmark!

function execution time speed factor
sfC 1315126107 1
sfNEONbad 337117053 3.901
sfNEON 266787748 4.929
sfNEON preload 110085265 11.946

Impressed? That't the true power of NEON.

I also wrote a hand optimized ARM assembly version(attached at the end) with and without PLD, and the chart below shows how each version fares :

NEON performance benchmark
Compilers are good enough? Just beat the ARM assembly version first.

Power Consumption

For system developers, reducing power consumption is probably more important than performance. App developers should care about this as well since it would carry to the usability of their products.
How much power does NEON consume? ARM claims that NEON doesn't consume more power than the integer unit. If we can trust ARM, using NEON in addition to the integer core(ARM) will consume twice the power at max on the CPU side.
The power consumption heavily depends on the number of instructions executed. Let's count again


LBB0_1:                                 @ =>This Inner Loop Header: Depth=1
 ldrsh r12, [r1], #2
 mla r12, r12, r2, r3
Ltmp1:
 tst.w r12, #128
 it ne
 addne.w r12, r12, #256
 asr.w lr, r12, #8
Ltmp2:
 cmp.w lr, #0
 mov.w lr, #0
Ltmp3:
 it ge
 asrge.w lr, r12, #8
 cmp.w lr, #65536
 it ge
 movge.w lr, #-1
Ltmp4:
 subs.w r9, r9, #1
Ltmp5:
 strh lr, [r0], #2
 bne LBB0_1
Above is the disassembly of the inner loop part of the C function (Thumb mode)
It contains 16 instructions.

function instructions per iteration instruction per data power consumption factor
sfC 16 16 1
sfARM preload 57 7.125 0.445
sfNEON preload 16 1.0625 0.0664*2=0.1328

As you can see, even if you double the power consumption result of the NEON version, it's still the least by a huge margin.
If you are capable of handling NEON properly, you will get immense performance boosts in addition to the reduction in power consumption.

Next time, I'll address to further optimizing the codes in a more useful examples in real applications.

See you next time!

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.


Key Contents

  • Vector-Scalar Operation
  • Vector-Vector Operation
  • The Long Model
  • The Wide Model
  • The Narrow Model
  • Instruction Scheduling

Bonus Stuff

Below is the hand optimized ARM assembly version used in benchmark.

.globl _sfARM
.private_extern _sfARM

pDst    .req    r0
pSrc    .req    r1
coeff   .req    r2
interc  .req    r3
count   .req    r12

.align 2
_sfARM:
.cfi_startproc
 pld  [r1, #64*0]
 pld  [r1, #64*1]
 pld  [r1, #64*2]
 pld  [r1, #64*3]
 pld  [r1, #64*4]

 ldr     count, [sp]
 push {r4-r10, lr}

 ldmia r1!, {r8-r10, lr}
 1:
  smulbb r4, r8, r2
  pld  [r1, #64*4]
  smultb r5, r8, r2
  smulbb r6, r9, r2
  smultb r7, r9, r2
  smulbb r8, r10, r2
  add  r4, r4, r3
  smultb r9, r10, r2
  add  r5, r5, r3
  smulbb r10, r11, r2
  add  r6, r6, r3
  smultb lr, lr, r2

   movs r4, r4, asr #8
  add  r7, r7, r3
   addcs r4, r4, #1
  add  r8, r8, r3
   movmi r4, #0
  add  r9, r9, r3
   movs r5, r5, asr #8
  add  r10, r10, r3
   addcs r5, r5, #1
  add  lr, lr, r3
   movmi r5, #0
   movs r6, r6, asr #8
  usat r4, #16, r4
   addcs r6, r6, #1
  usat r5, #16, r5 
   movmi r6, #0
   movs r7, r7, asr #8
   addcs r7, r7, #1
  orr  r4, r4, r5, lsl #16
   movmi r7, #0
   movs r8, r8, asr #8
   addcs r8, r8, #1
  usat r6, #16, r6
   movmi r8, #0
   movs r9, r9, asr #8
   addcs r9, r9, #1
  usat r7, #16, r7
   movmi r9, #0
   movs r10, r10, asr #8
   addcs r10, r10, #1
  usat r8, #16, r8
   movmi r10, #0
   movs lr, lr, asr #8
   addcs lr, lr, #1
  usat r9, #16, r9
   movmi lr, #0
  subs r12, r12, #8

  usat r10, #16, r10
  orr  r5, r6, r7, lsl #16
  usat lr, #16, lr

  orr  r6, r8, r9, lsl #16
  orr  r7, r10, lr, lsl #16

  ldmiagt r1!, {r8-r10, lr}
  stmia r0!, {r4-r7}
 bgt  1b

 pop  {r4-r10, pc}
.cfi_endproc

1 comment:

  1. Hello, I've started developing Android and I need to optimize some computer vision algorithm with NEON.
    I found your blog from your stackoverflow account, and and I find it very explanatory.
    However I have some problems with Arm Assembly.
    Please, contact me at:
    alegaietta (at) virgilio (dot) it

    ReplyDelete