No "Hello World" here
I don't intend starting the tutorial with the eighty millionth "Hello World" example. It's boring.
What about some elementary school stuff instead? A very simple function, literally :
f(x) = coefficient * x + intercept
Too simple? D'accord, let's extend this with a division, rounding, and saturation :
void sfC(unsigned short * pDst, short * pSrc, short coeff, short intercept, unsigned int count)
{
int res;
do {
res = *pSrc++ * coeff + intercept;
if (res & 0x80) res += 256;
res >>= 8;
if (res < 0) res = 0;
if (res>0xffff) res = 0xffff;
*pDst++ = (unsigned short) res;
} while (--count);
}
The C function above is quite self explanatory. It does multiplication, addition, division (right shift) with rounding, and saturates to 0 to 65535 (16bit).
Say Hello to NEON
Below is the corresponding code in Apple LLVM 4.2 assembly syntax :
// void sfNEONbad(unsigned short * pDst, short * pSrc, short coeff, short intercept, unsigned int count);
.code 32
.globl _sfNEONbad
.private_extern _sfNEONbad
pDst .req r0
pSrc .req r1
coeff .req r2
interc .req r3
count .req r12
.align 2
_sfNEONbad:
.cfi_startproc
ldr count, [sp]
vdup.16 d0, coeff
vdup.16 d1, interc
1:
vld1.16 {d2}, [pSrc]!
vmull.s16 q1, d2, d0[0]
vaddw.s16 q1, q1, d1
vqrshrun.s32 d2, q1, #8
vst1.16 {d2}, [pDst]!
subs count, count, #4
bgt 1b
bx lr
.cfi_endproc
vdup.16 d0, coeff - duplicate all four 16bit lanes of d0 with the value of coeff
vdup.16 d0, interc - duplicate all four 16bit lanes of d0 with the value of interc
vld1.16 {d2}, [pSrc]! - load planar data from pSrc, equivalent to "d2 = *pSrc++;" in C
NEON can and must use ARM registers as pointers, but it cannot use them for arithmetics. Therefore, both coeff and interc have to be copied to NEON's registers first.
All NEON instructions start with a v (for vector) and are easily distinguished from ARM's thereby.
The Long Model and Vector-Scalar Operation
vmull.s16 q1, d2, d0[0] - multiply all 4 signed int16 data in the vector d2 with the signed int16 data in the first lane of d0, the result is signed int32. (the last l for long)
Now it's SIMD time! d2 is a vector (short array) of 8byte size which means that D registers can contain up to 8 bytes, 4 int16, or 2 int32 (of float)
NEON can compute all the data in the lanes with a single instruction. With 16bit data, it's 4 at once.
And since the results are 4 int32 data, they don't fit to a D register. The destination has to be a Q register therefore.
The Wide Model and Vector-Vector Operation
vaddw.s16 q1, q1, d1 - add 4 int32 data with 4 int16 data in matching lane (w for wide)
Now we have to add 32bit data with 16bit ones - a Q register with a D register. It's called the Wide Model.
Although we don't have to do a vector-vector operation here since we want to add the same value (interc) to the 4 int32 data, multiplications are the only vector-scalar operations supported by NEON. Therefore, we have to stick to vector-vector addition.
The Narrow Model and The King of The Hill
vqrshrun.s32 d2, q1, #8
q - saturate
r - round
shr - shift right
u - convert to unsigned
n - narrow
You see? A single instruction that does this much. It does shift to right, round, saturate, convert to unsigned and shrink down the results to half.
Just look at the C code. 4 lines with 3 if's were necessary.
vqrshrun is probably the most efficient and useful NEON instruction. (q, r, u, n are all optional flags)
Store and Loop
vst1.16 {d2}, [pDst]! - store planar data to pDst, equivalent to "*pDst++ = d2;" in C
subs count, count, #4 - substract, equivalent to "count -= 4;" in C, and updates the CPSR (suffix s)
CPSR(Current Program Status Register) is a special register for flow control.
It gets updated depending on the result of an arithmetic-logic instruction with an 's' suffix attached.
And then, following instruction(s) can be executed conditionally, meaning they are only executed when the condition given with the condition code attached is met.
bgt 1b - Branch(jump) to label 1 backward if Greater Than 0
It's the 'b' (branch) instruction with the condition code 'gt'.
In this example, we compute 4 data per iteration. count gets decreased by 4 (subs), and if count becomes less than zero, the 'gt' condition isn't met and the branch to 1 backward isn't executed. Thus, the loop is terminated.
bx lr - Branch to the address lr(link register) contains changing mode(ARM<->Thumb) if necessary, equivalent to "return;" in C
Great Expectations
Much easier than previously thought, right? If you didn't understand CPSR and condition codes very well, it doesn't matter much. subs, bgt, and bx are pretty much the routine job.
Well we are finished here. We now compute 4 data per iteration and vqrshrun being so powerful we can expect it to be more than 4 times as fast than the C version, right?
It's time for benchmarking :
device : iPod touch 4th gen (Coretex-A8)
OS : iOS 6.13
data size : 1920*1080*2 bytes
iteration : 1024
function |
execution time |
speed factor |
sfC |
1315126107 |
1 |
sfNEONbad |
337117053 |
3.901 |
Wow, we did it. It's actually almost 4 times as fast.
You must be satisfied. (and probably your boss as well)
But there are something very disturbing here : why is the function's name sfNEONbad?
It's actually very bad.
Let's Count
One of the best thing about NEON is that almost all its instructions consume only 1 cycle, but at a price called latency.
Both vld and vst instructions consume 2 cycles each, and the others only 1.
Not bad considering even multiply consumes a single cycle, but why does whole loop consume 15 cycles as indicated left? That's the due to the latency marked in red.
Even though vmull consumes just one cycle, in doesn't mean that the result is available immediately. It takes time.
The next instruction, vaddw needs it, and since it's not available at the time of execution, addw has to wait 8 half-cycles (q1:8).
The same applies to vqrshrun and vst as well. (5 and 7 each)
This kind of delay or wait state is called pipeline interlock or hazard.
An iteration takes 15 cycles and 10 of them are wasted cycles? You must be kidding.
Can we fix this? Of course. It's actually very simple.
The Way It's Supposed To Be
You cannot simply remove interlocks - it's impossible. Instead you do something else filling the gap.
.code 32
.globl _sfNEONbad
.private_extern _sfNEONbad
pDst .req r0
pSrc .req r1
coeff .req r2
interc .req r3
count .req r12
.align 2
_sfNEON:
.cfi_startproc
ldr count, [sp]
vdup.16 d0, r2 //coeff
vdup.16 d1, r3 //value
1:
vld1.16 {d28-d31}, [pSrc,:128]!
vmull.s16 q12, d28, d0[0]
vmull.s16 q13, d29, d0[0]
vmull.s16 q14, d30, d0[0]
vmull.s16 q15, d31, d0[0]
vaddw.s16 q12, q12, d1
vaddw.s16 q13, q13, d1
vaddw.s16 q14, q14, d1
vaddw.s16 q15, q15, d1
vqrshrun.s32 d24, q12, #8
vqrshrun.s32 d25, q13, #8
vqrshrun.s32 d26, q14, #8
vqrshrun.s32 d27, q15, #8
subs count, count, #16
vst1.16 {d24-d27}, [pDst,:128]!
bgt 1b
bx lr
.cfi_endproc
As you can see, we utilize more registers, loading 4 registers at once.
Then we do the same kind of operation through the registers.
This way, the next kind of operation is delayed, but less to none cycles are wasted.
Lets count again :
Much better now. The interlocks are gone or heavily reduced.
One iteration takes 18 cycles now dealing with 4 times as many bytes! Can we expect the same level of performance boost this time? Let's benchmark again
Sorry disappointing you
function |
execution time |
speed factor |
sfC |
1315126107 |
1 |
sfNEONbad |
337117053 |
3.901 |
sfNEON |
266787748 |
4.929 |
The "proper" version is only about 26% faster than the "bad" version, and there are reasons.
In our benchmark, we are dealing with a huge chunk of data : 1920*1080*2 = 4,147,200 bytes. If you consider that we write the results to a different memory area, it doubles to 8,294,400bytes = 8.1MB.
What does this mean? 100% cache miss - guaranteed since even the top notch Coretex A15 can be equipped only up to 4MB L2 cache.
How painful is a cache miss? About 20 memory cycles, varying depending on the architecture/SoC. 20 memory cycles are usually 40 CPU cycles!
Something has to be done here, definitely.
The Power Of NEON
In our example we are reading data from memory sequentially. What if we could tell the CPU to read ahead while NEON is computing? Are there instructions for this purpose?
The answer is yes. - PLD - preload data.
.code 32
.globl _sfNEON
.private_extern _sfNEON
pDst .req r0
pSrc .req r1
coeff .req r2
interc .req r3
count .req r12
.align 2
_sfNEON:
.cfi_startproc
pld [pSrc]
pld [pSrc, #64*1]
pld [pSrc, #64*2]
pld [pSrc, #64*3]
ldr count, [sp]
vdup.16 d0, r2 //coeff
vdup.16 d1, r3 //value
1:
pld [pSrc, #64*4]
vld1.16 {d28-d31}, [pSrc,:128]!
vmull.s16 q12, d28, d0[0]
vmull.s16 q13, d29, d0[0]
vmull.s16 q14, d30, d0[0]
vmull.s16 q15, d31, d0[0]
vaddw.s16 q12, q12, d1
vaddw.s16 q13, q13, d1
vaddw.s16 q14, q14, d1
vaddw.s16 q15, q15, d1
vqrshrun.s32 d24, q12, #8
vqrshrun.s32 d25, q13, #8
vqrshrun.s32 d26, q14, #8
vqrshrun.s32 d27, q15, #8
subs count, count, #16
vst1.16 {d24-d27}, [pDst,:128]!
bgt 1b
bx lr
.cfi_endproc
Not much changed here, just adding a few PLD instructions
The cache controller doesn't manage the cache byte for byte but in lines, each line consisting of 64bytes.
ARM recommends to read three lines ahead, but since our code is relatively short, we will read 4 lines ahead. (pld[pSrc, #64*4]) On to the next benchmark!
function |
execution time |
speed factor |
sfC |
1315126107 |
1 |
sfNEONbad |
337117053 |
3.901 |
sfNEON |
266787748 |
4.929 |
sfNEON preload |
110085265 |
11.946 |
Impressed? That't the true power of NEON.
I also wrote a hand optimized ARM assembly version(attached at the end) with and without PLD, and the chart below shows how each version fares :
|
Compilers are good enough? Just beat the ARM assembly version first. |
Power Consumption
For system developers, reducing power consumption is probably more important than performance. App developers should care about this as well since it would carry to the usability of their products.
How much power does NEON consume? ARM claims that NEON doesn't consume more power than the integer unit. If we can trust ARM, using NEON in addition to the integer core(ARM) will consume twice the power at max on the CPU side.
The power consumption heavily depends on the number of instructions executed. Let's count again
LBB0_1: @ =>This Inner Loop Header: Depth=1
ldrsh r12, [r1], #2
mla r12, r12, r2, r3
Ltmp1:
tst.w r12, #128
it ne
addne.w r12, r12, #256
asr.w lr, r12, #8
Ltmp2:
cmp.w lr, #0
mov.w lr, #0
Ltmp3:
it ge
asrge.w lr, r12, #8
cmp.w lr, #65536
it ge
movge.w lr, #-1
Ltmp4:
subs.w r9, r9, #1
Ltmp5:
strh lr, [r0], #2
bne LBB0_1
Above is the disassembly of the inner loop part of the C function (Thumb mode)
It contains 16 instructions.
function |
instructions per iteration |
instruction per data |
power consumption factor |
sfC |
16 |
16 |
1 |
sfARM preload |
57 |
7.125 |
0.445 |
sfNEON preload |
16 |
1.0625 |
0.0664*2=0.1328 |
As you can see, even if you double the power consumption result of the NEON version, it's still the least by a huge margin.
If you are capable of handling NEON properly, you will get immense performance boosts in addition to the reduction in power consumption.
Next time, I'll address to further optimizing the codes in a more useful examples in real applications.
See you next time!
This work is licensed under a
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Key Contents
- Vector-Scalar Operation
- Vector-Vector Operation
- The Long Model
- The Wide Model
- The Narrow Model
- Instruction Scheduling
Bonus Stuff
Below is the hand optimized ARM assembly version used in benchmark.
.globl _sfARM
.private_extern _sfARM
pDst .req r0
pSrc .req r1
coeff .req r2
interc .req r3
count .req r12
.align 2
_sfARM:
.cfi_startproc
pld [r1, #64*0]
pld [r1, #64*1]
pld [r1, #64*2]
pld [r1, #64*3]
pld [r1, #64*4]
ldr count, [sp]
push {r4-r10, lr}
ldmia r1!, {r8-r10, lr}
1:
smulbb r4, r8, r2
pld [r1, #64*4]
smultb r5, r8, r2
smulbb r6, r9, r2
smultb r7, r9, r2
smulbb r8, r10, r2
add r4, r4, r3
smultb r9, r10, r2
add r5, r5, r3
smulbb r10, r11, r2
add r6, r6, r3
smultb lr, lr, r2
movs r4, r4, asr #8
add r7, r7, r3
addcs r4, r4, #1
add r8, r8, r3
movmi r4, #0
add r9, r9, r3
movs r5, r5, asr #8
add r10, r10, r3
addcs r5, r5, #1
add lr, lr, r3
movmi r5, #0
movs r6, r6, asr #8
usat r4, #16, r4
addcs r6, r6, #1
usat r5, #16, r5
movmi r6, #0
movs r7, r7, asr #8
addcs r7, r7, #1
orr r4, r4, r5, lsl #16
movmi r7, #0
movs r8, r8, asr #8
addcs r8, r8, #1
usat r6, #16, r6
movmi r8, #0
movs r9, r9, asr #8
addcs r9, r9, #1
usat r7, #16, r7
movmi r9, #0
movs r10, r10, asr #8
addcs r10, r10, #1
usat r8, #16, r8
movmi r10, #0
movs lr, lr, asr #8
addcs lr, lr, #1
usat r9, #16, r9
movmi lr, #0
subs r12, r12, #8
usat r10, #16, r10
orr r5, r6, r7, lsl #16
usat lr, #16, lr
orr r6, r8, r9, lsl #16
orr r7, r10, lr, lsl #16
ldmiagt r1!, {r8-r10, lr}
stmia r0!, {r4-r7}
bgt 1b
pop {r4-r10, pc}
.cfi_endproc