# NEON yuv + scale

Well I still haven't checked the jjmpeg code in but I did end up playing with NEON yuv conversion yesterday, and a bit more today.

The YUV conversion alone for a 680x480 frame on the beagleboard-xm is about 4.3ms, which is ok enough. However with bi-linear scaling to 1024x600 as well it blows out somewhat to 28ms or so - which is definitely too slow.

Right now it's doing somewhat more work that it needs to - it's scaling two rows each time in X so it can feed into the Y scaling. Perhaps this could be reduced by about half (depending on the scaling going on), which might knock about 10ms off the processing time (asssuming no funny cache interactions going on) which is still too slow to be useful. I'm a bit bored with it now and don't really feel like trying it out just yet.

Maybe the YUV only conversion might still be a win on Android though - if loading an RGB texture (or an RGB 565 one) is significantly faster than the 3x greyscale textures i'm using now. I need to run some benchmarks there to find out how fast each option is, although that will have to wait for another day.

### yuv to rgb

The YUV conversion code is fairly straightforward in NEON, although I used 2:6 fixed-point for the scaling factors so I could multiply the 8 bit pixel values directly. I didn't check to see if it introduces too many errors to be practical mind you.

I got the constants and the maths from here.

@ pre-load constants vmov.u8 d28,#90 @ 1.402 * 64 vmov.u8 d29,#113 @ 1.772 * 64 vmov.u8 d30,#22 @ 0.34414 * 64 vmov.u8 d31,#46 @ 0.71414 * 64

The main calculation is calculated using 2.14 fixed-point signed mathematics, with the Y value being pre-scaled before accumulation. For simplification the code assumes YUV444 with a separate format conversion pass if required, and if executed per row should be cheap through L1 cache.

vld1.u8 { d0, d1 }, [r0]! @ y is 0-255 vld1.u8 { d2, d3 }, [r1]! @ u is to be -128-127 vld1.u8 { d4, d5 }, [r2]! @ v is to be -128-127 vshll.u8 q10,d0,#6 @ y * 64 vshll.u8 q11,d1,#6 vsub.s8 q1,q3 @ u -= 128 vsub.s8 q2,q3 @ v -= 128 vmull.s8 q12,d29,d2 @ u * 1.772 vmull.s8 q13,d29,d3 vmull.s8 q8,d28,d4 @ v * 1.402 vmull.s8 q9,d28,d5 vadd.s16 q12,q10 @ y + 1.722 * u vadd.s16 q13,q11 vadd.s16 q8,q10 @ y + 1.402 * v vadd.s16 q9,q11 vmlsl.s8 q10,d30,d2 @ y -= 0.34414 * u vmlsl.s8 q11,d30,d3 vmlsl.s8 q10,d31,d4 @ y -= 0.71414 * v vmlsl.s8 q11,d31,d5

And this neatly leaves the 16 RGB result values in order in q8-q13.

They still need to be clamped which is performed in the 2.14 fixed point scale (i.e. 16383 == 1.0):

vmov.u8 q0,#0 vmov.u16 q1,#16383 vmax.s16 q8,q0 vmax.s16 q9,q0 vmax.s16 q10,q0 vmax.s16 q11,q0 vmax.s16 q12,q0 vmax.s16 q13,q0 vmin.s16 q8,q1 vmin.s16 q9,q1 vmin.s16 q10,q1 vmin.s16 q11,q1 vmin.s16 q12,q1 vmin.s16 q13,q1

Then the fixed point values need to be scaled and converted back to byte:

vshrn.i16 d16,q8,#6 vshrn.i16 d17,q9,#6 vshrn.i16 d18,q10,#6 vshrn.i16 d19,q11,#6 vshrn.i16 d20,q12,#6 vshrn.i16 d21,q13,#6

And finally re-ordered into 3-byte RGB triplets and written to memory. `vst3.u8`

does this directly:

vst3.u8 { d16,d18,d20 },[r3]! vst3.u8 { d17,d19,d21 },[r3]!

`vst4.u8`

could also be used to write out RGBx, or the planes kept separate if that is more useful.

Again, perhaps the 8x8 bit multiply is pushing it in terms of accuracy, although it's a fairly simple matter to use shorts instead. If shorts were used then perhaps the saturating doubling returning high half instructions could be used too, to avoid at least the input and output scaling.

### Stop Press

As happens when one is writing this kind of thing I noticed that there is a saturating shift instruction - and as it supports signed input and unsigned output, it looks like it should allow me to remove the clamping code entirely if I read it correctly.

This leads to the following combined clamping and scaling stage:

vqshrun.s16 d16,q8,#6 vqshrun.s16 d17,q9,#6 vqshrun.s16 d18,q10,#6 vqshrun.s16 d19,q11,#6 vqshrun.s16 d20,q12,#6 vqshrun.s16 d21,q13,#6

Which appears to work on my small test case. This drops the test case execution time down to about 3.9ms.

And given that replacing the yuv2rgb step with a memcpy of the same data (all else being equal - i.e. yuv420p to yuv444 conversion) still takes over 3.7ms, that isn't too shabby at all.

### RGB 565

An alternative scaling & output stage (after the clamping) could produce RGB 565 directly (I haven't checked this code works yet):

vshl.i16 q8,#2 @ red in upper 8 bits vshl.i16 q9,#2 vshl.i16 q10,#2 @ green in upper 8 bits vshl.i16 q11,#2 vshl.i16 q12,#2 @ blue in upper 8 bits vshl.i16 q13,#2 vsri.16 q8,q10,#5 @ insert green vsri.16 q9,q11,#5 vsri.16 q8,q12,#11 @ insert blue vsri.16 q9,q13,#11 vst1.u16 { d16,d17,d18,d19 },[r3]!