About

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

Tags

android (44)
beagle (63)
biographical (82)
business (1)
code (56)
cooking (29)
dez (6)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (414)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (216)
java ee (3)
javafx (48)
jjmpeg (67)
junk (3)
kobo (15)
linux (3)
mediaz (27)
ml (15)
nativez (3)
opencl (117)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
politics (7)
ps3 (12)
puppybits (17)
rants (134)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (2)
zedzone (13)
Wednesday, 06 May 2015, 21:48

post google code post

Well nobody bothered to comment about the stuff i removed from google code apart from the one lad or lass who lamented the loss of some javafx demos.

I had comments open+moderated for a few weeks but got hit by spammers a couple of days ago so had to go back to id+moderated. Maybe something got lost in those 500 bits of snot but i don't think so. The spam was quite strange; most mentioned web sites but didn't provide links or weren't very readable so i'm not sure what the point was. Perhaps they're just fishing for open sites or naive moderators they can then exploit. Like the "windows computer department" that keeps calling and calling hoping i'll not tell them to fuck off every time (sigh, no i don't normally say that although i would tonight).

I've still got the subversion clones but i'm not inclined to do much with any of it for the forseeable future and i'm not even sure if i'm going to continue publishing other bits of code i play with going forward.

Desktop Java, OpenCL, ARM assembly language; these things are just not very common in the Free Software world. Server Java is pretty common but that's just, well, `open sauce' companies sharing costs and not hobbyists. So i think all i'm really doing is providing hints or solutions for some student's homework or help for graduate programmers to keep their jobs. And even then it's so niche it wouldn't be many, if any.

As an example of niche, I was looking up some way to communicate with adobe photoshop that doesn't involve psd format and one thing i came across was someone linking to one of my projects for some unfinished experiments with openraster format - on the first page of results. This happens rarely but still too often. Of course it could just be the search engine trying to be smart and tuning results to the user, which is a somewhat terrifying possibility (implications beyond these types searches of course). FWIW I came to the conclusion photoshop is just one of those proprietary relics from the past which intentionally refuses to support other formats so it's idiot users can continue to be arse-reamed by its inflated price.

It's just a hobby

As a hobby i have no desire to work on larger projects of my own or other established projects in my spare time. Occasionally i'll send in a patch to a project but if they want a bunch of fucking around then yeah, ... naah. In hindsight i somewhat regret how we did it on evolution but i think i've mentioned that before. Neither do i need to solicit work or build a portfolio or just gain experience.

I'm not sure how many hobbyists are around; anyone with remotely close to enough skill seems to be jumping into the wild casinos of app-stores or services and expecting to make billion$ and not just doing it for the fun of it. Some of those left over just seem to be arrogant egotistical fuckwits (and some would probably think the same of me). Same as it ever was I guess.

I suppose I will continue to code-drop even if it's just out of habit.

For another hobby I made kumquat marmalade on the weekend. Spent a couple of hours in the sun slicing the tiny fruit and extracting seeds (2-3 cups worth of seeds) and cooked it the next day. Unfortunately after all that effort it looks like it wasn't cooked quite enough and it probably wont set - it's a bit runny but at least it tastes good. Not sure what i'll do with 2-odd litres of the stuff though.

Tagged beagle, code, imagez, jjmpeg, mediaz, parallella, pdfz, puppybits, readerz, socles, videoz.
Friday, 13 March 2015, 13:25

ahh shit, google code is being scrapped

Joy. I guess it's not making google enough money.

I'll make a copy of all my projects and then delete them. Probably sooner rather than later (probably immediately because i'd rather get it out of the way, i have a script going now). I'm keeping all the commit history for now although i never find it particularly useful. Links in old posts may break.

Some of them may or may not appear later somewhere else, or they may just sit on my hdd until i forget about them and delete them by accident or otherwise.

Wherever they end end up though it wont be in one of the other commercial services because i'm not interested in doing this again.

I need to do the same to this blog, ... but that can wait.

Update: So ... it seems someone did notice. Don't worry i've got a full backup of everything. In part I was pissed off with google and in part I wanted to make it immediately obvious it was going away to see if anyone actually noticed. Who could know from the years of feedback i've never received.

If you are interested in particular projects then add a comment about them anywhere on this blog. All comments are moderated so they wont appear till i let them through but there is no need to post more than once (i'll delete any nonsense or spam unless it makes you look like a bit of a wanker). I will see about enabling anonymous comments for the moment if they are not already on. I can then decide what to do depending on the interest.

I will not be using github or sourceforge nor the like.

In a follow-up post i'll see where i'm going with them as well as pondering the future location and shape of the blog itself. I've already written some stuff to take the blogger atom stuff, strip out the posts, download images, and fix the urls.

(in resp to Peter below) I knew the writing was on the wall when google code removed binary hosting: it's kinda useless for it's purpose without it. This is why all my new projects subsequent to that date are hosted differently.

Tagged android, beagle, code, dusk, games, gloat, hacking, imagez, jjmpeg, mediaz, pdfz, puppybits, readerz, socles.
Tuesday, 20 January 2015, 13:01

Yay, NBN is here at last.

Got the nbn hooked up on Friday - and fortunately i'm still in an area they're doing fibre to the home even if it is coming in over-head. Of course most of the country should be getting that too but the total fuckwits running the country decided to show their complete lack of intelligence and add a ton of future cost and grief by changing it to 'fibre to the node' so they can maintain 50 year old copper pairs until they need to go back to the original plan at double the expense, ... but i digress.

It was something i was kind of excited for a few years ago but then I kind of cooled on the idea since the internet is mostly just a pointless waste of time. But then again so is most human activity isn't it? The NBN is mostly just going to be used for ip-tv i'm sure.

The ADSL I had before wasn't too bad so for the most part there isn't much difference so far since most sites were bandwidth limited their end rather than mine but it should be more reliable during wet weather if nothing else (the old modem was getting flakey too and needed a weekly-or-so reboot). I guess the peak (i'm willing to pay for right now) is 2x download and much higher upload (10x or 20x, not sure what it was before) compared to my ADSL, which is better than a poke in the eye.

I am getting a fixed ip address this time and will be setting up a local webserver - mostly because I can and want to but also to play with some web software and move away from google's advertising-supported services. I'm going to try to write my own cms/server/thing based on some experiments I did a few years ago; but it could be a while before I get anywhere because i'm just not in any rush and the weather is too nice over summer (and it's a fairly large undertaking together with me being a bit rusty on the technologies involved).

I'm also going to try to run it on my beagleboard xm too just for kicks - after a bit of a search i found where i'd left it and it seems to be working fine. It should be fast enough for the "expected load".

I'm too lazy to take a picture of it right now but yesterday I finally worked out how to fit the beagleboard and a usb harddrive into a tidy & compact case. I have an old/dead 3.5" USB HDD enclosure made of extruded aluminium that I cut out and filed down some holes for the usb/network slots and had a portable 2.5" HDD I filed down a bit to slide into the other end (retaining all the shock mounting and external case). I have the connecting cable running externally but since it's already got a network cable coming out the same side it sort of "works" and is a lot neater than anything I could come up with when trying to shoehorn the cables into any other reasonably sized box (usb cables are so bulky in a confined area).

I didn't add any holes for the audio/video panel but there should be room for some right-angle plugs should I need it. I guess if i ever replace my burnt out amp I could set it up near that and use it for a radio too.

Tagged beagle, biographical, wanki.
Monday, 10 June 2013, 13:02

Clamping, scaling, format conversion

Got to spend a few hours poking at the photo-effects app i'm doing in conjunction with 'ffts'. I ended up having to use some NEON for performance.

One interesting solution along the way was code that took 2x2-channel float sequences (i.e. 2xcomplex number arrays) and re-wound them back to 4-channel bytes, including scaling and clamping.

I utilised the fixed-point variant of the VCVT instruction which performs the scaling to 8 bits with clamping below 0. For the high bits I used the saturating VQMOVN variant of move with narrow.

I haven't run it through the cycle counter (or looked the details up) so it could probably do with some jiggling or widening to 32 bytes/iteration but the current main loop is below.

        vld1.32         { d0[], d1[] }, [sp]

        vld1.32         { d16-d19 },[r0]!
        vld1.32         { d20-d23 },[r1]!     
1:
        vmul.f32        q12,q8,q0               @ scale
        vmul.f32        q13,q9,q0
        vmul.f32        q14,q10,q0
        vmul.f32        q15,q11,q0

        vld1.32         { d16-d19 },[r0]!       @ pre-load next iteration
        vld1.32         { d20-d23 },[r1]!

        vcvt.u32.f32    q12,q12,#8              @ to int + clamp lower in one step
        vcvt.u32.f32    q13,q13,#8
        vcvt.u32.f32    q14,q14,#8
        vcvt.u32.f32    q15,q15,#8

        vqmovn.u32      d24,q12                 @ to short, clamp upper
        vqmovn.u32      d25,q13
        vqmovn.u32      d26,q14
        vqmovn.u32      d27,q15

        vqmovn.u16      d24,q12                 @ to byte, clamp upper
        vqmovn.u16      d25,q13

        vst2.16         { d24,d25 },[r3]!

        subs    r12,#1
        bhi     1b

The loading of all elements of q0 from the stack was the first time I've done this:

        vld1.32         { d0[], d1[] }, [sp]

Last time I did this I thing I did a load to a single-point register or an ARM register then moved it across, and I thought that was unnecessarily clumsy. It isn't terribly obvious from the manual how the various versions of VLD1 differentiate themselves unless you look closely at the register lists. d0[],d1[] loads a single 32-bit value to every lane of the two registers, or all lanes of q0.

The VST2 line:

        vst2.16         { d24,d25 },[r3]!

Performs a neat trick of shuffling the 8-bit values back in to the correct order - although it relies on the machine operating in little-endian mode.

The data flow is something like this:

 input bytes:        ABCD ABCD ABCD
 float AB channel:   AAAA BBBB AAAA BBBB
 float CD channel:   CCCC DDDD CCCC DDDD   
 output bytes:       ABCD ABCD ABCD

As the process of performing a forward then inverse FFT ends up scaling the result by the number of elements (i.e. *(width*height)) the output stage requires scaling by 1/(width*height) anyway. But this routine requires further scaling by (1/255) so that the fixed-point 8-bit conversion works and is performed 'for free' using the same multiplies.

This is the kind of stuff that is much faster in NEON than C, and compilers are a long way from doing it automatically.

The loop in C would be something like:

float clampf(float v, float l, float u) {
   return v < l ? l : (v < u ? v : u);
}

    complex float *a;
    complex float *b;
    uint8_t *d;
    float scale = 1.0f / (width * height);
    for (int i=0;i<width;i++) {
       complex float A = a[i] * scale;
       complex float B = b[i] * scale;

       float are = clampf(creal(A), 0, 255);
       float aim = clampf(cimag(A), 0, 255);
       float bre = clampf(creal(B), 0, 255);
       float bim = clampf(cimag(B), 0, 255);

       d[i*4+0] = (uint8_t)are;
       d[i*4+1] = (uint8_t)aim;
       d[i*4+2] = (uint8_t)bre;
       d[i*4+3] = (uint8_t)bim;
    }

And it's interesting to me that the NEON isn't much bulkier than the C - despite performing 4x the amount of work per loop.

I setup a github account today - which was a bit of a pain as it doesn't work properly with my main browser machine - but I haven't put anything there yet. I want to bed down the basic data flow and user-interaction first.

Tagged android, beagle, code, hacking, picfx.
Friday, 24 May 2013, 11:50

on google

So google have decided to disable downloads on google code.

So I have decided to stop using it.

... although as yet I have no concrete plans or timeline for when this decision will take effect.

Whilst they claim it's about abuse, one can only assume that is just a "likely-sounding excuse" for what in reality is just another straight-up lie from the PR department of a supra-national conglomerate, and it's really just a way to cut costs and promote their 'drive' service (a useless microsoft/apple only service as far as i'm concerned).

Nobody seems to have reported that they have also gimped their POP interface to gmail a couple of days ago. No more UID support. This makes POP a lot less reliable/useful as a mail store (although in honesty it was never designed for that purpose). I proceeded to delete all the mail in gmail to help them free up some disk space.

I guess over-all the writing is on the wall. We all know that at some point 'google account' will mean 'google+', and blogger may be retired at any time.

So it seems my on-going-but-totally-lax search for alternatives to 'everything google for convenience' just got another big kick up the rump-side.

As my projects are all pretty small and low-volume I might look at a local solution because every network based solution faces the same problem. I have a couple of beagleboards doing nothing although getting a running and secure-enough system might be more pain than it's worth.

It's a bit of a pain to have to deal with.

Tagged beagle, dusk, hacking, imagez, java, javafx, jjmpeg, mediaz, pdfz, puppybits, rants, readerz, socles, videoz.
Wednesday, 27 March 2013, 12:21

Bummed out, or am i?

This week I've been experimenting with the performance of some NEON code. It is from an algorithm which was developed in OpenCL for desktop GPUs and then downscaled to fit on a beagleboard (only for development purposes). The overall algorithm is identical but the way some of the steps are implemented is different (and for some significant components much less computationally and bandwidth intensive).

The OpenCL code took many months to develop - although that included dead-ends, multiple steps if refinement, and other distractions including completely unrelated work. Even with that, I'd put the effort at around 4-10x that for the NEON code.

The NEON code took a few weeks. It obviously helped immensely that the algorithm was primarily known in advance although the downscaling alterations were not. On the other hand, my total experience with NEON is far less than OpenCL and certainly C or Java in terms of hours.

One reason the NEON code was much easier to write is that because as it is so cheap to invoke, one can just concentrate on the bottlenecks, and leave the housekeeping to C. e.g. I can write a routine that processes as little as 16x16 pixels in assembly, and leave the addressing crap to C. There is also no marshalling or other api binding to worry about: the C is plain C, and the assembly is plain assembly, and even though JOCL is far far better than using the C api directly it's still quite a bit of work. As much fun as OpenCL is, it's even more fun hacking NEON because you can concentrate on the fun bits even more.

Although the OpenCL model is also based on simple kernels which should equally be simple and isolated - it isn't really quite like that in practice. All but the simplest of kernels end up turn into 64-way parallel subroutines utilising LDS, barriers, and so on. Without that you end up leaving skads of performance on the floor, so it really is necessary. Not to mention all the marshalling and boilerplate in the host-code to communicate with it. And because of the marshalling and invocation latencies pretty much everything is forced onto the GPU.

So what's the point i'm getting at?

Well after all that, the projected performance on the previously-latest-version of a popular handset is only about 5x slower than a HD7970 on a pretty beefy desktop!

Yes that 5x speedup is important enough that it is worth it and opens it up to more applications, but on a personal level i'm just totally bummed it isn't much more. It's a highly parallel and bandwidth intensive workload which should be well-suited to a GPU. Obviously opencl has the advantage that it isn't tied to a single bit of hardware. It's a pity SSE sucks so much otherwise it would be interesting to see how a desktop cpu fared on it's own.

I plan to "back port" the algorithms so it can be improved on the GPU, but I have a fairly educated feeling that another 2000% performance isn't very likely. I will also need to use some AMD proprietary extensions, so the portability will suffer too.

I'm sure I can improve it, but I just take it as a big personal slap in the face for all the effort that's gone into it so far!

Of course, the alternative view is that ARM+NEON is the bees knees - with less effort i'm getting relatively great performance. But we all knew that so it isn't such a revelation ...

The main bottleneck on ARM cpus at the moment is the memory, and if you can utilise the cache effectively it really flies. I would really like to see how a beagleboard like machine with big/little A15/A7 quad core and much faster memory would fare, all these cheap android dongles are far too constrained by their form-factor.

Update: Well I might need to eat my words here. Today I started to look at GPU optimisations based on what i'd learnt from the ARM experience and trying to reduce the bottlenecks of the GPU code.

The key word of the day: batching.

One reason I wasn't previously batching the processing is because it didn't really fit the data-flow of an earlier application. But I have now achieved something like a 50x boost in one key algorithm by a combination of batching the work more aggressively and some other significant algorithmic changes.

This is more like it. No longer bummed out ...

Tagged beagle, hacking, opencl.
Thursday, 03 January 2013, 13:45

NEON YUV vs GPU

This morning I did some experiments with Android and the YUV code - although patience is wearing thin for such a shitty alternative to GNU/Linux that Android is. As icing on the cake most of the android developer site just doesn't render on most of my browsers anymore - I just get junk. Well I can always go elsewhere with my spare time ...

I changed the code to perform a simple doubling up of the U and V components without a separate pass, and changed to an RGB 565 output stage and embedded it into the code in another mess of crap. Then I did some profiling - comparing mainly to the frame-copying version.

Interestingly it is faster than sending the YUV planes to the GPU and using it to do the YUV conversion - and that is only including the CPU time for the frame copy/conversion, and the texture load. i.e. even using NEON it uses less CPU time (and presumably much less GPU time) even though it's doing more work. The volume of texture memory copied is also 33% more for the RGB565 case vs YUV420p one.

Still, 1ms isn't very much out of 10 or so.

The actual YUV420p to RGB565 conversion is only around 1/2 the speed of a simple AVFrame.copy() - ok considering it's writing 33% more data and I didn't try to optimise the scheduling.

Stop press Whilst writing this I thought i'd look at the scheduling and also using the saturating left shift to clamp the values implicitly. Got the inner loop down from 54 to 35 cycles (according to the cycle counter), although it only runs about 10% faster. Better than a kick in the nuts at any rate. Fortunately due to the way I already used registers I could decouple the input loading/formatting from the calculations, so i simply interleaved the next block of data load within the calculations wherever there were delay slots and only made the data loading conditional.

The (unscheduled) output stage now becomes:

        @ saturating left shift automatically clamps to signed [0,0xffff]
        vqshlu.s16      q8,#2           @ red in upper 8 bits
        vqshlu.s16      q9,#2
        vqshlu.s16      q10,#2          @ green in upper 8 bits
        vqshlu.s16      q11,#2
        vqshlu.s16      q12,#2          @ blue in upper 8 bits
        vqshlu.s16      q13,#2

        vsri.16         q8,q10,#5       @ insert green
        vsri.16         q9,q11,#5
        vsri.16         q8,q12,#11      @ insert blue
        vsri.16         q9,q13,#11

        vst1.u16        { d16,d17,d18,d19 },[r3]!

Which saves all those clamps.

As suspected, the 8 bit arithmetic leads to a fairly low quality result, although the non-dithered RGB565 can't help either. Perhaps using shorts could improve that without much impact on performance. Still, it's passable for a mobile device given the constraints (and source material), but it isn't much chop on a big tv.

Of course, all this wouldn't be necessary if one had access to the overlay framebuffer hardware present on pretty well all ARM SOCs ... but Android doesn't let you do that does it ...

Update: I've checked a couple of variations of this into yuv-neon.s, although i'm not using it in the released JJPlayer yet.

Mele vs Ainol Elf II

The Elf is much faster than the Mele at almost everything - particularly video decoding (which uses multiple threads), but pretty much everything else is faster (Better memory? The Cortex-A9? The GPU?) and with the dual-cores means it just works a lot better. Can't be good for the battery though.

(as an aside, someone who spoke english should've told the guys in China that "anal elf 2" is probably not a good name for a computer!)

But the code is written with multiple cores in mind - demux, decoding of video and audio, and presentation is all executed on separate threads. Having all of the cpu-bound tasks executed in a single thread may help on the Mele, although by how much I will only know if and when I do it ...

Tagged android, beagle, hacking, jjmpeg.
Wednesday, 02 January 2013, 17:01

NEON yuv + scale

Well I still haven't checked the jjmpeg code in but I did end up playing with NEON yuv conversion yesterday, and a bit more today.

The YUV conversion alone for a 680x480 frame on the beagleboard-xm is about 4.3ms, which is ok enough. However with bi-linear scaling to 1024x600 as well it blows out somewhat to 28ms or so - which is definitely too slow.

Right now it's doing somewhat more work that it needs to - it's scaling two rows each time in X so it can feed into the Y scaling. Perhaps this could be reduced by about half (depending on the scaling going on), which might knock about 10ms off the processing time (asssuming no funny cache interactions going on) which is still too slow to be useful. I'm a bit bored with it now and don't really feel like trying it out just yet.

Maybe the YUV only conversion might still be a win on Android though - if loading an RGB texture (or an RGB 565 one) is significantly faster than the 3x greyscale textures i'm using now. I need to run some benchmarks there to find out how fast each option is, although that will have to wait for another day.

yuv to rgb

The YUV conversion code is fairly straightforward in NEON, although I used 2:6 fixed-point for the scaling factors so I could multiply the 8 bit pixel values directly. I didn't check to see if it introduces too many errors to be practical mind you.

I got the constants and the maths from here.

        @ pre-load constants
        vmov.u8 d28,#90                 @ 1.402 * 64
        vmov.u8 d29,#113                @ 1.772 * 64
        vmov.u8 d30,#22                 @ 0.34414 * 64
        vmov.u8 d31,#46                 @ 0.71414 * 64

The main calculation is calculated using 2.14 fixed-point signed mathematics, with the Y value being pre-scaled before accumulation. For simplification the code assumes YUV444 with a separate format conversion pass if required, and if executed per row should be cheap through L1 cache.

        vld1.u8 { d0, d1 }, [r0]!       @ y is 0-255
        vld1.u8 { d2, d3 }, [r1]!       @ u is to be -128-127
        vld1.u8 { d4, d5 }, [r2]!       @ v is to be -128-127

        vshll.u8        q10,d0,#6       @ y * 64
        vshll.u8        q11,d1,#6

        vsub.s8         q1,q3           @ u -= 128
        vsub.s8         q2,q3           @ v -= 128
        
        vmull.s8        q12,d29,d2      @ u * 1.772
        vmull.s8        q13,d29,d3

        vmull.s8        q8,d28,d4       @ v * 1.402
        vmull.s8        q9,d28,d5

        vadd.s16        q12,q10         @ y + 1.722 * u
        vadd.s16        q13,q11
        vadd.s16        q8,q10          @ y + 1.402 * v
        vadd.s16        q9,q11

        vmlsl.s8        q10,d30,d2      @ y -= 0.34414 * u
        vmlsl.s8        q11,d30,d3
        vmlsl.s8        q10,d31,d4      @ y -= 0.71414 * v
        vmlsl.s8        q11,d31,d5

And this neatly leaves the 16 RGB result values in order in q8-q13.

They still need to be clamped which is performed in the 2.14 fixed point scale (i.e. 16383 == 1.0):

        vmov.u8         q0,#0
        vmov.u16        q1,#16383

        vmax.s16        q8,q0
        vmax.s16        q9,q0
        vmax.s16        q10,q0
        vmax.s16        q11,q0
        vmax.s16        q12,q0
        vmax.s16        q13,q0
        
        vmin.s16        q8,q1
        vmin.s16        q9,q1
        vmin.s16        q10,q1
        vmin.s16        q11,q1
        vmin.s16        q12,q1
        vmin.s16        q13,q1

Then the fixed point values need to be scaled and converted back to byte:

        vshrn.i16       d16,q8,#6
        vshrn.i16       d17,q9,#6
        vshrn.i16       d18,q10,#6
        vshrn.i16       d19,q11,#6
        vshrn.i16       d20,q12,#6
        vshrn.i16       d21,q13,#6

And finally re-ordered into 3-byte RGB triplets and written to memory. vst3.u8 does this directly:

        vst3.u8         { d16,d18,d20 },[r3]!
        vst3.u8         { d17,d19,d21 },[r3]!

vst4.u8 could also be used to write out RGBx, or the planes kept separate if that is more useful.

Again, perhaps the 8x8 bit multiply is pushing it in terms of accuracy, although it's a fairly simple matter to use shorts instead. If shorts were used then perhaps the saturating doubling returning high half instructions could be used too, to avoid at least the input and output scaling.

Stop Press

As happens when one is writing this kind of thing I noticed that there is a saturating shift instruction - and as it supports signed input and unsigned output, it looks like it should allow me to remove the clamping code entirely if I read it correctly.

This leads to the following combined clamping and scaling stage:

        vqshrun.s16     d16,q8,#6
        vqshrun.s16     d17,q9,#6
        vqshrun.s16     d18,q10,#6
        vqshrun.s16     d19,q11,#6
        vqshrun.s16     d20,q12,#6
        vqshrun.s16     d21,q13,#6

Which appears to work on my small test case. This drops the test case execution time down to about 3.9ms.

And given that replacing the yuv2rgb step with a memcpy of the same data (all else being equal - i.e. yuv420p to yuv444 conversion) still takes over 3.7ms, that isn't too shabby at all.

RGB 565

An alternative scaling & output stage (after the clamping) could produce RGB 565 directly (I haven't checked this code works yet):

        vshl.i16        q8,#2           @ red in upper 8 bits
        vshl.i16        q9,#2
        vshl.i16        q10,#2          @ green in upper 8 bits
        vshl.i16        q11,#2
        vshl.i16        q12,#2          @ blue in upper 8 bits
        vshl.i16        q13,#2

        vsri.16         q8,q10,#5       @ insert green
        vsri.16         q9,q11,#5
        vsri.16         q8,q12,#11      @ insert blue
        vsri.16         q9,q13,#11

        vst1.u16        { d16,d17,d18,d19 },[r3]!

Tagged android, beagle, hacking, jjmpeg.
Older Posts
Copyright (C) 2018 Michael Zucchi, All Rights Reserved.Powered by gcc & me!