OpenCV, android, java, C++, suckage
So I needed to code up a prototype and I thought it was probably time to use OpenCV rather than write it all from scratch again, since I didn't need OpenCL at this time. I've always avoided OpenCV because the code within it is horrid, the API a bit ugly, and it's almost impossible to build it without the right version of GNU/linux (let alone for another platform).
I think I probably made a mistake here ... because I'm not sure much has changed, or if it has it's only for the worse.
First I was just using the android api - i thought that the same api was available to java generally, but it's pretty hard to find out if that's true. The only README just points to a generic web page, and who knows what cmake is doing to decide it wont build anything Java. Although I have some stuff working on android I wanted to drop back to the desktop to ease some experimentation I need to do.
cmake - boy that sucks total arse. When it does work there is no information on how to control it (e.g. why does it say Unavailable: java), and it's just a shitty piece of crap anyway.
And so does C++. C++ is such a shit language.
OpenCV on android is just slow as fuck anyway. Just a simple live webcam can barely hit 10fps without even doing anything. Changing the camera resolution barely makes much difference, so there's something funky going on. There's so much debug snot in `logcat' that you can't even tell what you're own application is doing let alone OpenCV.
The Java API is really pretty horrible anyway. Everything seems to go through an opaque type 'Mat', but then you still need to know what's in it. It seems to be because the C++ API is also horrible. It's kind of like they're trying to make it "matlab for C++", which probably makes sense for their target audience of non-programmers, but otherwise it totally sucks.
I would have been better off just coding up the routines I need from scratch (I already had to do one of them as it was a contrib api which isn't exposed, and I couldn't get custom jni to work to bind it), as they're fairly simple and easy to use once you give them a decent api.
(Yes I could use JavaCV but it still brings along with it a lot of the hassles, and the library it uses is still the same).
So apparently Mozilla have announced that Thunderbird will no longer get active development, and merely security updates.
Well you could have fooled me - I gotta say it's a pretty uninspiring bit of software, and from the way it works I could have sworn they stopped developing it a decade ago.
It's like a reasonably competent email client from the 90s. Sure it does the job, but there's certainly no flair there and it's really just a bit clumsy at doing everything.
I've been using it for about a year since gmail became too slow in firefox and loaded my laptop too much. But for all it's faults it's still better than gmail, and I don't have to put up with adverts either. I still can't look at evolution after 7 years ...
Zed's Red Fermented Weird Sour Chilli Sauce
Well i'm back to work next week so i've been taking a break from hacking before getting back into it. Brewed some beer (second wart going now), cleaned the windows (i'm still surprised how much difference it makes), did some more preserving and cooking ...
Many moons ago - last year some time - I fermented some red chillies with the goal of making some tobasco style sauce. I did the same from some green chillies and I just ran out of that (or just can't find it in the cupboard), so I thought it was about time I did something with the red ones before the mould ate it all ...
So I scraped off the white mould, added some vinegar (1/2 a cup or so, didn't measure), lime and lemon juice (4 of each), blended it, sterilised it and bottled it. No sugar or anything else.
If nothing else it has an absolutely amazing colour ...
Flavour is interesting - quite sour. Definitely nothing like tabasco (unlike the green one I made which was much more vinegary and fairly close to green tobasco sauce), but it does emphasise the sour note from the fermentation which is what I wanted - the green version masked it too much in the vinegar. Given it was from Cayesan chillies it isn't too spicy either. House-mate thought it was reminiscent of green mango.
I have a small jar of habaneros i fermented too, originally I was intending to make a `super' tabasco sauce with that, but I think I will just leave those as sliced pickled chillies. Damn damn hot too - can't really add more than about 1/2 teaspoon to a bowl of food without suffering too much ...
Oh I tried the salted kumquats i made a few weeks ago. Well, they taste like salty kumquats in lemon juice and a little like lime pickles. Not sure what i'll do with them ...
It's almost time to look at getting some seedlings going again this year.
Update: I really like this sauce, very nice as a dipping sauce for pork or chicken. Very moreish, and not too overpowering in heat. I also found a use for the kumquats - works pretty well in a bean dish i've been making (bacon, beans, green tomatoes, herbs, stir fry, not stewed) to add some depth - although it's easy to over-do the salt.
Your browser is not supported ...
Great, now i get a fucking advert for some shit browser I don't want to use every time i do anything in blogger.
Lime marmalade Incinerade
Still have more fruit than I can use, so I made some lime marmalade. Added a pile of Habanero chillies as well, so it's pretty hot. Not sure what i'll use it for ... it's more like a lime + chillie jelly with a bit of bitterness.
Got of the infernal machines yesterday and got a bit productive in general, also did 3 loads of washing, mowed the lawn, started brewing some beer ...
||Finely sliced ripe limes (mine are yellow on the outside with a very thin skin, lime-green on the inside, and very juicy).
||Finely sliced Habanero chillies (this is a TON of heat)
||Water (i.e. equal weight to fruit)
||Sugar (i.e. equal weight to fruit)
||Pips from some other pippy citrus. I used half a dozen kumquats which are loaded with large seeds.
- Place the lime, chillies, ginger and water in a pot and soak overnight.
- Wrap the pips in some chux and tie up, place in the pot and bring to the boil.
- Simmer for at least 30 minutes.
- Pour in the sugar and stiry until dissolved (I initially removed the pips at this point, but as it took forever to set I put them back).
- Simmer until it sets on a plate in the freezer, 30 minutes plus. It's supposed to skin when pushed.
- Pour into sterlised jars and seal while still hot. Makes about 4.5 250ml jars.
I had trouble geting the 'plate set test' to work - and ended up simmering it for a bit over an hour. But when I went to bottle it it started to stick in my funnel after the first jar and it turned solid enough to turn upside-down as soon as it cooled off a bit. In short I think I cooked it a bit longer than I needed to, but not enough to hurt it (made it a bit more orange coloured than it would otherwise have been).
Initially I only put 40g of habaneros, but I thought I may as well make it worth the effort and grabbed a few more from the freezer as I was cooking it.
Has a nice sweet and intensely lime flavour with a generous hint of marmalade bitterness. It set solid - like jelly - although it is cloudy (mostly from the ginger pulp I guess).
The habanero chillies add a big kick - that gets more intense with each drop as they usually do. I had some tiny amount with kabana and cheese on crackers and it worked pretty well. Yet to try it on toast with coffee ..
It looks and smells like a nice sweet marmalade, but a corn-kernel sized piece is enough to set your whole mouth afire.
arm, tegra3, neon vfp, ffmpeg and crashes
So I just did a release of jjmpeg including the android player ... and then a few hours later finally discovered the cause of crashes I've been having ...
Either FFmpeg, the android sdk compiler or the Tegra 3 processor (or the way i'm using any of them) has some sort of issue which causes bus errors fairly regularly but never repeatably. Possibly because of mis-aligned accesses. Unfortunately when I compile without optimisations on - the build fails, which makes it a bit hard to debug ... i got gdb to run (once only though, subsequent runs fail), and got a half-decent backtrace, but optimistions obscured important details.
Anyway i noticed that 0.11.1 has a bunch of ARM work, so I upgraded the FFmpeg build, and mucked about with the build options for an hour trying to suss out the right ones and to see how various ones worked.
Short story is that using armeabi-7a causes the crash to appear (with any sort of float, vfp, neon, or soft), and dropping back to armeabi fixes everything.
Unless I can get better debugging results I think i'll just stick with armeabi for the foreseeable future. I can't find anything recent about these types of problems, so perhaps it's just my configuration but I really just don't know enough about the ARM specifics at this point to tell either way.
Well, that's enough for today.
Update: I spent another day or so on this and finally nutted it out. It was due to alignment problems - but it was odd that it happened so rarely.
As best I could work out, ARMV6+ allows non-aligned memory accesses, but the standard ARM system control module can be programmed to cause faults. And just to complicate matters the ARM linux kernel has the ability to handle the faults and implement the mis-aligned access manually, and this the behaviour can be configured at run-time via proc. It seems the kernel on my tablet is configured to cause faults, and not having administrative access I am unable to change it ...
So the problem is that FFmpeg's configure script assumes mis-aligned memory accesses are safe if you're using armv6 or higher. Anyway I filed a bug
although so far indications are that the bug triager doesn't know what I filed (see update 2) - i'm not fussed as I have a work-around that doesn't require any patch.
I had wasted a lot of time based on thinking it was neon or optimisation related, whereas it was just ARMv5 vs anything else behaviour. When i finally did get it to compile without optimisation turned on, the backtrace I got was still identical and so worrying about getting a good backtrace was pointless. I had wrongly assumed that a modern cpu would handle mis-aligned accesses ok, not working at the assembly language level for a while gets you rusty ...
I suppose the main upshot of posting on the libav-user list ... that mostly just resulted in me wasting a full day of fucking about ... is that I realised my configure invocation was still broken (more problems one got from copying some random configure script from the net) and so I managed to clean it up further.
Bit over it all now.
Update: So the actual fix was to run this sed command over
configure is executed:
sed -i -e 's/ HAVE_FAST_UNALIGNED 1/ HAVE_FAST_UNALIGNED 0/' $buildir/config.h
Update 2: Good-o, they've just added a configure option to override it.
Update 3: Can anyone tell me why this post is getting so many hits over the last month or so (June '14) It's showing up in the access stats but there's no info on why.
"Fuck you Nvidia" (and other news of the day)
So apparently Linus blew his top a bit and gave the bird to Nvidia with a pretty clear verbal message to match. Well if nothing else that'll be a keeper of a picture that will bounce around the internet for years to come ...
Of course, he has only got himself to blame here - if he didn't allow binary blobs to link into the kernel in the first place (choosing to discard rights that copyright gives him) then he wouldn't be in this situation would he? After-all, it was his decision alone - he could have gone either way and the rest would follow.
So much for pragmatism ...
So I guess we'll see where it goes in a few years when UEFI tivo-ises every hardware platform you buy, and you can no longer compile your own kernel or write your own operating system on your own computer, even if it is running a 'free' operating system.
Of course industry consortia such as Linaro are right behind UEFI - anyone who sells appliances would love to lock them down giving them forced obsolescence - when in reality hardware is approaching the point where software is taking over many of the functions, and is capable of much more than it's original firmware allows it. I find it pretty offensive that the guy in the linked video regards anyone who doesn't like UEFI as a pirate ...
Microsoft laptop and/or tablet
Well things must be in dire straits in microsoft's windows-rt land. One can only guess that the OEMs simply aren't embracing the platform with enough zeal - there seems no other sane reason that they would want to create their own tablet (and/or laptop, or whatever it is).
Unless it's just pure greed - which of course isn't something that can be discounted entirely. At least in part they probably think they can recreate the xbox success story - which given how much it cost, clearly wasn't anywhere near as successful as the internets would have you believe.
I bet the few OEMs even looking at microsoft windows-rt are going to be given some moment for pause with this announcement.
The only `OEM' to embrace microsoft will probably be nokia.
But they're totally fucked and who knows if they'll even see out the calendar year. They only ever made good phones, and now they don't even do that - who is going to buy a PC from them, even if the form-factor is a tablet.
But what has happened to nokia is a rant for another person - I've had a couple of old nokia phones over the years and I thought they were fine, but I don't have any connection to them other than a shared sense of disappointment in what has become of a great company in such an astoundingly short period of time.
Finally got a firmware upgrade to the transformer prime last week. TBH I can't tell any difference - if anything the browser hangs more with 'application not responding' (or whatever it says) than it did before. Not that i've been using it a great deal - it's a pretty clumsy way to do anything.
I hurt my foot again (well, this time it was my other foot - my guess is my overly sedentary work-at-home lifestyle for the last few years is catching up with me and I have to at least start taking regular walks to repair my feet - even when i venture out it's mostly cycling) so I was pretty much immobile for a few days. So I dragged out the tablet and used it for some web reading and even tried a few games.
Even the touchy feely games (I downloaded some 'bridge builder' and 'physics challenge' games) which seem well suited to the tablet are a bit of a pain to control with a fat imprecise finger which obscures what you're doing. Trying to use it in bed is annoying as you need two hands to hold it - the auto-rotate stuff is a pain in the arse too - so that gets turned off anyway. As a web browser it is just ok - portable - but again you need to prop it up to use it, or bend over it, and the fat-finger-mouse can make using any web page frustrating. Not to mention all the annoying adverts I haven't seen for years (well at least the flash stops when you're not looking at it - something i never understood about firefox after it had tabs).
About the only good thing about it is it still has a battery that works - so I can use it without a tether - unlike my laptops whose batteries are all dead now. Those batteries are too expensive to make them worth replacing. But other than the battery having died so it is tethered to my desk, my X61 thinkpad is a much easier to use and much more useful device: I use it for email and forums, and most of my browsing.
Once the tablet battery dies it'll be pretty shit as the connector is in an inconvenient place.
I'm still working on the android jjmpeg stuff though - mostly just for my own entertainment. I have the code back ported to amd64 now, but I haven't seen any crashes - valgrind gives a bunch of hits but it's always hard to tell if that's just the JVM doing funky shit or real problems (none of the stack traces show anything useful, even with a debug build).
Viola & Jones Revisited
So after the last post it got me thinking about just how I did implement the viola-jones haar cascade in socles.
The code runs in a loop, and there is no communications from the CPU to the GPU and only runs about 10-15 loops anyway (depending on the settings): thus the loop overheads are fairly small. But it still does require a 'scale features' step, which is useful on a CPU to avoid excessive calculations but isn't so important on a GPU.
So I tried a slightly different approach - that is, to perform the scaling inside the detector kernel, which allows each kernel to then work on different scales. i.e. to do all scales in one step.
My first attempt at this wasn't much faster - but that's because I was invoking the kernel for too many probes. So then I tried changing the way it works: each work-group still works together solving a single feature test stage together. But instead of calculating it's location and scale from the 2d work coordinates I create a 4 element descriptor with some of the information required and it just uses that. This gives me a bit more flexibility in the work assignment, e.g. I can utilise persistent work-groups and tune the work size to fit the hardware more directly. It requires less temporary memory since the features are scaled in-situ.
This change was definitely worth it, for a given test on the webcamfx code, I got the face detection down to around 13ms total time, vs 19ms - about 4ms overhead is fixed. A stand-alone test of Lenna registers about 8ms vs 19ms, so over 100% improvement.
Comparisons with other hardware are difficult - mostly because it depends a great deal on the subject matter and the settings and i haven't kept track of those - but I was pretty disappointed with the AMD performance up until now and I think this gets it on par with the nvidia hardware at least. Although really the 7970 should do measurably better ...
My guess is that the performance gained is mostly because with the greater amount of work done, it can more efficiently fit the total problem onto the hardware. There is usually a small amount 'modulus' where a given problem wont fill all hardware units leaving some idle, and in this newer version it only happens once rather than 10-15 times. Actually I did some more timing (and updated the numbers above), and 100% seems too much for this. Maybe? Oh I also changed the parallel sum mechanism - but I changed it in both implementations and it made no difference anyway. I changed the region description to a float array too, although that only affects the scaling function in the first instance.
If I run this on a CPU the performance is very poor - around 1.5s for this test case. If I go back to a test CPU version I wrote it's a more reasonable 240ms so i'm still getting a good 30x speedup over an Intel i7 X 980. Given I was getting 90ms before with the cpu driver and the nvidia test case i'm not really sure what's going on there.
I haven't checked the code in yet as it's a bit hacked up.
Update: I checked some stuff in, although left both implementations in-tact for now.
Update 2: So I did some further analysis of the cascades I have: it turns out the way i'm splitting the work is very wasteful of GPU resources. I'm using at least 64 work items per stage - using one work item per feature. But the earlier stages have only a small number of features to test - and the vast majority of probes don't go past the first few stages. e.g. the default cascade only has 9 tests. I tried a few variations to address this but the overheads of multiple kernel calls and the global communication required outweighed any better utilisation.
Update 3: So curiosity kept me poking. First I realised that using fixed scheduling for persistent kernels might not be idea. So I use an atomic to dole out work in a first-some-first-served consumer way. Made a tiny difference.
Then I thought I would try to see if using fewer work-items per feature stage would help. In this case I use 4x16 or 2x32 thread groups to work on 4 or 2 tests concurrently - with all the necessary (messy) logic to ensure all barriers are hit by all threads, etc. This was measurable - the lenna test case I have is now down to around 7ms (unfortunately when using sprofile the algorithm fails for some unknown reason - so this is now time measured with System.nanoTime()).
One big thing left to try is to see if localising the wide work queue would help. e.g. rather than call multiple kernels for each stage and having each work-item busy working on a sub-set of problems, do it within the kernel. e.g. if the stage count is 9, 12, ... do stage 1 with 7 concurrent jobs, if any pass then add them to a local queue. Then do stage 2 with (64/12) = 5 concurrent jobs, if any pass add them to a local queue. etc. Once you get to a stage longer than 32 items, just use 64 threads for all the rest. This way I get good utilisation with small stages as well as with large stages. I'm not sure whether this will be worth all the hassle, and the extra addressing mathematics required (and it's already using a lot of registers); but as i'm really curious to know if it would help I might attempt it.
Given that I now use a work queue, another possibility open is to re-arrange the jobs to see if any locality of reference can be exploited. Given the huge memory load this might help: although the image cache is so small it might not.
Update 4:Curiosity got the better of me, it's been crappy cold weather and I hurt my foot (i don't know how) so I had another look at the complex version this morning ...
Cut a long story short, too many overheads, and although it isn't slow it isn't faster than just using 16 or 32 threads per feature test. Too many dynamic calculations which cannot be optimised out, and so on. It's around 9.5ms on my test case.
Structurally it's quite interesting however ....
- Find out how many concurrent tests can be executed for stage 0, dequeue that many jobs from the work queue and copy them to a local work queue.
- If we exceeded the job length, stop.
- Work out how many jobs can be done for the current stage
- Process one batch of jobs.
- Parallel sum the stage sum.
- If it advances to the next stage, copy to a next-stage queue.
- Go back to 4 unless finished the in queue.
- If any are in the next-stage queue, copy them over, advance to the next stage.
- Go back to 3 if we had any work remaining.
- If we ran the full stage count, copy any work jobs remaining in the queue to the result list.
- Go back to 1.
So each stage is fully processed in lock-step, and then advanced. The DEFAULT cascade starts with
7 feature tests, so it never has more than
7 items in the queue (7 feature tests of 9 elements = 63 work items). As the stages get deeper the number of work-items assigned to solve the problem widens, up to a limit of 64 - i.e. the work item topology dynamically alters as it runs through the stages in an attempt to keep most work-items busy.
There's a lot of messy logic used to make sure every thread in the workgroup executes every barrier, and there are lots of barriers to make sure everything works properly (i'm using locals a lot to communicate global info like the stage and topology information). So the code runs on a CPU (i.e. I got the barriers correct), although very inefficiently.
As is often the case with GPU's, the simpler version works better even if on paper it is less efficient at filling the ALU slots. Although I haven't confirmed this is the case mathematically: apart from stage 0, the more complex method will also have un-even slot fillage - it's one of those discrete maths/Knuth style problems I simply give up on.
Copyright (C) 2019 Michael Zucchi, All Rights Reserved.
Powered by gcc & me!