Michael Zucchi

 B.E. (Comp. Sys. Eng.)


android (44)
beagle (63)
biographical (82)
business (1)
code (56)
cooking (29)
dez (6)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (414)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (216)
java ee (3)
javafx (48)
jjmpeg (67)
junk (3)
kobo (15)
linux (3)
mediaz (27)
ml (15)
nativez (3)
opencl (117)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
politics (7)
ps3 (12)
puppybits (17)
rants (134)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (2)
zedzone (13)
Sunday, 17 August 2014, 13:29

Another rotation and scale invariant feature descriptor?

Just had an idea whilst waking up this morning so I punched out a quick couple hundred lines of code before lunch.

I guess it works?

This is just the first output, without any tuning or much mucking about.

It uses LOG to determine the scale-invariant feature locations and guassian edge detectors to determine the rotation invariance. A local binary pattern is used for the feature descriptor.

Not sure if someone else has taken the same approach (the scale/rotation variance is a standard technique), possibly worth writing this one up if not.

Worth a beer at any rate. Cheers big ears.

Tagged ml.
Saturday, 16 August 2014, 13:53

A weak face detector?

First, here is the raw output from one of the haar cascades included in OpenCV (frontalface-alt) using the Viola & Jones object detection algorithm. This is a 20x20 face detector which requires 2135 feature tests via 4630 individual rectangles and approximately 200KB of constant data storage at runtime assuming optimal data packing.

It is being executed over a good number of steps over range of scales and tested at every pixel location of the given scale.

To be usable as a robust face detector further post-processing on the raw hit rectangles must be performed and then further work is needed to weed out false positives such as the NASA logo or hands which would still pass this process. Together with a wise choice of search scale.

The following is the raw output from a short training run of the fodolc algorithm over the same scale range. This is a 20x20 face detector which requires exactly 400 tests of a 4 bit local binary pattern (LBP) encoded image and 800 bytes of data storage at runtime. This is using a loose threshold to give comparable results to the haar cascade and to see what sort of features give false positives.

Unlike the haar cascade the output is simply a distance. This can simplify the post-processing to a threshold and basic non-minimal suppression trough detection.

As an example the following the raw output from the same fodolc detector but with a tighter threshold.

There are still some false positives but obviously it is something of an improvement - trivial trough detection would clean it up to a point exceeding the result of the haar cascade.

This is only a weak classifier taken from just the first 15 minutes or so of classifier training on 10 000 training images extracted or synthesised solely from 880 portrait photographs. Longer training always generates a better detector. A higher quality training set should generate a better detector. Training is by a very basic genetic algorithm.

In Java on a beefy workstation the execution time is roughly the same between the two algorithms but the fodolc algorithm can be implemented in efficient SIMD or OpenCL/GPU with very little effort for significant (order of magnitude) gains.

The following is the entire code of the classifier outside of the LBP conversion and of course the classifier table itself.

// Can you copyright something so simple??
public class Classify {
    private final static short[] face = { ... };

    public int score(byte[] lbp, int stride, int xi, int yi) {
        int score = 0;
        for (int y=0,i=0;y<20;y++)
            for (int x=0;x<20;x++,i++)
                score += (face[i] >>> lbp[x+xi+(y+yi)*stride]) & 1;
        return score;

Fast Face Detection in One Line of Code has a link to an unpublished paper with brief overview of the algorithm and local binary pattern used.

Please comment on this post if you think this is interesting. Or even if you're just as dumbfounded as I am that something so simple could possibly work.

The face detector was trained using images from the FERET database of facial images collected under the FERET program, sponsored by the DOD Counterdrug Technology Development Program Office (USA).


Tagged code, hacking, ml.
Friday, 15 August 2014, 15:45

epiphany gpu and bits

Work was a bit too interesting this week to fit much else into my head so I didn't get much time to play with the softgpu until today.

This morning i spent a few hours just filling out a basic GLES2 style frontend (it is not going to ever be real GLES2 because of the shader compiler thing).

I had most of it for the Java SoftGPU code but I wanted to make some improvements and the translation to C always involves a bit of piss farting about fixing compile errors and runtime bugs. Each little bit isn't terribly big but it adds up to quite a collection of code and faffing about - i've got roughly 4KLOC of C and headers just to get to this point and double that in Java that I used to prototype a few times.

But as of an hour or two ago I have just enough to be able to take this code:

int main(int argc, char **argv) {
        int res;
        struct matrix4 m1, m2;

        res = fb_open("/dev/fb0");
        if (res == -1) {
                perror("Unable to open fb");
                return 1;

        pglSetTarget(fb_getFrameBuffer(), fb_getWidth(), fb_getHeight());

        glViewport(0, 0, 512, 512);


        glVertexAttribPointer(0, 4, GL_FLOAT, GL_TRUE, 0, star_vertices);
        glVertexAttribPointer(1, 3, GL_FLOAT, GL_TRUE, 0, star_colours);

        matrix4_setFrustum(&m1, -1, 1, -1, 1, 1, 20);
        matrix4_rotate(&m2, 45, 0, 0, 1);
        matrix4_rotate(&m2, 45, 1, 0, 0);
        matrix4_translate(&m2, 0, 0, -5);
        matrix4_multBy(&m1, &m2);
        glUniformMatrix4fv(0, 1, 0, m1.M);

        glDrawElements(GL_TRIANGLES, 3*8, GL_UNSIGNED_BYTE, star_indices);



        return 0;

And turn it into this:

The vertex shader will run on-host and the code for the one above is:

static void vertexShaderRGB(float *attrib[], float *var, int varStride, int count, const float *uniforms) {
        float *pos = attrib[0];
        float *col = attrib[1];

        for (int i=0;i<count;i++) {
                matrix4_transform4(&uniforms[0], pos, var);
                var[4] = col[0];
                var[5] = col[1];
                var[6] = col[2];

                var += varStride;
                pos += 4;
                col += 3;

I'm passing the vertex arrays as individual elements in the attrib[] array: i.e. array[0] is vertex array 0 and the size matches that set by the client code. For output, var[0] to var[3] is equivalent of "glPosition" and the rest are "user set" varyings. The vertex arrays are being converted to float1/float2/float3/float4 before it being called (actually only GL_FLOAT is implemented anyway) so they are not just the raw arrays.

I'm doing it this way at present as it allows the draw commands to iterate through the arrays in the presumably long dimension of the number of elements rather than packing across the active arrays. It can also allow for NEON-efficient shaders if the data is set-up in a typical way (i.e. float4 data) and because all vertices are processed in a batch.

For glDrawElements() I implemented the obvious optimisation in that it only processes the vertices indexed by the indices array and only once per unique vertex. The processed vertices are then expanded out using the indices before being passed to the primitive assembler. So for the triangular pyramid i'm using 8 input vertices to generate 24 triangle vertices via the indices which are then passed to the primitive assembler. Even a very simple new-happy prototype of this code in my Java SoftGPU led to a 10% performance boost of the blocks-snake demo.

But I got to the point of outputting a triangle with perspective and thought i'd blog about it. Even though it isn't very much work I haven't hooked up the epiphany backend yet, i'm just a bit too bloody tired and hungry right now. For some reason spring means i'm waking up way too early, not sure why i'm so hungry after a big breakfast, and the next bit has been keeping my head busy all week ...


I've been playing quite a bit with my object detector algorithm and I came up with a better genetic algorithm for training it - and it's really working quite well. Mostly because the previous algorithm just wasn't very good and tended to get stuck in a monoculture due to the way it pooled the total population rather than separating the generations. In some cases I'm getting better accuracy and similar robustness as the viola & jones ('haarcascade') detectors, although i haven't tested it widely..

In particular I have a 24x16 object detector which requires only 768 bytes of classifier data (total) and a couple of lines of code to evaluate (sans the local binary pattern setup which isn't much more). It can be trained in a few hours (or much faster with OpenCL/GPU) and whilst not as robust as little as 100 positive images are enough to get a usable result. The equivalent detectors in OpenCV need 300K-400K of tables - at the very least - and that's after a lot of work on packing them down. I'm not employing boosting or validation set feedback yet - mostly because I don't understand it/can't get it to work - so maybe it can be improved.

Unlike other algorithms i'm aware of every stage is parallel-efficient to the instruction level and I have a NEON implementation that classifies more than one pixel per clock cycle. I may port it to parallella, I think across the 16 cores I can beat the per-clock performance of NEON but due to it's simplicity bandwidth will be the limiting factor there (again). At least the classifier data and code can fit entirely on-core and leave a relative-ton of space for image cache. It could probably fit into the FPGA for that matter. I might not either, because I have enough to keep me busy and other unspecified reasons.

Tagged hacking, ml, parallella.
Thursday, 07 August 2014, 20:19

On my god, it's full of local binary patterns!

I recently had need to look into feature descriptors. I've previously played with SURF and looked into others but I wasn't really happy with the complexity of the generator and needed something Java anyway.

A small search turned up FREAK (why the silly 'catchy' acronym names? Maybe it started with S-USANs?) which looked doable so I had a bit of a play. There is code available but it's OpenCV and C++ and the version I saw just wasn't very good code. I wrote up my own because I had some different needs for what I was looking at and porting the C++/OpenCV was going to be a pain.

I guess they work as advertised but i'm not sure they're what I want right now. I tried porting the AGAST detector as well but it wasn't really getting what I was after - i'm after specific features not just 'good features'.

The paper does include this interesting diagram though:

Although the paper doesn't reference them this diagram is pretty much a description of local binary patterns. The FREAK descriptor itself is just a very long local binary pattern with optional orientation normalisation.

Perhaps more interestingly is that this specific diagram could just as well be a description for how my fast object detector works. It is effectively a direct implementation of this diagram.

Assuming the above diagram is representative of human vision I guess one could say that the whole of visual reality is made of local binary patterns.

Tagged ml, philosophy.
Wednesday, 08 January 2014, 21:14

Fast Face Detection in One Line of Code

                  ****    ****    ****
                  *  *    *  *    *  *
                  ****    ****    ****

                  ****    ****    ****
                  *  *    *  *    *  *
                  ****    ****    ****

                  ****    ****    ****
                  *  *    *  *    *  *
                  ****    ****    ****

(blogger is broken: this is supposed to be a pic)

Based on the work of the last couple of weeks I've written up a short article / paper about the algorithm and created a really basic android demo. I think it's something novel and potentially useful but i'm not sure if i'm just suffering from an island effect and it's already been tried and failed. I tried to find similar research but once outside of the software engineering realm the language changes too much to even know if you're looking at the same thing when you are. Statistics gives me the willies.

Since I did this at home and am not aquianted with the academic process I didn't know what else to do. None of my peers, aquaintances or contacts do similar work for a hobby.

I have created a basic 1990s style `home' page on my ISPs web server to store the paper and application and anything else I may want to put there. This may move in the future (and perhaps this blog will too).

And yes, it really does detect faces in a single line of code (well, significant code) - and we're talking C here, not apl - and on SIMD or GPU hardware it is super duper fast. I haven't even looked at optimising the OpenCL code properly yet.

I wasn't going to but I ended up creating an optimised NEON implementation; I kinda needed something to fill the timing table out a bit in the article (at the time; it filled out afterwards) but I guess it was worth it. I also wrote up a NEON implementation of the LBP code i'm using this afternoon and because it is only 4 bits it is quite a bit faster than the LBP8,1u2 code I used last time, although in total it's a pretty insignificant part of the processing pie.

Now perhaps it is time for summer holidays.

And just to help searchers: This is an impossibly simple algorithm for a very fast image classifier and face detector which uses local binary patterns (LBP) and can be implemented directly using single instruction multiple data (SIMD) processors such as ARM/NEON and scales very well to massively multi-core parallel processors including graphics processing units (GPU) and application processing units (APU). OpenCL, CUDA, AMD, NVidia.

Tagged android, hacking, ml, opencl, parallella.
Sunday, 05 January 2014, 21:52

Further beyond the ROC - training a better than perfect classifier.

I added a realtime plot of the population distribution to my training code, and noticed something a little odd. Although the two peaks worked themselves apart they always stayed joined at the midpoint. This is not really strange I guess - the optimiser isn't going to waste any effort trying to do something I didn't tell it to.

So with a bit of experimentation I tweaked the fitness sorting to produce a more desriable result. This allows training to keep improving the classifier even though the training data tells it it is 'perfect'. It was a little tricky to get right because incorrect sorting could lead to the evolution getting stuck in a local minimum but I have something now that seems to work quite well. I did a little tuning of the GA parameters to speed things up a bit and added a bit more randomisation to the mutation step.

The black line is the ROC curve (i.e. it's perfect), green is the positive training set, red is the negative. For the population distrtribution the horizontal range is the full possible range of the classifier score, and vertically it's just scaled to be useful. The score is inverted as part of the ROC curve generation so a high score is on the left.

The new fitness collator helps push the peaks of the bell curves outwards too moving the score distribution that bit closer to the ideal for a classifier.

The above is for a face detector - although I had great success with the eyes I wanted to confirm with another type of data. Eyes are actually a harder problem because of scale and distinguishing signal. Late yesterday I experimented with creating a face classifier using the CBCL data-set but I think either the quality of the images is too low or I broke something as it was abysmal and had me thinking I had hit a dead-end.

One reason I didn't try using the Color FERET data directly is I didn't want to try to create a negative training set to match it. But I figured that since the eye detector seemed to work ok with limited negative samples, so should a face detector so I had a go today. It works amazingly well considering the negative training set contains nothing outside of the Color FERET portraits.

Yes, it is Fantastic.

I suspect the reason the Color FERET data worked better is that due to the image sizes they are being downsampled - with the same algorithm as the probe calculation. So both the training data and test data is being run through the same image scaling algorithms. In effect the scaling is part of the LBP transform on which the processing runs.

This is using a 16x16 classifier with a custom 5-bit LBP code (mostly just LBP 8,1).

The classifier response is strong and location specific as can be seen here for a single scale. This detector here is very size specific but i've had almost as good results from one that synthesized some scaling variation.

I couldn't get the young klingon chick to detect no matter what I tried - it may just be her pose but her prosthetics do make her fall outside of the positive training set so perhaps it's just doing it's job.

Tagged hacking, ml, opencl.
Sunday, 05 January 2014, 06:12

Beyond the ROC

I mentioned a couple of posts ago that i was hitting a wall trying to improve the classifier using a genetic algorithm because the fitness measure i'm using reached 'perfect' ... well I just worked out how to go further.

Here is a plot of the integral of the population density curve (it's just the way it comes out of the code, the reader will have to differentiate this in their head) after 400 and 50K generations of a 16x16 classifier. I now have the full-window classifier working mostly in OpenCL so this only took about 20 minutes.

Although a perfect classifier just has a dividing line separating the two populations, it is clear that these two (near) perfect classifiers are not equal (the above plot was generated from a super-set of the training data, so are not perfect - there should be no overlap at the base of the curves). The wider and deeper the chasm between the positive and negative population, the more robust the classifier is to noise and harder to classify images.

400 generations is the first time it reached a perfect ROC curve on the training data. I just let it run to 50K generations to see how far it would get and although most of the improvement had been reached by about 10K generations it didn't appear to encounter an upper bound by the time I stopped it. Progress is quite slow though and there is probably cause to revisit the genetic algorithm i'm using (might be time to read some books).

This is a very significant improvement and creates much more robust detectors.

Because the genetic algorithm is doing the heavy lifting all I had to do was change the sorting criteria for ranking the population. If the area under the ROC curve is the same for each individual then the distance between the mean positive and mean negative score is used as the sort key instead.

The Work

So i'm kind of not sure where to go with this work. A short search didn't turn up anything on the internets and recent papers are still mucking about with MP-LBP and integral images on GPUs which I found 2 years ago are definitely not a marriage made in heaven. The eye detector result seems remarkable but quite a bit of work is required to create another detector to cross-check the results.

The code is so simple that it's effectiveness defies explanation - until the hidden maths is exposed.

I started writing it up and I've worked out where most of the mathematics behind it come from and it does have a sound basis. Actually I realised the algorithm is just an exisiting common algorithm but with a single specific decision causing almost all of the mathematics to vanish through simplification. I first looked at this about 18 months ago but despite showing some promise it's just been pretty much sitting idle since.

Tagged hacking, ml, opencl.
Friday, 03 January 2014, 16:29

eye detector again

I was playing around with generating ROC curves for various algorithms - first I was trying to determine whether some new LBP codes I came up with worked better in certain cases (not yet, but they have promise, ok i kept playing as below: maybe they do already). For this purpose I implemented a basic LBP histogram matching algorithm, as I figured having baseline would provide a decent comparison.

This just took all eye images as one class and all non-eye images as another and then measured the distance of them all to the classifier. I'm using 20x20 images and the classfier creates 16 histograms from 8x8 tiles overlapping by half in each direction. It isn't terribly valid as a measure because it is using the same set for training and for testing but i'm only after the relative results. There are 10560 positive training examples and 9940 negative examples in the data set, all taken (and synthesised) from Color FERET fa partition. I'm only using abs difference as the distance measure which isn't ideal, and perhaps the 8x8 tile area isn't a large enough sample to build a decent histogram - in short, these histogram results are not tuned for performance.

The Zucchi LBP is using some directional differential filters in order to build the LBP code bit-planes. The idea is that they're tunable to the problem, should be more robust to noise, and generally produce a more noise-free and accurate LBP code. They are much more expensive to calculate however. This tunability offers another possibility for GA optimisation. I guess this is something else I should really write up properly.

Then I revisited the classifier I came up with late 2012 (the one I put into a small android app). Here the data and LBP transform are the same, only the algorithm has changed.

Blew me away a bit, I really didn't expect that much of an improvement. The 'Zucchi' algorithm requires a tiny fraction of the storage and fewer cpu instructions to process. Training takes more memory but fewer instructions per pixel. The original goals in designing it were for it to be SIMD parallelisable (at execution time) so it optimises very well. The first algorithm should be more robust to registration errors, but on this test the differences are remarkable and in many cases you don't want/need such robustness.

Then I wondered if this classifier could be trained using GA instead. Here I truncated the training set to 8192+8192 images to match the OpenCL GA algorithm, so the results are slightly more valid.

Somewhat surprisingly given the size of the problem (each individual is ~6K bits) it works extremely well and attains a result in only 300 generations. Actually the problem I have is that it is almost too good(tm) - before doing much of a search of the problem space the utility of the fitness measure has been exhausted and the population no longer evolves. One might also note that this result is for the relatively information poor 4-bit LBP codes, and in all cases is only a single-stage classifier.

Given the results I should probably look at the data that isn't classifying properly at each end, some of it may just be poor / incorrectly labelled data. Update: I had a look - the false negatives at the upper end are from synthesised images some of which are over-rotated or rotated off-centre. A couple of the false positives are samples taken a bit too close to the eye although most are legitimate and from around under the nose to under the mouth.

The 8-feature test as described in the previous post doesn't fare so well here - even though it is an effective eye detector in practice and requires about 1/3 of the processing.

Here the differential LBP code is working very well too. And it works extremely well as an eye detector in practice.

For this post I wasn't going to run the differential LBP codes through the GA algorithm based on the first plot - it didn't seem that good. But the final plot and this eye detector heat map somewhat validates the initial reasoning behind the algorithm - reduced noise. See the previous post.

Looking closer at the LBP code image I now see there isn't enough similarity between the left and the right eye regions in the LBP domain to try to create a classifier which resolves both at the same time. What I might do instead is create separate classifiers but not include the other eye in the negative training set - this should improve the results of both. I guess i'm near the point where spectacles should be added to the mix.

I've started writing some of this up but i'm a bit lazy to put the effort in required to do a good job. Apart from the fact that i'm only doing this for self education and entertainment purposes in my spare time right now, i've little experience writing papers. I'm purportedly on holidays for that matter - technically i'm even unemployed but literally i'm just between contracts.

Update: Oops, looks like I made a bit of a mistake. I thought I was using a decent resampler whilst generating the scaled training data, but it wasn't - so aliasing was creating noise in the small-scale images and thus in the LBP codes. This is probably why the derivative LBP code managed to win out in the end. This mistake will lead to the plots showing poorer performance than they should be - but i'm not going to re-run the plots right now.

Seeing that the classifier was doing such a good job, I thought i'd push it a bit harder. I have a couple of other ideas too but the first one I tried was to create a classifier for a very small region. This has very large implications on execution time. So whilst working on an 8x8 classifier I noticed the aliasing and realised the scaling problem. I still have an issue with pixel-boundary sampling when extracting the normalised eye so the data is of a lower quality - but it has to deal with these problems in the input data too.

An 8x8 detector is absurdly small. Here's an example of a decent training image from the training set:

Testing on Lenna's photograph gives quite a few false positives - but out of 181 244 tests, 30 or even 100 false positives isn't too bad for a single-stage classifier. An 8x8 classifier requires 16x less total processing from the inner loop vs a 16x16 classifier to detect the same sized features at the same input scale - so even a modest true negative rate could be a big win if it is reliable. And this doesn't count the image scaling and LBP operator both of which also scale at an N*N rate.

This is just one run of a single-pass 8x8 detector generated via a GA using left and right eye images as the positive set. The threshold was chosen roughly. There are a total of 43 hits out of a total of 181 244 probes, with a false acceptance rate of 0.022%. It took about an hour to generate this detector via pure-Java code on a 6-core intel machine.

This is one run of a single-pass 20x20 detector generated via GA using left-eye images as the positive set. It is being executed over a wide range of scales from 0.2x to 1.5x of the original (512x512) source. There are a total of 22 hits out of a 5 507 592 probes, with a false acceptance rate of 0.00016%. It took about 3 minutes to generate the classifier using the same pure-Java code on the 6-core intel at which point evolution stopped due to having a perfect classifier on the training data set. It took under 200 generations on the improved input data vs the numbers in the top-half of this post.

Update 2: I wasn't going to, but I compared it to the raw results from the left-eye cascade from OpenCV - i remembered it not being particularly good but wanted to quantify it. This is showing the raw hits as exit the cascade and do not include grouping which would remove most of the false positives. There are 313 total hits.

For comparison purposes I re-ran the 20x20 classfier with settings that result in about the same number of scans at roughly the same scales. Here there are 64 total hits, although I just chose an arbitrary threshold value (a fairly loose one however). Haar cascades only provide a binary true/false result and have no threshold that can be adjusted - they also provide no quality indicator so there is no way to choose a peak match and one must resort to error prone averaging and merging.

This code executes about 30% faster than the haar cascade although perhaps the looseness of the haar detector would allow it to be run on fewer locations at the cost of accuracy. However; accuracy is often important if not critical. It should be possible to train a better haar-cascade but I haven't had any luck geting even close to the ones supplied in OpenCV. Likewise this is still only a single-stage classifier and can still be improved (I think) quite easily.

One last data point - the 8x8 classifier over similar scales executes 10-20x faster. These classifiers also scale much much better on parallel hardware - all the way from SIMD to multi-stream GPUs, so this 30% figure is a little misleading. It would also take only a tiny bit of FPGA logic ...

Tagged hacking, ml, opencl.
Older Posts
Copyright (C) 2018 Michael Zucchi, All Rights Reserved.Powered by gcc & me!