SURF, RANSAC, OTHERALLCAPSSTUFF
With the prototype out of the way I'm back to looking at algorithms again, at least for a couple of weeks. It seemed a good opportunity to finally try and nut-out RANSAC for building homographies for mapping between images. I've previously looked into it but got no-where because it was all totally matlabotomised code and difficult to decipher.
I thought i'd start with this CUDA RANSAC implementation since at least it had all the right structure in place.
Of course, then i needed to get SURF going ... which I had previously played with (half-heartedly) and failed, but now with clsurf around, I just ported that to JOCL (which was thankfully a simple task as the code structure is similar to the way I write stuff). And that pretty much 'just worked', so all good so far.
Porting the CUDA RANSAC code wasn't much work, but I wasn't getting great results, nor great preformance so I ported that to Java to gain a deeper understanding. After a bit of searching I found EJML which I used for the SVD this time just to make sure.
Another implementation ...
I still wasn't grokking it - although by now RANSAC itself was starting to make some sense - so I kept searching. I found this interesting homework solution which had a complete, readable, and well explained RANSAC algorithm. At last. So finally i was getting reliable results - a bit slow mind you - but reliable if I let it run long enough.
One issue I had is that I seemed to need a great number of iterations to get a quality result - several 10s of thousands. i.e. a bit slow. Something like 20s for one of the data sets I had. I found out that much of the time was taken up with in-lier check, which is quite simple and although it processes many points should've been faster. It seems the E of efficient in EJML doesn't quite stretch to small matrices. Hand-coding that, and simplifying the test to only check the forward transform sped it up considerably. With a bit more tweaking the test-case is down to around 1s - for single-threaded java code.
I also found this site which has a bunch of matlab scripts and one is for a more complex version of the homography calculator which uses less FLOPs (I started with the paper 'Computation of Homographies' by Harker and O'Leary - but for whatever reason the site I downloaded it from seems to have removed it since). That seemed worth looking into, so I created an implementation of that as well. It didn't seem much faster at first but on further investigation it seemed worth looking into. It gets most of it's speed from running two smaller SVD's (4x2 and 8x3) rather than one larger one (8x9).
I was thinking at this point I might just forget about the SVD stuff on a GPU - i can find other work for it to do and the SVD code I was looking at (in EJML) is very complex. But with the fixes to the inlier check now it was taking about 50% of the time; which meant that even with multi-threading I probably couldn't keep the GPU very busy with in-lier checks. My first attempt at this generated a single result on the CPU then double-buffered the matrices to the GPU for it to perform the inlier check. Unfortunately the driver overhead was too great; I got it down to about 4s (for 10K iterations), but most of that time was in the driver synchronisation. After thinking about it a bit more I was going to do a whole batch instead, but then I did some more profiling and realised it was only going to give me at best a 2x performance boost, so I thought i'd go for some higher-hanging fruit first.
So I revisited the thought of creating a fully-GPU implementation. So as it happens I ended up back with the GSL version of the SVD code which was in the CUDA_RANSAC stuff. I dropped that back to Java and got it working. I noticed the code almost always does operations on columns, and indeed, copies the columns around a couple of times just for good measure. So I transposed the data and operate on rows instead: which means I can parallelise some of the internal loops as well. With a few more tweaks this is now the fastest version, 50K iterations take about 0.85s.
I'm still working on implementing a version of the 'optimised' homography algorithm at the moment, as it needs some more work - e.g. an implementation of
pinv. But given I have the basic version working I should probably try to port that to OpenCL first, to see if I will get much benefit.
I probably need to investigate trying to reduce the iterations required. At the moment I am doing a simple nearest-neighbour match using an L2 measure, so this could probably be improved (but given results as below, it is doing a fair job).
This is a display of the inliers. There are 249 inliers from 571 putative matching feature points.
This is the result after running clsurf to find feature points, running a sum of squares match against the feature points to find the putative matches, and then running RANSAC for 50K iterations to find the homography.
SURF and the nearest neighbour match are running on the GPU (in ~50ms total) with the RANSAC entirely on the CPU in single-threaded java - in about 0.95s for the 'unoptimised' homography calculator.
This is the image after the homography is applied to each corresponding picture. i.e. the source right-hand image has the inverse homography applied to generate the output left-hand image. And visa versa.