# oooh fast

Well I had a frustrating yet productive day today following on from yesterday's progress.

I ported the Jacobi SVD code to OpenCL and managed to paralleise most of it. I'm using 9 threads per SVD column which lets me paralleise all of the rotations. The code itself is somewhat simpler than the LU decomposition I worked on earlier, and it parallelises more easily. Actually having 9 threads also allows me to implement a 3x3 matrix multiply in 4 lines of code as well so it all fits nicely. I'm using 63 threads per local work group so the few times I do need barriers are just compiled out anyway. I also removed some of the code I don't need: e.g. fixing up the singular values and the U matrix at the end. The only bit which isn't totally parallel is the error calculation of the columns, but even there I was able to utilise a parallel sum to help.

I'm not really sure of the numerical stability; it works for my test cases.

So yeah, it's pretty fast - I'm currently doing batch sizes of 2000 homographies, and the SVD calculation and generation of the homography matrix is taking about 3.5-4ms - somewhat better than the 200ms or so from the CUDA RANSAC code (although that does a bit more work as well). I'm still doing the randomising and setup of the input matrix on the CPU, but that's the next candidate to move to the GPU. The GPU code to check for inliers is in the range of a 140uS as well. Once I re-arrange the code a little I will be able to keep all the data on the GPU - e.g. from SURF generation all the way through to final homography matrix.

I had thought real-time was going to be impossible, but now i'm not so sure.

The 3x3 matrix multiply fell out very simply - having 9 threads already assigned to each column of the SVD meant I had 1 thread per 3x3 matrix cell as well. I haven't had to look too much into matrix stuff on the GPU so far, so it was satisfying that such a neat solution presented itself when I did.

void mult3x3_gl(uint lid, uint cid, uint ar, uint ac, global float *a11, local float *b11, local float *lwork, local float *r11) { lwork[lid + LSTRIDE * 0] = a11[lid - cid + ac + 0] * b11[lid]; lwork[lid + LSTRIDE * 1] = a11[lid - cid + ac + 3] * b11[lid]; lwork[lid + LSTRIDE * 2] = a11[lid - cid + ac + 6] * b11[lid]; barrier(CLK_LOCAL_MEM_FENCE); r11[lid] = lwork[lid - ac + LSTRIDE * ac + 0] + lwork[lid - ac + LSTRIDE * ac + 1] + lwork[lid - ac + LSTRIDE * ac + 2]; barrier(CLK_LOCAL_MEM_FENCE); }

Here lid = local id, ac = lid % 3, cid = lid % 9 (base of matrix for this work item), LSTRIDE=64. With a worksize of 63 it means this does 7 3x3 multiplies at once and the barriers vanish.

It's nice when everything totally parallelises and there are no special cases.

The most frustrating part in all of this was getting the addressing right. For the SVD I came up with a complex addressing scheme where 9x9 matrices are stored in groups of 7 side-by-side. A few more grey hairs and a hoarse voice over trying to get that to work.

Also spotted a strange problem with the AMD driver: If I use printf, then I cannot use a cached binary or it simply crashes. If I just add a space to the code to force a re-compile (my cache uses the whole source as the key) it works ok.