Michael Zucchi

B.E. (Comp. Sys. Eng.)

also known as zed
& handle of notzed  ## Tags

android (44)
beagle (63)
biographical (95)
blogz (9)
code (72)
cooking (31)
dez (7)
dusk (30)
extensionz (1)
ffts (3)
forth (3)
games (32)
gloat (2)
gnu (4)
graphics (16)
gsoc (4)
hacking (450)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (229)
java ee (3)
javafx (49)
jjmpeg (80)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (9)
opencl (120)
os (17)
panamaz (3)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (3)
zedzone (23)

# LU Decomposition 2

So I finally needed to get that matrix solver working for work ... and revisited the LU decomposition code I had played with a few weeks ago.

It turns out the code was broken, first a simple typo, and then a deeper problem: I hadn't noticed that the loop which multiplies out the rows depends on earlier results. Confusingly of course this worked just fine using a CPU driver since the work items within a work-group are actually executed in serial.

So I had to come up with another solution.

First I wrote a bit of code that just printed out the calculations actually being performed.

For column 0 this amounted to a no-op.

For column 1, it was some edge cases, then something like:

```  for i : 0 to n
col 1 [i] -= col 0 [i] * = col 1 [ 0 ]```

For column 2, edge cases plus:

```  for i : 0 to n
col 2 [ i ] -= col 0 [ i ] * col 2 [ 0 ] + col 1 [ i ] * col 2 [ 1 ]```

And so on.

As the emboldened calculations depend on a previous iteration of the same loop (for n=1 in case 1, and n=2 in case 2), each column result cannot be calculated independently.

Since the amount of calculation is small, using the shared memory to propagate the partial results didn't seem viable, so instead I simply calculate the required previous values manually for each column for all threads. I fiddled with the code I had which printed out the calculations and turned it into a code generator to expand all loops: it's only a 6x6 matrix so it isn't very much code. For the edge cases I use some logic that zeros out the multiplicands so the entire algorithm is completely branchless.

For example, column 2 is calculated using:

``` tmp0 = getlu(lu, 2);
tmp1 = getlu(lu, 8);
tmp1 -= getlu(lu, 6) * tmp0;
tmp0 = cid >= 1 ? tmp0 : 0;
tmp1 = cid >= 2 ? tmp1 : 0;
v = getlu(lu, cid * 6 + 2);
v -= getlu(lu, cid * 6 + 0) * tmp0;
v -= getlu(lu, cid * 6 + 1) * tmp1;
barrier(CLK_LOCAL_MEM_FENCE);
putlu(lu, cid * 6 + 2, v);
barrier(CLK_LOCAL_MEM_FENCE);
pivotexchange(lu, piv, 2, cid);
barrier(CLK_LOCAL_MEM_FENCE);```

Where `cid` is the column id (get_local_id(0) % 6). I use macros to access the array since I am also flattening it out to avoid (or at least reduce) bank conflicts. The barriers are needed since some of the rows span wave-fronts: but in any event only add 3% or so to the processing time.

I've had to write code more than once to understand the data pattern of an algorithm: multi-dimensional array access is bad enough to visualise without adding multi-dimensional processing elements as well. In some cases I've managed to remove huge amounts of redundant calculations this way - matlab seems to encourage code that has very poor data flow for example.

I otherwise stuck to the same thread topology: 6 threads work on each 6x6 matrix together, and 192 threads in the local work group for a total of 32 matrices.

For 4096 solutions of the same 6x6 matrix (i.e. best-case branch divergence), it takes 26uS on the GTX 480. That seems reasonable enough to me. Although I still have a small number of bank conflicts I think this result is good enough for an early minute for today.

Tagged hacking, opencl, socles.