Had a bit of a victory today - after a kick in the nuts or two. Finally got some of my OpenCL code running with the correct results at a reasonable clip.
I spent most of the day working out why the results were wrong - partially because of a minor bug or three, but mostly because all of the synchronisation primitives don't work when you call a kernel function from another kernel function (at least in the ATI sdk). Wish I had have known that to start with ...
I think it's roughly 100x faster than the original java or c code (although I should quantify it), so that's a pretty penny in the bank, and I think there's a bit more I can squeeze out of it - let alone using beefier hardware. One of the keys was to use a native format for most operations - I take the input data which is in a packed byte format and convert it to floats, and then operate on those. The other key is to use local memory as a programmed cache to reduce the load on global memory. And finally to utilise registers as much as possible - once i've loaded data from memory re-use the data repeatedly before needing to go back to memory or running out of registers. The OpenCL api also has some nice queuing and job management which makes it easy to let the CPU do other work whilst the GPU is busy, without having to synchronise every operation - which is the real mind killer. And it goes without saying that the data is loaded once to the graphics card memory and all operations operate there until I get a result out (converted to the format I need).
I still haven't managed to get the image datatypes to work but I will keep trying as t should fit this problem well (and nice to see that the JOCL guys were quick to implement the missing api's to support them). Using arrays is a bit of a pita tbh - i've had to split my work 'tile' into multiple slices, and keeping track of where each of the work units (threads) within the work group ('process') gets hairier than a hippies armpits. Using the texture units should let me remove all of the manual cache code and messy address arithmetic - although whether it executes faster is the real test.