Well I finally have a lowly OpenCL capable card to play with. Just an ATI HD 5770, mostly because of the power budget. Unfortunately that was still enough to make an almost silent machine into a loud fan, and there's another card in where which almost completely covers the fan inlet on the graphics card, so it might have to be removed.
I have been using a simple bit of code to play around with to learn how the device accesses memory and processes jobs and discovered a few things along the way. The code does a simple debayering, converting 5 channel data into 5 separate data planes.
vloadn seems to be something to avoid. It appears to treat the input as unaligned, even though the documentation implies it should be aligned. Perhaps I need to use an aligned directive on the parameters too ... but an easier solution seems to be just to change the source datatype to a vector type and just index it as an array.
Just by changing the code to use a vector array type versus a vload I got a 24x speedup(!).
Some other less obvious results ... If I remove the output of one or two of the channels, the code runs nearly 50% slower. Running on the CPU (quad phenom II something or other) using the same OpenCL code via AMD's CPU backend is about 8x slower than the GPU. I wonder what hand tuned SIMD code could manage ... given they have comparable power profiles.
Splitting the job into multiple parts effectively - both `horizontally' to allow coalesced memory access, and `vertically' to allow greater concurrency - seems to be a bit of an art, and no-doubt very architecture dependent. And unfortunately critical to getting better output.
Horizontally I assign a work-item to each column of 8 output bytes, for 16 bytes of input, across 128 work items. I think that should be an optimal memory access pattern.
Vertically I'm sticking to powers of 2 since the algorithm needs it, but it seems splitting the local work-groups into sets of the number of compute units (i.e. 10 on this card) seems to work better. But I'm not really clear how the global/local work dimensions really maps to the hardware once you get beyond the trivial case of single jobs.
I'm not sure if i'm interpreting the disassembly correctly, but the processor appears to be more VLIW than SIMD. Each of the 5 channels seems to execute independent instructions on independent data. I guess this should allow it to execute scalar code better, but it must come at a pretty significant cost to die space and power. I wonder if this is also why they still clock relatively slowly versus something like the CELL.
My final code equates to about 12000 decodes per second of a 1024x768 frame, which is more like what I was expecting - my first cut was doing about 400 which was obviously way out. I'm not sure if using image accessors rather than arrays would be a win either - it's a bit fiddly to fit it in with this code. It might be though, since I think you get format conversion 'for free' rather than requiring a bunch of shifts and fart arsing about.