Bit of a dry post coming up, but when you consider I blew my Sunday on this it might make sense. Curiosity got the better of me today and I spent most of it playing with convolutions with OpenCL on a GPU - I had a pretty fast implementation but wanted to compare it with some other ideas I had.
As it turned out, the implementation I had was the fastest after-all, although I tweaked it a tiny bit during the process.
For kernels at 3x3 (or 7x7 for 4-channel UBYTE images), a simple 2d implementation is very slightly faster than a more complex algorithm.
For non-separable convolution, a complex implementation which uses a rolling buffer is over 3x faster than a naive implementation, at least up to sizes of 31x31.
For separable convolution, my complex implementation is up to 2.5x faster than a naive implementation.
My separable convolution implementation reads 16x16 blocks of image into local memory and then each thread generates all results from the local memory in one pass - e.g. for up to 7x7 convolution it reads 2x16x16 blocks, for up to 15x15 convolution it reads 3x16x16 blocks, and so on i.e. you need the 16x16 data plus 'kernel radius' pixels each size. It uses transpose for the Y convolution case during load and saving of the data but the processing is identical. It also uses the trick of offsetting the odd rows of the data so they avoid local memory contention when they might otherwise - e.g. when the number of bocks being read is even.
FWIW for 640x480 image on a GTX 480 A single channel FLOAT 31x31 separable convolution is about 190uS, or 470uS for naive version. For UBYTE 177uS vs 470uS. For a 4 channel image the timings are 413uS, 916uS, 389uS, and 465uS respectively. So larger (byte size) images gain more - presumably from the reduction in memory reads and lower cache loading.
Actually - yesterday I started working on a JOCL based image processing library that I intend to drop on google code at some point - and this investigation was part of that. More should be forthcoming on that soon although right now I just don't have enough put together to be much use.