I got a fufut.
Rotated image from previous post with rotated motion blur. The image on the left is used to generate the convolution kernel.
So here's a screenshot of a 'blur tool thing' I came up with. Because of screen-size I forced it to only process a 512x512 image, but even at 1024x1024 it does the convolution as fast as the mouse can send events (the raw gpu time for a 1024x1024 convolution is about 1200uS per plane, excluding data conversion). I had previously written a separable convolution for OpenCL and this is about on par with the 63x63 convolution processing time - but isn't limited to separable convolutions or small kernels (e.g. no rotation like above). Does take somewhat more processing to build the kernel though since it's the same size as the image but that's something easily off-loaded to the GPU as well.