Just a quick note, I created a small JNI wrapper using JOCL to access the clAmdFft library from AMD, to see how it compared to the Apple FFT I ported to Java.
Bummer, the Apple FFT is still faster for my problem size - image processing. Kernel time on a 1024x1024 sp fft is ~0.5ms vs ~0.38ms for the Apple one (HD7970). And the apple one only requires 3 passes. Smaller and/or batches seems to scale linearly too.
All that (minor) frustration with the clMiXeDAbbrCamlCase API for nought.
Update: So I actually tried writing a 64-element parallel FFT to see how I could do ... well, not good. Apart from wasting a lot of time wondering why "no matter what I did it took 15ms" because I hadn't allocated a properly sized buffer (out of range access I guess), when I finally got it going it was about 10x slower than the Apple one. Blah. Actually the apple one was about the same speed (if not faster) than a simple memcpy (using float2).
I wasn't being very smart about it, I just broke a 64-element FFT into one-calculation steps and used shared memory to communicate the results. A big overhead was the index twiddling for which I just used lookup-tables, but even the shared memory communication was expensive.
I may have gotten something wrong anyway, the profiler said the apple fft code had many many fewer global read/write operations, and I couldn't work out why. Although I was trying to do it all in LDS in a single kernel.
Since I didn't verify the results - I was just seeing how the memory access patterns would work - I have no real idea whether it worked or not anyway. But I may have another play another day, I wrote some code which outputs the expressions required between two arbitrary "layers" of the calculation so I can use larger kernels than the radix-2 ones I was testing with.