I've been continuing to experiment with fft code. The indexing for a general-purpose routine got a bit fiddly so I went back to the parallella board to see about some basics first.
The news isn't really very good ...
I was looking at a problem of size 2^20, I was going to split it into two sections: 1024 lots of 1024-element fft's and then another pass which does 1024 lots of 1024-element fft's again but of data strided by 1024. This should mean the minimum external memory accesses - two full read+write passes - with most working running on-core. There's no chance of keeping it all on core, and even then two passes isn't much is it?
But I estimate that a simple radix-2 implementation executing on only two cores will be enough to saturate the external memory interface. FFT is pretty much memory bound even on a typical CPU so this isn't too surprising.