FFT(N=2^20) parallella/epiphany, preliminary results
I finally (finally) got the fft routine running on the epiphany. It's been a bit of a journey mostly because each time I made a change I broke something and some of the logarithmic maths is a bit fiddly and I kept trying to do it in my head. And then of course was the days-long sojourn into calculating sine and cosine and other low-level issues. I just hope the base implementation I started with (radix-2) was correct.
At this point i'm doing synchronous DMA for the transfers and I think the memory performance is perhaps a little bit better than I thought from a previous post about bandwidth. But the numbers still suggest the epiphany is a bit of an fpu monster which has an impedance mismatch to the memory subsystem.
The code is under 2.5K in total and i've placed the code and stack into the first bank with 3 banks remaining for data and twiddle factors.
First the timing results (all times in seconds):
rows columns noop fft ffts radix-2 1 1 0.234 0.600 2 1 0.379 1 2 0.353 4 1 0.195 0.302 1 4 0.233 0.276 2 2 0.189 0.246 4 4 0.188 0.214 arm - 1.32 0.293 2.17
The test is a single out-of-place FFT of 2^20 elements of complex float values. The arm results are for a single core.
- This just performs the memory transfers but does no work on the data.
- The fft is performed on the data using two passes of a radix-1024 step. The radix-1024 step is implemented as 5 passes of radix-4 calculations. Twiddle factors are calculated on-core for each radix-4 pass.
The first outer pass takes every 1024th element of input, performs a 1024 element FFT on it, and then writes it back sequentially to memory (out-of-place).
The second outer pass takes every 1024th element, performs a 1024 element FFT on it but with a phase shift of 1/(2^20), and then writes it back to the same memory location they came from (in-place).
- Using the Fastest Fourier Transform in the South on an ARM core. This is a code-generator that generates a NEON optimised plan.
- A out-of-place radix-2 implementation which uses a pre-calculated lookup-table for the sin/cos tables. The lookup-table calculation is not included but it takes longer than the FFT itself.
Firstly, it's a bit quicker than I thought it would be. I guess my previous DMA timing tests were out a little bit last time I looked or were broken in some way. The arm results (outside of ffts) are for implementations which have particularly poor memory access patterns which explains the low performance - they are pretty close to worst-case for the ARM data cache.
However as expected the limiting factor is memory bandwidth. Even with simple synchronous DMA transfers diminishing returns are apparent after 2 active cores and beyond 4 cores is pointless. Without the memory transfers the calculation is around 10x faster in the 16-core case which shows how much memory speed is limiting performance.
Compared to complexity of ffts (it was the work of a PhD) the performance isn't too bad. It is executing a good few more flops too because it is calculating the twiddle factors on the fly. 1 396 736 sin/cos pairs are calculated in total.
I've yet to try asynchronous DMA for interleaving flops and mops. Although this would just allow even fewer cores to saturate the memory bus.
Currently it runs out-of-place, which requires 2x8MB buffers for operation. I can't see a way to re-arrange this to become in-place because of the butterfly of the data required between the two outer passes.
The memory transfers aren't ideal for the parallella. Only the write from the first pass is a sequential DMA which increases the write bandwidth. The only thing I can think of here is to group the writes across the epiphany chip so at least some of the writes can be in sequential blocks but this would add complication and overheads.
The inner loops could be optimised further. Unrolling to help with scheduling, assembly versions with hardware loops, or using higher radix steps.
The output of the last stage is written to a buffer which is then written using DMA, it also requires a new pair of twiddle factors calculated for each step of the calculation which passes through an auxiliary table. These calculations could be in-lined and the output written directly to the external memory (although the C compiler only uses 32-bit writes so it would require assembly language).
The code is hard-coded to this size but could be made general purpose.
The implementation is small enough it leaves room for further on-core processing which could help alleviate the bandwidth issue (e.g. convolution).
It should be possible to perform a radix-2048 step with the available memory on-core which would raise the upper limit of a two-pass solution to 2^22 elements, but this would not allow the use of double-buffering.
I haven't implemented an inverse transform.