Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as zed
  & handle of notzed


android (44)
beagle (63)
biographical (87)
blogz (7)
business (1)
code (63)
cooking (30)
dez (7)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (434)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (224)
java ee (3)
javafx (48)
jjmpeg (77)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (8)
opencl (119)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (3)
zcl (1)
zedzone (21)
Saturday, 21 June 2014, 11:49


So i've continuted to hack away on a 1M-element fft implementation for epiphany.

The implementation does 2xradix-1024 passes. The first just does 1024x1024-element fft's, and the second does the same but on elements which are strided by 1024 and changes the twiddle factors to account for the frequency changes. This lets it implement all the reads/writes using LDS and it only has to go to global memory twice.

At this point i'm pretty much doing it just for something to do because I did some testing with ffts and that can execute on a single arm core about 2x faster than the epiphany can even read and write the memory twice because all global memory accesses are uncached. According to some benchmarks I ran anyway - assuming the rev0 board is the same. Given that it's about the same speed for accessing the shared memory in the same way from the ARM (it's non-cached) I think the benchmark was valid enough.

Anyway one problem is that it still needs to load the twiddle factors from memory and for the second pass each stage needs it's own set. i.e. more bandwidth consumption. Since there's such a flop v mop impedance mismatch I had a look into generating the twiddle factors on the epiphany core itself.

Two issues arise: generating the frequency offset, and then taking the cosine and sine of this.

The frequency offset involves a floating-point division by a power of two. This could be implemented using fixed-point, pre-calculating the value on the ARM (or at compile time), or a reciprocal. Actually there's one other way - just shift the exponent on the IEEE representation of the floating point value directly. libc has a function for it called scalbnf() but because it has to be standards compliant and handle edge cases and overflows it's pretty bulky. A smaller/simpler version:

float scalbnf(float f, int logN) {
    union {
        float f;
        int i;
    } u;

    u.f = f;
    if (u.i)
        u.i += (logN << 23);

    return u.f;

This shifts f by logN - i.e. multiplies f by 2^N, and allows division by using negative logN. i.e. divide by 1024 by using logN=-10.

The second part is the sine and cosine. In software this is typically computed using enough steps of a Taylor Series to exhaust the floating point precision. I took the version from cephes which does this also but has some pre-processing to increase accuracy with fewer steps; it's calculation only works over over PI/4 and it uses pre and post-processing to map the input angle to this range.

But even then it was a bit bulky - standards compliance/range checking and the fpu mode register adds a bit of code to most simple operations due to hardware limitations. Some of the calculation is shared by sin and cos, and actually they both include the same two expressions which get executed depending on the octant of the input. By combining the two, removing the error handling, replacing some basic flops with custom ones good enough for the limited range and a couple of other tweaks I managed to get a single sincosf() function down to 222 bytes verses 482 for just the sinf() function (it only takes positive angles and does no error checking on range). I tried an even more cut-down version that only worked over a valid range of [0,PI/4] but it was too limited to be very useful for my purposes and it would just meant moving the octant handling to another place so wouldn't have saved any code-space.

I replaced int i = (int)f with a custom truncate function because I know the input is positive. It's a bit of a bummer that FIX changes it's behaviour based on a config register - which takes a long time to modify and restore.

I replaced float f = (float)i with a custom float function because I know the input is valid and in-range.

I replaced most of the octant-based decision making with non-branching arithmetic. For example this is part of the sinf() function which makes sign and pre-processing decisions based on the input octant.

    j = FOPI * x; /* integer part of x/(PI/4) */
    y = j;
    /* map zeros to origin */
    if( j & 1 ) {
        j += 1;
        y += 1.0f;
    j &= 7; /* octant modulo 360 degrees */
    /* reflect in x axis */
    if( j > 3 ) {
        sign = -sign;
        j -= 4;

 ... calculation ...

    if( (j==1) || (j==2) ) {
      // use the sin polynomial
    } else {
      // use the cos polynomial

    if(sign < 0)
        y = -y;
    return( y);

This became:

   j = ztrunc(FOPI * x);
   j += (j & 1);
   y = zfloat(j);
   sign = zbnegate(1.0f, j+1, 2);
   swap = (j+1) & 2;

 ... calculation ...
    if( swap ) {
      // use the sin polynomial
    } else {
      // use the cos polynomial

    return y * sign;

Another function zbnegate negates it's floating point argument if a specific bit of the input value is set. This can be implemented straightforward in C and ends up as three CPU instructions.

static inline float zbnegate(float a, unsigned int x, unsigned int bit) {
    union {
        float f;
        int i;
    } u;

    u.f = a;
    u.i ^= (x >> bit) << 31;

    return u.f;

Haven't done any timing on the chip yet. Even though the original implementation uses constants from the .data section the compiler has converted all the constant loads into literal loads. At first I thought this wasn't really ideal because double-word loads could load the constants from memory in fewer ops but once the code is sequenced there are a lot of execution slots left to fill between the dependent flops and so this ends up scheduling better (otoh such an implementation may be able to sqeeze a few more bytes out of the code).

Each radix-1024 fft needs to calculate 341x2 sin/cos pairs which will add to the processing time but the number of flops should account for the bandwidth limitations and it saves having to keep it around in memory - particularly if the problem gets any larger. The first pass only needs to calculate it once for all batches but the second pass needs to calculate each individually due to the phase shift of each pass. Some trigonometric identities should allow for some of the calculations to be removed too.

Tagged hacking, parallella.
cexpi, power of 2 pi | borkenmacs
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!