About

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as zed
  & handle of notzed

Tags

android (44)
beagle (63)
biographical (87)
blogz (7)
business (1)
code (63)
cooking (30)
dez (7)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (434)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (224)
java ee (3)
javafx (48)
jjmpeg (77)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (8)
opencl (119)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (3)
zcl (1)
zedzone (21)
Friday, 17 August 2012, 19:13

parallel 64 bit tests using NEON

So the next bit of code i've been working on requires a lot of 64-bit bit tests. Essentially using the tests as a 64-element 1-bit lookup table.

My first approach was to take an obvious C implementation, and convert it to NEON and perform the bit tests in parallel.

The C is basically like this:

  int i, count=0;
  for (i=0;i<N;i++) {
    uint64_t test = tests[i];
    uint64_t bit = (uint64_t)1 << bits[i];

    count += (test & bit) ? 1 : -1;
  }

It works, and one can perform the tests - but the tests need to be performed using 64-bit arithmetic - which takes a lot of register space, and a large amount of fuffing to get the data in the right format ('fuffing' is my technical term for data conversion/re-arrangement), and then the results must be converted back to 32-bits to form the selection masks.

So it's a great deal of code, much of which isn't doing anything but fuffing.

But the idea I had last night was to perform the testing using 8 bit values, and then scale these up to 32-bits to form the masks.

If one were writing the above code to avoid the 64-bit accesses in C, you could just use addressing to convert the 64-bit test to an 8 bit test - but on the correct byte instead.

  int i, count=0;
  for (i=0;i<N;i++) {
    int bitno = bits[i];
    uint8_t test = testsbyte[i * 8 + (bitno >> 3)];
    uint8_t bit = bitno & 7;

    count += (test & bit) ? 1 : -1;
  }

This requires per-byte addressing ... which you don't get for neon memory accesses per element ... except with vtbl.

So the following performs 8x 64-bit bit tests in only 5 instructions, and the actual tests are only using 8-bit arithmetic.

        @ d0 has 8 indices to test
        @ d1 has the 64-bit mask to test against
        @ d2 is set to 7 in 8 bytes
        @ d3 is set to 1 in 8 bytes
        
        vand.u8         d4,d2,d0        @ test & 7
        vshl.u8         d4,d3,d4        @ (1 << (test & 7))

        vshr.u8         d5,d0,#3        @ test >> 3
        vtbl.u8         d5,{d16},d5     @ data[test >> 3]
        
        vtst.u8         d4,d5           @ data[test >> 3] & (1 << (test & 7))

Changing the classifier to use code similar to this (but unrolled/scheduled better) gave me a 100% speed-up on the beagleboard (still unverified code ...).

Unfortunately, I still need nearly as many instructions just to convert the 8 bit mask into 32-bits for a selection mask (i need 32 bits for what i'm doing). But some mathematics might let me change that.

Tagged beagle, hacking.
Maths to the rescue | mipmap generation with SIMD/NEON
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!