Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as zed
  & handle of notzed


android (44)
beagle (63)
biographical (102)
blogz (9)
business (1)
code (74)
compilerz (1)
cooking (31)
dez (7)
dusk (31)
extensionz (1)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (455)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (231)
java ee (3)
javafx (49)
jjmpeg (81)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (10)
opencl (120)
os (17)
panamaz (5)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
players (1)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (4)
zedzone (24)
Saturday, 21 December 2013, 02:58

e-port revisited

I can't remember if i posted this here or just discussed it on the forums, but here's an update on the 'eport' code. eport is a lightweight 1:1 synchronisation point designed to assign owneship of slots in a cyclic buffer with minimum overhead. This directly allows the utilisation of some of the specific features in the epiphany architecture.

Since I can't find the post (probably just a bad subject) here's an overview of the api:

int eport_reserve(eport *)
Reserves a single slot in the port for the writer. The returned index is the slot number. Once this returns the writer has exclusive access to the given slot.
void eport_post(eport *)
Indicates that a slot previously reserved is now owned by the receiver with the implication that it contains valid data. Slots are automatically cycled in the same order they are reserved.
int eport_wait(eport *)
Waits for data to be ready at the receiving end. The return value indicates the slot number.
void eport_done(eport *)
Marks the next slot free to be re-used by the caller.

Ok so my initial implementation used power of two sizes for the indices for efficiency reasons of the cyclic buffer index calculation (modulo) and it only required 2 'pointer' values. However, because of repeating of the cycles one of the slots always sits unused, and furthermore the forcing of a power-of-two size can easily lead to needing to allocate more space than is necessary to implement an algorithm. There just isn't enough space to waste like this. An additional problem I hit when working on a real application was that the implementation required that every reserve be paired directly with a post, and every wait be paired directly with a done - which precluded the ability to use multiple slots in the buffer at once - which was kind of the whole point of it!

So I had a fiddle and came up with a new implementation which manages to allow all slots to be in use and also allows for arbitrary sizes without an expensive modulo operation. It only needs two extra local 'pointers' which are used to track multiple outstanding reserves/waits properly. If memory is tight the receiver can limit the working-set to exactly the amount it requires, or allow one extra free slot to allow for interleaving of work.

I've only just banged it up so it might not be correct or handle all the edge cases, but here it is anyway.

data structure

// Code Copyright (C) 2013 Michael Zucchi
// Licensed via GNU GPL version 3 or later.
struct eport {
        volatile unsigned int head;
        volatile unsigned int tail;
        unsigned int index;
        unsigned int next;
        unsigned int size;
        struct eport *other;

Each of these is allocated in pairs - one local to the sender and one local to the receiver. other pointers to the other one of the pair and everything else apart from size is initialised to 0. The only cross-communication is that the sender updates head in the receiver, and the receiver updates tail in the sender - obviously an atomic operation with no round-trip required.

index is local to each end and tracks the next slot, modulo the size of the cyclic buffer. Likewise next indicates the next 'actual' slot being requested at each end.


unsigned int eport_reserve(struct eport *port) {
        unsigned int next = port->next;

        // Check for a free slot
        while (next >= port->tail + port->size) {

        port->next = next + 1;

        // Increment index modulo size
        unsigned int r = port->index;
        port->index = (r+1) == port->size ? 0 : r+1;
        return r;

Because the "pointers" run without modulo I can simplify the capacity test. I then manually keep track of the index modulo the size for use by the caller. By using next to track the allocation it detaches the allocation from the publishing so it properly handles multiple outstanding reserves.

EPORT_SLEEP() does nothing on epiphany but might call usleep(x) on GNU/Linux.


void eport_post(struct eport *port) {
        unsigned int h = (port->head + 1);

        port->other->head = h;
        port->head = h;

Post is basically the same as it was before, except now it doesn't need to modulo the pointer. Since the slot is known to be owned by the sender all it has to do is update the pointers at both ends. The sender is the only thread that can write to head so it needs no arbitration.


unsigned int eport_wait(struct eport *port) {
        unsigned int next = port->next;

        while (port->head == next)

        port->next = next + 1;

        // Track tail % size
        unsigned int r = port->index;
        port->index = (r+1) == port->size ? 0 : (r+1);

        return r;

There is some obvious (and nice) symmetry with the reserve function here in the receiver. Again next is used to detatch acceptance from recycling.


void eport_done(struct eport *port) {
        unsigned int t = (port->tail + 1);

        port->other->tail = t;
        port->tail = t;

Done is similarly symmetric to the post function. And here the receiver is the only one who can ever write to tail, so it needs to also needs no arbitration.


So in short ... eport

When coupled with epiphany's blocking-free writes and limited memory it allows one to write streaming processors which write directly to the target core without needing any additional synchronisation and potentially asynchronously.

One drawback of the implementation above is that a given port is limited to 2^32-1 operations at a time before the maths breaks down, although that could be addressed by using 64-bit integers or some more complex pointer arithmetic.

I'm pretty much done for today but this is just another small step toward getting the 1-pass image scaler going. I think it should also end up a reasonable basis for writing a 1-pass 2-d separable convolution, wavelets, and so on. For some reason although i'm just a little bit of work away from finishing it I keep putting it off - today it's that the wind kept me awake all night and i'm too knackered to think straight. e.g. althogh i haven't compiled or debugged it I have the whole 2-d bicubic scaler written, I just need to slot in this updated eport code. Another day.

On another note i'm finally on xmas leave - hopefully for a few months like last year but I have a feeling i'm going to get roped in to doing a bit of other work which might cut it short. At this point i have a head full of ideas to code on but i think for sanity's sake I will need to take a break at some point too.

Tagged code, hacking, parallella.
genetic algorithm experiments | Wondered why the cartoons were on telly so late ...
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!