Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as zed
  & handle of notzed


android (44)
beagle (63)
biographical (87)
blogz (7)
business (1)
code (63)
cooking (30)
dez (7)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (434)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (224)
java ee (3)
javafx (48)
jjmpeg (77)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (8)
opencl (119)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (3)
zcl (1)
zedzone (21)
Saturday, 23 November 2013, 17:47

The joy of segfaults

I had another look at some parallella code today but I didn't get as far as i'd hoped. Being a bit tired and not really into it didn't really help I guess.

First major problem I hit was that the linker doesn't allocate bss blocks for relocatable files by default, which I only discovered after a lot of faffing about. This and a few other issues made me decide to create a simpler linker script which I was trying to avoid. Since I have it now I'm using the linker script to merge some of the c-runtime support sections and epiphany sections with the base sections, and rename some of the epiphany sections to something i can use more readily in the loader (e.g. IVT_RESET to .ivt0).

It still didn't work. Which took a lot of tracking down ... and turned out to be an annoying bug with the way I was resolving the address of a remote-core array. I had defined the weak external reference as a pointer type and was just passing it to e_get_global_address - I should have passed the address of the variable instead. Live and learn I suppose, or maybe not. This is the second time I've wasted a good chunk of time on something like this so it's probably something I need to macro/functionise if I can.

But once I worked that out it suddenly started working.

Single-pass resampler

I'm working on a single-pass image resampler. It's something I need for the FD code, and a nice parallel problem which should fit a grid of EPUs nicely to boot. It's also a good test case for the relocating elf loader code i have.

                        input rows
       |             |             |             |
 +-----------+ +-----------+ +-----------+ +-----------+
 | scale x 0 | | scale x 1 | | scale x 2 | | scale x 3 |
 +-----------+ +-----------+ +-----------+ +-----------+
       |             |             |             |
 +-----------+ +-----------+ +-----------+ +-----------+
 | scale y 0 | | scale y 1 | | scale y 2 | | scale y 3 |
 +-----------+ +-----------+ +-----------+ +-----------+
       |             |             |             |
                       output rows

                Workgroup topology' (transposed)

The input stage comprises of 4 cores in a column which load in 1/4 of a row of the input stream at a time and scale it in X - the results are written directly to the next stage in the pipeline.

The y scalers then perform y scaling on the input rows, and output directly to the target.

Because there are a lot of fiddly edge cases I just started with the data-flow code with an X-only scaling case to nearest neighbour (simplifies the y-scaling logic), but the intention is to end up with (at least) bi-cubic resampling. For this reason the Y scalers contain 'some' number of rows which will be greater than one organised in a cyclic buffer - so they can double-buffer with the X scaler and support higher-order resampling. I'm only using 4+4 cores mostly for simplicity but I may also have a use for the other 8. I don't know yet if the workload will balance well with a 1:1 mapping like this - in any event it will be dynamic based on the problem (e.g. x scaler always runs on each input row, by the y scaler only needs to run on each output row), and even if it isn't 100% efficient it should be goodly-efficient[sic].

So as of now I have the basic data-flow working. I'm using an 'eport' for the throttling/arbitration of the Y buffers and by organising the input stage in a column the DMA reads are fair without further work. This also gives me a simple platform to determine how important write DMA arbitration is, although I haven't included it yet.

As the Y stage can have multiple rows of storage (memory permitting) the same structure can be used for separable convolution, wavelets, etc. I can also be extended to high quality rotation and even to general purpose affine resampling - which I may look at eventually.

Tagged hacking, parallella.
A bit of S-FX fun | PicFX out
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!