A bit more vj progress, thoughts on elf loader
I had a chance to fit in a bit more hacking on the parallella over the weekend, although not as much progress as i'd hoped.
I first created an ARM version of the viola-jones (VJ) code to compare; it's still getting different results to a previous implementation but at least it's consistent with the EPU (epiphany core) version. I tried to work out why it isn't the same but I still haven't fathomed it - it can wait.
But at least with a consistent implementation I can do some comparisions. It's only running on a single core in each case, and it's timing just the processing of every 20x20 window on a single-scale 512x512 image (about 230K probes).
CPU Time Intel core Duo 2Ghz 0.3s Parallella ARM 1.5s Parallella EPU 2.8s Parallella ARM format 2 1.1s (approx) Parallella EPU format 2 2.3s Parallella EPU ASM 1.4s Parallella ARM C* 1.0s Parallella EPU C* 1.8s Parallella EPU ASM* 0.9s
Update 3: I finally got the assembly version working and added 3 more timings. These include C versions that have hand-unrolled inner loops aided by the new format. The ARM version doesn't report very reliable timings (paging on first-run i guess), but it's definitely improved. I've tried to optimise the scheduling on the assembly version although there are still enough stalls to add up - but removing those will require more than just reordering of instructions. These new times also includes some tweaks to the arrangement of the cascade chunks, although i can't recall if it made much difference. Still yet to double-buffer the image DMA.
Update *: Tried a bunch more things like fully double buffering, a more C-friendly format for the C versions, and some hacks which probably aren't safe/aren't really working. But as they still "appear" to work i'm including the timings here listed with the *s. But the C implementation is so tight it only just fits into the on-core memory to such an extent it might be difficult to add the scheduling code for multi-core.
It's also pretty much exhausted my interest at this point so i will probably move onto multi-core as well as a more complete and realistic example (although, tomorrow is another day so i might change my mind if inspiration hits).
(BTW this is not representative of the time it takes to detect a face on a 512x512 image on this hardware, as one doesn't normally need to look for 20x20 pixel faces at that resolution and it will typically be scaled down somewhat first, 1x1 pixel centres are usually not needed either).
To be honest I was hoping the EPU would be a bit closer to the ARM because of the similar clock speed, but I think in part it's suffering from a compiler which doesn't optimise quite as well. And the format I'm using benefits from the more capable instruction set on the ARM (bitfield stuff). However, there's still more that can be done so it isn't over yet - i'm not double buffering the image window loads yet for one.
First I started looking at writing an assembly version of the inner-most routine. But I ran into a problem on that - debugging is a royal pain in the arse. Obviously I had bugs, but the only way to "debug" it is to stare at the code looking for the error. And i'm not yet proficient enough at the dialect to be very good at this yet. The only indication of failure is that the code never gets to the mailbox return message. I don't know how to run gdb on the epu cores, and the computer i'm using as a console doesn't have an amd64 GNU/Linux installed (required for the pre-compiled sdk). A few documentation errors were a bit misleading too (post-increment is really post-increment despite the copy-paste job from the indexed LDR page). After paring things down enough and working out the ABI I at least had something which walked the cascade. As I started filling out the guts I thought I may as well tweak the format for better run-time efficiency too.
The EPU is most efficient when performing 64-bit data loads (aligned to 64-bit); but the format I had was all based on 32-bits for space-efficiency. With a little bit of re-arranging I managed to create a 64-bit format which requires less manipulation whilst only adding 4 bytes to each feature - this adds up but I think it will be worth it. It also lets me unroll the inner loop a bit more easily. One thing I noticed is that the 'weights' of the features in the cascade i'm using only fall into 3 categories; 2 combinations for 2-region features: -0.25 + 0.50, -0.25 + 0.75, and 3-region features are always a fixed -0.25 + 0.50 + 0.50. So I don't have to encode the weights individually for each region and can just either hard-code the weights (for the 3-region features), or store a single float value for each feature instead.
Anyway then my sister dropped around, which means a slight hangover this morning ... so seeing if all this tweaking makes any difference will have to wait.
Update: So i did some evening hacking, and finally got the code working on the epu - kept overwriting the stack and other fun stuff which is a pain to track down. And oh boy, make sure you avoid double and that any shorts in a structure are unsigned - othterwise it really hits performance.
So yeah the new cascade format does increase the performance ... but it just widened the gap! Although I gained a pretty decent 30-50% improvement on ARM execution time (actually its hard to tell, timing is very inconsistent on the arm code), it was pretty much insignificant on epiphany, under 2%. Blah.
The total DMA wait time of loading the cascades went up by 30% which suggests it's being limited by external memory bandwidth and not cpu processing. Some of this is good - indicating the cpu code is better and so there's more waiting around - but as the format is a bit fatter that can't help either and it's not like there's any way to fit any other processing in the dead time.
Will have to re-check the dma code to make sure it's double buffering properly, but i'm probably going to have to try another approach. a) Go for size of the data instead, i can get each feature down to 20/24 bytes (2 region features) + 24 bytes (3 region features) vs 32/40 - if i do the address arithmetic in the inner loop (lucky i can do the multplies with shifts), and/or b) see if fiddling with the sizes of the groups helps, and/or c) Distribute the extra bits of the cascade in chunks across multiple epiphany cores so the cascade remains completely on-chip, and/or d) Try a completely different approach such as a pipeline architecture where different cores process different stages.
Update 2: Actually on second thoughts, i don't think it's the dma, it's just not enough cycles to make a big difference (assuming my timing code is right). And even if i limit it to the local stages and don't dma in any cascade chunks it's still taking about 75% of the total time. So although the C compiler seems to be doing an okish job I'm probably better off spending my time in assembler now.
Based on thoughts on a post on the forums I also had a quick look at how one might write an elf loader for the epu - in a way which makes it more useful than just memcpy. I think it should be possible to create an executable for the epu which allows each core to have custom code, and for each core to be able to resolve locations in other cores. And if the linking is done at run-time it can also re-arrange the code in arbitrary ways - such as to customise a specific work-group topology or even handle differences between 16 or 64-core chips. And I think I can do this without changing any of the tools.
I poked around ld and noticed the -q and -r options which retain the reloc hunks in the linked output. For a test I created a linker script which defines a number of sections each with an origin a megabyte apart - i.e. the spacing of epu cores. This represents the absolute location of each 'virtual' core. Code can be placed on any core using section directives (attributes), and using the -q option the linker outputs both the code and the reloc hunks and so should allow a loader to re-link each core's code to suit any topology ... well in theory.
Before I get anywhere with that I need some doco on the epiphany-specific reloc hunks from the assembler - looks quite straightforward - and finding enough time to look into it.
This might seem a bit weird in terms of GNU/Linux, but i'm familair (although no doubt rather rusty!) with non-pic but relocatable code from my Amiga hacking days. My code back then was assembled to a fixed address, and the OS loader would re-link it to potentially non-contiguous chunks of memory in a global system address space. If it worked then, it should work ok with the parallella too. My only concern is that the GNU linker outputs enough information - using the -r switch (for relocatable code) causes the final link to fail (probably libc/startup related), but the -q switch (which simply retains the reloc hunks) may not be sufficient. It also relies on the fact that any inter-core reference will have to be absolute and thus require a load and a reloc entry of a 32-bit address, but I think that will be the case.
Having a runtime elf loader/relocator allows for some other nifty shit too like thread local storage and so on. Now, if only the epiphany ABI were a bit more sane too! (wrt register conventions)