Michael Zucchi

 B.E. (Comp. Sys. Eng.)


android (44)
beagle (63)
biographical (82)
business (1)
code (56)
cooking (29)
dez (6)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (414)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (216)
java ee (3)
javafx (48)
jjmpeg (67)
junk (3)
kobo (15)
linux (3)
mediaz (27)
ml (15)
nativez (3)
opencl (117)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
politics (7)
ps3 (12)
puppybits (17)
rants (134)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (2)
zedzone (13)
Wednesday, 06 May 2015, 21:48

post google code post

Well nobody bothered to comment about the stuff i removed from google code apart from the one lad or lass who lamented the loss of some javafx demos.

I had comments open+moderated for a few weeks but got hit by spammers a couple of days ago so had to go back to id+moderated. Maybe something got lost in those 500 bits of snot but i don't think so. The spam was quite strange; most mentioned web sites but didn't provide links or weren't very readable so i'm not sure what the point was. Perhaps they're just fishing for open sites or naive moderators they can then exploit. Like the "windows computer department" that keeps calling and calling hoping i'll not tell them to fuck off every time (sigh, no i don't normally say that although i would tonight).

I've still got the subversion clones but i'm not inclined to do much with any of it for the forseeable future and i'm not even sure if i'm going to continue publishing other bits of code i play with going forward.

Desktop Java, OpenCL, ARM assembly language; these things are just not very common in the Free Software world. Server Java is pretty common but that's just, well, `open sauce' companies sharing costs and not hobbyists. So i think all i'm really doing is providing hints or solutions for some student's homework or help for graduate programmers to keep their jobs. And even then it's so niche it wouldn't be many, if any.

As an example of niche, I was looking up some way to communicate with adobe photoshop that doesn't involve psd format and one thing i came across was someone linking to one of my projects for some unfinished experiments with openraster format - on the first page of results. This happens rarely but still too often. Of course it could just be the search engine trying to be smart and tuning results to the user, which is a somewhat terrifying possibility (implications beyond these types searches of course). FWIW I came to the conclusion photoshop is just one of those proprietary relics from the past which intentionally refuses to support other formats so it's idiot users can continue to be arse-reamed by its inflated price.

It's just a hobby

As a hobby i have no desire to work on larger projects of my own or other established projects in my spare time. Occasionally i'll send in a patch to a project but if they want a bunch of fucking around then yeah, ... naah. In hindsight i somewhat regret how we did it on evolution but i think i've mentioned that before. Neither do i need to solicit work or build a portfolio or just gain experience.

I'm not sure how many hobbyists are around; anyone with remotely close to enough skill seems to be jumping into the wild casinos of app-stores or services and expecting to make billion$ and not just doing it for the fun of it. Some of those left over just seem to be arrogant egotistical fuckwits (and some would probably think the same of me). Same as it ever was I guess.

I suppose I will continue to code-drop even if it's just out of habit.

For another hobby I made kumquat marmalade on the weekend. Spent a couple of hours in the sun slicing the tiny fruit and extracting seeds (2-3 cups worth of seeds) and cooked it the next day. Unfortunately after all that effort it looks like it wasn't cooked quite enough and it probably wont set - it's a bit runny but at least it tastes good. Not sure what i'll do with 2-odd litres of the stuff though.

Tagged beagle, code, imagez, jjmpeg, mediaz, parallella, pdfz, puppybits, readerz, socles, videoz.
Thursday, 27 November 2014, 22:06

ezesdk 0.4 released

Been a while and it's been basically ready for weeks (months?) but I finally found a few hours to drop out a new release of ezesdk. As the post linked below notes it's mostly for the 3D rasteriser sample but I did a bunch of re-arranging and other cleanup too.

As usual it is available on the ezesdk home page.

It still requires root to run and i haven't looked at all into the "new driver" stuff. I've been keeping a pretty loose eye on that but not doing anything about it.

Tagged code, hacking, parallella.
Wednesday, 10 September 2014, 13:42

Little gpu bits

I've mostly been taking it easy - i'm not going to be on leave forever (unfortunately) - but i've tried a couple of little things on the gpu code.

First I tried creating a tile-based implementation for the ARM/host version but this runs about 1/2 the speed of the line-oriented one. Not that I really optimised it but that's a lot to make up and i don't see the point; it's a convenient test-bed for experimenting though.

Then I tried creating tile-accurate indexing rather than using the bounding box. This improves the output a small amount on the purely arm version but takes a hit on the epiphany backend since the hit to the arm-side code exceeds the gains on the epiphany-side. It will depend on the workload and it might be worth it for larger triangles. Then again maybe the index isn't helping as much as I thought.

I also started (re)reading about some lighting stuff but didn't get very far.

Feeling pretty lazy today too.

Update: But not too lazy to poke a bit more it seems.

I made a "slight improvement" to the ARM based tile renderer and now it's a bit faster (10%) than the line-based one with a specific test-case. Being lazy the first time I was just processing the tile row by row rather than performing the rasteriser pass across the whole tile first and then processing the fragments afterwards. This just helps the compiler keep more setup data in registers for each loop and is closer to how i'm doing it on the epiphany.

Update: Haven't been able to get into it this last week. I think hayfever season is starting and even before the symptoms hit it just seems to wreck my sleep more than normal. Been really tired/lethargic and not really feeling like doing anything - it just feels like all i'm doing each day is hanging around waiting to escape from it into the unconsciousness of sleep again. Today I even feel like i'm "coming down with something" although i'm pretty sure i'm not and it's just some hayfever related nonsense. I've done a little gardening at least - preparing some garden beds, putting in a few seeds, and rejuvenating some pots.

But as a bit of a puzzle a few days ago I tried to see if i could get the rasteriser loop any faster. I think I can get the inner loop down to 8 cycles with some unrolling, double load/stores and some constant preloads. The previous best was 10 cycles but i'm not sure this new version is practical.

This came out of playing with the idea of breaking the work up into squares (4x4 or 8x8) rather than rows. This has overheads due to performing the edge tests multiple times outside of each pixel test but also reduces the overheads of calculating over the bounding box. But it's one of those things I need a solid afternoon to try out by coding it up.

These tile tests also allow one to determine full coverage outside of the loop - which removes the need for the edge testing calculations at all. So I tried to see if that could save anything in the inner loop; but so far the latency from the z buffer testing has prevented any gains being made. Even assuming I could pipeline that away I think I can only save 1 cycle.

I also toyed with creating an integer rasteriser that stores the framebuffer internally using bytes. For a flat shaded/z-buffered/non-blended triangle I think I can get that down to 7 cycles per pixel (and that's rendered, not just converted to fragments). Is that even useful? Who knows. But to test that idea out I need to work on a new design which will take another solid afternoon as well.

Tagged graphics, hacking, parallella.
Thursday, 04 September 2014, 23:12

ezegpu stuff

Did a bit more playing around on the ezegpu. I think i've hit another dead-end in performance although I guess I got somewhere reasonable with it.

Although there are some other things I haven't gotten to yet i've pretty much convinced myself this design is a dead-end now mostly due to the overhead of fragment transfer and poor system utilisation.

First I will put the rasterisers and fragment shaders back together again: splitting them didn't save nearly as much memory as I'd hoped and made it too difficult to fully utilise the flops due to the work imbalance and the transfer overheads. I'm not sure yet on the controller. I could keep the single controller and gang-schedule groups of 2/3/4 cores from the primitive input - i think some sort of multiplier is necessary here for bandwidth. Or I could use 3 or 4 of the first column of cores for this purpose since they all have fair access to the external ram.

Tagged code, hacking, parallella.
Wednesday, 03 September 2014, 02:47

simplex noise, less memory

I thought i'd look at something a bit different today: noise. Something to get the fragment shaders doing some more work.

I've looked at some of this before but it's been a while and never had much use for it.

I started with "wavelet noise" but when I realised it needed big lookup tables I went back to looking at the simplex noise algorithm. It seems wavelets are being used to create bandwidth limited versions of existing noise so that it scales better; but this isn't something I need to worry about.

A paper and implementation by Stefan Gustavson and others pretty much had me covered but I wanted to try and remove the 512+256 element lookup tables used to hash the integer coordinates to save some memory on the epiphany.

I came up with two working solutions in the end.

The first one uses a 32-element lookup table of prime numbers to implement a 2D hash function. I just grabbed the first 32 primes (20 apart) for the table and fiddled with eor/mul and shift until I had something that seemed to work. I arbitrarily chose 32 because it was a nice round number.

// there's nothing particularly good or useful here

    static final int[] hasha = {
        71, 173, 281, 409, 541, 659, 809, 941,
        1069, 1223, 1373, 1511, 1657, 1811, 1987, 2129,
        2287, 2423, 2617, 2741, 2903, 3079, 3257, 3413,
        3571, 3772, 3907, 4057, 4231, 4409, 4583, 4751

    private static int hash16(int a, int b) {
        return ((((b ^ hasha[a & 31]) * (a ^ hasha[b & 31])) >> 5) & 15);

Because I was only interested in the 2D case I changed the gradient normal array to 16 elements so I didn't have to modulo the result as well. TBH it's kind of surprising it works as well as it does since hashing numbers is pretty tricky to get right and I really didn't know what I was doing.

When I started I didn't realise exactly what it was for so once I had a better understanding of why it was there I thought i'd try an existing integer hash function. In general they failed miserably but I found one that came from the h2 database which worked sufficiently well.

    private static int hash(int x) {
        x = ((x >> 16) ^ x) * 0x45d9f3b;
        x = ((x >> 16) ^ x) * 0x45d9f3b;
        x = ((x >> 16) ^ x);
        return x;

    private static int hash16(int a, int b) {
        return (hash(a * b + a + b)) & 15;

I used the (a*b+a+b) calculation to turn it into a 2D hash function.

So this final version requires no lookup table for the gradient table permute at all - nice. But it requires 3 integer multiplies - not so nice for epiphany. And even the other version needs an integer multiply and thus the same costly fpu mode changes on epiphany.

Since I only need a limited number of output bits it might (should?) be possible to change this to using float multiplies to avoid the costly mode change; but this is something for further study. The first version might make this easier.

Screenshots ... this first is a simple 4-octave fractal noise generated using the 2D Simplex Noise code from Stefan. I think the 2D noise function has a small bug because it's using the 12-point 3D gradient bases which don't always evaluate to vectors of the same length in 2D but it isn't apparent once fractal noise is generated as here.

The next one is an example using the naive hash function (it may be a different scale to the others since I ran it separately). Covering 4 octaves hides some problems it might have but I've done some very basic testing to larger scales and it seems about as stable and nicely random as the others.

And the final shot is using the M2 hash function and 16 gradients evenly spaced around the unit circle rather than 12 evenly spaced around the unit sphere as in the traditional version.

Look about the same to me?

I don't know if it's useful for anything I might do or if it is even fast enough to run in a shader on the epiphany but I learnt a couple of interesting things along the way.


I've been doing some little bits and pieces on the ezegpu code as well.

Together the NEON changes amount to a 6% improvement to the total runtime of the all-ARM code for my current testing case (8x8x8 stars). Nothing major, although it goes up on simpler scenes mostly due to the faster RGBA float to byte conversion.

Tagged code, hacking, parallella.
Sunday, 31 August 2014, 19:13

egpu mk ii part 2

After a couple of days relaxing break including a nice ride down to the coast yesterday I had another look and the ezegpu today.

First task was just to create a common 'demo' frontend which can be linked to different backends so I can easily test different cases. I then created a backend based on the current mk ii state.

Well, I guess I jumped the gun a bit the other day by testing it with a poor example of large and mostly coincident triangles. Using the star-grid test the implementation is considerably faster than the line based renderer. The test code uses slightly different parameters but a 4x4x4 star test is now hitting 57fps vs 35fps for the line-based version, versus 31fps for single-core arm.

Then I upped the test to 8x8x8 stars (total of 4096 triangles) and zoomed out a bit and now the improved primitive input stage and 2d grouping really starts to show it's paces: 22fps vs 7fps. The single-core ARM code is coping a bit better at 11fps.

Well that was nice to see I guess.

I guess i'll have a look through the points of the last few posts to decide what to look at next.

Tagged hacking, parallella.
Thursday, 28 August 2014, 12:44

some notes

Some waking up thoughts to jot down for later. It's too nice to be inside today. I have stuff I should be doing but i'm a little immobile due to hurting my foot again so I might just sit in the sun drinking. I thought it was better and over-tested it last weekend - and I wasn't even drinking :(. I can get around ok - it just doesn't heal at all if i don't rest it enough.

I had some further thoughts on the results of yesterday; even though it's half the speed of the line renderer considering the complexity of the interactions and the forced requirement of an additional read/write cycle across another core for each fragment - it's probably actually fairly good. The main bottleneck seems to be the mismatch of rasterisation to fragment rendering time which has nothing to do with the architecture - but the fragment shaders are only trivial 3-term colour interpolations and if they were more complex then shifting the rasterisation to another core would leave more time for them to execute. So I will still hook it up to the gl frontend to test it and other backends which can use the same or similar controller setup.

Although I think due to the possibility of other highly optimised special cases a combined implementation will still be the ultimate target.

Tagged hacking, parallella.
Wednesday, 27 August 2014, 18:20

egpu mk ii.5

Well that took a bit longer than I wanted; and all i've done is rejigged all the comms around but that's enough for today.

I made a bunch of changes to address some of the problems; i'm still not sure it will fix the performance but it's some stuff I wanted to look at anyway. The big performance issue remaining is the rasteriser to fragment processor stream; I have a new communication protocol that addresses it as much as possible and have changed the fragment processor to use it but I haven't written the rasteriser to feed it yet. I was going to do a quick-and-dirty but that would just be wasted work and working toward the current target goal ended up ballooning out into a big pile of changes.

Hmm, so what was again going to be a short little poke turned into a whole afternoon and now the sun is rapidly leaving this hemisphere to a crisp but cold evening. This stuff is just too interesting to put down and i've just spent another hour and a half writing this and tweaking a few things I found while writing it. Might keep going now ...

Update: Hacked into the later evening ... did some profiling. It's about half the speed of the combined by-line processor at this point. Whilst this is a very large improvement as to where it was, it's obviously not enough.

From some numbers I think the bottleneck is the rasteriser. The rasteriser routine is very simple and compiles quite well and the dma interface is about as minimal as possible so there is little possibility of improvement. It's probably just the 1:4 fan-out being too much.

Tagged hacking, parallella.
Older Posts
Copyright (C) 2018 Michael Zucchi, All Rights Reserved.Powered by gcc & me!