About

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

Tags

android (44)
beagle (63)
biographical (82)
business (1)
code (56)
cooking (29)
dez (6)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (414)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (216)
java ee (3)
javafx (48)
jjmpeg (67)
junk (3)
kobo (15)
linux (3)
mediaz (27)
ml (15)
nativez (3)
opencl (117)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
politics (7)
ps3 (12)
puppybits (17)
rants (134)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (2)
zedzone (13)
Friday, 18 September 2015, 19:00

Nice curves!

Bezeir Curves.

Wow what a page.

Tagged graphics, hacking.
Friday, 03 April 2015, 04:52

another variation on histogram equalise

Yeah I dunno this was something I thought up whilst getting up this morning. I can't even remember why I would have been thinking of such a thing.

The original:

Standard histogram equalise (its slightly different than the gimp due to some range handling but it's equally just as shithouse):

My other really cheap tunable histogram equalise based on truncation on the default setting. Maybe a bit dark in the dark but it is tunable further.

The new one tries to solve the same problem but in a completely different way:

And another test image which probably shows it even better:

Here the standard operator just makes a pigs breakfast of pretty much everything:

The simple limit works pretty well here:

And here I think the new one pips it. It retains about the same amount of detail with in the coins but adds more contrast to the surface they're resting on without burning it out.

The previous algorithm applies a hard ceiling to the histogram bins after they are calculated based on the mean.

This algorithm instead applies a localised sameness test before they are added to the histogram and they are simply not counted if it fails the test. This means it simply drops any areas of similar or slow-changing intensity out of the histogram calculation. There are some tweaks needed to handle extreme cases but as can be seen it works pretty well as is on challenging input.

But its somewhat more expensive to calculate so the previous algorithm is probably more desirable given the similarity of the results.

I just snarfed the base pics from over here.

Unrelated musings and rants

Hmm, 2am again. My fibre NBN was down again tonight and I discovered my old ADSL is still connected and still being billed. Yay. What the hell internode, make it bloody clear that "switch plans" isn't a switch, if it isn't a switch. This whole thing really hasn't been anywhere near as smooth as it should've been; then again they charged me for permanent dial-up for about 6 years after i got ADSL so they certainly have form.

Probably should've gone to bed after that and a bit of TV but it's been a long week of trying to decipher some odd(?) C++ which wasn't very pleasant and I needed to 'relax'. Although the C++ thing end with some success.

I wrote a fairly long rant against it yesterday but decided not to post it. But in brief it started with the pleasant surprise of the performance of jaxb on a 200MB xml file (hah, not memory performance, but running-time), another in discovering that you can put a whole array of primitives into an attribute which drops the verbosity considerably - 200MB was after finding this out so it could've been much worse. A realisation of self-stupidity after spending an hour or so writing a s-expression loader from scratch for a completely one-off, throw-away data-transfer job (wasn't thinking straight there). Two lines of calling jaxb and a few annotated holding classes was only about 3x slower at reading the xml than the C++ all-hand-written all-in-header-files macro-template-stuff was at reading it's own proprietary binary format. For the task that would probably do but I want to run this a few times and 5 seconds is too long, so that lead to an ensuing of mirth once I tried using the java default all-i-did-was-add-an-interface serialisation via Object*Stream: about 5x faster and 30% smaller than the C++. And finally some satisfaction in porting some number crunching algorithm across and having it run 50% faster, with simple plain code using consistent syntax that compiles in a 10th of a second rather than 30, and no multiple-screen error messages for using the wrong type of bracket in one place.

Not that C++ programmers have writing needlessly complex code all to themselves, it seems to be a universal skill and one widely practiced. C++ does seem to attract adepts of the highest skill in the art though and fortunately for these code-fiends it has the most facilities to aid them in their vitally important work. They certainly seem to take pride in it. Java is probably a close second truth be told, but probably for very different reasons. The whole weird self-help like book/training scene and probably just a lot of newbies who don't know better, at least at one time (any time i look back ~5 years i always see things I wouldn't do that way now, or worse).

That is not abstraction

Now I remember one thing I thought of whilst writing the post yesterday based on my recent C++ exposure.

The C++ language just doesn't seem to understand what abstraction is for. It seems so busy hiding as many actually important machine-level details as it can whilst exposing in intricate detail as much fluffy non-essential information about it's own type system so the compiler has any hope of understanding what you intended together with some vague (and basically worthless) guarantee that all this noise will produce somehow "optimal" code - for some interpretation of "optimal" which doesn't seem to take into account physical hardware `as such' (not surprisingly given the preceding paragraph).

When these features are used in libraries (or perhaps conglomerations of syntax more succinctly described as "the headers are a platform") the information hiding can be taken to an absurd level: they've actually managed to hide the ability to understand even the syntax of a single line of program code in a widely used language by just looking at it. Now not only do you need to know the basic syntax and semantics of the language itself, you also need to have an intimate knowledge of the entire platform included in the header files before you know the meaning of basic expressions. Even in the one place where semantics cannot be deduced without extraneous knowledge - i.e. function invocation - C++ allows the removal of the one small mnemonic you had as a guide. Just think of all the man-years spent getting this snot to compile.

But back to the point - by doing this you haven't hidden anything - you've just exposed everything. Because without knowing everything you can't understand any of it.

Maybe it was a specifically weird of code but i did some searches and it seems the things it does is common and even seen as desirable. But people also believe the strangest things and the internet is on average simply full of shit.

Tagged graphics, java, rants.
Wednesday, 10 September 2014, 13:42

Little gpu bits

I've mostly been taking it easy - i'm not going to be on leave forever (unfortunately) - but i've tried a couple of little things on the gpu code.

First I tried creating a tile-based implementation for the ARM/host version but this runs about 1/2 the speed of the line-oriented one. Not that I really optimised it but that's a lot to make up and i don't see the point; it's a convenient test-bed for experimenting though.

Then I tried creating tile-accurate indexing rather than using the bounding box. This improves the output a small amount on the purely arm version but takes a hit on the epiphany backend since the hit to the arm-side code exceeds the gains on the epiphany-side. It will depend on the workload and it might be worth it for larger triangles. Then again maybe the index isn't helping as much as I thought.

I also started (re)reading about some lighting stuff but didn't get very far.

Feeling pretty lazy today too.

Update: But not too lazy to poke a bit more it seems.

I made a "slight improvement" to the ARM based tile renderer and now it's a bit faster (10%) than the line-based one with a specific test-case. Being lazy the first time I was just processing the tile row by row rather than performing the rasteriser pass across the whole tile first and then processing the fragments afterwards. This just helps the compiler keep more setup data in registers for each loop and is closer to how i'm doing it on the epiphany.

Update: Haven't been able to get into it this last week. I think hayfever season is starting and even before the symptoms hit it just seems to wreck my sleep more than normal. Been really tired/lethargic and not really feeling like doing anything - it just feels like all i'm doing each day is hanging around waiting to escape from it into the unconsciousness of sleep again. Today I even feel like i'm "coming down with something" although i'm pretty sure i'm not and it's just some hayfever related nonsense. I've done a little gardening at least - preparing some garden beds, putting in a few seeds, and rejuvenating some pots.

But as a bit of a puzzle a few days ago I tried to see if i could get the rasteriser loop any faster. I think I can get the inner loop down to 8 cycles with some unrolling, double load/stores and some constant preloads. The previous best was 10 cycles but i'm not sure this new version is practical.

This came out of playing with the idea of breaking the work up into squares (4x4 or 8x8) rather than rows. This has overheads due to performing the edge tests multiple times outside of each pixel test but also reduces the overheads of calculating over the bounding box. But it's one of those things I need a solid afternoon to try out by coding it up.

These tile tests also allow one to determine full coverage outside of the loop - which removes the need for the edge testing calculations at all. So I tried to see if that could save anything in the inner loop; but so far the latency from the z buffer testing has prevented any gains being made. Even assuming I could pipeline that away I think I can only save 1 cycle.

I also toyed with creating an integer rasteriser that stores the framebuffer internally using bytes. For a flat shaded/z-buffered/non-blended triangle I think I can get that down to 7 cycles per pixel (and that's rendered, not just converted to fragments). Is that even useful? Who knows. But to test that idea out I need to work on a new design which will take another solid afternoon as well.

Tagged graphics, hacking, parallella.
Saturday, 06 September 2014, 15:01

Too noisy

Been playing a bit with simplex noise. Interesting how much you can create from the same basic function and kind of cathartic and easy on the brain.

The following were all created with the same basic noise. 4 frequencies are combined in different ways. I'm using Z as an animator so they all smoothly animate.

Blobs of liquid. This uses max(). The frequency is the same for each layer by the amplitude is altered. Note that this is purely 2D and the depth appears due to attenuation.

Smoke or writhing organic mass. This uses max(abs()) and a lower frequency.

Lava lamp ringlets. Scales to an integer and selects one of the bits from the integer. Again the depth is from attenuation and scaling in this case.

Friesian cow-hide. Or a coastline. Or a burning piece of paper. Threshold with multiple frequencies. Works very nicely as a blending mask for image transition.

I've mostly been playing with the hash function to try and create something epiphany efficient whilst still working sufficiently well. For 3D noise the current candidate uses 3 lookup tables to provide a basic hash of the x/y/z locations and they are combined using floating point multiplies and/or other bit ops. I'm only using random values which works most of the time although a better choice should be possible. It may be worth just going back to the permutation array of the original code as I realised I can implement that in only 256 bytes if i need to. I still don't know how it will run on the machine as the simplex setup code is pretty expensive too but I haven't looked at how to optimise it yet. Originally I was looking at the 2D noise because it was simpler but as 3D noise is just so much more useful I will target that instead.

I also created a 16-element spherical set of vectors for the base noise gradients. First I used an inscribed cube and some others I made up but then I finally found the code by Jon Leech (hint: it's at the bottom of the page!) which models electron repulsion to try to evenly space the points across the sphere. This does create a nicer result. 16 is used since it's a lot easier to calculate the modulus of 16 than it is for 12.

I do see patterns showing up particularly with the ringlet algorithm - lines at 45 degrees showing up as you zoom out - but this shows up for the original too. The noise is definitely not zero-mean. If I average over many frames I get fairly regular blobs at 45 degrees showing up also - although they are at 90 degrees to the ones that show up zoomed out, but again this is also present with the original Simplex noise hash function and gradient set.

Tagged graphics, hacking.
Monday, 18 August 2014, 21:23

It lives!

Oops, wrong stride.

It lives!

I found the 1/2 hour required to hook up the epiphany rasteriser tonight.

Fun facts for that rotating double-triangular pyramid:

The epiphany should scale much better than the ARM, but I don't feel like poking more tonight.

Gawd i just realised that screenshot looks way too much like the damn windoze logo. Just an unfortunate coincidence as the colours were just the primaries and the background colours are supposed to be Commodore-64 like (the camera isn't picking them up very well).

The lack of any vblank interrupt in the video hardware ... well that's very uninspiring too (not that it should really come in to play, but it's the principal of the thing).

Update: Ok I had a tiny play. If I scale the model transform by 2x the times go to 2.6s, 23.5s, and 7.2s. i.e. much better scalability on the epiphany as expected.

Tagged graphics, hacking, parallella.
Sunday, 10 August 2014, 17:51

first triangle from epiphany soft-gpu

I was nearly going to leave it for the weekend but after Andreas twattered about the last post I figured i'd fill in the last little bit of work to get it running on-screen. It was a bit less work than I thought.

The times are for rotating the triangle around the centre of the screen for 360 degrees, one degree per frame. The active playfield is 512x512 pixels. Z buffer testing is on.

Actually the first triangle was a bit too boring, so it's a few hundred triangles later.

Update: I was just about to call it a night and I spotted a bit of testing code left in: it was always processing 1280 pixels for each triangle rather than the bounding-box. So the times are somewhat out and it's more like arm(-O2)=15.5s, epu 1x=11.5s 4x=3.9s 8x=3.1s 16x=2.4s. I also did some double-buffering and so on but before I spotted this bug but the timing is so shot it turned out to be pointless.

I did confirm that loading the primitive data is a major bottleneck however. But as a baseline the performance is a lot more interesting than it was a few hours ago.

Tagged graphics, hacking, parallella.
Friday, 08 August 2014, 21:15

epiphany soft-gpu thoughts

I've been feeling a bit off of late so not hacking much of an evening but I did get a spare couple to poke at the soft-gpu and finally write some epiphany code.

Of course I got completely side-tracked on the optimisation side of things so I didn't get terribly far. But I solidified the plan-of-attack and sorted out some way to provide C based shader code in a way which will still get some performance. I have much of the interesting setup code done as well (although there is more uninteresting stuff, maybe I will just use java as the driver).

I've re-settled on the earlier idea of separating the rasterisation from the fragment shading but it will all run on the same core. There will be 3 loops.

  1. Rasteriser which performs in-triangle and Z/W buffer tests and generates the X coordinate and interpolated 1/W value for all to-be-rendered fragments;
  2. Reciprocaliser[sic] which inverts all the 1/W values in a batch;
  3. Fragment processor which interpolates all of the varying values and invokes the fragment shader.

This allows each loop to be optimised separately and reduces register pressure. Due to the visual similarity of some of the setup I thought there would be some duplicated calculations but there actually isn't since each is working with different values.

1 and 2 will be hard-coded as part of the platform but 3 will be compiled separately for each shader so that the shader can be compiled in-line. This is the only way to get any performance out of the C code.

The shaders will be compiled something like this:

/*
 * Shader fragment to call
 */
#define SHADER_INVOKE(colour) solid_gourad(colour, uniform, var0, var1, var2)

/*
 * An example shader - solid (interpolated) colour
 */
static inline void solid_gourad(float *colour, float *uniform, float var0, float var1, float var2) {
    colour[0] = var0;
    colour[1] = var1;
    colour[2] = var2;
    colour[3] = 1.0f;
}

/*
 * Include the actual routine to use
 */
#include "e-fragment-processor.h"

And e-fragment-processor will have a generic inner loop which will be something like:

void draw_row(... arguments) {
 ... setup
    const float var0x = v[VS_X+0];
    const float var1x = v[VS_X+1];
    const float var2x = v[VS_X+2];

    // Set start location for interpolants
    float var0_w = (var0x * fx + v[0 + VS_Y] * fy + v[0 + VS_Z]);
    float var1_w = (var1x * fx + v[1 + VS_Y] * fy + v[1 + VS_Z]);
    float var2_w = (var2x * fx + v[2 + VS_Y] * fy + v[2 + VS_Z]);
    // ... up to whatever limit I have, 16 is probably practical

    for (int i=0;i<count;i++) {
        struct fragment f = fragments[i];

        // divide by w to get interpolated value
        float var0 = (var0_w + f.x * var0x) * f.w;
        float var1 = (var1_w + f.x * var1x) * f.w;
        float var2 = (var2_w + f.x * var2x) * f.w;
        // .. etc

        // shader says how many varX's it uses so compiler automatically
        // removes any redundant calculations: so only one version of this file
        // need be created
        SHADER_INVOKE(colour + f.x * 4);
    }
}

Written this way a simple colour gourad shader is around 500 bytes or so and the inner loop is 20 instructions although not very well scheduled.

The end goal would be to have multiple shaders loaded dynamically at runtime but that sounds like too much work so i'll keep it simple and just link them in.

It's a trade-off between ease of use and performance although from some preliminary benchmarking (well, looking at what the compiler produces) I think this is about as good as the compiler is going to get. Being able to provide a programmable shader at near-optimal performance would be a nice bullet-point.

An alternative is that the shader must just implement draw_row() and the code template above is copied; this might be useful if some other hard-to-calculate value like the reciprocal is required per-pixel and it can separate that pass into a separate loop.

Memory

On memory i've decided to set the rendering size to 512 pixels. I was hoping for 1024 but that's just a bit too big to fit and a bit too much work for the memory bus besides.

That leaves 7K 15K (oops, out by 8k) for code and stack and some other control structures - which should be enough to do some interesting things. I decided the data needs to be transferred using DMA because the final pass only needs to scale and clamp the floating point framebuffer data to bytes: this is not enough work to prevent the output writes stalling the CPU. Having a separate buffer for the DMA allows the rest to run asynchronously. I will need to round-robin the DMA writes for greatest performance or run them via a central framebuffer controller (and/or dedicate a whole core to the job, in which case it would maintain the colour transfer buffers too).

Actually the above design does let me efficiently split the fragment shaders into separate cores too if I want because they only need to transfer (x,1/w) tuples for each fragment to render - this was my original idea. If I did that then I could probably fit a 1024-pixel row in memory too.

The bottlenecks?

The gpu will work most efficiently by processing every triangle in the scene in one pass: this allows the framebuffer to stay on-core (and in the native floating point format) which provides very high bandwidth and blending essentially free. One every primitive on that row has been rendered the local framebuffer row cache is converted to bytes and flushed out to the real framebuffer (multipass rendering would also require loading from the framebuffer first, but lets not get carried away here).

I'm intentionally not worrying about texture maps (as in, not implement anything for them). Yes they could be used but the performance hit is going to be so dire that it is not going to be desirable to use them. If they were to be used I think a separate texture fetch pass will be required before the fragment shader - so that can fire off some scatter-gather DMA and then process the results as they arrive. I think this is not going to be easy or efficient with the current DMA capabilities.

So, ... ignore that. I will need some useful noise functions so that interesting textures can be procedurally generated instead.

The epiphany to framebuffer speed is pretty low, but that's fixed: there's nothing I can do about that, so no use wasting time crying over spilt milk on that one.

So, ... ignore that too.

I think the main bottleneck will be the transfer of the primitives - because they will all have to be loaded for each row. I will add some input indexing mechanism to separate them into bands so the loading of out-of-range primitives is reduced but fully indexing every row would be costly. If I can work out how to get the broadcast DMA to work (if indeed, it does actually work) then that may help alleviate some of the bandwidth requirements although it comes at a cost of forcing all rasterisers to operate in lock-step across the same band of framebuffer - which might be worse.

I may be completely off on this though - I really gotta just code this up and see how it works.

Deferred Rendering

Actually just to get way ahead of myself here; another alternative is a type of deferred rendering. Rather than keep track of the colour buffer it could just keep of (triangle id, x, 1/w) for each visible pixel. Once it's finished it could then just process the visible pixels - at most once per pixel.

This could be implemented by splitting the triangle primitive into two parts - first the bounding box, edge and z/w and 1/w interpolation equations, and the second being the varying equations. Each pass only needs that set of data - so it could reduce bandwidth requirements too.

Blending is more difficult. With it on every visible triangle would need to be rendered immediately and any previously rendered triangles waiting in the deferred buffer would need to be flushed.

Something to defer till later I guess (ho ho).

Tagged graphics, hacking, parallella.
Thursday, 31 July 2014, 22:06

lost in dots

Continued to play with software rendering. Ported the code to C and got it running on the framebuffer - still only on the workstation so I don't have any idea how it will run on the parallella.

The Java seems to be faster than the C TBH but that is probably rendering to X verses the framebuffer which is very slow. gcc isn't having the best time of the critical inner loop either. Currently i'm not running any fragment shaders and I have half an idea to just try creating a small soft-3d engine all in java since the jvm can handle some of that. Maybe not.

I thought i'd look at GLES2+ to see how it handles certain things, ... and decided to base some of the API on that because it dictates to a good degree the backend implementation and is based on 20+ years of industry experience. Apart from the shader compilers (which i'm going to side-step entirely) the core components are quite simple. The biggest one from my perspective is the one i've been focusing on so far - the rasteriser and varying interpolator.

I hadn't really thought about how many varyings there need to be, and 8x4 is a good amount. I hadn't got to thinking about uniforms yet either.

I played a bit with code to implement glDrawArrays() which is mostly just data conversion and flattening. My first obvious thought was to select elements from the arrays and turn them into rows (i.e. array of structures) and then batch-process them using the vertex shader. Locality of reference and all that. But after thinking about the inner loop of the rasteriser it will be more efficient to use a structure of arrays approach. It also makes it easier to SIMDise anything that needs to run on the ARM - quite likely the vertex shaders and primitive setup. It makes reading the vertex array data itself much more efficient anyway since each array type is an outer-loop decision rather than an inner-loop one although could have some poor cache behaviour if that isn't taken into consideration.

The potential size of the various data-structures has made me rethink the approach I was going to take. I was going to load as many triangles into the rasteriser as would fit in LDS and process them all at once. But that just wont be enough to be any use so it will have to cycle through them multiple times for different rendering bands. If I drop it down to just a pair of buffers it simplifies the i/o stuff and doesn't make any real difference to the number of times things are read from the external memory.

Due to memory constraints i'm going to have to split the rasteriser+zbuffer test to one or more cores and the fragment rendering to other cores. Each varying interpolator requires 3 floats per element, or 12 floats per 4-element vector. A simple gourad-shaded polygon will require at least 128 bytes to store everything needed (+some control stuff). I was going to pass the uniforms around with the triangle but they are too big so instead I will pass incremental updates to uniform elements in a processing stream, and the rasteriser will make sure the fragment shaders have the updates when necessary as well.

The queueing will be a bit of a pain to get right, but i'm ignoring it for now.

Rasteriser inner loop

Because of the number of varying elements I had to change the way I was thinking of writing the inner loop. I can't keep everything in registers and will need to load the interpolation addends as I go - this is why the data has to be stored as a structure of arrays because it always getting element 0 of a 3-element vector and that's more efficient to load using dword loads.

Also, since fadd is just as costly as fmadd I can remove the incremental update of the coefficients from always having to be executed - i.e. when a z-buffer test fails. It would also allow me to split the work into a couple of passes - one that generates live pixels, and another which turns them into interpolated fragments which could run branchless. I'm not sure if that is worth it yet though since it doesn't remove the branch.

So a single horizontal scan will be something like the following. This is the biggest case.

  (v0,v1,v2) = edge equation results for start location
  (zw, 1/w)  = interpolants for z/w, 1/w
  (p0-p31)   = interpolants for varying values

  (e0,e1,e2) = update values for edges
  (ez, ew)   = update values for z/w, 1/w

   x = start;
   lastx = start
start:
   in = v0.sign | v1.sign | v2.sign;
   if in
     zt = load zw-buffer[x]
       if (zw > zt)  // i think some of my maths is flipped
         store zw-buffer[x] = zw
         // use delta of X so that 3-instruction fmadd can be used!
         xdiff = x - lastx;
         lastx = x;

         ; imagine this is all magically scheduled properly ...

         1/w += xdiff * ew

         dload ep0-ep1
         p0 += xdiff * ep0
         p1 += xdiff * ep1
         dload ep2-ep3
         p2 += xdiff * ep2
         p3 += xdiff * ep3
         .. etc

         write target fragment record
         write 1/w
         write z/w
         // (i don't think i need v0-v3)
         write p0-p31
      fi
  fi
  v0 += e0;
  v1 += e1;
  v2 += e2;
  zw += ez;
while !done

I'm still experimenting with ways to work out how 'start' and '!done' are implemented. I've got a bunch of different algorithms although I need to get it on-hardware to get more accurate timing information. The stuff in ATTILA uses a sort of flood-fill algorithm but for various reasons I don't want to use that. I have a pretty reliable way to get both the start and end point of a line using a binary search, ... but its costly if run every line (it's still not perfect either but I think that's down to float being a shit :-/).

My current winner uses a 8x8 search every 8 lines and then uses that as the bounds. And that's even with a broken test which often over-runs (i've subsequently found the maths that work but haven't included it yet).

For small triangles just using the bounding box is pretty much as good as it gets, so may suffice.

Removing the loop branches

Actually it might be possible to have 2 branchless loops ...

  (v0,v1,v2) = edge equation results for start location
  (zw)       = interpolant for z/w,

  tmp        = room for hit locations
loop1 (over bounds):
  in = v0.sign | v1.sign | v2.sign;
  zt = read zwbuffer[x]
  in &= (zt - zw).sign
  zt = cmove(in, zw)
  write zwbuffer[x] = zt
  tmp[0] = x  
  tmp = cmove(in, tmp+1)

All branch-free. At the end, tmp points to the end of an array which includes the X values of all visible pixels.

The second loop just interpolates everything for the x locations stored in the tmp array.

  lastx = 0
loop2 (over tmp-tmp.start):
  load x
  increment everything by (x-lastx)
  output
  lastx = x

As with a previous post about getting good fmadd performance, it should be just about possible to dual-issue every fmadd and remove all stalls. All writes are written directly to the fragment shader core for that row?

Requires working memory for the X coordinates though; but that should be doable since this is all the core will be running.

However, having said all that; the fragment output from a single row can more than exceed the fragment processor memory so they need to be throttled. Back to loops and branches I guess but it was an interesting puzzle.

As one might tell i'm not really that concerned with getting a working solution, more in investigating the ways to fit the peculiarities of the hardware. There's just a lot to write even if i've got a fairly good idea of a workable solution and had all the components necessary :-/

Or maybe ...

Just do everything in a single-loop on the same core?

Yes maybe, hmm, dunno now. Really depends on if there's enough memory for it to fit and whether loading the triangle Nx times vs 1x times is slower than all the comms overheads. Unless the mesh is saturated; I would say no chance.

It's much easier and probably the first step to performance analysis anyway.

I guess I was more lost in dots than I realised when I made the subject (it was for the pic).

Tagged graphics, hacking, parallella.
Older Posts
Copyright (C) 2018 Michael Zucchi, All Rights Reserved.Powered by gcc & me!