About

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as zed
  & handle of notzed

Tags

android (44)
beagle (63)
biographical (87)
blogz (7)
business (1)
code (63)
cooking (30)
dez (7)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (434)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (224)
java ee (3)
javafx (48)
jjmpeg (77)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (8)
opencl (119)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (3)
zcl (1)
zedzone (21)
Tuesday, 29 July 2014, 14:04

better edges

Been poking more at the renderer. I have a fairly clean set of routines now that can be used to implement most of what I need: generating fragments, interpolating arbitrary parameters, handling a depth test.

And well, just running it and looking at the pretty pictures ...

The Z/W buffer stretched to 8-bit:

Add a few more, use a different 'fragment shader', ... polygonal shade-bobs.

I'm still just running stuff in Java on my workstation but i've been doing a lot of experimenting with micro-optimisations an trying to work out how i'm going to split the work up.

My current idea is to process the scene by complete scan-lines rather than tiles. This greatly simplifies the looping and address arithmetic but would not be very efficient for texture lookups. I'm basing the following on a loose understanding of a gpu pipeline but erring on the side of epu-efficiency.

I'm not sure where the first and second passes will reside, and for small triangles they will possibly run in the same loop. It may make sense to place some of the decision work on the arm so as to reduce the need for multiple passes over the triangles for each 'run'. In any event several rows will be processed at once since there will be several epus dedicated as fragment/row processors.

It's interesting that due to the FPU and branch latency you're forced to consider SIMD style algorithms for efficiency in a lot of cases. i.e. unrolling loops, branchless logic, re-arranging processing so decisions are taken elsewhere. I'm still juggling the code around to find out what works best on-paper although I should start poking on-board soon.

That better edge test

I'm also surprised at just how little code is required for the inner loop. Only 3 flops and 4 ialu ops (+branch) are required to implement a length-bounded 'inside triangle run' test.

This is an example of scanning right looking for the first set pixel in the triangle.

   ; v0, v1, v2 = edge equation value (already calculated)
   ; e0, e1, e1 = edge equation x coefficient
   ; i          = length left
1: and   code,v0,v1    ; negative test for v0 & v1
   fadd  v0,v0,e0
   and   code,code,v2  ; negative test for v0 & v1 & v2
   fadd  v1,v1,e1
   sub   i,i,#1
   fadd  v2,v2,e2
   orr   code,code,i   ; negative test for i
   bgte  1b 

5 cycles + branch = 8. Can be unrolled.

This is the equivalent of:

  do {
   code = (v0 < 0 && v1 < 0 && v2 < 0) || (i < 0));
   v0 += e0;
   v1 += e1;
   v2 += e2;
   i -= 1;
  } while (code == 0);

Rather than move the negative bit somewhere else as in the previous post I found just leaving them in-place reduces the instruction count. This also lets me combine float and integer comparisons in the same way.

A similar calculation can be used as the exit condition for the fragment generating loop. As each parameter is interpolated using a single addition within the loop it makes for a very simple loop. The only "problem" is that the calculations are so simple that branches become more expensive.

Tagged graphics, hacking, java.
lost in dots | edges
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!