Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as zed
  & handle of notzed


android (44)
beagle (63)
biographical (103)
blogz (9)
business (1)
code (74)
compilerz (1)
cooking (31)
dez (7)
dusk (31)
extensionz (1)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (455)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (231)
java ee (3)
javafx (49)
jjmpeg (81)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (10)
opencl (120)
os (17)
panamaz (5)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
players (1)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (4)
zedzone (24)
Wednesday, 10 February 2010, 12:37

Vectors and Bits again

Well I fixed the `c long' version of the rect-fill from the update mentioned a couple of posts ago ... and a bit more besides.

After sleeping in a bit I worked on some MMU code so I can start using the CPU cache. Most of that was just gaining a deeper understanding of the permission and memory type bits, which are a little confusing in places. It looks like it's been extended a couple of times whilst keeping compatability so there's multiple combinations that appear to do the same thing but with different nomenclature. Hmm, I have it more or less worked out ... I think. So once I got the MMU code working, it allowed me to enable caches and play a bit with various options. I used only section and super-section pages - 1MB or 16MB, so i'm probably only using a couple of TLB entries to run everything (= no page table walks).

I was assuming the caches were on when i enabled the MMU ... oh but they weren't, of course ... stupid me. Wow does that make a difference ... Wow.

Ok, pause to run a few more timings. ... Here goes.

Code                   Total    Slowest Fastest
C short             36097442    0.89    5.22
C long              40526536    1.00    5.86
ARM asm             15801430    0.38    2.28
NEON                 9654736    0.23    1.39
NEON2                9982542    0.24    1.44
NEON3                9421366    0.23    1.36
NEON4                9467262    0.23    1.37
sDMA                 6904794    0.17    1.00

(see 2 posts ago, or render-rect.c for what they mean)

This is the original scenario from a previous post, but with a 'fixed' C long version. Strangely, it runs slower than the short version. A cursory look at the assembly looks like it's doing the right thing - but it's not worth looking deeper. My guess is the extra logic required for the un-aligned edges is throwing it out or the pointer aliasing is making the compiler angry. Oddly, the performance monitor is registering the same number of data writes too.

Anyway, who cares. Lets turn the MMU on and set the memory regions up properly and and see what happens. Even with the caches off things happen, although not much.

With MMU on, graphics = wt                    With MMU on, graphics = wb

Code           Total    Slowest Fastest       Code           Total    Slowest Fastest
C short     36058684    0.89    7.33       C short     36233408    0.89    5.23
C long      40496404    1.00    8.23       C long      40584664    1.00    5.86
ARM asm      9367578    0.23    1.90       ARM asm     15811204    0.38    2.28
NEON         5332580    0.13    1.08       NEON         9653676    0.23    1.39
NEON2        4917308    0.12    1.00       NEON2       10057086    0.24    1.45
NEON3        5598968    0.13    1.13       NEON3        9555816    0.23    1.38
NEON4        5685246    0.14    1.15       NEON4        9431842    0.23    1.36
sDMA         6908602    0.17    1.40       sDMA         6917612    0.17    1.00

We're starting to beat the system DMA - I presume that even with the cache off this enables some sort of write-combining/write-buffering. It's interesting that the NEON2 code speeds up the most (nearly 2x) - probably given it has the smallest loop the CPU isn't in contention for memory bandwidth as much. You'd never use a write-back cache for video memory, but I timed it anyway. I really have no idea how or why using it is making any difference whatsoever though, since the global cache bits are all off!

Ok, so ... der, lets turn on the caches properly.

The way I set the MMU up is to have the first bank of memory - where all code and data resides - as write-back write-allocate (writes also read a cache-line), and the second - where the frame-buffer resides - as write-through no-write-allocate. For the `graphics = wb' case, I also set write-back write-allocate on the second bank of memory (in a separate run). All the IO devices are using shared-device mode.

First, with unrolled loops.

MMU on, graphics = wt, -O3 -funroll-loops     MMU on, graphics = wb, -O3 -funroll-loops
             -- lots of artifacts
Code           Total    Slowest Fastest       Code           Total    Slowest Fastest
C short       957743    0.14    1.02       C short      1816546    0.28    1.00
C long        956818    0.14    1.02       C long       1992627    0.30    1.09
ARM asm       933198    0.14    1.00       ARM asm      1871829    0.28    1.03
NEON          930448    0.14    1.00       NEON         1857085    0.28    1.02
NEON2         945969    0.14    1.01       NEON2        1862711    0.28    1.02
NEON3         946522    0.14    1.01       NEON3        1848473    0.28    1.01
NEON4         945739    0.14    1.01       NEON4        1861538    0.28    1.02
sDMA         6456313    1.00    6.93       sDMA         6455228    1.00    3.55

Ahh, now this is more like it. Getting over 800MB/S (if my timing calculations are right).

Even the basic crappy C code is within a whisker of everything else - even though it executes about 3.5x as many instructions to get the same work done. The system DMA has fallen right off; but run asynchronously it would probably still be worth using since it is basically `free', and the CPU can do a lot more than just write memory. This code also polls the DMA status in a tight loop, I don't know if that is having any bandwidth effects

The write-back timing is all out of whack - the C short version is the first to run, so it gets a benefit of having an empty cache and nothing to write-back. You also get to see the CPU write stuff back to the screen when it feels the need - lots of weird visual artifacts. And the explicit cache flushing required would only make it slower on top of that. In short - useless for a framebuffer. Any performance issues you might expect a write-back cache to address are handled much better by using proper algorithms. I saw it mentioned on the beagleboard list, so it seemed worthy of comment ...

And lastly, just with -O3, a typical compile flag (-funroll-loops generates much bigger code so might not always be desirable). I also added in a `hyper-optimised' memset implementation for good measure.

MMU on, graphics = wt, -O3

Code                   Total    Slowest Fastest
C short              1372096    0.21    1.47
C long               1038868    0.16    1.11
ARM asm               948600    0.14    1.02
NEON                  929968    0.14    1.00
NEON2                 939165    0.14    1.00
NEON3                 946102    0.14    1.01
NEON4                 945702    0.14    1.01
msNEON               1309313    0.20    1.40   (see memset_armneon())
sDMA                 6462071    1.00    6.94

The C is still ok, if a bit slower, but barely worth `optimising' in this trivial case.

The msNEON code is from the link indicated ... interesting that a more complex C loop beats it somewhat; the msNEON code is only writing the same amount of memory linearly not as a rectangle, and with severe alignment restrictions.

The NEON2 code has such a simple inner loop, yet is the most consistently top performer. Good to see that KISS sometimes still works.

 // write out 32-byte chunks
2: subs r6,#1
 vst1.64 { d0, d1, d2, d3 }, [r5, :64]! // ARM syntax is `r5 @ 64'
 bgt 2b

The ARM code is quite a mess by comparison:

 // write out 32-byte chunks
2: strd r2,[r5]
 strd r2,[r5, #8]
 strd r2,[r5, #16]
 subs r6,#1
 strd r2,[r5, #24]
 add r5,r5,#32
 bgt 2b

(FWIW I tried a similar trivial loop in ARM, a direct translation of the `C long' code, and that wasn't terribly fast).

Anyway, I think i've done memory fill/rect fill to bloody death (and beyond!) now. It's just not a terribly interesting problem - particularly for a SIMD unit. Apart from evaluating raw memory performance. Actually it is kind of handy for that since it will easily show if things aren't configured properly.

PS Code changes not committed yet.

Tagged beagle, hacking.
Tuesday, 09 February 2010, 17:57

Time is an illusion ...

Damn, it's 4am again. Knew I should've gone out for a ride yesterday, just haven't felt really sleepy - but it's starting to bite now.

I was watching TV (well, I had it on, it was a pretty boring - and extremely long - silent movie from Taiwan) and catching up on the news ... and then I got bored with that ... and poked around the 'my.ti.com' for a little while and came across some beagleboard TV out stuff. And being a glutton for punishment, something to look at at midnight ...

When playing with the Haiku boot process I had installed an older u-boot which initialises the video, so I guessed that should at least be a good signal. So I dragged it all around to the TV again and plugged it in and booted it up. Blah, still crap. What's going on. So as a last resort I tried another cable - i'd been using one of those expensive ones and just didn't expect any problems. Found a brand new cheapie from a video card or something ... damn, worked!

Well after much mucking about and a few mistakes I added some API to add TV out, and handle viewports on larger data (to clone the lcd display), and well, enough crapping on:

Don't mind the grey screen on the venerable old 1084 ... I don't have the right cable to hook up separated-lca to s-video, so there's no colour signal.

Hmm, now the cat's whining, wonder what he wants.

Tagged beagle, hacking, puppybits.
Monday, 08 February 2010, 11:03

vectors and bits

Updated, see the end of the post

Yesterday I started poking around with the SIMD unit. Wow, is that a way to eat up time or what.

Wasn't quite sure what to do with it, so played at first with writing an RGB888 to RGB565 converter. Didn't get to testing it, but it brought back memories of the SPU hacking I did before - the instruction set has a lot of similarities, although NEON is filled out more. And like with the SPU, there's so many ways to do the same thing it can be a bit overwhelming trying to find a good way of solving a problem. Particularly if you don't really know which instructions are there, or what they do. There seems to be some interesting ones though, like vrsi which lets you insert the upper-bits of each element into the lower-bits of each element in another register (without clobbering it's contents). I still seem to be wedded to the vtbl instruction as I was with the shuffleb instruction on SPU, although I think it's not always the best route. I really missed the spu_timing tool though - although the issue rules and latencies are simpler.

That idea didn't seem to be going anywhere in particular, so I thought i'd look at some specific stuff I need, and for which I have very slow implementations - font rendering and rect fill, although I only got around to looking at rect fill, and that still doesn't work 100%. I just did it using ARM code though. For such an old architecture i'm was a little surprised at the lack of info available for such tasks - at least as it applies to searching using google. Maybe it's too old, and the new stuff is hidden away in proprietary and embedded systems, and nobody does software rendering anymore.

And then I totally lost track of the time reading about the DSP ... at 4am I thought it was time to `call it a night' - that's what I get for having coffee and chips for dinner (and in short; there's no free tools to use it, and the Linux driver uses binary blobs - of course).

Today I filled out the rect fill code a bit and tried various implementations, including some NEON variants. Oh, I also `discovered' the performance counting unit - wow, you can track a lot of stuff, from branches taken to cache and memory stats to stalls. Very nice.

Oh NEON. Fucking hell. Spent about 4 hours tracking down why the NEON instructions just threw an undefined instruction exception. After a couple of hours of digging I came across a reference to the Coprocessor Access Control Register, but that didn't really help (oh and a thread on the beagleboard group where people just say to turn CONFIG_NEON to y ... sigh). So here I was trying to turn on clocks and power and other PRCM registers ... and then I remembered something about a bit in a status register to enable/disable the whole shebang. A bit more tracking down (i've got about nearly 10K pages of documentation to search now) and I discovered the FPEXC register and VMSR/VMRS instructions (my memory was wrong, but it was a lucky guess). Although the binutils i'm using doesn't support them ... sigh. Finally found a workaround using MRC/MCR from Linux - about the only thing i've managed to find in there when tracking things down (a lot of stuff is so abstracted it it's very hard to follow). Gee that was frustrating.

Anyway, so I came up with some total cycle counts for various implementations of a 'rectangular block colour fill for RGB565'.

These are all with *NO CACHE* or write buffers, so they don't really mean anything other than relative to each other. You have to turn the MMU on to turn on data caches and write buffers, perhaps that is the next thing to try.

Code                   Total    Slowest Fastest
C short             36308222    1.00    5.25
C long              18307488    0.50    2.64
ARM asm             15877960    0.43    2.29 - uses 4x strd (writes 8 bytes/instruction)
NEON                 9735680    0.26    1.40 - uses 2x writes of 2xD regs, 64 bit aligned
NEON2                9134690    0.25    1.32 - uses 1x write of 4xD regs, 64 bit aligned
NEON3                9311284    0.25    1.34 - uses 2x writes of 4xD regs, 128 bit aligned
NEON4                9191652    0.25    1.33 - uses vstm of 8xD regs
sDMA                 6910682    0.19    1.00

The NEON implementations use ARM code for the non-aligned 'edges', and none of them are particularly fantastic code.

Hrm, I thought the ARM asm one was ok when I was running it by itself, i guess twice as fast as something is quite noticeable, but obviously it's kind of slow.

Looking in more detail at a couple of them:

drawRect() C long
 total cycles=18307488
 dwrite intns=169668
 ext writes  =169671
 iexec       =701230
 istall      =1201453

drawRect() ARM asm
 total cycles=15877960
 dwrite intns=168963
 ext writes  =168965
 iexec       =310508
 istall      =182922
The C version executes 2x as many instructions but the execution time isn't much different - everything is waiting on memory (although I wonder if it uses less power). At first I thought the total cycle count was a mistake, but of course, it's taking about 100x longer than the number of instructions executing, so memory accesses must be around 100 cycles -- which sounds about right. Be interesting to see if any cache/write buffers make a noticeable difference here, although it is just a flood of writes.

Update: Should've tested more, the long version was still just a 'short' version, it just wrote half the width ... so all bogus. Will revisit in a newer post. The code in question is all in puppy bits:

Tagged beagle, hacking, puppybits.
Sunday, 07 February 2010, 07:47

Puppy Bits is born

Although it's still a bit broken, I figured the code was finally good enough to upload, so I've created a google-code project called Puppy Bits.

No 'demo' yet, but most of my library-so-far.

I had some 'lunch' too; damn, 6pm, another day just vanished.

Tagged beagle, hacking, puppybits.
Sunday, 07 February 2010, 04:53

Video Graphics

Well, I've still given up on the TV out and the video encoder, but I did have a bit of success with the rest of the video system. Seems that writing the video code using the register names did pay off after-all.

So instead of enjoying another fantastic, if a little warm, day outside, i've been hacking away (seriously, it must some sort of addiction) at some sort of video/graphics interface. And all i've had to eat today so far is beer ...

I added code to set various video modes - all the basic ones up to 1280x1024. Since i'm using a fixed clock with an integer divider, most of the pixel clocks are wrong, but they work with my monitor as they are 'close enough'. I also separated out the graphics part from the video part, so I can use the hardware more fully, as below.

Anyway, obligatory screen-shot, then some explanation.

First, the video mode is set to 1280x1024, with a light-blue background colour. That's all it will display, until I add a graphical channel.

The dark-blue graphical channel is using channel 0 - the 'graphics' channel, in RGB16 format at 1024x768 resolution, centered on the main video window.

Then the noisy rectangle is using channel 2, again in RGB16 format, although it could also be in UYVU or YUV2 format. i.e. it is a 'video overlay'.

I'll upload it somewhere soon - maybe this week.


To help with debugging i've come up with a couple ideas too. First, when I get a fatal exception I now jump to a little 'crash monitor' that lets me examine memory. Well that's all I do now, but it can always be extended. But even that has proven quite handy, e.g. to examine (more of) the stack.

Exception: Data Abort
 pc: 0x80009f00 sr: 0x200001d3
 r0: 0x00000020
 r1: 0x00000040
 r2: 0x48050400
 r3: 0x00000040
 r4: 0x48050400
 r5: 0x480504a0
 r6: 0x80e3fd84
 r7: 0x00000002
 r8: 0x48050400
 r9: 0x00000054
r10: 0x00000002
r11: 0x80e3fdbc
r12: 0x00000066
r13: 0x80e3fd70 0x00000008 0x00000000 0x80e3fe14 0x80009ff4 0x80e3fe14 0x80009b08 0x72747300 0x00797063
r14: 0x80e3fd7c
r15: 0x80009f00
Entering crappy crash monitor.
 ? for help.
#> ?
?               help
m addr len      Show memory as words from addr for len words
#> m 0x80e3fd70 22

0x80e3fd70: 0x00000008 0x00000000 0x80e3fe14 0x80009ff4 0x80e3fe14 0x80009b08 0x72747300 0x00797063
0x80e3fd90: 0x74757064 0x6d6d0063 0x6c665f75 0x5f687375 0x00424c54 0x646e6573 0x7274735f 0x00676e69
0x80e3fdb0: 0x00000001 0x80200000 0x80200000 0x00000280 0x00000200 0x00000000

The second is a sort of crash analyser, that turns addresses into functions. Basically, it takes the output of `objdump -d', and a list of addresses, and then turns them into the function the address resides in, and optionally the assembly language of the function. I've just been using `objdump | less' to do the same thing manually for individual addresses, but once you get more than a couple it gets tedious.

#> m 0x80e3fa00 64
0x80e3fa00: 0x80e3fab8 0x00000000 0x00000000 0x8000c394 0x00000010 0x80022940 0x00000000 0x00000000
0x80e3fa20: 0x00000000 0x8000b89c 0x80e3fab8 0x00000200 0x00000000 0x8000ba68 0x80e3fab8 0x00000200
0x80e3fa40: 0xffffffff 0x00000000 0x00000000 0x8000d0ec 0x80e3fab8 0x00000200 0x00000000 0x80022d68
0x80e3fa60: 0x00000000 0x8000b89c 0x80e3fab8 0x00000200 0x00000000 0x8000ba68 0x80e3fab8 0x00000200
0x80e3fa80: 0x00000010 0x00000000 0x00000000 0x80013294 0x80e3fab8 0x00000200 0x00000003 0x00000000
0x80e3faa0: 0x80e3fcd8 0x81204148 0x00000001 0x80013620 0x80e3fab8 0x00000001 0x00000013 0x000001e1
0x80e3fac0: 0x00000002 0x00000200 0xffffffff 0x00000001 0x00e3fb02 0x80022752 0x8004730b 0x80e3fb00
0x80e3fae0: 0x00000032 0x80008a84 0x80e3fb00 0x80046258 0x00000000 0x00000001 0x812040a0 0x8000851c

Dump that to a file on my workstation, then process it:

$ cat > a
0x80e3fa00: 0x80e3fab8 0x00000000 0x00000000 0x8000c394 0x00000010 0x80022940 0x00000000 0x00000000
$  cat a | while read line ; do ./crashdump -3 haiku-dump.text $line ; done
8000c31c <_ZN10MemoryDisk6ReadAtEPvxS0_m>:
8000c388:       e0811004        add     r1, r1, r4
8000c38c:       e1a02006        mov     r2, r6
8000c390:       ebfffba6        bl      8000b230 <memcpy>
8000b86c <_ZN10Descriptor6ReadAtExPvm>:
8000b890:       e1a0000c        mov     r0, ip
8000b894:       e1a0e00f        mov     lr, pc
8000b898:       e594f010        ldr     pc, [r4, #16]
8000ba28 <read_pos>:
8000ba5c:       e1a02004        mov     r2, r4
8000ba60:       e1a03005        mov     r3, r5
8000ba64:       ebffff80        bl      8000b86c <_ZN10Descriptor6ReadAtExPvm>
8000d050 <_ZN4boot9Partition6ReadAtEPvxS1_m>:
8000d0e0:       e0922004        adds    r2, r2, r4
8000d0e4:       e0a33005        adc     r3, r3, r5
8000d0e8:       ebfffa4e        bl      8000ba28 <read_pos>
8000b86c <_ZN10Descriptor6ReadAtExPvm>:
8000b890:       e1a0000c        mov     r0, ip
8000b894:       e1a0e00f        mov     lr, pc
8000b898:       e594f010        ldr     pc, [r4, #16]
8000ba28 <read_pos>:
8000ba5c:       e1a02004        mov     r2, r4
8000ba60:       e1a03005        mov     r3, r5
8000ba64:       ebffff80        bl      8000b86c <_ZN10Descriptor6ReadAtExPvm>
80013218 <_ZN18PartitionMapParser19_ReadPartitionTableExP15partition_table>:
80013288:       e0962004        adds    r2, r6, r4
8001328c:       e0a73005        adc     r3, r7, r5
80013290:       ebffe1e4        bl      8000ba28 <read_pos>
800135c8 <_ZN18PartitionMapParser5ParseEPKhP12PartitionMap>:
80013614:       e3a02000        mov     r2, #0  ; 0x0
80013618:       e3a03000        mov     r3, #0  ; 0x0
8001361c:       ebfffefd        bl      80013218 <_ZN18PartitionMapParser19_ReadPartitionTableExP15partition_table>
80008a00 <serial_puts>:
80008a78:       ebffffda        bl      800089e8 <serial_putc>
80008a7c:       e1a00007        mov     r0, r7
80008a80:       ebffffd8        bl      800089e8 <serial_putc>
800084e0 <dprintf>:
80008510:       a1a01003        movge   r1, r3
80008514:       e1a0000d        mov     r0, sp
80008518:       eb000138        bl      80008a00 <serial_puts>

Which is a lot more meaningful than a list of addresses.

Update: both of these are in the puppy bits project

Hmm, time to go in search of food I think.

Oh, turns out I was booting the wrong image with haiku ... so that splash screen really was less than it seemed. However once I changed to using the correct image, I get pretty much the same result - a pretty face but no brains.

Tagged beagle, hacking.
Friday, 05 February 2010, 08:53

Video killed the programming bloke ...

Well I was up till the wee hours working on some video code. Blah. Basically converting a register dump into code with #defined constants and other 'nice' stuff. Pity it doesn't do much more though.

Then I spent pretty much all day today failing at trying to get S-Video output working. But I just can't get it to work. I get some sort of signal out, and it looks like it could be the test pattern, but there doesn't appear to be any sync signal, and it's a bit weak too. At this point I think it might be worth cutting my losses and leaving it. For all I know the video DAC isn't even powered on properly - but to play with it's power you need to use I2C.

Actually that isn't all I did, as well as the video setup, I was `cleaning up' some other basic routines. Some clib-less debug stuff, and better exception handlers. I'm sick of rewriting bits of mess every time I try something new, and maybe this'll let me put it on the 'net at some point too.

I submitted some patches to Haiku too, one of which was applied within a few minutes.

Hmm, forgot to eat too, and now it's evening again. Mates are down the pub asking me along but I just don't fee like it today. Just finished a beer here and all I want to do is sleep now.

Tagged beagle, hacking.
Thursday, 04 February 2010, 05:32


Ok, so my MMU code was all broken. First I was just using the wrong number of bits in the L2 pages - x86 uses 4K pages with 1K entries, but ARM is only 1K pages with 256 entries, and I can't add up simple 2 digit numbers ... But even that didn't help ... many iterations and hours later ... ahh, I forgot to map the serial port - I was only mapping 16MB of i/o and there's another 1MB to map. Grr. Added that to the Haiku code and suddenly turning on the MMU 'works'.

Well it wasn't all wasted effort, I have a better understanding of the various permission and cache bits now. Better than nothing at least.

So ... ta-da ...

Actually it's nothing to be too excited about - that's all it does, and it's been hacked in very messily.

Tagged beagle, hacking, haiku.
Wednesday, 03 February 2010, 07:51

Damn MMU

I didn't have much time today but I had another go with the MMU, but this time on some stand-alone code.

No dice. It just goes off into la-la land as soon as I turn it on, no exceptions or any indicator of what went wrong. I guess the page tables are bung.

I can see this is going to be fun.

Tagged beagle, hacking.
Newer Posts | Older Posts
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!