About

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

Tags

android (44)
beagle (63)
biographical (82)
business (1)
code (56)
cooking (29)
dez (6)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (414)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (216)
java ee (3)
javafx (48)
jjmpeg (67)
junk (3)
kobo (15)
linux (3)
mediaz (27)
ml (15)
nativez (3)
opencl (117)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
politics (7)
ps3 (12)
puppybits (17)
rants (134)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (2)
zedzone (13)
Tuesday, 01 December 2015, 16:26

more termz

I tried a few more variants on the OpenCL rendering code but none are all that fast - the overheads kill it and whilst it does free the CPU up a little bit it isn't much. Probably not worth more time unless I look at OpenGL instead.

I added resizing. It meant I had to add some locking to the snapshot routine but it only needs to lock around the resize operation so it adds almost no overhead to normal operation (merely detecting a resize has occurred). I'm not yet sure what's supposed to happen with saved cursors and alternate screens when a resize occurs, probably just clip to size. Unfortunately there seems to be no way to set the WM_NORMAL_HINTS on the javafx stage, so there's no way to make it size to cells properly.

I did a bunch of benchmarking and profiling. One thing I tried was another test to compare to xterm - now at full-screen. Running "find . -name '*.c' | xargs cat" from the root of the linux 3.19.8 source tree. After a couple of runs this is about 25 seconds on termz, and 16 minutes in xterm. Well. Yeah it's a silly test but all those ls -l's during the day add up.

Looking at the memory profiler it doesn't really use too much heap during operation, just a few MB and most of that is the image and javfx. I mean nor should it, there isn't data structures to maintain. But having multiple compilers in ram (jvm, OpenCL), their generated output, and all the added overhead of the runtime support needed for those really adds up so it's very very fat in practice. I guess if multiple terms ran on the same jvm it would be ok.

enough yet?

So at this point i'm not really sure what i'll do with it. I'll probably poke at the edges when i'm bored and eventually when I get around to it I will dump what I have to my software site as the result of a weekend-and-a-bit-hack.

After that I'm not sure. It's actually quite functional and robust already (well, compared to the effort in) and wouldn't take much more work to turn it into a usable terminal for me - add some scrollback (pretty simple), mouse selection stuff (not that hard), and sort out some of the keyboard details (reading obtuse documents and testing). So maybe that will happen.

If I got that far adding the "10x20" typeface would probably be on the cards. Fixed-size outline fonts would be possible by just pre-rendering them but to me they just aren't terminal fonts.

Anything further such as a re-usable term component (which might actually be of use to the world) would require substantially more work on the i18n side of things and I don't feel like learning enough to do that properly.

Make

I did a bit poking at the java makefile stuff and it's to the point where i'm using it for out-of-ide builds of termz and zcl and will look into using it on other projects. That's the best way to find bugs/what works and what doesn't and previous attempts never got that far. For all it's gnumakefile obtuseness it's really rather compact at under 200 lines excluding comments and I didn't put any effort into making it particularly small. And that includes targets for javadocs, source jars, and binary builds.

Tagged hacking, java, javafx, opencl, termz.
Sunday, 29 November 2015, 01:22

opencl termz

I was just going to "try something out" while I waited for the washing to finish ...

... so after a long and full day of hacking ...

The screenshot is from an OpenCL renderer. Each work item processes one output pixel and adds any attributes on the fly, somewhat similar in effect to how hardcoded hardware might have done it. I implemented a 'fancy' underline that leaves a space around descenders. The font is a 16x16 glyph texture of iso-8859-1 characters. I haven't implemented colour but there's room for 16.

On this kaveri machine with only one DIMM (== miserable memory bandwidth) the OpenCL routine renders this buffer in about 35-40uS. This doesn't sound too bad but it takes 3uS to "upload" the cell table input, and 60uS to "download" the raster output (and this is an indexed-mode 8-bit rather than RGBA which is ~2x slower again), but somehow by the time it's all gone through OpenCL that's grown to 300-500uS from first enqueue to final dequeue. Then add on writing to JavaFX (which converts it to BGRA) and it ends up ~1200uS.

I'm using some synchronous transfers and just using buffer read/write so there could be some improvements but the vast majority of the overheads are due to the toolkit.

So I guess that's "a bit crap" but it would still be "fast enough". For comparison a basic java renderer that only implements inverse is about 1.5x slower overall.

But for whatever reason the app still uses ~8% cpu even when not doing anything; and that definitely isn't ok at all. I couldn't identify the cause. Another toolkit looks like a necessity if it ever went beyond play-thing-toy.

I got bored doing the escape codes around "^[ [ ? Ps p" so it's broken aplenty beyond the bits I simply implemented incorrectly. But it's only a couple days' poking and just 1K3LOC. While there is ample documentation on the codes some of the important detail is lacking and since i'm not looking at any other implementation (not even zvt) i have to try/test/compare bits to xterm and/or remember the fiddly bits from 15 years ago (like the way the cursor wrapping is handled). I also have most of the slave process setup sorted beyond just the pty i/o - session leaders, controlling terminals, signal masks and signal actions, the environment. It might not be correct but I think all the scaffolding is now in place (albeit only for Linux).

FWIW a test i've been using is "time find ~/src" to measure elapsed time on my system - after a couple of runs to load everything into the buffer cache this is a consistent test with a lot of spew. If I run it in an xterm of the same size this takes ~25s to execute and grinds big parts of the desktop to a near halt while it's active. It really is abysmal behaviour given the modern hardware it's on (however "underpowered" it's supposed to be; and it's considerably worse on a much much faster machine). The same test in 'termz' takes about 4.5s and you'd barely know it was running. Adding a scrollback buffer would increase this (well probably, and not by much) however this goes through a fairly complete UTF-8 code-path otherwise.

The renderer has no effect on the runtime as it is polled on another thread (in this instance via the animation pulse on javafx). I don't use locks but rely on 'eventual consistency'. Some details must be taken atomically as part of the snapshot and these are handled appropriately.

Right now I feel like i've had my fill for now with this. I'm kinda interested, but i'm not sure if i'm interested enough to finish it sufficiently to use it - like pretty much all my hacking hacked up hacks. Time will be the teller.

Tagged hacking, java, javafx, opencl, termz.
Friday, 27 November 2015, 19:35

termz

Every now and then I think about the sad state of affairs regarding terminal emulators for X11. It's been a bit of a thing for a while - it's how i ended up working at Ximian.

I stopped using gnome-terminal when i stopped working on it and went back to xterm. I never liked rxvt or their ilk and all of the 'desktop environment' terminal emulators are pretty naff for whatever reason.

xterm works and is reliable but with recent (being last 10 years) X Windows System servers the text rendering performance plummeted and even installing the only usable typefaces (misc-fixed 6x13, and 10x20, and sometimes xterm itself) became a manual job. Whilst performance isn't bad on this kaveri box I also use an uber-intel machine with a HD7970 where both emacs and xterm runs like an absolute pig whenever any GL applications are running, and it isn't even very fast otherwise (i'm talking whole desktop grinding to a halt as it redraws exposes at about 1 character column per SECOND). It's an "older" distribution so that may have something to do with it but there is no direct indication why it's so horrible (well apart from the AMD driver but i have no choice for that since it's used for OpenCL dev). I might upgrade it to slackware next year.

Ho hum.

Anyway I started poking last night at a basic xterm knockoff and got to the point of less sort of running inside it and now i'm thinking about ways I might be able to implement something a bit more complete. I'm working in Java and have a tiny bit of JNI to get the process going and handle some ioctl stuff (which seems somewhat easier now than it was in zvt, but portability is not on the agenda here).

TermZ? Glyphs are greymaps extracted directly from the PCF font.

zvt

When I wrote ZVT the primary goal was performance and to that end considerable effort was expended on making a terminal state machine which implemented zero-copy and zero-garbage algorithms. zero-copy is always a good thing but the zero-garbage was driven by the very slow malloc on Solaris at the time and my experience with Amiga memory management.

Another part of the puzzle was display and the main mechanism was inspired by some Amiga terminal emulators that used COPPER lists to re-render rows to the screen in arbitrary order without requiring them to be re-ordered in memory (memory bandwidth was a massive bottleneck when using pre 1985-spec hardware in 199x). I used a cyclic double-linked (exec) list of rows and to implement a scroll I just moved a row from the start to the end of the list which takes 8 pointer updates and a memset to clear it (and it also works for partial screen scrolls). By tracking the last row a given one was actually displayed at I could -at-any-point-later- attempt to create an optimal screen-update sequence including using blits for scrolling and minimising glyph redraws to only those that had changed. The algorithm for this was cheap and reliable if a little fiddly to get correct.

This last point is important as it allows the state machine to outpace the screen refresh rate which always becomes the largest bottleneck for terminal emulators in 'toolkit' environments. This is where it got all it's performance from.

new hardware, new approach

Thinking about the problem with current hardware my initial ideas are a little bit different.

I still quite like the linked list storage for the state machine and may go back to it but my current idea is instead to store a full cell-grid for the displayable area. I can still make full-screen scrolling just as cheap using a simple cyclic row trick (infact, even cheaper) but sub-region scrolling would require memory copies - but at the resolution of 4-bytes-per-glyph these are insanely cheap nowadays.

This is the most complex part of the emulator since it needs to implement all the control codes and whatnot - but for the most part thats just a mechanical process of implementing enough of them to have something functional.

I would also approach rendering from an entirely different angle. Rather than go smart i'm going wide and brute-forcing a simpler problem. At any given time - which can be throttled based on arbitrary metrics - I can take a snapshot of the current emulator screen and then asynchronously convert that to a screen display while the emulator continues to run on it's own thread.

For a basic CPU renderer it may still require some update optimisation but given it will just be trivial cell fonts to copy it probably wont be appreciably cheaper to scroll compared to just pasting new characters every time. And obviously this is utterly trivial code to implement.

The ultimate goal (and why the fixed-array grid backing is desirable) would be to use OpenCL or OpenGL (or more likely Vulkan if it ever gets here) to implement the rendering as a single pass operation which touches each output pixel only once. This would just take the raw cell-sized rectangle of the terminal state machine as it's only variable input and produce a fully rendered and styled framebuffer as the result. Basically just render the cells as a low-res nearest-neighbour texture lookup into a texture holding the glyphs. The former is a tiny tiny texture in GPU terms and rendering a single full-screen NN textured quad is absolutely nothing for any GPU. And compared to the gunk that is required to render a full-screen of arbitrary text through any gui toolkit ever it's many orders of magnitude less effort.

Ideally this would only ever exist at full-resolution in the on-screen framebuffer memory which would also make it extremely cheap memory wise.

But at least initially I would be going through JavaFX so it will instead have to have multiple copies and so on. The reason to use JavaFX is for all the auxiliary but absolutely necessary fluff like clipboard and dnd operations. I don't really like tabbed terminals (I mean I want to use windows as windows to multitask, not as a stack to task switch) but that is one way to ameliorate the memory use multiplication this would otherwise create.

So to begin with it would be extremely fat but that's just an implementation detail and not a limitation of the design.

worth bothering?

Still mulling that over. It's still a lot of work even if conceptually it's almost trivial.

Tagged hacking, java, javafx, opencl, termz.
Copyright (C) 2018 Michael Zucchi, All Rights Reserved.Powered by gcc & me!