Yes it still lives. I've just uploaded an update to zcl
A bunch of bugfixes, new build system, more robustness, and OpenCL 2.1 support.
There are still some thing i'm experimenting with - primarily the functional/task stuff as it's just not flexible enough - but it's stable and robust and easy to work with so i'm no longer using JOCL for anything at work.
On a personal note I still haven't really gotten back into hacking and i had a short sojourn into facebookland so i haven't had much to write about. It's mostly been work, very poor sleep, and drinking! Oh and I started wearing kilts ...
vulkan is out at last
I'm just starting to think about hacking again so that delay might not have been so bad for me. I haven't looked at it yet though.
Oh, and what a nice birthday present too!
Update: Not that i've had time to look but still no AMD drivers for linux; although if this means an end to fglrx then that will be worth it.
I tried a few more variants on the OpenCL rendering code but none are all that fast - the overheads kill it and whilst it does free the CPU up a little bit it isn't much. Probably not worth more time unless I look at OpenGL instead.
I added resizing. It meant I had to add some locking to the snapshot routine but it only needs to lock around the resize operation so it adds almost no overhead to normal operation (merely detecting a resize has occurred). I'm not yet sure what's supposed to happen with saved cursors and alternate screens when a resize occurs, probably just clip to size. Unfortunately there seems to be no way to set the WM_NORMAL_HINTS on the javafx stage, so there's no way to make it size to cells properly.
I did a bunch of benchmarking and profiling. One thing I tried was another test to compare to xterm - now at full-screen. Running "find . -name '*.c' | xargs cat" from the root of the linux 3.19.8 source tree. After a couple of runs this is about 25 seconds on termz, and 16 minutes in xterm. Well. Yeah it's a silly test but all those ls -l's during the day add up.
Looking at the memory profiler it doesn't really use too much heap during operation, just a few MB and most of that is the image and javfx. I mean nor should it, there isn't data structures to maintain. But having multiple compilers in ram (jvm, OpenCL), their generated output, and all the added overhead of the runtime support needed for those really adds up so it's very very fat in practice. I guess if multiple terms ran on the same jvm it would be ok.
So at this point i'm not really sure what i'll do with it. I'll probably poke at the edges when i'm bored and eventually when I get around to it I will dump what I have to my software site as the result of a weekend-and-a-bit-hack.
After that I'm not sure. It's actually quite functional and robust already (well, compared to the effort in) and wouldn't take much more work to turn it into a usable terminal for me - add some scrollback (pretty simple), mouse selection stuff (not that hard), and sort out some of the keyboard details (reading obtuse documents and testing). So maybe that will happen.
If I got that far adding the "10x20" typeface would probably be on the cards. Fixed-size outline fonts would be possible by just pre-rendering them but to me they just aren't terminal fonts.
Anything further such as a re-usable term component (which might actually be of use to the world) would require substantially more work on the i18n side of things and I don't feel like learning enough to do that properly.
I did a bit poking at the java makefile stuff and it's to the point where i'm using it for out-of-ide builds of termz and zcl and will look into using it on other projects. That's the best way to find bugs/what works and what doesn't and previous attempts never got that far. For all it's gnumakefile obtuseness it's really rather compact at under 200 lines excluding comments and I didn't put any effort into making it particularly small. And that includes targets for javadocs, source jars, and binary builds.
I was just going to "try something out" while I waited for the washing to finish ...
... so after a long and full day of hacking ...
The screenshot is from an OpenCL renderer. Each work item processes one output pixel and adds any attributes on the fly, somewhat similar in effect to how hardcoded hardware might have done it. I implemented a 'fancy' underline that leaves a space around descenders. The font is a 16x16 glyph texture of iso-8859-1 characters. I haven't implemented colour but there's room for 16.
On this kaveri machine with only one DIMM (== miserable memory bandwidth) the OpenCL routine renders this buffer in about 35-40uS. This doesn't sound too bad but it takes 3uS to "upload" the cell table input, and 60uS to "download" the raster output (and this is an indexed-mode 8-bit rather than RGBA which is ~2x slower again), but somehow by the time it's all gone through OpenCL that's grown to 300-500uS from first enqueue to final dequeue. Then add on writing to JavaFX (which converts it to BGRA) and it ends up ~1200uS.
I'm using some synchronous transfers and just using buffer read/write so there could be some improvements but the vast majority of the overheads are due to the toolkit.
So I guess that's "a bit crap" but it would still be "fast enough". For comparison a basic java renderer that only implements inverse is about 1.5x slower overall.
But for whatever reason the app still uses ~8% cpu even when not doing anything; and that definitely isn't ok at all. I couldn't identify the cause. Another toolkit looks like a necessity if it ever went beyond play-thing-toy.
I got bored doing the escape codes around "^[ [ ? Ps p" so it's broken aplenty beyond the bits I simply implemented incorrectly. But it's only a couple days' poking and just 1K3LOC. While there is ample documentation on the codes some of the important detail is lacking and since i'm not looking at any other implementation (not even zvt) i have to try/test/compare bits to xterm and/or remember the fiddly bits from 15 years ago (like the way the cursor wrapping is handled). I also have most of the slave process setup sorted beyond just the pty i/o - session leaders, controlling terminals, signal masks and signal actions, the environment. It might not be correct but I think all the scaffolding is now in place (albeit only for Linux).
FWIW a test i've been using is "time find ~/src" to measure elapsed time on my system - after a couple of runs to load everything into the buffer cache this is a consistent test with a lot of spew. If I run it in an xterm of the same size this takes ~25s to execute and grinds big parts of the desktop to a near halt while it's active. It really is abysmal behaviour given the modern hardware it's on (however "underpowered" it's supposed to be; and it's considerably worse on a much much faster machine). The same test in 'termz' takes about 4.5s and you'd barely know it was running. Adding a scrollback buffer would increase this (well probably, and not by much) however this goes through a fairly complete UTF-8 code-path otherwise.
The renderer has no effect on the runtime as it is polled on another thread (in this instance via the animation pulse on javafx). I don't use locks but rely on 'eventual consistency'. Some details must be taken atomically as part of the snapshot and these are handled appropriately.
Right now I feel like i've had my fill for now with this. I'm kinda interested, but i'm not sure if i'm interested enough to finish it sufficiently to use it - like pretty much all my hacking hacked up hacks. Time will be the teller.
Every now and then I think about the sad state of affairs regarding terminal emulators for X11. It's been a bit of a thing for a while - it's how i ended up working at Ximian.
I stopped using gnome-terminal when i stopped working on it and went back to xterm. I never liked rxvt or their ilk and all of the 'desktop environment' terminal emulators are pretty naff for whatever reason.
xterm works and is reliable but with recent (being last 10 years) X Windows System servers the text rendering performance plummeted and even installing the only usable typefaces (misc-fixed 6x13, and 10x20, and sometimes xterm itself) became a manual job. Whilst performance isn't bad on this kaveri box I also use an uber-intel machine with a HD7970 where both emacs and xterm runs like an absolute pig whenever any GL applications are running, and it isn't even very fast otherwise (i'm talking whole desktop grinding to a halt as it redraws exposes at about 1 character column per SECOND). It's an "older" distribution so that may have something to do with it but there is no direct indication why it's so horrible (well apart from the AMD driver but i have no choice for that since it's used for OpenCL dev). I might upgrade it to slackware next year.
Anyway I started poking last night at a basic xterm knockoff and got to the point of less sort of running inside it and now i'm thinking about ways I might be able to implement something a bit more complete. I'm working in Java and have a tiny bit of JNI to get the process going and handle some ioctl stuff (which seems somewhat easier now than it was in zvt, but portability is not on the agenda here).
TermZ? Glyphs are greymaps extracted directly from the PCF font.
When I wrote ZVT the primary goal was performance and to that end considerable effort was expended on making a terminal state machine which implemented zero-copy and zero-garbage algorithms. zero-copy is always a good thing but the zero-garbage was driven by the very slow malloc on Solaris at the time and my experience with Amiga memory management.
Another part of the puzzle was display and the main mechanism was inspired by some Amiga terminal emulators that used COPPER lists to re-render rows to the screen in arbitrary order without requiring them to be re-ordered in memory (memory bandwidth was a massive bottleneck when using pre 1985-spec hardware in 199x). I used a cyclic double-linked (exec) list of rows and to implement a scroll I just moved a row from the start to the end of the list which takes 8 pointer updates and a memset to clear it (and it also works for partial screen scrolls). By tracking the last row a given one was actually displayed at I could -at-any-point-later- attempt to create an optimal screen-update sequence including using blits for scrolling and minimising glyph redraws to only those that had changed. The algorithm for this was cheap and reliable if a little fiddly to get correct.
This last point is important as it allows the state machine to outpace the screen refresh rate which always becomes the largest bottleneck for terminal emulators in 'toolkit' environments. This is where it got all it's performance from.
new hardware, new approach
Thinking about the problem with current hardware my initial ideas are a little bit different.
I still quite like the linked list storage for the state machine and may go back to it but my current idea is instead to store a full cell-grid for the displayable area. I can still make full-screen scrolling just as cheap using a simple cyclic row trick (infact, even cheaper) but sub-region scrolling would require memory copies - but at the resolution of 4-bytes-per-glyph these are insanely cheap nowadays.
This is the most complex part of the emulator since it needs to implement all the control codes and whatnot - but for the most part thats just a mechanical process of implementing enough of them to have something functional.
I would also approach rendering from an entirely different angle. Rather than go smart i'm going wide and brute-forcing a simpler problem. At any given time - which can be throttled based on arbitrary metrics - I can take a snapshot of the current emulator screen and then asynchronously convert that to a screen display while the emulator continues to run on it's own thread.
For a basic CPU renderer it may still require some update optimisation but given it will just be trivial cell fonts to copy it probably wont be appreciably cheaper to scroll compared to just pasting new characters every time. And obviously this is utterly trivial code to implement.
The ultimate goal (and why the fixed-array grid backing is desirable) would be to use OpenCL or OpenGL (or more likely Vulkan if it ever gets here) to implement the rendering as a single pass operation which touches each output pixel only once. This would just take the raw cell-sized rectangle of the terminal state machine as it's only variable input and produce a fully rendered and styled framebuffer as the result. Basically just render the cells as a low-res nearest-neighbour texture lookup into a texture holding the glyphs. The former is a tiny tiny texture in GPU terms and rendering a single full-screen NN textured quad is absolutely nothing for any GPU. And compared to the gunk that is required to render a full-screen of arbitrary text through any gui toolkit ever it's many orders of magnitude less effort.
Ideally this would only ever exist at full-resolution in the on-screen framebuffer memory which would also make it extremely cheap memory wise.
But at least initially I would be going through JavaFX so it will instead have to have multiple copies and so on. The reason to use JavaFX is for all the auxiliary but absolutely necessary fluff like clipboard and dnd operations. I don't really like tabbed terminals (I mean I want to use windows as windows to multitask, not as a stack to task switch) but that is one way to ameliorate the memory use multiplication this would otherwise create.
So to begin with it would be extremely fat but that's just an implementation detail and not a limitation of the design.
Still mulling that over. It's still a lot of work even if conceptually it's almost trivial.
OpenCL 2.1 + java = zcl 0.x?
I noticed Khronos released the OpenCL 2.1 spec recently so I spent this morning updating zcl to include all the functions.
Since I don't have a suitable implementation nothing is tested and there's probably some typos and so on. I found a few small bugs in the enum tables while I was there.
But what took most of the time was the property queries. Each OpenCL object type has one or more query functions but rather than implement them all I use a tagged query function which branches to the correct function entry point at the lowest level but shares all the rest of the code. But then I had to add some specialist variants, and specialisations for return types and overloaded parameters - it started to get unwieldy and a new query type on CLKernel meant it wasn't going to be enough anyway.
So I said fuck that for a joke and just redid the whole mechanism.
For the basic 5-parameter queries I still share most of the code but I now add any type-specific queries separately. To cope with the api and code bloat i distilled the java side interface down to only two entry points for each query:
native <T> T getInfoAny(int type, int ctype, int param_name);
native <T> T getInfoAnyV(int type, int ctype, int param_name);
The first is a scalar query and the second an array one. It just means it now has to box primitive return types for scalar queries which is unlikely to have any measurable performance impact but the Java helpers which wrap the above interfaces in type-friendly calls could always be replaced with native equivalents if it was an issue.
This let me merge some internal jni code and delete a lot of snot and I moved the re-usability to a different layer so that the more specific queries can share most of the code. For example this was the previous set of native interfaces on CLObject, and although this covered the kernel and program specific 6-argument queries like GetProgramBuildInfo() it was getting a bit messy.
native long getInfoLong(int type, long subtarget, int param);
native long getInfoLongA(int type, long subtarget, int param);
native int getInfoInt(int type, long subtarget, int param);
native byte getInfoByteA(int type, long subtarget, int param);
native <T> T getInfoP(int type, long subtarget, int param, int ctype);
native <T extends CLObject> T getInfoPA(int type, long subtarget, int param, int ctype);
native long getInfoSizeT(int type, long subtarget, int param);
native long getInfoSizeTA(int type, long subtarget, int param);
... It seemed like a good idea at the time.
The exposed interfaces remain the same (like getInfoString(param), getInfoInt(param), etc).
Given the complete lack of interest and because it needs some testing anyway I wont be releasing a zcl-0.6 just yet.
OpenCL 2.0 + Java = zcl 0.5, or ~= 1.0 beta
I spent a wet morning doing some clean up and packaging of another zcl build and just finished updating the home page and uploading the source.
Although I just bumped the revision, this is getting pretty close to a 1.0 release. It's still got a few missing bits but it's mostly because the documentation is a bit broken beyond the README. It is only compatible with Java 8.
The home page has more details but the big points are that it now garbage collects everything (with explicit override), the lambda interfaces (trivial though they are), dynamically links to libOpenCL, fills out the extension framework and implements some extensions, and supports cross-platform building of native code.
I had to add a small code-generator to make the dynamic linking practical but it relies on the strict formatting of cl.h and does nothing fancy
Now i've got cross platform sorted out i'll probably do all my work to this interface rather than jogamp/jocl because it's just nicer to use and easier to work with. This might not mean any more frequent updates but at least it should get tested more. But apart from not being able to get SVM working at all on my machine (sdk demo works, cut and pasted bits from demo, or any other thing i write - crash) i've encountered very few bugs anyway.
I've probably covered enough of the new stuff in the blog previously so probably wont have much to add, but the curious are welcome to ask.
Not much going on so today here's a diary entry ...
I've added a few little things to ZCL - it's coming along quite nicely, I should probably work toward another release. I'm slowly adding the functional-like stuff into it, I decided to go with 'q.offer()' as the enqueue function, and 'ofXX()' as the factory methods. Together with the garbage collector support it does offer some interesting possibilities for code-reuse but i'm still experimenting with it in practice.
I needed to access a webcam so i added another api to jjmpeg (but i might move it) which just wraps v4l2 devices directly from the file descriptor (no library). Actually I did that a while ago but have slowly been filling it out as I needed more functionality. OTOH I started looking into the total snot-show that webcam access is on microsoft platforms and decided to give up - you can't even build the media-framework libraries with mingw-w64 as far as I can tell and you need vs of some form just to get the "system" headers. Ugh.
However I did find webcam-capture library which has already solved these problems. It's probably what i will look at as a fallback, but on linux efficiency is a bit low. The simplest webcam dump to JavaFX with my library generates almost no garbage (it provides static access directly to the driver buffers) and uses 50% less cpu time compared to using the low-level interfaces it provides despite a pretty expensive YUYV conversion step. The high level swing one is 4x the cpu overhead.
Along the way I found that was actually using videoInput "library" but I couldn't get that to link (cross compile at any rate, it should work with the ms sdk) - and in any event that just uses the directshow stuff which I had working ... i dunno, years ago. But the driver no longer works for the webcam i'm using and i'd have to buy it, ... so yeah that can wait.
And that ultimately led me to openimaj which probably would've saved me doing almost all that myself, although i would've needed to understand it anyway for OpenCL translation. And maven, ffs. But I guess I should at least have a look.
Discovery of useful software isn't that easy these days with so much noise - even if you go looking which I can't say I did ...
I've also been using a small amount of OpenGL and interoperating with OpenCL. It's become all a bit ... naff, and JOGL has been necessarily messed up to support all this nafficity. Pity I had to do this now rather than in a couple of months otherwise I would be looking at vulkan instead but with any luck that will be out soon enough to move to it before I need to get too far into GL (with the steammachines in november?). I'm sure microsoft will find some way to totally fuck-up it's cross-platform parts again though. My current thinking is that i will write a java binding for it once I have my hands on it (if only just-because) but I haven't looked into it all so far. But removing the static per-thread state should make it a lot saner.
For now I have some simple classes to do some off-screen rendering, and some OpenCL interop which is enough for what I want. Access to an output texture in JavaFX would be nice but it sounds like this is just not going to happen. Although one would expect a vulkan backend to be done at some point it will probably suffer the same hiding issues (well, with good reason I guess). If I really needed more performance i'd just use another toolkit - which is sad as that doesn't appear to be the intended vision of the javafx designers.
On that interop I had to fill out the extension mechanism in ZCL. I followed the prototype I'd created earlier. Currently each extension is provided by a different CLExtension class. It holds a pointer which in C-land is a function table resolved on the platform, and each platform object manages these. At first I was just going to use this as the mechanism for accessing extensions but it quickly becomes messy - you have to find the platform the object belongs to an in some cases this requires multiple queries (e.g. q.getDevice().getPlatform()). One approach I tried was to hide that by providing the extension methods directly on the object they extend - e.g. new CLContext or CLCommandQueue methods. These then manage looking up the extension and invoking the correct method for the given object. The details still need to be resolved but only once per object and it's all handled java-side.
There's a bit more behind the original mechanism than just code tidyiness for the extensions - they could potentially be loaded at runtime, or written separately from the core library. But on reflection how useless is that? The problem with this approach is each extension has it's own object - this is good and bad in that eventually you end up with a table required per context, queue, device, or whatever.
I think putting the extension methods on the target object is correct and after that the details don't really matter so much since it's an internal detail. But (on the fly design) I guess i should just maintain a CLPlatform reference on each object which can be extended and handle it that way. The extension objects will still be per-extension which keeps a cleaner namespace but they only need to be set per-platform which doesn't happen often. I'm pretty sure all objects have a 1:1 platform relationship, that would be the only thing to throw a spanner in the works; but the whole extension mechanism wouldn't work if that were the case.
Somewhere along this journey i came across some C++ code for something, I can't remember what it was. It was how I find most C++ code - it's been so over-engineered almost all of the actual lines of code is just boilerplate. The workings are so hidden I gave up trying to find it. It's just as bad in Java land where everyone wants to write a fucking framework before they even get started. Cut the bullshit and get to the point. C++ shows its heritage as being born from a time when "Software Engineering" was going to solve the worlds software problems by taking the programmer out of programming; using UML and CASE tools and auto-generating everything. It really shows, it is not a good language. This craziness was at it's peak just as I was going through uni and its done it's fair share of damage to the world and clearly continues to if abominations like C++ still exists.
That'll do for now.
Copyright (C) 2018 Michael Zucchi, All Rights Reserved.Powered by gcc & me!