Michael Zucchi

 B.E. (Comp. Sys. Eng.)


android (44)
beagle (63)
biographical (82)
business (1)
code (56)
cooking (29)
dez (6)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (414)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (216)
java ee (3)
javafx (48)
jjmpeg (67)
junk (3)
kobo (15)
linux (3)
mediaz (27)
ml (15)
nativez (3)
opencl (117)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
politics (7)
ps3 (12)
puppybits (17)
rants (134)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (2)
zedzone (13)
Monday, 07 May 2018, 21:00

c dez port

I had a couple of hours to burn Sunday morning so I ported over the rest of the dez code to C, although I didn't feel like testing it till I had some hours to burn today.

Anyway, I fixed some bugs and ran some tests. It's only about 30-50% faster than the Java version on the bible test for practical "limit" values. The patches generated aren't necessarily identical because of some minor changes in the hash table design but the differences are minor. The C code also requires some more bounds and error checking for robustness.

I also added CRC32 checksums to the file format as a quick check that the input and output aren't corrupted.

Tagged dez.
Saturday, 05 May 2018, 16:36

cdez + other stuff

I started porting dez to C to look at using it here somewhere. Along the way I found a bug in the matcher implementation but otherwise got very distracted trying to gain a few neglible percent out of the delta sizes by manipulating the address encoding mechanism.

I tried modifying the matcher in various ways - experimenting with the hash table details. These involved including the hash value (i.e. to reduce spurious string matching - it just slows it down) or using a separate index table (no real difference). Probably the most surprising was that the performance was already somewhat better than covered in the dez benchmarks. Both considerably faster processing and smaller generated deltas. I guess that must have been an earlier implementation and I need to update them. For example the bible compression test only takes 11 seconds and creates a 1 566 019 byte delta - or 65% of the runtime at 90% of the output size.

This insprired me to play with the chain limit tunable - which sets how deep the hashtable chain gets before it starts to throw away older values. Using a setting of 5 (32 depth) it just beats the previous published results but in only 0.7s - still somewhat slower than 0.1 for gzip but at least it's not out of the range of practicality. This is where I found the bug in the entry discard indexing which was an easy fix.

This does mean that the other timings I did are pretty much pointless though - using a larger block search size than 1 just produces so much worse results and it's still slower. I haven't tried with a large source input string however, where a chain limit will truncate the search space prematurely.

Then I spent way too much time and effort trying various address encoding mechanisms to try to squeeze a little bit more out of the algorithm. In the end although I managed to get about 2.5% best case improvement in some cases I doubt it's really worth worrying about. However some of the alternative address encoding schemes are conceptually and mechanically simpler so I might use one of them (and break the file format).

Because of all that faffing about I never really got very far with the cdez conversion although I have the substring matcher basically done which is the more complex part. The encoding/decoding code is quite involved but otherwise straightforward bit bashing.

Update I tried a different test - one where i simulated the total delta size of encoding 180 revisions of jjmpeg development - not a particularly active project but still a real one. The original encoding is easily the best in this case.


For some reason the blog went offline for a few hours. It kept getting segfaults in libc somewhere. All I did to fix it was run make install (which simply copied the binary into the cgi directory and didn't rebuild anything) and it started working again. Unfortunately I didn't think to preserve the binary that was there to find out why it stopped working.

Something to keep an eye on anyway.

Tagged dez, zedzone.
Monday, 23 April 2018, 21:21

Versioning DB

Well I don't have any code ready yet but between falling asleep today i did a little more work on a versioned data-store i've been working on ... for years, a couple of decades infact.

In it's current iteration it utilises just 3 core (simple) relational(-like) tables and can support both SVN style branches (lightweight renames) and CVS style branches (even lighter weight). Like SVN it uses global revisions and transactions but like CVS it works on a version tree rather than a path tree, but both approaches are possible within the same small library.

Together with dez it allows for compact reverse delta storage.

Originally I started in C but i've been working Java since - but i'm looking at back-porting the latest design to C again for some performance comparisons. It's always used Berkeley DB (JE for Java) as storage although I did experiment with using a SQL version in the past.

My renewed interest is that the goal is eventually run this site with it as a backing storage - for code, documentation, musings. e.g. the ability to branch a document tree for versioning and yet have it served live from common storage. This was essentially the reason I started investigating the project many years ago but never quite got there. I'm pretty sure I've got a solid schema but still need to solidify the API around a few basic use-cases before I move forward.

The last time I touched the code as 2 years ago, and the last time I did any significant code on it was 3 years ago, so it's definitely a slow burner project!

Well, more when I have more to say.

Tagged dez, java, rez.
Friday, 03 June 2016, 16:34

Using GNU make to build Java software

I finally finished writing an article about Java make i started some time ago, multiple times. I was going through cleaning up a new release of dez (still pending) and decided to fill it out with the junit stuff and then write it up what I actually ended up with.

The following few lines is now the complete makefile for dez. This supports `jar' (normal build target), `sources' (ide source jar), `javadoc' (ide javadoc jar), dist (complete rebuildable source), and now even `test' or `check' (unit and integration tests via JUnit 4) targets. The stuff included from java.make is reusable and is under 200 lines once you exclude voluminous comments and documentation.

java_PROGRAMS = dez



include java.make

The article is over on my home page at Using GNU Make for java under my software articles section.

Tagged code, dez, gnu, hacking, java.
Wednesday, 01 April 2015, 02:37

other little things

So it turned out i was quite wrong on the prefix sorted table thing, it can be done in less memory and more efficiently and there's a whole field of research newly reinvigorated by search engines and genome stitching and modern hardware capabilities. But for a casual observer most of it is a bit complex for a weekend hack and needs some pretty messy pointer-filled code so doesn't translate well to Java; I tried a few things but gave up on that idea since my naive approaches weren't much worse in speed to the code i could find on the sort of data i'm playing with. If it was that important I would just use some C, but it really isn't. But like most of this stuff its good to know it's out there should I actually need it some day. And the maven stuff everyone in java-land insists on now using was giving me the utter shits too; e.g. oh, lets just import a whole fecking platform just so we can use one trivial one-line function that just calls java.util.* anyway, ... once. Why why?

Then I got a bit side-tracked in trying to re-use the system primitive array sort to sort sub-strings by sorting them 4-bytes at a time packed into a long. The lower 31 bits are the offset into the array (i.e. base pointer) and the next 32 bits are 4 bytes to sort in order with truncation handled using 0 padding. As it only uses 63 bits I can use a signed long sort to sort 4 bytes of string at a time. This gets sorted then ranges with a common prefix are recursively sorted again, and so on. Managed about a 2x speedup over a straight memcmp sort; which is probably nothing to sneeze at but not really enough here. Then I tried 8-bit radix sorting the first pass: gains a little bit more. And then a 16-bit radix first pass which was over 2x speed-up again, although by this time I didn't feel like writing multiple whole sorts just to test ideas and they still have undesirable worst-case behaviour. Actually a 6x speedup (if the code is functionally correct, which i didn't get to checking) is pretty decent: but it's delving into territory where far better and more scalable algorithms for the problem exist and i already have one that works well enough for me as a (better than) "70% solution".

Sorting is something i poke at every now just for interests sake so i may still revisit it later. I can't even remember if i've done radix sorting before, but i probably should have given it is parallelisable in a massively-parallel sense.

I also had the headache of trying to troll through a big C++ library yesterday trying to understand something. Ugh, it's like reading through a whole C application written using pre-processor macros, one for every case imaginable but few actually ever realised. Apparently this is a desirable thing in C++. But yeah, abstraction isn't really about hiding so much detail you can't work out what something is doing without intimately knowing everything about it; that's pretty much the opposite. And those compile times. And those error messages. With all that i'm suprised how much work the language and compiler still forces on the developer, and yet it is trivial to write incorrect code. But I digress, I had enough of that yesterday.


I did have another idea for the string matcher either this morning or yesterday which I finally just remembered and tried: just truncate the maximum hash bucket length to some fixed maximum. It drops the match rate (particularly for popular symbols) and thus the compression rate but it drastically improves run-time speed and worst-case behaviour. Saves some memory too. Ok I guess everyone else does this already but I tried to get by without it; but alas could not. Because i'm using a bucket-based hash table i've got some more options on how i deal with it; i don't just have to throw them away.

Actually the way i implemented it I think it has some interesting side-effects together with the new delta format; common prefixes are updated more often and this more likely to be close by or in the exact-address table so the address compression works better. Initially I tried replacing a random record with the new entry: this is usually a good idea for mathematical reasons "I recognise but don't completely understand", but generating the random numbers proved to be too expensive for short files. So I just have a rolling index shared across all records which is super-cheap and should be relatively random in a local sense and merely practical in all others.

So here's some variations and times for my two current format encoders on the kjv text - on the smaller files it doesn't make much difference. The limit is (2^n - 1) because i store the chain length/insertion point in the first index and i double the size each time i grow it. I included one result when using random replacement: as suggested it does improve the compression - even if only very slightly! Here the cost is somewhat hidden by the string compares. The dez home page has the benchmarks of the other software i've tried; they are not even remotely close in terms of output size even related to the fastest and "worst" below.

                   time                  size
             dez-1      dez-2*      dez-1    dez-2
3            0.232623   0.319473    2596997  1970223
7            0.334969   0.433130    2328817  1831470
15           0.429424   0.521525    2157175  1757288
15 (random)  0.572432   0.666268    2156093  1756776
31           0.545488   0.626991    2028280  1702969
63           0.732696   0.822705    1934709  1659212

unlimited    9.724143   9.853381    1731250  1539501

     empty string to KJV bible "delta" encode

* actually i'm going to just call dez-2 `dez-1'.

The unlimited case is a bit (well 70%!) faster than previous because I added a small optimisation to the string length matcher: it first checks to see if the next character after the current best length is a match, if not it doesn't bother scanning the rest of the string at all. But still, whether 10s or 17s, it's still pretty shit (the main issue with this text is it is indented, so 25K+ substrings begin with newline+several spaces and they all end up in the same hash bucket).

For a given limit both format encoders are being fed an identical list of edits, which shows just how much better the second format is and why the first will vanish next time I do a code-drop.

Tagged dez, hacking, java.
Saturday, 28 March 2015, 20:20

dez 1.2

So I came up with a couple more ideas to try ... actually they all pretty much amounted to nothing but here is a small update anyway. See the notes on the home page.

Some things I tried and discarded: incrementally chaining the hash table (avoiding the realloc+copy); inlining the hash function; calculating all hashes of source and target at the start; doing the same in parallel. Most of them amounted to no gain although the parallel hashing helped quite a bit in some cases but not often enough to be worth the memory required.

Thus this is mostly a post about the expanded homepage where i added a bit detail and more benchmarks.

I was a bit surprised with the memory use compared to the "competition" because dez amounts to a full exhaustive search, albeit index assisted, whereas the others do not.

Tagged code, dez, hacking, java.
Copyright (C) 2018 Michael Zucchi, All Rights Reserved.Powered by gcc & me!