c dez port
I had a couple of hours to burn Sunday morning so I ported over
the rest of the dez code to C, although I didn't feel like testing
it till I had some hours to burn today.
Anyway, I fixed some bugs and ran some tests. It's only about
30-50% faster than the Java version on the bible test for
practical "limit" values. The patches generated aren't
necessarily identical because of some minor changes in the hash
table design but the differences are minor. The C code also
requires some more bounds and error checking for robustness.
I also added CRC32 checksums to the file format as a quick check
that the input and output aren't corrupted.
cdez + other stuff
I started porting dez to C to look
at using it here somewhere. Along the way I found a bug in the
matcher implementation but otherwise got very distracted trying to
gain a few neglible percent out of the delta sizes by manipulating
the address encoding mechanism.
I tried modifying the matcher in various ways - experimenting with
the hash table details. These involved including the hash value
(i.e. to reduce spurious string matching - it just slows it down) or
using a separate index table (no real difference). Probably the
most surprising was that the performance was already somewhat better
than covered in the dez benchmarks. Both considerably faster
processing and smaller generated deltas. I guess that must have
been an earlier implementation and I need to update them. For
example the bible compression test only takes 11 seconds and creates
a 1 566 019 byte delta - or 65% of the runtime at 90% of
the output size.
This insprired me to play with the
tunable - which sets how deep the hashtable chain gets before it
starts to throw away older values. Using a setting of 5 (32
depth) it just beats the previous published results but in only
0.7s - still somewhat slower than 0.1 for gzip but at least it's
not out of the range of practicality. This is where I found the
bug in the entry discard indexing which was an easy fix.
This does mean that the other timings I did are pretty much
pointless though - using a larger block search size than 1 just
produces so much worse results and it's still slower. I haven't
tried with a large source input string however, where a chain limit
will truncate the search space prematurely.
Then I spent way too much time and effort trying various address
encoding mechanisms to try to squeeze a little bit more out of the
algorithm. In the end although I managed to get about 2.5% best
case improvement in some cases I doubt it's really worth worrying
about. However some of the alternative address encoding schemes are
conceptually and mechanically simpler so I might use one of them
(and break the file format).
Because of all that faffing about I never really got very far with
the cdez conversion although I have the substring matcher
basically done which is the more complex part. The
encoding/decoding code is quite involved but otherwise
straightforward bit bashing.
Update I tried a different test - one where i simulated the
total delta size of encoding 180 revisions of jjmpeg development -
not a particularly active project but still a real one. The
original encoding is easily the best in this case.
For some reason the blog went offline for a few hours. It kept
getting segfaults in libc somewhere. All I did to fix it was
make install (which simply copied the binary into
the cgi directory and didn't rebuild anything) and it started
working again. Unfortunately I didn't think to preserve the binary
that was there to find out why it stopped working.
Something to keep an eye on anyway.
Well I don't have any code ready yet but between falling asleep
today i did a little more work on a versioned data-store i've been
working on ... for years, a couple of decades infact.
In it's current iteration it utilises just 3 core (simple)
relational(-like) tables and can support both SVN style branches
(lightweight renames) and CVS style branches (even lighter
weight). Like SVN it uses global revisions and transactions but
like CVS it works on a version tree rather than a path tree, but
both approaches are possible within the same small library.
Together with dez it allows for
compact reverse delta storage.
Originally I started in C but i've been working Java since - but
i'm looking at back-porting the latest design to C again for some
performance comparisons. It's always used Berkeley DB (JE for
Java) as storage although I did experiment with using a SQL
version in the past.
My renewed interest is that the goal is eventually run this site
with it as a backing storage - for code, documentation, musings.
e.g. the ability to branch a document tree for versioning and yet
have it served live from common storage. This was essentially the
reason I started investigating the project many years ago but
never quite got there. I'm pretty sure I've got a solid schema
but still need to solidify the API around a few basic use-cases
before I move forward.
The last time I touched the code as 2 years ago, and the last time
I did any significant code on it was 3 years ago, so it's
definitely a slow burner project!
Well, more when I have more to say.
Using GNU make to build Java software
I finally finished writing an article about Java make i started some time ago, multiple times. I was going through cleaning up a new release of dez (still pending) and decided to fill it out with the junit stuff and then write it up what I actually ended up with.
The following few lines is now the complete makefile for dez. This supports `jar' (normal build target), `sources' (ide source jar), `javadoc' (ide javadoc jar), dist (complete rebuildable source), and now even `test' or `check' (unit and integration tests via JUnit 4) targets. The stuff included from java.make is reusable and is under 200 lines once you exclude voluminous comments and documentation.
java_PROGRAMS = dez
DIST_EXTRA=COPYING.AGPL3 README Makefile
The article is over on my home page at Using GNU Make for java under my software articles section.
other little things
So it turned out i was quite wrong on the prefix sorted table thing, it can be done in less memory and more efficiently and there's a whole field of research newly reinvigorated by search engines and genome stitching and modern hardware capabilities. But for a casual observer most of it is a bit complex for a weekend hack and needs some pretty messy pointer-filled code so doesn't translate well to Java; I tried a few things but gave up on that idea since my naive approaches weren't much worse in speed to the code i could find on the sort of data i'm playing with. If it was that important I would just use some C, but it really isn't. But like most of this stuff its good to know it's out there should I actually need it some day. And the maven stuff everyone in java-land insists on now using was giving me the utter shits too; e.g. oh, lets just import a whole fecking platform just so we can use one trivial one-line function that just calls java.util.* anyway, ... once. Why why?
Then I got a bit side-tracked in trying to re-use the system primitive array sort to sort sub-strings by sorting them 4-bytes at a time packed into a long. The lower 31 bits are the offset into the array (i.e. base pointer) and the next 32 bits are 4 bytes to sort in order with truncation handled using 0 padding. As it only uses 63 bits I can use a signed long sort to sort 4 bytes of string at a time. This gets sorted then ranges with a common prefix are recursively sorted again, and so on. Managed about a 2x speedup over a straight memcmp sort; which is probably nothing to sneeze at but not really enough here. Then I tried 8-bit radix sorting the first pass: gains a little bit more. And then a 16-bit radix first pass which was over 2x speed-up again, although by this time I didn't feel like writing multiple whole sorts just to test ideas and they still have undesirable worst-case behaviour. Actually a 6x speedup (if the code is functionally correct, which i didn't get to checking) is pretty decent: but it's delving into territory where far better and more scalable algorithms for the problem exist and i already have one that works well enough for me as a (better than) "70% solution".
Sorting is something i poke at every now just for interests sake so i may still revisit it later. I can't even remember if i've done radix sorting before, but i probably should have given it is parallelisable in a massively-parallel sense.
I also had the headache of trying to troll through a big C++ library yesterday trying to understand something. Ugh, it's like reading through a whole C application written using pre-processor macros, one for every case imaginable but few actually ever realised. Apparently this is a desirable thing in C++. But yeah, abstraction isn't really about hiding so much detail you can't work out what something is doing without intimately knowing everything about it; that's pretty much the opposite. And those compile times. And those error messages. With all that i'm suprised how much work the language and compiler still forces on the developer, and yet it is trivial to write incorrect code. But I digress, I had enough of that yesterday.
I did have another idea for the string matcher either this morning or yesterday which I finally just remembered and tried: just truncate the maximum hash bucket length to some fixed maximum. It drops the match rate (particularly for popular symbols) and thus the compression rate but it drastically improves run-time speed and worst-case behaviour. Saves some memory too. Ok I guess everyone else does this already but I tried to get by without it; but alas could not. Because i'm using a bucket-based hash table i've got some more options on how i deal with it; i don't just have to throw them away.
Actually the way i implemented it I think it has some interesting side-effects together with the new delta format; common prefixes are updated more often and this more likely to be close by or in the exact-address table so the address compression works better. Initially I tried replacing a random record with the new entry: this is usually a good idea for mathematical reasons "I recognise but don't completely understand", but generating the random numbers proved to be too expensive for short files. So I just have a rolling index shared across all records which is super-cheap and should be relatively random in a local sense and merely practical in all others.
So here's some variations and times for my two current format encoders on the kjv text - on the smaller files it doesn't make much difference. The limit is (2^n - 1) because i store the chain length/insertion point in the first index and i double the size each time i grow it. I included one result when using random replacement: as suggested it does improve the compression - even if only very slightly! Here the cost is somewhat hidden by the string compares. The dez home page has the benchmarks of the other software i've tried; they are not even remotely close in terms of output size even related to the fastest and "worst" below.
dez-1 dez-2* dez-1 dez-2
3 0.232623 0.319473 2596997 1970223
7 0.334969 0.433130 2328817 1831470
15 0.429424 0.521525 2157175 1757288
15 (random) 0.572432 0.666268 2156093 1756776
31 0.545488 0.626991 2028280 1702969
63 0.732696 0.822705 1934709 1659212
unlimited 9.724143 9.853381 1731250 1539501
empty string to KJV bible "delta" encode
* actually i'm going to just call dez-2 `dez-1'.
The unlimited case is a bit (well 70%!) faster than previous because I added a small optimisation to the string length matcher: it first checks to see if the next character after the current best length is a match, if not it doesn't bother scanning the rest of the string at all. But still, whether 10s or 17s, it's still pretty shit (the main issue with this text is it is indented, so 25K+ substrings begin with newline+several spaces and they all end up in the same hash bucket).
For a given limit both format encoders are being fed an identical list of edits, which shows just how much better the second format is and why the first will vanish next time I do a code-drop.
So I came up with a couple more ideas to try ... actually they all pretty much amounted to nothing but here is a small update anyway. See the notes on the home page.
Some things I tried and discarded: incrementally chaining the hash table (avoiding the realloc+copy); inlining the hash function; calculating all hashes of source and target at the start; doing the same in parallel. Most of them amounted to no gain although the parallel hashing helped quite a bit in some cases but not often enough to be worth the memory required.
Thus this is mostly a post about the expanded homepage where i added a bit detail and more benchmarks.
I was a bit surprised with the memory use compared to the "competition" because dez amounts to a full exhaustive search, albeit index assisted, whereas the others do not.
Copyright (C) 2018 Michael Zucchi, All Rights Reserved.Powered by gcc & me!