cdez + other stuff
I started porting dez to C to look at using it here somewhere. Along the way I found a bug in the matcher implementation but otherwise got very distracted trying to gain a few neglible percent out of the delta sizes by manipulating the address encoding mechanism.
I tried modifying the matcher in various ways - experimenting with the hash table details. These involved including the hash value (i.e. to reduce spurious string matching - it just slows it down) or using a separate index table (no real difference). Probably the most surprising was that the performance was already somewhat better than covered in the dez benchmarks. Both considerably faster processing and smaller generated deltas. I guess that must have been an earlier implementation and I need to update them. For example the bible compression test only takes 11 seconds and creates a 1 566 019 byte delta - or 65% of the runtime at 90% of the output size.
This insprired me to play with the
tunable - which sets how deep the hashtable chain gets before it
starts to throw away older values. Using a setting of 5 (32
depth) it just beats the previous published results but in only
0.7s - still somewhat slower than 0.1 for gzip but at least it's
not out of the range of practicality. This is where I found the
bug in the entry discard indexing which was an easy fix.
This does mean that the other timings I did are pretty much pointless though - using a larger block search size than 1 just produces so much worse results and it's still slower. I haven't tried with a large source input string however, where a chain limit will truncate the search space prematurely.
Then I spent way too much time and effort trying various address encoding mechanisms to try to squeeze a little bit more out of the algorithm. In the end although I managed to get about 2.5% best case improvement in some cases I doubt it's really worth worrying about. However some of the alternative address encoding schemes are conceptually and mechanically simpler so I might use one of them (and break the file format).
Because of all that faffing about I never really got very far with the cdez conversion although I have the substring matcher basically done which is the more complex part. The encoding/decoding code is quite involved but otherwise straightforward bit bashing.
Update I tried a different test - one where i simulated the total delta size of encoding 180 revisions of jjmpeg development - not a particularly active project but still a real one. The original encoding is easily the best in this case.
For some reason the blog went offline for a few hours. It kept
getting segfaults in libc somewhere. All I did to fix it was
make install (which simply copied the binary into
the cgi directory and didn't rebuild anything) and it started
working again. Unfortunately I didn't think to preserve the binary
that was there to find out why it stopped working.
Something to keep an eye on anyway.