This is a basic binary delta generator for Java. It operates only on in-memory arrays.
It is based on the algorithm from Bentley & McIllroy's paper ``Data Compression Using Long Common Strings'' with some simplifications that take advantage of modern memory limits and processing power.
It only includes an encoder and decoder for a proprietary delta format 'DEZ1' (documented in the source) but other formats (e.g. vcdiff or gdiff) are possible by implementing a new decoder and matching ByteDeltaEncoder without needing to touch the core classes.
One aim was also to make a simple bit of well-engineered code as a goal in itself. There is a clean separation between the string matcher and encoding format and the string matcher itself is almost fully self-contained in 300 lines of commented and well-spaced Java.
It uses a rolling 'polycyclic' hash implemented via rotate instructions. It is very easy to understand, fast, and should suffice.
Because every candidate that maps to the same hash code is kept
conceptually it uses an instance
using java collections for this generates a lot of garbage and a
large fixed overhead due to containers and boxing. Instead a custom
hash table has been implemented as in-line code to gain both memory
overhead reduction and opportunity for micro optimisations. The
insertion/append function is under 15 lines of simple code and
iteration is a simple array scan. Unecessary features like re-hash
or delete are not implemented.
Look at the source for more details.
Because it can check each location for match candidates and includes the target decoded-so-far it can produce reasonably compact deltas and can be used as a compressor if the source is empty.
Runtime performance is good-to-excellent for typical text files but as it does not window the candidate search space it slows down with large files (>>1MB). But it can be tuned to produce a reasonable running time over a wide range and windowing could be added to improve scalability.
As a point of comparison to other java libraries, the following are the results for creating a delta from a few different files. The bible test checks using it for straight compression.
badiff-1.2 jvcdiff-2.0 dez-1.1 dez-1.1 (udiff) (6,4,1) (16,4,16) GPL 2.0 (18 092) to GPL 3.0 (35 147) encode time 0.17 0.0012 0.0023 0.00067 decode time 0.00059 0.00012 0.00013 0.000057 patch size 32 940 28 829 13 591 27 267 peak heap 1 105 767K 3 939K 1 972K 1 419K peak heap 100 115 500K 2 901K 1 962K 1 410K jjmpeg.dll r36 (144 658) to jjmpeg.dll r35 (144 145) encode time 1.1 0.0033 0.0072 0.0017 decode time 0.0026 0.00029 0.00017 0.00015 patch size 26 885 15 622 10 819 18 935 peak heap 1 105 339K 7 326K 4 434K 1 666K peak heap 100 139 841K 5 282K 5 527K 1 649K empty string (0) to KJV bible text (5 218 805) gzip --fast encode time 1.9 0.46 17. 1.1 0.098 decode time 0.0094 0.017 0.023 0.016 0.048 patch size 5 218 814 4 866 337 1 731 265 3 631 972 1 788 558 peak heap 1 311 631K 105 186K 51 746K 23 964K peak heap 2 497 405K 105 479K 56 951K 23 958K
Maybe badiff doesn't like this data or isn't very good for in-memory arrays but it seems to have little to recomment it. At best it takes a couple of orders longer and needs a couple of orders more memory, only to create bigger deltas.
Encodes fast but produces middling delta compression. Both are due to it's use of a 16-byte non-overlapping block size as this leads to less to search but fewer potential matches.
At it's best setting encoding time can scale poorly but it produces much smaller results and uses equal or less memory to do it. At the given fast setting it can still produce better results but in half the time with even lower memory requirements.
Not shown but the code simplicity also aids hotspot in optimisation so peak performance comes after fewer iterations and takes less effort.
Given the simplicity of the delta format these results are encouraging. The memory usage scales better than I expected and easily justifies the rather small cost of developing custom data structures.
Using the vcdiff format might reduce the size of the deltas further but an implementation is not planned at this time.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Sourcode here. You may also browse the repository.
26/4/2019: It's been 4 years, I think it's ready enough.
28/3/2015: Made a couple of minor changes:
27/3/2015: Fixed a silly typo.
27/3/2015: Initial release.
notzed on various mail servers, primarily gmail.com.