Monique said:
I'm looking for a compression algorithm that works better than gzip
for small (12 bytes or so) chunks of data. Ideally there would be a
free implementation for Java and C++ that we could use for commercial
purposes.
No general-purpose compression scheme is going to be very much help for such
short strings. They work by learning from the data, and 12 bytes doesn't give
enough input for the compressor to learn much, let alone start using what it
has learned.
So you /have/ to use a custom compression scheme. Adding to the suggestions
that have already been made:
The gzip compression scheme (as implemented in the Java API) can be pre-trained
with representative data (which is known to both the reading and writing
applications and therefore does not need to be transmitted). You can easily
test how well that would work for you by assembling a file consisting of (say)
8K of real, representative, data and another which is exactly the same except
that it has one more code tacked onto the end. The difference in size between
to the two files when compressed with gzip -9 will give an indication of the
kind of compression available via this approach. I suspect it won't work well,
but you never know and it's easy to try out.
Otherwise you will almost certainly have to look into Huffman compression or
even (if you are sufficiently desperate) Arithmetic compression (much slower
and harder to implement).
One thing to consider is whether the data has any structure. For instance
(this is a real example from a previous job) if you desperately need to
compress URLs such as found on the web:
http://java.sun.com/index.html
you can divide them up into separate "regions":
http://
java.sun
.com/
index
.html
each of which has distinct frequency characteristics. The first region, the
"scheme" can probably be represented in 1 bit in the majority of cases. You
are looking at a population that is:
http:// -- the overwhelming majority
https:// -- quite common
ftp:// -- moderately rare
<everything else> -- rare
So you can represent http:// as one bit, probably use three bits for https://
and ftp://, and more bits than that for other cases. Similarly the ".com"
suffix can be recognised as the most common top-level domain, and so can be
compressed to one or two bits. The .htm and .html suffixes are common too.
You can use the Huffman algorithm to assign near optimal bit-patterns to the
various cases. OTOH the 'java.sun' bit is going to be much more "random" so
the best you can hope for is a normal text compression scheme, such as
character-by-character Huffman compression based of the frequency of ASCII
characters in domain names (less the top-level segment).
BTW, if this data is the result of encoding some compound information, then
look at the information itself rather than the strings produced by your current
code scheme. I.e. don't think of the task as compressing the 12-bytes of data,
but of compressing the N-bytes of information that is (currently) expressed by
the 12-bytes.
-- chris