compressing short strings?

Paul Rubin · May 20, 2008

I have a lot of short English strings I'd like to compress in order to
reduce the size of a database. That is, I'd like a compression
function that takes a string like (for example) "George Washington"
and returns a shorter string, with luck maybe 6 bytes or so. One
obvious idea is take the gzip function, compress some large text
corpus with it in streaming mode and throw away the output (but
setting up the internal state to model the statistics of English
text), then put in "George Washington" and treat the additional output
as the compressed string. Obviously to get reasonable speed there
would have to be a way to save the internal state after initializing
from the corpus.

Anyone know if this has been done and if there's code around for it?
Maybe I'm better off with freezing a dynamic Markov model? I think
there's DMM code around but am not sure where to look.

Thanks.

Arnaud Delobelle · May 20, 2008

Paul Rubin said:
I have a lot of short English strings I'd like to compress in order to
reduce the size of a database. That is, I'd like a compression
function that takes a string like (for example) "George Washington"
and returns a shorter string, with luck maybe 6 bytes or so. One
obvious idea is take the gzip function, compress some large text
corpus with it in streaming mode and throw away the output (but
setting up the internal state to model the statistics of English
text), then put in "George Washington" and treat the additional output
as the compressed string. Obviously to get reasonable speed there
would have to be a way to save the internal state after initializing
from the corpus.

Anyone know if this has been done and if there's code around for it?
Maybe I'm better off with freezing a dynamic Markov model? I think
there's DMM code around but am not sure where to look.

Thanks.

Out of the blue idea which is probably rubbish:

Create a tree of your words, look at it as a Deterministic Finite
Automaton (DFA), minimise it, you get what some call a DAWG. That
works very well with lots of small words.

e.g.

Words aabc, babc, bbbc.

Minimal DFA:

1 -a-> 2 -a-> 4 -b-> 5 -c-> 6
\ ^
b-> 3 -a-----+
\ /
b----+

To compress a word, simply find its position in alphabetical order,
call that its 'path'.

The 'decompression algorithm' is then (in pseudo-python)

def decompress(DFA, path):
letters = []
state = DFA.initial_state
while not state.is_final():
transitions = state.transitions
path, choice = divmod(path, len(transitions))
state = transitions[choice].new_state
letters.append(transitions[choice].letter)

I'm not too sure about this

Thomas Troeger · May 20, 2008

Paul said:
I have a lot of short English strings I'd like to compress in order to
reduce the size of a database. That is, I'd like a compression
function that takes a string like (for example) "George Washington"
[...]

Thanks.

I think your idea is good, maybe you'd want to build an LZ78 encoder in
Python (LZ78 is pretty easy), feed it with a long English text and then
pickle the resulting object. You could then unpickle it on program start
and encode your short strings with it. I bet there's a working
implementation around that already that does it ... but if you can't
find any, LZ78 is implemented in 1 or 2 hours. There was a rather good
explanation of the algorithm in German, unfortunately it's vanished from
the net recently (I have a backup if you're interested).

Cheers,
Thomas.

Helmut Jarausch · May 20, 2008

Paul said:
I have a lot of short English strings I'd like to compress in order to
reduce the size of a database. That is, I'd like a compression
function that takes a string like (for example) "George Washington"
and returns a shorter string, with luck maybe 6 bytes or so. One
obvious idea is take the gzip function, compress some large text
corpus with it in streaming mode and throw away the output (but
setting up the internal state to model the statistics of English
text), then put in "George Washington" and treat the additional output
as the compressed string. Obviously to get reasonable speed there
would have to be a way to save the internal state after initializing
from the corpus.

Anyone know if this has been done and if there's code around for it?
Maybe I'm better off with freezing a dynamic Markov model? I think
there's DMM code around but am not sure where to look.

I'd ask in comp.compression where the specialists are listening and who are
very helpful.

--
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

bearophileHUGS · May 20, 2008

Helmut Jarausch:

I'd ask in comp.compression where the specialists are listening and who are
very helpful.

Asking in comp.compression is a good starting point.

My suggestions (sorry if they look a bit unsorted): it depends on what
language you want to use, how much you want to compress the strings,
if they are ASCII or unicode, how much quickly you want to compress/
decompress them, and how much time you have to write the encoder/
decoder.

You can use Python or a C/D extension, if you use C you will need more
time to develop it, but you will be quite more free to use any
algorithm you want (well, you are free with Python too, but the code
may be slower if it performs low level things), so D may be an
acceptable compromise.

Generally it's better to start with the simplest solution, and go on
from there looking for more complex ones if the first one isn't
enough. In Python the simplest thing to try is to compress the strings
with zip:
"George Washington".encode("zip")
Probably the result is longer than the original (25 bytes output for
this string). With "bz2" the results are probably worse (56 bytes for
this string).

A possible alternative it to compress many (10-30) strings at a time,
using an external table of their starting points (you can put the
table at the beginning), or you can separate those strings with a
separator char that's never present in the strings; and then you can
zip that group. If you have 1 million strings you can divide them into
such little groups, but the indexing in the DB becomes complex, and I
don't like this solution much.

A bit less simple python solution is to keep a fixed corpus of text
and compress it with and without the little string you want to add
(like you say), but compressing/decompressing everything each time (if
the corpus is small enough this isn't that slow), for example:
11

So you need to store only this 11 byte long string to be able to
decompress it. A bigger corpus (s2) allows you a bit more compression,
but it requires more time for compression/decompression:
9

One of the simpler ways to compress short ASCII English texts is to
see that the most frequent chars are very few, so you can encode the
15 most common English chars with 1 nibble, and the other ones with 3
nibbles (first nibble is an escape, and the two following are the
whole byte), if you use a low level language you can encode this with
few lines of code, it's very fast and it may shave off 25-33% of your
bytes. It means that your "George Washington" (17 chars long) may
become about 12-13 chars long. I have implemented this with Python
too, it's easy.

Another possible solution is to use a Huffman compression with fixed
symbol frequencies. In Python there are ways to write such compressor/
decompressor in just 10-15 lines code. You can use the heapq to speed
up the processing. This may lead to more compression, but I presume
not much than less than 4-5 bpc. But if you want to compress/
decompress a LOT of strings you may need to write (or look for the
code) in C/D.

This is one implementation that doesn't use heapq yet:

from operator import itemgetter

def clean_tree(obj):

if isinstance(obj, tuple):

return obj[0]

else:

return [clean_tree(obj[0][0]), clean_tree(obj[0][1])]

def huffman_generate(freqsl):

while len(freqsl) > 2:

freqsl.sort(key=itemgetter(1))

freqsl = [ [freqsl[0:2], freqsl[0][1] + freqsl[1][1]] ] +
freqsl[2:]

return [freqsl, -1]

def tree2dict(tree):

def encode(tree, out=""):

if isinstance(tree, list):

encode(tree[0], out+"0")

encode(tree[1], out+"1")

else:

dic[tree[0]] = out

return dic

dic = {}

return encode(tree)

freqs = [('a', 45), ('b', 13), ('c', 12), ('d', 16), ('e', 9), ('f',
5)]

print tree2dict(clean_tree(huffman_generate(freqs)))

If you want to use a Huffman you can use the usual tricks to increase
compression: use 2-char symbols, or even use whole words as symbols
(but you can use whole words only if you are sure your language is
English with few strange words).

If you want more compression, like a 6 bytes (2.8 bpc) output you will
need more complex algorithms. On long English texts the best
compressors need less than 0.9 bpc, but they are surely useless for
you. I don't think you will be able to go down to 6 bytes for that
string, unless you are ready to pay lot of computational time

Bye,
bearophile

bearophileHUGS · May 20, 2008

bearophile:

So you need to store only this 11 byte long string to be able to
decompress it.

Note that maybe there is a header, that may contain changing things,
like the length of the compressed text, etc.

Bye,
bearophile

castironpi · May 20, 2008

bearophile:

Note that maybe there is a header, that may contain changing things,
like the length of the compressed text, etc.

Bye,
bearophile

I've read that military texts contain different letter frequencies
than standard English. If you use a non-QWERTY keyset, it may change
your frequency distribution also.

I worry that this is an impolite question; as such, I lean to peers
for backup:

Will you be additionally adding further entries to the zipped list?

Will you be rewriting the entire file upon update, or just appending
bytes?

If your sequence is 'ab, ab, ab, cd, ab', you might be at:

00010.

Add 'cd' again and you're at:

000101.

You didn't have to re-output the contents.

But, if you add 'bc', you have:

0001012, which isn't in binary. So you're left at:

000 000 000 001 000 001 010

But five more and the byte overflows.

I'd say pickle the corpus, with new additions, and re-zip the entire
contents each time. Would you like to break across
(coughdisksectorscough) multiple files, say, a different corpus-
compression file pair for every thousand entries?

inhahe · May 21, 2008

i don't see anybody mentioning huffman encoding. i think it just works per
byte, so it's not as tight as gzip or whatever. but it sounds like it would
be easy to implement and wouldn't require any corpus-wide compression
information. except a character frequency count if you wanted to be optimal.

<i don't know anything about compression>

Paul Rubin · May 21, 2008

inhahe said:
i don't see anybody mentioning huffman encoding. i think it just works per
byte, so it's not as tight as gzip or whatever. but it sounds like it would
be easy to implement and wouldn't require any corpus-wide compression
information. except a character frequency count if you wanted to be optimal.

In principle you could do it over digraphs but I tried that once and
it didn't help much. Basially -because- it doesn't use any
corpus-wide compression information, it doesn't compress anywhere near
as well as LZ, DMC, or whatever.

compressing charatcers	35	Apr 2, 2014
compressing text in std::vector <std::string>	13	Jul 9, 2010
Unicode and Python - how often do you index strings?	33	Jun 4, 2014
Short string optimization vs. CoW	0	Feb 14, 2012
Data saving in condition of changing reality	0	Apr 29, 2022
Memory Usage of Strings	3	Mar 16, 2011
Marshal data too short	2	Feb 16, 2011
Compressing output via pipes	2	Nov 26, 2003

compressing short strings?

Paul Rubin

Arnaud Delobelle

Thomas Troeger

Helmut Jarausch

bearophileHUGS

bearophileHUGS

castironpi

inhahe

Paul Rubin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads