Java text compression

C

Chris

What's the fastest way to compress/decompress text?

We're doing that over really large datasets in our app. We're currently
converting char arrays to byte arrays using our own UTF-8 conversion
code, and then compressing the bytes using java.util.zip. The code is
pretty old.

I don't like this two-step process, and the profiler shows that this is
a bottleneck in our app.

Is anyone aware of any code that compresses chars directly? Perhaps a
third-party library that does it faster?

In our particular situation, decompression speed is a lot more important
than compression speed.
 
J

Joshua Cranmer

Chris said:
What's the fastest way to compress/decompress text?

I know of a compression method that can compress in constant time: it
replaces the entire text with a single `0' (hey, it's lossy!). To what
degree of compression do you need?
We're doing that over really large datasets in our app. We're currently
converting char arrays to byte arrays using our own UTF-8 conversion
code, and then compressing the bytes using java.util.zip. The code is
pretty old.

I don't like this two-step process, and the profiler shows that this is
a bottleneck in our app.

Questions:
1. From whence are the char arrays coming?
2. Where are the byte arrays going?
3. Which part of the process is the bottleneck? The UTF-8 conversion, or
the compression?
 
E

Eric Sosman

Chris said:
What's the fastest way to compress/decompress text?

If you're really interested in "the fastest way" to the
exclusion of all other concerns, then don't compress at all.
Bingo! Problem solved!

You might be happier with a compression scheme that did
a little better at reducing the size of the data, but now you
can't get a sensible answer until you describe the trade-offs
you're willing to make. For example, if you were offered a
compression scheme that ran ten percent faster than your current
method but emitted fifteen percent more data, would you take it
or reject it?
We're doing that over really large datasets in our app. We're currently
converting char arrays to byte arrays using our own UTF-8 conversion
code, and then compressing the bytes using java.util.zip. The code is
pretty old.

I don't like this two-step process, and the profiler shows that this is
a bottleneck in our app.

Is anyone aware of any code that compresses chars directly? Perhaps a
third-party library that does it faster?

How badly do you need your own idiosyncratic UTF-8 conversion?
If you can use standard methods, consider wrapping the compressed
streams, somewhat like

Writer w = new OutputStreamWriter(
new GZIPOutputStream(...));
w.write("Hello, world!");

BufferedReader r = new BufferedReader(
new InputStreamReader(
new GZIPInputStream(...)));
String s = r.readLine();

You'll have to make your own assessment of the speeds and the
degree of compression.
In our particular situation, decompression speed is a lot more important
than compression speed.

"You'll have to make your own assessment ..."
 
R

Roedy Green

A

Arne Vajhøj

Not quoted:
#We're currently converting char arrays to byte arrays using our own
#UTF-8 conversion code, and then compressing the bytes using
#java.util.zip.
You can try GZIP. See http://mindprod.com/applet/fileio.html
for sample code. It is simpler. It might be faster.

ZIP and GZIP uses the same basic algorithm - the difference
is just in the packaging where zip supports multiple files and
store file meta data while gzip does not.

Arne
 
R

Robert Klemme

If you're really interested in "the fastest way" to the
exclusion of all other concerns, then don't compress at all.
Bingo! Problem solved!

You might be happier with a compression scheme that did
a little better at reducing the size of the data, but now you
can't get a sensible answer until you describe the trade-offs
you're willing to make. For example, if you were offered a
compression scheme that ran ten percent faster than your current
method but emitted fifteen percent more data, would you take it
or reject it?

Bonus question for OP: what is the size of data sets and how are they
used? Especially, where are they stored?
How badly do you need your own idiosyncratic UTF-8 conversion?
If you can use standard methods, consider wrapping the compressed
streams, somewhat like

Writer w = new OutputStreamWriter(
new GZIPOutputStream(...));

Minor detail: The encoding is missing, so in this case it would rather be

new OutputStreamWriter(new GZIPOutputstream(...), "UTF-8")
w.write("Hello, world!");

BufferedReader r = new BufferedReader(
new InputStreamReader(
new GZIPInputStream(...)));
String s = r.readLine();

You'll have to make your own assessment of the speeds and the
degree of compression.

This is the solution I was going to suggest as well. If the custom
encoding yields the same results as Java's built in UTF-8 then I would
immediately switch to this approach of stacked streams. If the custom
encoding yields slightly different results I am sure you can plug in the
custom encoding into the standard Java io and nio classes with a little
effort.

If data is to be stored in a BLOB via JDBC you can even extend this
approach to directly stream into the database.
"You'll have to make your own assessment ..."

Decompression is generally much faster than compression. I believe
there is not much difference in decompression speed when decompressing a
GZIP stream that was compressed with highest and lowest level. If you
dig a bit into compression theory then it's pretty obvious that finding
a small compressed representation is significantly harder than
converting a compressed data set back.

Kind regards

robert
 
C

Chris

Joshua said:
I know of a compression method that can compress in constant time: it
replaces the entire text with a single `0' (hey, it's lossy!). To what
degree of compression do you need?

As with all compression apps, as much as possible. It's hard to pick a
number until you know the available tradeoffs. We already know the
tradeoffs with java.util.zip, so obviously we're looking for something
faster.
Questions:
1. From whence are the char arrays coming?

Memory. They get streamed from an external source, compressed, and
written to disk as they come in.
2. Where are the byte arrays going?

Also memory, although ultimately to the UI.
3. Which part of the process is the bottleneck? The UTF-8 conversion, or
the compression?

Both, though the UTF-8 is the bigger burden.
 
C

Chris

Eric said:
If you're really interested in "the fastest way" to the
exclusion of all other concerns, then don't compress at all.
Bingo! Problem solved!

Um, actually no. When you're dealing with really large datasets it's
generally faster to compress than not to compress. The reason is that
compressed data requires less disk I/O.

Not always, of course; really processor-intensive compression schemes
can make you processor-bound. But in general, that's the case.

Yup, we've benchmarked it.

You might be happier with a compression scheme that did
a little better at reducing the size of the data, but now you
can't get a sensible answer until you describe the trade-offs
you're willing to make. For example, if you were offered a
compression scheme that ran ten percent faster than your current
method but emitted fifteen percent more data, would you take it
or reject it?

Yes, but I don't think that knowing that is essential to answering the
question. The question is really about finding better tradeoffs than we
can get with char conversion + zip conversion.
 
C

Chris

Bonus question for OP: what is the size of data sets and how are they
used? Especially, where are they stored?

Multi-terabyte sized, split across multiple machines. On a single
machine, generally not more than a few hundred Gb. One or two disks per
machine, SATA, no RAID.

At compression time, the data is streamed from an external source,
transformed in memory, and written to disk.

At decompression time, the app seeks to the particular block of text of
interest and decompresses it. Seek time dominates decompression time,
*except* when we do heavy caching, in which case the decompression
becomes the bottleneck. Storing the decompressed text in memory takes up
too much space. Has to be cached in compressed form.
 
E

Eric Sosman

Chris said:
Multi-terabyte sized, split across multiple machines. On a single
machine, generally not more than a few hundred Gb. One or two disks per
machine, SATA, no RAID.

At compression time, the data is streamed from an external source,
transformed in memory, and written to disk.

At decompression time, the app seeks to the particular block of text of
interest and decompresses it. Seek time dominates decompression time,
*except* when we do heavy caching, in which case the decompression
becomes the bottleneck. Storing the decompressed text in memory takes up
too much space. Has to be cached in compressed form.

Cutting the text into blocks usually makes the compression
less effective. Most compression schemes nowadays (and I believe
DEFLATE is among them) are adaptive, meaning that they adjust to
the characteristics of the data stream as they process it. Thus,
they compress relatively poorly at first, then improve as they
learn more about the statistical profile of the data.

Implications: (1) You'll get better compression if you can
keep the blocks "fairly long." (1a) You might make the blocks
"long" by concatenating multiple sub-blocks, at the expense of
needing to decompress from the start of a block even if you only
need the sub-block at its end. (2) If the blocks simply must be
small, you probably shouldn't waste effort on BEST_COMPRESSION.

Some interesting experiments are in order.
 
M

Mark Rafn

Chris said:
What's the fastest way to compress/decompress text?

As others have said, this is going to vary a lot based on exact requirements.
We're doing that over really large datasets in our app. We're currently
converting char arrays to byte arrays using our own UTF-8 conversion
code, and then compressing the bytes using java.util.zip. The code is
pretty old.
I don't like this two-step process, and the profiler shows that this is
a bottleneck in our app.

I'd start by benchmarking standard Java encoding and compression against your
two-step process. This STILL buffers stuff internally, so I wouldn't expect
huge gains unless your home-grown code is broken somehow.

Writer w = new OutputStreamWriter(new GZipOutputStream(real_out), "UTF-8");

This will at least get you a baseline, and it'll be fairly easy to use other
DeflaterOutputStream implementations.
Is anyone aware of any code that compresses chars directly? Perhaps a
third-party library that does it faster?

Keep in mind that UTF-8 is actually a compression for many datasets. It's
mildly expensive, but it's also a 50% size reduction from the 16-bit internal
representation if you don't have many non-ascii characters.

It's certainly worth benchmarking UTF-16 encoding on your data. This should
be much cheaper (AFAIK, all common JVMs use UTF-16 as their internal storage,
so UTF-16 encoding is a no-op). But it'll give you a lot more bytes to
compress, which may or may not wash out in the compression mechanism and
copying from buffer to buffer.
 
G

George Neuner

This sounds like it might be DNA sequences.
Cutting the text into blocks usually makes the compression
less effective. Most compression schemes nowadays (and I believe
DEFLATE is among them) are adaptive, meaning that they adjust to
the characteristics of the data stream as they process it. Thus,
they compress relatively poorly at first, then improve as they
learn more about the statistical profile of the data.

Most LZ based implementations (including DEFLATE) limit codes to
16-bits (I've heard of 32-bit LZ, but I've never seen it).
Compression studies have shown that, on average, the 16-bit code
dictionary will be filled after processing 200KB of input.

If the remainder of the input is characteristically similar to the
part already encoded, the full dictionary will compress the rest of
the input pretty well. But most input varies as it goes, sometimes
rapidly and drastically, so it does make sense to segment the input to
take advantage of the variation.
Implications: (1) You'll get better compression if you can
keep the blocks "fairly long." (1a) You might make the blocks
"long" by concatenating multiple sub-blocks, at the expense of
needing to decompress from the start of a block even if you only
need the sub-block at its end. (2) If the blocks simply must be
small, you probably shouldn't waste effort on BEST_COMPRESSION.

Some interesting experiments are in order.

Unfortunately, the best segment length depends on the input. For
general text I would probably use relatively small blocks, 16-64KB
depending on the input - the more variation in the input, the smaller
the blocks need to be to take advantage of it.

If we are talking about DNA sequences, then I would probably go for
256KB - once the base nucleotides and amino acid sequences are in the
dictionary (and you can guarantee this by preloading them),
compression is typically very good (80+%), so it makes sense to not
worry about it and just pick a convenient sized buffer to work with.

George
 
R

rossum

This sounds like it might be DNA sequences.
If it is DNA sequences, then you might be able to use something like
Base64 for initial coding:

A -> 00
C -> 01
G -> 10
T -> 11

AAA -> 000000
....
ACG -> 000110
....
TTT -> 111111

Those six bit codes are a straight match for Base64 coding. That will
compress three bytes "AAC" into one "B" (000001 -> B in Base64).

rossum
 
A

Andrew Thompson

rossum said:
If it is DNA sequences, ..

Sounds more like a heap of speculation over a 'cagey' problem
specification.

To clarify the 'cagey' aspect, perhaps the OP can indeed express
what the ulitmate point of this exercise is, rather than doling out
'precious little tid-bits' that supposedly represent the actual problem.

(Ultimately, the OP's own posts identify that they are entirely
unaware of the subtleties of the problems they face, and until
they fill in more detail, it really constitutes an 'interesting but
inefficient use of bandwidth' to speculate further.)

--
Andrew Thompson
http://www.physci.org/

Message posted via JavaKB.com
http://www.javakb.com/Uwe/Forums.aspx/java-general/200711/1
 
E

Eric Sosman

George said:
This sounds like it might be DNA sequences.
[...]

Most LZ based implementations (including DEFLATE) limit codes to
16-bits (I've heard of 32-bit LZ, but I've never seen it).
Compression studies have shown that, on average, the 16-bit code
dictionary will be filled after processing 200KB of input.

If the remainder of the input is characteristically similar to the
part already encoded, the full dictionary will compress the rest of
the input pretty well. But most input varies as it goes, sometimes
rapidly and drastically, so it does make sense to segment the input to
take advantage of the variation.
[...]
If we are talking about DNA sequences, then I would probably go for
256KB - once the base nucleotides and amino acid sequences are in the
dictionary (and you can guarantee this by preloading them),
compression is typically very good (80+%), so it makes sense to not
worry about it and just pick a convenient sized buffer to work with.

If you're right, the alphabet is so small that I'd question
the need for the UTF-8 conversion. DEFLATE will quickly learn
that every other byte is a zero, and will compress them very well.
 
C

Chris

Andrew said:
Sounds more like a heap of speculation over a 'cagey' problem
specification.

To clarify the 'cagey' aspect, perhaps the OP can indeed express
what the ulitmate point of this exercise is, rather than doling out
'precious little tid-bits' that supposedly represent the actual problem.

(Ultimately, the OP's own posts identify that they are entirely
unaware of the subtleties of the problems they face, and until
they fill in more detail, it really constitutes an 'interesting but
inefficient use of bandwidth' to speculate further.)

I started to write a long, irritable response to this, but I just don't
have time. Maybe I'll start a new thread later on twerps whose snotty,
fatuous responses waste bandwidth.

I'll say only this:

1. The problem is fully specified. Text compression is a known problem,
and I just asked if anyone knew of a Java library that had better
tradeoffs than UTF-8 + zip.

2. Text means text, not DNA. Written words.

3. I'm fully aware of the subtleties of this particular problem space,
and there is nothing in any of the posts so far that I haven't
considered and already benchmarked. (Including block sizes. For this
app, ~10k is optimal).

4. You don't need to know about the environment or the rest of the app,
because I'm not asking for a damn consultant.

5. Asking "how much compression I want" is just stupid.

In short, a narrow question is an invitation for a narrow answer, not an
invitation for you to tell me how you think I should write this app.
 
E

Eric Sosman

Chris said:
[...]
1. The problem is fully specified. Text compression is a known problem,
and I just asked if anyone knew of a Java library that had better
tradeoffs than UTF-8 + zip.

2. Text means text, not DNA. Written words.

Elsethread you've explained that the compressed stream
gets read back into a companion program and decompressed there;
this suggests that it doesn't need to be exchanged with "foreign"
programs. In which case, I ask again: Does UTF-8 encoding buy
you enough additional compression to justify its expense? How
bad would things be if you just handed 16-bit chars to the
compressor with no "intelligence" whatsoever?
5. Asking "how much compression I want" is just stupid.

Well, you asked about compression speed. Other things
being equal, faster compressors compress less well and "looser"
compressors compress faster, so the question of "how much" must
eventually arise when you weigh alternatives.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top