Java text compression

Chris · Nov 20, 2007

Eric said:
Chris said:

[...]
1. The problem is fully specified. Text compression is a known
problem, and I just asked if anyone knew of a Java library that had
better tradeoffs than UTF-8 + zip.

2. Text means text, not DNA. Written words.

Click to expand...

Elsethread you've explained that the compressed stream
gets read back into a companion program and decompressed there;
this suggests that it doesn't need to be exchanged with "foreign"
programs. In which case, I ask again: Does UTF-8 encoding buy
you enough additional compression to justify its expense? How
bad would things be if you just handed 16-bit chars to the
compressor with no "intelligence" whatsoever?

I'd like to try that. Unfortunately, java.util.zip.Deflater accepts only
byte arrays, not char arrays. I suppose it might be faster to copy the
chars to 2-byte sequences and compress, rather than run the UTF-8
compressor. An extra step, but worth a try.

Well, you asked about compression speed. Other things
being equal, faster compressors compress less well and "looser"
compressors compress faster, so the question of "how much" must
eventually arise when you weigh alternatives.

Of course. It just reminded of walking into a store and having the clerk
ask "how much do you want to pay?" The right answer is, "show me the
merchandise and I'll figure out what the tradeoffs are on my own".

Andrew Thompson · Nov 20, 2007

Chris said:
[...]

Click to expand...

Click to expand...

...
Of course. It just reminded of walking into a store and having the clerk
ask "how much do you want to pay?" ...

That might be an appropriate comparison if this were
a help desk. It is not a help desk. (Though you seem
to be treating the responders as though they were your
own personal servants - so perhaps you are confused
about the differences between 'help-desk' and 'usenet').

--
Andrew Thompson
http://www.physci.org/

Message posted via JavaKB.com
http://www.javakb.com/Uwe/Forums.aspx/java-general/200711/1

Roger LindsjÃ¶ · Nov 20, 2007

Chris said:
Of course. It just reminded of walking into a store and having the clerk
ask "how much do you want to pay?" The right answer is, "show me the
merchandise and I'll figure out what the tradeoffs are on my own".

But don't you think it is an appropriate question? At least to get in
the ballpark? If you answered that you have regular text and you need
~10X compression, then people can suggest algorithms that has been shown
to give that compression. With no specification you will get suggestions
that are very fast (no compression was suggested) to methods that will
give very high compression rates but run very slowly.

Similar to your analogy, the electronics store where I have been looking
at a new TV has TVs ranging from about $100 to $120000. I doubt that
when someone walks and says "I want a new TV" that the first suggestion
is to look at the very expensive on (I heard a rumor that 1 was sold in
Sweden so far). Size and price are common to narrow down the number of
choices.

//Roger LindsjÃ¶

Eric Sosman · Nov 20, 2007

Chris said:
Eric said:

Chris said:

[...]
1. The problem is fully specified. Text compression is a known
problem, and I just asked if anyone knew of a Java library that had
better tradeoffs than UTF-8 + zip.

2. Text means text, not DNA. Written words.

Click to expand...

Elsethread you've explained that the compressed stream
gets read back into a companion program and decompressed there;
this suggests that it doesn't need to be exchanged with "foreign"
programs. In which case, I ask again: Does UTF-8 encoding buy
you enough additional compression to justify its expense? How
bad would things be if you just handed 16-bit chars to the
compressor with no "intelligence" whatsoever?

Click to expand...

I'd like to try that. Unfortunately, java.util.zip.Deflater accepts only
byte arrays, not char arrays. I suppose it might be faster to copy the
chars to 2-byte sequences and compress, rather than run the UTF-8
compressor. An extra step, but worth a try.

Might it instead be the removal of a step? You've also
mentioned that the data are "streamed from an external source;"
in what form do they arrive? Unless you're using something
like RMI they probably don't arrive as full-fledged String
objects, but are converted to Strings from some more primitive
form -- like, perhaps, streams of bytes?

Of course. It just reminded of walking into a store and having the clerk
ask "how much do you want to pay?" The right answer is, "show me the
merchandise and I'll figure out what the tradeoffs are on my own".

... and to push the analogy perhaps just a little bit too
far, the kindly clerk dumps the 600-page inventory list in your
lap and walks away. That is, a search is more efficient if the
searcher is an active participant instead of a filter-feeder.

Lew · Nov 20, 2007

But don't you think it is an appropriate question? At least to get in
the ballpark? ...

Similar to your analogy, the electronics store where I have been looking
at a new TV has TVs ranging from about $100 to $120000. I doubt that
when someone walks and says "I want a new TV" that the first suggestion
is to look at the very expensive on (I heard a rumor that 1 was sold in
Sweden so far). Size and price are common to narrow down the number of
choices.

Car dealers, clothing stores, flower shops - all routinely ask me "how much do
I want to spend?"

Lasse Reichstein Nielsen · Nov 20, 2007

Lew said:
Car dealers, clothing stores, flower shops - all routinely ask me "how
much do I want to spend?"

.... and this is where you lie!

/L

George Neuner · Nov 20, 2007

2. Text means text, not DNA. Written words.

3. I'm fully aware of the subtleties of this particular problem space,
and there is nothing in any of the posts so far that I haven't
considered and already benchmarked. (Including block sizes. For this
app, ~10k is optimal).

4. You don't need to know about the environment or the rest of the app,
because I'm not asking for a damn consultant.

There's no need to get nasty. Just in general you'll get better
answers when people have a general idea of where you're headed.
Storing gigabytes of "text" is not a typical problem other than for
search engines, data mining, or DNA research.

5. Asking "how much compression I want" is just stupid.

No, this question is FAR from stupid. For all you knew, the
alternatives to "UTF-8 + zip" may have been much worse for your
application.

This forum is open to all and people frequently tackle things that are
way beyond their capabilities. You are now claiming to have the
ability to evaluate suggestions yourself, but we had no way of knowing
that until now - you gave us no clues as to your own capabilities and
precious little information about what you are attempting.

In short, a narrow question is an invitation for a narrow answer, not an
invitation for you to tell me how you think I should write this app.

A narrow question is frequently a sign of a newbie in over his head.
Beyond that, answers to questions frequently create new questions ...
if not from the original poster, then from a reader seeking more
information about a response.

If you're looking for a simple answer, go find yourself a programming
oracle ... on Usenet any halfway interesting subject (of which
compression is one) is likely to invite a debate.

George

Lew · Nov 21, 2007

Lasse said:
.... and this is where you lie!

Which way? Does one claim to be a big spender or a skinflint?

Roedy Green · Nov 21, 2007

Well, you asked about compression speed. Other things
being equal, faster compressors compress less well and "looser"
compressors compress faster, so the question of "how much" must
eventually arise when you weigh alternatives.

The proper answer would have been
use X if you want fast but not so hot compression
use Y if you want slow but deep compression.

The various answers seemed to be disguising the fact nobody had any
algorithms other that the minor modification of GZIP to offer. It gave
me the impression there were one-upmanship games in play. I can see
why Chris felt so pissed to be jerked around that way.

Roedy Green · Nov 21, 2007

Storing gigabytes of "text" is not a typical problem other than for
search engines, data mining, or DNA research.

The more you know about the text the more options you have for
compression. For example if the text used some limited character set
or some limited word vocabulary specialised compression is possible.

"Text" can vary from instrument readings in ASCII to armoured
encrypted data.

Roedy Green · Nov 21, 2007

Two bits, there are only four letters.

I presume you would have some meta information embedded as well, e.g.
names and locations of genes. You need an extra bit to escape. You
could also handle meta information by encoding lengths in front of
2-bit encoded nucleotide sequences.

Lasse Reichstein Nielsen · Nov 21, 2007

Lew said:
Which way? Does one claim to be a big spender or a skinflint?

OK, I guess it's mainly car dealers where you don't want to tell them
your limit. Car shopping is a negotiation, and it's harder to press
the price if you have already told them what you are willing to
spend. I'd say something at least 20% lower than my real limit.
I.e., neither a big spender or a skinflint, but someone more restricted
in your spending than you actually are.
But then again, I'm not a great negotiatior, so don't necessarily
trust my opinion

Flower shops are ok.

/L

Eric Sosman · Nov 21, 2007

Roedy said:
The proper answer would have been
use X if you want fast but not so hot compression
use Y if you want slow but deep compression.

Quite aside from the point that enumerating the set of
possible compression techniques may be significantly harder than
counting to two, I don't think that's the "proper" answer at all.
The O.P. does not have a compression problem, but a systemic
problem -- even in the original posting he mentioned the custom-
written UTF-8 conversion as part of the bottleneck. The "proper"
answer is to take a step back and get a wider view of the context.
Global optimization, not peephole optimization.

If someone says he's spending a hundred dollars a week putting
motor oil in his car and asks for advice on how to get inexpensive
oil, the proper answer is not to help him shave a penny here and
a penny there but to find out what's wrong with the car!

The various answers seemed to be disguising the fact nobody had any
algorithms other that the minor modification of GZIP to offer. It gave
me the impression there were one-upmanship games in play. I can see
why Chris felt so pissed to be jerked around that way.

I'll admit to directing some sarcasm at Chris that perhaps
an exceptionally thin-skinned person might have seen as ridicule,
but I deny jerking him around.

Lew · Nov 21, 2007

Eric said:
I'll admit to directing some sarcasm at Chris that perhaps
an exceptionally thin-skinned person might have seen as ridicule,
but I deny jerking him around.

There seems to be a nawful lot of thin-skinnedness around lately. Every FAQ
I've read, every essay on how best to use it, points out (almost pridefully)
that Usenet discussions tend toward lurid prose, even (Gods forfend!) an
occasional touch of sarcasm. People were warned (or would have been, if
they'd RTFM).

So people: Get over it. Get on with it. Did the information help? Good.
Did it not? Ask more questions. Did you feel offended by some turn of
rhetoric? Awwwwww.

Robert Klemme · Nov 21, 2007

Um, actually no. When you're dealing with really large datasets it's
generally faster to compress than not to compress. The reason is that
compressed data requires less disk I/O.

Not always, of course; really processor-intensive compression schemes
can make you processor-bound. But in general, that's the case.

Yup, we've benchmarked it.

Yes, but I don't think that knowing that is essential to answering the
question. The question is really about finding better tradeoffs than we
can get with char conversion + zip conversion.

If you have strings already then serialization is one form to convert
them into binary data (which can be compressed with Java's standard
lib). If you can get rid of UTF-8 compression this might be an alternative.

robert

Chris · Nov 21, 2007

Roedy said:
The proper answer would have been
use X if you want fast but not so hot compression
use Y if you want slow but deep compression.

The various answers seemed to be disguising the fact nobody had any
algorithms other that the minor modification of GZIP to offer. It gave
me the impression there were one-upmanship games in play. I can see
why Chris felt so pissed to be jerked around that way.

Thank you, thank you. I could not have said it better.

I've been thinking for some time about posting on this topic. The
problem is that many have a tendency to offer broad programming advice,
rather than deal with the poster's question directly.

The problem with this (aside from the irritation factor) is that it
tends to crowd out substantive advice. When a thread has a number of
answers, many participants tend to skip it, assuming that the question
has been answered.

Whenever a poster asks a "How do I make X faster?" question, the
non-substantive answers come in droves. Usually they take the form "It
depends" (without any discussion of what it depends on), or "Run a
profiler/benchmark it first" (without any knowledge about whether the
poster has already done that), or just "You don't need to make it fast,
just make it work first and optimize later" (which assumes that the
poster is so foolish as to have asked the question without first having
seen slow behavior in his program).

These sorts of responses are condescending, and really very damaging
when they crowd out real answers. I'm sure most of us in the forum have
real work that needs to get done, real deadlines, and our income or jobs
depend on the insights we get in this and other forums.

We really ought to apply a bit of social pressure to get the egotists
and trolls to stuff it.

Eric Sosman · Nov 22, 2007

Chris said:
[...]
We really ought to apply a bit of social pressure to get the egotists
and trolls to stuff it.

Good idea. Let us also apply some pressure to those who
ask strangers for help and then complain when they receive it.

Roedy Green · Nov 22, 2007

Good idea. Let us also apply some pressure to those who
ask strangers for help and then complain when they receive it.

People tend to be much more candid on the net than they would be in
person. That's just the way it is. There is no reason to read a
specific-to-you insult into it.

The late Ken Keyes put it this way:
"You add suffering to the world just as much when you take offense as
when you give offense."

On the other hand, I think several answerers routinely take a rather
sadistic joy is putting down newbies and questioners, playing a sort
of "Mother May I" game to humiliate them into begging in precisely the
correct form, then giving them only a crumb.

This whole compression thing with Chris smells of such a tease because
nobody appeared to even know of a potential algorithm.

I think it quite reasonable to request someone produce an SSCCE to
help diagnose a problem, but it is not OK to chastise someone new to
the group for not already having done so.

Imagine a Monty Pythonesque sketch along these lines:

customer: I would like a beer?

barkeep: would that be a chilled or room temperature?

customer: chilled

barkeep: you stupid sod. Wrong answer. Nobody whose anyone drinks
chilled beer. Where do you think you are, Texas?

customer: ok, room temperature.

barkeep: that's better. What brand?

customer: Rickard's Red?

barkeep: don't have it.

customer: Guinness, surely you have a Guinness ale.

barkeep: nope. Try again.

.... etc. etc.

customer: what would you recommend?

barkeep: It depends. What's your astrological sign?

.... etc. etc.

barkeep: what a silly twit you are! We don't have a liquor licence.
You should have read the fine print on the back of the napkins. We
don't serve any kind of alcohol here, you stupid, naive, newbie,
drunken lout. What do you think this is, a pub? Get out before I call
the Raging Grannies.

(not to be taken too literally)

Andrew Thompson · Nov 22, 2007

Roedy Green wrote:
...

I think it quite reasonable to request someone produce an SSCCE to
help diagnose a problem, but it is not OK to chastise someone new to
the group for not already having done so.

As someone that has a Google alert for the term 'SSCCE',
I can assure you - that does not happen.

Andrew Thompson · Nov 22, 2007

Andrew said:
...

As someone that has a Google alert for the term 'SSCCE',
I can assure you - that does not happen.

Or at least, it does not happen on *usenet*, and if I
see any so much as questionable references to
produce an SSCCE, I will usually reply and say so.

I *have* seen some rather odd references to SSCCEs on
*Sun* forums, but I am not in a position to either get regular
updates to the term, nor respond through those (painfully
slow) forums.

I do not moniter any other 'web based' forums and am
in no position to comment on them.

Java sound and compression	5	Aug 12, 2009
Help with Huffman Compression	0	Jul 18, 2010
inaccuracies in compression algorithm	18	Jul 23, 2009
Decompressing LZW compression from PDF file	2	Mar 22, 2010
Cannot use Extensions for Compression?	0	Apr 28, 2006
Enabling HTTP 1.1 compression	0	Apr 7, 2005
Web Service Compression when ASP.NET is Consuming	1	Oct 18, 2006
Query about compression and decompression in Java using java.util.zip package	4	Nov 23, 2004

Java text compression

Chris

Andrew Thompson

Roger LindsjÃ¶

Eric Sosman

Lew

Lasse Reichstein Nielsen

George Neuner

Lew

Roedy Green

Roedy Green

Roedy Green

Lasse Reichstein Nielsen

Eric Sosman

Lew

Robert Klemme

Chris

Eric Sosman

Roedy Green

Andrew Thompson

Andrew Thompson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads