Java text compression

Discussion in 'Java' started by Chris, Nov 18, 2007.

  1. Chris

    Chris Guest

    What's the fastest way to compress/decompress text?

    We're doing that over really large datasets in our app. We're currently
    converting char arrays to byte arrays using our own UTF-8 conversion
    code, and then compressing the bytes using java.util.zip. The code is
    pretty old.

    I don't like this two-step process, and the profiler shows that this is
    a bottleneck in our app.

    Is anyone aware of any code that compresses chars directly? Perhaps a
    third-party library that does it faster?

    In our particular situation, decompression speed is a lot more important
    than compression speed.
    Chris, Nov 18, 2007
    #1
    1. Advertising

  2. Chris wrote:
    > What's the fastest way to compress/decompress text?


    I know of a compression method that can compress in constant time: it
    replaces the entire text with a single `0' (hey, it's lossy!). To what
    degree of compression do you need?

    > We're doing that over really large datasets in our app. We're currently
    > converting char arrays to byte arrays using our own UTF-8 conversion
    > code, and then compressing the bytes using java.util.zip. The code is
    > pretty old.
    >
    > I don't like this two-step process, and the profiler shows that this is
    > a bottleneck in our app.


    Questions:
    1. From whence are the char arrays coming?
    2. Where are the byte arrays going?
    3. Which part of the process is the bottleneck? The UTF-8 conversion, or
    the compression?

    > Is anyone aware of any code that compresses chars directly? Perhaps a
    > third-party library that does it faster?
    >
    > In our particular situation, decompression speed is a lot more important
    > than compression speed.



    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
    Joshua Cranmer, Nov 18, 2007
    #2
    1. Advertising

  3. Chris

    Eric Sosman Guest

    Chris wrote:
    > What's the fastest way to compress/decompress text?


    If you're really interested in "the fastest way" to the
    exclusion of all other concerns, then don't compress at all.
    Bingo! Problem solved!

    You might be happier with a compression scheme that did
    a little better at reducing the size of the data, but now you
    can't get a sensible answer until you describe the trade-offs
    you're willing to make. For example, if you were offered a
    compression scheme that ran ten percent faster than your current
    method but emitted fifteen percent more data, would you take it
    or reject it?

    > We're doing that over really large datasets in our app. We're currently
    > converting char arrays to byte arrays using our own UTF-8 conversion
    > code, and then compressing the bytes using java.util.zip. The code is
    > pretty old.
    >
    > I don't like this two-step process, and the profiler shows that this is
    > a bottleneck in our app.
    >
    > Is anyone aware of any code that compresses chars directly? Perhaps a
    > third-party library that does it faster?


    How badly do you need your own idiosyncratic UTF-8 conversion?
    If you can use standard methods, consider wrapping the compressed
    streams, somewhat like

    Writer w = new OutputStreamWriter(
    new GZIPOutputStream(...));
    w.write("Hello, world!");

    BufferedReader r = new BufferedReader(
    new InputStreamReader(
    new GZIPInputStream(...)));
    String s = r.readLine();

    You'll have to make your own assessment of the speeds and the
    degree of compression.

    > In our particular situation, decompression speed is a lot more important
    > than compression speed.


    "You'll have to make your own assessment ..."

    --
    Eric Sosman
    lid
    Eric Sosman, Nov 18, 2007
    #3
  4. Chris

    Roedy Green Guest

    On Sun, 18 Nov 2007 13:44:44 -0600, Chris <>
    wrote, quoted or indirectly quoted someone who said :

    >What's the fastest way to compress/decompress text?


    You can try GZIP. See http://mindprod.com/applet/fileio.html
    for sample code. It is simpler. It might be faster.

    Note that zip has a compression number where you can trade off speed
    for compression.

    If you have a limited vocabulary, you might try just looking up words
    and converting to 16-bit index numbers.

    Each number is a word + a trailing space.

    see http://mindprod.com/project/supercompressor.html
    --
    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Nov 18, 2007
    #4
  5. Chris

    Arne Vajhøj Guest

    Roedy Green wrote:
    > On Sun, 18 Nov 2007 13:44:44 -0600, Chris <>
    >> What's the fastest way to compress/decompress text?


    Not quoted:
    #We're currently converting char arrays to byte arrays using our own
    #UTF-8 conversion code, and then compressing the bytes using
    #java.util.zip.

    > You can try GZIP. See http://mindprod.com/applet/fileio.html
    > for sample code. It is simpler. It might be faster.


    ZIP and GZIP uses the same basic algorithm - the difference
    is just in the packaging where zip supports multiple files and
    store file meta data while gzip does not.

    Arne
    Arne Vajhøj, Nov 18, 2007
    #5
  6. On 18.11.2007 21:16, Eric Sosman wrote:
    > Chris wrote:
    >> What's the fastest way to compress/decompress text?

    >
    > If you're really interested in "the fastest way" to the
    > exclusion of all other concerns, then don't compress at all.
    > Bingo! Problem solved!
    >
    > You might be happier with a compression scheme that did
    > a little better at reducing the size of the data, but now you
    > can't get a sensible answer until you describe the trade-offs
    > you're willing to make. For example, if you were offered a
    > compression scheme that ran ten percent faster than your current
    > method but emitted fifteen percent more data, would you take it
    > or reject it?


    Bonus question for OP: what is the size of data sets and how are they
    used? Especially, where are they stored?

    >> We're doing that over really large datasets in our app. We're
    >> currently converting char arrays to byte arrays using our own UTF-8
    >> conversion code, and then compressing the bytes using java.util.zip.
    >> The code is pretty old.
    >>
    >> I don't like this two-step process, and the profiler shows that this
    >> is a bottleneck in our app.
    >>
    >> Is anyone aware of any code that compresses chars directly? Perhaps a
    >> third-party library that does it faster?

    >
    > How badly do you need your own idiosyncratic UTF-8 conversion?
    > If you can use standard methods, consider wrapping the compressed
    > streams, somewhat like
    >
    > Writer w = new OutputStreamWriter(
    > new GZIPOutputStream(...));


    Minor detail: The encoding is missing, so in this case it would rather be

    new OutputStreamWriter(new GZIPOutputstream(...), "UTF-8")

    > w.write("Hello, world!");
    >
    > BufferedReader r = new BufferedReader(
    > new InputStreamReader(
    > new GZIPInputStream(...)));
    > String s = r.readLine();
    >
    > You'll have to make your own assessment of the speeds and the
    > degree of compression.


    This is the solution I was going to suggest as well. If the custom
    encoding yields the same results as Java's built in UTF-8 then I would
    immediately switch to this approach of stacked streams. If the custom
    encoding yields slightly different results I am sure you can plug in the
    custom encoding into the standard Java io and nio classes with a little
    effort.

    If data is to be stored in a BLOB via JDBC you can even extend this
    approach to directly stream into the database.

    >> In our particular situation, decompression speed is a lot more
    >> important than compression speed.

    >
    > "You'll have to make your own assessment ..."


    Decompression is generally much faster than compression. I believe
    there is not much difference in decompression speed when decompressing a
    GZIP stream that was compressed with highest and lowest level. If you
    dig a bit into compression theory then it's pretty obvious that finding
    a small compressed representation is significantly harder than
    converting a compressed data set back.

    Kind regards

    robert
    Robert Klemme, Nov 18, 2007
    #6
  7. Chris

    Chris Guest

    Joshua Cranmer wrote:
    > Chris wrote:
    >> What's the fastest way to compress/decompress text?

    >
    > I know of a compression method that can compress in constant time: it
    > replaces the entire text with a single `0' (hey, it's lossy!). To what
    > degree of compression do you need?


    As with all compression apps, as much as possible. It's hard to pick a
    number until you know the available tradeoffs. We already know the
    tradeoffs with java.util.zip, so obviously we're looking for something
    faster.

    > Questions:
    > 1. From whence are the char arrays coming?


    Memory. They get streamed from an external source, compressed, and
    written to disk as they come in.

    > 2. Where are the byte arrays going?


    Also memory, although ultimately to the UI.

    > 3. Which part of the process is the bottleneck? The UTF-8 conversion, or
    > the compression?


    Both, though the UTF-8 is the bigger burden.
    Chris, Nov 18, 2007
    #7
  8. Chris

    Chris Guest

    Eric Sosman wrote:
    > Chris wrote:
    >> What's the fastest way to compress/decompress text?

    >
    > If you're really interested in "the fastest way" to the
    > exclusion of all other concerns, then don't compress at all.
    > Bingo! Problem solved!


    Um, actually no. When you're dealing with really large datasets it's
    generally faster to compress than not to compress. The reason is that
    compressed data requires less disk I/O.

    Not always, of course; really processor-intensive compression schemes
    can make you processor-bound. But in general, that's the case.

    Yup, we've benchmarked it.


    > You might be happier with a compression scheme that did
    > a little better at reducing the size of the data, but now you
    > can't get a sensible answer until you describe the trade-offs
    > you're willing to make. For example, if you were offered a
    > compression scheme that ran ten percent faster than your current
    > method but emitted fifteen percent more data, would you take it
    > or reject it?


    Yes, but I don't think that knowing that is essential to answering the
    question. The question is really about finding better tradeoffs than we
    can get with char conversion + zip conversion.
    Chris, Nov 18, 2007
    #8
  9. Chris

    Chris Guest

    > Bonus question for OP: what is the size of data sets and how are they
    > used? Especially, where are they stored?


    Multi-terabyte sized, split across multiple machines. On a single
    machine, generally not more than a few hundred Gb. One or two disks per
    machine, SATA, no RAID.

    At compression time, the data is streamed from an external source,
    transformed in memory, and written to disk.

    At decompression time, the app seeks to the particular block of text of
    interest and decompresses it. Seek time dominates decompression time,
    *except* when we do heavy caching, in which case the decompression
    becomes the bottleneck. Storing the decompressed text in memory takes up
    too much space. Has to be cached in compressed form.
    Chris, Nov 18, 2007
    #9
  10. Chris

    Eric Sosman Guest

    Chris wrote:
    >> Bonus question for OP: what is the size of data sets and how are they
    >> used? Especially, where are they stored?

    >
    > Multi-terabyte sized, split across multiple machines. On a single
    > machine, generally not more than a few hundred Gb. One or two disks per
    > machine, SATA, no RAID.
    >
    > At compression time, the data is streamed from an external source,
    > transformed in memory, and written to disk.
    >
    > At decompression time, the app seeks to the particular block of text of
    > interest and decompresses it. Seek time dominates decompression time,
    > *except* when we do heavy caching, in which case the decompression
    > becomes the bottleneck. Storing the decompressed text in memory takes up
    > too much space. Has to be cached in compressed form.


    Cutting the text into blocks usually makes the compression
    less effective. Most compression schemes nowadays (and I believe
    DEFLATE is among them) are adaptive, meaning that they adjust to
    the characteristics of the data stream as they process it. Thus,
    they compress relatively poorly at first, then improve as they
    learn more about the statistical profile of the data.

    Implications: (1) You'll get better compression if you can
    keep the blocks "fairly long." (1a) You might make the blocks
    "long" by concatenating multiple sub-blocks, at the expense of
    needing to decompress from the start of a block even if you only
    need the sub-block at its end. (2) If the blocks simply must be
    small, you probably shouldn't waste effort on BEST_COMPRESSION.

    Some interesting experiments are in order.

    --
    Eric Sosman
    lid
    Eric Sosman, Nov 18, 2007
    #10
  11. Chris

    Mark Rafn Guest

    Chris <> wrote:
    >What's the fastest way to compress/decompress text?


    As others have said, this is going to vary a lot based on exact requirements.

    >We're doing that over really large datasets in our app. We're currently
    >converting char arrays to byte arrays using our own UTF-8 conversion
    >code, and then compressing the bytes using java.util.zip. The code is
    >pretty old.
    >I don't like this two-step process, and the profiler shows that this is
    >a bottleneck in our app.


    I'd start by benchmarking standard Java encoding and compression against your
    two-step process. This STILL buffers stuff internally, so I wouldn't expect
    huge gains unless your home-grown code is broken somehow.

    Writer w = new OutputStreamWriter(new GZipOutputStream(real_out), "UTF-8");

    This will at least get you a baseline, and it'll be fairly easy to use other
    DeflaterOutputStream implementations.

    >Is anyone aware of any code that compresses chars directly? Perhaps a
    >third-party library that does it faster?


    Keep in mind that UTF-8 is actually a compression for many datasets. It's
    mildly expensive, but it's also a 50% size reduction from the 16-bit internal
    representation if you don't have many non-ascii characters.

    It's certainly worth benchmarking UTF-16 encoding on your data. This should
    be much cheaper (AFAIK, all common JVMs use UTF-16 as their internal storage,
    so UTF-16 encoding is a no-op). But it'll give you a lot more bytes to
    compress, which may or may not wash out in the compression mechanism and
    copying from buffer to buffer.
    --
    Mark Rafn <http://www.dagon.net/>
    Mark Rafn, Nov 19, 2007
    #11
  12. On Sun, 18 Nov 2007 18:30:19 -0500, Eric Sosman
    <> wrote:

    >Chris wrote:
    >>> Bonus question for OP: what is the size of data sets and how are they
    >>> used? Especially, where are they stored?

    >>
    >> Multi-terabyte sized, split across multiple machines. On a single
    >> machine, generally not more than a few hundred Gb. One or two disks per
    >> machine, SATA, no RAID.


    This sounds like it might be DNA sequences.

    >> At compression time, the data is streamed from an external source,
    >> transformed in memory, and written to disk.
    >>
    >> At decompression time, the app seeks to the particular block of text of
    >> interest and decompresses it. Seek time dominates decompression time,
    >> *except* when we do heavy caching, in which case the decompression
    >> becomes the bottleneck. Storing the decompressed text in memory takes up
    >> too much space. Has to be cached in compressed form.

    >
    > Cutting the text into blocks usually makes the compression
    >less effective. Most compression schemes nowadays (and I believe
    >DEFLATE is among them) are adaptive, meaning that they adjust to
    >the characteristics of the data stream as they process it. Thus,
    >they compress relatively poorly at first, then improve as they
    >learn more about the statistical profile of the data.


    Most LZ based implementations (including DEFLATE) limit codes to
    16-bits (I've heard of 32-bit LZ, but I've never seen it).
    Compression studies have shown that, on average, the 16-bit code
    dictionary will be filled after processing 200KB of input.

    If the remainder of the input is characteristically similar to the
    part already encoded, the full dictionary will compress the rest of
    the input pretty well. But most input varies as it goes, sometimes
    rapidly and drastically, so it does make sense to segment the input to
    take advantage of the variation.

    > Implications: (1) You'll get better compression if you can
    >keep the blocks "fairly long." (1a) You might make the blocks
    >"long" by concatenating multiple sub-blocks, at the expense of
    >needing to decompress from the start of a block even if you only
    >need the sub-block at its end. (2) If the blocks simply must be
    >small, you probably shouldn't waste effort on BEST_COMPRESSION.
    >
    > Some interesting experiments are in order.


    Unfortunately, the best segment length depends on the input. For
    general text I would probably use relatively small blocks, 16-64KB
    depending on the input - the more variation in the input, the smaller
    the blocks need to be to take advantage of it.

    If we are talking about DNA sequences, then I would probably go for
    256KB - once the base nucleotides and amino acid sequences are in the
    dictionary (and you can guarantee this by preloading them),
    compression is typically very good (80+%), so it makes sense to not
    worry about it and just pick a convenient sized buffer to work with.

    George
    --
    for email reply remove "/" from address
    George Neuner, Nov 19, 2007
    #12
  13. Chris

    Roedy Green Guest

    Roedy Green, Nov 19, 2007
    #13
  14. Chris

    rossum Guest

    On Mon, 19 Nov 2007 01:51:11 -0500, George Neuner
    <gneuner2/@/comcast.net> wrote:

    >This sounds like it might be DNA sequences.

    If it is DNA sequences, then you might be able to use something like
    Base64 for initial coding:

    A -> 00
    C -> 01
    G -> 10
    T -> 11

    AAA -> 000000
    ....
    ACG -> 000110
    ....
    TTT -> 111111

    Those six bit codes are a straight match for Base64 coding. That will
    compress three bytes "AAC" into one "B" (000001 -> B in Base64).

    rossum
    rossum, Nov 19, 2007
    #14
  15. rossum wrote:
    >>This sounds like it might be DNA sequences.

    >If it is DNA sequences, ..


    Sounds more like a heap of speculation over a 'cagey' problem
    specification.

    To clarify the 'cagey' aspect, perhaps the OP can indeed express
    what the ulitmate point of this exercise is, rather than doling out
    'precious little tid-bits' that supposedly represent the actual problem.

    (Ultimately, the OP's own posts identify that they are entirely
    unaware of the subtleties of the problems they face, and until
    they fill in more detail, it really constitutes an 'interesting but
    inefficient use of bandwidth' to speculate further.)

    --
    Andrew Thompson
    http://www.physci.org/

    Message posted via JavaKB.com
    http://www.javakb.com/Uwe/Forums.aspx/java-general/200711/1
    Andrew Thompson, Nov 19, 2007
    #15
  16. Chris

    Roedy Green Guest

    On Mon, 19 Nov 2007 01:51:11 -0500, George Neuner
    <gneuner2/@/comcast.net> wrote, quoted or indirectly quoted someone
    who said :

    >>> Multi-terabyte sized, split across multiple machines. On a single
    >>> machine, generally not more than a few hundred Gb. One or two disks per
    >>> machine, SATA, no RAID.

    >
    >This sounds like it might be DNA sequences.


    If they are, convert each letter to 3 bits, then pack them 21 to a
    long, wasting the sign bit.
    --
    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Nov 19, 2007
    #16
  17. Chris

    Eric Sosman Guest

    George Neuner wrote:
    > On Sun, 18 Nov 2007 18:30:19 -0500, Eric Sosman
    > <> wrote:
    >
    >> Chris wrote:
    >>>> Bonus question for OP: what is the size of data sets and how are they
    >>>> used? Especially, where are they stored?
    >>> Multi-terabyte sized, split across multiple machines. On a single
    >>> machine, generally not more than a few hundred Gb. One or two disks per
    >>> machine, SATA, no RAID.

    >
    > This sounds like it might be DNA sequences.
    > [...]
    >
    > Most LZ based implementations (including DEFLATE) limit codes to
    > 16-bits (I've heard of 32-bit LZ, but I've never seen it).
    > Compression studies have shown that, on average, the 16-bit code
    > dictionary will be filled after processing 200KB of input.
    >
    > If the remainder of the input is characteristically similar to the
    > part already encoded, the full dictionary will compress the rest of
    > the input pretty well. But most input varies as it goes, sometimes
    > rapidly and drastically, so it does make sense to segment the input to
    > take advantage of the variation.
    > [...]
    > If we are talking about DNA sequences, then I would probably go for
    > 256KB - once the base nucleotides and amino acid sequences are in the
    > dictionary (and you can guarantee this by preloading them),
    > compression is typically very good (80+%), so it makes sense to not
    > worry about it and just pick a convenient sized buffer to work with.


    If you're right, the alphabet is so small that I'd question
    the need for the UTF-8 conversion. DEFLATE will quickly learn
    that every other byte is a zero, and will compress them very well.

    --
    Eric Sosman
    lid
    Eric Sosman, Nov 19, 2007
    #17
  18. Chris

    rossum Guest

    On Mon, 19 Nov 2007 11:41:29 GMT, Roedy Green
    <> wrote:

    >On Mon, 19 Nov 2007 01:51:11 -0500, George Neuner
    ><gneuner2/@/comcast.net> wrote, quoted or indirectly quoted someone
    >who said :
    >
    >>>> Multi-terabyte sized, split across multiple machines. On a single
    >>>> machine, generally not more than a few hundred Gb. One or two disks per
    >>>> machine, SATA, no RAID.

    >>
    >>This sounds like it might be DNA sequences.

    >
    >If they are, convert each letter to 3 bits,

    Two bits, there are only four letters.

    rossum

    >then pack them 21 to a
    >long, wasting the sign bit.
    rossum, Nov 19, 2007
    #18
  19. Chris

    Chris Guest

    Andrew Thompson wrote:
    > rossum wrote:
    >>> This sounds like it might be DNA sequences.

    >> If it is DNA sequences, ..

    >
    > Sounds more like a heap of speculation over a 'cagey' problem
    > specification.
    >
    > To clarify the 'cagey' aspect, perhaps the OP can indeed express
    > what the ulitmate point of this exercise is, rather than doling out
    > 'precious little tid-bits' that supposedly represent the actual problem.
    >
    > (Ultimately, the OP's own posts identify that they are entirely
    > unaware of the subtleties of the problems they face, and until
    > they fill in more detail, it really constitutes an 'interesting but
    > inefficient use of bandwidth' to speculate further.)
    >


    I started to write a long, irritable response to this, but I just don't
    have time. Maybe I'll start a new thread later on twerps whose snotty,
    fatuous responses waste bandwidth.

    I'll say only this:

    1. The problem is fully specified. Text compression is a known problem,
    and I just asked if anyone knew of a Java library that had better
    tradeoffs than UTF-8 + zip.

    2. Text means text, not DNA. Written words.

    3. I'm fully aware of the subtleties of this particular problem space,
    and there is nothing in any of the posts so far that I haven't
    considered and already benchmarked. (Including block sizes. For this
    app, ~10k is optimal).

    4. You don't need to know about the environment or the rest of the app,
    because I'm not asking for a damn consultant.

    5. Asking "how much compression I want" is just stupid.

    In short, a narrow question is an invitation for a narrow answer, not an
    invitation for you to tell me how you think I should write this app.
    Chris, Nov 20, 2007
    #19
  20. Chris

    Eric Sosman Guest

    Chris wrote:
    > [...]
    > 1. The problem is fully specified. Text compression is a known problem,
    > and I just asked if anyone knew of a Java library that had better
    > tradeoffs than UTF-8 + zip.
    >
    > 2. Text means text, not DNA. Written words.


    Elsethread you've explained that the compressed stream
    gets read back into a companion program and decompressed there;
    this suggests that it doesn't need to be exchanged with "foreign"
    programs. In which case, I ask again: Does UTF-8 encoding buy
    you enough additional compression to justify its expense? How
    bad would things be if you just handed 16-bit chars to the
    compressor with no "intelligence" whatsoever?

    > 5. Asking "how much compression I want" is just stupid.


    Well, you asked about compression speed. Other things
    being equal, faster compressors compress less well and "looser"
    compressors compress faster, so the question of "how much" must
    eventually arise when you weigh alternatives.

    --
    Eric Sosman
    lid
    Eric Sosman, Nov 20, 2007
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Blah Blah

    java code compression

    Blah Blah, Aug 23, 2003, in forum: Java
    Replies:
    4
    Views:
    758
    Roedy Green
    Aug 23, 2003
  2. NB
    Replies:
    3
    Views:
    427
    Andy Fish
    Jun 2, 2004
  3. Alex Zorin
    Replies:
    3
    Views:
    378
  4. yogesh
    Replies:
    4
    Views:
    5,014
    anish.mathew84
    Dec 30, 2009
  5. Replies:
    1
    Views:
    331
Loading...

Share This Page