compression API available in Java & C++?

  • Thread starter Monique Y. Mudama
  • Start date
M

Monique Y. Mudama

Apologies if this is slightly off-topic ...

I'm looking for a compression algorithm that works better than gzip
for small (12 bytes or so) chunks of data. Ideally there would be a
free implementation for Java and C++ that we could use for commercial
purposes.

Specifically, we are streaming data from a C++ server to a Java applet
and would love to use as little bandwidth as possible.

We're not using any form of compression right now, but the belief is
that gzip works in blocks, so small chunks of data don't get much
benefit. (Anyone disagree?)
 
C

Chris Smith

Monique Y. Mudama said:
I'm looking for a compression algorithm that works better than gzip
for small (12 bytes or so) chunks of data. Ideally there would be a
free implementation for Java and C++ that we could use for commercial
purposes.

Specifically, we are streaming data from a C++ server to a Java applet
and would love to use as little bandwidth as possible.

Compressing 12 bytes? That's a tough one. The concern is that the
overhard added by any compression algorithm for something like the
dictionary is going to be more than 12 bytes in length. I'd be
considering the following options:

1. Drop it. There may be very little that you can do.

2. Run length encoding. If you expect this be a reliable hit, then a
simple RFE scheme could be devised that only adds a byte for the
uncompressable case and helps a little with the compressable case.

3. Out-of-stream dictionary. If you can devise a dictionary of pre-
known patterns that are likely to occur frequently in the data, then you
could pre-build that dictionary and devise a compression scheme that
asume them.

None of the three cases involve using compression libraries, though.
You'd have to build this stuff yourself.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
M

Monique Y. Mudama

Thanks for the response, Chris!
Compressing 12 bytes? That's a tough one. The concern is that the
overhard added by any compression algorithm for something like the
dictionary is going to be more than 12 bytes in length.

Exactly right.
I'd be considering the following options:

1. Drop it. There may be very little that you can do.

Yeah, I'm thinking it may be more trouble than it's worth .. but if it
*is* possible, it would be a help.
2. Run length encoding. If you expect this be a reliable hit, then a
simple RFE scheme could be devised that only adds a byte for the
uncompressable case and helps a little with the compressable case.

Was RFE a typo for RLE? I hadn't heard of RLE; I will google around
and see what comes up.
3. Out-of-stream dictionary. If you can devise a dictionary of pre-
known patterns that are likely to occur frequently in the data, then
you could pre-build that dictionary and devise a compression scheme
that asume them.

I would have to think about this. I'm pretty sure the data is only
ASCII characters, and of those only the common ones that show up on a
US keyboard; is that predictable enough for a dictionary approach?
None of the three cases involve using compression libraries, though.
You'd have to build this stuff yourself.

It may be worth it if the compression is good enough. I haven't yet
figured out what "good enough" would mean for this, though.

Thank you for all your suggestions.
 
M

Monique Y. Mudama

Was RFE a typo for RLE? I hadn't heard of RLE; I will google around
and see what comes up.


I would have to think about this. I'm pretty sure the data is only
ASCII characters, and of those only the common ones that show up on
a US keyboard; is that predictable enough for a dictionary approach?

The more I think about it, the more I think RLE + dictionary may be
right for us. For some of the data we're streaming, up to half of the
chunk of data is a single character repeated. And we could possibly
use a few bits to describe each character we use. I'm not sure the
character patterns are predictable enough to describe multiple
characters at a time, though.
 
B

Bjorn Abelli

On 2005-12-07, Chris Smith penned:

I would have to think about this. I'm pretty sure the data is
only ASCII characters, and of those only the common ones that
show up on a US keyboard; is that predictable enough for a
dictionary approach?

If the characters are somewhat evenly distributed, I don't think you could
gain so much by what I guess Chris meant, as it would mean that the number
of "patterns" would be just about the number of keys on the keyboard, hence
almost as many as can fit into a byte itself...

Although, there could be some more "compression" made if you're *really*
sure that it's only ASCII.

Then you can simply drop the insignificant bit (make each character 7-bits)
and pack them into 11 bytes instead of 12. But then you've only gained one
byte.

If you can investigate what the data really is comprised of, you could
probably make an even better chart to map a kind of dictionary.

Let's say that the data sent is only alphanumerical characters (0-9, a-z and
A-Z), which "could" be the case, then you can make your own schema for
mapping each character to 6 bits instead of 8. After packing those 12
characters in that schema, you would end up with sending 9 bytes instead of
12.

There would actually be room to map additionally 11 characters... ;-)

// Bjorn A
 
M

Mike Schilling

Monique Y. Mudama said:
Apologies if this is slightly off-topic ...

I'm looking for a compression algorithm that works better than gzip
for small (12 bytes or so) chunks of data. Ideally there would be a
free implementation for Java and C++ that we could use for commercial
purposes.

Consider the communication overhead of the protocol you're using. You might
accomplish more packing multiple chunks into a message than compressing the
chunks.
 
R

Roedy Green

I'm looking for a compression algorithm that works better than gzip
for small (12 bytes or so) chunks of data. Ideally there would be a
free implementation for Java and C++ that we could use for commercial
purposes.

Specifically, we are streaming data from a C++ server to a Java applet
and would love to use as little bandwidth as possible.

We're not using any form of compression right now, but the belief is
that gzip works in blocks, so small chunks of data don't get much
benefit. (Anyone disagree?)

the key to this in getting a big sample to compress and studying it
and seeing that both ends have access to the same summary info. You
won't do much with totally self contained messages.

What do you know about the data? Every additional fact can be
exploited in designing custom compression.

It the ideal case, your 12 bytes are one of 100 phrases (commands) and
you can replace them with a 1 byte int indexing the message.
 
M

Monique Y. Mudama

the key to this in getting a big sample to compress and studying it
and seeing that both ends have access to the same summary info. You
won't do much with totally self contained messages.

What do you know about the data? Every additional fact can be
exploited in designing custom compression.

I think I need to spend more time considering exactly this question.
There are actually several "types" of data that can be coming from
this source. It may be that to compress effectively, we would have to
use a different compression technique for each "type."
It the ideal case, your 12 bytes are one of 100 phrases (commands)
and you can replace them with a 1 byte int indexing the message.

Sadly, no. It's data from an external source, and ...

I recognize that I'm not being very helpful, because I don't want to
reveal the full nature of the data I'm dealing with. I also recognize
that this means that people have to suggest general approaches rather
than being able to tell me exactly what to do, and I apologize for the
frustration that may cause.

Anyway, any amount of optimizing of the kind you describe has already
been done. I'm looking at ways of compressing data that's already
stripped down to codes, etc. But we're using ASCII representations,
which could possibly be represented more efficiently in bits. I
talked about it with my team lead last night, and we came to the
conclusion we'd probably only squeeze it down to 6 bits instead of 8,
which probably isn't enough gain.

We have a new team member, and it looks like his first project will be
to play with this data and see if he can come up with a useful
compression scheme. So I'm still thinking about it, but I'm not on
the hook, so to speak.
 
M

Monique Y. Mudama

Consider the communication overhead of the protocol you're using.
You might accomplish more packing multiple chunks into a message
than compressing the chunks.

The data is live; each chunk needs to be transmitted as soon as the
server can do so. Sorry if I forgot to mention a streaming need in
my description.
 
M

Monique Y. Mudama

...

If the characters are somewhat evenly distributed, I don't think you
could gain so much by what I guess Chris meant, as it would mean
that the number of "patterns" would be just about the number of keys
on the keyboard, hence almost as many as can fit into a byte
itself...

Yeah, that's kind of what I came to realize earlier tonight ...
Although, there could be some more "compression" made if you're
*really* sure that it's only ASCII.

Then you can simply drop the insignificant bit (make each character
7-bits) and pack them into 11 bytes instead of 12. But then you've
only gained one byte.

If you can investigate what the data really is comprised of, you
could probably make an even better chart to map a kind of
dictionary.

Let's say that the data sent is only alphanumerical characters (0-9,
a-z and A-Z), which "could" be the case, then you can make your own
schema for mapping each character to 6 bits instead of 8. After
packing those 12 characters in that schema, you would end up with
sending 9 bytes instead of 12.

There would actually be room to map additionally 11 characters...
;-)

We also have some punctuation to deal with, so we'd probably end up
needing the full 6 bits, or maybe even all the way to seven. Bleh!

Thanks to you and everyone else who has made some suggestions. I am
going to think about it. It may be that our data just isn't
well-suited to compression; maybe I can come up with ways to
reorganize the data for compression, but I don't think that would be
well received because we already have apps coded to the existing
formats.
 
K

Knute Johnson

Monique said:
Apologies if this is slightly off-topic ...

I'm looking for a compression algorithm that works better than gzip
for small (12 bytes or so) chunks of data. Ideally there would be a
free implementation for Java and C++ that we could use for commercial
purposes.

Specifically, we are streaming data from a C++ server to a Java applet
and would love to use as little bandwidth as possible.

We're not using any form of compression right now, but the belief is
that gzip works in blocks, so small chunks of data don't get much
benefit. (Anyone disagree?)

How much data are you actually sending at a time? If it is only 12
bytes and you aren't sending them fast enough to get a lot of them into
a TCP packet then compression will gain you nothing. If you are then
GZIP ought to work great.
 
M

Mike Schilling

Monique Y. Mudama said:
The data is live; each chunk needs to be transmitted as soon as the
server can do so. Sorry if I forgot to mention a streaming need in
my description.

What's the per-packet overhead, then? If it's much more than 12 bytes,
you're not going to get much out of compressing the data.
 
H

Hiran Chaudhuri

Knute Johnson said:
How much data are you actually sending at a time? If it is only 12 bytes
and you aren't sending them fast enough to get a lot of them into a TCP
packet then compression will gain you nothing. If you are then GZIP ought
to work great.

From his other post it looks like they are already discussing the bits
representation of their messages. Any compression algorithm can only reduce
redundancy, repetitions and patterns. But if their information set is
reduced to contain no such structures any more no compression algorithm can
squeeze the data any more. Well, you might call it custom compression if you
like.

Hiran
 
C

Chris Uppal

Monique said:
I'm looking for a compression algorithm that works better than gzip
for small (12 bytes or so) chunks of data. Ideally there would be a
free implementation for Java and C++ that we could use for commercial
purposes.

No general-purpose compression scheme is going to be very much help for such
short strings. They work by learning from the data, and 12 bytes doesn't give
enough input for the compressor to learn much, let alone start using what it
has learned.

So you /have/ to use a custom compression scheme. Adding to the suggestions
that have already been made:

The gzip compression scheme (as implemented in the Java API) can be pre-trained
with representative data (which is known to both the reading and writing
applications and therefore does not need to be transmitted). You can easily
test how well that would work for you by assembling a file consisting of (say)
8K of real, representative, data and another which is exactly the same except
that it has one more code tacked onto the end. The difference in size between
to the two files when compressed with gzip -9 will give an indication of the
kind of compression available via this approach. I suspect it won't work well,
but you never know and it's easy to try out.

Otherwise you will almost certainly have to look into Huffman compression or
even (if you are sufficiently desperate) Arithmetic compression (much slower
and harder to implement).

One thing to consider is whether the data has any structure. For instance
(this is a real example from a previous job) if you desperately need to
compress URLs such as found on the web:
http://java.sun.com/index.html
you can divide them up into separate "regions":
http://
java.sun
.com/
index
.html
each of which has distinct frequency characteristics. The first region, the
"scheme" can probably be represented in 1 bit in the majority of cases. You
are looking at a population that is:
http:// -- the overwhelming majority
https:// -- quite common
ftp:// -- moderately rare
<everything else> -- rare
So you can represent http:// as one bit, probably use three bits for https://
and ftp://, and more bits than that for other cases. Similarly the ".com"
suffix can be recognised as the most common top-level domain, and so can be
compressed to one or two bits. The .htm and .html suffixes are common too.
You can use the Huffman algorithm to assign near optimal bit-patterns to the
various cases. OTOH the 'java.sun' bit is going to be much more "random" so
the best you can hope for is a normal text compression scheme, such as
character-by-character Huffman compression based of the frequency of ASCII
characters in domain names (less the top-level segment).

BTW, if this data is the result of encoding some compound information, then
look at the information itself rather than the strings produced by your current
code scheme. I.e. don't think of the task as compressing the 12-bytes of data,
but of compressing the N-bytes of information that is (currently) expressed by
the 12-bytes.

-- chris
 
M

mr.vaibhavjain

Hello Monique,


If your data is stream based than you can try out Huffmans Compression
algorithm. Its an algorithm that converts fixed length blocks to
variable length keys that can be used to look up the actual data. The
implementation of this algorithm may be trivial but requires some great
deal of bit shifts. You can find more info aboute Huffmans Compression
algorithm from the web. In your case is RLE fails or does not satisfies
you needs than Huffmans compression is the best bet. You can achive a
good compression ratio with it even for data blocks that are
distributed and hence do not get compressed much in RLE.

But please bear in mind that you should avoid third party libraries as
they might add some overhead to the data being transmitted. Instead I
suggest you to approach your problem in following phase :

1. Analyze the ASCII data that is being transmitted and prepare a table
(Will be of 256 entries in case of ASCII) for each ASCII charecter and
it corrosponding frequency of occurence in your test analysis.

2. Mannually prepare the Huffman Tree from the obtained data. Hence you
will find the optimal encoding scheme for you data set. That means that
for charecter occuring most will get a shorter bit string than the
charecter that does occurs that repeatedly.

3. Implement this encoding scheme into your Tramitter / Reciever
application. You can very well implement FilterInputstream and
FilterOutputStream interfaces into a class that will decorate your
Socket I/O stream.

I guess that above scheme will do your job as in any data some
charecters are more frequent than other.
 
E

Eric Sosman

Monique Y. Mudama wrote On 12/06/05 18:58,:
Apologies if this is slightly off-topic ...

I'm looking for a compression algorithm that works better than gzip
for small (12 bytes or so) chunks of data. Ideally there would be a
free implementation for Java and C++ that we could use for commercial
purposes.

Specifically, we are streaming data from a C++ server to a Java applet
and would love to use as little bandwidth as possible.

We're not using any form of compression right now, but the belief is
that gzip works in blocks, so small chunks of data don't get much
benefit. (Anyone disagree?)

Thought #1: If you're "streaming" the data, can you
treat the entire stream as the compressible block instead
of compressing each mini-block separately?

Thought #2: Compression introduces delay, not only by
consuming CPU cycles but from the very fact that it absorbs
more bits than it emits. E.g., if your super-efficient
compressor squeezes twelve bytes down to four bits, it
needs to swallow two twelve-byte packets before it can
transmit one compressed byte. If you're "streaming" the
data, this might not be acceptable.

Thought #3: If protocol overhead already amounts to
the equivalent of twelve bytes, no compressor can possibly
do better than halve your bandwidth. If protocol overhead
is more like sixty bytes, compression cannot possibly give
you more than a one-sixth reduction.

Thought #4: My news server carries three Usenet groups
with "compression" in their titles ...
 
T

Thomas Hawtin

Mike said:
Consider the communication overhead of the protocol you're using. You might
accomplish more packing multiple chunks into a message than compressing the
chunks.

Not just overhead, but minimum packet lengths. ATM (for instance ADSL)
packets always have exactly 48-bytes payload (although I don't know how
IP is mapped to ATM). Ethernet has a minimum packet size in or for
collision detection to work (machines at extremes of a network have to
detect collisions while still sending the packet).

(48 bytes may seem a peculiar choice. IIRC, the story is: Europeans
wanted a 64-byte packet for better efficiency. Americans wanted 32 bytes
as the faster rate of packets would sufficiently low latency that echo
suppression logic would not be necessary. 48 bytes was the compromise.
It gives poor efficiency and does not remove the need for echo
suppression logic.)

Tom Hawtin
 
M

Monique Y. Mudama

On 2005-12-06, Monique Y. Mudama asked a question ...

Thanks all for your info and ideas. It's now clear to me that I didn't
even know how to analyze my data, and your feedback has helped me figure
out how I will need to reframe the question. I'm going to try to get a
better understanding of my data, rather than trying to answer questions
piecemeal and frustrating everyone.
 
R

Roedy Green

I'm looking for a compression algorithm that works better than gzip
for small (12 bytes or so) chunks of data. Ideally there would be a
free implementation for Java and C++ that we could use for commercial
purposes.

Compression will buy you little if every chunk goes out in its own
packet. The packet wrapper overhead will overwhelm the payload.
See http://mindprod.com/jgloss/tcpip.html to see what I am talking
about.

However, if you your packet send delay is sufficient to say put 2
chunks most of the time in a packet you will halve the packet
overhead.

My suggestion then is to see if there is a way to tweak the send delay
to find an the optimal value for acceptable latency and fast
transmission. This will likely prove more fruitful than working on
compression.
 
M

Mike Schilling

Roedy Green said:
Compression will buy you little if every chunk goes out in its own
packet. The packet wrapper overhead will overwhelm the payload.
See http://mindprod.com/jgloss/tcpip.html to see what I am talking
about.

However, if you your packet send delay is sufficient to say put 2
chunks most of the time in a packet you will halve the packet
overhead.

My suggestion then is to see if there is a way to tweak the send delay
to find an the optimal value for acceptable latency and fast
transmission. This will likely prove more fruitful than working on
compression.

Roedy makes an excellent point.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top