removing the header from a gzip'd string

R

Rajarshi

Hi, I have some code that takes a string and obtains a compressed
version using zlib.compress

Does anybody know how I can remove the header portion of the compressed
bytes, such that I only have the compressed data remaining? (Obviously
I do not intend to perform the decompression!)

Thanks,
 
F

Fredrik Lundh

Rajarshi said:
Hi, I have some code that takes a string and obtains a compressed
version using zlib.compress

Does anybody know how I can remove the header portion of the compressed
bytes, such that I only have the compressed data remaining?

what makes you think there's a "header portion" in the data you get
from zlib.compress ? it's just a continuous stream of bits, all of
which are needed by the decoder.
> (Obviously I do not intend to perform the decompression!)

oh. in that case, this should be good enough:

data[random.randint(0,len(data)):]

</F>
 
B

Bjoern Schliessmann

Rajarshi said:
Does anybody know how I can remove the header portion of the
compressed bytes, such that I only have the compressed data
remaining? (Obviously I do not intend to perform the
decompression!)

Just curious: What's your goal? :) A home made hash function?

Regards,


Björn
 
G

Gabriel Genellina

what makes you think there's a "header portion" in the data you get
from zlib.compress ? it's just a continuous stream of bits, all of
which are needed by the decoder.

No. The first 2 bytes (or more if using a preset dictionary) are
header information. The last 4 bytes are for checksum. In-between
lies the encoded bit stream.
Using the default options ("deflate", default compression level, no
custom dictionary) will make those first two bytes 0x78 0x9c.
If you want to encrypt a compressed text, you must remove redundant
information first. Knowing part of the clear message is a security
hole. Using an structured container (like a zip/rar/... file) gets
worse because the fixed (or "guessable") part is longer, but anyway,
2 bytes may be bad enough.
See RFC1950 <ftp://ftp.isi.edu/in-notes/rfc1950.txt>


--
Gabriel Genellina
Softlab SRL






__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
 
F

Fredrik Lundh

Gabriel said:
Using the default options ("deflate", default compression level, no
custom dictionary) will make those first two bytes 0x78 0x9c.
>
> If you want to encrypt a compressed text, you must remove redundant
> information first.

encryption? didn't the OP say that he *didn't* plan to decompress the
resulting data stream?
> Knowing part of the clear message is a security hole.

well, knowing the algorithm used to convert from the original clear
text to the text that's actually encrypted also gives an attacker
plenty of clues (especially if the original is regular in some way,
such as "always an XML file" or "always a record having this format").
sounds to me like trying to address this potential hole by stripping
off 16 bits from the payload won't really solve that problem...

</F>
 
V

vasudevram

Fredrik said:
encryption? didn't the OP say that he *didn't* plan to decompress the
resulting data stream?


well, knowing the algorithm used to convert from the original clear
text to the text that's actually encrypted also gives an attacker
plenty of clues (especially if the original is regular in some way,
such as "always an XML file" or "always a record having this format").
sounds to me like trying to address this potential hole by stripping
off 16 bits from the payload won't really solve that problem...

</F>

Yes, I'm also interested to know why the OP wants to remove the header.

Though I'm not an expert on the zip format, my understanding is that
most binary formats are not of much use in pieces (though some
composite formats might be, e.g. you might be able to meaningfully
extract a piece, such as an image embedded in a Word file). I somehow
don't think a compressed zip file would be of use in pieces (except
possibly for the header itself). But I could be wrong of course.

Vasudev Ram
http://www.dancingbison.com
 
G

Gabriel Genellina

Fredrik Lundh ha escrito:
encryption? didn't the OP say that he *didn't* plan to decompress the
resulting data stream?
I was trying to imagine any motivation for asking that question. And I
considered the second part as "I'm not the guy who will reconstruct the
original data". But I'm still intrigued by the actual use case...
 
D

debarchana.ghosh

Bjoern said:
Just curious: What's your goal? :) A home made hash function?

Actually I was implementing the use of the normalized compression
distance to evaluate molecular similarity as described in an article in
J.Chem.Inf.Model (http://dx.doi.org/10.1021/ci600384z, subscriber
access only, unfortunately).

Essentially, they note that the NCD does not always bevave like a
metric and one reason they put forward is that this may be due to the
size of the header portion (they were using the command line gzip and
bzip2 programs) compared to the strings being compressed (which are on
average 48 bytes long).

So I was interested to see if the NCD behaved like a metric if I
removed everything that was not the compressed string. And since I only
need to calculate similarity between two strings, I do not need to do
any decompression.
 
F

Fredrik Lundh

Essentially, they note that the NCD does not always bevave like a
metric and one reason they put forward is that this may be due to the
size of the header portion (they were using the command line gzip and
bzip2 programs) compared to the strings being compressed (which are on
average 48 bytes long).

gzip datastreams have a real header, with a file type identifier,
optional filenames, comments, and a bunch of flags.

but even if you strip that off (which is basically what happens if you
use zlib.compress instead of gzip), I doubt you'll get representative
"compressability" metrics on strings that short. like most other
compression algorithms, those algorithms are designed for much larger
datasets.

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top