Reading delimited gzipped serialized objects

Dave Brown · Mar 8, 2006

Hi all,

Hoping someone might help me get my head around the reading of a data file.

the strucutre of my binary file begins with a 100 byte header. after
that there is a 0L delimter (8 bytes of 0) followed by a set of
gzipped/serialised objects followed by the delimiter, then again
followed by more gzipped/serialised objects and a delimiter at the end...

I've got the header reading in ok using fileinputstream but its the
skipping over delimiter reading up to the next delimiter whilst
unzipping/deserialising which is confusing me...

Any help appreciated, thankyou

Chris Uppal · Mar 9, 2006

Dave said:
the strucutre of my binary file begins with a 100 byte header. after
that there is a 0L delimter (8 bytes of 0) followed by a set of
gzipped/serialised objects followed by the delimiter, then again
followed by more gzipped/serialised objects and a delimiter at the end...

I think you have a couple of problems here.

One is that your format definition doesn't include the length of each segment
of data. That means that reading the data will be tricky. If the gzip format
were self-delimiting (meaning that the decompressor could tell when it got to
the end of the compressed data without external help) then you could decode it
easily enough (but see below). As it is, I don't think the gzip format /is/
self-delimiting. So I think that all you can do is read up to one of your
"delimiters" copying the data into a byte array as you do so, and then attempt
to un-gzip that. If it works, then you're fine, if not then you start again,
copying more data from the stream (not forgetting the delimiter you've just
seen) onto the end of the array. Loop until you find input that does
decompress correctly...

A second problem with using gzip to decompress /parts/ of an input stream like
this is that (as far as I can see) there's a mis-design in the Java library.
You can attatch a decompressor to an existing read stream, and use that to
decompress the data from that stream. That's easy enough (in fact that's the
/only way you can use a GZipInputStream ;-) But the problem comes when you
reach the end of the compressed bit (assuming you know when that happens).
There doesn't seem to be a way to unhook the decompressor from the stream. One
problem is that the decompressor may have read ahead too far from the
underlying stream (consuming bytes from the next segment of uncompressed data),
and there's no way to get that data back. The other is that you are supposed
to close() streams (it releases resources in the zlib library, for instance),
but close()ing the decompressing stream will close() the underlying stream...
So for that reason, too, you will need to read all the compressed segment from
the stream into a byte array before decompressing it.

-- chris

dave · Mar 9, 2006

Thanks chris, point 2 you make is the problem in my head which i'm
having trouble coming up against..

I think the answer is to have one object that includes the 'delimited'
objects in an array list before serialization. so after i've written my
header, i just write the whole object in one go..

My problem with that is that IF anyone analyses my file structure and
realises the structure they can create their own reader. Which is
exactly what i'm trying to avoid.

Roedy Green · Mar 9, 2006

My problem with that is that IF anyone analyses my file structure and
realises the structure they can create their own reader. Which is
exactly what i'm trying to avoid.

Then put the lengths all at the very end and scramble them in some
simple way.eg. xor them with the first n bytes of the file.

Your tail structure might look like this:

offset of member 0 = 0 (implied)
offset member1 + crazy-making number1
offset member 2 + crazy-making number2
3 (count of members )

and you compute the lengths or do the reverse. Don't put both. That
is just making it even easier for a hacker.

Your crazy making numbers can be "random" numbers using some bytes in
the file as a seed. If he his going to crack it, he is going to have
to disassemble your code.

Roedy Green · Mar 9, 2006

Then put the lengths all at the very end and scramble them in some
simple way.eg. xor them with the first n bytes of the file.

Your tail structure might look like this:

offset of member 0 = 0 (implied)
offset member1 + crazy-making number1
offset member 2 + crazy-making number2
3 (count of members )

and you compute the lengths or do the reverse. Don't put both. That
is just making it even easier for a hacker.

Your crazy making numbers can be "random" numbers using some bytes in
the file as a seed. If he his going to crack it, he is going to have
to disassemble your code.

If you are feeling particularly cruel, duplicate the info at the head
of the file or embedded in it in some obvious way.but hide it lightly
or not at all Ensure it is correct 99 percent of the time.

see http://mindprod.com/jgloss/obfuscator.html

site is down just now. I don't know why. I have a call in.

Thomas Weidenfeller · Mar 10, 2006

My problem with that is that IF anyone analyses my file structure and
realises the structure they can create their own reader. Which is
exactly what i'm trying to avoid.

Not when using Java. It is trivial to decompile non-obfuscated Java an
to see exactly what you are doing. It just takes some more time to make
sense out of some obfuscated class files. But in the end, figuring out
what you are doing is not rocket science.

IMHO you are wasting your time with such attempts. In addition, you
probably annoy some of your users who badly need to get at that data for
good reasons.

/Thomas

Chris Uppal · Mar 10, 2006

My problem with that is that IF anyone analyses my file structure and
realises the structure they can create their own reader. Which is
exactly what i'm trying to avoid.

Since any file structure can be reverse engineered, it doesn't seem a
worthwhile reason to give yourself real difficulty parsing the data. Just
adding some field lengths isn't going to make a big difference to how easy it
is to work out where the fields are -- especially if you are going to use
delimiters to mark the segment ends.

There are lots of ways to make the data less readily readable, ranging from the
simple through the bizarre to the fantastic, but even if you do indulge in some
of them, it doesn't seem very productive to make it harder for /yourself/
too...

-- chris

Roedy Green · Mar 10, 2006

Not when using Java. It is trivial to decompile non-obfuscated Java an
to see exactly what you are doing. It just takes some more time to make
sense out of some obfuscated class files. But in the end, figuring out
what you are doing is not rocket science.

If you distributed natively compiled code, reverse engineering it is a
lot more work. You would discourage casual hackers even with the
simple means I suggested.

Error in reading Serialized info Vector objects issues!!	2	May 1, 2005
extending optionparser to accept multiple comma delimited input forone arg	3	Nov 27, 2009
Reading in cooked mode (was Re: Python MSI not installing, log fileshowing name of a Viatnemese comm	8	Mar 23, 2014
replace a string delimited by 2 other string, regexp problem	3	Oct 2, 2006
writing and reading objects	10	Sep 18, 2004
? about reading a comma delimited file	4	Sep 4, 2003
Reading from a text file	11	Nov 22, 2005
Persistent Objects with Ruby - simple beginning	2	Dec 16, 2008

Reading delimited gzipped serialized objects

Dave Brown

Chris Uppal

dave

Roedy Green

Roedy Green

Thomas Weidenfeller

Chris Uppal

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads