Reading delimited gzipped serialized objects

D

Dave Brown

Hi all,

Hoping someone might help me get my head around the reading of a data file.

the strucutre of my binary file begins with a 100 byte header. after
that there is a 0L delimter (8 bytes of 0) followed by a set of
gzipped/serialised objects followed by the delimiter, then again
followed by more gzipped/serialised objects and a delimiter at the end...

I've got the header reading in ok using fileinputstream but its the
skipping over delimiter reading up to the next delimiter whilst
unzipping/deserialising which is confusing me...

Any help appreciated, thankyou
 
C

Chris Uppal

Dave said:
the strucutre of my binary file begins with a 100 byte header. after
that there is a 0L delimter (8 bytes of 0) followed by a set of
gzipped/serialised objects followed by the delimiter, then again
followed by more gzipped/serialised objects and a delimiter at the end...

I think you have a couple of problems here.

One is that your format definition doesn't include the length of each segment
of data. That means that reading the data will be tricky. If the gzip format
were self-delimiting (meaning that the decompressor could tell when it got to
the end of the compressed data without external help) then you could decode it
easily enough (but see below). As it is, I don't think the gzip format /is/
self-delimiting. So I think that all you can do is read up to one of your
"delimiters" copying the data into a byte array as you do so, and then attempt
to un-gzip that. If it works, then you're fine, if not then you start again,
copying more data from the stream (not forgetting the delimiter you've just
seen) onto the end of the array. Loop until you find input that does
decompress correctly...

A second problem with using gzip to decompress /parts/ of an input stream like
this is that (as far as I can see) there's a mis-design in the Java library.
You can attatch a decompressor to an existing read stream, and use that to
decompress the data from that stream. That's easy enough (in fact that's the
/only way you can use a GZipInputStream ;-) But the problem comes when you
reach the end of the compressed bit (assuming you know when that happens).
There doesn't seem to be a way to unhook the decompressor from the stream. One
problem is that the decompressor may have read ahead too far from the
underlying stream (consuming bytes from the next segment of uncompressed data),
and there's no way to get that data back. The other is that you are supposed
to close() streams (it releases resources in the zlib library, for instance),
but close()ing the decompressing stream will close() the underlying stream...
So for that reason, too, you will need to read all the compressed segment from
the stream into a byte array before decompressing it.

-- chris
 
D

dave

Thanks chris, point 2 you make is the problem in my head which i'm
having trouble coming up against..

I think the answer is to have one object that includes the 'delimited'
objects in an array list before serialization. so after i've written my
header, i just write the whole object in one go..

My problem with that is that IF anyone analyses my file structure and
realises the structure they can create their own reader. Which is
exactly what i'm trying to avoid.
 
R

Roedy Green

My problem with that is that IF anyone analyses my file structure and
realises the structure they can create their own reader. Which is
exactly what i'm trying to avoid.

Then put the lengths all at the very end and scramble them in some
simple way.eg. xor them with the first n bytes of the file.

Your tail structure might look like this:

offset of member 0 = 0 (implied)
offset member1 + crazy-making number1
offset member 2 + crazy-making number2
3 (count of members )

and you compute the lengths or do the reverse. Don't put both. That
is just making it even easier for a hacker.

Your crazy making numbers can be "random" numbers using some bytes in
the file as a seed. If he his going to crack it, he is going to have
to disassemble your code.
 
R

Roedy Green

Then put the lengths all at the very end and scramble them in some
simple way.eg. xor them with the first n bytes of the file.

Your tail structure might look like this:

offset of member 0 = 0 (implied)
offset member1 + crazy-making number1
offset member 2 + crazy-making number2
3 (count of members )

and you compute the lengths or do the reverse. Don't put both. That
is just making it even easier for a hacker.

Your crazy making numbers can be "random" numbers using some bytes in
the file as a seed. If he his going to crack it, he is going to have
to disassemble your code.

If you are feeling particularly cruel, duplicate the info at the head
of the file or embedded in it in some obvious way.but hide it lightly
or not at all Ensure it is correct 99 percent of the time.

see http://mindprod.com/jgloss/obfuscator.html

site is down just now. I don't know why. I have a call in.
 
T

Thomas Weidenfeller

My problem with that is that IF anyone analyses my file structure and
realises the structure they can create their own reader. Which is
exactly what i'm trying to avoid.

Not when using Java. It is trivial to decompile non-obfuscated Java an
to see exactly what you are doing. It just takes some more time to make
sense out of some obfuscated class files. But in the end, figuring out
what you are doing is not rocket science.

IMHO you are wasting your time with such attempts. In addition, you
probably annoy some of your users who badly need to get at that data for
good reasons.

/Thomas
 
C

Chris Uppal

My problem with that is that IF anyone analyses my file structure and
realises the structure they can create their own reader. Which is
exactly what i'm trying to avoid.

Since any file structure can be reverse engineered, it doesn't seem a
worthwhile reason to give yourself real difficulty parsing the data. Just
adding some field lengths isn't going to make a big difference to how easy it
is to work out where the fields are -- especially if you are going to use
delimiters to mark the segment ends.

There are lots of ways to make the data less readily readable, ranging from the
simple through the bizarre to the fantastic, but even if you do indulge in some
of them, it doesn't seem very productive to make it harder for /yourself/
too...

-- chris
 
R

Roedy Green

Not when using Java. It is trivial to decompile non-obfuscated Java an
to see exactly what you are doing. It just takes some more time to make
sense out of some obfuscated class files. But in the end, figuring out
what you are doing is not rocket science.

If you distributed natively compiled code, reverse engineering it is a
lot more work. You would discourage casual hackers even with the
simple means I suggested.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top