Serialisation inefficiency

R

Roedy Green

I serialised simple array of objects: When I looked at in a hex
viewer I discovered it contains the following strings:

[Lcom.mindprod.replicator.MiniZD;
com.mindprod.replicator.MiniZD
Lcom/mindprod/replicator/MiniFD;
com.mindprod.replicator.MiniFD

Why would it need to specify each type twice?

I can possibly see MiniZD appearing twice, once because it the type of
the array and once for an object in the array, but not MiniFD which is
just a reference from a MiniZD.
 
W

Will Hartung

Roedy Green said:
I serialised simple array of objects: When I looked at in a hex
viewer I discovered it contains the following strings:

[Lcom.mindprod.replicator.MiniZD;
com.mindprod.replicator.MiniZD
Lcom/mindprod/replicator/MiniFD;
com.mindprod.replicator.MiniFD

Why would it need to specify each type twice?

It's not specifying each type twice. It's specifying each type once. Arrays
are different types than classes, even though they are tied TO a class, they
are indeed different types, so this looks pretty straightforward and correct
to me.

Regards,

Will Hartung
([email protected])
 
R

Robert Olofsson

Roedy Green ([email protected]) wrote:
: [Lcom.mindprod.replicator.MiniZD;

Array of objects of the class com.mindprod.replicator.MiniZD

: com.mindprod.replicator.MiniZD

Hmm, since you dont give the full stream I can say where this comes
from. probably an instance in the stream...

: Lcom/mindprod/replicator/MiniFD;

Object of class com.mindprod.replicator.MiniFD

: com.mindprod.replicator.MiniFD

Need full stream to give more info.

: Why would it need to specify each type twice?

It does not (unless you force it to by calling reset()).

: I can possibly see MiniZD appearing twice, once because it the type of
: the array and once for an object in the array, but not MiniFD which is
: just a reference from a MiniZD.

There is a white paper about how the streams are build, they are quite
efficent, but gzip the data will probably save you about 50% (thats
what I see for the data I serialize, ymmv).

http://java.sun.com/j2se/1.4.2/docs/guide/serialization/spec/protocol.html#wp10258

Normally you will find something like
magic,
class 1 description, instance 1, instance 2,
class 2 description, instance 3
instance 4, instance 5....

where instance 1, 2, 4 are of class 1 and instance 3 and 5 are of
class 2.

If your class MiniZD has a variable of the type MiniFD the the class
description will have the string MiniFD in it and you will also have
the class MiniFD declared earlier in the stream => 2 strings....

Read the grammar for a better understanding. I did it when I wanted to
write a perl server that outputted java serilized objects. The grammar
is quite easy to read and understand.

Have fun
/robo
 
R

Roedy Green

Hmm, since you dont give the full stream I can say where this comes
from. probably an instance in the stream...

the object being persisted is MiniA[]

Each MiniA has a single MiniB reference.

There are no MiniB[]

The main mystery is solved by noting that MiniA and MiniA[] are
considered as distinct atomic types and hence the name appears twice.

That does not explain where the why miniB appears twice

Lcom/mindprod/replicator/MiniB
com.mindprod.replicator.MiniB

And why the two different separator conventions?
 
R

Robert Olofsson

Roedy Green ([email protected]) wrote:

: the object being persisted is MiniA[]
: Each MiniA has a single MiniB reference.
: There are no MiniB[]

: The main mystery is solved by noting that MiniA and MiniA[] are
: considered as distinct atomic types and hence the name appears twice.

That is correct.

: That does not explain where the why miniB appears twice
If you did read my previous statement it should be quite clear, but I
will try to explain a bit more.

: Lcom/mindprod/replicator/MiniB
This one comes from the variable inside MiniA (class description
of MiniA). The class description of MiniA say something like:
"here comes a variable named <variablename> of type
Lcom/mindprod/replicator/MiniB".

: com.mindprod.replicator.MiniB
This one is the class description of MiniB.

: And why the two different separator conventions?
That is a good question, try to get SUN to answear that...

/robo
 
R

Roedy Green

: com.mindprod.replicator.MiniB
This one is the class description of MiniB.

Most of the references to MiniB objects just seem to point to the
class description. Why would that one spell it out longhand?

I presume it has something to do with it being a slightly forward
reference.

I think serialised object are underused. To write the equivalent code
to accurately write and read a persistent object is a lot of extra
work, and it must be maintained every time anything changes.

In debugging it is no problem when I change the format. I just delete
all persistent files. However, once this program is deployed, I can't
be so cavalier. I will have to read old objects and write news ones.

I don't know to what extent you can trust the readObject to deal with
a slightly out of date object.
 
R

Robert Olofsson

Roedy Green ([email protected]) wrote:
: Most of the references to MiniB objects just seem to point to the
: class description. Why would that one spell it out longhand?

Class descriptions are nice to have. The class description has an id
for the class and a list of fields (name and type). More info below.
Class descriptions normally only occur once for each class in an
ObjectStream.

: I presume it has something to do with it being a slightly forward
: reference.

Hm, no, I would not say that it is that. For readObject to work
correctly a class description is neccessary.

: I think serialised object are underused. To write the equivalent code
: to accurately write and read a persistent object is a lot of extra
: work, and it must be maintained every time anything changes.

Yes, serializing is an easy way and serialized objects are compact.
By wrapping the object stream in a gzip stream about 50% of the size
can be saved.

: In debugging it is no problem when I change the format. I just delete
: all persistent files. However, once this program is deployed, I can't
: be so cavalier. I will have to read old objects and write news ones.

That depends, if you only (bug) fix methods you can declare the
serialVersionUID with a value from the serialver tool.
If you do change the data in an object it is possible to write a
readObject method that can handle multiple versions of a class.
It is possible to get fields from the stream that only existed in the
previous version of the class (something like:
getField ("fieldThatNoLongerExist")) and by doing that it is possible
to read old type of objects. This is where the class description in
the field is useful.

: I don't know to what extent you can trust the readObject to deal with
: a slightly out of date object.

Not at all without help.

/robo
 
R

Roedy Green

Hm, no, I would not say that it is that. For readObject to work
correctly a class description is neccessary.


but why does the name of the class need to appear more than once in
the file?

I'd think you would need to describe each class once, then from then
on you could refer back to that. Even if you are describing a
reference in a class, it could point to the class description.

Possibly what happens is it does not bother providing a description of
class B on merely encountering a reference in class A to class B. It
procrastinates until it finds an object of class B.
In the meantime it has to use the long class name to fill out its
descriptor of what a class A is.

It is trading off the double name against the possible overhead of a
chain reaction of included class descriptors that are not of any
relevance in reconstituting the stream.

However, even if that is what they are doing, I still don't see the
need for embedding ANY name more than once.

The other approach they could have gone is not to embed any structural
information, just the actual types of referred objects. They could
just trust the receiving end had accurate maps of each object.
 
R

Robert Olofsson

Roedy Green ([email protected]) wrote:
: On 17 Sep 2003 21:18:28 GMT, (e-mail address removed) (Robert
: Olofsson) wrote or quoted :
: >Hm, no, I would not say that it is that. For readObject to work
: >correctly a class description is neccessary.
: but why does the name of the class need to appear more than once in
: the file?

Exercise:
1) Create a file with serialized data from some objects you have now
(for example an instance of BuildingInfo(adress, owner))
2) change a variable type (split adress -> street and zip or similar)
3) serialize the new object to another file
4) try to read back both the files.

The format of the classes have changed so you will not be able to read
the first file. However if you peek inside the stream with a special
written readObject you can see which version the file is and get the
adress and add some code that splits it as you want and then creates
an instance of the new version.

For this to work the class description needs to describe the type of
the variables and the name of the variables.

A class description also needs the class name since that is the id for
that structure.

This makes it easy to see why data is the way it is today.

Ok, it should be possible to create a more optimized stream where the
class names are only stored once, however then you would need to store
an id for the String and the String. On average you should probably
save a few bytes (String def: <id> <length> <characters>, class name:
<string id>, variable type: L<string id>, an id would probably be 4
bytes).

The cost of doing this optimization would be a bit extra complexity in
the stream.

Serialized streams are not designed to be the most compressed form of
the data, only a binary compact format, for easy reading and writing.

If you serialize ~100 objects of 4 classes, maybe you could save
~20-50 bytes, not a big win, (feel free to try it and give me the
exact count). Class descriptions happen only once in a stream so in
this case youd have 4 - 10 class descriptions (java.lang.String will
have one, as will the arrays you have, as will the java.util.HashSet
you have in your objects...). You will have some of the class names
twice, not a big deal.

And as I have stated before If you want your object streams to be even
more compact, wrap the stream in a GZIP(Input|Output)Stream normally
resulting in something like 50% reduction of the size (for the data I
have tried).

/robo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top