Serialisation inefficiency

Discussion in 'Java' started by Roedy Green, Sep 16, 2003.

  1. Roedy Green

    Roedy Green Guest

    I serialised simple array of objects: When I looked at in a hex
    viewer I discovered it contains the following strings:

    [Lcom.mindprod.replicator.MiniZD;
    com.mindprod.replicator.MiniZD
    Lcom/mindprod/replicator/MiniFD;
    com.mindprod.replicator.MiniFD

    Why would it need to specify each type twice?

    I can possibly see MiniZD appearing twice, once because it the type of
    the array and once for an object in the array, but not MiniFD which is
    just a reference from a MiniZD.



    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Sep 16, 2003
    #1
    1. Advertising

  2. Roedy Green

    Will Hartung Guest

    "Roedy Green" <> wrote in message
    news:...
    > I serialised simple array of objects: When I looked at in a hex
    > viewer I discovered it contains the following strings:
    >
    > [Lcom.mindprod.replicator.MiniZD;
    > com.mindprod.replicator.MiniZD
    > Lcom/mindprod/replicator/MiniFD;
    > com.mindprod.replicator.MiniFD
    >
    > Why would it need to specify each type twice?


    It's not specifying each type twice. It's specifying each type once. Arrays
    are different types than classes, even though they are tied TO a class, they
    are indeed different types, so this looks pretty straightforward and correct
    to me.

    Regards,

    Will Hartung
    ()
     
    Will Hartung, Sep 17, 2003
    #2
    1. Advertising

  3. Roedy Green () wrote:
    : [Lcom.mindprod.replicator.MiniZD;

    Array of objects of the class com.mindprod.replicator.MiniZD

    : com.mindprod.replicator.MiniZD

    Hmm, since you dont give the full stream I can say where this comes
    from. probably an instance in the stream...

    : Lcom/mindprod/replicator/MiniFD;

    Object of class com.mindprod.replicator.MiniFD

    : com.mindprod.replicator.MiniFD

    Need full stream to give more info.

    : Why would it need to specify each type twice?

    It does not (unless you force it to by calling reset()).

    : I can possibly see MiniZD appearing twice, once because it the type of
    : the array and once for an object in the array, but not MiniFD which is
    : just a reference from a MiniZD.

    There is a white paper about how the streams are build, they are quite
    efficent, but gzip the data will probably save you about 50% (thats
    what I see for the data I serialize, ymmv).

    http://java.sun.com/j2se/1.4.2/docs/guide/serialization/spec/protocol.html#wp10258

    Normally you will find something like
    magic,
    class 1 description, instance 1, instance 2,
    class 2 description, instance 3
    instance 4, instance 5....

    where instance 1, 2, 4 are of class 1 and instance 3 and 5 are of
    class 2.

    If your class MiniZD has a variable of the type MiniFD the the class
    description will have the string MiniFD in it and you will also have
    the class MiniFD declared earlier in the stream => 2 strings....

    Read the grammar for a better understanding. I did it when I wanted to
    write a perl server that outputted java serilized objects. The grammar
    is quite easy to read and understand.

    Have fun
    /robo
     
    Robert Olofsson, Sep 17, 2003
    #3
  4. Roedy Green

    Roedy Green Guest

    On 17 Sep 2003 16:12:40 GMT, (Robert
    Olofsson) wrote or quoted :

    >Hmm, since you dont give the full stream I can say where this comes
    >from. probably an instance in the stream...


    the object being persisted is MiniA[]

    Each MiniA has a single MiniB reference.

    There are no MiniB[]

    The main mystery is solved by noting that MiniA and MiniA[] are
    considered as distinct atomic types and hence the name appears twice.

    That does not explain where the why miniB appears twice

    Lcom/mindprod/replicator/MiniB
    com.mindprod.replicator.MiniB

    And why the two different separator conventions?

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Sep 17, 2003
    #4
  5. Roedy Green () wrote:

    : the object being persisted is MiniA[]
    : Each MiniA has a single MiniB reference.
    : There are no MiniB[]

    : The main mystery is solved by noting that MiniA and MiniA[] are
    : considered as distinct atomic types and hence the name appears twice.

    That is correct.

    : That does not explain where the why miniB appears twice
    If you did read my previous statement it should be quite clear, but I
    will try to explain a bit more.

    : Lcom/mindprod/replicator/MiniB
    This one comes from the variable inside MiniA (class description
    of MiniA). The class description of MiniA say something like:
    "here comes a variable named <variablename> of type
    Lcom/mindprod/replicator/MiniB".

    : com.mindprod.replicator.MiniB
    This one is the class description of MiniB.

    : And why the two different separator conventions?
    That is a good question, try to get SUN to answear that...

    /robo
     
    Robert Olofsson, Sep 17, 2003
    #5
  6. Roedy Green

    Roedy Green Guest

    On 17 Sep 2003 20:28:43 GMT, (Robert
    Olofsson) wrote or quoted :

    >: com.mindprod.replicator.MiniB
    >This one is the class description of MiniB.


    Most of the references to MiniB objects just seem to point to the
    class description. Why would that one spell it out longhand?

    I presume it has something to do with it being a slightly forward
    reference.

    I think serialised object are underused. To write the equivalent code
    to accurately write and read a persistent object is a lot of extra
    work, and it must be maintained every time anything changes.

    In debugging it is no problem when I change the format. I just delete
    all persistent files. However, once this program is deployed, I can't
    be so cavalier. I will have to read old objects and write news ones.

    I don't know to what extent you can trust the readObject to deal with
    a slightly out of date object.
    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Sep 17, 2003
    #6
  7. Roedy Green () wrote:
    : Most of the references to MiniB objects just seem to point to the
    : class description. Why would that one spell it out longhand?

    Class descriptions are nice to have. The class description has an id
    for the class and a list of fields (name and type). More info below.
    Class descriptions normally only occur once for each class in an
    ObjectStream.

    : I presume it has something to do with it being a slightly forward
    : reference.

    Hm, no, I would not say that it is that. For readObject to work
    correctly a class description is neccessary.

    : I think serialised object are underused. To write the equivalent code
    : to accurately write and read a persistent object is a lot of extra
    : work, and it must be maintained every time anything changes.

    Yes, serializing is an easy way and serialized objects are compact.
    By wrapping the object stream in a gzip stream about 50% of the size
    can be saved.

    : In debugging it is no problem when I change the format. I just delete
    : all persistent files. However, once this program is deployed, I can't
    : be so cavalier. I will have to read old objects and write news ones.

    That depends, if you only (bug) fix methods you can declare the
    serialVersionUID with a value from the serialver tool.
    If you do change the data in an object it is possible to write a
    readObject method that can handle multiple versions of a class.
    It is possible to get fields from the stream that only existed in the
    previous version of the class (something like:
    getField ("fieldThatNoLongerExist")) and by doing that it is possible
    to read old type of objects. This is where the class description in
    the field is useful.

    : I don't know to what extent you can trust the readObject to deal with
    : a slightly out of date object.

    Not at all without help.

    /robo
     
    Robert Olofsson, Sep 17, 2003
    #7
  8. Roedy Green

    Roedy Green Guest

    On 17 Sep 2003 21:18:28 GMT, (Robert
    Olofsson) wrote or quoted :

    >Hm, no, I would not say that it is that. For readObject to work
    >correctly a class description is neccessary.



    but why does the name of the class need to appear more than once in
    the file?

    I'd think you would need to describe each class once, then from then
    on you could refer back to that. Even if you are describing a
    reference in a class, it could point to the class description.

    Possibly what happens is it does not bother providing a description of
    class B on merely encountering a reference in class A to class B. It
    procrastinates until it finds an object of class B.
    In the meantime it has to use the long class name to fill out its
    descriptor of what a class A is.

    It is trading off the double name against the possible overhead of a
    chain reaction of included class descriptors that are not of any
    relevance in reconstituting the stream.

    However, even if that is what they are doing, I still don't see the
    need for embedding ANY name more than once.

    The other approach they could have gone is not to embed any structural
    information, just the actual types of referred objects. They could
    just trust the receiving end had accurate maps of each object.




    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Sep 18, 2003
    #8
  9. Roedy Green () wrote:
    : On 17 Sep 2003 21:18:28 GMT, (Robert
    : Olofsson) wrote or quoted :
    : >Hm, no, I would not say that it is that. For readObject to work
    : >correctly a class description is neccessary.
    : but why does the name of the class need to appear more than once in
    : the file?

    Exercise:
    1) Create a file with serialized data from some objects you have now
    (for example an instance of BuildingInfo(adress, owner))
    2) change a variable type (split adress -> street and zip or similar)
    3) serialize the new object to another file
    4) try to read back both the files.

    The format of the classes have changed so you will not be able to read
    the first file. However if you peek inside the stream with a special
    written readObject you can see which version the file is and get the
    adress and add some code that splits it as you want and then creates
    an instance of the new version.

    For this to work the class description needs to describe the type of
    the variables and the name of the variables.

    A class description also needs the class name since that is the id for
    that structure.

    This makes it easy to see why data is the way it is today.

    Ok, it should be possible to create a more optimized stream where the
    class names are only stored once, however then you would need to store
    an id for the String and the String. On average you should probably
    save a few bytes (String def: <id> <length> <characters>, class name:
    <string id>, variable type: L<string id>, an id would probably be 4
    bytes).

    The cost of doing this optimization would be a bit extra complexity in
    the stream.

    Serialized streams are not designed to be the most compressed form of
    the data, only a binary compact format, for easy reading and writing.

    If you serialize ~100 objects of 4 classes, maybe you could save
    ~20-50 bytes, not a big win, (feel free to try it and give me the
    exact count). Class descriptions happen only once in a stream so in
    this case youd have 4 - 10 class descriptions (java.lang.String will
    have one, as will the arrays you have, as will the java.util.HashSet
    you have in your objects...). You will have some of the class names
    twice, not a big deal.

    And as I have stated before If you want your object streams to be even
    more compact, wrap the stream in a GZIP(Input|Output)Stream normally
    resulting in something like 50% reduction of the size (for the data I
    have tried).

    /robo
     
    Robert Olofsson, Sep 18, 2003
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Frederick Gotham

    Inherent inefficiency in domestic "for" loop?

    Frederick Gotham, Jun 26, 2006, in forum: C Programming
    Replies:
    34
    Views:
    800
  2. Program inefficiency?

    , Sep 29, 2007, in forum: Python
    Replies:
    17
    Views:
    623
    Florian Schmidt
    Oct 1, 2007
  3. Replies:
    6
    Views:
    309
    Dimiter \malkia\ Stanev
    Oct 1, 2007
  4. wittle
    Replies:
    7
    Views:
    328
    Roedy Green
    Sep 29, 2007
  5. Antoninus Twink

    Bug/Gross InEfficiency in HeathField's fgetline program

    Antoninus Twink, Oct 7, 2007, in forum: C Programming
    Replies:
    436
    Views:
    6,115
    user923005
    Nov 13, 2007
Loading...

Share This Page