Encoding of primitives for binary serialization

Discussion in 'Java' started by kb, Apr 9, 2009.

  1. kb

    kb Guest

    Hey,

    I'm implementing binary serialization for primitive data types both in
    java and c++. Also I need to handle serialization/de-serialization
    across java and c++ i.e. serialization from java and de-serialization
    in c++ and vice-versa.

    For this I need to decide an encoding for primitive data types which
    is independent of language and platform. Does any one have some idea
    about such an encoding format.
     
    kb, Apr 9, 2009
    #1
    1. Advertising

  2. On 09.04.2009 14:56, kb wrote:
    > I'm implementing binary serialization for primitive data types both in
    > java and c++. Also I need to handle serialization/de-serialization
    > across java and c++ i.e. serialization from java and de-serialization
    > in c++ and vice-versa.
    >
    > For this I need to decide an encoding for primitive data types which
    > is independent of language and platform. Does any one have some idea
    > about such an encoding format.


    Why not use Java's serialization format? If you do not want to use that
    and only want to serialize String, char, int, long, float and other
    number types what you basically need is a type tag and a convention
    whether you store numbers in big endian or little endian format.

    If it does not have to be binary, you can use an existing format, for
    example http://www.yaml.org/ - implementations for Java and C++ do exist
    already. I am sure there are also libraries for XML serialization out
    there.

    Kind regards

    robert
     
    Robert Klemme, Apr 9, 2009
    #2
    1. Advertising

  3. kb

    Mark Space Guest

    kb wrote:
    > Hey,
    >
    > I'm implementing binary serialization for primitive data types both in
    > java and c++. Also I need to handle serialization/de-serialization
    > across java and c++ i.e. serialization from java and de-serialization
    > in c++ and vice-versa.
    >
    > For this I need to decide an encoding for primitive data types which
    > is independent of language and platform. Does any one have some idea
    > about such an encoding format.



    You might try DataInputStream and DataOutputStream. These classes allow
    you to do basic binary IO on primitives and strings. Even if you do use
    Serialization I think you'll end up overriding the serialization IO
    methods and using Data*Stream classes to do the actual work.

    <http://java.sun.com/docs/books/tutorial/essential/io/datastreams.html>

    However, data streams won't do everything, like little endian formats.
    For that, I think a ByteBuffer and associated classes are best.

    <http://java.sun.com/javase/6/docs/api/java/nio/class-use/ByteBuffer.html>
     
    Mark Space, Apr 9, 2009
    #3
  4. kb

    Roedy Green Guest

    On Thu, 9 Apr 2009 05:56:50 -0700 (PDT), kb <>
    wrote, quoted or indirectly quoted someone who said :

    >For this I need to decide an encoding for primitive data types which
    >is independent of language and platform. Does any one have some idea
    >about such an encoding format


    see http://mindprod.com/jgloss/corba.html

    If you don't have arbitrary records, DataOutputStream would work.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "At this point, 29 percent of fish and seafood species have collapsed - that is,
    their catch has declined by 90 percent. It is a very clear trend, and it is accelerating.
    If the long-term trend continues, all fish and seafood species are projected to collapse
    within my lifetime -- by 2048."
    ~ Dr. Boris Worm of Dalhousie University
     
    Roedy Green, Apr 9, 2009
    #4
  5. kb

    Tom Anderson Guest

    On Thu, 9 Apr 2009, kb wrote:

    > I'm implementing binary serialization for primitive data types both in
    > java and c++. Also I need to handle serialization/de-serialization
    > across java and c++ i.e. serialization from java and de-serialization in
    > c++ and vice-versa.
    >
    > For this I need to decide an encoding for primitive data types which is
    > independent of language and platform. Does any one have some idea about
    > such an encoding format.


    Use the formats used in internet protocols - see pretty much any low-level
    RFC for details. The TCP and IP ones would do. Bytes are bytes, 16- and
    32-bit numbers are written out byte by byte in 'network byte order', ie
    most significant first. In java, use Data{Out,In}putStream for that, and
    in C, the htons/ntohs and htonl/ntohl functions from arpa/inet.h. Not sure
    what you do about 64-bit numbers. You can do signed and unsigned, but be
    aware that in java, which has no native unsigned types, you'll need to use
    the next bigger type to hold unsigneds, eg an unsigned short will need an
    int to hold.

    Floating-point numbers are harder; you might be better off avoiding them
    altogether if possible, but if not, use the IEEE 754 32- and 64-bit
    formats. Again, in java the Data*putStreams do that. I'm not aware of
    standard functions to do it in C, though - if you're on a machine which
    uses 754 natively, you can just pun the float as an int and write that out
    (through the htonl function, i think). On one that doesn't, like an x86,
    you'll need to find a machine-specific library with an encoding function
    in it.

    Booleans are bytes - false is 0, true is 1.

    For characters, you're working in unicode (whether you like it or not!),
    and you just have to pick an encoding. UTF-16 will let you encode all
    characters (all the ones you're likely to encounter, anyway) in two bytes
    each, and is simple to do. UTF-8 encodes most latin characters in one byte
    each, greek, cyrillic, hebrew, arabic and a few other scripts in two
    bytes, and all others in three bytes, making it a good choice if you're
    mostly handling western text but a poor one if you might be handling
    southern and eastern asian scripts, and has good library support in most
    languages. SCSU encodes all text in a minimal number of bytes (averaging
    one per character for alphabetic scripts, two per character for
    ideographic ones), but is rather complex (and is really a string rather
    than a character encoding); however, there are libraries for doing it in
    java and C.

    There are various ways you could do strings. The best is probably to write
    the string length as an integer, then all the characters one by one. This
    is different from the standard formats in both java and C, but easier to
    implement!

    Alternatively, relax the 'binary' requirement and use JSON.

    tom

    --
    PS I am trying to stab a giant warthog in the arse but it keeps throwing
    me off a bridge :( -- Martin Lewis
     
    Tom Anderson, Apr 9, 2009
    #5
  6. kb

    Arne Vajhøj Guest

    Robert Klemme wrote:
    > I am sure there are also libraries for XML serialization out
    > there.


    Out there as in included with Java.

    Arne
     
    Arne Vajhøj, Apr 10, 2009
    #6
  7. kb

    Arne Vajhøj Guest

    Tom Anderson wrote:
    > On Thu, 9 Apr 2009, kb wrote:
    >> I'm implementing binary serialization for primitive data types both in
    >> java and c++. Also I need to handle serialization/de-serialization
    >> across java and c++ i.e. serialization from java and de-serialization
    >> in c++ and vice-versa.
    >>
    >> For this I need to decide an encoding for primitive data types which
    >> is independent of language and platform. Does any one have some idea
    >> about such an encoding format.

    >
    > Use the formats used in internet protocols - see pretty much any
    > low-level RFC for details. The TCP and IP ones would do. Bytes are
    > bytes, 16- and 32-bit numbers are written out byte by byte in 'network
    > byte order', ie most significant first. In java, use
    > Data{Out,In}putStream for that, and in C, the htons/ntohs and
    > htonl/ntohl functions from arpa/inet.h. Not sure what you do about
    > 64-bit numbers. You can do signed and unsigned, but be aware that in
    > java, which has no native unsigned types, you'll need to use the next
    > bigger type to hold unsigneds, eg an unsigned short will need an int to
    > hold.


    It is not that hard to code htonll and ntphll (or whatever one will call
    them) if 64 bit integers (long long's) are available - and these
    functions would probably not be needed if they were not.

    > Floating-point numbers are harder; you might be better off avoiding them
    > altogether if possible, but if not, use the IEEE 754 32- and 64-bit
    > formats. Again, in java the Data*putStreams do that. I'm not aware of
    > standard functions to do it in C, though - if you're on a machine which
    > uses 754 natively, you can just pun the float as an int and write that
    > out (through the htonl function, i think). On one that doesn't, like an
    > x86, you'll need to find a machine-specific library with an encoding
    > function in it.


    x86 uses IEEE floating point.

    Most real computers do today. Old IBM mainframes and DEC VAX'es did not.

    > Alternatively, relax the 'binary' requirement and use JSON.


    Or XML.

    Arne
     
    Arne Vajhøj, Apr 10, 2009
    #7
  8. kb

    Arne Vajhøj Guest

    Patricia Shanahan wrote:
    > Arne Vajhøj wrote:
    >> Tom Anderson wrote:

    > ...
    >>> Alternatively, relax the 'binary' requirement and use JSON.

    >>
    >> Or XML.

    >
    > If disk space is the reason for using binary, consider compressing a
    > text file.


    Just wrapping the streams in GZIPInputStream/GZIPOutputStream
    can often make it very easy to implement.

    Arne
     
    Arne Vajhøj, Apr 10, 2009
    #8
  9. kb

    Roedy Green Guest

    On Thu, 09 Apr 2009 15:30:23 +0200, Robert Klemme
    <> wrote, quoted or indirectly quoted
    someone who said :

    >
    >If it does not have to be binary, you can use an existing format, for
    >example http://www.yaml.org/ - implementations for Java and C++ do exist
    >already. I am sure there are also libraries for XML serialization out
    >there.


    there in the venerable ASN.1
    http://mindprod.com/jgloss/asn1.html
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "The most significant trend in the US industry has been the decline in the amount
    of energy recovered compared to energy expended. In 1916, the ratio was about 28
    to 1, a very handsome energy return. By 1985, the ratio had dropped to 2 to 1,
    and it is still dropping."
    ~ Walter Youngquist, Professor of Geology

    By 2003, it had dropped to 0.5 to 1 in the US, making oil extraction no longer economically viable, no matter how high the price of crude.
     
    Roedy Green, Apr 10, 2009
    #9
  10. kb

    Roedy Green Guest

    On Thu, 09 Apr 2009 09:50:36 -0500, In the Middle of the Pack
    <> wrote, quoted or indirectly quoted someone who
    said :

    >It avoids problems such as big-endian/little-endian, and different floating
    >point specs. on different computers.


    Nowadays it much simpler. You don't have packed decimal formats. IEEE
    has standardardised float. Unicode or UTF-8 is a common exchange
    format or characters.

    I suspect binary will end up being less work than other formats. All
    you have to deal with there is to use LEDataInputStream of
    DataInputStream to deal with the endian problem. With anything else,
    you end up having to write something to parse the chars, unless they
    used CSV.

    see http://mindprod.com/jgloss/csv.htm

    I think CSV is probably today's best interchange format for small
    amounts of data. It is easy for humans to understand. You can import
    it into a spreadsheet to figure out what you have. It is reasonably
    compact.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "The most significant trend in the US industry has been the decline in the amount
    of energy recovered compared to energy expended. In 1916, the ratio was about 28
    to 1, a very handsome energy return. By 1985, the ratio had dropped to 2 to 1,
    and it is still dropping."
    ~ Walter Youngquist, Professor of Geology

    By 2003, it had dropped to 0.5 to 1 in the US, making oil extraction no longer economically viable, no matter how high the price of crude.
     
    Roedy Green, Apr 10, 2009
    #10
  11. kb

    Roedy Green Guest

    On Thu, 09 Apr 2009 10:02:08 -0700, Mark Space
    <> wrote, quoted or indirectly quoted someone
    who said :

    >owever, data streams won't do everything, like little endian formats.


    See http://mindprod.com/products1.html#LEDATASTREAM

    LEDataInputStream/LEDataOutputStream behave exactly like
    DataInputStream/DataOutputStream except they are little-endian.

    Presumably your stream is entirely little or big endian.

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "The most significant trend in the US industry has been the decline in the amount
    of energy recovered compared to energy expended. In 1916, the ratio was about 28
    to 1, a very handsome energy return. By 1985, the ratio had dropped to 2 to 1,
    and it is still dropping."
    ~ Walter Youngquist, Professor of Geology

    By 2003, it had dropped to 0.5 to 1 in the US, making oil extraction no longer economically viable, no matter how high the price of crude.
     
    Roedy Green, Apr 10, 2009
    #11
  12. kb

    Tom Anderson Guest

    On Thu, 9 Apr 2009, Arne Vajh?j wrote:

    > Tom Anderson wrote:
    >
    >> Floating-point numbers are harder; you might be better off avoiding them
    >> altogether if possible, but if not, use the IEEE 754 32- and 64-bit
    >> formats. Again, in java the Data*putStreams do that. I'm not aware of
    >> standard functions to do it in C, though - if you're on a machine which
    >> uses 754 natively, you can just pun the float as an int and write that out
    >> (through the htonl function, i think). On one that doesn't, like an x86,
    >> you'll need to find a machine-specific library with an encoding function in
    >> it.

    >
    > x86 uses IEEE floating point.


    Yes, of course - it uses a funny 80-bit format in *registers*, but the
    normal 64-bit IEEE format on the heap, which is what matters here. The
    existence of the 80-bit format is only relevant when you're worrying about
    exact reproducibility of calculations.

    tom

    --
    secular utopianism is based on a belief in an unstoppable human ability
    to make a better world -- Rt Rev Tom Wright
     
    Tom Anderson, Apr 10, 2009
    #12
  13. kb

    Arne Vajhøj Guest

    Mark Space wrote:
    > kb wrote:
    >> I'm implementing binary serialization for primitive data types both in
    >> java and c++. Also I need to handle serialization/de-serialization
    >> across java and c++ i.e. serialization from java and de-serialization
    >> in c++ and vice-versa.
    >>
    >> For this I need to decide an encoding for primitive data types which
    >> is independent of language and platform. Does any one have some idea
    >> about such an encoding format.

    >
    > You might try DataInputStream and DataOutputStream. These classes allow
    > you to do basic binary IO on primitives and strings. Even if you do use
    > Serialization I think you'll end up overriding the serialization IO
    > methods and using Data*Stream classes to do the actual work.
    >
    > <http://java.sun.com/docs/books/tutorial/essential/io/datastreams.html>
    >
    > However, data streams won't do everything, like little endian formats.
    > For that, I think a ByteBuffer and associated classes are best.
    >
    > <http://java.sun.com/javase/6/docs/api/java/nio/class-use/ByteBuffer.html>


    Or just use the Data*Stream's and switch the bytes around. It is not
    exactly difficult to code.

    Arne
     
    Arne Vajhøj, Apr 20, 2009
    #13
  14. kb

    Arne Vajhøj Guest

    Roedy Green wrote:
    > On Thu, 09 Apr 2009 09:50:36 -0500, In the Middle of the Pack
    > <> wrote, quoted or indirectly quoted someone who
    > said :
    >> It avoids problems such as big-endian/little-endian, and different floating
    >> point specs. on different computers.

    >
    > Nowadays it much simpler. You don't have packed decimal formats. IEEE
    > has standardardised float. Unicode or UTF-8 is a common exchange
    > format or characters.
    >
    > I suspect binary will end up being less work than other formats. All
    > you have to deal with there is to use LEDataInputStream of
    > DataInputStream to deal with the endian problem.


    Given how simple it is to switch the bytes or even use the
    builtin code in NIO, then I don't see the point in using an
    external lib for it.

    > With anything else,
    > you end up having to write something to parse the chars, unless they
    > used CSV.
    >
    > see http://mindprod.com/jgloss/csv.htm
    >
    > I think CSV is probably today's best interchange format for small
    > amounts of data. It is easy for humans to understand. You can import
    > it into a spreadsheet to figure out what you have. It is reasonably
    > compact.


    XML is usually preferred today.

    Arne
     
    Arne Vajhøj, Apr 20, 2009
    #14
  15. kb

    kb Guest

    It looks like reading/writing real data types (float/double) in binary
    format, in a language and platform independent manner is pretty tough
    to implement. (I've already implemented reading/writing for other data
    types and it is working fine.)

    But given that I have to stick to binary format, the other options is
    to write float/double values as characters i.e. to first convert
    (format) the value to string and then write the string in binary
    format. Clearly this would mean some performance impact.
    Does anybody have an idea as to how much impact will this have on the
    performance? (writing float as byte vs converting the float value to a
    string and then writing the string to the stream)
     
    kb, Apr 22, 2009
    #15
  16. kb wrote:
    > It looks like reading/writing real data types (float/double) in binary
    > format, in a language and platform independent manner is pretty tough
    > to implement. (I've already implemented reading/writing for other data
    > types and it is working fine.)


    Once you get around endian issues, there's no real problems to a binary
    format. I don't know of any major architectures that are not IEEE 754,
    for example; even so, writing a routine to convert a floating-point
    number from IEEE 754 to a native format would not be difficult.

    > But given that I have to stick to binary format, the other options is
    > to write float/double values as characters i.e. to first convert
    > (format) the value to string and then write the string in binary
    > format. Clearly this would mean some performance impact.


    I'd also be concerned about precision. Converting decimals to and from
    string representations is liable to munge the lowest bits, assuming you
    even get a precise representation.

    > Does anybody have an idea as to how much impact will this have on the
    > performance? (writing float as byte vs converting the float value to a
    > string and then writing the string to the stream)


    A single-precision floating point number will take up exactly four bytes
    in binary. A string representation would have 6-7 characters of
    significant figures, along with a likely decimal point. If the numbers
    are big enough, you'd also have a possible five more digits added
    (e-100, e.g.). So, at worst, your output string would be 13 characters
    long--thrice the size of the binary representation.

    The performance of conversion is a different story. Java's conversion
    actually uses a miniature bignum library to get full precision on input
    and output, so I can't imagine that it's very fast, relatively speaking.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
     
    Joshua Cranmer, Apr 22, 2009
    #16
  17. kb

    Lew Guest

    Joshua Cranmer wrote:
    > The performance of conversion is a different story. Java's conversion
    > actually uses a miniature bignum library to get full precision on input
    > and output, so I can't imagine that it's very fast, relatively speaking.


    Relative to what? The question is about reading and writing; I/O will
    dominate the performance question. Bignum conversion should be very fast,
    relatively speaking.

    --
    Lew
     
    Lew, Apr 22, 2009
    #17
  18. kb

    Arne Vajhøj Guest

    Joshua Cranmer wrote:
    > kb wrote:
    >> It looks like reading/writing real data types (float/double) in binary
    >> format, in a language and platform independent manner is pretty tough
    >> to implement. (I've already implemented reading/writing for other data
    >> types and it is working fine.)

    >
    > Once you get around endian issues, there's no real problems to a binary
    > format.


    It is easier to work with text formats, because the content is
    directly visible instead of having to work with hex dump.

    > I don't know of any major architectures that are not IEEE 754,
    > for example;


    There are still a lot of old data around - various IBM, VAX etc..

    >> But given that I have to stick to binary format, the other options is
    >> to write float/double values as characters i.e. to first convert
    >> (format) the value to string and then write the string in binary
    >> format. Clearly this would mean some performance impact.

    >
    > I'd also be concerned about precision. Converting decimals to and from
    > string representations is liable to munge the lowest bits, assuming you
    > even get a precise representation.


    On the other side - if the lowest bits were significant, then floating
    point should not have been used in the first place.

    >> Does anybody have an idea as to how much impact will this have on the
    >> performance? (writing float as byte vs converting the float value to a
    >> string and then writing the string to the stream)

    >
    > A single-precision floating point number will take up exactly four bytes
    > in binary. A string representation would have 6-7 characters of
    > significant figures, along with a likely decimal point. If the numbers
    > are big enough, you'd also have a possible five more digits added
    > (e-100, e.g.). So, at worst, your output string would be 13 characters
    > long--thrice the size of the binary representation.


    2 digits is enough for single precision exponent, but the number may
    be negative, so 13 it is.

    -d.ddddddE+dd

    Arne
     
    Arne Vajhøj, May 3, 2009
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. George Mercury

    Instantiate primitives in for-generate?

    George Mercury, Jul 28, 2005, in forum: VHDL
    Replies:
    2
    Views:
    645
  2. Replies:
    8
    Views:
    2,305
    deadsea
    Jan 2, 2005
  3. Replies:
    3
    Views:
    1,082
  4. Dimitri Ognibene
    Replies:
    4
    Views:
    819
    Dimitri Ognibene
    Sep 2, 2006
  5. Ramunas Urbonas
    Replies:
    1
    Views:
    425
    Dino Chiesa [Microsoft]
    Jul 27, 2004
Loading...

Share This Page