Encoding of primitives for binary serialization

kb · Apr 9, 2009

Hey,

I'm implementing binary serialization for primitive data types both in
java and c++. Also I need to handle serialization/de-serialization
across java and c++ i.e. serialization from java and de-serialization
in c++ and vice-versa.

For this I need to decide an encoding for primitive data types which
is independent of language and platform. Does any one have some idea
about such an encoding format.

Robert Klemme · Apr 9, 2009

I'm implementing binary serialization for primitive data types both in
java and c++. Also I need to handle serialization/de-serialization
across java and c++ i.e. serialization from java and de-serialization
in c++ and vice-versa.

For this I need to decide an encoding for primitive data types which
is independent of language and platform. Does any one have some idea
about such an encoding format.

Why not use Java's serialization format? If you do not want to use that
and only want to serialize String, char, int, long, float and other
number types what you basically need is a type tag and a convention
whether you store numbers in big endian or little endian format.

If it does not have to be binary, you can use an existing format, for
example http://www.yaml.org/ - implementations for Java and C++ do exist
already. I am sure there are also libraries for XML serialization out
there.

Kind regards

robert

Mark Space · Apr 9, 2009

kb said:
Hey,

I'm implementing binary serialization for primitive data types both in
java and c++. Also I need to handle serialization/de-serialization
across java and c++ i.e. serialization from java and de-serialization
in c++ and vice-versa.

For this I need to decide an encoding for primitive data types which
is independent of language and platform. Does any one have some idea
about such an encoding format.

You might try DataInputStream and DataOutputStream. These classes allow
you to do basic binary IO on primitives and strings. Even if you do use
Serialization I think you'll end up overriding the serialization IO
methods and using Data*Stream classes to do the actual work.

<http://java.sun.com/docs/books/tutorial/essential/io/datastreams.html>

However, data streams won't do everything, like little endian formats.
For that, I think a ByteBuffer and associated classes are best.

<http://java.sun.com/javase/6/docs/api/java/nio/class-use/ByteBuffer.html>

Roedy Green · Apr 9, 2009

For this I need to decide an encoding for primitive data types which
is independent of language and platform. Does any one have some idea
about such an encoding format

see http://mindprod.com/jgloss/corba.html

If you don't have arbitrary records, DataOutputStream would work.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"At this point, 29 percent of fish and seafood species have collapsed - that is,
their catch has declined by 90 percent. It is a very clear trend, and it is accelerating.
If the long-term trend continues, all fish and seafood species are projected to collapse
within my lifetime -- by 2048."
~ Dr. Boris Worm of Dalhousie University

Tom Anderson · Apr 9, 2009

I'm implementing binary serialization for primitive data types both in
java and c++. Also I need to handle serialization/de-serialization
across java and c++ i.e. serialization from java and de-serialization in
c++ and vice-versa.

For this I need to decide an encoding for primitive data types which is
independent of language and platform. Does any one have some idea about
such an encoding format.

Use the formats used in internet protocols - see pretty much any low-level
RFC for details. The TCP and IP ones would do. Bytes are bytes, 16- and
32-bit numbers are written out byte by byte in 'network byte order', ie
most significant first. In java, use Data{Out,In}putStream for that, and
in C, the htons/ntohs and htonl/ntohl functions from arpa/inet.h. Not sure
what you do about 64-bit numbers. You can do signed and unsigned, but be
aware that in java, which has no native unsigned types, you'll need to use
the next bigger type to hold unsigneds, eg an unsigned short will need an
int to hold.

Floating-point numbers are harder; you might be better off avoiding them
altogether if possible, but if not, use the IEEE 754 32- and 64-bit
formats. Again, in java the Data*putStreams do that. I'm not aware of
standard functions to do it in C, though - if you're on a machine which
uses 754 natively, you can just pun the float as an int and write that out
(through the htonl function, i think). On one that doesn't, like an x86,
you'll need to find a machine-specific library with an encoding function
in it.

Booleans are bytes - false is 0, true is 1.

For characters, you're working in unicode (whether you like it or not!),
and you just have to pick an encoding. UTF-16 will let you encode all
characters (all the ones you're likely to encounter, anyway) in two bytes
each, and is simple to do. UTF-8 encodes most latin characters in one byte
each, greek, cyrillic, hebrew, arabic and a few other scripts in two
bytes, and all others in three bytes, making it a good choice if you're
mostly handling western text but a poor one if you might be handling
southern and eastern asian scripts, and has good library support in most
languages. SCSU encodes all text in a minimal number of bytes (averaging
one per character for alphabetic scripts, two per character for
ideographic ones), but is rather complex (and is really a string rather
than a character encoding); however, there are libraries for doing it in
java and C.

There are various ways you could do strings. The best is probably to write
the string length as an integer, then all the characters one by one. This
is different from the standard formats in both java and C, but easier to
implement!

Alternatively, relax the 'binary' requirement and use JSON.

tom

Arne Vajhøj · Apr 10, 2009

Robert said:
I am sure there are also libraries for XML serialization out
there.

Out there as in included with Java.

Arne

Arne Vajhøj · Apr 10, 2009

Tom said:
Use the formats used in internet protocols - see pretty much any
low-level RFC for details. The TCP and IP ones would do. Bytes are
bytes, 16- and 32-bit numbers are written out byte by byte in 'network
byte order', ie most significant first. In java, use
Data{Out,In}putStream for that, and in C, the htons/ntohs and
htonl/ntohl functions from arpa/inet.h. Not sure what you do about
64-bit numbers. You can do signed and unsigned, but be aware that in
java, which has no native unsigned types, you'll need to use the next
bigger type to hold unsigneds, eg an unsigned short will need an int to
hold.

It is not that hard to code htonll and ntphll (or whatever one will call
them) if 64 bit integers (long long's) are available - and these
functions would probably not be needed if they were not.

Floating-point numbers are harder; you might be better off avoiding them
altogether if possible, but if not, use the IEEE 754 32- and 64-bit
formats. Again, in java the Data*putStreams do that. I'm not aware of
standard functions to do it in C, though - if you're on a machine which
uses 754 natively, you can just pun the float as an int and write that
out (through the htonl function, i think). On one that doesn't, like an
x86, you'll need to find a machine-specific library with an encoding
function in it.

x86 uses IEEE floating point.

Most real computers do today. Old IBM mainframes and DEC VAX'es did not.

Alternatively, relax the 'binary' requirement and use JSON.

Or XML.

Arne

Arne Vajhøj · Apr 10, 2009

Patricia said:
If disk space is the reason for using binary, consider compressing a
text file.

Just wrapping the streams in GZIPInputStream/GZIPOutputStream
can often make it very easy to implement.

Arne

Roedy Green · Apr 10, 2009

If it does not have to be binary, you can use an existing format, for
example http://www.yaml.org/ - implementations for Java and C++ do exist
already. I am sure there are also libraries for XML serialization out
there.

there in the venerable ASN.1
http://mindprod.com/jgloss/asn1.html
--
Roedy Green Canadian Mind Products
http://mindprod.com

"The most significant trend in the US industry has been the decline in the amount
of energy recovered compared to energy expended. In 1916, the ratio was about 28
to 1, a very handsome energy return. By 1985, the ratio had dropped to 2 to 1,
and it is still dropping."
~ Walter Youngquist, Professor of Geology

By 2003, it had dropped to 0.5 to 1 in the US, making oil extraction no longer economically viable, no matter how high the price of crude.

Roedy Green · Apr 10, 2009

It avoids problems such as big-endian/little-endian, and different floating
point specs. on different computers.

Nowadays it much simpler. You don't have packed decimal formats. IEEE
has standardardised float. Unicode or UTF-8 is a common exchange
format or characters.

I suspect binary will end up being less work than other formats. All
you have to deal with there is to use LEDataInputStream of
DataInputStream to deal with the endian problem. With anything else,
you end up having to write something to parse the chars, unless they
used CSV.

see http://mindprod.com/jgloss/csv.htm

I think CSV is probably today's best interchange format for small
amounts of data. It is easy for humans to understand. You can import
it into a spreadsheet to figure out what you have. It is reasonably
compact.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"The most significant trend in the US industry has been the decline in the amount
of energy recovered compared to energy expended. In 1916, the ratio was about 28
to 1, a very handsome energy return. By 1985, the ratio had dropped to 2 to 1,
and it is still dropping."
~ Walter Youngquist, Professor of Geology

By 2003, it had dropped to 0.5 to 1 in the US, making oil extraction no longer economically viable, no matter how high the price of crude.

Roedy Green · Apr 10, 2009

owever, data streams won't do everything, like little endian formats.

See http://mindprod.com/products1.html#LEDATASTREAM

LEDataInputStream/LEDataOutputStream behave exactly like
DataInputStream/DataOutputStream except they are little-endian.

Presumably your stream is entirely little or big endian.

--
Roedy Green Canadian Mind Products
http://mindprod.com

"The most significant trend in the US industry has been the decline in the amount
of energy recovered compared to energy expended. In 1916, the ratio was about 28
to 1, a very handsome energy return. By 1985, the ratio had dropped to 2 to 1,
and it is still dropping."
~ Walter Youngquist, Professor of Geology

By 2003, it had dropped to 0.5 to 1 in the US, making oil extraction no longer economically viable, no matter how high the price of crude.

Tom Anderson · Apr 10, 2009

x86 uses IEEE floating point.

Yes, of course - it uses a funny 80-bit format in *registers*, but the
normal 64-bit IEEE format on the heap, which is what matters here. The
existence of the 80-bit format is only relevant when you're worrying about
exact reproducibility of calculations.

tom

Arne Vajhøj · Apr 20, 2009

Mark said:
You might try DataInputStream and DataOutputStream. These classes allow
you to do basic binary IO on primitives and strings. Even if you do use
Serialization I think you'll end up overriding the serialization IO
methods and using Data*Stream classes to do the actual work.

<http://java.sun.com/docs/books/tutorial/essential/io/datastreams.html>

However, data streams won't do everything, like little endian formats.
For that, I think a ByteBuffer and associated classes are best.

<http://java.sun.com/javase/6/docs/api/java/nio/class-use/ByteBuffer.html>

Or just use the Data*Stream's and switch the bytes around. It is not
exactly difficult to code.

Arne

Arne Vajhøj · Apr 20, 2009

Roedy said:
Nowadays it much simpler. You don't have packed decimal formats. IEEE
has standardardised float. Unicode or UTF-8 is a common exchange
format or characters.

I suspect binary will end up being less work than other formats. All
you have to deal with there is to use LEDataInputStream of
DataInputStream to deal with the endian problem.

Given how simple it is to switch the bytes or even use the
builtin code in NIO, then I don't see the point in using an
external lib for it.

With anything else,
you end up having to write something to parse the chars, unless they
used CSV.

see http://mindprod.com/jgloss/csv.htm

I think CSV is probably today's best interchange format for small
amounts of data. It is easy for humans to understand. You can import
it into a spreadsheet to figure out what you have. It is reasonably
compact.

XML is usually preferred today.

Arne

kb · Apr 22, 2009

It looks like reading/writing real data types (float/double) in binary
format, in a language and platform independent manner is pretty tough
to implement. (I've already implemented reading/writing for other data
types and it is working fine.)

But given that I have to stick to binary format, the other options is
to write float/double values as characters i.e. to first convert
(format) the value to string and then write the string in binary
format. Clearly this would mean some performance impact.
Does anybody have an idea as to how much impact will this have on the
performance? (writing float as byte vs converting the float value to a
string and then writing the string to the stream)

Joshua Cranmer · Apr 22, 2009

kb said:
It looks like reading/writing real data types (float/double) in binary
format, in a language and platform independent manner is pretty tough
to implement. (I've already implemented reading/writing for other data
types and it is working fine.)

Once you get around endian issues, there's no real problems to a binary
format. I don't know of any major architectures that are not IEEE 754,
for example; even so, writing a routine to convert a floating-point
number from IEEE 754 to a native format would not be difficult.

But given that I have to stick to binary format, the other options is
to write float/double values as characters i.e. to first convert
(format) the value to string and then write the string in binary
format. Clearly this would mean some performance impact.

I'd also be concerned about precision. Converting decimals to and from
string representations is liable to munge the lowest bits, assuming you
even get a precise representation.

Does anybody have an idea as to how much impact will this have on the
performance? (writing float as byte vs converting the float value to a
string and then writing the string to the stream)

A single-precision floating point number will take up exactly four bytes
in binary. A string representation would have 6-7 characters of
significant figures, along with a likely decimal point. If the numbers
are big enough, you'd also have a possible five more digits added
(e-100, e.g.). So, at worst, your output string would be 13 characters
long--thrice the size of the binary representation.

The performance of conversion is a different story. Java's conversion
actually uses a miniature bignum library to get full precision on input
and output, so I can't imagine that it's very fast, relatively speaking.

Lew · Apr 22, 2009

Joshua said:
The performance of conversion is a different story. Java's conversion
actually uses a miniature bignum library to get full precision on input
and output, so I can't imagine that it's very fast, relatively speaking.

Relative to what? The question is about reading and writing; I/O will
dominate the performance question. Bignum conversion should be very fast,
relatively speaking.

Arne Vajhøj · May 3, 2009

Joshua said:
Once you get around endian issues, there's no real problems to a binary
format.

It is easier to work with text formats, because the content is
directly visible instead of having to work with hex dump.

I don't know of any major architectures that are not IEEE 754,
for example;

There are still a lot of old data around - various IBM, VAX etc..

I'd also be concerned about precision. Converting decimals to and from
string representations is liable to munge the lowest bits, assuming you
even get a precise representation.

On the other side - if the lowest bits were significant, then floating
point should not have been used in the first place.

A single-precision floating point number will take up exactly four bytes
in binary. A string representation would have 6-7 characters of
significant figures, along with a likely decimal point. If the numbers
are big enough, you'd also have a possible five more digits added
(e-100, e.g.). So, at worst, your output string would be 13 characters
long--thrice the size of the binary representation.

2 digits is enough for single precision exponent, but the number may
be negative, so 13 it is.

-d.ddddddE+dd

Arne

Different Serialization Technique In .NET	0	Sep 27, 2013
Low-latency alternative to Java Object Serialization	13	Oct 1, 2011
Serialization library, request for feedback	42	Dec 13, 2012
Automating Serialization?	0	Nov 27, 2009
Which data serialization format?	9	Aug 8, 2011
Serialization - filesystem or dbms	16	Dec 4, 2005
Returning primitives by reference	7	Jul 20, 2006
object serialization as python scripts	4	Nov 13, 2009

Encoding of primitives for binary serialization

kb

Robert Klemme

Mark Space

Roedy Green

Tom Anderson

Arne Vajhøj

Arne Vajhøj

Arne Vajhøj

Roedy Green

Roedy Green

Roedy Green

Tom Anderson

Arne Vajhøj

Arne Vajhøj

kb

Joshua Cranmer

Lew

Arne Vajhøj

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads