object serialisation

F

feng

Hi,
I can't understand the object serialisation concepts. The definition
is attached below. objects are already sequence of bits stored in a
file or memory buffer. so why all references define serialisation as a
mean to convert an object to a sequence of bits? I thought casting a
block of bits is more enough to resurrect the object or any other data
structure. any clarification please?


In computer science, in the context of data storage and transmission,
serialization is the process of converting a data structure or object
into a sequence of bits so that it can be stored in a file or memory
buffer, or transmitted across a network connection link to be
"resurrected" later in the same or another computer environment.[1]
When the resulting series of bits is reread according to the
serialization format, it can be used to create a semantically
identical clone of the original object.
 
A

Arne Vajhøj

I can't understand the object serialisation concepts. The definition
is attached below. objects are already sequence of bits stored in a
file or memory buffer. so why all references define serialisation as a
mean to convert an object to a sequence of bits? I thought casting a
block of bits is more enough to resurrect the object or any other data
structure. any clarification please?

In computer science, in the context of data storage and transmission,
serialization is the process of converting a data structure or object
into a sequence of bits so that it can be stored in a file or memory
buffer, or transmitted across a network connection link to be
"resurrected" later in the same or another computer environment.[1]
When the resulting series of bits is reread according to the
serialization format, it can be used to create a semantically
identical clone of the original object.

Serialization is a little bit more complex than just reading/writing
to/from memory.

One problem is references that most likely are stored as an
address. But just because that address within the senders JVM are
1000 does not mean that it will be 1000 in the receivers JVM.
That need to be fixed.

Fields marked with the transient keyword are not serialized.
So they need to be excluded.

It is required to verify that the object and all objects
referenced from it are indeed serializable. That requires
the process to go through everything.

Plus all the stuff that I can not think of right now.

Arne
a
 
M

markspace

I thought casting a
block of bits is more enough to resurrect the object or any other data
structure. any clarification please?


This is think is where you are off. Specifically serialization tries to
avoid casting entire blocks of bits by putting an official protocol
around the bits. Casting is bad, C is bad, writing raw structures to a
storage device or network is bad. There's just too many things that are
"internal" to that struct that could mess you up.

How for example, do you cast a block of raw data from an little endian
machine to an object to be used by a big endian machine? What about
pointer sizes or int sizes? This stuff just does not work with a
"cast," it's going to mess you up.

Just because it's going to another JVM doesn't mean you can assume this
stuff is consistent. Sometimes it's not even going to a JVM. There are
standards and other things that can read that object byte stream.

Thus the whole point of serialization is to make a consistent protocol
that can be used on any machine. Writing raw structs is pretty much the
polar opposite. Hopefully even die-hard low level C fans understand
this is a bad idea by now.
 
S

Stefan Ram

feng said:
I can't understand the object serialisation concepts.

The state of the object and possibly more information (that
can be used to rebuild an equivalent object) is converted to
a string representation.
objects are already sequence of bits stored in a file or
memory buffer.

An object is an entity that accepts and sends messages,
usually according to certain interface specifications.
 
E

Eric Sosman

Hi,
I can't understand the object serialisation concepts. The definition
is attached below. objects are already sequence of bits stored in a
file or memory buffer. [...]

Stop right there.

String str = "feng";
String[] arr = { str, str, "Hello, world!", str };

Now, consider the "sequence of bits" that makes up `arr'. Observe
that the letter 'f' appears three times in the array's contents,
yet there is only one 'f' in the program. How can one single 'f'
appear three times in a "sequence of bits" without triplication?

"Ah," I hear you say, "The elements of `arr' are not Strings,
but references to Strings. There are three references to the lone
String containing an 'f', and you, Sosman, are fffull of it!"

... but keep going: Suppose you gather up the "sequence of bits"
that represent those references, and send them across a network to a
remote JVM. Question: What can the remote JVM *do* with them? Does
it even know about the `f'? No, it only received the references as
a "sequence of bits," not the Strings those references referred to.
You sent an array of references to Strings, but you didn't send the
Strings themselves, and the remote JVM is helpless.

In serialization, both the references *and* the things they refer
to are packaged up for transmission. When you consider that a single
object may hold references to many other objects, which in turn may
hold references to still more objects, I think you'll see that the
problem is not merely transmitting a block of memory, but transmitting
an encoding of an arbitrarily complicated graph. The graph may even
contain cycles: consider a tree-like structure where each node has
references to its children *and* each child refers to its parent, for
example.
file or memory buffer. so why all references define serialisation as a
mean to convert an object to a sequence of bits? I thought casting a
block of bits is more enough to resurrect the object or any other data
structure. any clarification please?

By now, I hope you understand that "sequence of bits" is inadequate
as a model for a Java object.
 
R

Roedy Green

I can't understand the object serialisation concepts. The definition
is attached below. objects are already sequence of bits stored in a
file or memory buffer. so why all references define serialisation as a
mean to convert an object to a sequence of bits? I thought casting a
block of bits is more enough to resurrect the object or any other data
structure. any clarification please?

see http://mindprod.com/jgloss/serialization.html for an overview.
 
B

BGB


interesting...


although I haven't read the entire thing, it makes me wonder if I should
actually bother implementing the JVM's serialization mechanism in my own
pseudo-JVM (my VM supports JBC, but doesn't really aim to be a full JVM,
and many "unorthodox parts" are used).

there is no real "universal" data serialization system in my VM,
although a commonly used system (mostly for C and misc) is based roughly
on LISP-style S-Expressions, and they could be used for handling
class/instance objects. this is also commonly used for dumping data to
the console.

the S-Exp printer/parser is built directly on the VM's core typesystem...

(note, not to be confused with Rivest's S-Expressions, which are IMO a
somewhat different piece of technology...).



oddly, for all I have used XML, I have generally never really used XML
as a format for serializing internal data or for "data binding" (usually
because either specialized formats or S-Expressions are used, with in
the latter the notation containing a number of syntax extensions for the
various supported types...).

most of my use of XML is limited to DOM trees and data represented
internally as DOM trees (and neither data-binding nor schemas are used
in my case, although a binary-XML encoding is used for serialized DOM
nodes...).


actually, IIRC the S-Exp printer/parser already does class/instance
objects, with a syntax something like:
'#L<classname> { fields... }'
where each field is: 'name: value'.

note: 'name:' and ':name' are currently equivalent (this syntax is known
as a keyword). other names are parsed as "symbols", and quoted strings
are parsed as strings.


for example:
package foo;
class Bar {
public int x;
public int y;
}

would be printed as something like:
#L<foo/Bar> { x: 3 y: 4 }

integer and numeric values default to "fixint" and "flonum" types, which
by default both have 28 bits on x86 (+-2^27 is encoded directly in the
pointer for ints, and floats have slightly reduced range and accuracy).
x86-64 uses 48 bits for each. larger ints and full-precision
floats/doubles require boxing (which creates added heap garbage).

note:
{ x: 3 y: 4 }
would encode a prototype object (N/A for Java, used by JavaScript but
lacks a class and my VM only allows Java to access these via
interfaces). (hmm: a special class could be added to ease using
prototype objects from Java...). (note: Java and the external dynamic
typesystem don't currently mix well...).


typed arrays are printed as:
#A<signature> ( values... )

for example:
int[] {1, 2, 3}

would be printed as:
#A<i>(1 2 3)
in the past, I had considered allowing "#Ai(1 2 3)", but didn't
implement support for it.

dynamically-typed arrays (N/A for Java) are printed as:
#(values...)

in effect, they are equivalent to #A<r> (...) where 'r' is the character
for dynamic types.

"square arrays" (N/A for Java) are partly supported, but only for printing.

ex: #A2<i>((1 2 3) (4 5 6) (7 8 9))

(...)
will encode a list (also N/A for Java), which is composed of linked
cons-cells.

example: (foo (bar 2 3) 4 (5 6 (7 "ROFLOL" 8 9)))

note: (1 2 3) and (1 . (2 . (3 . ()))) are equivalent (note "()" is a
special value, encoded in memory as a NULL pointer).

....


looking over the code, it seems cycles are handled by encoding 0 or more
items as:
"#;#index=value"
followed by the root expression being encoded.

"#index#" can encode references to these.

the algo is 2-pass where it seems to identify any mutiply-referenced
items (via arrays) in the first pass, and any marked items are encoded
as references in the second pass.


a custom-rolled binary format could also work...


or such...
 
A

Arne Vajhøj

interesting...

although I haven't read the entire thing, it makes me wonder if I should
actually bother implementing the JVM's serialization mechanism in my own
pseudo-JVM (my VM supports JBC, but doesn't really aim to be a full JVM,
and many "unorthodox parts" are used).

It is a pretty important part of a JVM.

I seem to remember that is was one of the first things GCJ started
implementing.

If you are not creating a JVM, then obviously that does not apply.

Arne
 
B

BGB

It is a pretty important part of a JVM.

I seem to remember that is was one of the first things GCJ started
implementing.

If you are not creating a JVM, then obviously that does not apply.

well, I don't think this is such a "one way or another" matter, like
there are degrees as to how much of this stuff can be implemented...

is is "pseudo", as in, it supports the bytecode and basic language, and
some core classes (a lot of the stuff in "java.lang" and parts of
"java.io" and "java.util").

how much beyond this? it is an open question.


part of the reason though is I am not really intending to use it for
self-contained apps, but more for app scripting. so it matters more that
the Java<->C interface is good than that the runtime doesn't suck
(sadly... I am still mostly using JNI for a lot of this...).

like, I figure, if people want full Java support, they can use a real
JVM... (my VM does not intend to replace a real JVM, FWIW...).


now, what to classify this exactly is unclear, since the term JVM
implies a relatively complete implementation (and what is required for
this is not exactly small). similarly, the Java community seems to have
a fairly weak line of division between "Java Language" and "Java Class
Library".

JBC-VM works, but many people don't seem to realize that "JBC" means
"Java Bytecode". BJVM (used personally) still implies "JVM", and "BJ-VM"
just looks odd.

as well, the external "greater VM" (mostly unrelated to anything JVM,
and debatably even a VM, as opposed to a collection of vaguely VM
related libraries) goes under the names "BSCC" and "SilverfishVM" (named
after the insect).


it is a custom written runtime, which I recently noted in a line counter
that it is currently around 4 kloc (4000 lines of code).
a lot of this is stubs or calls into C land (or calls into other classes
which redirect into C).

for example, "System.in" and "System.out" redirect to special/internal
"Z_ConsoleInput" and "Z_ConsoleOutput" classes (these classes extend
InputStream and PrintString, but use a native method for output), which
redirect console IO into C (actually, console IO is passed to my GC
library, which then passes it to whatever is hooked into this, such as
for logging or displaying messages where the user can see them).

at the moment, I am not even really bothering with "java.net" or
"java.awt" (nor at the moment "java.lang.reflect"). also lacking is
support for user-defined classloaders, ...

I only recently fully implemented exceptions, but they are not really
tested.



looking online, I guess it aligns roughly with the Java ME CLDC or CDC
profiles or similar.

I guess I could look at CLDC and see if this much can be fully
implemented. what status will this give the project? dunno...

apparently, serialization/... is absent from CLDC as well...



it is also plugged into a modified Quake2 engine (and also a few 3D
modeling tools and similar), ... currently this is its intended usage
domain...


or such...
 
M

Mike Schilling

BGB said:
part of the reason though is I am not really intending to use it for
self-contained apps, but more for app scripting. so it matters more that
the Java<->C interface is good than that the runtime doesn't suck
(sadly... I am still mostly using JNI for a lot of this...).

If you're rolling your own, how about this? Create a utility that
generates a C header file for a specified class, defining it as a C struct,
and when your JVM calls into C, pass a pointer to that struct instead of an
opaque pointer. Your C can now access everything in the class directly
instead of having to use JNI. (Having the header define function pointers
used to call the class's methods would be a further step.) You lose Java's
safety guarantees [1], which are the main reason that JNI is so clumsy, but
to offset that, you have huge gains in efficiency and ease.

1. That is, you gain C's full freedom to scribble all over random bits of
memory.
 
T

Tom Anderson

If you're rolling your own, how about this? Create a utility that
generates a C header file for a specified class, defining it as a C
struct, and when your JVM calls into C, pass a pointer to that struct
instead of an opaque pointer. Your C can now access everything in the
class directly instead of having to use JNI. (Having the header define
function pointers used to call the class's methods would be a further
step.) You lose Java's safety guarantees [1], which are the main reason
that JNI is so clumsy, but to offset that, you have huge gains in
efficiency and ease.

1. That is, you gain C's full freedom to scribble all over random bits
of memory.

You already have that freedom when writing C against JNI. JNI merely makes
it more challenging to find the random bits of memory where you can really
do some damage!

tom
 
B

BGB

BGB said:
part of the reason though is I am not really intending to use it for
self-contained apps, but more for app scripting. so it matters more
that the Java<->C interface is good than that the runtime doesn't suck
(sadly... I am still mostly using JNI for a lot of this...).

If you're rolling your own, how about this? Create a utility that
generates a C header file for a specified class, defining it as a C
struct, and when your JVM calls into C, pass a pointer to that struct
instead of an opaque pointer. Your C can now access everything in the
class directly instead of having to use JNI. (Having the header define
function pointers used to call the class's methods would be a further
step.) You lose Java's safety guarantees [1], which are the main reason
that JNI is so clumsy, but to offset that, you have huge gains in
efficiency and ease.

this is an interesting idea...

however, sadly, it would be a little problematic for my current
implementation, which internally uses opaque pointers for classes (for
sake of abstraction/modularity) and also uses relatively more "fluid"
memory handling than C prefers (IOW: physical class layout is not known
until runtime in the current implementation).


actually, the implementation is "transactional", and actually allows
physical class layout to change ("class versions" are used to track the
particular layout of the particular class instance...). class
definition/modification is done via begin/end pairs, where the 'end' may
implicitly commit changes to the class (only partially, as the layout is
not frozen until an instance of the class or a subclass is made, at
which point any further alterations would go into the new 'version').

the above mechanism was used to implement "java/lang/String", which has
a temporary (VM-provided) form initially, but during VM startup the
classfile is loaded and "dropped on top of" the VM defined class (mostly
this was to deal with the problem that otherwise there was no obvious
way to make String loadable, since the constant pool may infact contain
String instances).

this could also allow for "partial classes" (like in .NET), apart from
the lack of any obvious way to handle this in Java.


so, the VM itself uses an interface fairly similar to JNI to access the
OO facilities...


using function pointers to access fields and methods can work though,
where any methods would be exposed directly, and any fields could be
accessed via getters/setters.


at the moment, the main issue is not the C->Java interface, as this
works well enough.


the main issue then is mostly that Java can't so easily access the bulk
of code and API's which exist in C land (this being the vast majority of
the codebase).

I have provided a few JNI alternatives for this part, but technical
issues remain (namely that it is still necessary to write classes with
'native' methods and provide some basic level of boilerplate in most cases).

I don't use JNA as this is not currently implemented and is not any
obviously better for my uses than using JNI, or classes with native
methods and a short-circuit mechanism (as a lighter-weight JNI alternative).

the main cost at the moment is then having to write classes with native
methods for any API calls one wishes to import, and dealing with the
language typesystem mismatches, ... there is no obvious way to export
API's to Java via header-processing, as it would be needed to identify
somehow which functions belong to which class (unless I start using
C-side annotations for this), and classes need to be fully specified and
are limited to a "reasonable" size (64k methods), preventing simple
naive strategies.


apart from using an awkward/nasty strategy (using special classes and
methods to call into C land), there is no obvious way to do
boilerplate-free operation with a standard Java compiler...

another difficulty is that there is no good way to handle
variable-argument calls that doesn't force everything to Object in the
process, which is expensive (creates a bunch of boxed values which turn
into garbage).

....


thinking of it, I may have reason to start supporting function
annotations to identify methods.

for example:
DYC_CLASSAPI("bgb/vm/Foo") void fooMethod(dycThis self, int x)
{
...
}

this would allow some tools to autogenerate classes from C-level
metadata. (however, a few of my tools would need to be modified to
support this feature...).

note: 'dycThis' is already used, and is basically a magic type to
identify that a function is an instance method and expects a 'this'
object to be passed as the given argument (limited to the first function
argument).

DYC_CLASSAPI would be new, and would be analogous to (in C++):
namespace bgb {
namespace vm {
void Foo::fooMethod(int x)
{
...
}
}
}

in a normal C compiler (say, MSVC), it would expand to, say
"__declspec(dllexport)".

in my metadata tool, it could expand to " __declspec(dllexport)
__declspec(dycClassAPI("bgb/vm/Foo")) ".

which would in turn go into a database to be processed by the later tool
(the former tool exists, but the later tool, namely a class-emitter,
would need to be written).

the bigger hassle is that my (slightly hackish) auto-header generator
tool would need to be modified to support this annotation.

1. That is, you gain C's full freedom to scribble all over random bits
of memory.

well, that happens sometimes...


I was checking earlier, and it already seems I implemented most of the
classes for J2ME CLDC already (and actually a few more than this, but a
few holes exist as well...).

I guess I may for now try to get this profile fully implemented.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top