Serialization library, request for feedback

G

Greg Martin

yeah, these are general problems with data serialization, namely that
there is no good way to deal with "everything" within any reasonable
level of complexity.

it is easier then to find a reasonable subset of things that can be
serialized via a given mechanism, and use this.

At one point I did a lot of work with ASN.1, which is a specification
for defining protocols but is implemented using a variety of encoding
schemes. It actually is an effective means of serializing data and can
be very compact and, as I recall, extended to covering most data
encodings that I ran into but it never seemed trivial to me. Debugging a
stream of encoded data can lead to premature blindness and post
traumatic drunkenness. Well, maybe that was just me. To interpret the
data correctly you need the specification for encoding, BER for example,
and also the ASN.1 specification.

Lately I've been embedding the V8 engine in an application. I haven't
looked at how they've implemented the JSON objects but it might be worth
a peek for anyone interested in serialization techniques. I'm very
impressed with the V8 JavaScript engine.
 
I

Ian Collins

Ben said:
My ring_buffer example is one. In fact, any structure where pointers are
"shared" would seem to be a bad fit with JSON.

How often would you want to serialise a structure that contains
pointers? Pointer values are only useful within the current executing
program, so you have to use some form of conversion.
Maybe "no obvious JSON representation" is too strong because you can
probably always map pointers so some sort of index or string label, but
it's not a natural fit.

Arrays and indexes would be one choice. But this is more of a general
issue of how to serialise a structure with pointers than a JSON one.
 
J

James Kuyper

How often would you want to serialise a structure that contains
pointers? Pointer values are only useful within the current executing
program, so you have to use some form of conversion.

Yes, and ideally the serialization system should handle that conversion
for you, if you give it enough information to do so. I don't claim to
know of any serialization system that meets that ideal.
 
J

James Kuyper

How often would you want to serialise a structure that contains
pointers? Pointer values are only useful within the current executing
program, so you have to use some form of conversion.

Yes, and ideally the serialization system should handle that conversion
for you, if you give it enough information to do so. I don't claim to
know of any serialization system that meets that ideal.
 
B

BartC

personally, I far more often use specialized file-formats, rather than
dumping data structures, and usually if serialization is used, it is
usually limited mostly to certain sets of dedicated "safe to serialize"
data-types, which in my case are typically "lists" and "dynamically-typed
objects".

I would use dedicated file-formats too. Also the docs for this library seem
very complicated; I wouldn't know where to start. Or what the capabilities
and limitations actually are.

The idea sounds good, but it seems something more suitable for a language
where information about data-structures is available at runtime.

(The docs could do with being even more basic, for people like me. For
example, the Readme file says the output is binary format, then a few lines
further it, it says the serialised data is in C-like syntax. And how,
really, does it work. If I start off with these two values:

int a = 1234;
int b = 5678;

can I serialise both into the same file. What happens at the 'other end'
when I try and read this data; how will it work if I don't happen to have a
pair of ints called a and b? Etc. (I did say it needed to be basic!))
actually, given my frequent use of sequential read/write functions, is a
major part of why, very often, I end up using variable-length integer
representations.

for example, the vast majority of integer-values are fairly small
(basically, forming a bell-curve with its center around 0), making it so
that, on-average, small integers may only require 1 byte, and most others
are 2 or 3 bytes, independent of the physical storage-type of the integer
(32 or 64 bits).

like, although a variable-length integer isn't very good for random-access
data, it is pretty good for data that is read/written sequentially.

Yes, I use a low-level layer for most of my binary formats now. So an entire
file is just a sequence of tagged integer, floating point or string values.
Small int values, as you say, are most common, so values 0 to 239 are output
as 0 to 239; no tag. (I think the majority will be positive.) Anything else
needs a tag, which is a code from 240 to 255 (the latter being an EOF
marker).

The file format with it's data sits on top of this layer.
 
B

BGB

At one point I did a lot of work with ASN.1, which is a specification
for defining protocols but is implemented using a variety of encoding
schemes. It actually is an effective means of serializing data and can
be very compact and, as I recall, extended to covering most data
encodings that I ran into but it never seemed trivial to me. Debugging a
stream of encoded data can lead to premature blindness and post
traumatic drunkenness. Well, maybe that was just me. To interpret the
data correctly you need the specification for encoding, BER for example,
and also the ASN.1 specification.

ASN.1 BER isn't particularly compact.

it is about the same as my "typical" binary serialization format, where
both will typically require around 2-bytes for a marker, and 0 or more
bytes for payload data.

granted, yes, it is much more compact than more naive binary formats, or
simpler TLV formats like IFF or RIFF, but, either way...


ASN.1 PER is a bit more compact.
basically, it maps data to fixed-size bit-fields.


part of the "proof of concept" part of my "BSXRP" protocol was that it
is possible to generate a message stream on-average more compact than
ASN.1 PER, and without needing to make use of a dedicated schema. they
work in different ways though, whereas PER uses fixed bit-packing, my
protocol borrowed more heavily from Deflate and JPEG, basically working
to reduce the data mostly to near-0 integer values (via "prediction"),
which can be relatively efficiently coded via a VLC scheme (a
Huffman-coded value followed by 0 or more "extra bits").

say, one can encode one of:
a VLC value indicating which "prediction" is correct (regarding the next
item);
a VLC value indicating an item within an MRU-list of "recently seen items";
a direct representation of the item in question, in the case where it
hasn't been seen before, or was "forgotten" (fell off the end of the MRU).

Huffman coding these choice-values will tend to use a near optimal
number of bits (vs a fixed-field encoding, which will tend to allow
every possibility with a near equal weight regardless of the relative
probability of a given choice).

say, there are 6 possible choices.
a fixed-width encoding would use 3 bits here.

but, what if one choice is much more common than another:
000 = 90%
001 = 8%
others = 2%

then, 3 bits are no longer optimal, and instead we might want something
like:
000 -> 0
001 -> 10
others -> 11xxx

now, 90% of the time, we only need 1 bit.
this is part of the advantage of Huffman coding.


note that an Arithmetic coder can compress better than Huffman coding
(because it can deal with "fractional bits"), but tends to be a little
slower to encode/decode (note, however, that the H.264 video-codec is
built around an arithmetic coder).

in my case things like strings and byte-arrays are compressed using an
LZ77 scheme very similar to Deflate (but using a 64kB window rather than
a 32kB one), and also does not need to resend the Huffman table with
each message.


personally though, I don't really like ASN.1, mostly on the grounds that
its design requires a schema in order to work correctly, and I much more
prefer formats which will handle "whatever you throw at them" even
without any sort of schema.

Lately I've been embedding the V8 engine in an application. I haven't
looked at how they've implemented the JSON objects but it might be worth
a peek for anyone interested in serialization techniques. I'm very
impressed with the V8 JavaScript engine.

yeah, V8 is probably fairly good...


I looked at it before, but in my case decided to stick with my own
Scripting-VM (the VM is sufficiently tightly integrated with my project,
that it isn't really an "easily replaceable" component).

both have it in common though that they run ECMAScript / JavaScript
variants though (but, have different language extensions...).

but, mine has a fancier FFI, albeit worse performance...
but, it is easier to try to optimize things as-needed, than try to rip
out and rewrite around 2/3's of the 3D engine's codebase.


or such...
 
R

Rui Maciel

Ben said:
My ring_buffer example is one. In fact, any structure where pointers are
"shared" would seem to be a bad fit with JSON.

Maybe "no obvious JSON representation" is too strong because you can
probably always map pointers so some sort of index or string label, but
it's not a natural fit.


Granted, JSON doesn't offer explicit support for pointers. Nonetheless,
pointers are essentially a map between a reference number and an object.
Therefore,if it's possible to represent a map between a number and an
object, it's possible to represent a pointer, and JSON supports those
types.


Rui Maciel
 
B

BGB

Yes, and ideally the serialization system should handle that conversion
for you, if you give it enough information to do so. I don't claim to
know of any serialization system that meets that ideal.


if I understand what is meant here, this case would likely a *very* hard
problem for a general-purpose serialization API...


simpler and more well-behaved cases can be more easily addressed, say:
requiring a specific memory-manager;
requiring use of explicit type-tagging;
requiring structures to be allocated individually;
banning physically nested structures or the use of unions;
....
 
M

Malcolm McLean

Ulf Åström wrote:

The "let's blindly dump everything" approach is fundamentally broken,
because it actually doesn't describe the information. Instead, it dumps the
data as how it was stored in a particular data structure and omits any
semantics associated with it. This causes significant problems, because it
doesn't perform any integrity checks on the information and essentially
eliminates the possibility of adding the necessary sanity checks
to the parsing process, other than testing if a particular primitive data
type is actually valid.
But it's far easier to parse.
Imagine an image that consists of width, height, and rgb triplets. If we store the dimensions as 16 bit big-endian numbers, we can read them in in a handful of instructions, then loop to read in the bytes. If we read the wrong number of bytes, we know the format is corrupted.
If we have to use some text scheme, we've got to guard against stuff like
<width -1000000000000000000000000000000000000000000000000000000000000000.02>
<height FRED>
<unrecognised tag order = rgba>

<Row 0xFFFFFF ... # this comment comments out the end >

That makes it far harder to write the parser.
 
W

Willem

Keith Thompson wrote:
) [...]
)> The user needs to set up translators (composed of one or more fields)
)> for each type they wish to serialize. Internally it works by
)> flattening each type into a list of void pointers and replacing them
)> with numeric indices. The output format has a simple "<tag> <value>;"
)> format; it is similar to JSON but it is not a reimplementation of it.
) [...]
)
) Why not just use JSON? It would make the flattened files accessible by
) other tools.

I think YAML is a much better choice in this case, because it has
some extra features such as object descriptions and object references.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
I

Ian Collins

BGB said:
if I understand what is meant here, this case would likely a *very* hard
problem for a general-purpose serialization API...

Don't loose sight of the difference between the serialisation technique
and the representation of the serialised data. The same problems (such
as what to do with pointers) exist with any general purpose
serialisation technique or sufficiently expressive data representation.

How the data types are described is another, independent, problem. Even
in C code, a structure object requires meta-data (the struct
declaration) to describe its type.

In my experience serialising data containing pointers is a very unusual
occurrence. The only time I have done this is where a set of
cooperating processes are using data in shared memory. In this case,
pointers can either be serialised as an offset from the segment base or
as plain integer values if all processes use the same base address for
the segment.
 
B

BGB

Don't loose sight of the difference between the serialisation technique
and the representation of the serialised data. The same problems (such
as what to do with pointers) exist with any general purpose
serialisation technique or sufficiently expressive data representation.

usually, the matter of pointers is transformed into some sort of object
index, but the problem becomes, what if the pointer doesn't point at a
well-defined object?...

say, a person is pointing to raw memory gained via "mmap()" or
"malloc()" or similar.


a lot of my stuff doesn't work correctly in these cases, and if given a
pointer to such memory, it will have NULL as the dynamic type.


my scripting language can deal with it, so far as it is also capable of
seeing the types "C-style" (and so will work with the data according to
its struct declaration and/or pointer types). (when doing so, it will no
longer have any notion of array-bounds though, since there may not be
any information available to tell the VM where the beginning or end of
the array are located, ...).

but, the serialization code will have no idea, and will generally encode
it as an "UNDEFINED" value.


How the data types are described is another, independent, problem. Even
in C code, a structure object requires meta-data (the struct
declaration) to describe its type.

even as such, C has many holes in its natural type representations:
information which may be needed to serialize an object with a given type
may not be present within the types as understood by the language itself.

hence, why some information may be needed in terms of special
annotations, and run-time type-tags, because even though the tool can
understand the types (struct declarations, ...), some information may
still not be knowable from the declarations themselves.


like, say:
"int *a;"

given the 'a' pointer, how do you know where the beginning or end of its
associated memory object is at?... run-time information may be needed in
this case (in the sense that the memory manager may know where the array
is located, and how big it is).

but, the memory manager may only know this if the memory for 'a' was
allocated via the appropriate functions. (if it came directly from
"mmap()" or similar, the memory manager may not have any idea).

granted, yes, the memory manager gets its own memory initially from
"mmap()" or "VirtualAlloc()" or similar, but it remembers which memory
chunks exist, and can identify individual memory objects within these
chunks.


a problem though is that, usually, there is not sufficient information
to work with a union, either within C, or within run-time tags, since if
all we know is that it is a union, it doesn't help much knowing which
possibility regarding its contents is the "correct" one.

In my experience serialising data containing pointers is a very unusual
occurrence. The only time I have done this is where a set of
cooperating processes are using data in shared memory. In this case,
pointers can either be serialised as an offset from the segment base or
as plain integer values if all processes use the same base address for
the segment.

most often, IME, the pointers are actually object references (either
pointing to another structure, or an array of values, ...).

in other cases, there may be other problems.


granted, in my case, nearly all of my program's memory is allocated via
the special memory manager (I very rarely use "malloc()" or similar),
so, in this case, there is usually sufficient information to work with.
 
U

Ulf Åström

I would use dedicated file-formats too. Also the docs for this library seem
very complicated; I wouldn't know where to start. Or what the capabilities
and limitations actually are.

There are sections for both the capabilities ("Output format") and the
limitations ("What it doesn't do") in the readme.
The idea sounds good, but it seems something more suitable for a language
where information about data-structures is available at runtime.

That's why it should be provided the type, size and offset of each
field; these are compiled into the program but so are the structure
layouts. What other information do you think would be needed?
(The docs could do with being even more basic, for people like me. For
example, the Readme file says the output is binary format, then a few lines
further it, it says the serialised data is in C-like syntax.

No, the introduction about binary formats explains why they are
problematic. These are the things I'm trying to overcome.
And how, really, does it work. If I start off with these two values:

 int a = 1234;
 int b = 5678;

can I serialise both into the same file. What happens at the 'other end'
when I try and read this data; how will it work if I don't happen to havea
pair of ints called a and b? Etc. (I did say it needed to be basic!))

The ser_ialize() function will take a list of translators, a pointer
to the initial data and what type it is (and some extra arguments for
options). It returns a character array with the serialized data.
Passing such data to ser_parse() will restore it; the return value is
a pointer to a copy of the data you initially fed into it. To
serialize multiple variables they must be wrapped in a struct or
array. It doesn't care about global variables at all, it only converts
the things you point it at.

/Ulf
 
I

Ian Collins

BGB said:
usually, the matter of pointers is transformed into some sort of object
index, but the problem becomes, what if the pointer doesn't point at a
well-defined object?...

Then the object isn't a suitable candidate for serialisation.

I don't agree with your "usually". Unless you are dealing with the very
rare case where a pointer value (even expressed as a base + offset) is
meaningful to the reader of the serialised data as well as the writer,
the only sensible thing to do is the equivalent of a deep copy and
serialise the data pointed to and not the pointer value.
even as such, C has many holes in its natural type representations:
information which may be needed to serialize an object with a given type
may not be present within the types as understood by the language itself.
Eh?

hence, why some information may be needed in terms of special
annotations, and run-time type-tags, because even though the tool can
understand the types (struct declarations, ...), some information may
still not be knowable from the declarations themselves.

like, say:
"int *a;"

given the 'a' pointer, how do you know where the beginning or end of its
associated memory object is at?... run-time information may be needed in
this case (in the sense that the memory manager may know where the array
is located, and how big it is).

Ah-ha. See above.
most often, IME, the pointers are actually object references (either
pointing to another structure, or an array of values, ...).

in other cases, there may be other problems.

granted, in my case, nearly all of my program's memory is allocated via
the special memory manager (I very rarely use "malloc()" or similar),
so, in this case, there is usually sufficient information to work with.

Not if you are sending the data to another language....
 
S

Shao Miller

if I understand what is meant here, this case would likely a *very* hard
problem for a general-purpose serialization API...

Well seeing as how "the slow way" is the only portable way to test two
pointers for pointing into the same object (one step at a time with !=
or ==), "the slow way" could be adopted for assigning indices to
pointers, or referring to previously-established indices for the same
pointee. No? At least deserializing wouldn't be so slow. - Shao
 
J

Johann Klammer

Rui said:
Granted, JSON doesn't offer explicit support for pointers. Nonetheless,
pointers are essentially a map between a reference number and an object.
Therefore,if it's possible to represent a map between a number and an
object, it's possible to represent a pointer, and JSON supports those
types.


Rui Maciel
A one-on one mapping will not do. There may be multiple pointers
pointing into the same area, and they need not point at the start of it.
You'd need at least some kind of interval table/tree, to check the
pointers that fall into a certain range and try to associate them with
some 'base objects', into which they point... Otherwise the read back
structure will end up differently than what was written.
 
B

BGB

Then the object isn't a suitable candidate for serialisation.

I thought someone here was objecting here to serialization not working
with structures mapped into a ring-buffer?... (IOW: a big glob of memory
treated like a ring, with structures allocated "around" the inside of
the ring...).

I was mostly noting that this sort of thing is a hard problem for a
serialization API to deal with (and, in my case, I don't even try to
make this work).


though, I may have misunderstood, and the reference may have just been
structures with cyclic linking (say, a ring composed of linked
structures), which is an easier problem (mostly this one involves
mapping objects to indices, and the "ring" will fall away, as any prior
object will already have been mapped).

I don't agree with your "usually". Unless you are dealing with the very
rare case where a pointer value (even expressed as a base + offset) is
meaningful to the reader of the serialised data as well as the writer,
the only sensible thing to do is the equivalent of a deep copy and
serialise the data pointed to and not the pointer value.

the "usually" is because there are some ways to implement persistence
mechanisms which do depend on pointer addresses... (usually by
implementing it via "mmap()" or "CreateFileMapping()" or similar...).

whether this is good or not is a separate manner.
(I don't do this, as it is just too ugly and broken IME to really be all
that useful).


in most other cases, any pointer is mapped to an object index or
similar, and the original address is not preserved. (hence, "usually").


like, array bounds for raw-memory arrays.

if a person is like:
int *a;
a=malloc(256*sizeof(int));

it isn't (necessarily) the case that a person can get back the base and
size of the pointed-to memory (absent compiler-specific extensions, such
as "_msize()" and similar).


other possibilities include a pointer to a struct physically embedded
within another struct.

typedef struct
{
int x, y;
}Foo;

typedef struct
{
int w;
Foo foo;
int z;
}Bar;

Bar *bar;
Foo *foo;
bar=malloc(sizeof(Bar));
foo=&(bar->foo);

if we try to serialize with nothing more than 'foo', there is not enough
information available (within the C type-system), for a serialization
API to realize that 'foo' is contained within an instance of Bar, and
that the object pointed to by 'bar' should be serialized instead.

another similar example, would be allocating a buffer for the contents
of a file, then casting various parts of this buffer to various structs
pointers.

a serializer working simply by walking the pointers would likely be
entirely unaware that the structs are held within a common buffer (and
would not preserve their physical spatial relationship within this buffer).


granted, a lot of these are cases which my APIs don't really address,
nor would I really want to try to address them...

this is part of the reason for some of the "restrictions" mentioned
before, which mostly serve to place more limits of what can be called
"well defined" data structures.

Ah-ha. See above.

ok, fair enough...

Not if you are sending the data to another language....

a lot of this is used when sharing data with my own language
(BGBScript), which is built on the same custom memory manager as most of
my C code.


functionally though, BGBScript can (optionally) deal with most of the
same types and memory-management practices as in C.

apparently, I just sort of took it for granted that these sorts of
things "should" work, for such a scripting language to be "practically
useful"...

granted, yes, this isn't really the normal way of writing code in the
language...
 
B

BGB

Well seeing as how "the slow way" is the only portable way to test two
pointers for pointing into the same object (one step at a time with !=
or ==), "the slow way" could be adopted for assigning indices to
pointers, or referring to previously-established indices for the same
pointee. No? At least deserializing wouldn't be so slow. - Shao

could be...


in my case, I am using a customized memory manager, and it is possible
to use special calls, say:
int Foo_CheckPtrAisB(void *a, void *b)
{ return((gcgetbase(a)!=NULL) && (gcgetbase(a)==gcgetbase(b))); }

where any two pointers into the same object, will produce the same base
address for "gcgetbase()", which returns the starting address for an
allocated object, or NULL if the pointer isn't into the GC's heap.


but, yeah, otherwise, it is a harder problem.


or such...
 
I

Ian Collins

BGB said:
I thought someone here was objecting here to serialization not working
with structures mapped into a ring-buffer?... (IOW: a big glob of memory
treated like a ring, with structures allocated "around" the inside of
the ring...).

My point was an object that contains a pointer to a blob of unspecified
size isn't suitable for serialisation. If the object is designed to be
serialised (of if the requirement is retrofitted), it would have an
embedded size, or provide a means to determine the size of the blob. In
practice the conditions for serialising are pretty much the same as
those for copying an object.
the "usually" is because there are some ways to implement persistence
mechanisms which do depend on pointer addresses... (usually by
implementing it via "mmap()" or "CreateFileMapping()" or similar...).

That's basically what I described up thread as the case of a set of
cooperating processes are using data in shared memory.
in most other cases, any pointer is mapped to an object index or
similar, and the original address is not preserved. (hence, "usually").

Again, I disagree. it makes more sense (in a general scheme) to
serialise the data pointed to, not the pointer. Unless you also intend
including the entire memory space in the serialised data. The only time
I see that is in a core file.
like, array bounds for raw-memory arrays.

if a person is like:
int *a;
a=malloc(256*sizeof(int));

If you wanted to serialise a, you would hang on to the size.
other possibilities include a pointer to a struct physically embedded
within another struct.

typedef struct
{
int x, y;
}Foo;

typedef struct
{
int w;
Foo foo;
int z;
}Bar;

Bar *bar;
Foo *foo;
bar=malloc(sizeof(Bar));
foo=&(bar->foo);

if we try to serialize with nothing more than 'foo', there is not enough
information available (within the C type-system), for a serialization
API to realize that 'foo' is contained within an instance of Bar, and
that the object pointed to by 'bar' should be serialized instead.

That sounds like a programming error rather than a deficiency in the C
type system.
another similar example, would be allocating a buffer for the contents
of a file, then casting various parts of this buffer to various structs
pointers.

a serializer working simply by walking the pointers would likely be
entirely unaware that the structs are held within a common buffer (and
would not preserve their physical spatial relationship within this buffer).

If you want a portable, persistent representation of the data, that is
probably a good thing!
granted, a lot of these are cases which my APIs don't really address,
nor would I really want to try to address them...

this is part of the reason for some of the "restrictions" mentioned
before, which mostly serve to place more limits of what can be called
"well defined" data structures.

In most of what I do the serialisation API is "convert to JSON, stream".
So if something can't be represented as JSON, it can't be serialised.
This rule came about in order to share data between JavaScript clients
and C++, C or PHP servers. I have subsequently found internal JSON
objects incredibly useful for building and passing dynamic types,
especially as I can use almost identical code in JavaScript, PHP and C++
(but alas, not C) to manipulate them.
 
N

Nick Keighley

At one point I did a lot of work with ASN.1, which is a specification
for defining protocols but is implemented using a variety of encoding
schemes. It actually is an effective means of serializing data and can
be very compact and, as I recall, extended to covering most data
encodings that I ran into but it never seemed trivial to me. Debugging a
stream of encoded data can lead to premature blindness and post
traumatic drunkenness.

I wrote a vary basic ASN.1 decoder. It didn't seem that hard...


Well, maybe that was just me. To interpret the
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top