Serialization library, request for feedback

Ulf Åström · Dec 13, 2012

Hello,

I'm writing a serialization library for C. It can convert binary data
(a struct or array, for example) into human-readable text and of
course also read serialized data back and reconstruct it in memory,
including nested pointers. The purpose is to provide a quick way to
save/load in applications and reduce the effort needed to keep file
formats in sync.

The user needs to set up translators (composed of one or more fields)
for each type they wish to serialize. Internally it works by
flattening each type into a list of void pointers and replacing them
with numeric indices. The output format has a simple "<tag> <value>;"
format; it is similar to JSON but it is not a reimplementation of it.

I'm looking for feedback how I can make this most useful to other
people. The code and documentation can be found at:
http://www.happyponyland.net/serialize.php

I'm wondering if there is anything crucial I have missed. It can deal
with most primitive C types, strings, arrays and almost any
arrangement of linked structures. There are still a few pieces
missing; for example I would like to have a clean interface to let the
user provide custom read/write functions for complex data types. I
will also work on the type safety, so structures can't be linked
incorrectly by erroneous or malicious input.

How should I structure the documentation and what level of detail does
it need (I'm guessing more is always better)? Is the API consistent?

I'm not asking anyone to proofread my 3000 lines of spaghetti, but of
course I would also appreciate to hear thoughts on the safety and
portability of the source itself. I'm open to suggestions how to
improve the conversion process.

Finally, do you think this could be genuinely useful? I'm doing it
just for fun and for use in my personal projects, but is there any
niche in programming where it would fit a need?

/Ulf

Keith Thompson · Dec 13, 2012

Ulf Ã…strÃ¶m said:
The user needs to set up translators (composed of one or more fields)
for each type they wish to serialize. Internally it works by
flattening each type into a list of void pointers and replacing them
with numeric indices. The output format has a simple "<tag> <value>;"
format; it is similar to JSON but it is not a reimplementation of it.

[...]

Why not just use JSON? It would make the flattened files accessible by
other tools.

Ulf Åström · Dec 13, 2012

Why not just use JSON? It would make the flattened files accessible by
other tools.

I'm considering it, but I haven't read the JSON specs in detail yet.
At a glance they seem to be mostly compatible but I don't know how
well it would mesh with my memory offset <-> value layout or how I
would perform pointer substitution. I haven't decided if I want to
pursue a fully compliant JSON implementation, or take the format in
another direction entirely.

Also, there is always the question: Why not use C#, Java, or anything
else that has these things built in? Hm.

/Ulf

Ben Bacarisse · Dec 13, 2012

Ulf Ã…strÃ¶m said:
I'm writing a serialization library for C. It can convert binary data
(a struct or array, for example) into human-readable text and of
course also read serialized data back and reconstruct it in memory,
including nested pointers. The purpose is to provide a quick way to
save/load in applications and reduce the effort needed to keep file
formats in sync.

The user needs to set up translators (composed of one or more fields)
for each type they wish to serialize. Internally it works by
flattening each type into a list of void pointers and replacing them
with numeric indices. The output format has a simple "<tag> <value>;"
format; it is similar to JSON but it is not a reimplementation of it.

I'm looking for feedback how I can make this most useful to other
people. The code and documentation can be found at:
http://www.happyponyland.net/serialize.php

I'm wondering if there is anything crucial I have missed. It can deal
with most primitive C types, strings, arrays and almost any
arrangement of linked structures.

The most striking omission are support for bool, wide strings and
complex types. This is not a criticism, just an observation.

The other area that I could not understand is how to serialise multiple
pointers into a single object. For example, how does one serialise

struct ring_buffer {
T buffer[BSIZE];
T *head, *tail;
};

where 'head' and 'tail' both point to an element of 'buffer'? This may
be just because there's no documentation yet -- I had a look at the API
and did not immediately see how this could be done.

As for the design itself, I got stuck on an issue in the first example.
You seem to require a dummy object in order to pass pointers to it's
members to the set-up functions, and a comment purports to explain why
this is needed but I just don't see it. The comment includes a
statement that is wrong (if I understand it) that a pointer to the first
member of a field may not equal (when suitably converted) a pointer to
the structure object itself. Even if this were true, I don't see why
that means you can't just pass the offset of the field. At the least,
more explanation is required.

<snip>

Ulf Åström · Dec 13, 2012

The most striking omission are support for bool, wide strings and
complex types. This is not a criticism, just an observation.

Ok! I will have a look at this. I don't use them much myself so I
haven't thought of them, but they should be supported.

The other area that I could not understand is how to serialise multiple
pointers into a single object. For example, how does one serialise

struct ring_buffer {
T buffer[BSIZE];
T *head, *tail;
};

where 'head' and 'tail' both point to an element of 'buffer'? This may
be just because there's no documentation yet -- I had a look at the API
and did not immediately see how this could be done.

Here is a quick example (assuming int for T):

struct ring_buffer rbuf;
tra = ser_new_tra("ring_buffer", sizeof(struct ring_buffer), NULL);
field = ser_new_field(tra, "int", 0, "buffer", &rbuf, &rbuf.buffer);
field->repeat = BSIZE;
ser_new_field(tra, "int", 1, "head", &rbuf, &rbuf.head);
ser_new_field(tra, "int", 1, "tail", &rbuf, &rbuf.tail);

The 1s to ser_new_field indicate that it is a pointer.

This exposed a bug, however; the int pointers will only be repointed
correctly (during inflation) if they point to rbuf.buffer[0]; other
addresses would be duplicated in memory. This is because it will only
check for pointers that exactly match a field offset. It should also
check all elements of an array.

As for the design itself, I got stuck on an issue in the first example.
You seem to require a dummy object in order to pass pointers to it's
members to the set-up functions, and a comment purports to explain why
this is needed but I just don't see it. The comment includes a
statement that is wrong (if I understand it) that a pointer to the first
member of a field may not equal (when suitably converted) a pointer to
the structure object itself. Even if this were true, I don't see why
that means you can't just pass the offset of the field. At the least,
more explanation is required.

It's a typo; it should say "pos must be within st".

Using a dummy isn't strictly necessary, you may as well pass zero and
an offset. I designed it this way so people won't hardcode offset
values, then change their structures (or start porting their program
to a different architecture!) but forget to update the translator.
There doesn't seem to be any other reliable way to get the offset of a
member (offsetof() with null cast is undefined behaviour, according to
Wikipedia). Anyway, the dummy is only needed when setting up the
translator. Perhaps I could change it to only take an offset but
suggest using a dummy to calculate it, e.g. (&rbuf.head - &rbuf).

Good suggestions, thanks a lot.

/Ulf

Ben Bacarisse · Dec 13, 2012

Ulf Ã…strÃ¶m said:
On Dec 14, 12:11Â am, Ben Bacarisse <[email protected]> wrote:

The other area that I could not understand is how to serialise multiple
pointers into a single object. Â For example, how does one serialise

Â struct ring_buffer {
Â Â Â T buffer[BSIZE];
Â Â Â T *head, *tail;
Â };

where 'head' and 'tail' both point to an element of 'buffer'? Â This may
be just because there's no documentation yet -- I had a look at the API
and did not immediately see how this could be done.

Click to expand...

Here is a quick example (assuming int for T):

struct ring_buffer rbuf;
tra = ser_new_tra("ring_buffer", sizeof(struct ring_buffer), NULL);
field = ser_new_field(tra, "int", 0, "buffer", &rbuf, &rbuf.buffer);
field->repeat = BSIZE;
ser_new_field(tra, "int", 1, "head", &rbuf, &rbuf.head);
ser_new_field(tra, "int", 1, "tail", &rbuf, &rbuf.tail);

The 1s to ser_new_field indicate that it is a pointer.

What about pointers to pointers?

This exposed a bug, however; the int pointers will only be repointed
correctly (during inflation) if they point to rbuf.buffer[0]; other
addresses would be duplicated in memory. This is because it will only
check for pointers that exactly match a field offset. It should also
check all elements of an array.

Thanks. I think I see what's happening. As I understand it, this
suggests there must be a "root" structure for all of the data, but maybe
what you are doing it more subtle than that.

As an example, imagine a binary tree of nodes and a separate linked list
of node pointers. Can I serialise this? Must I serialise the tree
first?

It's a typo; it should say "pos must be within st".

I think that a reference to the API documentation ("Note: pos must be
within pos"). I was talking about the comment in example1.c:

"In this example &thing_dummy and &thing_dummy.a will most probably be
the same location. &thing_dummy.b on the other hand will typically end
up 2 to 8 bytes from the base pointer (depending on the computer
architecture). For this reason we do not pass the offset as a number
but rely on the compiler to tell us exactly where the members will
be."

In your example, &thing_dummy.a and &thing_dummy must *always* be the
same location so the phrase "will most probably be" just looks odd.

That aside, the following bit is what threw me: "For this reason we do
not pass the offset as a number". Nothing previous to this seems to
explain why you pass two pointers rather than an offset.

Using a dummy isn't strictly necessary, you may as well pass zero and
an offset. I designed it this way so people won't hardcode offset
values, then change their structures (or start porting their program
to a different architecture!) but forget to update the translator.

That is a reason but not, to my mind, a strong one. To complicate the
API because it's possible to so grossly abuse a simpler one seems to be
contrary to the spirit if C. What's more, if you are serious about
this, then you should not permit sizeof(myType) in the earlier call,
since a programmer might hard code the size. Instead, they should be
obliged to pass two pointers: &thing_dummy and &thing_dummy + 1. If you
accept a simple size in one case, permit a simple offset in the other.

There doesn't seem to be any other reliable way to get the offset of a
member (offsetof() with null cast is undefined behaviour, according to
Wikipedia). Anyway, the dummy is only needed when setting up the
translator. Perhaps I could change it to only take an offset but
suggest using a dummy to calculate it, e.g. (&rbuf.head - &rbuf).

What's wrong with offsetof? I don't follow what you mean about a "null
cast" being undefined.

BTW, writing &rbuf.head - &rbuf won't work -- the pointers are of the
wrong type. You'd need to cast to char * but that's so messy. I think
offsetof is the way to go.

Ian Collins · Dec 14, 2012

Ulf said:
I'm considering it, but I haven't read the JSON specs in detail yet.

There isn't a great deal to read...

At a glance they seem to be mostly compatible but I don't know how
well it would mesh with my memory offset <-> value layout or how I
would perform pointer substitution. I haven't decided if I want to
pursue a fully compliant JSON implementation, or take the format in
another direction entirely.

Why make things complex? JSON is an ideal candidate for representing
structure and array types. It is after all designed as an object notation.

Also, there is always the question: Why not use C#, Java, or anything
else that has these things built in? Hm.

Other languages have better support for manipulating JSON objects, but
at least one of them (PHP) uses a C library under the hood.

Bart van Ingen Schenau · Dec 14, 2012

Using a dummy isn't strictly necessary, you may as well pass zero and
an offset. I designed it this way so people won't hardcode offset
values, then change their structures (or start porting their program
to a different architecture!) but forget to update the translator.
There doesn't seem to be any other reliable way to get the offset of a
member (offsetof() with null cast is undefined behaviour, according to
Wikipedia). Anyway, the dummy is only needed when setting up the
translator. Perhaps I could change it to only take an offset but
suggest using a dummy to calculate it, e.g. (&rbuf.head - &rbuf).

You could also setup the interface like this:

ser_field_t * ser_new_field_impl(ser_tra_t * tra, const char * type,
const int ref, const char * tag,
const int offset);
#define ser_new_field(tra,type,ref,tag,src_type,field) \
ser_new_field_impl(tra, type, ref, tag, offsetof(src_type, field))

Also note that offsetof must be defined by the compiler/implementation
and they can use whatever magic that gets the job done, including the
trick with dereferencing a null pointer.
If you do the same, you incur Undefined Behaviour, but the compiler
itself is above the law in this respect.
That makes offsetof() the easiest and reliable way to obtain the offset
of a member of a structure and it is in the standard just because you
need compiler magic to get the offset without incurring UB or the
overhead of a dummy object.

/Ulf

Bart v Ingen Schenau

BGB · Dec 14, 2012

There isn't a great deal to read...

Why make things complex? JSON is an ideal candidate for representing
structure and array types. It is after all designed as an object notation.

yep, JSON makes sense.

one possible downside though is that it doesn't normally identify object
types, which means some means may be needed to identify what sort of
struct is being serialized, and/or the type of array, ...

it is either that, or deal with data in a dynamically-typed manner,
rather than directly mapping it to raw C structs.

JSON is generally better IMO for serializing dynamically-typed data,
than for doing data-binding against structs.

another minor problem with JSON is that, in its pure form, it has no
good way to deal with cyclic data (where a referenced object may refer
back to a prior object), but an extended form could allow this.

in my case, one mechanism I have serializes a wider range of data into a
binary format, but in my case, only deals with dynamically type-tagged
data (which basically means that it was allocated via my GC API, and
with a type-name supplied), and also that the typedefs have any relevant
annotations (it depends on data gathered by using a tool to parse the
headers in order to work correctly).

it may try to use a special serialization handler if one is registered,
but will otherwise fall back to using key/value serialization of the
structs.

the basic format (for each member) is:
<typeindex:VLI> <size:VLI> <data:byte[size]>

where 'VLI' is a special encoding for variable-length-integers.
in my case:
00-7F 0-127
80-BF XX 128-16383
C0-DF XX XX 16384-2097151
....

a slight VLI variant (SVLI) encoding signed values by folding the sign
into the LSB, so, the values follow a pattern:
0, -1, 1, -2, 2, ...

the format was basically just a flat-array of serialized members, and
these members were linked via index (and allowed both backwards and
forwards references). it has a limited form of value-identity preservation.

the typeindex members are basically indices into this array, giving the
elements which give the type-name (which, as a simplifying assumption,
are assumed to be ASCII strings).

index 0 is reserved for NULL, and is not encoded. index 1 is the first
encoded index, and serves as the "root member" (basically, the "thing
that the program asked the serializer to serialize").

as a later addition, if the typeindex is 0, then the member is a comment
(and does not appear in the member array). a comment member immediately
at the start of the file is used to indicate the "type" of the file
(basically, it is a "magic string"), which is then followed by the root
member.

a partial drawback was that the format doesn't have any good way to
indicate "how" the data is encoded, making it potentially more subject
to versioning issues (consider, for example, if a structure-layout
changes, ...). I have tried previously to develop self-describing
serialization formats in the past, but trying to make a format fully
self-describing tends to make working with it unreasonably complex. (the
basic idea here would be that not only would the format identify the
types in use, but it would also encode information to describe all of
the data encodings used by the format, down to the level of some number
of "atomic" types, ...).

however, my Script-VM's bytecode serialization format is based on the
above mechanism (the bytecode is actually just the result of serializing
the output from compiling a source-module).

some amount of stuff also uses an S-Expression based notation:
999 //integer number
3.14159 //real number
"text" //string
name //symbol (identifier, used to identify something)
:name //keyword (special type of literal identifier)
name: //field or item name
....
( values ) //list of items (composed of "cons cells")
#( values ) //array of items (dynamically typed)
{ key: value ... } //object (dynamically-typed)
#A<sig> ( ... ) //array (statically-typed)
#X<name> { key: value ... } //struct
#L<name> { key: value ... } //instance of a class
....
#idx# //object index (declaration or reference)
#z //null
#u //undefined
#t //true
#f //false
....

so, for example, a struct like:
typdef dytname("foo_t") as_variant //(magic annotations, 1)
struct Foo_s Foo;

struct Foo_s {
Foo *next;
char *name;
int x;
float y;
double z[16];
}

1: these annotations are no-op macros in a normal C compiler (and mostly
expand to special attributes used by the header-processing tool).

"dytname()" basically gives the type-name that will be used when
allocating instances of this struct-type (it is used to key the
type-name back to the struct).

"as_variant" is basically a hint for how it should be handled by my
scripting language. this modifier asserts that the type should be
treated as a self-defined VM type (potentially opaque), rather than be
treated as a boxed-struct or as a boxed "pointer to a struct" (more
literally mapping the struct and/or struct pointer to the scripting
language, causing script code to see it more like how C would see it).

with the structs being created like:
Foo *obj;
obj=gctalloc("foo_t", sizeof(Foo)); //allocate object with type-tag
obj->name=dystrdup("foo_instance_13"); //make a new tagged string
....

might be serialized as:
#0# = #X<foo_t> { next: #1# name: "foo_instance_13" x: 99 y: 4.9 z:
#A<d> ( 2.0 3.0 ... ) }
#1# = #X<foo_t> { ... }

where this format works, but isn't really, exactly, pretty...

I also have a network protocol I call "BSXRP", which basically works
very similar to the above (same data model, ...), just it uses Huffman
coding and predictive context modeling of the data, and "clever" ways of
VLC coding what data-values are sent. (compression is favorable to that
of S-Expressions + Deflate, typically being around 25% the size, whereas
Deflate by itself was reducing the S-Expressions to around 10% their
original size, or IOW: around 2.5% the size of the textual
serialization). (basically, if similar repeating structures are sent,
prior structures may be used as "templates" for sending later
structures, allowing them to be encoded in fewer bits, essentially
working sort of like a dynamically-built schema).

as-before, it has special cases to allow encoding cyclic data, but the
protocol does not generally preserve "value-identity" (value-identity or
data-identity is its own hairy set of issues, and in my case I leave
matters of identity to higher-level protocols).

some tweaks to the format also allow it to give modest compression
improvements over Deflate when being used for delivering lots of short
plaintext or binary data messages (it essentially includes a Deflate64
like compressor as a sub-mode, but addresses some "weak areas" regarding
Deflate).

a minor drawback though is that the context models can eat up a lot of
memory (the memory costs are considerably higher than those of Deflate).

(it was originally written partly as a "proof of concept", but is,
technically, pretty much overkill).

Other languages have better support for manipulating JSON objects, but
at least one of them (PHP) uses a C library under the hood.

yeah...

I use variants of both JSON and S-Expressions, but mostly for
dynamically typed data.

not depending on the use of type-tags and data mined from headers would
require a rather different implementation strategy.

most of my code is C, but I make fairly extensive use of dynamic-type
tagging.

so, yeah, all this isn't really a "general purpose" set of solutions for
the data-serialization process.

I suspect that though that there may not actually be any sort of
entirely "general purpose" solution to this problem though...

and, as-is, to use my implementation would probably require dragging
around roughly about 400 kloc of code, and it is very likely that many
people would object to needing to use a special memory manager and
code-processing tools to be able to use these facilities...

or, IOW:
if you allocate the data with "malloc()" or via a raw "mmap()" or
similar, a lot of my code will have no idea what it is looking at (yes,
a lame limitation, I know).

granted, the scripting language can partly work around it:
if you don't use "as_variant" modifier, the struct will map literally,
and ironically, this allows script-code to still use "malloc()" for
these types.

however, the data serialization code generally isn't this clever...

or such...

Ian Collins · Dec 14, 2012

BGB said:
yeah...

I use variants of both JSON and S-Expressions, but mostly for
dynamically typed data.

not depending on the use of type-tags and data mined from headers would
require a rather different implementation strategy.

most of my code is C, but I make fairly extensive use of dynamic-type
tagging.

I originally wrote a JSON library to enable my (C++) sever side web
application it interact with client side JavaScript. I soon found the
objects extremely useful for building dynamic type objects in general
programming. I doubt the same would be true in C, not elegantly at least.

Ulf Åström · Dec 14, 2012

yep, JSON makes sense.

one possible downside though is that it doesn't normally identify object
types, which means some means may be needed to identify what sort of
struct is being serialized, and/or the type of array, ...

it is either that, or deal with data in a dynamically-typed manner,
rather than directly mapping it to raw C structs.

I can see a couple of problems with JSON. It is really just an
associative array and doesn't specify the type of objects, so I would
either need to guess (from the fields provided) or add a wrapper of
some kind telling what each element is. It also seems JSON has no
built-in way to do object references.

I can imagine something like

{
"some_struct #1" : { "property" : "value" },
"another_struct #2" : { "reference" : "#1" }
}

where another_struct.reference would be defined as a some_struct
pointer, so it seems I can get around this. I'm not sure how useful
the end result would really be, though; even if it is technically
valid as JSON data and other applications can parse it, they still
wouldn't know what to make of it.

Another (minor) problem is that JSON doesn't allow comments, while my
format does.

/Ulf

Ben Bacarisse · Dec 14, 2012

Ian Collins said:
There isn't a great deal to read...

Why make things complex? JSON is an ideal candidate for representing
structure and array types. It is after all designed as an object
notation.

I don't think JSON is a good fit for the OP's task. There are lots of C
data structures that have no obvious JSON representation at all.

<snip>

Ulf Åström · Dec 14, 2012

Ulf Åström said:
Ulf Åström said:

On Dec 14, 12:11 am, Ben Bacarisse <[email protected]> wrote:

The other area that I could not understand is how to serialise multiple
pointers into a single object. For example, how does one serialise
struct ring_buffer {
T buffer[BSIZE];
T *head, *tail;
};
where 'head' and 'tail' both point to an element of 'buffer'? This may
be just because there's no documentation yet -- I had a look at the API
and did not immediately see how this could be done.

Click to expand...

Click to expand...

Here is a quick example (assuming int for T):

Click to expand...

struct ring_buffer rbuf;
tra = ser_new_tra("ring_buffer", sizeof(struct ring_buffer), NULL);
field = ser_new_field(tra, "int", 0, "buffer", &rbuf, &rbuf.buffer);
field->repeat = BSIZE;
ser_new_field(tra, "int", 1, "head", &rbuf, &rbuf.head);
ser_new_field(tra, "int", 1, "tail", &rbuf, &rbuf.tail);

Click to expand...

The 1s to ser_new_field indicate that it is a pointer.

Click to expand...

What about pointers to pointers?

It can be done; have a look at test_single_ptr() in demo.c (I still
need to write up a clean example for this). It requires an extra
translator for "pointer to type", so other translators can reference
the pointer type itself. In this case it also needs the _int wrapper
(with a single int at offset 0) since I haven't found a satisfactory
way yet to translate primitives on their own.

This exposed a bug, however; the int pointers will only be repointed
correctly (during inflation) if they point to rbuf.buffer[0]; other
addresses would be duplicated in memory. This is because it will only
check for pointers that exactly match a field offset. It should also
check all elements of an array.

Click to expand...

Thanks. I think I see what's happening. As I understand it, this
suggests there must be a "root" structure for all of the data, but maybe
what you are doing it more subtle than that.

As an example, imagine a binary tree of nodes and a separate linked list
of node pointers. Can I serialise this? Must I serialise the tree
first?

This is just the kind of problem I'm hoping to solve.

It needs translators for the tree nodes and the list nodes. It is true
it only works on hierarchal data. The only way with the current API
would be to make a wrapper struct (and a translator for it) that
points to both the first tree and list nodes; this should be passed to
ser_ialize(). I can imagine using a variadric function to pass an
arbitrary number of type/pointer pairs and omit the wrapper, but
another problem would arise in the other end since ser_parse() will
only return a single pointer.

I think that a reference to the API documentation ("Note: pos must be
within pos"). I was talking about the comment in example1.c:

"In this example &thing_dummy and &thing_dummy.a will most probably be
the same location. &thing_dummy.b on the other hand will typically end
up 2 to 8 bytes from the base pointer (depending on the computer
architecture). For this reason we do not pass the offset as a number
but rely on the compiler to tell us exactly where the members will
be."

In your example, &thing_dummy.a and &thing_dummy must *always* be the
same location so the phrase "will most probably be" just looks odd.

True; I was probably overthinking things and imagining a larger
structure.

That aside, the following bit is what threw me: "For this reason we do
not pass the offset as a number". Nothing previous to this seems to
explain why you pass two pointers rather than an offset.

That is a reason but not, to my mind, a strong one. To complicate the
API because it's possible to so grossly abuse a simpler one seems to be
contrary to the spirit if C. What's more, if you are serious about
this, then you should not permit sizeof(myType) in the earlier call,
since a programmer might hard code the size. Instead, they should be
obliged to pass two pointers: &thing_dummy and &thing_dummy + 1. If you
accept a simple size in one case, permit a simple offset in the other.

I agree; I should trust the programmer to know what they are doing. I
will be changing it to take only an offset.

/Ulf

Rui Maciel · Dec 14, 2012

Ulf said:
I'm wondering if there is anything crucial I have missed. It can deal
with most primitive C types, strings, arrays and almost any
arrangement of linked structures. There are still a few pieces
missing; for example I would like to have a clean interface to let the
user provide custom read/write functions for complex data types. I
will also work on the type safety, so structures can't be linked
incorrectly by erroneous or malicious input.

I believe that's a poor approach to a common problem which also has a very
common and simple solution: for the output, a single custom
operator/function that takes in a data type/struct and calls fprintf() to
dump the relevant text following a specific data format, and for the input,
a custom parser. That's it. No need to complicate things.

The "let's blindly dump everything" approach is fundamentally broken,
because it actually doesn't describe the information. Instead, it dumps the
data as how it was stored in a particular data structure and omits any
semantics associated with it. This causes significant problems, because it
doesn't perform any integrity checks on the information and essentially
eliminates the possibility of adding the necessary sanity checks
to the parsing process, other than testing if a particular primitive data
type is actually valid. Therefore, dumping data makes it impossible to test
if the data which has been imported actually represents valid information.
And this is a bad thing.

The snags you've stumbled on are a natural consequence of trying to dump
data instead of exporting information, which happen to be trivial to fix
with the parser/document dump combo approach. Complex data types are never
a problem because the programmer is responsible to write the necessary I/O
routines, and all the necessary sanity checks are automatically performed,
not only with regards to type safety but also in the form of semantics
checks, including the ability to infer if a problem is fatal or recoverable.

Rui Maciel

Rui Maciel · Dec 14, 2012

Ben said:
I don't think JSON is a good fit for the OP's task. There are lots of C
data structures that have no obvious JSON representation at all.

What data structures do you have in mind?

Rui Maciel

Ben Bacarisse · Dec 14, 2012

Rui Maciel said:
What data structures do you have in mind?

My ring_buffer example is one. In fact, any structure where pointers are
"shared" would seem to be a bad fit with JSON.

Maybe "no obvious JSON representation" is too strong because you can
probably always map pointers so some sort of index or string label, but
it's not a natural fit.

BGB · Dec 14, 2012

I originally wrote a JSON library to enable my (C++) sever side web
application it interact with client side JavaScript. I soon found the
objects extremely useful for building dynamic type objects in general
programming. I doubt the same would be true in C, not elegantly at least.

I actually use a fair amount of dynamically-typed stuff in C.

another big area has to do with my scripting language (largely itself
based on JavaScript and ActionScript).

the main downsides have to do with performance, namely that
dynamically-typed code just doesn't perform anywhere near as well as
statically-typed code.

in many areas, this isn't a big issue, and/or leads to a hybrid
strategy, with some occasional type-checking, but much of the code
remains statically typed (for example, much of my 3D renderer largely
works this way).

typically, this would be using dynamic type-checking more for things
like classifying "what type of object is this?" and dispatching more to
the appropriate part of the renderer (or, IOW: type-checking on the
level of structures). a few places are "optimized" though, mostly to try
to avoid using dynamically-typed operations in tight-loops (such as
scene-graph object lookup operations, ...).

side note (main project):
http://www.youtube.com/user/BGBTech

(note, a lot of parts of the video are from much earlier versions of my
3D engine...).

(although the 3D engine has some amount of code written in my
script-language, I generally don't use script-code within the 3D
renderer, as the languages' performance is a bit weak, and the 3D
renderer is a lot more performance-sensitive).

sadly, performance is kind of a problem area sometimes, but much of this
more has to do with getting everything fed through OpenGL.

in my case, as-noted, the type-tagging is largely transparent to C code,
apart from the use of special allocation and freeing functions, and
supplying type-names, ...

these type-names can be retrieved via API calls, like:
BGBDY_API char *dygettype(dyt val);

or checked with functions like:
BGBDY_API int dyTypeP(dyt val, char *ty);

many operations on dynamically typed values may check types using these
mechanisms, but many internal operations essentially bypass this
(usually, because if/else chains of "strcmp()" calls can get expensive...).

elsewhere, for many core operations rather than fetching the type-name
from the object, the code will instead fetch a "type vtable":

BGBDY_API dyt dyCopyValue(dyt p)
{
BGBDY_ObjVTab *ty;
if(dyCheckValueNoCopyP(p))
return(p);
ty=BGBDY_GetObjVTabFast(p);
if(ty && ty->copyValue)
return(ty->copyValue(p));
return(p);
}

BGBDY_API int dyDropValue(dyt p)
{
BGBDY_ObjVTab *ty;
if(dyCheckValueNoCopyP(p))
return(0);
ty=BGBDY_GetObjVTabFast(p);
if(ty && ty->dropValue)
return(ty->dropValue(p));
return(0);
}

where BGBDY_GetObjVTabFast basically calls off to code for some
(slightly long/ugly) logic for fetching the object-base, fetching the
"ObjType", and returning its "vtable" member.

in a few places (due mostly to performance issues), some of this has
lead to frequent use of things like type-signatures and untagged union
types:

#ifndef DYTVALUE_T
#define DYTVALUE_T
typedef union
{
s32 i; //int
u32 ui; //uint
s64 l; //long
u64 ul; //ulong
f32 f; //float
f64 d; //double
nlint a; //address
void *pv; //raw pointer
dyt r; //variant reference
dytf cr; //variant reference (double)
dycClass cls; //class
dycObject obj; //object
dycArray arr; //array
struct { int x, y; }k; //lexical env index
}dycValue;
#endif

the main reason being that these allow higher performance.

for the untagged union, it is because one can access values like:
dycValue a, b, c;
....
c.i=a.i+b.i;

and, apparently the compiler is smart enough to figure out this at least
(and generate vaguely efficient code).

the reason for type-signatures is that unlike full dynamic typing,
signatures are more readily usable with the "build stuff with function
pointers" strategy.

say, for example:
void DYLL_CopyValueBufI(dycValue *v, void *p)
{ *(s32 *)p=v->i; }
void DYLL_CopyValueBufL(dycValue *v, void *p)
{ *(s64 *)p=v->l; }
....

typedef struct {
void (*CopyValueBuf)(dycValue *v, void *p);
void (*CopyBufValue)(void *p, dycValue *v);
....
int offs;
}DYLL_ValueTransfer;

typedef struct {
DYLL_ValueTransfer *ret;
DYLL_ValueTransfer *args;
int nargs;
void (*CopyValuesBuf)(DYLL_ArgList *ctx, dycValue *args, byte *buf);
void (*CopyBufValues)(DYLL_ArgList *ctx, byte *buf, dycValue *args);
....
}DYLL_ArgList;

....

void DYLL_CopyValuesBuf3(DYLL_ArgList *ctx, dycValue *args, byte *buf)
{
ctx->args[0].CopyValueBuf(args+0, buf+ctx->args[0].offs);
ctx->args[1].CopyValueBuf(args+1, buf+ctx->args[1].offs);
ctx->args[2].CopyValueBuf(args+2, buf+ctx->args[2].offs);
}

BGBDY_API void DYLL_CopyValuesBuf(DYLL_ArgList *ctx,
dycValue *args, byte *buf)
{ ctx->CopyValuesBuf(ctx, args, buf); }

where the "DYLL_ArgList" structure may be built via a function which
parses the signature.

....

which, although arguably not exactly efficient, can always be a good
deal worse...

(there is a good deal more of this sort of logic for working with
"va_list" argument lists as well, typically for indirect calls
originating within C land, a lot of this stuff has to deal with calls
between natively compiled and interpreted code).

generally the "buffer" is a packed buffer-like representation of an
argument list, and is generally what is used by the ASM parts of the
call-handling machinery (which may in-turn repack these buffer into
ABI-specific representations, and then initiate the call).

partly, this is because, it isn't really desirable to have to deal with
the various possible argument-list representations and ASM code at the
same time.

in some places, the function-pointer funkiness serves to bypass older
machinery which often involves lots of sometimes long if/else chains.

a downside though is that a lot of this is, admittedly, a bit more
painful to work with, and the code can get very long/winding and ugly
(so, the spread of this sort of code in many places within the VM core
has both good and bad points).

however, a lot of these sorts of mechanisms end up somewhat abstracted,
and so are generally unseen by "user code" (luckily, it all looks much
nicer from the outside).

or such...

BGB · Dec 14, 2012

I can see a couple of problems with JSON. It is really just an
associative array and doesn't specify the type of objects, so I would
either need to guess (from the fields provided) or add a wrapper of
some kind telling what each element is. It also seems JSON has no
built-in way to do object references.

actually, it can do both, just it is maybe not entirely obvious.
it helps more to imaging how JavaScript itself works, and JSON more are
a direct mapping to associated JavaScript objects.

I can imagine something like

{
"some_struct #1" : { "property" : "value" },
"another_struct #2" : { "reference" : "#1" }
}

where another_struct.reference would be defined as a some_struct
pointer, so it seems I can get around this. I'm not sure how useful
the end result would really be, though; even if it is technically
valid as JSON data and other applications can parse it, they still
wouldn't know what to make of it.

Another (minor) problem is that JSON doesn't allow comments, while my
format does.

this is all, not exactly right...

however, how it actually works gets a little less trivial to explain.

handling it more "effectively" would involve essentially throwing
together dynamic type facilities on top of C.

a way of imagining this, is if you have some sort of opaque type, which
can hold a value of any type (no, it is *not* a string).

a popular solution (though not the one I am using) is basically to treat
the low bits of a pointer as a "magic tag", for example:
low 3 bits:
0=raw pointer
1/5=integer (30-bit)
2/6=float (30-bit)
3, 4=reserved
7=misc values (grabs another 5 bits as an expanded tag, with a 24-bit
value).

(my system is based instead on "magic address ranges" for some types,
and encoding the object type in the memory-manager's header, for other
types).

you might have functions, then, to wrap and unwrap values, say (using my
API as an example):
dyt dyint(s64 v); //wrap integer (or "long")
s64 dyintv(dyt v); //unwrap integer (or "long")
dyt dylong(s64 v); //wrap "long"
s64 dylongv(dyt v); //unwrap "long"
dyt dyfloat(float v); //wrap float
float dyfloatv(dyt v); //unwrap float
dyt dydouble(double v); //wrap double
double dydoublev(dyt v); //unwrap double
....

as well, as functions to perform "type-checks":
int dyintp(dyt v); //returns non-zero if value is an integer
int dyfloatp(dyt v);
....

and, so then, a person can write logic like:
if(dyintp(v))
{
i=dyintv(v);
}else if(dyfloatp(v))
{
f=dyfloatv(v);
}else if(dystringp(v))
{
...
}else if(dyObjectP(v))
{
...
}else ...

as for parsing JSON, it will generally involve looking at the syntax for
every token to identify what it is.

part of this is done in the "lexer"/"tokenizer", with the idea being
that the logic for splitting apart the input text into tokens, will
identify what sort of token it is looking at: numbers vs strings vs
braces vs operators vs ...

the parser logic will then build a representation of the data via the
tokens it sees.

a common strategy for hand-written parsers is "recursive descent", where
basically the code looks at the current tokens and sees what to do, and
then may call itself recursively in order to parse sub-items:
if(tokenty==TOK_INTEGER)
{
value=dyint(atoi(token));
}else if(tokenty==TOK_REAL)
{
value=dydouble(atof(token));
}else if(!strcmp(token, "{") && (tokenty==TOK_BRACE))
{
... parse object ...
}else if(!strcmp(token, "[") && (tokenty==TOK_BRACE))
{
... parse array ...
}else ...

a possible representation for a "naive" object, could be something like:
struct Object_s
{
char **vars;
dyt *vals;
int n_vars;
int m_vars;
};

with vars holding a list of member names, vals holding the values, and
n_vars holding a count for the number of members in an object, and
m_vars holding the max number of members (how many members can be added
before the arrays would need to be reallocated).

there are potentially more efficient ways to handle all this, but this
is a "general idea".

BGB · Dec 14, 2012

My ring_buffer example is one. In fact, any structure where pointers are
"shared" would seem to be a bad fit with JSON.

Maybe "no obvious JSON representation" is too strong because you can
probably always map pointers so some sort of index or string label, but
it's not a natural fit.

yeah, these are general problems with data serialization, namely that
there is no good way to deal with "everything" within any reasonable
level of complexity.

it is easier then to find a reasonable subset of things that can be
serialized via a given mechanism, and use this.

more careful design of a serialization interface would be needed, for
example, to be able to deal with cyclic data structures, and more
in-turn to deal with preserving the identity of serialized objects (1).

1: if obja==objb, will this be true or false after the data is saved,
and re-loaded?...

some options will not preserve any identity, so if the same object is
reached via multiple paths, each may become separate objects when the
same image is read-in later.

other options may in-turn create false identity, where pretty much any
object which "looks the same" will have the same identity, regardless of
whether or not it was the same object prior to serialization.

identity becomes a particularly ugly issue if it needs to be dealt with
across the same object being saved across multiple images, or if network
gets involved (currently, I just don't deal with this problem). (this
part gets much more important if dealing with trying to serialize things
like program-state, say, where a "running program image" may be saved to
disk, and reloaded later, and lots of little "subtle details" will need
to be preserved, or else things will often go "critically wrong" when
the program state gets reloaded later on).

or, at least, short of much simpler but very ugly solutions, like doing
a raw memory-dump of the program's heap (I have done this before, but it
"doesn't really scale well"), but more practical solutions to general
program heap serialization are overly complicated, and it is easier just
to deal with smaller scale serialization and reloading (such as
"serialize this particular data structure").

a basic idea for how the heap-dump serialization works is:
imagine, if you will, that on startup the application uses a
memory-mapped file as the main VM heap, and uses special magic to
preserve any "roots" (generally global variables). on exiting and
reloading, any objects in this heap can be preserved, with the "roots"
automatically restored to their prior values.

yeah, long ago, one of my prior VMs actually did this, although the heap
wasn't a single giant file, but rather a bunch of several-MB "heap chunks".

I later generally abandoned this strategy, as it had some particularly
nasty technical issues (every exit/reload cycle leaked memory that could
not be reclaimed, nasty "bit rot" problems, structure-layout problems, ...).

it is IMO better to actually just try to "serialize the contents of the
data", rather than just "stupidly dump the memory to the HDD and try to
reload it later".

but, in general, dealing with these sorts of issues requires more
careful attention to detail, and sadly, this isn't really something JSON
does very well.

not that this makes JSON useless though.

more generally though, I just use plain files for data storage, as they
just work so-much better and more reliably.

one possible (partial) solution here for cyclic structures/... though is
an "extended and non-standard" JSON variant, say, where everything is
indexed.

#1 = { name: "foo", link: #2 };
#2 = { name: "bar", link: #3 };
#3 = { name: "baz", link: #4, first: #1 };
....
{ name: "someobject", listHead: #1 }

but, dealing effectively with a more comprehensive type-system would
require further somewhat extending the syntax, say, adding some
mechanism to identify up-front the type of object being serialized, ...

this is partly what my "somewhat hacked-up S-Expressions" do, even if
they are a little out of place (long ago, I had a Scheme implementation,
and even now, vestiges of it still remain, floating around in my
codebase...).

or such...

BGB · Dec 14, 2012

I believe that's a poor approach to a common problem which also has a very
common and simple solution: for the output, a single custom
operator/function that takes in a data type/struct and calls fprintf() to
dump the relevant text following a specific data format, and for the input,
a custom parser. That's it. No need to complicate things.

The "let's blindly dump everything" approach is fundamentally broken,
because it actually doesn't describe the information. Instead, it dumps the
data as how it was stored in a particular data structure and omits any
semantics associated with it. This causes significant problems, because it
doesn't perform any integrity checks on the information and essentially
eliminates the possibility of adding the necessary sanity checks
to the parsing process, other than testing if a particular primitive data
type is actually valid. Therefore, dumping data makes it impossible to test
if the data which has been imported actually represents valid information.
And this is a bad thing.

The snags you've stumbled on are a natural consequence of trying to dump
data instead of exporting information, which happen to be trivial to fix
with the parser/document dump combo approach. Complex data types are never
a problem because the programmer is responsible to write the necessary I/O
routines, and all the necessary sanity checks are automatically performed,
not only with regards to type safety but also in the form of semantics
checks, including the ability to infer if a problem is fatal or recoverable.

makes sense IMO...

personally, I far more often use specialized file-formats, rather than
dumping data structures, and usually if serialization is used, it is
usually limited mostly to certain sets of dedicated "safe to serialize"
data-types, which in my case are typically "lists" and
"dynamically-typed objects".

most types provide dedicated handlers for serializing and reading back
in their serialized forms, many often using a small "version" tag, as a
way to indicate what fields are present in the data (this itself has
flaws, but basically works), or sometimes the handlers use individual
tags to indicate various items within the serialized structure.

types which don't bother with version-tags are typically things like
integer values or strings, which are basically "too simple to justify a
version tag".

actually, given my frequent use of sequential read/write functions, is a
major part of why, very often, I end up using variable-length integer
representations.

for example, the vast majority of integer-values are fairly small
(basically, forming a bell-curve with its center around 0), making it so
that, on-average, small integers may only require 1 byte, and most
others are 2 or 3 bytes, independent of the physical storage-type of the
integer (32 or 64 bits).

like, although a variable-length integer isn't very good for
random-access data, it is pretty good for data that is read/written
sequentially.

Looking for feedback on this markup language I developed and my website idea?	0	Jun 17, 2023
Song Request	5	Aug 14, 2021
Google sheets song request	3	Apr 19, 2022
Request for build feedback	47	Jul 25, 2013
PHP cURL for large content and single HTTP request	1	Feb 23, 2023
My http request is working but not doing it correctly	0	Oct 13, 2023
React native post-request is not working	1	May 27, 2023
Serialization	10	Nov 4, 2010

Serialization library, request for feedback

Ulf Åström

Keith Thompson

Ulf Åström

Ben Bacarisse

Ulf Åström

Ben Bacarisse

Ian Collins

Bart van Ingen Schenau

BGB

Ian Collins

Ulf Åström

Ben Bacarisse

Ulf Åström

Rui Maciel

Rui Maciel

Ben Bacarisse

BGB

BGB

BGB

BGB

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads