Serialization library, request for feedback

Discussion in 'C Programming' started by Ulf Åström, Dec 13, 2012.

  1. Ulf Åström

    Ulf Åström Guest

    Hello,

    I'm writing a serialization library for C. It can convert binary data
    (a struct or array, for example) into human-readable text and of
    course also read serialized data back and reconstruct it in memory,
    including nested pointers. The purpose is to provide a quick way to
    save/load in applications and reduce the effort needed to keep file
    formats in sync.

    The user needs to set up translators (composed of one or more fields)
    for each type they wish to serialize. Internally it works by
    flattening each type into a list of void pointers and replacing them
    with numeric indices. The output format has a simple "<tag> <value>;"
    format; it is similar to JSON but it is not a reimplementation of it.

    I'm looking for feedback how I can make this most useful to other
    people. The code and documentation can be found at:
    http://www.happyponyland.net/serialize.php

    I'm wondering if there is anything crucial I have missed. It can deal
    with most primitive C types, strings, arrays and almost any
    arrangement of linked structures. There are still a few pieces
    missing; for example I would like to have a clean interface to let the
    user provide custom read/write functions for complex data types. I
    will also work on the type safety, so structures can't be linked
    incorrectly by erroneous or malicious input.

    How should I structure the documentation and what level of detail does
    it need (I'm guessing more is always better)? Is the API consistent?

    I'm not asking anyone to proofread my 3000 lines of spaghetti, but of
    course I would also appreciate to hear thoughts on the safety and
    portability of the source itself. I'm open to suggestions how to
    improve the conversion process.

    Finally, do you think this could be genuinely useful? I'm doing it
    just for fun and for use in my personal projects, but is there any
    niche in programming where it would fit a need?

    /Ulf
     
    Ulf Åström, Dec 13, 2012
    #1
    1. Advertising

  2. Ulf Åström <> writes:
    [...]
    > The user needs to set up translators (composed of one or more fields)
    > for each type they wish to serialize. Internally it works by
    > flattening each type into a list of void pointers and replacing them
    > with numeric indices. The output format has a simple "<tag> <value>;"
    > format; it is similar to JSON but it is not a reimplementation of it.

    [...]

    Why not just use JSON? It would make the flattened files accessible by
    other tools.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Will write code for food.
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
     
    Keith Thompson, Dec 13, 2012
    #2
    1. Advertising

  3. Ulf Åström

    Ulf Åström Guest

    On Dec 13, 10:57 pm, Keith Thompson <> wrote:
    > Ulf Åström <> writes:
    >
    > > The user needs to set up translators (composed of one or more fields)
    > > for each type they wish to serialize. Internally it works by
    > > flattening each type into a list of void pointers and replacing them
    > > with numeric indices. The output format has a simple "<tag> <value>;"
    > > format; it is similar to JSON but it is not a reimplementation of it.

    >
    > Why not just use JSON?  It would make the flattened files accessible by
    > other tools.


    I'm considering it, but I haven't read the JSON specs in detail yet.
    At a glance they seem to be mostly compatible but I don't know how
    well it would mesh with my memory offset <-> value layout or how I
    would perform pointer substitution. I haven't decided if I want to
    pursue a fully compliant JSON implementation, or take the format in
    another direction entirely.

    Also, there is always the question: Why not use C#, Java, or anything
    else that has these things built in? Hm.

    /Ulf
     
    Ulf Åström, Dec 13, 2012
    #3
  4. Ulf Åström <> writes:

    > I'm writing a serialization library for C. It can convert binary data
    > (a struct or array, for example) into human-readable text and of
    > course also read serialized data back and reconstruct it in memory,
    > including nested pointers. The purpose is to provide a quick way to
    > save/load in applications and reduce the effort needed to keep file
    > formats in sync.
    >
    > The user needs to set up translators (composed of one or more fields)
    > for each type they wish to serialize. Internally it works by
    > flattening each type into a list of void pointers and replacing them
    > with numeric indices. The output format has a simple "<tag> <value>;"
    > format; it is similar to JSON but it is not a reimplementation of it.
    >
    > I'm looking for feedback how I can make this most useful to other
    > people. The code and documentation can be found at:
    > http://www.happyponyland.net/serialize.php
    >
    > I'm wondering if there is anything crucial I have missed. It can deal
    > with most primitive C types, strings, arrays and almost any
    > arrangement of linked structures.


    The most striking omission are support for bool, wide strings and
    complex types. This is not a criticism, just an observation.

    The other area that I could not understand is how to serialise multiple
    pointers into a single object. For example, how does one serialise

    struct ring_buffer {
    T buffer[BSIZE];
    T *head, *tail;
    };

    where 'head' and 'tail' both point to an element of 'buffer'? This may
    be just because there's no documentation yet -- I had a look at the API
    and did not immediately see how this could be done.

    As for the design itself, I got stuck on an issue in the first example.
    You seem to require a dummy object in order to pass pointers to it's
    members to the set-up functions, and a comment purports to explain why
    this is needed but I just don't see it. The comment includes a
    statement that is wrong (if I understand it) that a pointer to the first
    member of a field may not equal (when suitably converted) a pointer to
    the structure object itself. Even if this were true, I don't see why
    that means you can't just pass the offset of the field. At the least,
    more explanation is required.

    <snip>
    --
    Ben.
     
    Ben Bacarisse, Dec 13, 2012
    #4
  5. Ulf Åström

    Ulf Åström Guest

    On Dec 14, 12:11 am, Ben Bacarisse <> wrote:
    > Ulf Åström <> writes:
    > > I'm wondering if there is anything crucial I have missed. It can deal
    > > with most primitive C types, strings, arrays and almost any
    > > arrangement of linked structures.

    >
    > The most striking omission are support for bool, wide strings and
    > complex types.  This is not a criticism, just an observation.


    Ok! I will have a look at this. I don't use them much myself so I
    haven't thought of them, but they should be supported.

    > The other area that I could not understand is how to serialise multiple
    > pointers into a single object.  For example, how does one serialise
    >
    >   struct ring_buffer {
    >      T buffer[BSIZE];
    >      T *head, *tail;
    >   };
    >
    > where 'head' and 'tail' both point to an element of 'buffer'?  This may
    > be just because there's no documentation yet -- I had a look at the API
    > and did not immediately see how this could be done.


    Here is a quick example (assuming int for T):

    struct ring_buffer rbuf;
    tra = ser_new_tra("ring_buffer", sizeof(struct ring_buffer), NULL);
    field = ser_new_field(tra, "int", 0, "buffer", &rbuf, &rbuf.buffer);
    field->repeat = BSIZE;
    ser_new_field(tra, "int", 1, "head", &rbuf, &rbuf.head);
    ser_new_field(tra, "int", 1, "tail", &rbuf, &rbuf.tail);

    The 1s to ser_new_field indicate that it is a pointer.

    This exposed a bug, however; the int pointers will only be repointed
    correctly (during inflation) if they point to rbuf.buffer[0]; other
    addresses would be duplicated in memory. This is because it will only
    check for pointers that exactly match a field offset. It should also
    check all elements of an array.

    > As for the design itself, I got stuck on an issue in the first example.
    > You seem to require a dummy object in order to pass pointers to it's
    > members to the set-up functions, and a comment purports to explain why
    > this is needed but I just don't see it.  The comment includes a
    > statement that is wrong (if I understand it) that a pointer to the first
    > member of a field may not equal (when suitably converted) a pointer to
    > the structure object itself.  Even if this were true, I don't see why
    > that means you can't just pass the offset of the field.  At the least,
    > more explanation is required.


    It's a typo; it should say "pos must be within st".

    Using a dummy isn't strictly necessary, you may as well pass zero and
    an offset. I designed it this way so people won't hardcode offset
    values, then change their structures (or start porting their program
    to a different architecture!) but forget to update the translator.
    There doesn't seem to be any other reliable way to get the offset of a
    member (offsetof() with null cast is undefined behaviour, according to
    Wikipedia). Anyway, the dummy is only needed when setting up the
    translator. Perhaps I could change it to only take an offset but
    suggest using a dummy to calculate it, e.g. (&rbuf.head - &rbuf).

    Good suggestions, thanks a lot.

    /Ulf
     
    Ulf Åström, Dec 13, 2012
    #5
  6. Ulf Åström <> writes:

    > On Dec 14, 12:11 am, Ben Bacarisse <> wrote:

    <snip>
    >> The other area that I could not understand is how to serialise multiple
    >> pointers into a single object.  For example, how does one serialise
    >>
    >>   struct ring_buffer {
    >>      T buffer[BSIZE];
    >>      T *head, *tail;
    >>   };
    >>
    >> where 'head' and 'tail' both point to an element of 'buffer'?  This may
    >> be just because there's no documentation yet -- I had a look at the API
    >> and did not immediately see how this could be done.

    >
    > Here is a quick example (assuming int for T):
    >
    > struct ring_buffer rbuf;
    > tra = ser_new_tra("ring_buffer", sizeof(struct ring_buffer), NULL);
    > field = ser_new_field(tra, "int", 0, "buffer", &rbuf, &rbuf.buffer);
    > field->repeat = BSIZE;
    > ser_new_field(tra, "int", 1, "head", &rbuf, &rbuf.head);
    > ser_new_field(tra, "int", 1, "tail", &rbuf, &rbuf.tail);
    >
    > The 1s to ser_new_field indicate that it is a pointer.


    What about pointers to pointers?

    > This exposed a bug, however; the int pointers will only be repointed
    > correctly (during inflation) if they point to rbuf.buffer[0]; other
    > addresses would be duplicated in memory. This is because it will only
    > check for pointers that exactly match a field offset. It should also
    > check all elements of an array.


    Thanks. I think I see what's happening. As I understand it, this
    suggests there must be a "root" structure for all of the data, but maybe
    what you are doing it more subtle than that.

    As an example, imagine a binary tree of nodes and a separate linked list
    of node pointers. Can I serialise this? Must I serialise the tree
    first?

    >> As for the design itself, I got stuck on an issue in the first example.
    >> You seem to require a dummy object in order to pass pointers to it's
    >> members to the set-up functions, and a comment purports to explain why
    >> this is needed but I just don't see it.  The comment includes a
    >> statement that is wrong (if I understand it) that a pointer to the first
    >> member of a field may not equal (when suitably converted) a pointer to
    >> the structure object itself.  Even if this were true, I don't see why
    >> that means you can't just pass the offset of the field.  At the least,
    >> more explanation is required.

    >
    > It's a typo; it should say "pos must be within st".


    I think that a reference to the API documentation ("Note: pos must be
    within pos"). I was talking about the comment in example1.c:

    "In this example &thing_dummy and &thing_dummy.a will most probably be
    the same location. &thing_dummy.b on the other hand will typically end
    up 2 to 8 bytes from the base pointer (depending on the computer
    architecture). For this reason we do not pass the offset as a number
    but rely on the compiler to tell us exactly where the members will
    be."

    In your example, &thing_dummy.a and &thing_dummy must *always* be the
    same location so the phrase "will most probably be" just looks odd.

    That aside, the following bit is what threw me: "For this reason we do
    not pass the offset as a number". Nothing previous to this seems to
    explain why you pass two pointers rather than an offset.

    > Using a dummy isn't strictly necessary, you may as well pass zero and
    > an offset. I designed it this way so people won't hardcode offset
    > values, then change their structures (or start porting their program
    > to a different architecture!) but forget to update the translator.


    That is a reason but not, to my mind, a strong one. To complicate the
    API because it's possible to so grossly abuse a simpler one seems to be
    contrary to the spirit if C. What's more, if you are serious about
    this, then you should not permit sizeof(myType) in the earlier call,
    since a programmer might hard code the size. Instead, they should be
    obliged to pass two pointers: &thing_dummy and &thing_dummy + 1. If you
    accept a simple size in one case, permit a simple offset in the other.

    > There doesn't seem to be any other reliable way to get the offset of a
    > member (offsetof() with null cast is undefined behaviour, according to
    > Wikipedia). Anyway, the dummy is only needed when setting up the
    > translator. Perhaps I could change it to only take an offset but
    > suggest using a dummy to calculate it, e.g. (&rbuf.head - &rbuf).


    What's wrong with offsetof? I don't follow what you mean about a "null
    cast" being undefined.

    BTW, writing &rbuf.head - &rbuf won't work -- the pointers are of the
    wrong type. You'd need to cast to char * but that's so messy. I think
    offsetof is the way to go.

    --
    Ben.
     
    Ben Bacarisse, Dec 14, 2012
    #6
  7. Ulf Åström

    Ian Collins Guest

    Ulf Åström wrote:
    > On Dec 13, 10:57 pm, Keith Thompson <> wrote:
    >> Ulf Åström <> writes:
    >>
    >>> The user needs to set up translators (composed of one or more fields)
    >>> for each type they wish to serialize. Internally it works by
    >>> flattening each type into a list of void pointers and replacing them
    >>> with numeric indices. The output format has a simple "<tag> <value>;"
    >>> format; it is similar to JSON but it is not a reimplementation of it.

    >>
    >> Why not just use JSON? It would make the flattened files accessible by
    >> other tools.

    >
    > I'm considering it, but I haven't read the JSON specs in detail yet.


    There isn't a great deal to read...

    > At a glance they seem to be mostly compatible but I don't know how
    > well it would mesh with my memory offset <-> value layout or how I
    > would perform pointer substitution. I haven't decided if I want to
    > pursue a fully compliant JSON implementation, or take the format in
    > another direction entirely.


    Why make things complex? JSON is an ideal candidate for representing
    structure and array types. It is after all designed as an object notation.

    > Also, there is always the question: Why not use C#, Java, or anything
    > else that has these things built in? Hm.


    Other languages have better support for manipulating JSON objects, but
    at least one of them (PHP) uses a C library under the hood.

    --
    Ian Collins
     
    Ian Collins, Dec 14, 2012
    #7
  8. On Thu, 13 Dec 2012 15:57:27 -0800, Ulf Åström wrote:
    >
    >Using a dummy isn't strictly necessary, you may as well pass zero and
    >an offset. I designed it this way so people won't hardcode offset
    >values, then change their structures (or start porting their program
    >to a different architecture!) but forget to update the translator.
    >There doesn't seem to be any other reliable way to get the offset of a
    >member (offsetof() with null cast is undefined behaviour, according to
    >Wikipedia). Anyway, the dummy is only needed when setting up the
    >translator. Perhaps I could change it to only take an offset but
    >suggest using a dummy to calculate it, e.g. (&rbuf.head - &rbuf).


    You could also setup the interface like this:

    ser_field_t * ser_new_field_impl(ser_tra_t * tra, const char * type,
    const int ref, const char * tag,
    const int offset);
    #define ser_new_field(tra,type,ref,tag,src_type,field) \
    ser_new_field_impl(tra, type, ref, tag, offsetof(src_type, field))

    Also note that offsetof must be defined by the compiler/implementation
    and they can use whatever magic that gets the job done, including the
    trick with dereferencing a null pointer.
    If you do the same, you incur Undefined Behaviour, but the compiler
    itself is above the law in this respect.
    That makes offsetof() the easiest and reliable way to obtain the offset
    of a member of a structure and it is in the standard just because you
    need compiler magic to get the offset without incurring UB or the
    overhead of a dummy object.

    >
    >/Ulf


    Bart v Ingen Schenau
     
    Bart van Ingen Schenau, Dec 14, 2012
    #8
  9. Ulf Åström

    BGB Guest

    On 12/14/2012 12:38 AM, Ian Collins wrote:
    > Ulf Åström wrote:
    >> On Dec 13, 10:57 pm, Keith Thompson <> wrote:
    >>> Ulf Åström <> writes:
    >>>
    >>>> The user needs to set up translators (composed of one or more fields)
    >>>> for each type they wish to serialize. Internally it works by
    >>>> flattening each type into a list of void pointers and replacing them
    >>>> with numeric indices. The output format has a simple "<tag> <value>;"
    >>>> format; it is similar to JSON but it is not a reimplementation of it.
    >>>
    >>> Why not just use JSON? It would make the flattened files accessible by
    >>> other tools.

    >>
    >> I'm considering it, but I haven't read the JSON specs in detail yet.

    >
    > There isn't a great deal to read...
    >
    >> At a glance they seem to be mostly compatible but I don't know how
    >> well it would mesh with my memory offset <-> value layout or how I
    >> would perform pointer substitution. I haven't decided if I want to
    >> pursue a fully compliant JSON implementation, or take the format in
    >> another direction entirely.

    >
    > Why make things complex? JSON is an ideal candidate for representing
    > structure and array types. It is after all designed as an object notation.
    >


    yep, JSON makes sense.

    one possible downside though is that it doesn't normally identify object
    types, which means some means may be needed to identify what sort of
    struct is being serialized, and/or the type of array, ...

    it is either that, or deal with data in a dynamically-typed manner,
    rather than directly mapping it to raw C structs.

    JSON is generally better IMO for serializing dynamically-typed data,
    than for doing data-binding against structs.

    another minor problem with JSON is that, in its pure form, it has no
    good way to deal with cyclic data (where a referenced object may refer
    back to a prior object), but an extended form could allow this.



    in my case, one mechanism I have serializes a wider range of data into a
    binary format, but in my case, only deals with dynamically type-tagged
    data (which basically means that it was allocated via my GC API, and
    with a type-name supplied), and also that the typedefs have any relevant
    annotations (it depends on data gathered by using a tool to parse the
    headers in order to work correctly).

    it may try to use a special serialization handler if one is registered,
    but will otherwise fall back to using key/value serialization of the
    structs.


    the basic format (for each member) is:
    <typeindex:VLI> <size:VLI> <data:byte[size]>

    where 'VLI' is a special encoding for variable-length-integers.
    in my case:
    00-7F 0-127
    80-BF XX 128-16383
    C0-DF XX XX 16384-2097151
    ....

    a slight VLI variant (SVLI) encoding signed values by folding the sign
    into the LSB, so, the values follow a pattern:
    0, -1, 1, -2, 2, ...


    the format was basically just a flat-array of serialized members, and
    these members were linked via index (and allowed both backwards and
    forwards references). it has a limited form of value-identity preservation.

    the typeindex members are basically indices into this array, giving the
    elements which give the type-name (which, as a simplifying assumption,
    are assumed to be ASCII strings).

    index 0 is reserved for NULL, and is not encoded. index 1 is the first
    encoded index, and serves as the "root member" (basically, the "thing
    that the program asked the serializer to serialize").

    as a later addition, if the typeindex is 0, then the member is a comment
    (and does not appear in the member array). a comment member immediately
    at the start of the file is used to indicate the "type" of the file
    (basically, it is a "magic string"), which is then followed by the root
    member.


    a partial drawback was that the format doesn't have any good way to
    indicate "how" the data is encoded, making it potentially more subject
    to versioning issues (consider, for example, if a structure-layout
    changes, ...). I have tried previously to develop self-describing
    serialization formats in the past, but trying to make a format fully
    self-describing tends to make working with it unreasonably complex. (the
    basic idea here would be that not only would the format identify the
    types in use, but it would also encode information to describe all of
    the data encodings used by the format, down to the level of some number
    of "atomic" types, ...).


    however, my Script-VM's bytecode serialization format is based on the
    above mechanism (the bytecode is actually just the result of serializing
    the output from compiling a source-module).



    some amount of stuff also uses an S-Expression based notation:
    999 //integer number
    3.14159 //real number
    "text" //string
    name //symbol (identifier, used to identify something)
    :name //keyword (special type of literal identifier)
    name: //field or item name
    ....
    ( values ) //list of items (composed of "cons cells")
    #( values ) //array of items (dynamically typed)
    { key: value ... } //object (dynamically-typed)
    #A<sig> ( ... ) //array (statically-typed)
    #X<name> { key: value ... } //struct
    #L<name> { key: value ... } //instance of a class
    ....
    #idx# //object index (declaration or reference)
    #z //null
    #u //undefined
    #t //true
    #f //false
    ....


    so, for example, a struct like:
    typdef dytname("foo_t") as_variant //(magic annotations, 1)
    struct Foo_s Foo;

    struct Foo_s {
    Foo *next;
    char *name;
    int x;
    float y;
    double z[16];
    }


    1: these annotations are no-op macros in a normal C compiler (and mostly
    expand to special attributes used by the header-processing tool).

    "dytname()" basically gives the type-name that will be used when
    allocating instances of this struct-type (it is used to key the
    type-name back to the struct).

    "as_variant" is basically a hint for how it should be handled by my
    scripting language. this modifier asserts that the type should be
    treated as a self-defined VM type (potentially opaque), rather than be
    treated as a boxed-struct or as a boxed "pointer to a struct" (more
    literally mapping the struct and/or struct pointer to the scripting
    language, causing script code to see it more like how C would see it).


    with the structs being created like:
    Foo *obj;
    obj=gctalloc("foo_t", sizeof(Foo)); //allocate object with type-tag
    obj->name=dystrdup("foo_instance_13"); //make a new tagged string
    ....


    might be serialized as:
    #0# = #X<foo_t> { next: #1# name: "foo_instance_13" x: 99 y: 4.9 z:
    #A<d> ( 2.0 3.0 ... ) }
    #1# = #X<foo_t> { ... }

    where this format works, but isn't really, exactly, pretty...



    I also have a network protocol I call "BSXRP", which basically works
    very similar to the above (same data model, ...), just it uses Huffman
    coding and predictive context modeling of the data, and "clever" ways of
    VLC coding what data-values are sent. (compression is favorable to that
    of S-Expressions + Deflate, typically being around 25% the size, whereas
    Deflate by itself was reducing the S-Expressions to around 10% their
    original size, or IOW: around 2.5% the size of the textual
    serialization). (basically, if similar repeating structures are sent,
    prior structures may be used as "templates" for sending later
    structures, allowing them to be encoded in fewer bits, essentially
    working sort of like a dynamically-built schema).

    as-before, it has special cases to allow encoding cyclic data, but the
    protocol does not generally preserve "value-identity" (value-identity or
    data-identity is its own hairy set of issues, and in my case I leave
    matters of identity to higher-level protocols).

    some tweaks to the format also allow it to give modest compression
    improvements over Deflate when being used for delivering lots of short
    plaintext or binary data messages (it essentially includes a Deflate64
    like compressor as a sub-mode, but addresses some "weak areas" regarding
    Deflate).

    a minor drawback though is that the context models can eat up a lot of
    memory (the memory costs are considerably higher than those of Deflate).

    (it was originally written partly as a "proof of concept", but is,
    technically, pretty much overkill).


    >> Also, there is always the question: Why not use C#, Java, or anything
    >> else that has these things built in? Hm.

    >
    > Other languages have better support for manipulating JSON objects, but
    > at least one of them (PHP) uses a C library under the hood.
    >


    yeah...

    I use variants of both JSON and S-Expressions, but mostly for
    dynamically typed data.

    not depending on the use of type-tags and data mined from headers would
    require a rather different implementation strategy.

    most of my code is C, but I make fairly extensive use of dynamic-type
    tagging.


    so, yeah, all this isn't really a "general purpose" set of solutions for
    the data-serialization process.

    I suspect that though that there may not actually be any sort of
    entirely "general purpose" solution to this problem though...

    and, as-is, to use my implementation would probably require dragging
    around roughly about 400 kloc of code, and it is very likely that many
    people would object to needing to use a special memory manager and
    code-processing tools to be able to use these facilities...


    or, IOW:
    if you allocate the data with "malloc()" or via a raw "mmap()" or
    similar, a lot of my code will have no idea what it is looking at (yes,
    a lame limitation, I know).

    granted, the scripting language can partly work around it:
    if you don't use "as_variant" modifier, the struct will map literally,
    and ironically, this allows script-code to still use "malloc()" for
    these types.

    however, the data serialization code generally isn't this clever...


    or such...
     
    BGB, Dec 14, 2012
    #9
  10. Ulf Åström

    Ian Collins Guest

    BGB wrote:
    > On 12/14/2012 12:38 AM, Ian Collins wrote:
    >>
    >> Other languages have better support for manipulating JSON objects, but
    >> at least one of them (PHP) uses a C library under the hood.

    >
    > yeah...
    >
    > I use variants of both JSON and S-Expressions, but mostly for
    > dynamically typed data.
    >
    > not depending on the use of type-tags and data mined from headers would
    > require a rather different implementation strategy.
    >
    > most of my code is C, but I make fairly extensive use of dynamic-type
    > tagging.


    I originally wrote a JSON library to enable my (C++) sever side web
    application it interact with client side JavaScript. I soon found the
    objects extremely useful for building dynamic type objects in general
    programming. I doubt the same would be true in C, not elegantly at least.

    --
    Ian Collins
     
    Ian Collins, Dec 14, 2012
    #10
  11. Ulf Åström

    Ulf Åström Guest

    On Dec 14, 10:47 am, BGB <> wrote:
    > On 12/14/2012 12:38 AM, Ian Collins wrote:
    > > Ulf str m wrote:
    > >> On Dec 13, 10:57 pm, Keith Thompson <> wrote:
    > >>> Ulf str m <> writes:

    >
    > >>>> The user needs to set up translators (composed of one or more fields)
    > >>>> for each type they wish to serialize. Internally it works by
    > >>>> flattening each type into a list of void pointers and replacing them
    > >>>> with numeric indices. The output format has a simple "<tag> <value>;"
    > >>>> format; it is similar to JSON but it is not a reimplementation of it..

    >
    > >>> Why not just use JSON?  It would make the flattened files accessible by
    > >>> other tools.

    >
    > >> At a glance they seem to be mostly compatible but I don't know how
    > >> well it would mesh with my memory offset <-> value layout or how I
    > >> would perform pointer substitution. I haven't decided if I want to
    > >> pursue a fully compliant JSON implementation, or take the format in
    > >> another direction entirely.

    >
    > > Why make things complex?  JSON is an ideal candidate for representing
    > > structure and array types.  It is after all designed as an object notation.

    >
    > yep, JSON makes sense.
    >
    > one possible downside though is that it doesn't normally identify object
    > types, which means some means may be needed to identify what sort of
    > struct is being serialized, and/or the type of array, ...
    >
    > it is either that, or deal with data in a dynamically-typed manner,
    > rather than directly mapping it to raw C structs.


    I can see a couple of problems with JSON. It is really just an
    associative array and doesn't specify the type of objects, so I would
    either need to guess (from the fields provided) or add a wrapper of
    some kind telling what each element is. It also seems JSON has no
    built-in way to do object references.

    I can imagine something like

    {
    "some_struct #1" : { "property" : "value" },
    "another_struct #2" : { "reference" : "#1" }
    }

    where another_struct.reference would be defined as a some_struct
    pointer, so it seems I can get around this. I'm not sure how useful
    the end result would really be, though; even if it is technically
    valid as JSON data and other applications can parse it, they still
    wouldn't know what to make of it.

    Another (minor) problem is that JSON doesn't allow comments, while my
    format does.

    /Ulf
     
    Ulf Åström, Dec 14, 2012
    #11
  12. Ian Collins <> writes:

    > Ulf Åström wrote:
    >> On Dec 13, 10:57 pm, Keith Thompson <> wrote:
    >>> Ulf Åström <> writes:
    >>>
    >>>> The user needs to set up translators (composed of one or more fields)
    >>>> for each type they wish to serialize. Internally it works by
    >>>> flattening each type into a list of void pointers and replacing them
    >>>> with numeric indices. The output format has a simple "<tag> <value>;"
    >>>> format; it is similar to JSON but it is not a reimplementation of it.
    >>>
    >>> Why not just use JSON? It would make the flattened files accessible by
    >>> other tools.

    >>
    >> I'm considering it, but I haven't read the JSON specs in detail yet.

    >
    > There isn't a great deal to read...
    >
    >> At a glance they seem to be mostly compatible but I don't know how
    >> well it would mesh with my memory offset <-> value layout or how I
    >> would perform pointer substitution. I haven't decided if I want to
    >> pursue a fully compliant JSON implementation, or take the format in
    >> another direction entirely.

    >
    > Why make things complex? JSON is an ideal candidate for representing
    > structure and array types. It is after all designed as an object
    > notation.


    I don't think JSON is a good fit for the OP's task. There are lots of C
    data structures that have no obvious JSON representation at all.

    <snip>
    --
    Ben.
     
    Ben Bacarisse, Dec 14, 2012
    #12
  13. Ulf Åström

    Ulf Åström Guest

    On Dec 14, 3:53 am, Ben Bacarisse <> wrote:
    > Ulf Åström <> writes:
    > > On Dec 14, 12:11 am, Ben Bacarisse <> wrote:

    > <snip>
    > >> The other area that I could not understand is how to serialise multiple
    > >> pointers into a single object. For example, how does one serialise

    >
    > >> struct ring_buffer {
    > >> T buffer[BSIZE];
    > >> T *head, *tail;
    > >> };

    >
    > >> where 'head' and 'tail' both point to an element of 'buffer'? This may
    > >> be just because there's no documentation yet -- I had a look at the API
    > >> and did not immediately see how this could be done.

    >
    > > Here is a quick example (assuming int for T):

    >
    > > struct ring_buffer rbuf;
    > > tra = ser_new_tra("ring_buffer", sizeof(struct ring_buffer), NULL);
    > > field = ser_new_field(tra, "int", 0, "buffer", &rbuf, &rbuf.buffer);
    > > field->repeat = BSIZE;
    > > ser_new_field(tra, "int", 1, "head", &rbuf, &rbuf.head);
    > > ser_new_field(tra, "int", 1, "tail", &rbuf, &rbuf.tail);

    >
    > > The 1s to ser_new_field indicate that it is a pointer.

    >
    > What about pointers to pointers?


    It can be done; have a look at test_single_ptr() in demo.c (I still
    need to write up a clean example for this). It requires an extra
    translator for "pointer to type", so other translators can reference
    the pointer type itself. In this case it also needs the _int wrapper
    (with a single int at offset 0) since I haven't found a satisfactory
    way yet to translate primitives on their own.

    > > This exposed a bug, however; the int pointers will only be repointed
    > > correctly (during inflation) if they point to rbuf.buffer[0]; other
    > > addresses would be duplicated in memory. This is because it will only
    > > check for pointers that exactly match a field offset. It should also
    > > check all elements of an array.

    >
    > Thanks. I think I see what's happening. As I understand it, this
    > suggests there must be a "root" structure for all of the data, but maybe
    > what you are doing it more subtle than that.
    >
    > As an example, imagine a binary tree of nodes and a separate linked list
    > of node pointers. Can I serialise this? Must I serialise the tree
    > first?


    This is just the kind of problem I'm hoping to solve.

    It needs translators for the tree nodes and the list nodes. It is true
    it only works on hierarchal data. The only way with the current API
    would be to make a wrapper struct (and a translator for it) that
    points to both the first tree and list nodes; this should be passed to
    ser_ialize(). I can imagine using a variadric function to pass an
    arbitrary number of type/pointer pairs and omit the wrapper, but
    another problem would arise in the other end since ser_parse() will
    only return a single pointer.

    > >> As for the design itself, I got stuck on an issue in the first example..
    > >> You seem to require a dummy object in order to pass pointers to it's
    > >> members to the set-up functions, and a comment purports to explain why
    > >> this is needed but I just don't see it. The comment includes a
    > >> statement that is wrong (if I understand it) that a pointer to the first
    > >> member of a field may not equal (when suitably converted) a pointer to
    > >> the structure object itself. Even if this were true, I don't see why
    > >> that means you can't just pass the offset of the field. At the least,
    > >> more explanation is required.

    >
    > > It's a typo; it should say "pos must be within st".

    >
    > I think that a reference to the API documentation ("Note: pos must be
    > within pos"). I was talking about the comment in example1.c:
    >
    > "In this example &thing_dummy and &thing_dummy.a will most probably be
    > the same location. &thing_dummy.b on the other hand will typically end
    > up 2 to 8 bytes from the base pointer (depending on the computer
    > architecture). For this reason we do not pass the offset as a number
    > but rely on the compiler to tell us exactly where the members will
    > be."
    >
    > In your example, &thing_dummy.a and &thing_dummy must *always* be the
    > same location so the phrase "will most probably be" just looks odd.


    True; I was probably overthinking things and imagining a larger
    structure.

    > That aside, the following bit is what threw me: "For this reason we do
    > not pass the offset as a number". Nothing previous to this seems to
    > explain why you pass two pointers rather than an offset.
    >
    > > Using a dummy isn't strictly necessary, you may as well pass zero and
    > > an offset. I designed it this way so people won't hardcode offset
    > > values, then change their structures (or start porting their program
    > > to a different architecture!) but forget to update the translator.

    >
    > That is a reason but not, to my mind, a strong one. To complicate the
    > API because it's possible to so grossly abuse a simpler one seems to be
    > contrary to the spirit if C. What's more, if you are serious about
    > this, then you should not permit sizeof(myType) in the earlier call,
    > since a programmer might hard code the size. Instead, they should be
    > obliged to pass two pointers: &thing_dummy and &thing_dummy + 1. If you
    > accept a simple size in one case, permit a simple offset in the other.


    I agree; I should trust the programmer to know what they are doing. I
    will be changing it to take only an offset.

    > --
    > Ben.


    /Ulf
     
    Ulf Åström, Dec 14, 2012
    #13
  14. Ulf Åström

    Rui Maciel Guest

    Ulf Åström wrote:

    > I'm wondering if there is anything crucial I have missed. It can deal
    > with most primitive C types, strings, arrays and almost any
    > arrangement of linked structures. There are still a few pieces
    > missing; for example I would like to have a clean interface to let the
    > user provide custom read/write functions for complex data types. I
    > will also work on the type safety, so structures can't be linked
    > incorrectly by erroneous or malicious input.


    I believe that's a poor approach to a common problem which also has a very
    common and simple solution: for the output, a single custom
    operator/function that takes in a data type/struct and calls fprintf() to
    dump the relevant text following a specific data format, and for the input,
    a custom parser. That's it. No need to complicate things.

    The "let's blindly dump everything" approach is fundamentally broken,
    because it actually doesn't describe the information. Instead, it dumps the
    data as how it was stored in a particular data structure and omits any
    semantics associated with it. This causes significant problems, because it
    doesn't perform any integrity checks on the information and essentially
    eliminates the possibility of adding the necessary sanity checks
    to the parsing process, other than testing if a particular primitive data
    type is actually valid. Therefore, dumping data makes it impossible to test
    if the data which has been imported actually represents valid information.
    And this is a bad thing.

    The snags you've stumbled on are a natural consequence of trying to dump
    data instead of exporting information, which happen to be trivial to fix
    with the parser/document dump combo approach. Complex data types are never
    a problem because the programmer is responsible to write the necessary I/O
    routines, and all the necessary sanity checks are automatically performed,
    not only with regards to type safety but also in the form of semantics
    checks, including the ability to infer if a problem is fatal or recoverable.


    Rui Maciel
     
    Rui Maciel, Dec 14, 2012
    #14
  15. Ulf Åström

    Rui Maciel Guest

    Ben Bacarisse wrote:

    > I don't think JSON is a good fit for the OP's task. There are lots of C
    > data structures that have no obvious JSON representation at all.


    What data structures do you have in mind?


    Rui Maciel
     
    Rui Maciel, Dec 14, 2012
    #15
  16. Rui Maciel <> writes:

    > Ben Bacarisse wrote:
    >
    >> I don't think JSON is a good fit for the OP's task. There are lots of C
    >> data structures that have no obvious JSON representation at all.

    >
    > What data structures do you have in mind?


    My ring_buffer example is one. In fact, any structure where pointers are
    "shared" would seem to be a bad fit with JSON.

    Maybe "no obvious JSON representation" is too strong because you can
    probably always map pointers so some sort of index or string label, but
    it's not a natural fit.

    --
    Ben.
     
    Ben Bacarisse, Dec 14, 2012
    #16
  17. Ulf Åström

    BGB Guest

    On 12/14/2012 4:23 AM, Ian Collins wrote:
    > BGB wrote:
    >> On 12/14/2012 12:38 AM, Ian Collins wrote:
    >>>
    >>> Other languages have better support for manipulating JSON objects, but
    >>> at least one of them (PHP) uses a C library under the hood.

    >>
    >> yeah...
    >>
    >> I use variants of both JSON and S-Expressions, but mostly for
    >> dynamically typed data.
    >>
    >> not depending on the use of type-tags and data mined from headers would
    >> require a rather different implementation strategy.
    >>
    >> most of my code is C, but I make fairly extensive use of dynamic-type
    >> tagging.

    >
    > I originally wrote a JSON library to enable my (C++) sever side web
    > application it interact with client side JavaScript. I soon found the
    > objects extremely useful for building dynamic type objects in general
    > programming. I doubt the same would be true in C, not elegantly at least.
    >



    I actually use a fair amount of dynamically-typed stuff in C.

    another big area has to do with my scripting language (largely itself
    based on JavaScript and ActionScript).


    the main downsides have to do with performance, namely that
    dynamically-typed code just doesn't perform anywhere near as well as
    statically-typed code.

    in many areas, this isn't a big issue, and/or leads to a hybrid
    strategy, with some occasional type-checking, but much of the code
    remains statically typed (for example, much of my 3D renderer largely
    works this way).

    typically, this would be using dynamic type-checking more for things
    like classifying "what type of object is this?" and dispatching more to
    the appropriate part of the renderer (or, IOW: type-checking on the
    level of structures). a few places are "optimized" though, mostly to try
    to avoid using dynamically-typed operations in tight-loops (such as
    scene-graph object lookup operations, ...).


    side note (main project):
    http://www.youtube.com/user/BGBTech

    http://www.youtube.com/watch?v=hhdq87CXL0U

    (note, a lot of parts of the video are from much earlier versions of my
    3D engine...).


    (although the 3D engine has some amount of code written in my
    script-language, I generally don't use script-code within the 3D
    renderer, as the languages' performance is a bit weak, and the 3D
    renderer is a lot more performance-sensitive).

    sadly, performance is kind of a problem area sometimes, but much of this
    more has to do with getting everything fed through OpenGL.


    in my case, as-noted, the type-tagging is largely transparent to C code,
    apart from the use of special allocation and freeing functions, and
    supplying type-names, ...

    these type-names can be retrieved via API calls, like:
    BGBDY_API char *dygettype(dyt val);

    or checked with functions like:
    BGBDY_API int dyTypeP(dyt val, char *ty);


    many operations on dynamically typed values may check types using these
    mechanisms, but many internal operations essentially bypass this
    (usually, because if/else chains of "strcmp()" calls can get expensive...).


    elsewhere, for many core operations rather than fetching the type-name
    from the object, the code will instead fetch a "type vtable":

    BGBDY_API dyt dyCopyValue(dyt p)
    {
    BGBDY_ObjVTab *ty;
    if(dyCheckValueNoCopyP(p))
    return(p);
    ty=BGBDY_GetObjVTabFast(p);
    if(ty && ty->copyValue)
    return(ty->copyValue(p));
    return(p);
    }

    BGBDY_API int dyDropValue(dyt p)
    {
    BGBDY_ObjVTab *ty;
    if(dyCheckValueNoCopyP(p))
    return(0);
    ty=BGBDY_GetObjVTabFast(p);
    if(ty && ty->dropValue)
    return(ty->dropValue(p));
    return(0);
    }

    where BGBDY_GetObjVTabFast basically calls off to code for some
    (slightly long/ugly) logic for fetching the object-base, fetching the
    "ObjType", and returning its "vtable" member.



    in a few places (due mostly to performance issues), some of this has
    lead to frequent use of things like type-signatures and untagged union
    types:

    #ifndef DYTVALUE_T
    #define DYTVALUE_T
    typedef union
    {
    s32 i; //int
    u32 ui; //uint
    s64 l; //long
    u64 ul; //ulong
    f32 f; //float
    f64 d; //double
    nlint a; //address
    void *pv; //raw pointer
    dyt r; //variant reference
    dytf cr; //variant reference (double)
    dycClass cls; //class
    dycObject obj; //object
    dycArray arr; //array
    struct { int x, y; }k; //lexical env index
    }dycValue;
    #endif


    the main reason being that these allow higher performance.

    for the untagged union, it is because one can access values like:
    dycValue a, b, c;
    ....
    c.i=a.i+b.i;

    and, apparently the compiler is smart enough to figure out this at least
    (and generate vaguely efficient code).

    the reason for type-signatures is that unlike full dynamic typing,
    signatures are more readily usable with the "build stuff with function
    pointers" strategy.

    say, for example:
    void DYLL_CopyValueBufI(dycValue *v, void *p)
    { *(s32 *)p=v->i; }
    void DYLL_CopyValueBufL(dycValue *v, void *p)
    { *(s64 *)p=v->l; }
    ....

    typedef struct {
    void (*CopyValueBuf)(dycValue *v, void *p);
    void (*CopyBufValue)(void *p, dycValue *v);
    ....
    int offs;
    }DYLL_ValueTransfer;

    typedef struct {
    DYLL_ValueTransfer *ret;
    DYLL_ValueTransfer *args;
    int nargs;
    void (*CopyValuesBuf)(DYLL_ArgList *ctx, dycValue *args, byte *buf);
    void (*CopyBufValues)(DYLL_ArgList *ctx, byte *buf, dycValue *args);
    ....
    }DYLL_ArgList;

    ....

    void DYLL_CopyValuesBuf3(DYLL_ArgList *ctx, dycValue *args, byte *buf)
    {
    ctx->args[0].CopyValueBuf(args+0, buf+ctx->args[0].offs);
    ctx->args[1].CopyValueBuf(args+1, buf+ctx->args[1].offs);
    ctx->args[2].CopyValueBuf(args+2, buf+ctx->args[2].offs);
    }

    BGBDY_API void DYLL_CopyValuesBuf(DYLL_ArgList *ctx,
    dycValue *args, byte *buf)
    { ctx->CopyValuesBuf(ctx, args, buf); }

    where the "DYLL_ArgList" structure may be built via a function which
    parses the signature.

    ....

    which, although arguably not exactly efficient, can always be a good
    deal worse...

    (there is a good deal more of this sort of logic for working with
    "va_list" argument lists as well, typically for indirect calls
    originating within C land, a lot of this stuff has to deal with calls
    between natively compiled and interpreted code).

    generally the "buffer" is a packed buffer-like representation of an
    argument list, and is generally what is used by the ASM parts of the
    call-handling machinery (which may in-turn repack these buffer into
    ABI-specific representations, and then initiate the call).

    partly, this is because, it isn't really desirable to have to deal with
    the various possible argument-list representations and ASM code at the
    same time.


    in some places, the function-pointer funkiness serves to bypass older
    machinery which often involves lots of sometimes long if/else chains.


    a downside though is that a lot of this is, admittedly, a bit more
    painful to work with, and the code can get very long/winding and ugly
    (so, the spread of this sort of code in many places within the VM core
    has both good and bad points).

    however, a lot of these sorts of mechanisms end up somewhat abstracted,
    and so are generally unseen by "user code" (luckily, it all looks much
    nicer from the outside).


    or such...
     
    BGB, Dec 14, 2012
    #17
  18. Ulf Åström

    BGB Guest

    On 12/14/2012 5:51 AM, Ulf Åström wrote:
    > On Dec 14, 10:47 am, BGB <> wrote:
    >> On 12/14/2012 12:38 AM, Ian Collins wrote:
    >>> Ulf str m wrote:
    >>>> On Dec 13, 10:57 pm, Keith Thompson <> wrote:
    >>>>> Ulf str m <> writes:

    >>
    >>>>>> The user needs to set up translators (composed of one or more fields)
    >>>>>> for each type they wish to serialize. Internally it works by
    >>>>>> flattening each type into a list of void pointers and replacing them
    >>>>>> with numeric indices. The output format has a simple "<tag> <value>;"
    >>>>>> format; it is similar to JSON but it is not a reimplementation of it.

    >>
    >>>>> Why not just use JSON? It would make the flattened files accessible by
    >>>>> other tools.

    >>
    >>>> At a glance they seem to be mostly compatible but I don't know how
    >>>> well it would mesh with my memory offset <-> value layout or how I
    >>>> would perform pointer substitution. I haven't decided if I want to
    >>>> pursue a fully compliant JSON implementation, or take the format in
    >>>> another direction entirely.

    >>
    >>> Why make things complex? JSON is an ideal candidate for representing
    >>> structure and array types. It is after all designed as an object notation.

    >>
    >> yep, JSON makes sense.
    >>
    >> one possible downside though is that it doesn't normally identify object
    >> types, which means some means may be needed to identify what sort of
    >> struct is being serialized, and/or the type of array, ...
    >>
    >> it is either that, or deal with data in a dynamically-typed manner,
    >> rather than directly mapping it to raw C structs.

    >
    > I can see a couple of problems with JSON. It is really just an
    > associative array and doesn't specify the type of objects, so I would
    > either need to guess (from the fields provided) or add a wrapper of
    > some kind telling what each element is. It also seems JSON has no
    > built-in way to do object references.
    >


    actually, it can do both, just it is maybe not entirely obvious.
    it helps more to imaging how JavaScript itself works, and JSON more are
    a direct mapping to associated JavaScript objects.


    > I can imagine something like
    >
    > {
    > "some_struct #1" : { "property" : "value" },
    > "another_struct #2" : { "reference" : "#1" }
    > }
    >
    > where another_struct.reference would be defined as a some_struct
    > pointer, so it seems I can get around this. I'm not sure how useful
    > the end result would really be, though; even if it is technically
    > valid as JSON data and other applications can parse it, they still
    > wouldn't know what to make of it.
    >
    > Another (minor) problem is that JSON doesn't allow comments, while my
    > format does.
    >


    this is all, not exactly right...


    however, how it actually works gets a little less trivial to explain.

    handling it more "effectively" would involve essentially throwing
    together dynamic type facilities on top of C.

    a way of imagining this, is if you have some sort of opaque type, which
    can hold a value of any type (no, it is *not* a string).

    a popular solution (though not the one I am using) is basically to treat
    the low bits of a pointer as a "magic tag", for example:
    low 3 bits:
    0=raw pointer
    1/5=integer (30-bit)
    2/6=float (30-bit)
    3, 4=reserved
    7=misc values (grabs another 5 bits as an expanded tag, with a 24-bit
    value).

    (my system is based instead on "magic address ranges" for some types,
    and encoding the object type in the memory-manager's header, for other
    types).


    you might have functions, then, to wrap and unwrap values, say (using my
    API as an example):
    dyt dyint(s64 v); //wrap integer (or "long")
    s64 dyintv(dyt v); //unwrap integer (or "long")
    dyt dylong(s64 v); //wrap "long"
    s64 dylongv(dyt v); //unwrap "long"
    dyt dyfloat(float v); //wrap float
    float dyfloatv(dyt v); //unwrap float
    dyt dydouble(double v); //wrap double
    double dydoublev(dyt v); //unwrap double
    ....

    as well, as functions to perform "type-checks":
    int dyintp(dyt v); //returns non-zero if value is an integer
    int dyfloatp(dyt v);
    ....


    and, so then, a person can write logic like:
    if(dyintp(v))
    {
    i=dyintv(v);
    }else if(dyfloatp(v))
    {
    f=dyfloatv(v);
    }else if(dystringp(v))
    {
    ...
    }else if(dyObjectP(v))
    {
    ...
    }else ...


    as for parsing JSON, it will generally involve looking at the syntax for
    every token to identify what it is.

    part of this is done in the "lexer"/"tokenizer", with the idea being
    that the logic for splitting apart the input text into tokens, will
    identify what sort of token it is looking at: numbers vs strings vs
    braces vs operators vs ...

    the parser logic will then build a representation of the data via the
    tokens it sees.

    a common strategy for hand-written parsers is "recursive descent", where
    basically the code looks at the current tokens and sees what to do, and
    then may call itself recursively in order to parse sub-items:
    if(tokenty==TOK_INTEGER)
    {
    value=dyint(atoi(token));
    }else if(tokenty==TOK_REAL)
    {
    value=dydouble(atof(token));
    }else if(!strcmp(token, "{") && (tokenty==TOK_BRACE))
    {
    ... parse object ...
    }else if(!strcmp(token, "[") && (tokenty==TOK_BRACE))
    {
    ... parse array ...
    }else ...


    a possible representation for a "naive" object, could be something like:
    struct Object_s
    {
    char **vars;
    dyt *vals;
    int n_vars;
    int m_vars;
    };

    with vars holding a list of member names, vals holding the values, and
    n_vars holding a count for the number of members in an object, and
    m_vars holding the max number of members (how many members can be added
    before the arrays would need to be reallocated).


    there are potentially more efficient ways to handle all this, but this
    is a "general idea".
     
    BGB, Dec 14, 2012
    #18
  19. Ulf Åström

    BGB Guest

    On 12/14/2012 9:22 AM, Ben Bacarisse wrote:
    > Rui Maciel <> writes:
    >
    >> Ben Bacarisse wrote:
    >>
    >>> I don't think JSON is a good fit for the OP's task. There are lots of C
    >>> data structures that have no obvious JSON representation at all.

    >>
    >> What data structures do you have in mind?

    >
    > My ring_buffer example is one. In fact, any structure where pointers are
    > "shared" would seem to be a bad fit with JSON.
    >
    > Maybe "no obvious JSON representation" is too strong because you can
    > probably always map pointers so some sort of index or string label, but
    > it's not a natural fit.
    >


    yeah, these are general problems with data serialization, namely that
    there is no good way to deal with "everything" within any reasonable
    level of complexity.

    it is easier then to find a reasonable subset of things that can be
    serialized via a given mechanism, and use this.


    more careful design of a serialization interface would be needed, for
    example, to be able to deal with cyclic data structures, and more
    in-turn to deal with preserving the identity of serialized objects (1).

    1: if obja==objb, will this be true or false after the data is saved,
    and re-loaded?...


    some options will not preserve any identity, so if the same object is
    reached via multiple paths, each may become separate objects when the
    same image is read-in later.

    other options may in-turn create false identity, where pretty much any
    object which "looks the same" will have the same identity, regardless of
    whether or not it was the same object prior to serialization.

    identity becomes a particularly ugly issue if it needs to be dealt with
    across the same object being saved across multiple images, or if network
    gets involved (currently, I just don't deal with this problem). (this
    part gets much more important if dealing with trying to serialize things
    like program-state, say, where a "running program image" may be saved to
    disk, and reloaded later, and lots of little "subtle details" will need
    to be preserved, or else things will often go "critically wrong" when
    the program state gets reloaded later on).


    or, at least, short of much simpler but very ugly solutions, like doing
    a raw memory-dump of the program's heap (I have done this before, but it
    "doesn't really scale well"), but more practical solutions to general
    program heap serialization are overly complicated, and it is easier just
    to deal with smaller scale serialization and reloading (such as
    "serialize this particular data structure").

    a basic idea for how the heap-dump serialization works is:
    imagine, if you will, that on startup the application uses a
    memory-mapped file as the main VM heap, and uses special magic to
    preserve any "roots" (generally global variables). on exiting and
    reloading, any objects in this heap can be preserved, with the "roots"
    automatically restored to their prior values.

    yeah, long ago, one of my prior VMs actually did this, although the heap
    wasn't a single giant file, but rather a bunch of several-MB "heap chunks".

    I later generally abandoned this strategy, as it had some particularly
    nasty technical issues (every exit/reload cycle leaked memory that could
    not be reclaimed, nasty "bit rot" problems, structure-layout problems, ...).

    it is IMO better to actually just try to "serialize the contents of the
    data", rather than just "stupidly dump the memory to the HDD and try to
    reload it later".



    but, in general, dealing with these sorts of issues requires more
    careful attention to detail, and sadly, this isn't really something JSON
    does very well.

    not that this makes JSON useless though.


    more generally though, I just use plain files for data storage, as they
    just work so-much better and more reliably.


    one possible (partial) solution here for cyclic structures/... though is
    an "extended and non-standard" JSON variant, say, where everything is
    indexed.

    #1 = { name: "foo", link: #2 };
    #2 = { name: "bar", link: #3 };
    #3 = { name: "baz", link: #4, first: #1 };
    ....
    { name: "someobject", listHead: #1 }

    but, dealing effectively with a more comprehensive type-system would
    require further somewhat extending the syntax, say, adding some
    mechanism to identify up-front the type of object being serialized, ...


    this is partly what my "somewhat hacked-up S-Expressions" do, even if
    they are a little out of place (long ago, I had a Scheme implementation,
    and even now, vestiges of it still remain, floating around in my
    codebase...).


    or such...
     
    BGB, Dec 14, 2012
    #19
  20. Ulf Åström

    BGB Guest

    On 12/14/2012 8:25 AM, Rui Maciel wrote:
    > Ulf Åström wrote:
    >
    >> I'm wondering if there is anything crucial I have missed. It can deal
    >> with most primitive C types, strings, arrays and almost any
    >> arrangement of linked structures. There are still a few pieces
    >> missing; for example I would like to have a clean interface to let the
    >> user provide custom read/write functions for complex data types. I
    >> will also work on the type safety, so structures can't be linked
    >> incorrectly by erroneous or malicious input.

    >
    > I believe that's a poor approach to a common problem which also has a very
    > common and simple solution: for the output, a single custom
    > operator/function that takes in a data type/struct and calls fprintf() to
    > dump the relevant text following a specific data format, and for the input,
    > a custom parser. That's it. No need to complicate things.
    >
    > The "let's blindly dump everything" approach is fundamentally broken,
    > because it actually doesn't describe the information. Instead, it dumps the
    > data as how it was stored in a particular data structure and omits any
    > semantics associated with it. This causes significant problems, because it
    > doesn't perform any integrity checks on the information and essentially
    > eliminates the possibility of adding the necessary sanity checks
    > to the parsing process, other than testing if a particular primitive data
    > type is actually valid. Therefore, dumping data makes it impossible to test
    > if the data which has been imported actually represents valid information.
    > And this is a bad thing.
    >
    > The snags you've stumbled on are a natural consequence of trying to dump
    > data instead of exporting information, which happen to be trivial to fix
    > with the parser/document dump combo approach. Complex data types are never
    > a problem because the programmer is responsible to write the necessary I/O
    > routines, and all the necessary sanity checks are automatically performed,
    > not only with regards to type safety but also in the form of semantics
    > checks, including the ability to infer if a problem is fatal or recoverable.
    >


    makes sense IMO...


    personally, I far more often use specialized file-formats, rather than
    dumping data structures, and usually if serialization is used, it is
    usually limited mostly to certain sets of dedicated "safe to serialize"
    data-types, which in my case are typically "lists" and
    "dynamically-typed objects".

    most types provide dedicated handlers for serializing and reading back
    in their serialized forms, many often using a small "version" tag, as a
    way to indicate what fields are present in the data (this itself has
    flaws, but basically works), or sometimes the handlers use individual
    tags to indicate various items within the serialized structure.

    types which don't bother with version-tags are typically things like
    integer values or strings, which are basically "too simple to justify a
    version tag".


    actually, given my frequent use of sequential read/write functions, is a
    major part of why, very often, I end up using variable-length integer
    representations.

    for example, the vast majority of integer-values are fairly small
    (basically, forming a bell-curve with its center around 0), making it so
    that, on-average, small integers may only require 1 byte, and most
    others are 2 or 3 bytes, independent of the physical storage-type of the
    integer (32 or 64 bits).

    like, although a variable-length integer isn't very good for
    random-access data, it is pretty good for data that is read/written
    sequentially.
     
    BGB, Dec 14, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Ashenden
    Replies:
    5
    Views:
    600
    Peter Ashenden
    Dec 17, 2004
  2. Replies:
    8
    Views:
    2,252
    deadsea
    Jan 2, 2005
  3. Replies:
    3
    Views:
    1,035
  4. Dimitri Ognibene
    Replies:
    4
    Views:
    785
    Dimitri Ognibene
    Sep 2, 2006
  5. Ramunas Urbonas
    Replies:
    1
    Views:
    402
    Dino Chiesa [Microsoft]
    Jul 27, 2004
Loading...

Share This Page