RFC, an ugly parser hack (and a bin-xml variant)

Discussion in 'XML' started by cr88192, Sep 5, 2005.

  1. cr88192

    cr88192 Guest

    for various reasons, I added an imo ugly hack to my xml parser.
    basically, I wanted the ability to have binary payload within the xml parse
    trees.

    this was partly because I came up with a binary xml format (mentioned more
    later), and thought it would be "useful" to be able to store binary data
    inline with this format, and still wanted to keep things balanced (whatever
    the binary version can do, the textual version can do as well).

    the approach involved, well, a bastardized subset of xml-data.
    the attribute 'dt:dt' now has a special meaning (along with the rest of the
    'dt' namespace prefix), and the contents of such nodes are parsed specially
    (though still within xml's syntactic rules, eg, as a normal xml text glob).

    (an alternative would have been leaving the trees and textual-parser as-is,
    but doing this in the binary-xml reader/writer, but I opted to hack the text
    version as, otherwise, there is no real point in this feature anyways...).

    <foo dt:dt="binary.base64">
    adz6A7dG9TaB41H7D6G5KSt3
    </foo>

    I wonder if anyone has suggestions for a better approach?

    (the risk here being that one might want to use the 'dt' prefix for
    something else, and this is an actual parser-hack rather than a more generic
    semantics hack...).


    other comments:
    for the most part, my xml stuff is being used for "offline" uses, though
    avoiding my stuff blowing up when dealing with generic xml would be
    preferable (this happens occasionally, making me go and patch up whatever in
    the parser/other code).

    no schemas or similar are used, really, anywhere. pretty much everything
    tends to be hard-coded c code, which is expected to deal with whatever.
    likewise, many of the formats are ad-hoc and can change randomly (most
    things use an "ignore what is not understood" policy).

    my parser only really implements a subset of xml anyways (entities, dtd's,
    .... are ignored, and only built-in entities are handled). it does, however,
    have basic namespaces support, and things like 'CDATA', it skips comments,
    ....


    about the binary format (for anyone that cares):

    well, I presently have no real intent on trying to push it as any form of
    standard or trying to compete with anything.

    everyone can do as they will imo, just I want to use the format for things
    which the textual variety is not that well suited, eg, being intermixed with
    other binary data (as a particular example, for storing the object
    files+bytecode for a newer interpreter of mine, other possible uses may
    include persistent data stores and similar, storage of geometric/3d data,
    for which I have typically used other formats, ...).
    for some of these things, xml is used as an internal representation,
    sometimes flattening to some other external representation (eg:
    line-oriented text or a binary format).

    it is, well, signifigantly faster than my textual parser, largely because of
    the dramatic reduction in memory allocation. this is partly because, as a
    matter of the format's operation, most strings are merged. likewise, it is a
    bit smaller (around 20-30% the original size in my testing), which is a bit
    worse than what I can get from "real" compressors, but this is no big loss.

    and, also, the code is smaller (though, after adding features like integers,
    lz+markov text compression, cdata, ... it is around 800 loc, vs. about 1700
    for the text parser). initially it was about 300 loc, but, features cost
    some.


    it uses a byte-based structure vaguely similar to wbxml.
    note that, however, most of the structure/codespace is built dynamically,
    and the codespace is divided in a uniform manner. with most of the codespace
    going to the text mru (127 values), the tag mru (63 values), and the
    namespace mru (31 values), the remaining values encode literals (to be added
    to the mru's, located at the end of each mru), or special predefined codes
    (the low 32 values). attribute names also have a mru, but the codespace
    overlaps the one for tags (since a tag and attribute don't occure in the
    same context).

    also, at present, tags include no means to skip out on attribute or content
    end markers, as basic testing (using statistical methods to eliminate the
    markers) did not show "good enough" results to justify having it (nor do I
    have a good place to put the bits, eg, without somewhat reducing the mru
    size).
    also, only ascii/utf-8 is supported.
    no strings tables are used, everything is inline.

    for the most part, it is using linear mru caching of strings, and it also
    does not have lengths for most things (with the exception being binary
    chunks). as a result, random access in not possible in it's present form.

    the lz+markov encoding used for text is aimed more for speed than decent
    compression (lz77 would likely do better in terms of compression, but would
    lead to slower encoding). however, hex dumps show that it does "good enough"
    at hacking away at text strings... a single window is used between all the
    strings (eg: for the purpose of eliminating common strings).

    (compression could likely be improved, eg, by use of huffman coding, but,
    this is neither cheap nor simple, and is thus likely not worth it).

    probably good enough for my uses anyways...

    0x00: general purpose ending marker
    0x01..0x1F: Special, Reserved
    0x20..0x3E: Namespace Prefix MRU
    0x3F: Namespace String
    0x40..0x7E: Opening Tag/Attr MRU
    0x7F: Opening Tag/Attr String
    0x80..0xFE: Text MRU
    0xFF: Text String

    Node: [<NS>] <TAG> <ATTR*> 0 <BODY*> 0
    Attr: [<NS>] <TAG> <TEXT*>
    Body: <NODE>|<TEXT>

    0x10, Integer (VLI)
    0x11, LZ+Markov Text String

    0x12, CDATA:
    0x12 <TEXT*> 0
    0x13, Binary Data:
    0x13 [<NS>] <TAG> <ATTR*> 0 <UVLI len> <BYTES[len]>

    UVLI, basically same as the variable-byte ints in WBXML;
    VLI, UVLI, but with the sign put in the LSB.

    (skipping on the rest...).
     
    cr88192, Sep 5, 2005
    #1
    1. Advertising

  2. cr88192

    cr88192 Guest

    correction (Re: RFC, an ugly parser hack (and a bin-xml variant))


    >
    > it is, well, signifigantly faster than my textual parser, largely because
    > of the dramatic reduction in memory allocation. this is partly because, as
    > a matter of the format's operation, most strings are merged. likewise, it
    > is a bit smaller (around 20-30% the original size in my testing), which is
    > a bit worse than what I can get from "real" compressors, but this is no
    > big loss.
    >

    just checked and recalculated percents, realized it was doing somewhat
    better than this, eg, around 10% original size or so for some larger files
    (around 900kB initial, around 1MB after being spit back out from my app with
    different formatting).

    binary files are presently about 2x as large as that of the output from gzip
    (eg: initial, about 900kB, my format about 100kB, gzip about 40kB).

    somehow, I had not taken this into account, remembering my initial results
    with smaller xml files (eg: 1.5kB to 400 bytes, ...).

    as for huffman compression, if done, it would likely be at least close to,
    or maybe exceed that of gzip. this is difficult to predict though given the
    signifigant differences in the algos (gzip might win due to its ability to
    utilize patterns spaning multiple tags, but might be hurt by its inability
    to deal with regular but predictable variations in the pattern).

    gzip'ing the binary variant leads to an output of about 30kB, so about 10kB
    less than gziping the input file. a specialized compressor may thus have a
    chance.

    each tag as a huffman code, possibly using a lz77 or markov variant for the
    strings (lz77+huffman is the base of gzip anyways), ...

    but, then again, speed may no longer be good. by this point it may have
    dropped somewhat below the speed of the normal text printer/parser,
    effectively losing part of the gain.

    actually, it may yet be slower than defalte, eg, given my tendency to be
    lazy and use adaptive huffman coding most of the time (slower but generally
    easier to manage than the static varieties used in gzip/deflate). actually,
    the varieties I use are more often "quasi-static", eg, they only update
    every so often, vs after every symbol (I can, for example, encode a few kB
    and then rebuild the trees, which is faster than a pure-adaptive variant,
    but slower than static). as a result, for decoding at least I can still use
    an index table (vs. having to resort to decoding the file a single bit at a
    time). one then has to tune how often they rebuild the trees/tables
    (rebuilding more often hurts speed, but typically helps compression).


    not like it matters probably.
    I am just a lame hobbyist...
     
    cr88192, Sep 5, 2005
    #2
    1. Advertising

  3. cr88192

    Soren Kuula Guest

    cr88192 wrote:
    > for various reasons, I added an imo ugly hack to my xml parser.
    > basically, I wanted the ability to have binary payload within the xml parse
    > trees.
    >
    > this was partly because I came up with a binary xml format (mentioned more
    > later), and thought it would be "useful" to be able to store binary data
    > inline with this format, and still wanted to keep things balanced (whatever
    > the binary version can do, the textual version can do as well).
    >
    > the approach involved, well, a bastardized subset of xml-data.
    > the attribute 'dt:dt' now has a special meaning (along with the rest of the
    > 'dt' namespace prefix), and the contents of such nodes are parsed specially
    > (though still within xml's syntactic rules, eg, as a normal xml text glob).


    My comment: Why don't you use the normal namespace mechanism, instead of
    magic prefixes?

    The parser must store the namespace prefix-->URI bindings is had
    encountered so far at some place anyway (if it's a namespace aware
    parser). It should then also be possible to modify it to go to binary
    mode when entering elements in your special namespace -- and leave it
    again when exiting (keep a counter of the nesting depth, increment,
    decrement). In binary mode, it will decode text to binary.

    Søren
     
    Soren Kuula, Sep 7, 2005
    #3
  4. cr88192

    cr88192 Guest

    "Soren Kuula" <> wrote in message
    news:9RqTe.66931$...
    > cr88192 wrote:
    >> for various reasons, I added an imo ugly hack to my xml parser.
    >> basically, I wanted the ability to have binary payload within the xml
    >> parse trees.
    >>
    >> this was partly because I came up with a binary xml format (mentioned
    >> more later), and thought it would be "useful" to be able to store binary
    >> data inline with this format, and still wanted to keep things balanced
    >> (whatever the binary version can do, the textual version can do as well).
    >>
    >> the approach involved, well, a bastardized subset of xml-data.
    >> the attribute 'dt:dt' now has a special meaning (along with the rest of
    >> the 'dt' namespace prefix), and the contents of such nodes are parsed
    >> specially (though still within xml's syntactic rules, eg, as a normal xml
    >> text glob).

    >
    > My comment: Why don't you use the normal namespace mechanism, instead of
    > magic prefixes?
    >

    well, at the time I had figured it would be an extra hassle.

    now, I am thinking, it makes more sense anyways, I just need to specify that
    a certain namespace needs to be used for binary tags, and make the necessary
    changes to allow resolving the namespace.

    > The parser must store the namespace prefix-->URI bindings is had
    > encountered so far at some place anyway (if it's a namespace aware
    > parser). It should then also be possible to modify it to go to binary mode
    > when entering elements in your special namespace -- and leave it again
    > when exiting (keep a counter of the nesting depth, increment, decrement).
    > In binary mode, it will decode text to binary.
    >

    yes, I could do this.

    a hassle though is that the parser, during parsing, does not know the
    correct namespaces. namespaces are resolved by stepping up the tree, and,
    presently, nodes are not bound into the tree until after they are parsed (as
    a result, I can't look up the tree during parsing).

    an alteration would be to allways pass the parent node to the parse-node
    function and setting the 'up' value before parsing sub-expressions, at
    least, so the search could be done during parsing and thus be able to
    resolve namespaces and avoiding needing a magic prefix...


    my parser is a recursive function (as opposed to a stack or similar).
    also, it is damn slow, or at least when I throw larger files at it...
    I could probably try to make it faster if needed.


    also, I will add a comment:
    the files are a little smaller, eg, the 900kB xml file now becomes 85kB in
    my format, and about 200kB in wbxml (not using any dtd's or similar, which
    is probably a bad case for wbxml). this was after a few minor tweaks to the
    format (eg: adding a workable means to eliminate many end markers).

    otherwise, the wbxml writer may be a bit naive, which could lead to the
    size. most of the tags and text contents end up as string table references,
    which are naturally more expensive than my format (a few bytes, vs. a single
    byte for anything in the mru list).
    small files using dtd's are likely to do better (for large files, I doubt it
    makes much difference, the costs should be about the same due to the small
    contribution in total size of the strings table).

    also, I don't know if wbxml supports namespaces (I guess it could be done if
    the prefix is treated as part of the tag). if this is the case, then it is
    probable wbxml could win out with namespace-heavy code.
    why do I need a 32 element namespace mru anyways? it is doubtful this many
    unique prefixes will be used anyways. it just fit well with the pattern I
    guess.

    the gzip'ed version is still about 40kB.

    sizewise, my format is worse than gzip, but not that terribly worse. beating
    gzip wrt size would risk compromising speed...
    at least, it doesn't require decoding and using my textual parser...
     
    cr88192, Sep 7, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kevin Mitchell

    Can "bin" be changed to "cgi-bin" for asp.net

    Kevin Mitchell, Oct 19, 2003, in forum: ASP .Net
    Replies:
    3
    Views:
    897
    Wim Hollebrandse
    Oct 19, 2003
  2. sweety
    Replies:
    9
    Views:
    1,044
    Richard Heathfield
    Feb 7, 2006
  3. Ivan Shmakov
    Replies:
    3
    Views:
    1,164
    Kari Hurtta
    Feb 13, 2012
  4. anne001
    Replies:
    1
    Views:
    520
  5. BGB
    Replies:
    15
    Views:
    1,342
Loading...

Share This Page