RFC, an ugly parser hack (and a bin-xml variant)

C

cr88192

for various reasons, I added an imo ugly hack to my xml parser.
basically, I wanted the ability to have binary payload within the xml parse
trees.

this was partly because I came up with a binary xml format (mentioned more
later), and thought it would be "useful" to be able to store binary data
inline with this format, and still wanted to keep things balanced (whatever
the binary version can do, the textual version can do as well).

the approach involved, well, a bastardized subset of xml-data.
the attribute 'dt:dt' now has a special meaning (along with the rest of the
'dt' namespace prefix), and the contents of such nodes are parsed specially
(though still within xml's syntactic rules, eg, as a normal xml text glob).

(an alternative would have been leaving the trees and textual-parser as-is,
but doing this in the binary-xml reader/writer, but I opted to hack the text
version as, otherwise, there is no real point in this feature anyways...).

<foo dt:dt="binary.base64">
adz6A7dG9TaB41H7D6G5KSt3
</foo>

I wonder if anyone has suggestions for a better approach?

(the risk here being that one might want to use the 'dt' prefix for
something else, and this is an actual parser-hack rather than a more generic
semantics hack...).


other comments:
for the most part, my xml stuff is being used for "offline" uses, though
avoiding my stuff blowing up when dealing with generic xml would be
preferable (this happens occasionally, making me go and patch up whatever in
the parser/other code).

no schemas or similar are used, really, anywhere. pretty much everything
tends to be hard-coded c code, which is expected to deal with whatever.
likewise, many of the formats are ad-hoc and can change randomly (most
things use an "ignore what is not understood" policy).

my parser only really implements a subset of xml anyways (entities, dtd's,
.... are ignored, and only built-in entities are handled). it does, however,
have basic namespaces support, and things like 'CDATA', it skips comments,
....


about the binary format (for anyone that cares):

well, I presently have no real intent on trying to push it as any form of
standard or trying to compete with anything.

everyone can do as they will imo, just I want to use the format for things
which the textual variety is not that well suited, eg, being intermixed with
other binary data (as a particular example, for storing the object
files+bytecode for a newer interpreter of mine, other possible uses may
include persistent data stores and similar, storage of geometric/3d data,
for which I have typically used other formats, ...).
for some of these things, xml is used as an internal representation,
sometimes flattening to some other external representation (eg:
line-oriented text or a binary format).

it is, well, signifigantly faster than my textual parser, largely because of
the dramatic reduction in memory allocation. this is partly because, as a
matter of the format's operation, most strings are merged. likewise, it is a
bit smaller (around 20-30% the original size in my testing), which is a bit
worse than what I can get from "real" compressors, but this is no big loss.

and, also, the code is smaller (though, after adding features like integers,
lz+markov text compression, cdata, ... it is around 800 loc, vs. about 1700
for the text parser). initially it was about 300 loc, but, features cost
some.


it uses a byte-based structure vaguely similar to wbxml.
note that, however, most of the structure/codespace is built dynamically,
and the codespace is divided in a uniform manner. with most of the codespace
going to the text mru (127 values), the tag mru (63 values), and the
namespace mru (31 values), the remaining values encode literals (to be added
to the mru's, located at the end of each mru), or special predefined codes
(the low 32 values). attribute names also have a mru, but the codespace
overlaps the one for tags (since a tag and attribute don't occure in the
same context).

also, at present, tags include no means to skip out on attribute or content
end markers, as basic testing (using statistical methods to eliminate the
markers) did not show "good enough" results to justify having it (nor do I
have a good place to put the bits, eg, without somewhat reducing the mru
size).
also, only ascii/utf-8 is supported.
no strings tables are used, everything is inline.

for the most part, it is using linear mru caching of strings, and it also
does not have lengths for most things (with the exception being binary
chunks). as a result, random access in not possible in it's present form.

the lz+markov encoding used for text is aimed more for speed than decent
compression (lz77 would likely do better in terms of compression, but would
lead to slower encoding). however, hex dumps show that it does "good enough"
at hacking away at text strings... a single window is used between all the
strings (eg: for the purpose of eliminating common strings).

(compression could likely be improved, eg, by use of huffman coding, but,
this is neither cheap nor simple, and is thus likely not worth it).

probably good enough for my uses anyways...

0x00: general purpose ending marker
0x01..0x1F: Special, Reserved
0x20..0x3E: Namespace Prefix MRU
0x3F: Namespace String
0x40..0x7E: Opening Tag/Attr MRU
0x7F: Opening Tag/Attr String
0x80..0xFE: Text MRU
0xFF: Text String

Node: [<NS>] <TAG> <ATTR*> 0 <BODY*> 0
Attr: [<NS>] <TAG> <TEXT*>
Body: <NODE>|<TEXT>

0x10, Integer (VLI)
0x11, LZ+Markov Text String

0x12, CDATA:
0x12 <TEXT*> 0
0x13, Binary Data:
0x13 [<NS>] <TAG> <ATTR*> 0 <UVLI len> <BYTES[len]>

UVLI, basically same as the variable-byte ints in WBXML;
VLI, UVLI, but with the sign put in the LSB.

(skipping on the rest...).
 
C

cr88192

it is, well, signifigantly faster than my textual parser, largely because
of the dramatic reduction in memory allocation. this is partly because, as
a matter of the format's operation, most strings are merged. likewise, it
is a bit smaller (around 20-30% the original size in my testing), which is
a bit worse than what I can get from "real" compressors, but this is no
big loss.
just checked and recalculated percents, realized it was doing somewhat
better than this, eg, around 10% original size or so for some larger files
(around 900kB initial, around 1MB after being spit back out from my app with
different formatting).

binary files are presently about 2x as large as that of the output from gzip
(eg: initial, about 900kB, my format about 100kB, gzip about 40kB).

somehow, I had not taken this into account, remembering my initial results
with smaller xml files (eg: 1.5kB to 400 bytes, ...).

as for huffman compression, if done, it would likely be at least close to,
or maybe exceed that of gzip. this is difficult to predict though given the
signifigant differences in the algos (gzip might win due to its ability to
utilize patterns spaning multiple tags, but might be hurt by its inability
to deal with regular but predictable variations in the pattern).

gzip'ing the binary variant leads to an output of about 30kB, so about 10kB
less than gziping the input file. a specialized compressor may thus have a
chance.

each tag as a huffman code, possibly using a lz77 or markov variant for the
strings (lz77+huffman is the base of gzip anyways), ...

but, then again, speed may no longer be good. by this point it may have
dropped somewhat below the speed of the normal text printer/parser,
effectively losing part of the gain.

actually, it may yet be slower than defalte, eg, given my tendency to be
lazy and use adaptive huffman coding most of the time (slower but generally
easier to manage than the static varieties used in gzip/deflate). actually,
the varieties I use are more often "quasi-static", eg, they only update
every so often, vs after every symbol (I can, for example, encode a few kB
and then rebuild the trees, which is faster than a pure-adaptive variant,
but slower than static). as a result, for decoding at least I can still use
an index table (vs. having to resort to decoding the file a single bit at a
time). one then has to tune how often they rebuild the trees/tables
(rebuilding more often hurts speed, but typically helps compression).


not like it matters probably.
I am just a lame hobbyist...
 
S

Soren Kuula

cr88192 said:
for various reasons, I added an imo ugly hack to my xml parser.
basically, I wanted the ability to have binary payload within the xml parse
trees.

this was partly because I came up with a binary xml format (mentioned more
later), and thought it would be "useful" to be able to store binary data
inline with this format, and still wanted to keep things balanced (whatever
the binary version can do, the textual version can do as well).

the approach involved, well, a bastardized subset of xml-data.
the attribute 'dt:dt' now has a special meaning (along with the rest of the
'dt' namespace prefix), and the contents of such nodes are parsed specially
(though still within xml's syntactic rules, eg, as a normal xml text glob).

My comment: Why don't you use the normal namespace mechanism, instead of
magic prefixes?

The parser must store the namespace prefix-->URI bindings is had
encountered so far at some place anyway (if it's a namespace aware
parser). It should then also be possible to modify it to go to binary
mode when entering elements in your special namespace -- and leave it
again when exiting (keep a counter of the nesting depth, increment,
decrement). In binary mode, it will decode text to binary.

Søren
 
C

cr88192

Soren Kuula said:
My comment: Why don't you use the normal namespace mechanism, instead of
magic prefixes?
well, at the time I had figured it would be an extra hassle.

now, I am thinking, it makes more sense anyways, I just need to specify that
a certain namespace needs to be used for binary tags, and make the necessary
changes to allow resolving the namespace.
The parser must store the namespace prefix-->URI bindings is had
encountered so far at some place anyway (if it's a namespace aware
parser). It should then also be possible to modify it to go to binary mode
when entering elements in your special namespace -- and leave it again
when exiting (keep a counter of the nesting depth, increment, decrement).
In binary mode, it will decode text to binary.
yes, I could do this.

a hassle though is that the parser, during parsing, does not know the
correct namespaces. namespaces are resolved by stepping up the tree, and,
presently, nodes are not bound into the tree until after they are parsed (as
a result, I can't look up the tree during parsing).

an alteration would be to allways pass the parent node to the parse-node
function and setting the 'up' value before parsing sub-expressions, at
least, so the search could be done during parsing and thus be able to
resolve namespaces and avoiding needing a magic prefix...


my parser is a recursive function (as opposed to a stack or similar).
also, it is damn slow, or at least when I throw larger files at it...
I could probably try to make it faster if needed.


also, I will add a comment:
the files are a little smaller, eg, the 900kB xml file now becomes 85kB in
my format, and about 200kB in wbxml (not using any dtd's or similar, which
is probably a bad case for wbxml). this was after a few minor tweaks to the
format (eg: adding a workable means to eliminate many end markers).

otherwise, the wbxml writer may be a bit naive, which could lead to the
size. most of the tags and text contents end up as string table references,
which are naturally more expensive than my format (a few bytes, vs. a single
byte for anything in the mru list).
small files using dtd's are likely to do better (for large files, I doubt it
makes much difference, the costs should be about the same due to the small
contribution in total size of the strings table).

also, I don't know if wbxml supports namespaces (I guess it could be done if
the prefix is treated as part of the tag). if this is the case, then it is
probable wbxml could win out with namespace-heavy code.
why do I need a 32 element namespace mru anyways? it is doubtful this many
unique prefixes will be used anyways. it just fit well with the pattern I
guess.

the gzip'ed version is still about 40kB.

sizewise, my format is worse than gzip, but not that terribly worse. beating
gzip wrt size would risk compromising speed...
at least, it doesn't require decoding and using my textual parser...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top