The problem with this concept is that if someone really needs a data-
interchange format which is lean and doesn't need to be human-readable
then that person is better off adopting (or even implementing) a format
which is lean and doesn't need to be human-readable. Once we start off by
picking up a human-readable format and then mangling it to make it leaner
then we simply abandon the single most important justification (and maybe
the only one) to adopt that specific format.
Adding to that, if we adopt a human-readable format and then we are forced
to implement some compression scheme so that we can use it in it's
intended purpose then we are needlessly complicating things, and even
adding yet another point of failure to our code. After all, if we are
forced to implement a compression scheme so that we can use our human-
readable format in it's then we are basically adopting two different
parsers to handle a single document format. That means we are forced to
adopt/implement two different parsers to parse the same data tice which
must be applied to the same data stream in succession, and we are forced
to do all that only to be able to encode/decode and use information.
Instead, if someone develops a binary format from the start and relies on
a single parser to encode and decode any data described through this
format then that person not only gets exactly what he needs but also ends
up with a lean format which requires a fraction of both resources and code
to be used.
well, for compiler ASTs, basically, one needs a tree-structured format,
and human readability is very helpful to debugging the thing (so one can
see more of what is going on inside the compiler).
now, there are many options here.
some compilers use raw structs;
some use S-Expressions;
....
my current compiler internally uses XML (mostly in the front-end),
mostly as it tends to be a reasonably flexible way to represent
tree-structured data (more flexible than S-Expressions).
however, yes, the current implementation does have some memory-footprint
issues, along with the data storage issues (using a DOM-like system eats
memory, and XML notation eats space).
a binary encoding can at least allow storing and decoding the trees more
quickly, and using a little less space, and more so, my SBXE decoder is
much simpler than a full XML parser (and is the defined format for
representing these ASTs).
however, in some ways, XML is overkill for compiler ASTs, and possibly a
few features could be eliminated (to reduce memory footprint, creating a
subset):
raw text globs and CDATA;
namespaces;
....
so, the subset would only support tags and attributes.
however, as of yet, I have not adopted such a restrictive subset (text
globs, CDATA, namespaces, ... continue to be supported even if not
really used by the compiler).
even a few extensions are supported, such as "BDATA" globs (basically,
for raw globs of binary data, although if printed textually, BDATA is
written out in hex). but, these are also not used for ASTs.
although, a compromise is possible:
the in-memory nodes could still eliminate raw text globs and CDATA, but
still support them by internally moving the text into an attribute and
using special tags (such as "!TEXT").
or such...