Information on XML overhead analysis

Generic Usenet Account · Feb 14, 2011

Greetings,

Have there been any studies done on the overhead imposed by XML? We
are evaluating whether or not XML imposes an unacceptable overhead for
severely resource constrained devices in M2M (Machine-to-Machine)
deployments. These devices are expected to be very cheap (< $10) and
are expected to run on battery power for years.

Any pointers will be appeciated.

Regards,
Bhat

Peter Flynn · Feb 14, 2011

Greetings,

Have there been any studies done on the overhead imposed by XML? We
are evaluating whether or not XML imposes an unacceptable overhead for
severely resource constrained devices in M2M (Machine-to-Machine)
deployments. These devices are expected to be very cheap (< $10) and
are expected to run on battery power for years.

Any pointers will be appreciated.

I think it depends how *much* XML is "unacceptable". Parsing a very
small, well-formed instance, with no reference to DTDs or Schemas, such
as a simple config file, would not appear to present much difficulty,
and there are libraries for the major scripting languages that could be
cut down for the purpose.

Larger files of the "Data" genre may also be "acceptable", as they do
not typically use mixed content, and rarely descend much below 4-5
levels IMHE. "Document" files (eg DocBook, XHTML, TEI, etc) by contrast
can be arbitrarily complex and may nest markup to a considerable depth;
TEI in particular. In both cases, a definition of "severely restrained"
would be needed: is this memory, speed, bandwidth, or what? (or all three?).

You might want to talk to some of the utility and application authors
who have implemented some very fast XML software, and see what their
approach was. I'm not a computer scientist, so I don't know how you
would measure the balance between the demands of XML and the demands of
the implementation language, but I would expect that there are metrics
for this which would let you take the platform restrictions into account.

There was some discussion of performance and resources at last year's
XML Summerschool in Oxford, mostly in the sessions on JSON vs XML. I'm
not sure that there was a formal conclusion at that stage, but the
consensus seemed to be that they weren't in competition; rather, that
they addresses different requirements. There was also a recent tweet
from Michael Kay implying that there may be JSON support in Saxon 3.x,
which would make serialisation easier.That, however, doesn't address the
problem for small devices that Java is a hog

(http://xmlsummerschool.com)

The underlying implication of the XML Spec is that resources (disk
space, bandwidth, processor speed) would become less and less of a
factor: I'm not sure that we envisaged severely resource-constrained
devices as forming part of the immediate future. But perhaps someone out
there has indeed tested and measured the cycles and bytes needed.

///Peter

Joe Kesselman · Feb 15, 2011

Have there been any studies done on the overhead imposed by XML?

Depends on the XML, depends on the alternatives, depends on the specific
task being addressed.

Generally, my recommendation is that XML be thought of as a data model
for interchange and toolability. If you're exchanging data entirely
inside of something where nobody else is going to touch it, raw binary
works just fine, and is maximally compact. When the data wants to move
into or out of that controlled environment, XML can be a good choice as
a representation that reliably works across architectures, is easy to
debug, and has a great deal of support already in place which you can
take advantage of.

Tools for tasks. No one tool is perfect for everything, and they *ALL*
involve tradeoffs.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

Rui Maciel · Feb 16, 2011

Generic said:
Greetings,

Have there been any studies done on the overhead imposed by XML? We
are evaluating whether or not XML imposes an unacceptable overhead for
severely resource constrained devices in M2M (Machine-to-Machine)
deployments. These devices are expected to be very cheap (< $10) and
are expected to run on battery power for years.

Any pointers will be appeciated.

XML does impose a considerable overhead, which means that an answer to
your question will only depend on what you consider to be "unacceptable".
For example, if you happen to design a protocol to be used in establishing
the communication between two systems then if you happen to need to
exchange data structures you will be forced to either feed/swallow a lot
of cruft just to get that (i.e., tons of convoluted elements whose opening
and closing scheme end up wasting multiple times the data used to encode
the information which it is designed to convey) or to develop crude hacks
to weasel your way out of that problem where XML forced upon yourself
(i.e., dump your data structures on an element according to your own
format and then re-parse it a second time around on the receiving end).

And there is a good reason for that: XML is a markup language. It was
designed to encode documents, such as HTML, and nothing else. It may do
that well but once you step beyond that then it simply doesn't work that
well. Plus, there are a lot of better suited alternatives out there.

My suggestion is that if you really want a data interchange language then
you should go with a language designed specifically with that in mind.
One such language is JSON, which, in spite of it's name, happens to be a
great language. For example, unlike XML it provides explicit support for
data structures (objects, arrays/lists) and for basic data types (text
strings, numbers, boolean values, NULL) Another added feature is the fact
that it is terribly simple to parse, which means you can develop a fully
conforming parser in a hundred or so LoC of C, including all the state
machine stuff for the lexer.

Hope this helps,
Rui Maciel

Roberto Waltman · Mar 1, 2011

GUA said:
Have there been any studies done on the overhead imposed by XML? We
are evaluating whether or not XML imposes an unacceptable overhead for
severely resource constrained devices in M2M (Machine-to-Machine)
deployments.

I personally find that markup/data overheads of several hundred
percent are difficult to justify.

Somehow related, see "Why the Air Force needs binary XML"
http://www.mitre.org/news/events/xml4bin/pdf/gilligan_keynote.pdf

Roberto Waltman · Mar 1, 2011

Somehow related, see "Why the Air Force needs binary XML"

http://www.mitre.org/news/events/xml4bin/pdf/gilligan_keynote.pdf

After I posted that, searching for "Binary XML" brought up this:
http://www.extreme.indiana.edu/~aslom/papers/bxsa.pdf

Joe Kesselman · Mar 1, 2011

I personally find that markup/data overheads of several hundred
percent are difficult to justify.

XML compresses like a sonofagun. And industry experience has been that
the time needed to parse XML vs. the time needed to reload from a
binary-stream representation aren't all that different. That's the rock
on which past attempts to push the idea of standardizing a binary
equivalent of XML have foundered -- the intuitive sense that binary
should automatically be better hasn't panned out.

Sharing binary representations once the data is in memory makes more
sense. In fact, XML's greatest strength is at the edges of a system --
as a data interchange/standardization/tooling format -- while the
interior of the system would often be better off using a data model
specifically tuned to that system's needs.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

Pascal J. Bourguignon · Mar 1, 2011

Roberto Waltman said:
I personally find that markup/data overheads of several hundred
percent are difficult to justify.

Somehow related, see "Why the Air Force needs binary XML"
http://www.mitre.org/news/events/xml4bin/pdf/gilligan_keynote.pdf

Do they accept propositions?

What about something like:

element ::= 0x28 element-name 0x20 attributes 0x20 contents 0x29 .

attributes ::= 0x28 ( attribute-name 0x20 attribute-value )* 0x29 .

contents ::= ( element | value ) { 0x20 contents } .

value ::= 0x22 ( non-double-quote-character | 0x5c 0x22 | 0x5c 0x5c ) * 0x22
| number
| identifier .

element-name ::= identifier .
attribute-name ::= identifier .
attribute-value ::= value .

BGB · Mar 1, 2011

XML compresses like a sonofagun. And industry experience has been that
the time needed to parse XML vs. the time needed to reload from a
binary-stream representation aren't all that different. That's the rock
on which past attempts to push the idea of standardizing a binary
equivalent of XML have foundered -- the intuitive sense that binary
should automatically be better hasn't panned out.

Sharing binary representations once the data is in memory makes more
sense. In fact, XML's greatest strength is at the edges of a system --
as a data interchange/standardization/tooling format -- while the
interior of the system would often be better off using a data model
specifically tuned to that system's needs.

I think it depends somewhat on the type of data.

in my own binary XML format (SBXE), which is mostly used for compiler
ASTs (for C and several other languages), I am often seeing an approx 6x
to 9x size difference.

most of the difference is likely that of eliminating redundant strings
and tag names (SBXE handles both via MRU lists).

grabbing a few samples (ASTs in both formats), and running them through
gzip:
textual XML compresses by around 29x;
SBXE compresses by around 3.7x.

the gzip'ed text XML is 1.1x (approx 10%) larger than the gzip'ed SBXE.

so, purely from a sake of size (if GZIP can be reasonably used in a
given context), binary XML is not really needed.

the binary format is likely a little faster to decode again, and as
typically used, I don't use deflate.

it is mostly used within the same program, and also for stuffing XML
data into a few other misc binary formats.

however, it can be noted that most common uses of XML don't involve the
corresponding use of deflate, so a format which is partly compressed by
default will still save much over one which is not compressed at all.

so, one would still likely need a "special" file format (lets' just call
it ".xml.gz" or maybe ".xgz" for the moment...).

or such...

Rui Maciel · Mar 1, 2011

Roberto said:
I personally find that markup/data overheads of several hundred
percent are difficult to justify.

Somehow related, see "Why the Air Force needs binary XML"
http://www.mitre.org/news/events/xml4bin/pdf/gilligan_keynote.pdf

At first glance, that presentation is yet another example how XML is
inexplicably forced into inappropriate uses. The presentation basically
states that the US air force needs to implement "seamless interoperability
between the warfighting elements", which means adopting a protocol to
handle communications, and then out of nowhere XML is presented as a
given, without giving any justification why it is any good, let alone why
it should be used. As that wasn't enough, then half of the presentation
is spent suggesting ways to try to mitigate one of XML's many problems,
which incidentally consists of simply eliminating XML's main (and single?)
selling point: being a human-readable format.

So, it appears it's yet another example of XML fever, where people
involved in decision-making are attracted to a technology due to marketing
buzzwords instead of their technological merits.

Rui Maciel

Rui Maciel · Mar 1, 2011

BGB said:
I think it depends somewhat on the type of data.

in my own binary XML format (SBXE), which is mostly used for compiler
ASTs (for C and several other languages), I am often seeing an approx 6x
to 9x size difference.

most of the difference is likely that of eliminating redundant strings
and tag names (SBXE handles both via MRU lists).

grabbing a few samples (ASTs in both formats), and running them through
gzip:
textual XML compresses by around 29x;
SBXE compresses by around 3.7x.

the gzip'ed text XML is 1.1x (approx 10%) larger than the gzip'ed SBXE.

so, purely from a sake of size (if GZIP can be reasonably used in a
given context), binary XML is not really needed.

the binary format is likely a little faster to decode again, and as
typically used, I don't use deflate.

it is mostly used within the same program, and also for stuffing XML
data into a few other misc binary formats.

however, it can be noted that most common uses of XML don't involve the
corresponding use of deflate, so a format which is partly compressed by
default will still save much over one which is not compressed at all.

so, one would still likely need a "special" file format (lets' just call
it ".xml.gz" or maybe ".xgz" for the moment...).

The problem with this concept is that if someone really needs a data-
interchange format which is lean and doesn't need to be human-readable
then that person is better off adopting (or even implementing) a format
which is lean and doesn't need to be human-readable. Once we start off by
picking up a human-readable format and then mangling it to make it leaner
then we simply abandon the single most important justification (and maybe
the only one) to adopt that specific format.

Adding to that, if we adopt a human-readable format and then we are forced
to implement some compression scheme so that we can use it in it's
intended purpose then we are needlessly complicating things, and even
adding yet another point of failure to our code. After all, if we are
forced to implement a compression scheme so that we can use our human-
readable format in it's then we are basically adopting two different
parsers to handle a single document format. That means we are forced to
adopt/implement two different parsers to parse the same data tice which
must be applied to the same data stream in succession, and we are forced
to do all that only to be able to encode/decode and use information.

Instead, if someone develops a binary format from the start and relies on
a single parser to encode and decode any data described through this
format then that person not only gets exactly what he needs but also ends
up with a lean format which requires a fraction of both resources and code
to be used.

Rui Maciel

BGB · Mar 1, 2011

The problem with this concept is that if someone really needs a data-
interchange format which is lean and doesn't need to be human-readable
then that person is better off adopting (or even implementing) a format
which is lean and doesn't need to be human-readable. Once we start off by
picking up a human-readable format and then mangling it to make it leaner
then we simply abandon the single most important justification (and maybe
the only one) to adopt that specific format.

Adding to that, if we adopt a human-readable format and then we are forced
to implement some compression scheme so that we can use it in it's
intended purpose then we are needlessly complicating things, and even
adding yet another point of failure to our code. After all, if we are
forced to implement a compression scheme so that we can use our human-
readable format in it's then we are basically adopting two different
parsers to handle a single document format. That means we are forced to
adopt/implement two different parsers to parse the same data tice which
must be applied to the same data stream in succession, and we are forced
to do all that only to be able to encode/decode and use information.

Instead, if someone develops a binary format from the start and relies on
a single parser to encode and decode any data described through this
format then that person not only gets exactly what he needs but also ends
up with a lean format which requires a fraction of both resources and code
to be used.

well, for compiler ASTs, basically, one needs a tree-structured format,
and human readability is very helpful to debugging the thing (so one can
see more of what is going on inside the compiler).

now, there are many options here.
some compilers use raw structs;
some use S-Expressions;
....

my current compiler internally uses XML (mostly in the front-end),
mostly as it tends to be a reasonably flexible way to represent
tree-structured data (more flexible than S-Expressions).

however, yes, the current implementation does have some memory-footprint
issues, along with the data storage issues (using a DOM-like system eats
memory, and XML notation eats space).

a binary encoding can at least allow storing and decoding the trees more
quickly, and using a little less space, and more so, my SBXE decoder is
much simpler than a full XML parser (and is the defined format for
representing these ASTs).

however, in some ways, XML is overkill for compiler ASTs, and possibly a
few features could be eliminated (to reduce memory footprint, creating a
subset):
raw text globs and CDATA;
namespaces;
....

so, the subset would only support tags and attributes.
however, as of yet, I have not adopted such a restrictive subset (text
globs, CDATA, namespaces, ... continue to be supported even if not
really used by the compiler).

even a few extensions are supported, such as "BDATA" globs (basically,
for raw globs of binary data, although if printed textually, BDATA is
written out in hex). but, these are also not used for ASTs.

although, a compromise is possible:
the in-memory nodes could still eliminate raw text globs and CDATA, but
still support them by internally moving the text into an attribute and
using special tags (such as "!TEXT").

or such...

Peter Flynn · Mar 1, 2011

At first glance, that presentation is yet another example how XML is
inexplicably forced into inappropriate uses. The presentation basically
states that the US air force needs to implement "seamless interoperability
between the warfighting elements", which means adopting a protocol to
handle communications, and then out of nowhere XML is presented as a
given, without giving any justification why it is any good, let alone why
it should be used. As that wasn't enough, then half of the presentation
is spent suggesting ways to try to mitigate one of XML's many problems,
which incidentally consists of simply eliminating XML's main (and single?)
selling point: being a human-readable format.

So, it appears it's yet another example of XML fever, where people
involved in decision-making are attracted to a technology due to marketing
buzzwords instead of their technological merits.

[followups reset to c.t.x]

Which is why we don't hear a lot about it now. The interoperability
features of XML (plain text, robust structure, common syntax, etc) are
ideal for open interop between multiple disparate systems, which is why
it works so well for applications like TEI. In the case of milcom, they
have the capacity to ensure identicality between stations, not
disparity, and they also have absolute control over all other stages of
messaging (capture, formation, passage, reception, and consumption), so
the argument for openness and disparity falls.

There is a well-intentioned tendency for milstd systems to be heavily
over-engineered. While redundancy, error-correction, encryption, and
other protective techniques are essential to message survival and
reconstruction in a low-bandwidth environment, XML precisely does *not*
address these aspects _per se_. Adding these to the design (at the
schema stage) adds significantly to the markup overhead, which is
typically already swelled by "design features" like unnecessarily long
names.

I see zero merit in using XML for realtime secure battle-condition
military messaging. Perhaps some potential enemies do.

///Peter

CFP: ALTA 2008 at ISCA 2008	0	Mar 26, 2008
Embedded Javascript (XML/XSL) error on call to makeMenu	0	Aug 14, 2003
Fundamentals of Financial Management Concise 7e Brigham Houston	0	May 1, 2011
word_set = set() def should_preceed_with_an(phrase): first_word =	1	Jan 26, 2013
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
ISC'2005 Industrial Simulation Conference, Berlin June 2005, CFP	0	Nov 19, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
comp.lang.vhdl FAQ part 3 of 4: products & services	0	Jul 8, 2003

Information on XML overhead analysis

Generic Usenet Account

Peter Flynn

Joe Kesselman

Rui Maciel

Roberto Waltman

Roberto Waltman

Joe Kesselman

Pascal J. Bourguignon

BGB

Rui Maciel

Rui Maciel

BGB

Peter Flynn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads