Information on XML overhead analysis

Discussion in 'XML' started by Generic Usenet Account, Feb 14, 2011.

  1. Greetings,

    Have there been any studies done on the overhead imposed by XML? We
    are evaluating whether or not XML imposes an unacceptable overhead for
    severely resource constrained devices in M2M (Machine-to-Machine)
    deployments. These devices are expected to be very cheap (< $10) and
    are expected to run on battery power for years.

    Any pointers will be appeciated.

    Regards,
    Bhat
     
    Generic Usenet Account, Feb 14, 2011
    #1
    1. Advertising

  2. Generic Usenet Account

    Peter Flynn Guest

    On 14/02/11 21:19, Generic Usenet Account <> wrote
    in comp.text.tex:
    > Greetings,
    >
    > Have there been any studies done on the overhead imposed by XML? We
    > are evaluating whether or not XML imposes an unacceptable overhead for
    > severely resource constrained devices in M2M (Machine-to-Machine)
    > deployments. These devices are expected to be very cheap (< $10) and
    > are expected to run on battery power for years.
    >
    > Any pointers will be appreciated.


    I think it depends how *much* XML is "unacceptable". Parsing a very
    small, well-formed instance, with no reference to DTDs or Schemas, such
    as a simple config file, would not appear to present much difficulty,
    and there are libraries for the major scripting languages that could be
    cut down for the purpose.

    Larger files of the "Data" genre may also be "acceptable", as they do
    not typically use mixed content, and rarely descend much below 4-5
    levels IMHE. "Document" files (eg DocBook, XHTML, TEI, etc) by contrast
    can be arbitrarily complex and may nest markup to a considerable depth;
    TEI in particular. In both cases, a definition of "severely restrained"
    would be needed: is this memory, speed, bandwidth, or what? (or all three?).

    You might want to talk to some of the utility and application authors
    who have implemented some very fast XML software, and see what their
    approach was. I'm not a computer scientist, so I don't know how you
    would measure the balance between the demands of XML and the demands of
    the implementation language, but I would expect that there are metrics
    for this which would let you take the platform restrictions into account.

    There was some discussion of performance and resources at last year's
    XML Summerschool in Oxford, mostly in the sessions on JSON vs XML. I'm
    not sure that there was a formal conclusion at that stage, but the
    consensus seemed to be that they weren't in competition; rather, that
    they addresses different requirements. There was also a recent tweet
    from Michael Kay implying that there may be JSON support in Saxon 3.x,
    which would make serialisation easier.That, however, doesn't address the
    problem for small devices that Java is a hog :)
    (http://xmlsummerschool.com)

    The underlying implication of the XML Spec is that resources (disk
    space, bandwidth, processor speed) would become less and less of a
    factor: I'm not sure that we envisaged severely resource-constrained
    devices as forming part of the immediate future. But perhaps someone out
    there has indeed tested and measured the cycles and bytes needed.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
     
    Peter Flynn, Feb 14, 2011
    #2
    1. Advertising

  3. On 2/14/2011 4:19 PM, Generic Usenet Account wrote:
    > Have there been any studies done on the overhead imposed by XML?


    Depends on the XML, depends on the alternatives, depends on the specific
    task being addressed.

    Generally, my recommendation is that XML be thought of as a data model
    for interchange and toolability. If you're exchanging data entirely
    inside of something where nobody else is going to touch it, raw binary
    works just fine, and is maximally compact. When the data wants to move
    into or out of that controlled environment, XML can be a good choice as
    a representation that reliably works across architectures, is easy to
    debug, and has a great deal of support already in place which you can
    take advantage of.

    Tools for tasks. No one tool is perfect for everything, and they *ALL*
    involve tradeoffs.



    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
     
    Joe Kesselman, Feb 15, 2011
    #3
  4. Generic Usenet Account

    Rui Maciel Guest

    Generic Usenet Account wrote:

    > Greetings,
    >
    > Have there been any studies done on the overhead imposed by XML? We
    > are evaluating whether or not XML imposes an unacceptable overhead for
    > severely resource constrained devices in M2M (Machine-to-Machine)
    > deployments. These devices are expected to be very cheap (< $10) and
    > are expected to run on battery power for years.
    >
    > Any pointers will be appeciated.


    XML does impose a considerable overhead, which means that an answer to
    your question will only depend on what you consider to be "unacceptable".
    For example, if you happen to design a protocol to be used in establishing
    the communication between two systems then if you happen to need to
    exchange data structures you will be forced to either feed/swallow a lot
    of cruft just to get that (i.e., tons of convoluted elements whose opening
    and closing scheme end up wasting multiple times the data used to encode
    the information which it is designed to convey) or to develop crude hacks
    to weasel your way out of that problem where XML forced upon yourself
    (i.e., dump your data structures on an element according to your own
    format and then re-parse it a second time around on the receiving end).

    And there is a good reason for that: XML is a markup language. It was
    designed to encode documents, such as HTML, and nothing else. It may do
    that well but once you step beyond that then it simply doesn't work that
    well. Plus, there are a lot of better suited alternatives out there.

    My suggestion is that if you really want a data interchange language then
    you should go with a language designed specifically with that in mind.
    One such language is JSON, which, in spite of it's name, happens to be a
    great language. For example, unlike XML it provides explicit support for
    data structures (objects, arrays/lists) and for basic data types (text
    strings, numbers, boolean values, NULL) Another added feature is the fact
    that it is terribly simple to parse, which means you can develop a fully
    conforming parser in a hundred or so LoC of C, including all the state
    machine stuff for the lexer.


    Hope this helps,
    Rui Maciel
     
    Rui Maciel, Feb 16, 2011
    #4
  5. GUA wrote:
    >Have there been any studies done on the overhead imposed by XML? We
    >are evaluating whether or not XML imposes an unacceptable overhead for
    >severely resource constrained devices in M2M (Machine-to-Machine)
    >deployments.


    I personally find that markup/data overheads of several hundred
    percent are difficult to justify.

    Somehow related, see "Why the Air Force needs binary XML"
    http://www.mitre.org/news/events/xml4bin/pdf/gilligan_keynote.pdf
    --
    Roberto Waltman

    [ Please reply to the group.
    Return address is invalid ]
     
    Roberto Waltman, Mar 1, 2011
    #5
  6. Roberto Waltman, Mar 1, 2011
    #6
  7. On 2/28/2011 7:20 PM, Roberto Waltman wrote:
    > I personally find that markup/data overheads of several hundred
    > percent are difficult to justify.


    XML compresses like a sonofagun. And industry experience has been that
    the time needed to parse XML vs. the time needed to reload from a
    binary-stream representation aren't all that different. That's the rock
    on which past attempts to push the idea of standardizing a binary
    equivalent of XML have foundered -- the intuitive sense that binary
    should automatically be better hasn't panned out.

    Sharing binary representations once the data is in memory makes more
    sense. In fact, XML's greatest strength is at the edges of a system --
    as a data interchange/standardization/tooling format -- while the
    interior of the system would often be better off using a data model
    specifically tuned to that system's needs.


    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
     
    Joe Kesselman, Mar 1, 2011
    #7
  8. Roberto Waltman <> writes:

    > GUA wrote:
    >>Have there been any studies done on the overhead imposed by XML? We
    >>are evaluating whether or not XML imposes an unacceptable overhead for
    >>severely resource constrained devices in M2M (Machine-to-Machine)
    >>deployments.

    >
    > I personally find that markup/data overheads of several hundred
    > percent are difficult to justify.
    >
    > Somehow related, see "Why the Air Force needs binary XML"
    > http://www.mitre.org/news/events/xml4bin/pdf/gilligan_keynote.pdf


    Do they accept propositions?

    What about something like:

    element ::= 0x28 element-name 0x20 attributes 0x20 contents 0x29 .

    attributes ::= 0x28 ( attribute-name 0x20 attribute-value )* 0x29 .

    contents ::= ( element | value ) { 0x20 contents } .

    value ::= 0x22 ( non-double-quote-character | 0x5c 0x22 | 0x5c 0x5c ) * 0x22
    | number
    | identifier .

    element-name ::= identifier .
    attribute-name ::= identifier .
    attribute-value ::= value .


    --
    __Pascal Bourguignon__ http://www.informatimago.com/
    A bad day in () is better than a good day in {}.
     
    Pascal J. Bourguignon, Mar 1, 2011
    #8
  9. Generic Usenet Account

    BGB Guest

    On 2/28/2011 10:22 PM, Joe Kesselman wrote:
    > On 2/28/2011 7:20 PM, Roberto Waltman wrote:
    >> I personally find that markup/data overheads of several hundred
    >> percent are difficult to justify.

    >
    > XML compresses like a sonofagun. And industry experience has been that
    > the time needed to parse XML vs. the time needed to reload from a
    > binary-stream representation aren't all that different. That's the rock
    > on which past attempts to push the idea of standardizing a binary
    > equivalent of XML have foundered -- the intuitive sense that binary
    > should automatically be better hasn't panned out.
    >
    > Sharing binary representations once the data is in memory makes more
    > sense. In fact, XML's greatest strength is at the edges of a system --
    > as a data interchange/standardization/tooling format -- while the
    > interior of the system would often be better off using a data model
    > specifically tuned to that system's needs.
    >


    I think it depends somewhat on the type of data.

    in my own binary XML format (SBXE), which is mostly used for compiler
    ASTs (for C and several other languages), I am often seeing an approx 6x
    to 9x size difference.

    most of the difference is likely that of eliminating redundant strings
    and tag names (SBXE handles both via MRU lists).


    grabbing a few samples (ASTs in both formats), and running them through
    gzip:
    textual XML compresses by around 29x;
    SBXE compresses by around 3.7x.

    the gzip'ed text XML is 1.1x (approx 10%) larger than the gzip'ed SBXE.

    so, purely from a sake of size (if GZIP can be reasonably used in a
    given context), binary XML is not really needed.


    the binary format is likely a little faster to decode again, and as
    typically used, I don't use deflate.

    it is mostly used within the same program, and also for stuffing XML
    data into a few other misc binary formats.


    however, it can be noted that most common uses of XML don't involve the
    corresponding use of deflate, so a format which is partly compressed by
    default will still save much over one which is not compressed at all.

    so, one would still likely need a "special" file format (lets' just call
    it ".xml.gz" or maybe ".xgz" for the moment...).


    or such...
     
    BGB, Mar 1, 2011
    #9
  10. Generic Usenet Account

    Rui Maciel Guest

    Roberto Waltman wrote:

    > I personally find that markup/data overheads of several hundred
    > percent are difficult to justify.
    >
    > Somehow related, see "Why the Air Force needs binary XML"
    > http://www.mitre.org/news/events/xml4bin/pdf/gilligan_keynote.pdf


    At first glance, that presentation is yet another example how XML is
    inexplicably forced into inappropriate uses. The presentation basically
    states that the US air force needs to implement "seamless interoperability
    between the warfighting elements", which means adopting a protocol to
    handle communications, and then out of nowhere XML is presented as a
    given, without giving any justification why it is any good, let alone why
    it should be used. As that wasn't enough, then half of the presentation
    is spent suggesting ways to try to mitigate one of XML's many problems,
    which incidentally consists of simply eliminating XML's main (and single?)
    selling point: being a human-readable format.

    So, it appears it's yet another example of XML fever, where people
    involved in decision-making are attracted to a technology due to marketing
    buzzwords instead of their technological merits.


    Rui Maciel
     
    Rui Maciel, Mar 1, 2011
    #10
  11. Generic Usenet Account

    Rui Maciel Guest

    BGB wrote:

    > I think it depends somewhat on the type of data.
    >
    > in my own binary XML format (SBXE), which is mostly used for compiler
    > ASTs (for C and several other languages), I am often seeing an approx 6x
    > to 9x size difference.
    >
    > most of the difference is likely that of eliminating redundant strings
    > and tag names (SBXE handles both via MRU lists).
    >
    >
    > grabbing a few samples (ASTs in both formats), and running them through
    > gzip:
    > textual XML compresses by around 29x;
    > SBXE compresses by around 3.7x.
    >
    > the gzip'ed text XML is 1.1x (approx 10%) larger than the gzip'ed SBXE.
    >
    > so, purely from a sake of size (if GZIP can be reasonably used in a
    > given context), binary XML is not really needed.
    >
    >
    > the binary format is likely a little faster to decode again, and as
    > typically used, I don't use deflate.
    >
    > it is mostly used within the same program, and also for stuffing XML
    > data into a few other misc binary formats.
    >
    >
    > however, it can be noted that most common uses of XML don't involve the
    > corresponding use of deflate, so a format which is partly compressed by
    > default will still save much over one which is not compressed at all.
    >
    > so, one would still likely need a "special" file format (lets' just call
    > it ".xml.gz" or maybe ".xgz" for the moment...).


    The problem with this concept is that if someone really needs a data-
    interchange format which is lean and doesn't need to be human-readable
    then that person is better off adopting (or even implementing) a format
    which is lean and doesn't need to be human-readable. Once we start off by
    picking up a human-readable format and then mangling it to make it leaner
    then we simply abandon the single most important justification (and maybe
    the only one) to adopt that specific format.

    Adding to that, if we adopt a human-readable format and then we are forced
    to implement some compression scheme so that we can use it in it's
    intended purpose then we are needlessly complicating things, and even
    adding yet another point of failure to our code. After all, if we are
    forced to implement a compression scheme so that we can use our human-
    readable format in it's then we are basically adopting two different
    parsers to handle a single document format. That means we are forced to
    adopt/implement two different parsers to parse the same data tice which
    must be applied to the same data stream in succession, and we are forced
    to do all that only to be able to encode/decode and use information.

    Instead, if someone develops a binary format from the start and relies on
    a single parser to encode and decode any data described through this
    format then that person not only gets exactly what he needs but also ends
    up with a lean format which requires a fraction of both resources and code
    to be used.



    Rui Maciel
     
    Rui Maciel, Mar 1, 2011
    #11
  12. Generic Usenet Account

    BGB Guest

    On 3/1/2011 3:31 AM, Rui Maciel wrote:
    > BGB wrote:
    >
    >> I think it depends somewhat on the type of data.
    >>
    >> in my own binary XML format (SBXE), which is mostly used for compiler
    >> ASTs (for C and several other languages), I am often seeing an approx 6x
    >> to 9x size difference.
    >>
    >> most of the difference is likely that of eliminating redundant strings
    >> and tag names (SBXE handles both via MRU lists).
    >>
    >>
    >> grabbing a few samples (ASTs in both formats), and running them through
    >> gzip:
    >> textual XML compresses by around 29x;
    >> SBXE compresses by around 3.7x.
    >>
    >> the gzip'ed text XML is 1.1x (approx 10%) larger than the gzip'ed SBXE.
    >>
    >> so, purely from a sake of size (if GZIP can be reasonably used in a
    >> given context), binary XML is not really needed.
    >>
    >>
    >> the binary format is likely a little faster to decode again, and as
    >> typically used, I don't use deflate.
    >>
    >> it is mostly used within the same program, and also for stuffing XML
    >> data into a few other misc binary formats.
    >>
    >>
    >> however, it can be noted that most common uses of XML don't involve the
    >> corresponding use of deflate, so a format which is partly compressed by
    >> default will still save much over one which is not compressed at all.
    >>
    >> so, one would still likely need a "special" file format (lets' just call
    >> it ".xml.gz" or maybe ".xgz" for the moment...).

    >
    > The problem with this concept is that if someone really needs a data-
    > interchange format which is lean and doesn't need to be human-readable
    > then that person is better off adopting (or even implementing) a format
    > which is lean and doesn't need to be human-readable. Once we start off by
    > picking up a human-readable format and then mangling it to make it leaner
    > then we simply abandon the single most important justification (and maybe
    > the only one) to adopt that specific format.
    >
    > Adding to that, if we adopt a human-readable format and then we are forced
    > to implement some compression scheme so that we can use it in it's
    > intended purpose then we are needlessly complicating things, and even
    > adding yet another point of failure to our code. After all, if we are
    > forced to implement a compression scheme so that we can use our human-
    > readable format in it's then we are basically adopting two different
    > parsers to handle a single document format. That means we are forced to
    > adopt/implement two different parsers to parse the same data tice which
    > must be applied to the same data stream in succession, and we are forced
    > to do all that only to be able to encode/decode and use information.
    >
    > Instead, if someone develops a binary format from the start and relies on
    > a single parser to encode and decode any data described through this
    > format then that person not only gets exactly what he needs but also ends
    > up with a lean format which requires a fraction of both resources and code
    > to be used.
    >


    well, for compiler ASTs, basically, one needs a tree-structured format,
    and human readability is very helpful to debugging the thing (so one can
    see more of what is going on inside the compiler).


    now, there are many options here.
    some compilers use raw structs;
    some use S-Expressions;
    ....

    my current compiler internally uses XML (mostly in the front-end),
    mostly as it tends to be a reasonably flexible way to represent
    tree-structured data (more flexible than S-Expressions).

    however, yes, the current implementation does have some memory-footprint
    issues, along with the data storage issues (using a DOM-like system eats
    memory, and XML notation eats space).

    a binary encoding can at least allow storing and decoding the trees more
    quickly, and using a little less space, and more so, my SBXE decoder is
    much simpler than a full XML parser (and is the defined format for
    representing these ASTs).


    however, in some ways, XML is overkill for compiler ASTs, and possibly a
    few features could be eliminated (to reduce memory footprint, creating a
    subset):
    raw text globs and CDATA;
    namespaces;
    ....

    so, the subset would only support tags and attributes.
    however, as of yet, I have not adopted such a restrictive subset (text
    globs, CDATA, namespaces, ... continue to be supported even if not
    really used by the compiler).

    even a few extensions are supported, such as "BDATA" globs (basically,
    for raw globs of binary data, although if printed textually, BDATA is
    written out in hex). but, these are also not used for ASTs.

    although, a compromise is possible:
    the in-memory nodes could still eliminate raw text globs and CDATA, but
    still support them by internally moving the text into an attribute and
    using special tags (such as "!TEXT").


    or such...
     
    BGB, Mar 1, 2011
    #12
  13. Generic Usenet Account

    Peter Flynn Guest

    On 01/03/11 10:11, Rui Maciel wrote:
    > Roberto Waltman wrote:
    >
    >> I personally find that markup/data overheads of several hundred
    >> percent are difficult to justify.
    >>
    >> Somehow related, see "Why the Air Force needs binary XML"
    >> http://www.mitre.org/news/events/xml4bin/pdf/gilligan_keynote.pdf

    >
    > At first glance, that presentation is yet another example how XML is
    > inexplicably forced into inappropriate uses. The presentation basically
    > states that the US air force needs to implement "seamless interoperability
    > between the warfighting elements", which means adopting a protocol to
    > handle communications, and then out of nowhere XML is presented as a
    > given, without giving any justification why it is any good, let alone why
    > it should be used. As that wasn't enough, then half of the presentation
    > is spent suggesting ways to try to mitigate one of XML's many problems,
    > which incidentally consists of simply eliminating XML's main (and single?)
    > selling point: being a human-readable format.
    >
    > So, it appears it's yet another example of XML fever, where people
    > involved in decision-making are attracted to a technology due to marketing
    > buzzwords instead of their technological merits.


    [followups reset to c.t.x]

    Which is why we don't hear a lot about it now. The interoperability
    features of XML (plain text, robust structure, common syntax, etc) are
    ideal for open interop between multiple disparate systems, which is why
    it works so well for applications like TEI. In the case of milcom, they
    have the capacity to ensure identicality between stations, not
    disparity, and they also have absolute control over all other stages of
    messaging (capture, formation, passage, reception, and consumption), so
    the argument for openness and disparity falls.

    There is a well-intentioned tendency for milstd systems to be heavily
    over-engineered. While redundancy, error-correction, encryption, and
    other protective techniques are essential to message survival and
    reconstruction in a low-bandwidth environment, XML precisely does *not*
    address these aspects _per se_. Adding these to the design (at the
    schema stage) adds significantly to the markup overhead, which is
    typically already swelled by "design features" like unnecessarily long
    names.

    I see zero merit in using XML for realtime secure battle-condition
    military messaging. Perhaps some potential enemies do.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
     
    Peter Flynn, Mar 1, 2011
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. vipindeep
    Replies:
    0
    Views:
    477
    vipindeep
    Oct 25, 2004
  2. vipindeep
    Replies:
    1
    Views:
    484
    Thomas Maier-Komor
    Oct 25, 2004
  3. vipindeep

    Dynamic analysis tools information

    vipindeep, Oct 25, 2004, in forum: C Programming
    Replies:
    1
    Views:
    424
    Jack Klein
    Oct 26, 2004
  4. Guest
    Replies:
    0
    Views:
    382
    Guest
    Jun 20, 2007
  5. ssubbarayan
    Replies:
    5
    Views:
    2,420
    Dave Hansen
    Nov 3, 2009
Loading...

Share This Page