Trying to parse a HUGE(1gb) xml file

Discussion in 'Python' started by spaceman-spiff, Dec 20, 2010.

  1. Hi c.l.p folks

    This is a rather long post, but i wanted to include all the details & everything i have tried so far myself, so please bear with me & read the entire boringly long post.

    I am trying to parse a ginormous ( ~ 1gb) xml file.


    0. I am a python & xml n00b, s& have been relying on the excellent beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng this, you are AWESOME & so is your witty & humorous writing style)


    1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.

    import xml.etree.ElementTree as etree
    tree = etree.parse('*path_to_ginormous_xml*')
    root = tree.getroot() #my huge xml has 1 root at the top level
    print root

    2. In the 2nd line of code above, as Mark explains in DIP, the parse function builds & returns a tree object, in-memory(RAM), which represents the entire document.
    I tried this code, which works fine for a small ( ~ 1MB), but when i run this simple 4 line py code in a terminal for my HUGE target file (1GB), nothing happens.
    In a separate terminal, i run the top command, & i can see a python process, with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.

    I am guessing, as this happens (over the course of 20-30 mins), the tree representing is being slowly built in memory, but even after 30-40 mins, nothing happens.
    I dont get an error, seg fault or out_of_memory exception.

    My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad cpuq9400.
    On this i am running sun virtualbox(3.2.12), with ubuntu 10.10 as guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the guest ubuntu os.

    3. I also tried using lxml, but an lxml tree is much more expensive, as it retains more info about a node's context, including references to it's parent.
    [http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]

    When i ran the same 4line code above, but with lxml's elementree ( using the import below in line1of the code above)
    import lxml.etree as lxml_etree

    i can see the memory consumption of the python process(which is running the code) shoot upto ~ 2700mb & then, python(or the os ?) kills the process as it nears the total system memory(2gb)

    I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
    & ran top from another terminal (http://imgur.com/HAoHA.png)

    4. I then investigated some streaming libraries, but am confused - there is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse interface[http://effbot.org/zone/element-iterparse.htm]

    Which one is the best for my situation ?

    Any & all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of the c.l.p community would be greatly appreciated.
    Plz feel free to email me directly too.

    thanks a ton

    cheers
    ashish

    email :
    ashish.makani
    domain:gmail.com

    p.s.
    Other useful links on xml parsing in python
    0. http://diveintopython3.org/xml.html
    1. http://stackoverflow.com/questions/1513592/python-is-there-an-xml-parser-implemented-as-a-generator
    2. http://codespeak.net/lxml/tutorial.html
    3. https://groups.google.com/forum/?hl=en&lnk=gst&q=parsing+a+huge+xml#!topic/comp.lang.python/CMgToEnjZBk
    4. http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    5.http://effbot.org/zone/element-index.htm
    http://effbot.org/zone/element-iterparse.htm
    6. SAX : http://en.wikipedia.org/wiki/Simple_API_for_XML
    spaceman-spiff, Dec 20, 2010
    #1
    1. Advertising

  2. On Mon, 2010-12-20 at 11:34 -0800, spaceman-spiff wrote:
    > Hi c.l.p folks
    > This is a rather long post, but i wanted to include all the details &
    > everything i have tried so far myself, so please bear with me & read
    > the entire boringly long post.
    > I am trying to parse a ginormous ( ~ 1gb) xml file.


    Do that hundreds of times a day.

    > 0. I am a python & xml n00b, s& have been relying on the excellent
    > beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if
    > u are readng this, you are AWESOME & so is your witty & humorous
    > writing style)
    > 1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
    > import xml.etree.ElementTree as etree
    > tree = etree.parse('*path_to_ginormous_xml*')
    > root = tree.getroot() #my huge xml has 1 root at the top level
    > print root


    Yes, this is a terrible technique; most examples are crap.

    > 2. In the 2nd line of code above, as Mark explains in DIP, the parse
    > function builds & returns a tree object, in-memory(RAM), which
    > represents the entire document.
    > I tried this code, which works fine for a small ( ~ 1MB), but when i
    > run this simple 4 line py code in a terminal for my HUGE target file
    > (1GB), nothing happens.
    > In a separate terminal, i run the top command, & i can see a python
    > process, with memory (the VIRT column) increasing from 100MB , all the
    > way upto 2100MB.


    Yes, this is using DOM. DOM is evil and the enemy, full-stop.

    > I am guessing, as this happens (over the course of 20-30 mins), the
    > tree representing is being slowly built in memory, but even after
    > 30-40 mins, nothing happens.
    > I dont get an error, seg fault or out_of_memory exception.


    You need to process the document as a stream of elements; aka SAX.

    > 3. I also tried using lxml, but an lxml tree is much more expensive,
    > as it retains more info about a node's context, including references
    > to it's parent.
    > [http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]
    > When i ran the same 4line code above, but with lxml's elementree
    > ( using the import below in line1of the code above)
    > import lxml.etree as lxml_etree


    You're still using DOM; DOM is evil.

    > Which one is the best for my situation ?
    > Any & all
    > code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of
    > the c.l.p community would be greatly appreciated.
    > Plz feel free to email me directly too.


    <http://docs.python.org/library/xml.sax.html>

    <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>
    Adam Tauno Williams, Dec 20, 2010
    #2
    1. Advertising

  3. spaceman-spiff

    Tim Harig Guest

    [Wrapped to meet RFC1855 Netiquette Guidelines]
    On 2010-12-20, spaceman-spiff <> wrote:
    > This is a rather long post, but i wanted to include all the details &
    > everything i have tried so far myself, so please bear with me & read
    > the entire boringly long post.
    >
    > I am trying to parse a ginormous ( ~ 1gb) xml file.

    [SNIP]
    > 4. I then investigated some streaming libraries, but am confused - there
    > is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
    > interface[http://effbot.org/zone/element-iterparse.htm]


    I have made extensive use of SAX and it will certainly work for low
    memory parsing of XML. I have never used "iterparse"; so, I cannot make
    an informed comparison between them.

    > Which one is the best for my situation ?


    Your posed was long but it failed to tell us the most important piece
    of information: What does your data look like and what are you trying
    to do with it?

    SAX is a low level API that provides a callback interface allowing you to
    processes various elements as they are encountered. You can therefore
    do anything you want to the information, as you encounter it, including
    outputing and discarding small chunks as you processes it; ignoring
    most of it and saving only what you want to memory data structures;
    or saving all of it to a more random access database or on disk data
    structure that you can load and process as required.

    What you need to do will depend on what you are actually trying to
    accomplish. Without knowing that, I can only affirm that SAX will work
    for your needs without providing any information about how you should
    be using it.
    Tim Harig, Dec 20, 2010
    #3
  4. spaceman-spiff

    Terry Reedy Guest

    On 12/20/2010 2:49 PM, Adam Tauno Williams wrote:
    >
    > Yes, this is a terrible technique; most examples are crap.


    > Yes, this is using DOM. DOM is evil and the enemy, full-stop.


    > You're still using DOM; DOM is evil.


    For serial processing, DOM is superfluous superstructure.
    For random access processing, some might disagree.

    >
    >> Which one is the best for my situation ?
    >> Any& all
    >> code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of
    >> the c.l.p community would be greatly appreciated.
    >> Plz feel free to email me directly too.

    >
    > <http://docs.python.org/library/xml.sax.html>
    >
    > <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>


    For Python (unlike Java), wrapping module functions as class static
    methods is superfluous superstructure that only slows things down.

    raise Exception(...) # should be something specific like
    raise ValueError(...)

    --
    Terry Jan Reedy
    Terry Reedy, Dec 20, 2010
    #4
  5. Adam Tauno Williams, 20.12.2010 20:49:
    > On Mon, 2010-12-20 at 11:34 -0800, spaceman-spiff wrote:
    >> This is a rather long post, but i wanted to include all the details&
    >> everything i have tried so far myself, so please bear with me& read
    >> the entire boringly long post.
    >> I am trying to parse a ginormous ( ~ 1gb) xml file.

    >
    > Do that hundreds of times a day.
    >
    >> 0. I am a python& xml n00b, s& have been relying on the excellent
    >> beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if
    >> u are readng this, you are AWESOME& so is your witty& humorous
    >> writing style)
    >> 1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
    >> import xml.etree.ElementTree as etree


    Try

    import xml.etree.cElementTree as etree

    instead. Note the leading "c", which hints at the C implementations of
    ElementTree. It's much faster and much more memory friendly than the Python
    implementation.


    >> tree = etree.parse('*path_to_ginormous_xml*')
    >> root = tree.getroot() #my huge xml has 1 root at the top level
    >> print root

    >
    > Yes, this is a terrible technique; most examples are crap.
    >
    >> 2. In the 2nd line of code above, as Mark explains in DIP, the parse
    >> function builds& returns a tree object, in-memory(RAM), which
    >> represents the entire document.
    >> I tried this code, which works fine for a small ( ~ 1MB), but when i
    >> run this simple 4 line py code in a terminal for my HUGE target file
    >> (1GB), nothing happens.
    >> In a separate terminal, i run the top command,& i can see a python
    >> process, with memory (the VIRT column) increasing from 100MB , all the
    >> way upto 2100MB.

    >
    > Yes, this is using DOM. DOM is evil and the enemy, full-stop.


    Actually, ElementTree is not "DOM", it's modelled after the XML Infoset.
    While I agree that DOM is, well, maybe not "the enemy", but not exactly
    beautiful either, ElementTree is really a good thing, likely also in this case.


    >> I am guessing, as this happens (over the course of 20-30 mins), the
    >> tree representing is being slowly built in memory, but even after
    >> 30-40 mins, nothing happens.
    >> I dont get an error, seg fault or out_of_memory exception.

    >
    > You need to process the document as a stream of elements; aka SAX.


    IMHO, this is the worst advice you can give.

    Stefan
    Stefan Behnel, Dec 21, 2010
    #5
  6. spaceman-spiff, 20.12.2010 21:29:
    > I am sorry i left out what exactly i am trying to do.
    >
    > 0. Goal :I am looking for a specific element..there are several 10s/100s occurrences of that element in the 1gb xml file.
    > The contents of the xml, is just a dump of config parameters from a packet switch( although imho, the contents of the xml dont matter)
    >
    > I need to detect them& then for each 1, i need to copy all the content b/w the element's start& end tags& create a smaller xml file.


    Then cElementTree's iterparse() is your friend. It allows you to basically
    iterate over the XML tags while its building an in-memory tree from them.
    That way, you can either remove subtrees from the tree if you don't need
    them (to safe memory) or otherwise handle them in any way you like, such as
    serialising them into a new file (and then deleting them).

    Also note that the iterparse implementation in lxml.etree allows you to
    specify a tag name to restrict the iterator to these tags. That's usually a
    lot faster, but it also means that you need to take more care to clean up
    the parts of the tree that the iterator stepped over. Depending on your
    requirements and the amount of manual code optimisation that you want to
    invest, either cElementTree or lxml.etree may perform better for you.

    It seems that you already found the article by Liza Daly about high
    performance XML processing with Python. Give it another read, it has a
    couple of good hints and examples that will help you here.

    Stefan
    Stefan Behnel, Dec 21, 2010
    #6
  7. Am 20.12.2010 20:34, schrieb spaceman-spiff:
    > Hi c.l.p folks
    >
    > This is a rather long post, but i wanted to include all the details& everything i have tried so far myself, so please bear with me& read the entire boringly long post.
    >
    > I am trying to parse a ginormous ( ~ 1gb) xml file.
    >
    >
    > 0. I am a python& xml n00b, s& have been relying on the excellent beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng this, you are AWESOME& so is your witty& humorous writing style)
    >
    >
    > 1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
    >
    > import xml.etree.ElementTree as etree
    > tree = etree.parse('*path_to_ginormous_xml*')
    > root = tree.getroot() #my huge xml has 1 root at the top level
    > print root
    >
    > 2. In the 2nd line of code above, as Mark explains in DIP, the parse function builds& returns a tree object, in-memory(RAM), which represents the entire document.
    > I tried this code, which works fine for a small ( ~ 1MB), but when i run this simple 4 line py code in a terminal for my HUGE target file (1GB), nothing happens.
    > In a separate terminal, i run the top command,& i can see a python process, with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.
    >
    > I am guessing, as this happens (over the course of 20-30 mins), the tree representing is being slowly built in memory, but even after 30-40 mins, nothing happens.
    > I dont get an error, seg fault or out_of_memory exception.
    >
    > My hardware setup : I have a win7 pro box with 8gb of RAM& intel core2 quad cpuq9400.
    > On this i am running sun virtualbox(3.2.12), with ubuntu 10.10 as guest os, with 23gb disk space& 2gb(2048mb) ram, assigned to the guest ubuntu os.
    >
    > 3. I also tried using lxml, but an lxml tree is much more expensive, as it retains more info about a node's context, including references to it's parent.
    > [http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]
    >
    > When i ran the same 4line code above, but with lxml's elementree ( using the import below in line1of the code above)
    > import lxml.etree as lxml_etree
    >
    > i can see the memory consumption of the python process(which is running the code) shoot upto ~ 2700mb& then, python(or the os ?) kills the process as it nears the total system memory(2gb)
    >
    > I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
    > & ran top from another terminal (http://imgur.com/HAoHA.png)
    >
    > 4. I then investigated some streaming libraries, but am confused - there is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse interface[http://effbot.org/zone/element-iterparse.htm]
    >
    > Which one is the best for my situation ?
    >
    > Any& all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of the c.l.p community would be greatly appreciated.
    > Plz feel free to email me directly too.
    >
    > thanks a ton
    >
    > cheers
    > ashish
    >
    > email :
    > ashish.makani
    > domain:gmail.com
    >
    > p.s.
    > Other useful links on xml parsing in python
    > 0. http://diveintopython3.org/xml.html
    > 1. http://stackoverflow.com/questions/1513592/python-is-there-an-xml-parser-implemented-as-a-generator
    > 2. http://codespeak.net/lxml/tutorial.html
    > 3. https://groups.google.com/forum/?hl=en&lnk=gst&q=parsing+a+huge+xml#!topic/comp.lang.python/CMgToEnjZBk
    > 4. http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    > 5.http://effbot.org/zone/element-index.htm
    > http://effbot.org/zone/element-iterparse.htm
    > 6. SAX : http://en.wikipedia.org/wiki/Simple_API_for_XML
    >
    >

    Normally (what is normal, anyway?) such files are auto-generated,
    and are something that has a apparent similarity with a database query
    result, encapsuled in xml.
    Most of the time the structure is same for every "row" thats in there.
    So, a very unpythonic but fast, way would be to let awk resemble the
    records and write them in csv format to stdout.
    then pipe that to your python cruncher of choice and let it do the hard
    work.
    The awk part can be done in python, anyway, so could skip that.

    And take a look at xmlsh.org, they offer tools for the command line,
    like xml2csv. (Need java, btw).

    Cheers
    Stefan Sonnenberg-Carstens, Dec 22, 2010
    #7
  8. spaceman-spiff

    Nobody Guest

    On Wed, 22 Dec 2010 23:54:34 +0100, Stefan Sonnenberg-Carstens wrote:

    > Normally (what is normal, anyway?) such files are auto-generated,
    > and are something that has a apparent similarity with a database query
    > result, encapsuled in xml.
    > Most of the time the structure is same for every "row" thats in there.
    > So, a very unpythonic but fast, way would be to let awk resemble the
    > records and write them in csv format to stdout.


    awk works well if the input is formatted such that each line is a record;
    it's not so good otherwise. XML isn't a line-oriented format; in
    particular, there are many places where both newlines and spaces are just
    whitespace. A number of XML generators will "word wrap" the resulting XML
    to make it more human readable, so line-oriented tools aren't a good idea.
    Nobody, Dec 23, 2010
    #8
  9. Am 23.12.2010 21:27, schrieb Nobody:
    > On Wed, 22 Dec 2010 23:54:34 +0100, Stefan Sonnenberg-Carstens wrote:
    >
    >> Normally (what is normal, anyway?) such files are auto-generated,
    >> and are something that has a apparent similarity with a database query
    >> result, encapsuled in xml.
    >> Most of the time the structure is same for every "row" thats in there.
    >> So, a very unpythonic but fast, way would be to let awk resemble the
    >> records and write them in csv format to stdout.

    > awk works well if the input is formatted such that each line is a record;

    You shouldn't tell it to awk.
    > it's not so good otherwise. XML isn't a line-oriented format; in
    > particular, there are many places where both newlines and spaces are just
    > whitespace. A number of XML generators will "word wrap" the resulting XML
    > to make it more human readable, so line-oriented tools aren't a good idea.

    I never had the opportunity seeing awk fail on this task :)

    For large datasets I always have huge question marks if one says "xml".
    But I don't want to start a flame war.
    Stefan Sonnenberg-Carstens, Dec 23, 2010
    #9
  10. spaceman-spiff

    Steve Holden Guest

    On 12/23/2010 4:34 PM, Stefan Sonnenberg-Carstens wrote:
    > For large datasets I always have huge question marks if one says "xml".
    > But I don't want to start a flame war.


    I agree people abuse the "spirit of XML" using it to transfer gigabytes
    of data, but what else are they to use?

    regards
    Steve
    --
    Steve Holden +1 571 484 6266 +1 800 494 3119
    PyCon 2011 Atlanta March 9-17 http://us.pycon.org/
    See Python Video! http://python.mirocommunity.org/
    Holden Web LLC http://www.holdenweb.com/
    Steve Holden, Dec 25, 2010
    #10
  11. Steve Holden, 25.12.2010 16:55:
    > On 12/23/2010 4:34 PM, Stefan Sonnenberg-Carstens wrote:
    >> For large datasets I always have huge question marks if one says "xml".
    >> But I don't want to start a flame war.

    >
    > I agree people abuse the "spirit of XML" using it to transfer gigabytes
    > of data


    I keep reading people say that (and *much* worse). XML may not be the
    tightly tailored solution for data of that size, but it's not inherently
    wrong to store gigabytes of data in XML. I mean, XML is a reasonably fast,
    versatile, widely used, well-compressing and safe data format with an
    extremely ubiquitous and well optimised set of tools available for all
    sorts of environments. So as soon as the data is any complex or the
    environments require portable data exchange, I consider XML a reasonable
    choice, even for large data sets (which usually implies that it's machine
    generated outputo anyway).

    Stefan
    Stefan Behnel, Dec 25, 2010
    #11
  12. "Steve Holden" <> wrote:
    >On 12/23/2010 4:34 PM, Stefan Sonnenberg-Carstens wrote:
    >> For large datasets I always have huge question marks if one says

    >"xml".
    >> But I don't want to start a flame war.

    >I agree people abuse the "spirit of XML" using it to transfer gigabytes
    >of data,


    How so? I think this assertion is bogus. XML works extremely well for large datasets.

    >but what else are they to use?


    If you are sending me data - please use XML . I've gotten 22GB XML files in the past - worked without issue and pretty quickly too.

    Sure better than trying to figure out whatever goofy document format someone cooks up on their own. XML toolkits are proven and documented.
    Adam Tauno Williams, Dec 25, 2010
    #12
  13. spaceman-spiff

    Tim Harig Guest

    On 2010-12-25, Steve Holden <> wrote:
    > On 12/23/2010 4:34 PM, Stefan Sonnenberg-Carstens wrote:
    >> For large datasets I always have huge question marks if one says "xml".
    >> But I don't want to start a flame war.


    I would agree; but, you don't always have the choice over the data format
    that you have to work with. You just have to do the best you can with what
    they give you.

    > I agree people abuse the "spirit of XML" using it to transfer gigabytes
    > of data, but what else are they to use?


    Something with an index so that you don't have to parse the entire file
    would be nice. SQLite comes to mind. It is not standardized; but, the
    implementation is free with bindings for most languages.
    Tim Harig, Dec 25, 2010
    #13
  14. spaceman-spiff

    Roy Smith Guest

    In article <>,
    Adam Tauno Williams <> wrote:

    > XML works extremely well for large datasets.


    Barf. I'll agree that there are some nice points to XML. It is
    portable. It is (to a certain extent) human readable, and in a pinch
    you can use standard text tools to do ad-hoc queries (i.e. grep for a
    particular entry). And, yes, there are plenty of toolsets for dealing
    with XML files.

    On the other hand, the verbosity is unbelievable. I'm currently working
    with a data feed we get from a supplier in XML. Every day we get
    incremental updates of about 10-50 MB each. The total data set at this
    point is 61 GB. It's got stuff like this in it:

    <Parental-Advisory>FALSE</Parental-Advisory>

    That's 54 bytes to store a single bit of information. I'm all for
    human-readable formats, but bloating the data by a factor of 432 is
    rather excessive. Of course, that's an extreme example. A more
    efficient example would be:

    <Id>1173722</Id>

    which is 26 bytes to store an integer. That's only a bloat factor of
    6-1/2.

    Of course, one advantage of XML is that with so much redundant text, it
    compresses well. We typically see gzip compression ratios of 20:1.
    But, that just means you can archive them efficiently; you can't do
    anything useful until you unzip them.
    Roy Smith, Dec 25, 2010
    #14
  15. Am 25.12.2010 20:41, schrieb Roy Smith:
    > In article<>,
    > Adam Tauno Williams<> wrote:
    >
    >> XML works extremely well for large datasets.

    > Barf. I'll agree that there are some nice points to XML. It is
    > portable. It is (to a certain extent) human readable, and in a pinch
    > you can use standard text tools to do ad-hoc queries (i.e. grep for a
    > particular entry). And, yes, there are plenty of toolsets for dealing
    > with XML files.
    >
    > On the other hand, the verbosity is unbelievable. I'm currently working
    > with a data feed we get from a supplier in XML. Every day we get
    > incremental updates of about 10-50 MB each. The total data set at this
    > point is 61 GB. It's got stuff like this in it:
    >
    > <Parental-Advisory>FALSE</Parental-Advisory>
    >
    > That's 54 bytes to store a single bit of information. I'm all for
    > human-readable formats, but bloating the data by a factor of 432 is
    > rather excessive. Of course, that's an extreme example. A more
    > efficient example would be:
    >
    > <Id>1173722</Id>
    >
    > which is 26 bytes to store an integer. That's only a bloat factor of
    > 6-1/2.
    >
    > Of course, one advantage of XML is that with so much redundant text, it
    > compresses well. We typically see gzip compression ratios of 20:1.
    > But, that just means you can archive them efficiently; you can't do
    > anything useful until you unzip them.

    Sending complete SQLite databases is absolute perfect.
    For example Fedora uses (used?) this for their yum catalog updates.
    Download to the right place, point your tool to it, ready.
    Stefan Sonnenberg-Carstens, Dec 25, 2010
    #15
  16. spaceman-spiff

    Nobody Guest

    On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:

    >> XML works extremely well for large datasets.


    One advantage it has over many legacy formats is that there are no
    inherent 2^31/2^32 limitations. Many binary formats inherently cannot
    support files larger than 2GiB or 4Gib due to the use of 32-bit offsets in
    indices.

    > Of course, one advantage of XML is that with so much redundant text, it
    > compresses well. We typically see gzip compression ratios of 20:1.
    > But, that just means you can archive them efficiently; you can't do
    > anything useful until you unzip them.


    XML is typically processed sequentially, so you don't need to create a
    decompressed copy of the file before you start processing it.

    If file size is that much of an issue, eventually we'll see a standard for
    compressing XML. This could easily result in smaller files than using a
    dedicated format compressed with general-purpose compression algorithms,
    as a widely-used format such as XML merits more effort than any
    application-specific format.
    Nobody, Dec 25, 2010
    #16
  17. On Sat, 2010-12-25 at 22:34 +0000, Nobody wrote:
    > On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
    > >> XML works extremely well for large datasets.

    > One advantage it has over many legacy formats is that there are no
    > inherent 2^31/2^32 limitations. Many binary formats inherently cannot
    > support files larger than 2GiB or 4Gib due to the use of 32-bit offsets in
    > indices.


    And what legacy format has support for code pages, namespaces, schema
    verification, or comments? None.

    > > Of course, one advantage of XML is that with so much redundant text, it
    > > compresses well. We typically see gzip compression ratios of 20:1.
    > > But, that just means you can archive them efficiently; you can't do
    > > anything useful until you unzip them.

    > XML is typically processed sequentially, so you don't need to create a
    > decompressed copy of the file before you start processing it.


    Yep.

    > If file size is that much of an issue,


    Which it isn't.

    > eventually we'll see a standard for
    > compressing XML. This could easily result in smaller files than using a
    > dedicated format compressed with general-purpose compression algorithms,
    > as a widely-used format such as XML merits more effort than any
    > application-specific format.


    Agree; and there actually already is a standard compression scheme -
    HTTP compression [supported by every modern web-server]; so the data is
    compressed at the only point where it matters [during transfer].

    Again: "XML works extremely well for large datasets".
    Adam Tauno Williams, Dec 25, 2010
    #17
  18. spaceman-spiff

    BartC Guest

    "Adam Tauno Williams" <> wrote in message
    news:...
    > On Sat, 2010-12-25 at 22:34 +0000, Nobody wrote:
    >> On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
    >> >> XML works extremely well for large datasets.

    >> One advantage it has over many legacy formats is that there are no
    >> inherent 2^31/2^32 limitations. Many binary formats inherently cannot
    >> support files larger than 2GiB or 4Gib due to the use of 32-bit offsets
    >> in
    >> indices.

    >
    > And what legacy format has support for code pages, namespaces, schema
    > verification, or comments? None.
    >
    >> > Of course, one advantage of XML is that with so much redundant text, it
    >> > compresses well. We typically see gzip compression ratios of 20:1.
    >> > But, that just means you can archive them efficiently; you can't do
    >> > anything useful until you unzip them.

    >> XML is typically processed sequentially, so you don't need to create a
    >> decompressed copy of the file before you start processing it.

    >
    > Yep.
    >
    >> If file size is that much of an issue,

    >
    > Which it isn't.


    Only if you're prepared to squander resources that could be put to better
    use.

    XML is so redundant, anyone (even me :) could probably spend an afternoon
    coming up with a compression scheme to reduce it to a fraction of it's size.

    It can even be an custom format, provided you also send along the few dozen
    lines of Python (or whatever language) needed to decompress. Although if
    it's done properly, it might be possible to create an XML library that works
    directly on the compressed format, and as a plug-in replacement for a
    conventional library.

    That will likely save time and memory.

    Anyway there seem to be existing schemes for binary XML, indicating some
    people do think it is an issue.

    I'm just concerned at the waste of computer power (I used to think HTML was
    bad, for example repeating the same long-winded font name hundreds of times
    over in the same document. And PDF: years ago I was sent a 1MB document for
    a modem; perhaps some substantial user manual for it? No, just a simple
    diagram showing how to plug it into the phone socket!).

    --
    Bartc
    BartC, Dec 26, 2010
    #18
  19. spaceman-spiff

    Tim Harig Guest

    On 2010-12-25, Nobody <> wrote:
    > On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
    >>> XML works extremely well for large datasets.

    > One advantage it has over many legacy formats is that there are no
    > inherent 2^31/2^32 limitations. Many binary formats inherently cannot
    > support files larger than 2GiB or 4Gib due to the use of 32-bit offsets in
    > indices.


    That is probably true of many older and binary formats; but, XML
    is certainly not the the only format that supports arbitrary size.
    It certainly doesn't prohibit another format with better handling of
    large data sets from being developed. XML's primary benefit is its
    ubiquity. While it is an excellent format for a number of uses, I don't
    accept ubiquity as the only or preeminent metric when choosing a data
    format.

    >> Of course, one advantage of XML is that with so much redundant text, it
    >> compresses well. We typically see gzip compression ratios of 20:1.
    >> But, that just means you can archive them efficiently; you can't do
    >> anything useful until you unzip them.

    >
    > XML is typically processed sequentially, so you don't need to create a
    > decompressed copy of the file before you start processing it.


    Sometimes XML is processed sequentially. When the markup footprint is
    large enough it must be. Quite often, as in the case of the OP, you only
    want to extract a small piece out of the total data. In those cases, being
    forced to read all of the data sequentially is both inconvenient and and a
    performance penalty unless there is some way to address the data you want
    directly.
    Tim Harig, Dec 26, 2010
    #19
  20. spaceman-spiff

    Tim Harig Guest

    On 2010-12-25, Adam Tauno Williams <> wrote:
    > On Sat, 2010-12-25 at 22:34 +0000, Nobody wrote:
    >> On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
    >> XML is typically processed sequentially, so you don't need to create a
    >> decompressed copy of the file before you start processing it.

    >
    > Yep.


    Sometimes that is true and sometimes it isn't. There are many situations
    where you want to access the data nonsequentially or address just a small
    subset of it. Just because you never want to access data randomly doesn't
    mean others might not. Certainly the OP would be happier using something
    like XPath to get just the piece of data that he is looking for.
    Tim Harig, Dec 26, 2010
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Craig Petty

    Any parser can handle 2.1GB+ file?

    Craig Petty, Sep 13, 2003, in forum: XML
    Replies:
    0
    Views:
    513
    Craig Petty
    Sep 13, 2003
  2. Bryan Parkoff

    Use Real 1GB and Fake 3GB Memory?

    Bryan Parkoff, Mar 9, 2005, in forum: C++
    Replies:
    1
    Views:
    398
    Victor Bazarov
    Mar 9, 2005
  3. Bennie
    Replies:
    6
    Views:
    444
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Mar 27, 2005
  4. Replies:
    3
    Views:
    493
  5. spaceman-spiff

    Re: Trying to parse a HUGE(1gb) xml file

    spaceman-spiff, Dec 20, 2010, in forum: Python
    Replies:
    3
    Views:
    707
    John Nagle
    Dec 22, 2010
Loading...

Share This Page