Re: Trying to parse a HUGE(1gb) xml file

Discussion in 'Python' started by spaceman-spiff, Dec 20, 2010.

  1. Hi Usernet

    First up, thanks for your prompt reply.
    I will make sure i read RFC1855, before posting again, but right now chasing a hard deadline :)

    I am sorry i left out what exactly i am trying to do.

    0. Goal :I am looking for a specific element..there are several 10s/100s occurrences of that element in the 1gb xml file.
    The contents of the xml, is just a dump of config parameters from a packet switch( although imho, the contents of the xml dont matter)

    I need to detect them & then for each 1, i need to copy all the content b/w the element's start & end tags & create a smaller xml file.

    1. Can you point me to some examples/samples of using SAX, especially , ones dealing with really large XML files.

    2.This brings me to another q. which i forgot to ask in my OP(original post).
    Is simply opening the file, & using reg ex to look for the element i need, a *good* approach ?
    While researching my problem, some article seemed to advise against this, especially since its known apriori, that the file is an xml & since regex code gets complicated very quickly & is not very readable.

    But is that just a "style"/"elegance" issue, & for my particular problem (detecting a certain element, & then creating(writing) a smaller xml file corresponding to, each pair of start & end tags of said element), is the open file & regex approach, something you would recommend ?

    Thanks again for your super-prompt response :)

    cheers
    ashish
    spaceman-spiff, Dec 20, 2010
    #1
    1. Advertising

  2. On Mon, 2010-12-20 at 12:29 -0800, spaceman-spiff wrote:
    > I need to detect them & then for each 1, i need to copy all the
    > content b/w the element's start & end tags & create a smaller xml
    > file.


    Yep, do that a lot; via iterparse.

    > 1. Can you point me to some examples/samples of using SAX,
    > especially , ones dealing with really large XML files.


    SaX is equivalent to iterparse (iterpase is a way, to essentially, do
    SaX-like processing).

    I provided an iterparse example already. See the Read_Rows method in
    <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>

    > 2.This brings me to another q. which i forgot to ask in my OP(original post).
    > Is simply opening the file, & using reg ex to look for the element i need, a *good* approach ?


    No.
    Adam Tauno Williams, Dec 20, 2010
    #2
    1. Advertising

  3. spaceman-spiff

    Tim Harig Guest

    On 2010-12-20, spaceman-spiff <> wrote:
    > 0. Goal :I am looking for a specific element..there are several 10s/100s
    > occurrences of that element in the 1gb xml file. The contents of the xml,
    > is just a dump of config parameters from a packet switch( although imho,
    > the contents of the xml dont matter)


    Then you need:
    1. To detect whenever you move inside of the type element you are
    seeking and whenever you move out of it. As long as these
    elements cannot be nested inside of each other, this is an
    easy binary task. If they can be nested, then you will
    need to maintain some kind of level count or recursively
    decompose each level.

    2. Once you have obtained a complete element (from its start tag to
    its end tag) you will need to test whether you have the
    single correct element that you are looking for.

    Something like this (untested) will work if the target tag cannot be nested
    in another target tag:

    - import xml.sax
    - class tagSearcher(xml.sax.ContentHandler):
    -
    - def startDocument():
    - self.inTarget = False
    -
    - def startElement(name, attrs):
    - if name == targetName:
    - self.inTarget = True
    - elif inTarget = True:
    - # save element information
    -
    - def endElement(name):
    - if name == targetName:
    - self.inTarget = False
    - # test the saved information to see if you have the
    - # one you want:
    - #
    - # if its the peice you are looking for, then
    - # you can process the information
    - # you have saved
    - #
    - # if not, disgard the accumulated
    - # information and move on
    -
    - def characters(content):
    - if self.inTarget == True:
    - # save the content
    -
    - yourHandler = tagSearcher()
    - yourParser = xml.sax.make_parser()
    - yourParser.parse(inputXML, yourHandler)

    Then you just walk through the document picking up and discarding each
    target element type until you have the one that you are looking for.

    > I need to detect them & then for each 1, i need to copy all the content
    > b/w the element's start & end tags & create a smaller xml file.


    Easy enough; but, with SAX you will have to recreate the tags from
    the information that they contain because they will be skipped by the
    character() events; so you will need to save the information from each tag
    as you come across it. This could probably be done more automatically
    using saxutils.XMLGenerator; but, I haven't actually worked with it
    before. xml.dom.pulldom also looks interesting

    > 1. Can you point me to some examples/samples of using SAX, especially ,
    > ones dealing with really large XML files.


    There is nothing special about large files with SAX. Sax is very simple.
    It walks through the document and calls the the functions that you
    give it for each event as it reaches varius elements. Your callback
    functions (methods of a handler) do everthing with the information.
    SAX does nothing more then call your functions. There are events for
    reaching a starting tag, an end tag, and characters between tags;
    as well as some for beginning and ending a document.

    > 2.This brings me to another q. which i forgot to ask in my OP(original
    > post). Is simply opening the file, & using reg ex to look for the element
    > i need, a *good* approach ? While researching my problem, some article
    > seemed to advise against this, especially since its known apriori, that
    > the file is an xml & since regex code gets complicated very quickly &
    > is not very readable.
    >
    > But is that just a "style"/"elegance" issue, & for my particular problem
    > (detecting a certain element, & then creating(writing) a smaller xml
    > file corresponding to, each pair of start & end tags of said element),
    > is the open file & regex approach, something you would recommend ?


    It isn't an invalid approach if it works for your situatuation. I have
    used it before for very simple problems. The thing is, XML is a context
    free data format which makes it difficult to generate precise regular
    expressions, especially where where tags of the same type can be nested.

    It can be very error prone. Its really easy to have a regex work for
    your tests and fail, either by matching too much or failing to match,
    because you didn't anticipate a given piece of data. I wouldn't consider
    it a robust solution.
    Tim Harig, Dec 20, 2010
    #3
  4. spaceman-spiff

    John Nagle Guest

    On 12/20/2010 12:33 PM, Adam Tauno Williams wrote:
    > On Mon, 2010-12-20 at 12:29 -0800, spaceman-spiff wrote:
    >> I need to detect them& then for each 1, i need to copy all the
    >> content b/w the element's start& end tags& create a smaller xml
    >> file.

    >
    > Yep, do that a lot; via iterparse.
    >
    >> 1. Can you point me to some examples/samples of using SAX,
    >> especially , ones dealing with really large XML files.


    I've just subclassed HTMLparser for this. It's slow, but
    100% Python. Using the SAX parser is essentially equivalent.
    I'm processing multi-gigabyte XML files and updating a MySQL
    database, so I do need to look at all the entries, but don't
    need a parse tree of the XML.

    > SaX is equivalent to iterparse (iterpase is a way, to essentially, do
    > SaX-like processing).


    Iterparse does try to build a tree, although you can discard the
    parts you don't want. If you can't decide whether a part of the XML
    is of interest until you're deep into it, an "iterparse" approach
    may result in a big junk tree. You have to keep clearing the "root"
    element to discard that.

    > I provided an iterparse example already. See the Read_Rows method in
    > <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>


    I don't quite see the point of creating a class with only static
    methods. That's basically a verbose way to create a module.
    >
    >> 2.This brings me to another q. which i forgot to ask in my OP(original post).
    >> Is simply opening the file,& using reg ex to look for the element i need, a *good* approach ?

    >
    > No.


    If the XML file has a very predictable structure, that may not be
    a bad idea. It's not very general, but if you have some XML file
    that's basically fixed format records using XML to delimit the
    fields, pounding on the thing with a regular expression is simple
    and fast.

    John Nagle
    John Nagle, Dec 22, 2010
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Craig Petty

    Any parser can handle 2.1GB+ file?

    Craig Petty, Sep 13, 2003, in forum: XML
    Replies:
    0
    Views:
    513
    Craig Petty
    Sep 13, 2003
  2. Bryan Parkoff

    Use Real 1GB and Fake 3GB Memory?

    Bryan Parkoff, Mar 9, 2005, in forum: C++
    Replies:
    1
    Views:
    398
    Victor Bazarov
    Mar 9, 2005
  3. Bennie
    Replies:
    6
    Views:
    444
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Mar 27, 2005
  4. Replies:
    3
    Views:
    493
  5. spaceman-spiff

    Trying to parse a HUGE(1gb) xml file

    spaceman-spiff, Dec 20, 2010, in forum: Python
    Replies:
    41
    Views:
    2,214
Loading...

Share This Page