python xml DOM? pulldom? SAX?

Discussion in 'Python' started by jog, Aug 29, 2005.

  1. jog

    jog Guest

    Hi,
    I want to get text out of some nodes of a huge xml file (1,5 GB). The
    architecture of the xml file is something like this
    <parent>
    <page>
    <title>bla</title>
    <id></id>
    <revision>
    <id></id>
    <text>blablabla</text>
    <revision>
    </page>
    <page>
    </page>
    ....
    </parent>
    I want to combine the text out of page:title and page:revision:text for
    every single page element. One by one I want to index these combined
    texts (so for each page one index)
    What is the most efficient API for that?: SAX ( I don´t thonk so) DOM
    or pulldom?
    Or should I just use Xpath somehow.
    I don`t want to do anything else with his xml file afterwards.
    I hope someone will understand me.....
    Thank you very much
    Jog
     
    jog, Aug 29, 2005
    #1
    1. Advertising

  2. "jog" wrote:

    > I want to get text out of some nodes of a huge xml file (1,5 GB). The
    > architecture of the xml file is something like this


    > I want to combine the text out of page:title and page:revision:text for
    > every single page element. One by one I want to index these combined
    > texts (so for each page one index)


    here's one way to do it:

    try:
    import cElementTree as ET
    except ImportError:
    from elementtree import ElementTree as ET

    for event, elem in ET.iterparse(file):
    if elem.tag == "page":
    title = elem.findtext("title")
    revision = elem.findtext("revision/text")
    print title, revision
    elem.clear() # won't need this any more

    references:

    http://effbot.org/zone/element-index.htm
    http://effbot.org/zone/celementtree.htm (for best performance)
    http://effbot.org/zone/element-iterparse.htm

    </F>
     
    Fredrik Lundh, Aug 29, 2005
    #2
    1. Advertising

  3. jog

    tooper Guest

    Hi,

    I'd advocate for using SAX, as DOM related methods implies loading the
    complete XML content in memory whereas SAX grab things on the fly.
    SAX method should therefore be faster and less memory consuming...

    By the way, if your goal is to just "combine the text out of page:title
    and page:revision:text for every single page element", maybe you should
    also consider an XSLT filter.

    Regards,
    Thierry
     
    tooper, Aug 29, 2005
    #3
  4. On 29 Aug 2005 08:17:04 -0700
    "jog" <> wrote:
    > I want to get text out of some nodes of a huge xml file (1,5 GB). The
    > architecture of the xml file is something like this
    > [structure snipped]
    > I want to combine the text out of page:title and page:revision:text
    > for every single page element. One by one I want to index these
    > combined texts (so for each page one index)
    > What is the most efficient API for that?: SAX ( I don´t thonk so) DOM
    > or pulldom?


    Definitely SAX IMHO, or xml.parsers.expat. For what you're doing, an
    event-driven interface is ideal. DOM parses the *entire* XML tree into
    memory at once, before you can do anything - highly inefficient for a
    large data set like this. I've never used pulldom, it might have
    potential, but from my (limited and flawed) understanding of it, I
    think it may also wind up loading most of the file into memory by the
    time you're done.

    SAX will not build any memory structures other than the ones you
    explicitly create (SAX is commonly used to build DOM trees). With SAX,
    you can just watch for any tags of interest (and perhaps some
    surrounding tags to provide context), extract the desired data, and all
    that very efficiently.

    It took me a bit to get the hang of SAX, but once I did, I haven't
    looked back. Event-driven parsing is a brilliant solution to this
    problem domain.

    > Or should I just use Xpath somehow.


    XPath usually requires a DOM tree on which it can operate. The Python
    XPath implementation (in PyXML) requires DOM objects. I see this as
    being a highly inefficient solution.

    Another potential solution, if the data file has extraneous
    information: run the source file through an XSLT transform that strips
    it down to only the data you need, and then apply SAX to parse it.

    - Michael
     
    Michael Ekstrand, Aug 29, 2005
    #4
  5. jog

    Alan Kennedy Guest

    [jog]
    > I want to get text out of some nodes of a huge xml file (1,5 GB). The
    > architecture of the xml file is something like this


    [snip]

    > I want to combine the text out of page:title and page:revision:text
    > for every single page element. One by one I want to index these
    > combined texts (so for each page one index)
    > What is the most efficient API for that?:
    > SAX ( I don´t thonk so)


    SAX is perfect for the job. See code below.

    > DOM


    If your XML file is 1.5G, you'll need *lots* of RAM and virtual memory
    to load it into a DOM.

    > or pulldom?


    Not sure how pulldom does it's pull "optimizations", but I think it
    still builds an in-memory object structure for your document, which will
    still take buckets of memory for such a big document. I could be wrong
    though.

    > Or should I just use Xpath somehow.


    Using xpath normally requires building a (D)OM, which will consume
    *lots* of memory for your document, regardless of how efficient the OM is.

    Best to use SAX and XPATH-style expressions.

    You can get a limited subset of xpath using a SAX handler and a stack.
    Your problem is particularly well suited to that kind of solution. Code
    that does a basic job of this for your specific problem is given below.

    Note that there are a number of caveats with this code

    1. characterdata handlers may get called multiple times for a single xml
    text() node. This is permitted in the SAX spec, and is basically a
    consequence of using buffered IO to read the contents of the xml file,
    e.g. the start of a text node is at the end of the last buffer read, and
    the rest of the text node is at the beginning of the next buffer.

    2. This code assumes that your "revision/text" nodes do not contain
    mixed content, i.e. a mixture of elements and text, e.g.
    "<revision><text>This is a piece of <b>revision</b>
    text</text></revision>. The below code will fail to extract all
    character data in that case.

    #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    import xml.sax

    class Page:

    def append(self, field_name, new_value):
    old_value = ""
    if hasattr(self, field_name):
    old_value = getattr(self, field_name)
    setattr(self, field_name, "%s%s" % (old_value, new_value))

    class page_matcher(xml.sax.handler.ContentHandler):

    def __init__(self, page_handler=None):
    xml.sax.handler.ContentHandler.__init__(self)
    self.page_handler = page_handler
    self.stack = []

    def check_stack(self):
    stack_expr = "/" + "/".join(self.stack)
    if '/parent/page' == stack_expr:
    self.page = Page()
    elif '/parent/page/title/text()' == stack_expr:
    self.page.append('title', self.chardata)
    elif '/parent/page/revision/id/text()' == stack_expr:
    self.page.append('revision_id', self.chardata)
    elif '/parent/page/revision/text/text()' == stack_expr:
    self.page.append('revision_text', self.chardata)
    else:
    pass

    def startElement(self, elemname, attrs):
    self.stack.append(elemname)
    self.check_stack()

    def endElement(self, elemname):
    if elemname == 'page' and self.page_handler:
    self.page_handler(self.page)
    self.page = None
    self.stack.pop()

    def characters(self, data):
    self.chardata = data
    self.stack.append('text()')
    self.check_stack()
    self.stack.pop()

    testdoc = """
    <parent>
    <page>
    <title>Page number 1</title>
    <id>p1</id>
    <revision>
    <id>r1</id>
    <text>revision one</text>
    </revision>
    </page>
    <page>
    <title>Page number 2</title>
    <id>p2</id>
    <revision>
    <id>r2</id>
    <text>revision two</text>
    </revision>
    </page>
    </parent>
    """

    def page_handler(new_page):
    print "New page"
    print "title\t\t%s" % new_page.title
    print "revision_id\t%s" % new_page.revision_id
    print "revision_text\t%s" % new_page.revision_text
    print

    if __name__ == "__main__":
    parser = xml.sax.make_parser()
    parser.setContentHandler(page_matcher(page_handler))
    parser.setFeature(xml.sax.handler.feature_namespaces, 0)
    parser.feed(testdoc)
    #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

    HTH,

    --
    alan kennedy
    ------------------------------------------------------
    email alan: http://xhaus.com/contact/alan
     
    Alan Kennedy, Aug 29, 2005
    #5
  6. Alan Kennedy wrote:

    > SAX is perfect for the job. See code below.


    depends on your definition of perfect...

    using a 20 MB version of jog's sample, and having replaced
    the print statements with local variable assignments, I get the
    following timings:

    5 lines of cElementTree code: 7.2 seconds
    60+ lines of xml.sax code: 63 seconds

    (Python 2.4.1, Windows XP, Pentium 3 GHz)

    </F>
     
    Fredrik Lundh, Aug 29, 2005
    #6
  7. jog

    Alan Kennedy Guest

    [Alan Kennedy]
    >>SAX is perfect for the job. See code below.


    [Fredrik Lundh]
    > depends on your definition of perfect...


    Obviously, perfect is the eye of the beholder ;-)

    [Fredrik Lundh]
    > using a 20 MB version of jog's sample, and having replaced
    > the print statements with local variable assignments, I get the
    > following timings:
    >
    > 5 lines of cElementTree code: 7.2 seconds
    > 60+ lines of xml.sax code: 63 seconds
    >
    > (Python 2.4.1, Windows XP, Pentium 3 GHz)


    Impressive!

    At first, I thought your code sample was building a tree for the entire
    document, so I checked the API docs. It appeared to me that an event
    processing model *couldn't* obtain the text for the node when notified
    of the node: the text node is still in the future.

    That's when I understood the nature of iterparse, which must generate an
    event *after* the node is complete, and it's subdocument reified. That's
    also when I understood the meaning of the "elem.clear()" call at the
    end. Only the required section of the tree is modelled in memory at any
    given time. Nice.

    There are some minor inefficiencies in my pure python sax code, e.g.
    building the stack expression for every evaluation, but I left them in
    for didactic reasons. But even if every possible speed optimisation was
    added to my python code, I doubt it would be able to match your code.

    I'm guessing that a lot of the reason why cElementTree performs best is
    because the model-building is primarily implemented in C: Both of our
    solutions run python code for every node in the tree, i.e. are O(N). But
    yours also avoids the overhead of having function-calls/stack-frames for
    every single node event, by processing all events inside a single function.

    If the SAX algorithm were implemented in C (or Java) for that matter, I
    wonder if it might give comparable performance to the cElementTree code,
    primarily because the data structures it is building are simpler,
    compared to the tree-subsections being reified and discarded by
    cElementTree. But that's not of relevance, because we're looking for
    python solutions. (Aside: I can't wait to run my solution on a
    fully-optimising PyPy :)

    That's another nice thing I didn't know (c)ElementTree could do.

    enlightened-ly'yrs,

    --
    alan kennedy
    ------------------------------------------------------
    email alan: http://xhaus.com/contact/alan
     
    Alan Kennedy, Aug 29, 2005
    #7
  8. jog

    William Park Guest

    jog <> wrote:
    > Hi,
    > I want to get text out of some nodes of a huge xml file (1,5 GB). The
    > architecture of the xml file is something like this
    > <parent>
    > <page>
    > <title>bla</title>
    > <id></id>
    > <revision>
    > <id></id>
    > <text>blablabla</text>
    > <revision>
    > </page>
    > <page>
    > </page>
    > ....
    > </parent>
    > I want to combine the text out of page:title and page:revision:text for
    > every single page element. One by one I want to index these combined
    > texts (so for each page one index)
    > What is the most efficient API for that?: SAX ( I don?t thonk so) DOM
    > or pulldom?
    > Or should I just use Xpath somehow.
    > I don`t want to do anything else with his xml file afterwards.
    > I hope someone will understand me.....
    > Thank you very much
    > Jog


    I would use Expat interface from Python, Awk, or even Bash shell. I'm
    most familiar with shell interface to Expat, which would go something
    like

    start() # Usage: start tag att=value ...
    {
    case $1 in
    page) unset title text ;;
    esac
    }
    data() # Usage: data text
    {
    case ${XML_TAG_STACK[0]}.${XML_TAG_STACK[1]}.${XML_TAG_STACK[2]} in
    title.page.*) title=$1 ;;
    text.revision.page) text=$1 ;;
    esac
    }
    end() # Usage: end tag
    {
    case $1 in
    page) echo "title=$title text=$text" ;;
    esac
    }
    expat -s start -d data -e end < file.xml

    --
    William Park <>, Toronto, Canada
    ThinFlash: Linux thin-client on USB key (flash) drive
    http://home.eol.ca/~parkw/thinflash.html
    BashDiff: Super Bash shell
    http://freshmeat.net/projects/bashdiff/
     
    William Park, Aug 29, 2005
    #8
  9. jog

    jog Guest

    Thanks a lot for all your replies that was really great and helpfull.
    Now I have some problems with the indexing, it takes to much memory and
    akes to long. I have to look into it.....
     
    jog, Sep 6, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David Pinto

    minidom and pulldom

    David Pinto, Dec 11, 2003, in forum: Python
    Replies:
    4
    Views:
    341
    John J. Lee
    Dec 15, 2003
  2. sun
    Replies:
    0
    Views:
    284
  3. Replies:
    1
    Views:
    374
    Paul Boddie
    Jun 20, 2008
  4. Simon Willison
    Replies:
    10
    Views:
    578
    Paul Boddie
    Jul 31, 2008
  5. Erik Wasser
    Replies:
    5
    Views:
    485
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page