"drop-in" DOM replacement for minidom?

Discussion in 'Python' started by Paul Miller, Aug 13, 2003.

  1. Paul Miller

    Paul Miller Guest

    We've run into minidom's inabilty to handle large (20+MB) XML files, and
    need a replacement that can handle it. Unfortunately, we're pretty
    dependent on a DOM, so a pulldom or SAX replacement is likely out of the
    question for now.

    Has someone done a more efficient minidom replacement module that we can
    just drop in? Preferrably written in C?
    Paul Miller, Aug 13, 2003
    #1
    1. Advertising

  2. Quoting Paul Miller ():
    > We've run into minidom's inabilty to handle large (20+MB) XML files, and
    > need a replacement that can handle it. Unfortunately, we're pretty
    > dependent on a DOM, so a pulldom or SAX replacement is likely out of the
    > question for now.
    >
    > Has someone done a more efficient minidom replacement module that we can
    > just drop in? Preferrably written in C?


    I've posted on a related topic in the past, when a friend of mine was
    blowing thru 8GB of memory parsing a 30MB file in minidom. Pretty much
    every response I got was of the general form "well what the hell are
    you using DOM for? are you defective?" Some were more diplomatic than
    others.

    My friend also had some more challenging problems. He was running on a
    DEC Alpha, I think under Digital Unix, and as a consequence 4Suite had
    byte-ordering problems. PyRXP wouldn't compile for him, if I recall
    correctly -- or maybe there were licensing problems? Anyway, he
    ultimately settled on using pulldom; that gave him simplicity, speed,
    and a small enough memory profile that it satisfied his needs.

    Obviously it won't help in your case.

    I don't think you'll find something that precisely mimics the minidom
    module's interface, so you're going to hafta do some retooling.
    However, I believe that if you can get 4Suite to compile, you might
    find some love in there. There's a cDomlette component (labelled at
    the time of my last reading as "experimental") that builds the parse
    tree in C, with a minimal memory consumption.

    Here's a link to something that should tell you how to make it work
    (though when I personally used cDomlette, I seem to remember it being
    harder than this....)

    http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/domlettes

    Also, you may be interested in looking at the comparisons done by the
    PyRXP folks on their page:

    http://www.reportlab.com/xml/pyrxp.html

    Best of luck!

    --G.

    --
    Geoff Gerrietts "Whenever people agree with me I always
    <geoff at gerrietts net> feel I must be wrong." --Oscar Wilde
    Geoff Gerrietts, Aug 13, 2003
    #2
    1. Advertising

  3. Harry George <> wrote in message news:<>...
    > Paul Miller <> writes:
    >
    > Switching to
    > SAX was a major improvement in mem usage and thus in parse time.
    >


    As an alternative you can easily build a custom, lightweight, Object
    Model. I'm using one designed naively to reflect the set of elements
    used in the several XML schemas we use. I use SAX to parse the
    document into our object model and have the convenience of programming
    with the nicer (in some ways DOM like) interface.

    Basically there is a class Element which (since 2.2) is a child of
    list. By convention it can contain either a unicode string (CDATA) or
    another element. The XML attributes can be either stored as a
    dictionary or, as I eventually did, directly as attributes of the
    class. Record the parent element (aka location), add some methods
    such as nextSibling() etc and you're on your way.

    In our case I've adopted a naive approach, ie there is a separate
    class for every type of XML element (which all ultimately derive from
    Element). This suffers from being non-general (ie specific, to the
    specific set of schema we use), but it has the advantage that you
    don't have to look up what kind of Element you are dealing with and
    determine what to do with it, but can use polymorphism nicely.
    Further there is no conceptual difference between a chunk of XML, and
    the python object structure (ie Elements within Elements) used to
    represent it.

    It was because Python was so ideally suited to this kind of thing,
    that I originally adopted it. As an aside I wrote an XLST sheet,
    which reads the various xml-schema files (I only write DTDs myself,
    relying on converters to generate xsd), and writes out the python stub
    code, (ie creates the basic class definition for each element adding
    the appropriate attributes etc), saving a lot of boring boilerplate
    typing and allows for quick and accurate code updates if new
    attributes are added to the schema.

    Going about it in this kind of way, you get something of much lighter
    weight than DOM, but which does have that nice structural (as opposed
    to SAX's event-driven) way of working with XML.
    Armin Wittfoth, Aug 14, 2003
    #3
  4. On Wed, 13 Aug 2003 11:09:39 -0500, Paul Miller <> wrote:

    >We've run into minidom's inabilty to handle large (20+MB) XML files, and
    >need a replacement that can handle it. Unfortunately, we're pretty
    >dependent on a DOM, so a pulldom or SAX replacement is likely out of the
    >question for now.
    >
    >Has someone done a more efficient minidom replacement module that we can
    >just drop in? Preferrably written in C?
    >

    I'm curious how DOM dependent you really are. I.e., what minidom methods do you really use?
    Can you assume that you are dealing with valid (error-free) XML as input?

    Regards,
    Bengt Richter
    Bengt Richter, Aug 14, 2003
    #4
  5. Paul Miller

    Uche Ogbuji Guest

    Geoff Gerrietts <> wrote in message news:<>...
    > Quoting Paul Miller ():
    > > We've run into minidom's inabilty to handle large (20+MB) XML files, and
    > > need a replacement that can handle it. Unfortunately, we're pretty
    > > dependent on a DOM, so a pulldom or SAX replacement is likely out of the
    > > question for now.
    > >
    > > Has someone done a more efficient minidom replacement module that we can
    > > just drop in? Preferrably written in C?

    >
    > I've posted on a related topic in the past, when a friend of mine was
    > blowing thru 8GB of memory parsing a 30MB file in minidom. Pretty much
    > every response I got was of the general form "well what the hell are
    > you using DOM for? are you defective?" Some were more diplomatic than
    > others.


    My response is usually more like "what are you using XML for a single
    30MB file for?"

    I've long maintained that when working with XML, modest document sizes
    is very important, regardless of what tools you're using.

    But that having been said, some documents are 30MB, and it makes sense
    that they're 30MB, and that's just the way it is.


    > My friend also had some more challenging problems. He was running on a
    > DEC Alpha, I think under Digital Unix, and as a consequence 4Suite had
    > byte-ordering problems.


    4Suite used to have byte-ordering problems, originally reported under
    Solaris 9, and also affecting some Mac OS X users. Those are fixed
    now.


    > PyRXP wouldn't compile for him, if I recall
    > correctly -- or maybe there were licensing problems? Anyway, he
    > ultimately settled on using pulldom; that gave him simplicity, speed,
    > and a small enough memory profile that it satisfied his needs.
    >
    > Obviously it won't help in your case.


    pulldom is always worth considering.

    http://www-106.ibm.com/developerworks/xml/library/x-tipulldom.html

    > I don't think you'll find something that precisely mimics the minidom
    > module's interface, so you're going to hafta do some retooling.
    > However, I believe that if you can get 4Suite to compile,


    Which I hardly expect to be a problem.

    > you might
    > find some love in there. There's a cDomlette component (labelled at
    > the time of my last reading as "experimental")


    cDomlette hasn't been experimental for nearly a year now. We use it
    heavily in production.


    > that builds the parse
    > tree in C, with a minimal memory consumption.


    And fast parse and mutation time.


    > Here's a link to something that should tell you how to make it work
    > (though when I personally used cDomlette, I seem to remember it being
    > harder than this....)
    >
    > http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/domlettes


    Your memories must be from long ago :) That API is how it's been for
    a while.


    > Also, you may be interested in looking at the comparisons done by the
    > PyRXP folks on their page:
    >
    > http://www.reportlab.com/xml/pyrxp.html
    >
    > Best of luck!


    Ditto.

    --Uche
    http://uche.ogbuji.net
    Uche Ogbuji, Aug 15, 2003
    #5
  6. Paul Miller

    Paul Miller Guest

    >>Has someone done a more efficient minidom replacement module that we can
    >>just drop in? Preferrably written in C?
    >>

    >I'm curious how DOM dependent you really are. I.e., what minidom methods do you really use?
    >Can you assume that you are dealing with valid (error-free) XML as input?


    Yes, it is assumed to be valid. We don't even use a DTD. But we use the DOM
    to point to later nodes in the tree by following references in nodes higher
    in the tree.

    But, building a sparse object model initially and resolving references
    later might be the right solution.
    Paul Miller, Aug 15, 2003
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Roman Yakovenko

    xml.dom.minidom - bug ? future ?

    Roman Yakovenko, Sep 4, 2003, in forum: Python
    Replies:
    1
    Views:
    339
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=
    Sep 4, 2003
  2. Hans Nowak
    Replies:
    1
    Views:
    393
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Sep 4, 2003
  3. Hans Nowak
    Replies:
    0
    Views:
    324
    Hans Nowak
    Sep 4, 2003
  4. Geiregat Jonas

    xml.dom.minidom question

    Geiregat Jonas, Sep 30, 2003, in forum: Python
    Replies:
    11
    Views:
    533
    Uche Ogbuji
    Oct 8, 2003
  5. Replies:
    3
    Views:
    533
    Stefan Behnel
    Aug 3, 2007
Loading...

Share This Page