"drop-in" DOM replacement for minidom?

P

Paul Miller

We've run into minidom's inabilty to handle large (20+MB) XML files, and
need a replacement that can handle it. Unfortunately, we're pretty
dependent on a DOM, so a pulldom or SAX replacement is likely out of the
question for now.

Has someone done a more efficient minidom replacement module that we can
just drop in? Preferrably written in C?
 
G

Geoff Gerrietts

Quoting Paul Miller ([email protected]):
We've run into minidom's inabilty to handle large (20+MB) XML files, and
need a replacement that can handle it. Unfortunately, we're pretty
dependent on a DOM, so a pulldom or SAX replacement is likely out of the
question for now.

Has someone done a more efficient minidom replacement module that we can
just drop in? Preferrably written in C?

I've posted on a related topic in the past, when a friend of mine was
blowing thru 8GB of memory parsing a 30MB file in minidom. Pretty much
every response I got was of the general form "well what the hell are
you using DOM for? are you defective?" Some were more diplomatic than
others.

My friend also had some more challenging problems. He was running on a
DEC Alpha, I think under Digital Unix, and as a consequence 4Suite had
byte-ordering problems. PyRXP wouldn't compile for him, if I recall
correctly -- or maybe there were licensing problems? Anyway, he
ultimately settled on using pulldom; that gave him simplicity, speed,
and a small enough memory profile that it satisfied his needs.

Obviously it won't help in your case.

I don't think you'll find something that precisely mimics the minidom
module's interface, so you're going to hafta do some retooling.
However, I believe that if you can get 4Suite to compile, you might
find some love in there. There's a cDomlette component (labelled at
the time of my last reading as "experimental") that builds the parse
tree in C, with a minimal memory consumption.

Here's a link to something that should tell you how to make it work
(though when I personally used cDomlette, I seem to remember it being
harder than this....)

http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/domlettes

Also, you may be interested in looking at the comparisons done by the
PyRXP folks on their page:

http://www.reportlab.com/xml/pyrxp.html

Best of luck!

--G.
 
A

Armin Wittfoth

Harry George said:
Switching to
SAX was a major improvement in mem usage and thus in parse time.

As an alternative you can easily build a custom, lightweight, Object
Model. I'm using one designed naively to reflect the set of elements
used in the several XML schemas we use. I use SAX to parse the
document into our object model and have the convenience of programming
with the nicer (in some ways DOM like) interface.

Basically there is a class Element which (since 2.2) is a child of
list. By convention it can contain either a unicode string (CDATA) or
another element. The XML attributes can be either stored as a
dictionary or, as I eventually did, directly as attributes of the
class. Record the parent element (aka location), add some methods
such as nextSibling() etc and you're on your way.

In our case I've adopted a naive approach, ie there is a separate
class for every type of XML element (which all ultimately derive from
Element). This suffers from being non-general (ie specific, to the
specific set of schema we use), but it has the advantage that you
don't have to look up what kind of Element you are dealing with and
determine what to do with it, but can use polymorphism nicely.
Further there is no conceptual difference between a chunk of XML, and
the python object structure (ie Elements within Elements) used to
represent it.

It was because Python was so ideally suited to this kind of thing,
that I originally adopted it. As an aside I wrote an XLST sheet,
which reads the various xml-schema files (I only write DTDs myself,
relying on converters to generate xsd), and writes out the python stub
code, (ie creates the basic class definition for each element adding
the appropriate attributes etc), saving a lot of boring boilerplate
typing and allows for quick and accurate code updates if new
attributes are added to the schema.

Going about it in this kind of way, you get something of much lighter
weight than DOM, but which does have that nice structural (as opposed
to SAX's event-driven) way of working with XML.
 
B

Bengt Richter

We've run into minidom's inabilty to handle large (20+MB) XML files, and
need a replacement that can handle it. Unfortunately, we're pretty
dependent on a DOM, so a pulldom or SAX replacement is likely out of the
question for now.

Has someone done a more efficient minidom replacement module that we can
just drop in? Preferrably written in C?
I'm curious how DOM dependent you really are. I.e., what minidom methods do you really use?
Can you assume that you are dealing with valid (error-free) XML as input?

Regards,
Bengt Richter
 
U

Uche Ogbuji

Geoff Gerrietts said:
Quoting Paul Miller ([email protected]):

I've posted on a related topic in the past, when a friend of mine was
blowing thru 8GB of memory parsing a 30MB file in minidom. Pretty much
every response I got was of the general form "well what the hell are
you using DOM for? are you defective?" Some were more diplomatic than
others.

My response is usually more like "what are you using XML for a single
30MB file for?"

I've long maintained that when working with XML, modest document sizes
is very important, regardless of what tools you're using.

But that having been said, some documents are 30MB, and it makes sense
that they're 30MB, and that's just the way it is.

My friend also had some more challenging problems. He was running on a
DEC Alpha, I think under Digital Unix, and as a consequence 4Suite had
byte-ordering problems.

4Suite used to have byte-ordering problems, originally reported under
Solaris 9, and also affecting some Mac OS X users. Those are fixed
now.

PyRXP wouldn't compile for him, if I recall
correctly -- or maybe there were licensing problems? Anyway, he
ultimately settled on using pulldom; that gave him simplicity, speed,
and a small enough memory profile that it satisfied his needs.

Obviously it won't help in your case.

pulldom is always worth considering.

http://www-106.ibm.com/developerworks/xml/library/x-tipulldom.html
I don't think you'll find something that precisely mimics the minidom
module's interface, so you're going to hafta do some retooling.
However, I believe that if you can get 4Suite to compile,

Which I hardly expect to be a problem.
you might
find some love in there. There's a cDomlette component (labelled at
the time of my last reading as "experimental")

cDomlette hasn't been experimental for nearly a year now. We use it
heavily in production.

that builds the parse
tree in C, with a minimal memory consumption.

And fast parse and mutation time.

Here's a link to something that should tell you how to make it work
(though when I personally used cDomlette, I seem to remember it being
harder than this....)

http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/domlettes

Your memories must be from long ago :) That API is how it's been for
a while.

Also, you may be interested in looking at the comparisons done by the
PyRXP folks on their page:

http://www.reportlab.com/xml/pyrxp.html

Best of luck!

Ditto.

--Uche
http://uche.ogbuji.net
 
P

Paul Miller

Has someone done a more efficient minidom replacement module that we can
I'm curious how DOM dependent you really are. I.e., what minidom methods do you really use?
Can you assume that you are dealing with valid (error-free) XML as input?

Yes, it is assumed to be valid. We don't even use a DTD. But we use the DOM
to point to later nodes in the tree by following references in nodes higher
in the tree.

But, building a sparse object model initially and resolving references
later might be the right solution.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top