xml.dom.minidom memory usage

D

Dan

I'm using python's xml.dom.minidom module to generate xml files, and
I'm running into memory problems. The xml files I'm trying to create
are relatively flat, with one root node which may have millions of
direct child nodes.

Here's an example script:
#!/usr/bin/env python


import xml.dom.minidom


def gen_xml(n):
doc = xml.dom.minidom.Document()
root = xml.dom.minidom.Element("foo")
root.ownerDocument = doc
root.setAttribute("one", "1")
doc.appendChild(root)
for x in xrange(n):
elem = xml.dom.minidom.Element("bar")
elem.ownerDocument = doc
elem.setAttribute("attr1", "12345678")
elem.setAttribute("attr2", "87654321")
root.appendChild(elem)
return doc


if I run gen_xml(1000000), my python process ends up using all my 90%
of my memory, and the system ends up thrashing (Linux, P4, 1G ram,
python 2.4.3) .

So, my questions are (1) am I doing something dumb in the script that
stops python from collecting temp garbage? (2) If not, is there
another reasonable module to generate xml (as opposed to parsing it),
or should I just implement my own xml generation solution?

Thanks,
-Dan
 
J

Jonathan Curran

Dan,
The DOM (Document Object Model) is such that it loads all the elements of the
XML document into memory before you can do anything with it. With your file
containing millions of child nodes this will eat up as much memory as you
have. A solution to this is to use the SAX method of parsing XML documents
and working on it. SAX is such that it only reads the XML doc. a node (or a
few nodes) at a time.

Unfortunately, the use of DOM or SAX completely depends on what kind of
processing you need to be done on the XML document. If it is editing a record
at a time (from what I've gathered from your code) it would be wise to use
SAX. I highly suggest looking into this method of processing.

- Jonathan Curran
 
J

Jonathan Curran

Dan,
I jumped the gun and didn't read your entire post. I ended up skipping a lot
of parts, especially where you say that you are creating a document (for some
reason I thought you were reading it). I looked at the setAttribute function
and thought: ah he's writing it out.
Sorry about all that. I'd still ask you to look into SAX though :) Especially
when dealing with really large XML documents, whether it be reading or
writing them.

- Jonathan Curran
 
D

Dan

Dan,
The DOM (Document Object Model) is such that it loads all the elements of the
XML document into memory before you can do anything with it. With your file
containing millions of child nodes this will eat up as much memory as you
have. A solution to this is to use the SAX method of parsing XML documents
and working on it. SAX is such that it only reads the XML doc. a node (or a
few nodes) at a time.

Unfortunately, the use of DOM or SAX completely depends on what kind of
processing you need to be done on the XML document. If it is editing a record
at a time (from what I've gathered from your code) it would be wise to use
SAX. I highly suggest looking into this method of processing.

- Jonathan Curran

Jonathan,

Thanks for the response. I'm comfortable using SAX for parsing, but
I'm unsure how it helps me with XML generation. To clarify in my
example code, with the DOM document, I can call doc.toxml() or
doc.toprettyxml(), and get the desired output (an xml file). AFAIK,
SAX has no analog, it just does parsing. Is there a way to do XML
generation with SAX that I'm unaware of?

-Dan
 
B

Bruno Desthuilliers

Dan a écrit :
I'm using python's xml.dom.minidom module to generate xml files, and
I'm running into memory problems. The xml files I'm trying to create
are relatively flat, with one root node which may have millions of
direct child nodes.

Woops ! You're looking for trouble.
So, my questions are (1) am I doing something dumb in the script

Yes : using minidom !-)
that
stops python from collecting temp garbage?

That's not the problem. The problem is with how xml dom APIs work: they
build the whole damn tree in memory.
(2) If not, is there
another reasonable module to generate xml (as opposed to parsing it),
or should I just implement my own xml generation solution?

You should have a look at Genshi:
http://genshi.edgewall.org/
 
S

Stefan Behnel

Dan said:
I'm using python's xml.dom.minidom module to generate xml files, and
I'm running into memory problems.

Then take a look at cElementTree. It's part of Python 2.5 and is available as
a separate module for Python 2.4. It's fast, has a very low memory profile and
if you ever decide to need more features, there's lxml to the rescue.

You might also consider streaming your XML piece by piece instead of creating
in-memory trees. Python's generators are a good starting point here.

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top