10GB XML Blows out Memory, Suggestions?

A

axwack

I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

Any suggestions on what that something else is? Is it hard to convert
the code from DOM to SAX?
 
M

Mathias Waack

I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

More memory;)
Maybe you should have a look at pulldom, a combination of sax and dom: it
reads your document in a sax-like manner and expands only selected
sub-trees.
Any suggestions on what that something else is? Is it hard to convert
the code from DOM to SAX?

Assuming a good design of course not. Esp. if you only need some selected
parts of the document SAX should be your choice.

Mathias
 
D

Diez B. Roggisch

I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

Any suggestions on what that something else is? Is it hard to convert
the code from DOM to SAX?

Yes.

You could used elementtree iterparse - that should be the easiest solution.

http://effbot.org/zone/element-iterparse.htm

Diez
 
K

K.S.Sreeram

I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

With a 10gb file, you're best bet might be to juse use Expat and C!!

Regards
Sreeram



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEhXVqrgn0plK5qqURArsdAKCyjsORjKDZlZNhwR82C9bMDKtGtgCfVeCz
mgU+25qIR6eiyLVc/QOPn+U=
=Zv2q
-----END PGP SIGNATURE-----
 
N

Nicola Musatti

I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

What you clearly need is a better suited file format, but I suspect
you're not in a position to change it, are you?

Cheers,
Nicola Musatti
 
D

Diez B. Roggisch

K.S.Sreeram said:
With a 10gb file, you're best bet might be to juse use Expat and C!!

No what exactly makes C grok a 10Gb file where python will fail to do so?

What the OP needs is a different approach to XML-documents that won't
parse the whole file into one giant tree - but I'm pretty sure that
(c)ElementTree will do the job as well as expat. And I don't recall the
OP musing about performances woes, btw.

Diez
 
P

Paul McGuire

I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

You clearly need something instead of XML.

This sounds like a case where a prototype, which worked for the developer's
simple test data set, blows up in the face of real user/production data.
XML adds lots of overhead for nested structures, when in fact, the actual
meat of the data can be relatively small. Note also that this XML overhead
is directly related to the verbosity of the XML designer's choice of tag
names, and whether the designer was predisposed to using XML elements over
attributes. Imagine a record structure for a 3D coordinate point (described
here in no particular coding language):

struct ThreeDimPoint:
xValue : integer,
yValue : integer,
zValue : integer

Directly translated to XML gives:

<ThreeDimPoint>
<xValue>4</xValue>
<yValue>5</yValue>
<zValue>6</zValue>
</ThreeDimPoint>

This expands 3 integers to a whopping 101 characters. Throw in namespaces
for good measure, and you inflate the data even more.

Many Java folks treat XML attributes as anathema, but look how this cuts
down the data inflation:

<ThreeDimPoint xValue="4" yValue="5" zValue="6"/>

This is only 50 characters, or *only* 4 times the size of the contained data
(assuming 4-byte integers).

Try zipping your 10Gb file, and see what kind of compression you get - I'll
bet it's close to 30:1. If so, convert the data to a real data storage
medium. Even a SQLite database table should do better, and you can ship it
around just like a file (just can't open it up like a text file).

-- Paul
 
K

Kay Schluehr

I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

Any suggestions on what that something else is? Is it hard to convert
the code from DOM to SAX?

If your XML files grow so large you might rethink the representation
model. Maybe you give eXist a try?

http://exist.sourceforge.net/

Regards,
Kay
 
K

K.S.Sreeram

Diez said:
What the OP needs is a different approach to XML-documents that won't
parse the whole file into one giant tree - but I'm pretty sure that
(c)ElementTree will do the job as well as expat. And I don't recall the
OP musing about performances woes, btw.


There's just NO WAY that the 10gb xml file can be loaded into memory as
a tree on any normal machine, irrespective of whether we use C or
Python. So the *only* way is to perform some kind of 'stream' processing
No what exactly makes C grok a 10Gb file where python will fail to do so?

In most typical cases where there's any kind of significant python code,
its possible to achieve a *minimum* of a 10x speedup by using C. In most
cases, the speedup is not worth it and we just trade it for the
increased flexiblity/power of the python language. But in this situation
using a bit of tight C code could make the difference between the
process taking just 15mins or taking a few hours!

Ofcourse I'm not asking him to write the entire application in C. It
makes sense to just write the performance critical sections in C, and
wrap it in Python, and write the rest of the application in Python.




-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEhcgjrgn0plK5qqURAvW9AKCENpXKQY7xB6pQ8RCDkQssEoV+fwCgn2xM
Yq0TJ/RkipdJmOkVRXUu1Fw=
=iq9H
-----END PGP SIGNATURE-----
 
F

Fredrik Lundh

K.S.Sreeram said:
There's just NO WAY that the 10gb xml file can be loaded into memory as
a tree on any normal machine, irrespective of whether we use C or
Python. So the *only* way is to perform some kind of 'stream' processing
on the file. Perhaps using a SAX like API. So (c)ElementTree is ruled
out for this.

both ElementTree and cElementTree support "sax-style" event generation
(through XMLTreeBuilder/XMLParser) and incremental parsing (through
iterparse). the cElementTree versions of these are even faster than
pyexpat.

the iterparse interface is described here:

http://effbot.org/zone/element-iterparse.htm

</F>
 
K

K.S.Sreeram

Fredrik said:
both ElementTree and cElementTree support "sax-style" event generation
(through XMLTreeBuilder/XMLParser) and incremental parsing (through
iterparse). the cElementTree versions of these are even faster than
pyexpat.

the iterparse interface is described here:

http://effbot.org/zone/element-iterparse.htm
Thats cool! Thanks for the info!

For a multi-gigabyte file, I would still recommend C/C++, because the
processing code which sits on top of the XML library needs to be Python,
and that could turn out to be a significant overhead in such extreme cases.

Of course, the exact strategy to follow would depend on the specifics of
the case, and all this speculation may not really apply! :)

Regards
Sreeram


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEhdISrgn0plK5qqURAo4cAJ9UaTpoIFkQx7JZg07XW3EMdfp8NACfYMjR
TbNHV7CDROnUQTLSqtm3je8=
=evWp
-----END PGP SIGNATURE-----
 
G

gregarican

10 gigs? Wow, even using SAX I would imagine that you would be pushing
the limits of reasonable performance. Any way you can depart from the
XML requirement? That's not really what XML was intended for in terms
of passing along information IMHO...
 
A

axwack

The file is an XML dump from Goldmine. I have built a document parser
that allows for the population of data from Goldmine into SugarCRM. The
clients data se is 10gb.
 
F

Fredrik Lundh

gregarican said:
10 gigs? Wow, even using SAX I would imagine that you would be pushing
the limits of reasonable performance.

depends on how you define "reasonable", of course. modern computers are
quite fast:
> dir data.xml

2006-06-06 21:35 1 002 000 015 data.xml
1 File(s) 1 002 000 015 bytes
> more test.py
from xml.etree import cElementTree as ET
import time

t0 = time.time()

for event, elem in ET.iterparse("data.xml"):
if elem.tag == "item":
elem.clear()

print time.time() - t0

gives me timings between 27.1 and 49.1 seconds over 5 runs.

(Intel Dual Core T2300, slow laptop disks, 1000000 XML "item" elements
averaging 1000 byte each, bundled cElementTree, peak memory usage 33 MB.
your milage may vary.)

</F>
 
A

axwack

Paul,

This is interesting. Unfortunately, I have no control over the XML
output. The file is from Goldmine. However, you have given me an
idea...

Is it possible to read an XML document in compressed format?
 
G

gregarican

That a good sized Goldmine database. In past lives I have supported
that app and recall that you could match the Goldmine front end against
an SQL backend. If you can get to the underlying data utilizing SQL you
can selectively port over sections of the database and might be able to
attack things more methodically than parsing through a mongo XML file.
Instead you could bulk insert portions of the Goldmine data into
SugarCRM. Know what I mean?
 
J

John J. Lee

K.S.Sreeram said:
There's just NO WAY that the 10gb xml file can be loaded into memory as
a tree on any normal machine, irrespective of whether we use C or
Python.
Yes.

So the *only* way is to perform some kind of 'stream' processing
on the file. Perhaps using a SAX like API. So (c)ElementTree is ruled
out for this.

No, that's not true. I guess you didn't read the other posts:

http://effbot.org/zone/element-iterparse.htm

In most typical cases where there's any kind of significant python code,
its possible to achieve a *minimum* of a 10x speedup by using C. In most
[...]

I don't know where you got that from. And in this particular case, of
course, cElementTree *is* written in C, there's presumably plenty of
"significant python code" around since, one assumes, *all* of the OP's
code is written in Python (does that count as "any kind" of Python
code?), and yet rewriting something in C here may not make much
difference.


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,571
Members
45,045
Latest member
DRCM

Latest Threads

Top