cElementTree clear semantics

Igor V. Rafienko · Sep 25, 2005

Hi,

I am trying to understand how cElementTree's clear works: I have a
(relatively) large XML file, that I do not wish to load into memory.
So, naturally, I tried something like this:

from cElementTree import iterparse
for event, elem in iterparse("data.xml"):
if elem.tag == "schnappi":
count += 1
elem.clear()

.... which resulted in caching of all elements in memory except for
those named <schnappi> (i.e. the process' memory footprint grew more
and more). Then I though about clear()'ing all elements that I did not
really need:

from cElementTree import iterparse
for event, elem in iterparse("data.xml"):
if elem.tag == "schnappi":
count += 1
elem.clear()

.... which gave a suitably small memory footprint, *BUT* since
<schnappi> has a number of subelements, and I subscribe to
'end'-events, the <schnappi> element is returned after all of its
subelements have been read and clear()'ed. So, I see indeed a
<schnappi> element, but calling its getiterator() gives me completely
empty subelements, which is not what I wanted

Finally, I thought about keeping track of when to clear and when not
to by subscribing to start and end elements (so that I would collect
the entire <schnappi>-subtree in memory and only than release it):

from cElementTree import iterparse
clear_flag = True
for event, elem in iterparse("data.xml", ("start", "end")):
if event == "start" and elem.tag == "schnappi":
# start collecting elements
clear_flag = False
if event == "end" and elem.tag == "schnappi":
clear_flag = True
# do something with elem
# unless we are collecting elements, clear()
if clear_flag:
elem.clear()

This gave me the desired behaviour, but:

* It looks *very* ugly
* It's twice as slow as version which sees 'end'-events only.

Now, there *has* to be a better way. What am I missing?

Thanks in advance,

ivr

D H · Sep 25, 2005

Igor said:
This gave me the desired behaviour, but:

* It looks *very* ugly
* It's twice as slow as version which sees 'end'-events only.

Now, there *has* to be a better way. What am I missing?

Try emailing the author for support.

Reinhold Birkenfeld · Sep 25, 2005

D said:
Try emailing the author for support.

I don't think that's needed. He is one of the most active members
of c.l.py, and you should know that yourself.

Reinhold

D H · Sep 25, 2005

Reinhold said:
I don't think that's needed. He is one of the most active members
of c.l.py, and you should know that yourself.

I would recommend emailing the author of a library when you have a
question about that library. You should know that yourself as well.

Reinhold Birkenfeld · Sep 25, 2005

D said:
I would recommend emailing the author of a library when you have a
question about that library. You should know that yourself as well.

Well, if I had e.g. a question about Boo, I would of course first ask
here because I know the expert writes here.

Reinhold

D H · Sep 25, 2005

Reinhold said:
Well, if I had e.g. a question about Boo, I would of course first ask
here because I know the expert writes here.

Reinhold

> If I had wanted to say "you have opinions? **** off!", I would have said
>"you have opinions? **** off!".

Take your own advice asshole.

D H · Sep 25, 2005

D said:
Take your own advice asshole.

Reinhold Birkenfeld · Sep 25, 2005

D said:
Take your own advice asshole.

QED. Irony tags for sale.

Reinhold

Reinhold Birkenfeld · Sep 25, 2005

And what's that about?

Reinhold

D H · Sep 25, 2005

Reinhold said:
And what's that about?

I think it means you should **** off, asshole.

Reinhold Birkenfeld · Sep 25, 2005

D said:
I think it means you should **** off, asshole.

I think you've made that clear.

*plonk*

Reinhold

PS: I really wonder why you get upset when someone except you mentions boo.

Fredrik Lundh · Sep 25, 2005

Igor said:
Finally, I thought about keeping track of when to clear and when not
to by subscribing to start and end elements (so that I would collect
the entire <schnappi>-subtree in memory and only than release it):

from cElementTree import iterparse
clear_flag = True
for event, elem in iterparse("data.xml", ("start", "end")):
if event == "start" and elem.tag == "schnappi":
# start collecting elements
clear_flag = False
if event == "end" and elem.tag == "schnappi":
clear_flag = True
# do something with elem
# unless we are collecting elements, clear()
if clear_flag:
elem.clear()

This gave me the desired behaviour, but:

* It looks *very* ugly
* It's twice as slow as version which sees 'end'-events only.

Now, there *has* to be a better way. What am I missing?

the iterparse/clear approach works best if your XML file has a
record-like structure. if you have toplevel records with lots of
schnappi records in them, iterate over the records and use find
(etc) to locate the subrecords you're interested in:

for event, elem in iterparse("data.xml"):
if event.tag == "record":
# deal with schnappi subrecords
for schappi in elem.findall(".//schnappi"):
process(schnappi)
elem.clear()

the collect flag approach isn't that bad ("twice as slow" doesn't
really say much: "raw" cElementTree is extremely fast compared
to the Python interpreter, so everything you end up doing in
Python will slow things down quite a bit).

to make your application code look a bit less convoluted, put the
logic in a generator function:

# in library
def process(filename, annoying_animal):
clear = True
start = "start"; end = "end"
for event, elem in iterparse(filename, (start, end)):
if elem.tag == annoying_animal:
if event is start:
clear = False
else:
yield elem
clear = True
if clear:
elem.clear()

# in application
for subelem in process(filename, "schnappi"):
# do something with subelem

(I've reorganized the code a bit to cut down on the operations.
also note the "is" trick; iterparse returns the event strings you
pass in, so comparing on object identities is safe)

an alternative is to use the lower-level XMLParser class (which
is similar to SAX, but faster), but that will most likely result in
more and tricker Python code...

</F>

D H · Sep 25, 2005

Reinhold said:
I think you've made that clear.

*plonk*

Reinhold

PS: I really wonder why you get upset when someone except you mentions boo.

You're the only one making any association between this thread about
celementree and boo. So again I'll say, take your own advice and **** off.

Fredrik Lundh · Sep 25, 2005

Doug said:
You're the only one making any association between this thread about
celementree and boo.

really? judging from the Original-From header in your posts, your internet
provider is sure making the same association...

</F>

Igor V. Rafienko · Sep 25, 2005

[ Fredrik Lundh ]

[ ... ]

the iterparse/clear approach works best if your XML file has a
record-like structure. if you have toplevel records with lots of
schnappi records in them, iterate over the records and use find
(etc) to locate the subrecords you're interested in: (...)

The problem is that the file looks like this:

<data>
<schnappi>
<color>green</color>
<friends>
<friend>
<id>Lama</id>
<color>white</color>
</friend>
<friend>
<id>mother schnappi</id>
<color>green</color>
</friend>
</friends>
<food>
<id>human</id>
<id>rabbit</id>
</food>
</schappi>
<schnappi>

</schnappi>

</data>

.... and there is really nothing above <schnappi>. The "something
interesting" part consists of a variety of elements, and calling
findall for each of them although possible, would probably be
unpractical (say, distinguishing <friend>'s colors from <schnappi's>).

Conceptually I need a "XML subtree iterator", rather than an XML
element iterator. <schnappi>-elements are the ones having a complex
internal structure, and I'd like to be able to speak of my XML as a
sequence of Python objects representing <schnappi>s and their internal
structure.

[ ... ]

(I've reorganized the code a bit to cut down on the operations. also
note the "is" trick; iterparse returns the event strings you pass
in, so comparing on object identities is safe)

Neat trick.

Thank you for your input,

ivr

D H · Sep 25, 2005

Fredrik said:
Doug Holton wrote:

really? judging from the Original-From header in your posts, your internet
provider is sure making the same association...

You seriously need some help.

Paul Boddie · Sep 25, 2005

Reinhold said:
Well, if I had e.g. a question about Boo, I would of course first ask
here because I know the expert writes here.

Regardless of anyone's alleged connection with Boo or newsgroup
participation level, the advice to contact the package
author/maintainer is sound. It happens every now and again that people
post questions to comp.lang.python about fairly specific issues or
packages that would be best sent to mailing lists or other resources
devoted to such topics. It's far better to get a high quality opinion
from a small group of people than a lower quality opinion from a larger
group or a delayed response from the maintainer because he/she doesn't
happen to be spending time sifting through flame wars amidst large
volumes of relatively uninteresting/irrelevant messages.

Paul

Fredrik Lundh · Sep 25, 2005

Igor said:
The problem is that the file looks like this:

<data>

... lots of schnappi records ...

okay. I think your first approach

from cElementTree import iterparse

for event, elem in iterparse("data.xml"):
if elem.tag == "schnappi":
count += 1
elem.clear()

is the right one for this case. with this code, the clear call will
destroy each schnappi record when you're done with it, so you
will release all memory allocated for the schnappi elements.

however, you will end up with a single toplevel element that
contains a large number of empty subelements. this is usually
no problem (it'll use a couple of megabytes), but you can get
rid of the dead schnappis too, if you want to. see the example
that starts with "context = iterparse" on this page

http://effbot.org/zone/element-iterparse.htm

for more information.

</F>

Reinhold Birkenfeld · Sep 25, 2005

Paul said:
Regardless of anyone's alleged connection with Boo or newsgroup
participation level

the advice to contact the package author/maintainer is sound.

Correct. But if the post is already in the newsgroup and the author is known
to write there extensively, it sounds ridiculous to say "contact the author".

It happens every now and again that people
post questions to comp.lang.python about fairly specific issues or
packages that would be best sent to mailing lists or other resources
devoted to such topics. It's far better to get a high quality opinion
from a small group of people than a lower quality opinion from a larger
group or a delayed response from the maintainer because he/she doesn't
happen to be spending time sifting through flame wars amidst large
volumes of relatively uninteresting/irrelevant messages.

Hey, the flame war stopped before it got interesting

Reinhold

Grant Edwards · Sep 25, 2005

I would recommend emailing the author of a library when you
have a question about that library. You should know that
yourself as well.

Why??

For the things I "support", I much prefer answering questions
in a public forum. That way the knowledge is available to
everybody, and it reduces the number of e-mailed duplicate
questions. Most of the gurus I know (not that I'm attempting
to placing myself in that category) feel the same way. ESR
explained it well.

Quoting from http://www.catb.org/~esr/faqs/smart-questions.html#forum

You are likely to be ignored, or written off as a loser, if
you:

[...]

* post a personal email to somebody who is neither an
acquaintance of yours nor personally responsible for
solving your problem

[...]

In general, questions to a well-selected public forum are
more likely to get useful answers than equivalent questions
to a private one. There are multiple reasons for this. One
is simply the size of the pool of potential respondents.
Another is the size of the audience; hackers would rather
answer questions that educate a lot of people than questions
which only serve a few.

When to clear elements using cElementTree	1	Oct 19, 2012
Issue with xml iterparse	4	Jun 3, 2010
Iterparse and ElementTree confusion	4	Aug 17, 2005
sql to xml	2	Apr 17, 2007
lxml: traverse xml tree and retrieve element based on an attribute	5	May 21, 2009
The Semantics of 'volatile'	73	Jun 2, 2009
Engineering a List container Part 2: Implementations	20	Dec 8, 2013
aligning SGML to text	4	Jun 18, 2006

cElementTree clear semantics

Igor V. Rafienko

D H

Reinhold Birkenfeld

D H

Reinhold Birkenfeld

D H

D H

Reinhold Birkenfeld

Reinhold Birkenfeld

D H

Reinhold Birkenfeld

Fredrik Lundh

D H

Fredrik Lundh

Igor V. Rafienko

D H

Paul Boddie

Fredrik Lundh

Reinhold Birkenfeld

Grant Edwards

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads