cElementTree clear semantics

I

Igor V. Rafienko

Hi,


I am trying to understand how cElementTree's clear works: I have a
(relatively) large XML file, that I do not wish to load into memory.
So, naturally, I tried something like this:

from cElementTree import iterparse
for event, elem in iterparse("data.xml"):
if elem.tag == "schnappi":
count += 1
elem.clear()

.... which resulted in caching of all elements in memory except for
those named <schnappi> (i.e. the process' memory footprint grew more
and more). Then I though about clear()'ing all elements that I did not
really need:

from cElementTree import iterparse
for event, elem in iterparse("data.xml"):
if elem.tag == "schnappi":
count += 1
elem.clear()

.... which gave a suitably small memory footprint, *BUT* since
<schnappi> has a number of subelements, and I subscribe to
'end'-events, the <schnappi> element is returned after all of its
subelements have been read and clear()'ed. So, I see indeed a
<schnappi> element, but calling its getiterator() gives me completely
empty subelements, which is not what I wanted :(

Finally, I thought about keeping track of when to clear and when not
to by subscribing to start and end elements (so that I would collect
the entire <schnappi>-subtree in memory and only than release it):

from cElementTree import iterparse
clear_flag = True
for event, elem in iterparse("data.xml", ("start", "end")):
if event == "start" and elem.tag == "schnappi":
# start collecting elements
clear_flag = False
if event == "end" and elem.tag == "schnappi":
clear_flag = True
# do something with elem
# unless we are collecting elements, clear()
if clear_flag:
elem.clear()

This gave me the desired behaviour, but:

* It looks *very* ugly
* It's twice as slow as version which sees 'end'-events only.

Now, there *has* to be a better way. What am I missing?

Thanks in advance,





ivr
 
D

D H

Igor said:
This gave me the desired behaviour, but:

* It looks *very* ugly
* It's twice as slow as version which sees 'end'-events only.

Now, there *has* to be a better way. What am I missing?

Try emailing the author for support.
 
R

Reinhold Birkenfeld

D said:
Try emailing the author for support.

I don't think that's needed. He is one of the most active members
of c.l.py, and you should know that yourself.

Reinhold
 
D

D H

Reinhold said:
I don't think that's needed. He is one of the most active members
of c.l.py, and you should know that yourself.

I would recommend emailing the author of a library when you have a
question about that library. You should know that yourself as well.
 
R

Reinhold Birkenfeld

D said:
I would recommend emailing the author of a library when you have a
question about that library. You should know that yourself as well.

Well, if I had e.g. a question about Boo, I would of course first ask
here because I know the expert writes here.

Reinhold
 
D

D H

Reinhold said:
Well, if I had e.g. a question about Boo, I would of course first ask
here because I know the expert writes here.

Reinhold
> If I had wanted to say "you have opinions? **** off!", I would have said
>"you have opinions? **** off!".


Take your own advice asshole.
 
R

Reinhold Birkenfeld

D said:
I think it means you should **** off, asshole.

I think you've made that clear.

*plonk*

Reinhold

PS: I really wonder why you get upset when someone except you mentions boo.
 
F

Fredrik Lundh

Igor said:
Finally, I thought about keeping track of when to clear and when not
to by subscribing to start and end elements (so that I would collect
the entire <schnappi>-subtree in memory and only than release it):

from cElementTree import iterparse
clear_flag = True
for event, elem in iterparse("data.xml", ("start", "end")):
if event == "start" and elem.tag == "schnappi":
# start collecting elements
clear_flag = False
if event == "end" and elem.tag == "schnappi":
clear_flag = True
# do something with elem
# unless we are collecting elements, clear()
if clear_flag:
elem.clear()

This gave me the desired behaviour, but:

* It looks *very* ugly
* It's twice as slow as version which sees 'end'-events only.

Now, there *has* to be a better way. What am I missing?

the iterparse/clear approach works best if your XML file has a
record-like structure. if you have toplevel records with lots of
schnappi records in them, iterate over the records and use find
(etc) to locate the subrecords you're interested in:

for event, elem in iterparse("data.xml"):
if event.tag == "record":
# deal with schnappi subrecords
for schappi in elem.findall(".//schnappi"):
process(schnappi)
elem.clear()

the collect flag approach isn't that bad ("twice as slow" doesn't
really say much: "raw" cElementTree is extremely fast compared
to the Python interpreter, so everything you end up doing in
Python will slow things down quite a bit).

to make your application code look a bit less convoluted, put the
logic in a generator function:

# in library
def process(filename, annoying_animal):
clear = True
start = "start"; end = "end"
for event, elem in iterparse(filename, (start, end)):
if elem.tag == annoying_animal:
if event is start:
clear = False
else:
yield elem
clear = True
if clear:
elem.clear()

# in application
for subelem in process(filename, "schnappi"):
# do something with subelem

(I've reorganized the code a bit to cut down on the operations.
also note the "is" trick; iterparse returns the event strings you
pass in, so comparing on object identities is safe)

an alternative is to use the lower-level XMLParser class (which
is similar to SAX, but faster), but that will most likely result in
more and tricker Python code...

</F>
 
D

D H

Reinhold said:
I think you've made that clear.

*plonk*

Reinhold

PS: I really wonder why you get upset when someone except you mentions boo.

You're the only one making any association between this thread about
celementree and boo. So again I'll say, take your own advice and **** off.
 
F

Fredrik Lundh

Doug said:
You're the only one making any association between this thread about
celementree and boo.

really? judging from the Original-From header in your posts, your internet
provider is sure making the same association...

</F>
 
I

Igor V. Rafienko

[ Fredrik Lundh ]

[ ... ]
the iterparse/clear approach works best if your XML file has a
record-like structure. if you have toplevel records with lots of
schnappi records in them, iterate over the records and use find
(etc) to locate the subrecords you're interested in: (...)


The problem is that the file looks like this:

<data>
<schnappi>
<color>green</color>
<friends>
<friend>
<id>Lama</id>
<color>white</color>
</friend>
<friend>
<id>mother schnappi</id>
<color>green</color>
</friend>
</friends>
<food>
<id>human</id>
<id>rabbit</id>
</food>
</schappi>
<schnappi>
<!-- something interesting -->
</schnappi>
<!-- 60,000 more schnappis -->
</data>

.... and there is really nothing above <schnappi>. The "something
interesting" part consists of a variety of elements, and calling
findall for each of them although possible, would probably be
unpractical (say, distinguishing <friend>'s colors from <schnappi's>).

Conceptually I need a "XML subtree iterator", rather than an XML
element iterator. <schnappi>-elements are the ones having a complex
internal structure, and I'd like to be able to speak of my XML as a
sequence of Python objects representing <schnappi>s and their internal
structure.

[ ... ]

(I've reorganized the code a bit to cut down on the operations. also
note the "is" trick; iterparse returns the event strings you pass
in, so comparing on object identities is safe)


Neat trick.

Thank you for your input,





ivr
 
D

D H

Fredrik said:
Doug Holton wrote:




really? judging from the Original-From header in your posts, your internet
provider is sure making the same association...

You seriously need some help.
 
P

Paul Boddie

Reinhold said:
Well, if I had e.g. a question about Boo, I would of course first ask
here because I know the expert writes here.

Regardless of anyone's alleged connection with Boo or newsgroup
participation level, the advice to contact the package
author/maintainer is sound. It happens every now and again that people
post questions to comp.lang.python about fairly specific issues or
packages that would be best sent to mailing lists or other resources
devoted to such topics. It's far better to get a high quality opinion
from a small group of people than a lower quality opinion from a larger
group or a delayed response from the maintainer because he/she doesn't
happen to be spending time sifting through flame wars amidst large
volumes of relatively uninteresting/irrelevant messages.

Paul
 
F

Fredrik Lundh

Igor said:
The problem is that the file looks like this:

<data>
... lots of schnappi records ...

okay. I think your first approach

from cElementTree import iterparse

for event, elem in iterparse("data.xml"):
if elem.tag == "schnappi":
count += 1
elem.clear()

is the right one for this case. with this code, the clear call will
destroy each schnappi record when you're done with it, so you
will release all memory allocated for the schnappi elements.

however, you will end up with a single toplevel element that
contains a large number of empty subelements. this is usually
no problem (it'll use a couple of megabytes), but you can get
rid of the dead schnappis too, if you want to. see the example
that starts with "context = iterparse" on this page

http://effbot.org/zone/element-iterparse.htm

for more information.

</F>
 
R

Reinhold Birkenfeld

Paul said:
Regardless of anyone's alleged connection with Boo or newsgroup
participation level

the advice to contact the package author/maintainer is sound.

Correct. But if the post is already in the newsgroup and the author is known
to write there extensively, it sounds ridiculous to say "contact the author".
It happens every now and again that people
post questions to comp.lang.python about fairly specific issues or
packages that would be best sent to mailing lists or other resources
devoted to such topics. It's far better to get a high quality opinion
from a small group of people than a lower quality opinion from a larger
group or a delayed response from the maintainer because he/she doesn't
happen to be spending time sifting through flame wars amidst large
volumes of relatively uninteresting/irrelevant messages.

Hey, the flame war stopped before it got interesting ;)

Reinhold
 
G

Grant Edwards

I would recommend emailing the author of a library when you
have a question about that library. You should know that
yourself as well.

Why??

For the things I "support", I much prefer answering questions
in a public forum. That way the knowledge is available to
everybody, and it reduces the number of e-mailed duplicate
questions. Most of the gurus I know (not that I'm attempting
to placing myself in that category) feel the same way. ESR
explained it well.

Quoting from http://www.catb.org/~esr/faqs/smart-questions.html#forum

You are likely to be ignored, or written off as a loser, if
you:

[...]

* post a personal email to somebody who is neither an
acquaintance of yours nor personally responsible for
solving your problem

[...]

In general, questions to a well-selected public forum are
more likely to get useful answers than equivalent questions
to a private one. There are multiple reasons for this. One
is simply the size of the pool of potential respondents.
Another is the size of the audience; hackers would rather
answer questions that educate a lot of people than questions
which only serve a few.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top