cElementTree clear semantics

Discussion in 'Python' started by Igor V. Rafienko, Sep 25, 2005.

  1. Hi,


    I am trying to understand how cElementTree's clear works: I have a
    (relatively) large XML file, that I do not wish to load into memory.
    So, naturally, I tried something like this:

    from cElementTree import iterparse
    for event, elem in iterparse("data.xml"):
    if elem.tag == "schnappi":
    count += 1
    elem.clear()

    .... which resulted in caching of all elements in memory except for
    those named <schnappi> (i.e. the process' memory footprint grew more
    and more). Then I though about clear()'ing all elements that I did not
    really need:

    from cElementTree import iterparse
    for event, elem in iterparse("data.xml"):
    if elem.tag == "schnappi":
    count += 1
    elem.clear()

    .... which gave a suitably small memory footprint, *BUT* since
    <schnappi> has a number of subelements, and I subscribe to
    'end'-events, the <schnappi> element is returned after all of its
    subelements have been read and clear()'ed. So, I see indeed a
    <schnappi> element, but calling its getiterator() gives me completely
    empty subelements, which is not what I wanted :(

    Finally, I thought about keeping track of when to clear and when not
    to by subscribing to start and end elements (so that I would collect
    the entire <schnappi>-subtree in memory and only than release it):

    from cElementTree import iterparse
    clear_flag = True
    for event, elem in iterparse("data.xml", ("start", "end")):
    if event == "start" and elem.tag == "schnappi":
    # start collecting elements
    clear_flag = False
    if event == "end" and elem.tag == "schnappi":
    clear_flag = True
    # do something with elem
    # unless we are collecting elements, clear()
    if clear_flag:
    elem.clear()

    This gave me the desired behaviour, but:

    * It looks *very* ugly
    * It's twice as slow as version which sees 'end'-events only.

    Now, there *has* to be a better way. What am I missing?

    Thanks in advance,





    ivr
    --
    "...but it's HDTV -- it's got a better resolution than the real world."
    -- Fry, "When aliens attack"
    Igor V. Rafienko, Sep 25, 2005
    #1
    1. Advertising

  2. Igor V. Rafienko

    D H Guest

    Igor V. Rafienko wrote:
    > This gave me the desired behaviour, but:
    >
    > * It looks *very* ugly
    > * It's twice as slow as version which sees 'end'-events only.
    >
    > Now, there *has* to be a better way. What am I missing?
    >


    Try emailing the author for support.
    D H, Sep 25, 2005
    #2
    1. Advertising

  3. D H wrote:
    > Igor V. Rafienko wrote:
    >> This gave me the desired behaviour, but:
    >>
    >> * It looks *very* ugly
    >> * It's twice as slow as version which sees 'end'-events only.
    >>
    >> Now, there *has* to be a better way. What am I missing?
    >>

    >
    > Try emailing the author for support.


    I don't think that's needed. He is one of the most active members
    of c.l.py, and you should know that yourself.

    Reinhold
    Reinhold Birkenfeld, Sep 25, 2005
    #3
  4. Igor V. Rafienko

    D H Guest

    Reinhold Birkenfeld wrote:
    > D H wrote:
    >
    >>Igor V. Rafienko wrote:
    >>
    >>>This gave me the desired behaviour, but:
    >>>
    >>>* It looks *very* ugly
    >>>* It's twice as slow as version which sees 'end'-events only.
    >>>
    >>>Now, there *has* to be a better way. What am I missing?
    >>>

    >>
    >>Try emailing the author for support.

    >
    >
    > I don't think that's needed. He is one of the most active members
    > of c.l.py, and you should know that yourself.
    >


    I would recommend emailing the author of a library when you have a
    question about that library. You should know that yourself as well.
    D H, Sep 25, 2005
    #4
  5. D H wrote:
    > Reinhold Birkenfeld wrote:
    >> D H wrote:
    >>
    >>>Igor V. Rafienko wrote:
    >>>
    >>>>This gave me the desired behaviour, but:
    >>>>
    >>>>* It looks *very* ugly
    >>>>* It's twice as slow as version which sees 'end'-events only.
    >>>>
    >>>>Now, there *has* to be a better way. What am I missing?
    >>>>
    >>>
    >>>Try emailing the author for support.

    >>
    >>
    >> I don't think that's needed. He is one of the most active members
    >> of c.l.py, and you should know that yourself.
    >>

    >
    > I would recommend emailing the author of a library when you have a
    > question about that library. You should know that yourself as well.


    Well, if I had e.g. a question about Boo, I would of course first ask
    here because I know the expert writes here.

    Reinhold
    Reinhold Birkenfeld, Sep 25, 2005
    #5
  6. Igor V. Rafienko

    D H Guest

    Reinhold Birkenfeld wrote:

    >
    > Well, if I had e.g. a question about Boo, I would of course first ask
    > here because I know the expert writes here.
    >
    > Reinhold


    Reinhold Birkenfeld also wrote:
    > If I had wanted to say "you have opinions? **** off!", I would have said
    >"you have opinions? **** off!".



    Take your own advice asshole.
    D H, Sep 25, 2005
    #6
  7. Igor V. Rafienko

    D H Guest

    Reinhold Birkenfeld [was "Re: cElementTree clear semantics"]

    D H wrote:
    > Reinhold Birkenfeld wrote:
    >
    >>
    >> Well, if I had e.g. a question about Boo, I would of course first ask
    >> here because I know the expert writes here.
    >>
    >> Reinhold

    >
    >
    > Reinhold Birkenfeld also wrote:
    > > If I had wanted to say "you have opinions? **** off!", I would have said
    > >"you have opinions? **** off!".

    >
    >
    > Take your own advice asshole.
    D H, Sep 25, 2005
    #7
  8. D H wrote:
    > Reinhold Birkenfeld wrote:
    >
    >>
    >> Well, if I had e.g. a question about Boo, I would of course first ask
    >> here because I know the expert writes here.
    >>
    >> Reinhold

    >
    > Reinhold Birkenfeld also wrote:
    > > If I had wanted to say "you have opinions? **** off!", I would have said
    > >"you have opinions? **** off!".

    >
    >
    > Take your own advice asshole.


    QED. Irony tags for sale.

    Reinhold
    Reinhold Birkenfeld, Sep 25, 2005
    #8
  9. Re: "Re: cElementTree clear semantics"

    D H wrote:
    > D H wrote:
    >> Reinhold Birkenfeld wrote:
    >>
    >>>
    >>> Well, if I had e.g. a question about Boo, I would of course first ask
    >>> here because I know the expert writes here.
    >>>
    >>> Reinhold

    >>
    >>
    >> Reinhold Birkenfeld also wrote:
    >> > If I had wanted to say "you have opinions? **** off!", I would have said
    >> >"you have opinions? **** off!".

    >>
    >>
    >> Take your own advice asshole.


    And what's that about?

    Reinhold
    Reinhold Birkenfeld, Sep 25, 2005
    #9
  10. Igor V. Rafienko

    D H Guest

    Reinhold Birkenfeld [Re: "Re: cElementTree clear semantics"]

    Reinhold Birkenfeld wrote:
    > D H wrote:
    >
    >>D H wrote:
    >>
    >>>Reinhold Birkenfeld wrote:
    >>>
    >>>
    >>>>Well, if I had e.g. a question about Boo, I would of course first ask
    >>>>here because I know the expert writes here.
    >>>>
    >>>>Reinhold
    >>>
    >>>
    >>>Reinhold Birkenfeld also wrote:
    >>> > If I had wanted to say "you have opinions? **** off!", I would have said
    >>> >"you have opinions? **** off!".
    >>>
    >>>
    >>>Take your own advice asshole.

    >
    >
    > And what's that about?


    I think it means you should **** off, asshole.
    D H, Sep 25, 2005
    #10
  11. Re: "Re: cElementTree clear semantics"

    D H wrote:
    > Reinhold Birkenfeld wrote:
    >> D H wrote:
    >>
    >>>D H wrote:
    >>>
    >>>>Reinhold Birkenfeld wrote:
    >>>>
    >>>>
    >>>>>Well, if I had e.g. a question about Boo, I would of course first ask
    >>>>>here because I know the expert writes here.
    >>>>>
    >>>>>Reinhold
    >>>>
    >>>>
    >>>>Reinhold Birkenfeld also wrote:
    >>>> > If I had wanted to say "you have opinions? **** off!", I would have said
    >>>> >"you have opinions? **** off!".
    >>>>
    >>>>
    >>>>Take your own advice asshole.

    >>
    >>
    >> And what's that about?

    >
    > I think it means you should **** off, asshole.


    I think you've made that clear.

    *plonk*

    Reinhold

    PS: I really wonder why you get upset when someone except you mentions boo.
    Reinhold Birkenfeld, Sep 25, 2005
    #11
  12. Igor V. Rafienko wrote:

    > Finally, I thought about keeping track of when to clear and when not
    > to by subscribing to start and end elements (so that I would collect
    > the entire <schnappi>-subtree in memory and only than release it):
    >
    > from cElementTree import iterparse
    > clear_flag = True
    > for event, elem in iterparse("data.xml", ("start", "end")):
    > if event == "start" and elem.tag == "schnappi":
    > # start collecting elements
    > clear_flag = False
    > if event == "end" and elem.tag == "schnappi":
    > clear_flag = True
    > # do something with elem
    > # unless we are collecting elements, clear()
    > if clear_flag:
    > elem.clear()
    >
    > This gave me the desired behaviour, but:
    >
    > * It looks *very* ugly
    > * It's twice as slow as version which sees 'end'-events only.
    >
    > Now, there *has* to be a better way. What am I missing?


    the iterparse/clear approach works best if your XML file has a
    record-like structure. if you have toplevel records with lots of
    schnappi records in them, iterate over the records and use find
    (etc) to locate the subrecords you're interested in:

    for event, elem in iterparse("data.xml"):
    if event.tag == "record":
    # deal with schnappi subrecords
    for schappi in elem.findall(".//schnappi"):
    process(schnappi)
    elem.clear()

    the collect flag approach isn't that bad ("twice as slow" doesn't
    really say much: "raw" cElementTree is extremely fast compared
    to the Python interpreter, so everything you end up doing in
    Python will slow things down quite a bit).

    to make your application code look a bit less convoluted, put the
    logic in a generator function:

    # in library
    def process(filename, annoying_animal):
    clear = True
    start = "start"; end = "end"
    for event, elem in iterparse(filename, (start, end)):
    if elem.tag == annoying_animal:
    if event is start:
    clear = False
    else:
    yield elem
    clear = True
    if clear:
    elem.clear()

    # in application
    for subelem in process(filename, "schnappi"):
    # do something with subelem

    (I've reorganized the code a bit to cut down on the operations.
    also note the "is" trick; iterparse returns the event strings you
    pass in, so comparing on object identities is safe)

    an alternative is to use the lower-level XMLParser class (which
    is similar to SAX, but faster), but that will most likely result in
    more and tricker Python code...

    </F>
    Fredrik Lundh, Sep 25, 2005
    #12
  13. Igor V. Rafienko

    D H Guest

    Reinhold Birkenfeld [Re: "Re: cElementTree clear semantics"]

    Reinhold Birkenfeld wrote:
    > D H wrote:
    >
    >>Reinhold Birkenfeld wrote:
    >>
    >>>D H wrote:
    >>>
    >>>
    >>>>D H wrote:
    >>>>
    >>>>
    >>>>>Reinhold Birkenfeld wrote:
    >>>>>
    >>>>>
    >>>>>
    >>>>>>Well, if I had e.g. a question about Boo, I would of course first ask
    >>>>>>here because I know the expert writes here.
    >>>>>>
    >>>>>>Reinhold
    >>>>>
    >>>>>
    >>>>>Reinhold Birkenfeld also wrote:
    >>>>>
    >>>>>>If I had wanted to say "you have opinions? **** off!", I would have said
    >>>>>>"you have opinions? **** off!".
    >>>>>
    >>>>>
    >>>>>Take your own advice asshole.
    >>>
    >>>
    >>>And what's that about?

    >>
    >>I think it means you should **** off, asshole.

    >
    >
    > I think you've made that clear.
    >
    > *plonk*
    >
    > Reinhold
    >
    > PS: I really wonder why you get upset when someone except you mentions boo.


    You're the only one making any association between this thread about
    celementree and boo. So again I'll say, take your own advice and **** off.
    D H, Sep 25, 2005
    #13
  14. Re: Reinhold Birkenfeld [Re: "Re: cElementTree clear semantics"]

    Doug Holton wrote:

    > You're the only one making any association between this thread about
    > celementree and boo.


    really? judging from the Original-From header in your posts, your internet
    provider is sure making the same association...

    </F>
    Fredrik Lundh, Sep 25, 2005
    #14
  15. [ Fredrik Lundh ]

    [ ... ]

    > the iterparse/clear approach works best if your XML file has a
    > record-like structure. if you have toplevel records with lots of
    > schnappi records in them, iterate over the records and use find
    > (etc) to locate the subrecords you're interested in: (...)



    The problem is that the file looks like this:

    <data>
    <schnappi>
    <color>green</color>
    <friends>
    <friend>
    <id>Lama</id>
    <color>white</color>
    </friend>
    <friend>
    <id>mother schnappi</id>
    <color>green</color>
    </friend>
    </friends>
    <food>
    <id>human</id>
    <id>rabbit</id>
    </food>
    </schappi>
    <schnappi>
    <!-- something interesting -->
    </schnappi>
    <!-- 60,000 more schnappis -->
    </data>

    .... and there is really nothing above <schnappi>. The "something
    interesting" part consists of a variety of elements, and calling
    findall for each of them although possible, would probably be
    unpractical (say, distinguishing <friend>'s colors from <schnappi's>).

    Conceptually I need a "XML subtree iterator", rather than an XML
    element iterator. <schnappi>-elements are the ones having a complex
    internal structure, and I'd like to be able to speak of my XML as a
    sequence of Python objects representing <schnappi>s and their internal
    structure.

    [ ... ]


    > (I've reorganized the code a bit to cut down on the operations. also
    > note the "is" trick; iterparse returns the event strings you pass
    > in, so comparing on object identities is safe)



    Neat trick.

    Thank you for your input,





    ivr
    --
    "...but it's HDTV -- it's got a better resolution than the real world."
    -- Fry, "When aliens attack"
    Igor V. Rafienko, Sep 25, 2005
    #15
  16. Igor V. Rafienko

    D H Guest

    Fredrik Lundh [Re: Reinhold Birkenfeld [Re: "Re: cElementTree clearsemantics"]]

    Fredrik Lundh wrote:
    > Doug Holton wrote:
    >
    >
    >>You're the only one making any association between this thread about
    >>celementree and boo.

    >
    >
    > really? judging from the Original-From header in your posts, your internet
    > provider is sure making the same association...


    You seriously need some help.
    D H, Sep 25, 2005
    #16
  17. Igor V. Rafienko

    Paul Boddie Guest

    Reinhold Birkenfeld wrote:
    > D H wrote:
    > > I would recommend emailing the author of a library when you have a
    > > question about that library. You should know that yourself as well.

    >
    > Well, if I had e.g. a question about Boo, I would of course first ask
    > here because I know the expert writes here.


    Regardless of anyone's alleged connection with Boo or newsgroup
    participation level, the advice to contact the package
    author/maintainer is sound. It happens every now and again that people
    post questions to comp.lang.python about fairly specific issues or
    packages that would be best sent to mailing lists or other resources
    devoted to such topics. It's far better to get a high quality opinion
    from a small group of people than a lower quality opinion from a larger
    group or a delayed response from the maintainer because he/she doesn't
    happen to be spending time sifting through flame wars amidst large
    volumes of relatively uninteresting/irrelevant messages.

    Paul
    Paul Boddie, Sep 25, 2005
    #17
  18. Igor V. Rafienko wrote:

    > The problem is that the file looks like this:
    >
    > <data>

    ... lots of schnappi records ...

    okay. I think your first approach

    from cElementTree import iterparse

    for event, elem in iterparse("data.xml"):
    if elem.tag == "schnappi":
    count += 1
    elem.clear()

    is the right one for this case. with this code, the clear call will
    destroy each schnappi record when you're done with it, so you
    will release all memory allocated for the schnappi elements.

    however, you will end up with a single toplevel element that
    contains a large number of empty subelements. this is usually
    no problem (it'll use a couple of megabytes), but you can get
    rid of the dead schnappis too, if you want to. see the example
    that starts with "context = iterparse" on this page

    http://effbot.org/zone/element-iterparse.htm

    for more information.

    </F>
    Fredrik Lundh, Sep 25, 2005
    #18
  19. Paul Boddie wrote:
    > Reinhold Birkenfeld wrote:
    >> D H wrote:
    >> > I would recommend emailing the author of a library when you have a
    >> > question about that library. You should know that yourself as well.

    >>
    >> Well, if I had e.g. a question about Boo, I would of course first ask
    >> here because I know the expert writes here.

    >
    > Regardless of anyone's alleged connection with Boo or newsgroup
    > participation level


    Which was sort of an ironic <wink> from my side. I did not expect "D H"
    to go overboard on this.

    > the advice to contact the package author/maintainer is sound.


    Correct. But if the post is already in the newsgroup and the author is known
    to write there extensively, it sounds ridiculous to say "contact the author".

    > It happens every now and again that people
    > post questions to comp.lang.python about fairly specific issues or
    > packages that would be best sent to mailing lists or other resources
    > devoted to such topics. It's far better to get a high quality opinion
    > from a small group of people than a lower quality opinion from a larger
    > group or a delayed response from the maintainer because he/she doesn't
    > happen to be spending time sifting through flame wars amidst large
    > volumes of relatively uninteresting/irrelevant messages.


    Hey, the flame war stopped before it got interesting ;)

    Reinhold
    Reinhold Birkenfeld, Sep 25, 2005
    #19
  20. On 2005-09-25, D H <no@spam> wrote:
    >>>Igor V. Rafienko wrote:
    >>>
    >>>>This gave me the desired behaviour, but:
    >>>>
    >>>>* It looks *very* ugly
    >>>>* It's twice as slow as version which sees 'end'-events only.
    >>>>
    >>>>Now, there *has* to be a better way. What am I missing?
    >>>
    >>>Try emailing the author for support.

    >>
    >> I don't think that's needed. He is one of the most active
    >> members of c.l.py, and you should know that yourself.

    >
    > I would recommend emailing the author of a library when you
    > have a question about that library. You should know that
    > yourself as well.


    Why??

    For the things I "support", I much prefer answering questions
    in a public forum. That way the knowledge is available to
    everybody, and it reduces the number of e-mailed duplicate
    questions. Most of the gurus I know (not that I'm attempting
    to placing myself in that category) feel the same way. ESR
    explained it well.

    Quoting from http://www.catb.org/~esr/faqs/smart-questions.html#forum

    You are likely to be ignored, or written off as a loser, if
    you:

    [...]

    * post a personal email to somebody who is neither an
    acquaintance of yours nor personally responsible for
    solving your problem

    [...]

    In general, questions to a well-selected public forum are
    more likely to get useful answers than equivalent questions
    to a private one. There are multiple reasons for this. One
    is simply the size of the pool of potential respondents.
    Another is the size of the audience; hackers would rather
    answer questions that educate a lot of people than questions
    which only serve a few.

    --
    Grant Edwards grante Yow! I'm a GENIUS! I
    at want to dispute sentence
    visi.com structure with SUSAN
    SONTAG!!
    Grant Edwards, Sep 25, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stefan Höhne

    std::vector<>::clear semantics

    Stefan Höhne, Oct 16, 2003, in forum: C++
    Replies:
    10
    Views:
    2,917
    tom_usenet
    Oct 17, 2003
  2. Fredrik Lundh

    ANN: cElementTree 0.9.8 (january 23, 2005)

    Fredrik Lundh, Jan 23, 2005, in forum: Python
    Replies:
    0
    Views:
    266
    Fredrik Lundh
    Jan 23, 2005
  3. Kent Johnson

    Subclassing cElementTree.Element

    Kent Johnson, Feb 7, 2005, in forum: Python
    Replies:
    1
    Views:
    752
    Fredrik Lundh
    Feb 8, 2005
  4. Diez B. Roggisch

    cElementTree encoding woes

    Diez B. Roggisch, Feb 20, 2006, in forum: Python
    Replies:
    3
    Views:
    385
    Fredrik Lundh
    Feb 20, 2006
  5. Ben Temperton

    When to clear elements using cElementTree

    Ben Temperton, Oct 19, 2012, in forum: Python
    Replies:
    1
    Views:
    175
    Ben Temperton
    Oct 19, 2012
Loading...

Share This Page