lxml removing tag, keeping text order

Discussion in 'Python' started by Tim Arnold, Oct 24, 2008.

  1. Tim Arnold

    Tim Arnold Guest

    Hi,
    Using lxml to clean up auto-generated xml to validate against a dtd; I need
    to remove an element tag but keep the text in order. For example
    s0 = '''
    <option>
    <optional> first text
    <someelement>ladida</someelement>
    <emphasis>emphasized text</emphasis>
    middle text
    <anotherelement/>
    last text
    </optional>
    </option>'''

    I want to get rid of the <emphasis> tag but keep everything else as it is;
    that is, I need this result:

    <option>
    <optional> first text
    <someelement>ladida</someelement>
    emphasized text
    middle text
    <anotherelement/>
    last text
    </optional>
    </option>

    I'm beginning to think this an impossible task, so I'm asking here to see if
    there is some method that will work. What I've done so far is this:

    (outer encloses the parent, outside is the parent, inside is the child to
    remove)
    from lxml import etree
    import copy
    def rm_tag(elem, outer, outside, inside):
    newdiv = etree.Element(outside)
    newdiv.text = ''
    for e0 in elem.getiterator(outside):
    for i,e1 in enumerate(e0.getiterator()):
    if i == 0:
    if e1.text: newdiv.text += e1.text
    elif (e1.tag != inside):
    newdiv.append(copy.deepcopy(e1))
    elif (e1.text):
    newdiv.text += e1.text

    for t in elem.getiterator():
    if t.tag == outer:
    t.clear()
    t.append(newdiv)
    break
    return etree.ElementTree(elem)

    print
    etree.tostring(rm_tag(el,'option','optional','emphasis'),pretty_print=True)

    But the text is messed up using this method. I see why it's wrong, but not
    how to make it right.
    It returns:
    <option>
    <optional> first text
    emphasized text
    <someelement>ladida</someelement>
    <anotherelement/>
    last text
    </optional>
    </option>

    Maybe I should send the outside element (via tostring) to a regexp for
    removing the child and return that string? Regexp? Getting desperate, hey.

    Any pointers much appreciated,
    --Tim Arnold
    Tim Arnold, Oct 24, 2008
    #1
    1. Advertising

  2. Tim Arnold schrieb:
    > Hi,
    > Using lxml to clean up auto-generated xml to validate against a dtd; I need
    > to remove an element tag but keep the text in order. For example
    > s0 = '''
    > <option>
    > <optional> first text
    > <someelement>ladida</someelement>
    > <emphasis>emphasized text</emphasis>
    > middle text
    > <anotherelement/>
    > last text
    > </optional>
    > </option>'''
    >
    > I want to get rid of the <emphasis> tag but keep everything else as it is;
    > that is, I need this result:
    >
    > <option>
    > <optional> first text
    > <someelement>ladida</someelement>
    > emphasized text
    > middle text
    > <anotherelement/>
    > last text
    > </optional>
    > </option>


    There's a drop_tag() method in lxml.html (lxml/html/__init__.py) that does
    what you want. Just copy the code over to your code base and adapt it as needed.

    Stefan
    Stefan Behnel, Oct 25, 2008
    #2
    1. Advertising

  3. Tim Arnold

    Tim Arnold Guest

    "Stefan Behnel" <> wrote in message
    news:4902e522$0$17382$-online.net...
    > Tim Arnold schrieb:
    >> Hi,
    >> Using lxml to clean up auto-generated xml to validate against a dtd; I
    >> need
    >> to remove an element tag but keep the text in order. For example
    >> s0 = '''
    >> <option>
    >> <optional> first text
    >> <someelement>ladida</someelement>
    >> <emphasis>emphasized text</emphasis>
    >> middle text
    >> <anotherelement/>
    >> last text
    >> </optional>
    >> </option>'''
    >>
    >> I want to get rid of the <emphasis> tag but keep everything else as it
    >> is;
    >> that is, I need this result:
    >>
    >> <option>
    >> <optional> first text
    >> <someelement>ladida</someelement>
    >> emphasized text
    >> middle text
    >> <anotherelement/>
    >> last text
    >> </optional>
    >> </option>

    >
    > There's a drop_tag() method in lxml.html (lxml/html/__init__.py) that does
    > what you want. Just copy the code over to your code base and adapt it as
    > needed.
    >
    > Stefan

    Thanks Stefan, I was going crazy with this. That method is going to be quite
    useful for my project and it's good to learn from too; I was making it too
    hard.

    thanks,
    --Tim Arnold
    Tim Arnold, Oct 27, 2008
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. shruds
    Replies:
    1
    Views:
    783
    John C. Bollinger
    Jan 27, 2006
  2. J. Pablo Fernández

    Getting elements and text with lxml

    J. Pablo Fernández, May 16, 2008, in forum: Python
    Replies:
    5
    Views:
    920
    J. Pablo Fernández
    May 17, 2008
  3. Replies:
    20
    Views:
    3,189
    Peter Flynn
    Jun 20, 2009
  4. Raja Kannan
    Replies:
    2
    Views:
    131
  5. jwcarlton

    Removing tag + closing tag

    jwcarlton, Sep 21, 2010, in forum: Perl Misc
    Replies:
    12
    Views:
    178
Loading...

Share This Page