lxml removing tag, keeping text order

T

Tim Arnold

Hi,
Using lxml to clean up auto-generated xml to validate against a dtd; I need
to remove an element tag but keep the text in order. For example
s0 = '''
<option>
<optional> first text
<someelement>ladida</someelement>
<emphasis>emphasized text</emphasis>
middle text
<anotherelement/>
last text
</optional>
</option>'''

I want to get rid of the <emphasis> tag but keep everything else as it is;
that is, I need this result:

<option>
<optional> first text
<someelement>ladida</someelement>
emphasized text
middle text
<anotherelement/>
last text
</optional>
</option>

I'm beginning to think this an impossible task, so I'm asking here to see if
there is some method that will work. What I've done so far is this:

(outer encloses the parent, outside is the parent, inside is the child to
remove)
from lxml import etree
import copy
def rm_tag(elem, outer, outside, inside):
newdiv = etree.Element(outside)
newdiv.text = ''
for e0 in elem.getiterator(outside):
for i,e1 in enumerate(e0.getiterator()):
if i == 0:
if e1.text: newdiv.text += e1.text
elif (e1.tag != inside):
newdiv.append(copy.deepcopy(e1))
elif (e1.text):
newdiv.text += e1.text

for t in elem.getiterator():
if t.tag == outer:
t.clear()
t.append(newdiv)
break
return etree.ElementTree(elem)

print
etree.tostring(rm_tag(el,'option','optional','emphasis'),pretty_print=True)

But the text is messed up using this method. I see why it's wrong, but not
how to make it right.
It returns:
<option>
<optional> first text
emphasized text
<someelement>ladida</someelement>
<anotherelement/>
last text
</optional>
</option>

Maybe I should send the outside element (via tostring) to a regexp for
removing the child and return that string? Regexp? Getting desperate, hey.

Any pointers much appreciated,
--Tim Arnold
 
S

Stefan Behnel

Tim said:
Hi,
Using lxml to clean up auto-generated xml to validate against a dtd; I need
to remove an element tag but keep the text in order. For example
s0 = '''
<option>
<optional> first text
<someelement>ladida</someelement>
<emphasis>emphasized text</emphasis>
middle text
<anotherelement/>
last text
</optional>
</option>'''

I want to get rid of the <emphasis> tag but keep everything else as it is;
that is, I need this result:

<option>
<optional> first text
<someelement>ladida</someelement>
emphasized text
middle text
<anotherelement/>
last text
</optional>
</option>

There's a drop_tag() method in lxml.html (lxml/html/__init__.py) that does
what you want. Just copy the code over to your code base and adapt it as needed.

Stefan
 
T

Tim Arnold

Stefan Behnel said:
There's a drop_tag() method in lxml.html (lxml/html/__init__.py) that does
what you want. Just copy the code over to your code base and adapt it as
needed.

Stefan
Thanks Stefan, I was going crazy with this. That method is going to be quite
useful for my project and it's good to learn from too; I was making it too
hard.

thanks,
--Tim Arnold
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,021
Latest member
AkilahJaim

Latest Threads

Top