elementtree question

T

Tim Arnold

Hi, I'm using elementtree and elementtidy to work with some HTML files. For
some of these files I need to enclose the body content in a new div tag,
like this:
<body>
<div class="remapped">
original contents...
</div>
</body>

I figure there must be a way to do it by creating a 'div' SubElement to the
'body' tag and somehow copying the rest of the tree under that SubElement,
but it's beyond my comprehension.

How can I accomplish this?
(I know I could put the class on the body tag itself, but that won't satisfy
the powers-that-be).

thanks,
--Tim Arnold
 
I

Ivo

Tim said:
Hi, I'm using elementtree and elementtidy to work with some HTML files. For
some of these files I need to enclose the body content in a new div tag,
like this:
<body>
<div class="remapped">
original contents...
</div>
</body>

I figure there must be a way to do it by creating a 'div' SubElement to the
'body' tag and somehow copying the rest of the tree under that SubElement,
but it's beyond my comprehension.

How can I accomplish this?
(I know I could put the class on the body tag itself, but that won't satisfy
the powers-that-be).

thanks,
--Tim Arnold

You could also try something like this:

from sgmllib import SGMLParser
class IParse(SGMLParser):
def __init__(self, verbose=0):
SGMLParser.__init__(self, verbose)
self.data = ""
def _attr_to_str(self, attrs):
return ' '.join(['%s="%s"' % a for a in attrs])

def start_body(self, attrs):
self.data += "<body %s>" % self._attr_to_str(attrs)
print "remapping"
self.data += '''<div class="remapped">'''
def end_body(self):
self.data += "</div>" # end remapping
self.data += "</body>"
def handle_data(self, data):
self.data += data
def unknown_starttag(self, tag, attrs):
self.data+="<%s %s>" % (tag, self._attr_to_str(attrs),)
def unknown_endtag(self, tag):
self.data += "</%s>" % tag


if __name__=="__main__":
i = IParse()
i.feed('''
<html>
<body bgcolor="#fffff">
original
<i>italic</i>
<b class="test">contents</b>...
</body>
</html>''');

print i.data
i.close()


just look at the code from sgmllib (standard lib) and it is very easy to
make a parser. for some much needed refactoring
 
G

Gabriel Genellina

Hi, I'm using elementtree and elementtidy to work with some HTML files.
For
some of these files I need to enclose the body content in a new div tag,
like this:
<body>
<div class="remapped">
original contents...
</div>
</body>

I figure there must be a way to do it by creating a 'div' SubElement to
the
'body' tag and somehow copying the rest of the tree under that
SubElement,
but it's beyond my comprehension.

import xml.etree.ElementTree as ET
source = """<html><head><title>Test</title></head><body>
original contents... 2&amp;3 <a href="hello/world">some text</a>
<p>Another paragraph</p>
</body></html>"""
tree = ET.XML(source)
body = tree.find("body")
newdiv = ET.Element('div', {'class':'remapped'})
newdiv.append(body)
bodyidx = tree.getchildren().index(body)
tree[bodyidx]=newdiv
ET.dump(tree)
 
M

Mark T

Gabriel Genellina said:
Hi, I'm using elementtree and elementtidy to work with some HTML files.
For
some of these files I need to enclose the body content in a new div tag,
like this:
<body>
<div class="remapped">
original contents...
</div>
</body>

I figure there must be a way to do it by creating a 'div' SubElement to
the
'body' tag and somehow copying the rest of the tree under that
SubElement,
but it's beyond my comprehension.

import xml.etree.ElementTree as ET
source = """<html><head><title>Test</title></head><body>
original contents... 2&amp;3 <a href="hello/world">some text</a>
<p>Another paragraph</p>
</body></html>"""
tree = ET.XML(source)
body = tree.find("body")
newdiv = ET.Element('div', {'class':'remapped'})
newdiv.append(body)
bodyidx = tree.getchildren().index(body)
tree[bodyidx]=newdiv
ET.dump(tree)

The above wraps the body element, not the contents of the body element. I'm
no ElementTree expert, but this seems to work:

import xml.etree.ElementTree as ET
source = """<html><head><title>Test</title></head><body>
original contents... 2&amp;3 <a href="hello/world">some text</a>
<p>Another paragraph</p>
</body></html>"""
tree = ET.XML(source)
body = tree.find("body")
newdiv = ET.Element('div', {'class':'remapped'})
for e in body.getchildren():
newdiv.append(e)
newdiv.text = body.text
newdiv.tail = body.tail
body.clear()
body.append(newdiv)
ET.dump(tree)

Result:

<html><head><title>Test</title></head><body><div class="remapped">
original contents... 2&amp;3 <a href="hello/world">some text</a>
<p>Another paragraph</p>
</div></body></html>

-Mark
 
G

Gabriel Genellina

[wrong code]
The above wraps the body element, not the contents of the body element.
I'm
no ElementTree expert, but this seems to work:

[better code]

Almost right. clear() removes all attributes too, so if the body element
had any attribute, it is lost. I would remove children from body at the
same time they're copied into newdiv.
(This whole thing appears to be harder than one would expect at first)

import xml.etree.cElementTree as ET
source = """<html><head><title>Test</title></head><body lang="en">
original contents... 2&amp;3 <a href="hello/world">some text</a>
<p>Another paragraph</p>
</body></html>"""
tree = ET.XML(source)
body = tree.find("body")
newdiv = ET.Element('div', {'class':'remapped'})
for e in list(body.getchildren()):
newdiv.append(e)
body.remove(e)
newdiv.text, body.text = body.text, ''
newdiv.tail, body.tail = body.tail, ''
body.append(newdiv)
ET.dump(tree)
 
S

Stefan Behnel

Tim said:
Hi, I'm using elementtree and elementtidy to work with some HTML files. For
some of these files I need to enclose the body content in a new div tag,
like this:
<body>
<div class="remapped">
original contents...
</div>
</body>

Give lxml.etree (or lxml.html) a try:

tree = etree.parse("http://url.to/some.html", etree.HTMLParser())
body = tree.find("body")

and then:

div = etree.Element("div", {"class" : "remapped"})
div.extend(body)
body.append(div)

or alternatively:

children = list(body)
div = etree.SubElement(body, "div", {"class" : "remapped"})
div.extend(children)

http://codespeak.net/lxml/

and for lxml.html, which is currently in alpha status:

http://codespeak.net/lxml/dev/

ET 1.3 will also support the extend() function, BTW.

Stefan
 
T

Tim Arnold

Thanks for the great answers--I learned a lot. I'm looking forward to the ET
1.3 version. I'm currently working on some older HP10.20ux machines and
haven't been able to compile lxml all the way through yet.

thanks again,
--Tim Arnold
 
F

Fredrik Lundh

Stefan said:
ET 1.3 will also support the extend() function, BTW.

div.extend(seq) can be trivially rewritten as

div[len(div):] = seq

and in this case, you know that len(div) is 0, so you can simply do:

div[:] = seq

(this recent lxml habit of using lxml-specific versions of things that
are trivial to do with the standard API is a bit disappointing. kind of
defeats the purpose of having a standard API...)

</F>
 
S

Stefan Behnel

Fredrik said:
(this recent lxml habit of using lxml-specific versions of things that
are trivial to do with the standard API is a bit disappointing. kind of
defeats the purpose of having a standard API...)

ElementTree is not the only standard API that lxml is following. Another one
is the standard API of the "list" builtin type, which has an extend() method.

ah-you're-just-jealous-we-had-it-first-ly,

Stefan :)
 
S

Stefan Behnel

Tim said:
Thanks for the great answers--I learned a lot. I'm looking forward to the ET
1.3 version.

Note that there is a difference in behaviour, though. lxml.etree forces
Elements to be uniquely positioned in a tree, so the code I posted relies on
the "side effect" of automatically removing an Element from the old position
when inserting it at a different place. ElementTree does not do that, so this
code is not portable between the two libraries.

Stefan
 
F

Fredrik Lundh

Tim Arnold wrote:

I figure there must be a way to do it by creating a 'div' SubElement to the
'body' tag and somehow copying the rest of the tree under that SubElement,
but it's beyond my comprehension.

How can I accomplish this?
(I know I could put the class on the body tag itself, but that won't satisfy
the powers-that-be).

for completeness, here's an efficient and fairly straightforward way to
do it under plain 2.5 xml.etree:

body = doc.find(".//body")

# clone and mutate the body element
div = copy.copy(body)
div.tag = "div"
div.set("class", "remapped")

# replace the body contents with the new div
body.clear()
body[:] = [div]

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,262
Messages
2,571,043
Members
48,769
Latest member
Clifft

Latest Threads

Top