Simple elementtree question

I

IamIan

This is in Python 2.3.5. I've had success with elementtree and other
RSS feeds, but I can't get it to work with this format:

<?xml version="1.0"?><rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:fr="http://ASPRSS.com/fr.html"
xmlns:pa="http://ASPRSS.com/pa.html"
xmlns="http://purl.org/rss/1.0/">
<channel rdf:about="http://www.sample.com">
<title>Example feed</title>
<link>http://www.sample.com</link>
<description>Sample News Agency - News Feed</description>
<image rdf:resource="http://www.sample.com/img/new.gif" />
<items>
<rdf:Seq>
<rdf:li rdf:resource="http://www.sample.com/news/20000/news.htm" />
<rdf:li rdf:resource="http://www.sample.com/news/20001/news.htm" />
</rdf:Seq>
</items>
</channel><image rdf:about="http://www.sample.com/img/about.gif">
<title>Our News Feed</title>
<url>http://www.sample.com/img/title.gif</url>
<link>http://www.sample.com</link>
</image>
<item rdf:about="http://www.sample.com/news/20000/
news.htm"><title>First story</title>
<description>30 August, 2007 : - - First description including unicode
characters</description>
<link>http://www.sample.com/news/20000/news.htm</link>
</item>
<item rdf:about="http://www.sample.com/news/20001/
news.htm"><title>Second story</title>
<description>30 August, 2007 : - - Second description including
unicode characters</description>
<link>http://www.sample.com/news/20001/news.htm</link>
</item>
</rdf:RDF>

What I want to extract is the text in the title and link tags for each
item (eg. <title>First story</title> and <link>http://www.sample.com/
news/20000/news.htm</link>). Starting with the title, my test script
is:

import sys
from urllib import urlopen

sys.path.append("/home/me/lib/python")
import elementtree.ElementTree as ET

news = urlopen("http://www.sample.com/rss/rss.xml")
nTree = ET.parse(news)
for item in nTree.getiterator("title"):
print item.text

Whether I try this for title or link, nothing is printed. There are
also unicode characters in the <title> tags, I'm not sure if that
could affect the output like this. In case it did I passed an encoding
argument to ET.parse (which I'd seen in other posts) but it said
encoding was an unexpected argument...

Printing all subelements does work:
print nTree.getiterator()

[<Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF at
40436d2c>,
<Element {http://purl.org/rss/1.0/}channel at 40436b2c>,
<Element {http://purl.org/rss/ 1.0/}title at 40436dcc>,
<Element {http://purl.org/rss/1.0/}link at 40436d6c>,
< Element {http://purl.org/rss/1.0/}description at 40436e0c>,
<Element {http://pur l.org/rss/1.0/}image at 40436e6c>,
<Element {http://purl.org/rss/1.0/}items at 4 0436f2c>, <Element
{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Seq at 40436f6c> ,
<Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}li at
40436f0c>,
<Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}li at
40436fec>,
<Element {http://purl.org/rss /1.0/}item at 4044624c>,
<Element {http://purl.org/rss/1.0/}title at 4044626c>,
<Element {http://purl.org/rss/1.0/}description at 4044614c>,
<Element {http://purl.org/rss/1.0/}link at 4044630c>,
<Element {http://purl.org/rss/1.0/}item at 40 4463ac>,
<Element {http://purl.org/rss/1.0/}title at 404463cc>,
<Element {http:/ /purl.org/rss/1.0/}description at 404462ac>,
<Element {http://purl.org/rss/1.0/} link at 4044640c>]

Any ideas are greatly appreciated.
 
S

Stefan Behnel

IamIan said:
This is in Python 2.3.5. I've had success with elementtree and other
RSS feeds, but I can't get it to work with this format:

<?xml version="1.0"?><rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:fr="http://ASPRSS.com/fr.html"
xmlns:pa="http://ASPRSS.com/pa.html"
xmlns="http://purl.org/rss/1.0/">
<channel rdf:about="http://www.sample.com">
<title>Example feed</title> [...]
</rdf:RDF>

What I want to extract is the text in the title and link tags for each
item (eg. <title>First story</title> and <link>http://www.sample.com/
news/20000/news.htm</link>). Starting with the title, my test script
is:

import sys
from urllib import urlopen

import elementtree.ElementTree as ET

news = urlopen("http://www.sample.com/rss/rss.xml")
nTree = ET.parse(news)
for item in nTree.getiterator("title"):
print item.text

Whether I try this for title or link, nothing is printed.

Your document uses namespaces. What you are looking for is not the tag "title"
without a namespace, but the tag "{http://purl.org/rss/1.0/}title" with the
default namespace.

http://effbot.org/zone/element.htm#xml-namespaces

Stefan
 
I

IamIan

Thank you very much! That did it.

In the source XML <item> tags have rdf:about attributes with the link
to the story, and it was here I planned on grabbing the link and
matching it up with the <title> child text. After seeing the output of
elmenttree's getiterator() though, it now looks like each item, title,
description, and link is a separate element...

I could use a dictionary or lists to match the first title to the
first link, but is there a more elegant way in elementtree (or
otherwise) to do this?

Thanks again,

Ian
 
S

Stefan Behnel

IamIan said:
Thank you very much! That did it.

In the source XML <item> tags have rdf:about attributes with the link
to the story, and it was here I planned on grabbing the link and
matching it up with the <title> child text. After seeing the output of
elmenttree's getiterator() though, it now looks like each item, title,
description, and link is a separate element...

I could use a dictionary or lists to match the first title to the
first link, but is there a more elegant way in elementtree (or
otherwise) to do this?

You can iterate over the channel Elements and then select the title child
(el.find()) to see if it's interesting.

You can also try lxml.etree, which supports XPath:
>>> from lxml import etree
>>> find_channel = etree.XPath("//channel[title = $title]")
>>> tree = etree.parse("http://somewhere/the_document.xml")
>>> channel = find_channel(tree, title="example title")
>>> print channel.findtext("link")

or lxml.objectify:
>>> from lxml import etree, objectify
>>> find_channel = etree.XPath("//channel[title = $title]")
>>> tree = objectify.parse("http://somewhere/the_document.xml")
>>> channel = find_channel(tree, title="example title")
>>> print channel.title, channel.link

http://codespeak.net/lxml

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,880
Messages
2,569,944
Members
46,251
Latest member
AnnetteBir

Latest Threads

Top