I
IamIan
This is in Python 2.3.5. I've had success with elementtree and other
RSS feeds, but I can't get it to work with this format:
<?xml version="1.0"?><rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:fr="http://ASPRSS.com/fr.html"
xmlnsa="http://ASPRSS.com/pa.html"
xmlns="http://purl.org/rss/1.0/">
<channel rdf:about="http://www.sample.com">
<title>Example feed</title>
<link>http://www.sample.com</link>
<description>Sample News Agency - News Feed</description>
<image rdf:resource="http://www.sample.com/img/new.gif" />
<items>
<rdf:Seq>
<rdf:li rdf:resource="http://www.sample.com/news/20000/news.htm" />
<rdf:li rdf:resource="http://www.sample.com/news/20001/news.htm" />
</rdf:Seq>
</items>
</channel><image rdf:about="http://www.sample.com/img/about.gif">
<title>Our News Feed</title>
<url>http://www.sample.com/img/title.gif</url>
<link>http://www.sample.com</link>
</image>
<item rdf:about="http://www.sample.com/news/20000/
news.htm"><title>First story</title>
<description>30 August, 2007 : - - First description including unicode
characters</description>
<link>http://www.sample.com/news/20000/news.htm</link>
</item>
<item rdf:about="http://www.sample.com/news/20001/
news.htm"><title>Second story</title>
<description>30 August, 2007 : - - Second description including
unicode characters</description>
<link>http://www.sample.com/news/20001/news.htm</link>
</item>
</rdf:RDF>
What I want to extract is the text in the title and link tags for each
item (eg. <title>First story</title> and <link>http://www.sample.com/
news/20000/news.htm</link>). Starting with the title, my test script
is:
import sys
from urllib import urlopen
sys.path.append("/home/me/lib/python")
import elementtree.ElementTree as ET
news = urlopen("http://www.sample.com/rss/rss.xml")
nTree = ET.parse(news)
for item in nTree.getiterator("title"):
print item.text
Whether I try this for title or link, nothing is printed. There are
also unicode characters in the <title> tags, I'm not sure if that
could affect the output like this. In case it did I passed an encoding
argument to ET.parse (which I'd seen in other posts) but it said
encoding was an unexpected argument...
Printing all subelements does work:
print nTree.getiterator()
[<Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF at
40436d2c>,
<Element {http://purl.org/rss/1.0/}channel at 40436b2c>,
<Element {http://purl.org/rss/ 1.0/}title at 40436dcc>,
<Element {http://purl.org/rss/1.0/}link at 40436d6c>,
< Element {http://purl.org/rss/1.0/}description at 40436e0c>,
<Element {http://pur l.org/rss/1.0/}image at 40436e6c>,
<Element {http://purl.org/rss/1.0/}items at 4 0436f2c>, <Element
{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Seq at 40436f6c> ,
<Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}li at
40436f0c>,
<Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}li at
40436fec>,
<Element {http://purl.org/rss /1.0/}item at 4044624c>,
<Element {http://purl.org/rss/1.0/}title at 4044626c>,
<Element {http://purl.org/rss/1.0/}description at 4044614c>,
<Element {http://purl.org/rss/1.0/}link at 4044630c>,
<Element {http://purl.org/rss/1.0/}item at 40 4463ac>,
<Element {http://purl.org/rss/1.0/}title at 404463cc>,
<Element {http:/ /purl.org/rss/1.0/}description at 404462ac>,
<Element {http://purl.org/rss/1.0/} link at 4044640c>]
Any ideas are greatly appreciated.
RSS feeds, but I can't get it to work with this format:
<?xml version="1.0"?><rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:fr="http://ASPRSS.com/fr.html"
xmlnsa="http://ASPRSS.com/pa.html"
xmlns="http://purl.org/rss/1.0/">
<channel rdf:about="http://www.sample.com">
<title>Example feed</title>
<link>http://www.sample.com</link>
<description>Sample News Agency - News Feed</description>
<image rdf:resource="http://www.sample.com/img/new.gif" />
<items>
<rdf:Seq>
<rdf:li rdf:resource="http://www.sample.com/news/20000/news.htm" />
<rdf:li rdf:resource="http://www.sample.com/news/20001/news.htm" />
</rdf:Seq>
</items>
</channel><image rdf:about="http://www.sample.com/img/about.gif">
<title>Our News Feed</title>
<url>http://www.sample.com/img/title.gif</url>
<link>http://www.sample.com</link>
</image>
<item rdf:about="http://www.sample.com/news/20000/
news.htm"><title>First story</title>
<description>30 August, 2007 : - - First description including unicode
characters</description>
<link>http://www.sample.com/news/20000/news.htm</link>
</item>
<item rdf:about="http://www.sample.com/news/20001/
news.htm"><title>Second story</title>
<description>30 August, 2007 : - - Second description including
unicode characters</description>
<link>http://www.sample.com/news/20001/news.htm</link>
</item>
</rdf:RDF>
What I want to extract is the text in the title and link tags for each
item (eg. <title>First story</title> and <link>http://www.sample.com/
news/20000/news.htm</link>). Starting with the title, my test script
is:
import sys
from urllib import urlopen
sys.path.append("/home/me/lib/python")
import elementtree.ElementTree as ET
news = urlopen("http://www.sample.com/rss/rss.xml")
nTree = ET.parse(news)
for item in nTree.getiterator("title"):
print item.text
Whether I try this for title or link, nothing is printed. There are
also unicode characters in the <title> tags, I'm not sure if that
could affect the output like this. In case it did I passed an encoding
argument to ET.parse (which I'd seen in other posts) but it said
encoding was an unexpected argument...
Printing all subelements does work:
print nTree.getiterator()
[<Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF at
40436d2c>,
<Element {http://purl.org/rss/1.0/}channel at 40436b2c>,
<Element {http://purl.org/rss/ 1.0/}title at 40436dcc>,
<Element {http://purl.org/rss/1.0/}link at 40436d6c>,
< Element {http://purl.org/rss/1.0/}description at 40436e0c>,
<Element {http://pur l.org/rss/1.0/}image at 40436e6c>,
<Element {http://purl.org/rss/1.0/}items at 4 0436f2c>, <Element
{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Seq at 40436f6c> ,
<Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}li at
40436f0c>,
<Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}li at
40436fec>,
<Element {http://purl.org/rss /1.0/}item at 4044624c>,
<Element {http://purl.org/rss/1.0/}title at 4044626c>,
<Element {http://purl.org/rss/1.0/}description at 4044614c>,
<Element {http://purl.org/rss/1.0/}link at 4044630c>,
<Element {http://purl.org/rss/1.0/}item at 40 4463ac>,
<Element {http://purl.org/rss/1.0/}title at 404463cc>,
<Element {http:/ /purl.org/rss/1.0/}description at 404462ac>,
<Element {http://purl.org/rss/1.0/} link at 4044640c>]
Any ideas are greatly appreciated.