Simple elementtree question

IamIan · Aug 30, 2007

This is in Python 2.3.5. I've had success with elementtree and other
RSS feeds, but I can't get it to work with this format:

<?xml version="1.0"?><rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:fr="http://ASPRSS.com/fr.html"
xmlns

a="http://ASPRSS.com/pa.html"
xmlns="http://purl.org/rss/1.0/">
<channel rdf:about="http://www.sample.com">
<title>Example feed</title>
<link>http://www.sample.com</link>
<description>Sample News Agency - News Feed</description>
<image rdf:resource="http://www.sample.com/img/new.gif" />
<items>
<rdf:Seq>
<rdf:li rdf:resource="http://www.sample.com/news/20000/news.htm" />
<rdf:li rdf:resource="http://www.sample.com/news/20001/news.htm" />
</rdf:Seq>
</items>
</channel><image rdf:about="http://www.sample.com/img/about.gif">
<title>Our News Feed</title>
<url>http://www.sample.com/img/title.gif</url>
<link>http://www.sample.com</link>
</image>
<item rdf:about="http://www.sample.com/news/20000/
news.htm"><title>First story</title>
<description>30 August, 2007 : - - First description including unicode
characters</description>
<link>http://www.sample.com/news/20000/news.htm</link>
</item>
<item rdf:about="http://www.sample.com/news/20001/
news.htm"><title>Second story</title>
<description>30 August, 2007 : - - Second description including
unicode characters</description>
<link>http://www.sample.com/news/20001/news.htm</link>
</item>
</rdf:RDF>

What I want to extract is the text in the title and link tags for each
item (eg. <title>First story</title> and <link>http://www.sample.com/
news/20000/news.htm</link>). Starting with the title, my test script
is:

import sys
from urllib import urlopen

sys.path.append("/home/me/lib/python")
import elementtree.ElementTree as ET

news = urlopen("http://www.sample.com/rss/rss.xml")
nTree = ET.parse(news)
for item in nTree.getiterator("title"):
print item.text

Whether I try this for title or link, nothing is printed. There are
also unicode characters in the <title> tags, I'm not sure if that
could affect the output like this. In case it did I passed an encoding
argument to ET.parse (which I'd seen in other posts) but it said
encoding was an unexpected argument...

Printing all subelements does work:
print nTree.getiterator()

[<Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF at
40436d2c>,
<Element {http://purl.org/rss/1.0/}channel at 40436b2c>,
<Element {http://purl.org/rss/ 1.0/}title at 40436dcc>,
<Element {http://purl.org/rss/1.0/}link at 40436d6c>,
< Element {http://purl.org/rss/1.0/}description at 40436e0c>,
<Element {http://pur l.org/rss/1.0/}image at 40436e6c>,
<Element {http://purl.org/rss/1.0/}items at 4 0436f2c>, <Element
{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Seq at 40436f6c> ,
<Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}li at
40436f0c>,
<Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}li at
40436fec>,
<Element {http://purl.org/rss /1.0/}item at 4044624c>,
<Element {http://purl.org/rss/1.0/}title at 4044626c>,
<Element {http://purl.org/rss/1.0/}description at 4044614c>,
<Element {http://purl.org/rss/1.0/}link at 4044630c>,
<Element {http://purl.org/rss/1.0/}item at 40 4463ac>,
<Element {http://purl.org/rss/1.0/}title at 404463cc>,
<Element {http:/ /purl.org/rss/1.0/}description at 404462ac>,
<Element {http://purl.org/rss/1.0/} link at 4044640c>]

Any ideas are greatly appreciated.

Stefan Behnel · Aug 30, 2007

IamIan said:
This is in Python 2.3.5. I've had success with elementtree and other
RSS feeds, but I can't get it to work with this format:

<?xml version="1.0"?><rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:fr="http://ASPRSS.com/fr.html"
xmlnsa="http://ASPRSS.com/pa.html"
xmlns="http://purl.org/rss/1.0/">
<channel rdf:about="http://www.sample.com">
<title>Example feed</title> [...]
</rdf:RDF>

What I want to extract is the text in the title and link tags for each
item (eg. <title>First story</title> and <link>http://www.sample.com/
news/20000/news.htm</link>). Starting with the title, my test script
is:

import sys
from urllib import urlopen

import elementtree.ElementTree as ET

news = urlopen("http://www.sample.com/rss/rss.xml")
nTree = ET.parse(news)
for item in nTree.getiterator("title"):
print item.text

Whether I try this for title or link, nothing is printed.

Your document uses namespaces. What you are looking for is not the tag "title"
without a namespace, but the tag "{http://purl.org/rss/1.0/}title" with the
default namespace.

http://effbot.org/zone/element.htm#xml-namespaces

Stefan

IamIan · Aug 30, 2007

Thank you very much! That did it.

In the source XML <item> tags have rdf:about attributes with the link
to the story, and it was here I planned on grabbing the link and
matching it up with the <title> child text. After seeing the output of
elmenttree's getiterator() though, it now looks like each item, title,
description, and link is a separate element...

I could use a dictionary or lists to match the first title to the
first link, but is there a more elegant way in elementtree (or
otherwise) to do this?

Thanks again,

Ian

Stefan Behnel · Aug 31, 2007

IamIan said:
Thank you very much! That did it.

In the source XML <item> tags have rdf:about attributes with the link
to the story, and it was here I planned on grabbing the link and
matching it up with the <title> child text. After seeing the output of
elmenttree's getiterator() though, it now looks like each item, title,
description, and link is a separate element...

I could use a dictionary or lists to match the first title to the
first link, but is there a more elegant way in elementtree (or
otherwise) to do this?

You can iterate over the channel Elements and then select the title child
(el.find()) to see if it's interesting.

You can also try lxml.etree, which supports XPath:

>>> from lxml import etree
>>> find_channel = etree.XPath("//channel[title = $title]")
>>> tree = etree.parse("http://somewhere/the_document.xml")
>>> channel = find_channel(tree, title="example title")
>>> print channel.findtext("link")

Click to expand...

Click to expand...

or lxml.objectify:

>>> from lxml import etree, objectify
>>> find_channel = etree.XPath("//channel[title = $title]")
>>> tree = objectify.parse("http://somewhere/the_document.xml")
>>> channel = find_channel(tree, title="example title")
>>> print channel.title, channel.link

Click to expand...

Click to expand...

http://codespeak.net/lxml

Stefan

PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022
parsing RSS XML feed for item value	5	Nov 20, 2013
AJAX Question	0	Nov 26, 2005
xmlns	8	Aug 3, 2005
XPath help	1	Apr 17, 2007
Determining QName from a URIref in RDFS	0	Jan 22, 2004
Two ways to generate RSS - rss/maker and rss/2.0 - which is better?	1	Jun 26, 2009
XHTML - how extend/create ELEMENT body in my DTD?	0	Oct 29, 2019

Simple elementtree question

IamIan

Stefan Behnel

IamIan

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads