Parsing XML RSS feed byte stream for <item> tag

darrel.rendell · Feb 7, 2013

I'm attempting to parse an RSS feed for the first instance of an element ""..

def pageReader(url):
try:
readPage = urllib2.urlopen(url)
except urllib2.URLError, e:
# print 'We failed to reach a server.'
# print 'Reason: ', e.reason
return 404
except urllib2.HTTPError, e:
# print('The server couldn\'t fulfill the request.')
# print('Error code: ', e.code)
return 404
else:
outputPage = readPage.read()
return outputPage

Assume arguments being passed are correct. The function returns a str object whose value is simply an entire rss feed - I've confirmed the type with:

a = isinstance(value, str)
if not a:
return -1

So, an entire rss feed has been returned from the function call, it's this point I hit a brick wall - I've tried parsing the feed with BeautifulSoup, lxml and various other libs, but no success (I had some success with BeautifulSoup, but it wasn't able to pull certain child elements from the parent,for example, . I'm just about ready to resort to writing my own parser, but I'd like to know if anybody has any suggestions.

To recreate my error, simply call the above function with an argument similar to:

http://www.cert.org/nav/cert_announcements.rss

You'll see I'm trying to return the first child.

<item>
<title>New Blog Entry: Common Sense Guide to Mitigating Insider Threats - Best Practice 16 (of 19)</title>
<link>http://www.cert.org/blogs/insider_t...insider_threats_-_best_practice_16_of_19.html</link>
<description>This sixteenth of 19 blog posts about the fourth edition of the Common Sense Guide to Mitigating Insider Threats describes Practice 16

evelop a formalized insider threat program.</description>
<pubDate>Wed, 06 Feb 2013 06:38:07 -0500</pubDate>
</item>

As I've said, BeautifulSoup fails to find both pubDate and Link, which are crucial to my app.

Any advice would be greatly appreciated.

John Gordon · Feb 7, 2013

In said:
def pageReader(url):
try:
readPage =3D urllib2.urlopen(url)
except urllib2.URLError, e:
# print 'We failed to reach a server.'
# print 'Reason: ', e.reason
return 404 =20
except urllib2.HTTPError, e:
# print('The server couldn\'t fulfill the request.')
# print('Error code: ', e.code) =20
return 404 =20
else:
outputPage =3D readPage.read() =20
return outputPage

To recreate my error, simply call the above function with an argument
similar to:

You'll see I'm trying to return the first child.

The above code produces no output at all. The pageReader() function is
defined but never called.

If we add a few lines at the bottom:

if __name__ == '__main__':
print pageReader('http://www.cert.org/nav/cert_announcements.rss')

Then we get some output:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">

<channel>
<title>CERT Announcements</title>
<link>http://www.cert.org/nav/whatsnew.html</link>
<language>en-us</language>
<description>Announcements: What's New on the CERT web site</description>

<item>
<title>New Blog Entry: Common Sense Guide to Mitigating Insider Threats - Best Practice 16 (of 19)</title>
<link>http://www.cert.org/blogs/insider_t...insider_threats_-_best_practice_16_of_19.html</link>
<description>This sixteenth of 19 blog posts about the fourth edition of the Common Sense Guide to Mitigating Insider Threats describes Practice 16: Develop a formalized insider threat program.</description>
<pubDate>Wed, 06 Feb 2013 06:38:07 -0500</pubDate>
</item>

....

As I've said, BeautifulSoup fails to find both pubDate and Link, which are =
crucial to my app.

Any advice would be greatly appreciated.

You haven't included the BeautifulSoup code which attempts to parse the XML,
so it's impossible to say exactly what the error is.

However, I have a guess: you said you're trying to return the first
child. Based on the above output, the first child is the <channel>
element, not an <item> element. Perhaps that's the issue?

xDog Walker · Feb 8, 2013

As I've said, BeautifulSoup fails to find both pubDate and Link, which are
crucial to my app
Any advice would be greatly appreciated.

http://packages.python.org/feedparser

parsing RSS XML feed for item value	5	Nov 20, 2013
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022
Rss feed <index> item error	2	Nov 3, 2007
help with for loop----python 2.7.2	9	Mar 22, 2014
RSS Feed selector	0	Jul 16, 2008
Thunderbird doesn't check for updates to an RSS feed	0	Apr 4, 2007
Two ways to generate RSS - rss/maker and rss/2.0 - which is better?	1	Jun 26, 2009
Feed validation problem	0	Feb 2, 2008

Parsing XML RSS feed byte stream for <item> tag

darrel.rendell

John Gordon

xDog Walker

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads