Annoying feedparser issues

John Nagle · May 16, 2009

This really isn't the fault of the "feedparser" module, but it's
worth mentioning.

I have an application which needs to read each new item from a feed
as it shows up, as efficiently as possible, because it's monitoring multiple
feeds. I want exactly one copy of each item as it comes in.

In theory, this is easy. Each time the feed is polled, pass in the
timestamp and ID from the previous poll, and if nothing has changed,
a 304 status should come back.

Results are spotty. It mostly works for Reuters. It doesn't work
for Twitter at all; Twitter updates the timestamp even when nothing changes.
So items are routinely re-read. (That has to be costing Twitter a huge
amount of bandwidth from useless polls.)

Some sites have changing feed etags because they're using multiple
servers and a load balancer. These can be recognized because the same
etags will show up again after a change.

Items can supposedly be unduplicated by using the "etag" value.
This almost works, but it's tricker than one might think. On some feeds,
an item might go away, yet come back in a later feed. This happens with
news feeds from major news sources, because they have priorities that
don't show up in RSS. High priority stories might push a low priority story
off the feed, but it may come back later. Also, every night at 00:00, some
feeds like Reuters re-number everything. The only thing that works reliably
is comparing the story text.

John Nagle

J Kenneth King · May 19, 2009

John Nagle said:
This really isn't the fault of the "feedparser" module, but it's
worth mentioning.

I have an application which needs to read each new item from a feed
as it shows up, as efficiently as possible, because it's monitoring multiple
feeds. I want exactly one copy of each item as it comes in.

In theory, this is easy. Each time the feed is polled, pass in the
timestamp and ID from the previous poll, and if nothing has changed,
a 304 status should come back.

Results are spotty. It mostly works for Reuters. It doesn't work
for Twitter at all; Twitter updates the timestamp even when nothing changes.
So items are routinely re-read. (That has to be costing Twitter a huge
amount of bandwidth from useless polls.)

Some sites have changing feed etags because they're using multiple
servers and a load balancer. These can be recognized because the same
etags will show up again after a change.

Items can supposedly be unduplicated by using the "etag" value.
This almost works, but it's tricker than one might think. On some feeds,
an item might go away, yet come back in a later feed. This happens with
news feeds from major news sources, because they have priorities that
don't show up in RSS. High priority stories might push a low priority story
off the feed, but it may come back later. Also, every night at 00:00, some
feeds like Reuters re-number everything. The only thing that works reliably
is comparing the story text.

John Nagle

I can't really offer much help, but I feel your pain. I had to write
something of a similar system for a large company once and it hurt.
They mixed different formats with different protocols and man it was
something in the end. The law of fuzzy inputs makes this stuff tough.

It may help to create a hash from the first x number of bytes of the
article text. Then cache all the hashes in a local dbm-style
database. We used berkely, but it doesn't really matter. Whatever
way you can generate and store a keyed signature will allow you to do
a quick look up and see if you've already processed that article.

RSS feed issues, or how to read each item exactly once	1	Mar 21, 2009
feedparser vs. network errors - something remembers that net wasdown.	0	Apr 7, 2011
Feedparser Problem	1	Jun 5, 2009
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022
Request for tips on my first python script.	6	Sep 8, 2006
Simple elementtree question	3	Aug 30, 2007
Python RSS aggregator?	1	Dec 17, 2004
ASP/FlashMX 2004 newsfeed problem	1	Jul 17, 2008

Annoying feedparser issues

John Nagle

J Kenneth King

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads