Annoying feedparser issues

J

John Nagle

This really isn't the fault of the "feedparser" module, but it's
worth mentioning.

I have an application which needs to read each new item from a feed
as it shows up, as efficiently as possible, because it's monitoring multiple
feeds. I want exactly one copy of each item as it comes in.

In theory, this is easy. Each time the feed is polled, pass in the
timestamp and ID from the previous poll, and if nothing has changed,
a 304 status should come back.

Results are spotty. It mostly works for Reuters. It doesn't work
for Twitter at all; Twitter updates the timestamp even when nothing changes.
So items are routinely re-read. (That has to be costing Twitter a huge
amount of bandwidth from useless polls.)

Some sites have changing feed etags because they're using multiple
servers and a load balancer. These can be recognized because the same
etags will show up again after a change.

Items can supposedly be unduplicated by using the "etag" value.
This almost works, but it's tricker than one might think. On some feeds,
an item might go away, yet come back in a later feed. This happens with
news feeds from major news sources, because they have priorities that
don't show up in RSS. High priority stories might push a low priority story
off the feed, but it may come back later. Also, every night at 00:00, some
feeds like Reuters re-number everything. The only thing that works reliably
is comparing the story text.

John Nagle
 
J

J Kenneth King

John Nagle said:
This really isn't the fault of the "feedparser" module, but it's
worth mentioning.

I have an application which needs to read each new item from a feed
as it shows up, as efficiently as possible, because it's monitoring multiple
feeds. I want exactly one copy of each item as it comes in.

In theory, this is easy. Each time the feed is polled, pass in the
timestamp and ID from the previous poll, and if nothing has changed,
a 304 status should come back.

Results are spotty. It mostly works for Reuters. It doesn't work
for Twitter at all; Twitter updates the timestamp even when nothing changes.
So items are routinely re-read. (That has to be costing Twitter a huge
amount of bandwidth from useless polls.)

Some sites have changing feed etags because they're using multiple
servers and a load balancer. These can be recognized because the same
etags will show up again after a change.

Items can supposedly be unduplicated by using the "etag" value.
This almost works, but it's tricker than one might think. On some feeds,
an item might go away, yet come back in a later feed. This happens with
news feeds from major news sources, because they have priorities that
don't show up in RSS. High priority stories might push a low priority story
off the feed, but it may come back later. Also, every night at 00:00, some
feeds like Reuters re-number everything. The only thing that works reliably
is comparing the story text.

John Nagle

I can't really offer much help, but I feel your pain. I had to write
something of a similar system for a large company once and it hurt.
They mixed different formats with different protocols and man it was
something in the end. The law of fuzzy inputs makes this stuff tough.

It may help to create a hash from the first x number of bytes of the
article text. Then cache all the hashes in a local dbm-style
database. We used berkely, but it doesn't really matter. Whatever
way you can generate and store a keyed signature will allow you to do
a quick look up and see if you've already processed that article.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,147
Latest member
CarenSchni
Top