Making LWP::RobotUA cache robots.txt files

pwaring · Dec 22, 2005

I'm working on a script that fetches URLs extracted from a database on
a regular basis, and I'd like to make sure that it complies with any
robots.txt rules at the sites it pulls data from. I'm using an instance
of LWP::RobotUA to do this, and it seems to work correctly (in that it
doesn't fetch files if they're blocked by robots.txt) but it also
fetches the robots.txt file every time I make a request to that host.
Seeing as the whole idea of checking the file is to see whether my
crawler is wanted or not (plus these files don't change that often so
there's not much point in fetching them every time), I'd prefer to only
fetch it every 48 hours or so. Is there an easy way to make the module
I'm using cache the files to disk instead of fetching a new copy
everytime, or should I be using a different module to achieve the same
effect? I know how to write the code to do all the legwork myself, but
I'd rather not if someone else has already done the same thing.

If it helps, the code I'm using is as follows (snipped of all the use
statements and other fluff):

my $crawler = LWP::RobotUA->new('feedread', '(e-mail address removed)');
$crawler->delay(1);
$crawler->max_redirect(5);
$crawler->protocols_allowed(['http']);

my $response = $crawler->get('http://www.feedread.org/test.html');

Thanks in advance,

Paul

Caching robots.txt in LWP::RobotUA	1	Mar 15, 2010
How to Make CSV Contact Files Work Seamlessly Across All Smartphones?	0	Sep 17, 2025
LWP hangs	7	Mar 22, 2007
OT: Opinions on Robots.txt	1	Oct 9, 2005
Alternatives to LWP::Parallel	2	Oct 24, 2009
Question on download by LWP	14	Sep 18, 2006
converting Java code to Perl (using LWP?)	4	Jun 13, 2008
Invisible cache for LWP / Mechanize?	13	Jan 28, 2005

Making LWP::RobotUA cache robots.txt files

pwaring

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads