Making LWP::RobotUA cache robots.txt files

P

pwaring

I'm working on a script that fetches URLs extracted from a database on
a regular basis, and I'd like to make sure that it complies with any
robots.txt rules at the sites it pulls data from. I'm using an instance
of LWP::RobotUA to do this, and it seems to work correctly (in that it
doesn't fetch files if they're blocked by robots.txt) but it also
fetches the robots.txt file every time I make a request to that host.
Seeing as the whole idea of checking the file is to see whether my
crawler is wanted or not (plus these files don't change that often so
there's not much point in fetching them every time), I'd prefer to only
fetch it every 48 hours or so. Is there an easy way to make the module
I'm using cache the files to disk instead of fetching a new copy
everytime, or should I be using a different module to achieve the same
effect? I know how to write the code to do all the legwork myself, but
I'd rather not if someone else has already done the same thing.

If it helps, the code I'm using is as follows (snipped of all the use
statements and other fluff):

my $crawler = LWP::RobotUA->new('feedread', '(e-mail address removed)');
$crawler->delay(1);
$crawler->max_redirect(5);
$crawler->protocols_allowed(['http']);

my $response = $crawler->get('http://www.feedread.org/test.html');

Thanks in advance,

Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,054
Latest member
LucyCarper

Latest Threads

Top