P
pwaring
I'm working on a script that fetches URLs extracted from a database on
a regular basis, and I'd like to make sure that it complies with any
robots.txt rules at the sites it pulls data from. I'm using an instance
of LWP::RobotUA to do this, and it seems to work correctly (in that it
doesn't fetch files if they're blocked by robots.txt) but it also
fetches the robots.txt file every time I make a request to that host.
Seeing as the whole idea of checking the file is to see whether my
crawler is wanted or not (plus these files don't change that often so
there's not much point in fetching them every time), I'd prefer to only
fetch it every 48 hours or so. Is there an easy way to make the module
I'm using cache the files to disk instead of fetching a new copy
everytime, or should I be using a different module to achieve the same
effect? I know how to write the code to do all the legwork myself, but
I'd rather not if someone else has already done the same thing.
If it helps, the code I'm using is as follows (snipped of all the use
statements and other fluff):
my $crawler = LWP::RobotUA->new('feedread', '(e-mail address removed)');
$crawler->delay(1);
$crawler->max_redirect(5);
$crawler->protocols_allowed(['http']);
my $response = $crawler->get('http://www.feedread.org/test.html');
Thanks in advance,
Paul
a regular basis, and I'd like to make sure that it complies with any
robots.txt rules at the sites it pulls data from. I'm using an instance
of LWP::RobotUA to do this, and it seems to work correctly (in that it
doesn't fetch files if they're blocked by robots.txt) but it also
fetches the robots.txt file every time I make a request to that host.
Seeing as the whole idea of checking the file is to see whether my
crawler is wanted or not (plus these files don't change that often so
there's not much point in fetching them every time), I'd prefer to only
fetch it every 48 hours or so. Is there an easy way to make the module
I'm using cache the files to disk instead of fetching a new copy
everytime, or should I be using a different module to achieve the same
effect? I know how to write the code to do all the legwork myself, but
I'd rather not if someone else has already done the same thing.
If it helps, the code I'm using is as follows (snipped of all the use
statements and other fluff):
my $crawler = LWP::RobotUA->new('feedread', '(e-mail address removed)');
$crawler->delay(1);
$crawler->max_redirect(5);
$crawler->protocols_allowed(['http']);
my $response = $crawler->get('http://www.feedread.org/test.html');
Thanks in advance,
Paul