Using LWP to get last modified date of web page

A

Arvin Portlock

I write the following program which works great:

use LWP::UserAgent;
my $agent = new LWP::UserAgent;
my $response = $agent->head('http://www.perl.org/about.html');
print $response->last_modified;

But most other URLs I try don't return a modified date.
Is this because most servers aren't accepting HEAD requests
anymore? There has to be a way to get the last modifed date
for a web page. We have crawlers and spiders that do it all
the time. Can anybody suggest a way to do this? I'm partic-
ularly concerned about not downloading entire pages but just
getting the sizes and I'm not sure how to do this with LWP.

Thanks for your help
 
J

Joe Smith

Arvin said:
I write the following program which works great:

use LWP::UserAgent;
my $agent = new LWP::UserAgent;
my $response = $agent->head('http://www.perl.org/about.html');
print $response->last_modified;

But most other URLs I try don't return a modified date.

That's up to the web page author; it is out of your control.
Is this because most servers aren't accepting HEAD requests anymore?

A lot of dynamically generated pages (such as ones with banner ads
or table-of-contents links) have no modified date.
There has to be a way to get the last modifed date
for a web page. We have crawlers and spiders that do it all
the time. ... not downloading entire pages but just
getting the sizes and I'm not sure how to do this with LWP.

Spiders and crawlers download the entire page.
In some cases, the date is the last time it went looking
instead of the file's date.
-Joe
 
A

Andrew Bryson

Spiders and crawlers download the entire page.
In some cases, the date is the last time it went looking
instead of the file's date.

I would hope that in all cases they would use the time they last
went looking... otherwise they would not be accurate at all.
 
U

Uri Guttman

huh? some spiders do a HEAD first to avoid downloading a page that
hasn't changed. you compare the date from the HEAD with the date of the
file the last time you downloaded it. you have to assume the web server
is returning accurate and proper file timestamps.

AB> I would hope that in all cases they would use the time they last
AB> went looking... otherwise they would not be accurate at all.

not last looking but last modified of the fetched page. there is a last
modified field in http requests which also does this. you get a not
modified result instead of the page if it hasn't changed since you last
got it. again, this needs the server to behave nicely and not all do.

uri
 
R

Richard Bell

huh? some spiders do a HEAD first to avoid downloading a page that
hasn't changed. you compare the date from the HEAD with the date of the
file the last time you downloaded it. you have to assume the web server
is returning accurate and proper file timestamps.

AB> I would hope that in all cases they would use the time they last
AB> went looking... otherwise they would not be accurate at all.

not last looking but last modified of the fetched page. there is a last
modified field in http requests which also does this. you get a not
modified result instead of the page if it hasn't changed since you last
got it. again, this needs the server to behave nicely and not all do.

uri

In my experience (currently working with about 6000 sites) HEAD and
related matters are as honored in the breach as not. There are some
sites that respond with what is by all appearances good data. Others
do not respond at all. Still others, respond with what is clearly
bogus data (sometimes synthesized, other times just plain lies).
While your milage may vary, in my experience, if you want to know
what's on a page today, download it today.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,023
Latest member
websitedesig25

Latest Threads

Top