Using LWP to get last modified date of web page

Discussion in 'Perl Misc' started by Arvin Portlock, Jun 4, 2004.

  1. I write the following program which works great:

    use LWP::UserAgent;
    my $agent = new LWP::UserAgent;
    my $response = $agent->head('http://www.perl.org/about.html');
    print $response->last_modified;

    But most other URLs I try don't return a modified date.
    Is this because most servers aren't accepting HEAD requests
    anymore? There has to be a way to get the last modifed date
    for a web page. We have crawlers and spiders that do it all
    the time. Can anybody suggest a way to do this? I'm partic-
    ularly concerned about not downloading entire pages but just
    getting the sizes and I'm not sure how to do this with LWP.

    Thanks for your help
     
    Arvin Portlock, Jun 4, 2004
    #1
    1. Advertising

  2. Arvin Portlock

    Joe Smith Guest

    Arvin Portlock wrote:

    > I write the following program which works great:
    >
    > use LWP::UserAgent;
    > my $agent = new LWP::UserAgent;
    > my $response = $agent->head('http://www.perl.org/about.html');
    > print $response->last_modified;
    >
    > But most other URLs I try don't return a modified date.


    That's up to the web page author; it is out of your control.

    > Is this because most servers aren't accepting HEAD requests anymore?


    A lot of dynamically generated pages (such as ones with banner ads
    or table-of-contents links) have no modified date.

    > There has to be a way to get the last modifed date
    > for a web page. We have crawlers and spiders that do it all
    > the time. ... not downloading entire pages but just
    > getting the sizes and I'm not sure how to do this with LWP.


    Spiders and crawlers download the entire page.
    In some cases, the date is the last time it went looking
    instead of the file's date.
    -Joe
     
    Joe Smith, Jun 5, 2004
    #2
    1. Advertising

  3. On Sat, 05 Jun 2004 18:26:29 +0000, Joe Smith wrote:

    > Spiders and crawlers download the entire page.
    > In some cases, the date is the last time it went looking
    > instead of the file's date.


    I would hope that in all cases they would use the time they last
    went looking... otherwise they would not be accurate at all.
     
    Andrew Bryson, Jun 7, 2004
    #3
  4. Arvin Portlock

    Uri Guttman Guest

    >>>>> "AB" == Andrew Bryson <> writes:

    AB> On Sat, 05 Jun 2004 18:26:29 +0000, Joe Smith wrote:
    >> Spiders and crawlers download the entire page.
    >> In some cases, the date is the last time it went looking
    >> instead of the file's date.


    huh? some spiders do a HEAD first to avoid downloading a page that
    hasn't changed. you compare the date from the HEAD with the date of the
    file the last time you downloaded it. you have to assume the web server
    is returning accurate and proper file timestamps.

    AB> I would hope that in all cases they would use the time they last
    AB> went looking... otherwise they would not be accurate at all.

    not last looking but last modified of the fetched page. there is a last
    modified field in http requests which also does this. you get a not
    modified result instead of the page if it hasn't changed since you last
    got it. again, this needs the server to behave nicely and not all do.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Jun 7, 2004
    #4
  5. Arvin Portlock

    Richard Bell Guest

    On Mon, 07 Jun 2004 15:40:02 GMT, Uri Guttman <>
    wrote:

    >>>>>> "AB" == Andrew Bryson <> writes:

    >
    > AB> On Sat, 05 Jun 2004 18:26:29 +0000, Joe Smith wrote:
    > >> Spiders and crawlers download the entire page.
    > >> In some cases, the date is the last time it went looking
    > >> instead of the file's date.

    >
    >huh? some spiders do a HEAD first to avoid downloading a page that
    >hasn't changed. you compare the date from the HEAD with the date of the
    >file the last time you downloaded it. you have to assume the web server
    >is returning accurate and proper file timestamps.
    >
    > AB> I would hope that in all cases they would use the time they last
    > AB> went looking... otherwise they would not be accurate at all.
    >
    >not last looking but last modified of the fetched page. there is a last
    >modified field in http requests which also does this. you get a not
    >modified result instead of the page if it hasn't changed since you last
    >got it. again, this needs the server to behave nicely and not all do.
    >
    >uri


    In my experience (currently working with about 6000 sites) HEAD and
    related matters are as honored in the breach as not. There are some
    sites that respond with what is by all appearances good data. Others
    do not respond at all. Still others, respond with what is clearly
    bogus data (sometimes synthesized, other times just plain lies).
    While your milage may vary, in my experience, if you want to know
    what's on a page today, download it today.
     
    Richard Bell, Jun 8, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rich Pasco
    Replies:
    2
    Views:
    19,334
    Rich Pasco
    Oct 9, 2003
  2. Dennis Marks

    Last Modified Date

    Dennis Marks, May 2, 2004, in forum: HTML
    Replies:
    9
    Views:
    718
    Thomas 'PointedEars' Lahn
    May 21, 2004
  3. Roedy Green

    Date last Accessed vs Date Modified

    Roedy Green, Feb 21, 2008, in forum: Java
    Replies:
    1
    Views:
    1,467
    Nigel Wade
    Feb 22, 2008
  4. Elliot
    Replies:
    1
    Views:
    315
    siccolo
    Feb 27, 2008
  5. Elliot
    Replies:
    3
    Views:
    364
    Elliot
    Jun 16, 2008
Loading...

Share This Page