LWP::UserAgent and 404 page not found

Discussion in 'Perl Misc' started by P.R.Brady, Jun 22, 2005.

  1. P.R.Brady

    P.R.Brady Guest

    I'm using LWP::UserAgent (Active Perl v5.6.1.638) in a web site
    crawler, but there's a page I just can't read -
    http://www.psychology.bangor.ac.uk/ gives '404 not found' It is
    similarly inaccessible for many of the web checkers out there (like
    http://validator.w3.org/) but is okay with 'real' browsers like Internet
    Explorer and Netscape.
    There's a redirection there somewhere behind the scenes to index.php
    (which can be read), but then that is so for our main web page
    http://www.bangor.ac.uk/ as well and that redirects okay.

    I suppose the problem is not understanding how redirection takes place.
    Is it a server issue? Do the regular browsers 'guess' at filenames if
    none are given? Is there some browser/server negotiation which is not
    being implemented?

    An extract from the code which exhibits the symptoms is below (but note
    the folding of the 'my $referer' line!)

    I'd appreciate any help you can give - I've drawn blanks elsewhere!

    Regards
    Phil



    use strict;
    use warnings;
    use LWP::UserAgent;
    use HTTP::Response;
    use HTML::TokeParser;

    #the page which refers to the culprit:
    my $referer = http://www.bangor.ac.uk/corporate/informationabout/depts.php';

    #the inaccessible page
    my $url='http://www.psychology.bangor.ac.uk/';

    #but these are okay
    # $url='http://www.informatics.bangor.ac.uk/';
    # $url='http://www.psychology.bangor.ac.uk/index.php';
    # $url='http://www.bangor.ac.uk/';

    #open the browser

    my $browser = LWP::UserAgent->new;
    $browser->timeout(30);

    #try to get the page

    my $response = $browser->get($url, Referer => $referer);
    print "Response $response\n";

    my $status= $response->status_line;
    ($status) = split(' ',$status.' ');
    print "Status_line $status\n";

    exit;
     
    P.R.Brady, Jun 22, 2005
    #1
    1. Advertising

  2. P.R.Brady

    Brian Wakem Guest

    P.R.Brady wrote:

    > I'm using LWP::UserAgent (Active Perl v5.6.1.638) in a web site
    > crawler, but there's a page I just can't read -
    > http://www.psychology.bangor.ac.uk/ gives '404 not found' It is
    > similarly inaccessible for many of the web checkers out there (like
    > http://validator.w3.org/) but is okay with 'real' browsers like Internet
    > Explorer and Netscape.
    > There's a redirection there somewhere behind the scenes to index.php
    > (which can be read), but then that is so for our main web page
    > http://www.bangor.ac.uk/ as well and that redirects okay.
    >
    > I suppose the problem is not understanding how redirection takes place.
    > Is it a server issue? Do the regular browsers 'guess' at filenames if
    > none are given? Is there some browser/server negotiation which is not
    > being implemented?
    >
    > An extract from the code which exhibits the symptoms is below (but note
    > the folding of the 'my $referer' line!)
    >
    > I'd appreciate any help you can give - I've drawn blanks elsewhere!
    >
    > Regards
    > Phil
    >
    > my $response = $browser->get($url, Referer => $referer);



    They seem to be doing a redirect based upon the language that your broswer
    declares itself to accept. As you aren't doing this you get an error page.


    Try:-

    my $response = $browser->get($url, Referer => $referer, ACCEPT_LANGUAGE =>
    'en');


    --
    Brian Wakem
     
    Brian Wakem, Jun 22, 2005
    #2
    1. Advertising

  3. P.R.Brady

    P.R.Brady Guest

    Brian Wakem wrote:
    > P.R.Brady wrote:
    >
    >
    >>I'm using LWP::UserAgent (Active Perl v5.6.1.638) in a web site
    >>crawler, but there's a page I just can't read -
    >>http://www.psychology.bangor.ac.uk/ gives '404 not found' It is
    >>similarly inaccessible for many of the web checkers out there (like
    >>http://validator.w3.org/) but is okay with 'real' browsers like Internet
    >>Explorer and Netscape.
    >>There's a redirection there somewhere behind the scenes to index.php
    >>(which can be read), but then that is so for our main web page
    >>http://www.bangor.ac.uk/ as well and that redirects okay.
    >>



    [ ... snipped ...]

    >
    > They seem to be doing a redirect based upon the language that your broswer
    > declares itself to accept. As you aren't doing this you get an error page.
    >
    > Try:-
    >
    > my $response = $browser->get($url, Referer => $referer, ACCEPT_LANGUAGE =>
    > 'en');
    >


    Thanks Brian, that certainly works, Much appreciated.

    Now do I have to alter my crawler to scan pages twice I wonder, once for
    English, once for Welsh?

    Phil
     
    P.R.Brady, Jun 23, 2005
    #3
  4. "P.R.Brady" <> writes:

    > Brian Wakem wrote:
    >> P.R.Brady wrote:
    >>
    >>>I'm using LWP::UserAgent (Active Perl v5.6.1.638) in a web site
    >>>crawler, but there's a page I just can't read -

    >
    > [ ... snip ...]
    >
    >> Try:-
    >> my $response = $browser->get($url, Referer => $referer,
    >> ACCEPT_LANGUAGE =>
    >> 'en');
    >>

    >
    > Those parameters like Referer and ACCEPT_LANGUAGE are clearly reserved
    > words, but to what? The UserAgent? HTMP protocol?


    HTTP. Here's a reference:

    <http://www.w3.org/Protocols/rfc2616/rfc2616.html>

    sherm--
     
    Sherm Pendley, Jun 24, 2005
    #4
  5. P.R.Brady

    P.R.Brady Guest

    Brian Wakem wrote:
    > P.R.Brady wrote:
    >
    >
    >>I'm using LWP::UserAgent (Active Perl v5.6.1.638) in a web site
    >>crawler, but there's a page I just can't read -


    [ ... snip ...]

    >
    > Try:-
    >
    > my $response = $browser->get($url, Referer => $referer, ACCEPT_LANGUAGE =>
    > 'en');
    >



    Those parameters like Referer and ACCEPT_LANGUAGE are clearly reserved
    words, but to what? The UserAgent? HTMP protocol?
    Where are they listed and defined, or what are they called generically
    so I can google them?

    Phil
     
    P.R.Brady, Jun 24, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. danglesocket

    not able to access a URL with LWP::UserAgent.

    danglesocket, Sep 11, 2003, in forum: Perl Misc
    Replies:
    6
    Views:
    148
    danglesocket
    Sep 12, 2003
  2. Great Deals
    Replies:
    1
    Views:
    115
  3. Bill
    Replies:
    8
    Views:
    151
    William Herrera
    Oct 23, 2005
  4. Phil Powell
    Replies:
    0
    Views:
    282
    Phil Powell
    Feb 18, 2008
  5. CronJob
    Replies:
    5
    Views:
    160
    Eric Pozharski
    Mar 20, 2009
Loading...

Share This Page