Question on download by LWP

Discussion in 'Perl Misc' started by Wonder, Sep 19, 2006.

  1. Wonder

    Wonder Guest

    Hello,

    I'm trying to use a perl script to download some html files on a
    website, but it's very strange that the response for the fetch-file is
    always "404 not found", even though I have no problem to open the same
    url in IE. And the same script can work for the urls of other websites.

    It seems that website only allow the connection from web browser, but
    not download tools ( it also denied the download request of tools like
    Flashget). Is there any field I need to set up in my perl code to cheat
    the remote website to make it think this is a request from the web
    browser? Thanks.

    The following is my code:

    use LWP::UserAgent;
    use HTTP::Response;
    $ua = LWP::UserAgent->new;
    $CurrentURL = "http://DomainName.com/RemoteFileName.html";
    $filename = ".LocalFileName.html";
    $response = $ua->mirror($CurrentURL, $filename);
    die "Can't get $CurrentURL -- ", $response->status_line unless
    $response->is_success;
    Wonder, Sep 19, 2006
    #1
    1. Advertising

  2. Wonder

    John Bokma Guest

    "Wonder" <> wrote:

    > Hello,
    >
    > I'm trying to use a perl script to download some html files on a
    > website, but it's very strange that the response for the fetch-file is
    > always "404 not found", even though I have no problem to open the same
    > url in IE. And the same script can work for the urls of other
    > websites.
    >
    > It seems that website only allow the connection from web browser, but
    > not download tools ( it also denied the download request of tools like
    > Flashget). Is there any field I need to set up in my perl code to
    > cheat the remote website to make it think this is a request from the
    > web browser? Thanks.
    >
    > The following is my code:
    >
    > use LWP::UserAgent;
    > use HTTP::Response;


    use strict;
    use warnings;

    > $ua = LWP::UserAgent->new;


    You might want to try:

    my $ua = LWP::UserAgent->new(
    agent => 'Mozilla/5.0'
    );

    If that gives the same problem, try to use a longer agent, e.g.

    my $ua = LWP::UserAgent->new(
    agent => 'Mozilla/5.0'
    . ' (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6)'
    . ' Gecko/20060728 Firefox/1.5.0.6'
    );

    (yes, I am going to update after this post :) ).

    Quite some sites block well known "grabbers".

    > $CurrentURL = "http://DomainName.com/RemoteFileName.html";


    don't use camelcase with Perl (yes, that sounds like a contradiction).
    Also, RemoteFileName.html is not really a filename Also, use example.com
    for examples, don't invent domain names.

    > $filename = ".LocalFileName.html";
    > $response = $ua->mirror($CurrentURL, $filename);
    > die "Can't get $CurrentURL -- ", $response->status_line unless
    > $response->is_success;


    I would change that into:

    $response->is_success or die .....


    Which you can read as: the response has to be sucessful or we give up.


    --
    John Experienced Perl programmer: http://castleamber.com/

    Perl help, tutorials, and examples: http://johnbokma.com/perl/
    John Bokma, Sep 19, 2006
    #2
    1. Advertising

  3. Wonder

    David Squire Guest

    John Bokma wrote:

    > don't use camelcase with Perl


    Now, familiar as I am with the doctrinaire nature of this group, this
    seems to me to go too far. Surely folk can have their own naming
    conventions and you (we) can be quiet about that.


    DS
    David Squire, Sep 19, 2006
    #3
  4. Wonder

    John Bokma Guest

    David Squire <> wrote:

    > John Bokma wrote:
    >
    >> don't use camelcase with Perl

    >
    > Now, familiar as I am with the doctrinaire nature of this group, this
    > seems to me to go too far. Surely folk can have their own naming
    > conventions and you (we) can be quiet about that.


    Yup, as they can decide not to use strict & warnings. Yet this group
    insists on some form of readability, and *I* include not using camel case
    with that. If I have to include a disclaimer with each and every sentence
    my replies take too much time. It's free advice, after all, and I assume
    that the reader is a grown up who can ignore advice at will.

    --
    John Experienced Perl programmer: http://castleamber.com/

    Perl help, tutorials, and examples: http://johnbokma.com/perl/
    John Bokma, Sep 19, 2006
    #4
  5. Wonder

    David Squire Guest

    John Bokma wrote:
    > David Squire <> wrote:
    >
    >> John Bokma wrote:
    >>
    >>> don't use camelcase with Perl

    >> Now, familiar as I am with the doctrinaire nature of this group, this
    >> seems to me to go too far. Surely folk can have their own naming
    >> conventions and you (we) can be quiet about that.

    >
    > Yup, as they can decide not to use strict & warnings. Yet this group
    > insists on some form of readability, and *I* include not using camel case
    > with that.


    I can't agree with this. Using strict and warnings is language-specific
    advice that helps folk solve and avoid problems. The readability or not
    of "camelcase" is person-specific and has nothing to do with Perl.

    So, I have no problems with your saying "I don't like camelcase", but
    "don't use camelcase with Perl" to me over-steps the mark.

    I, and no doubt others, participate in multiple Perl projects where we
    don't always have control over naming conventions (how do you feel about
    Hungarian?), and editing that down should surely not be a requirement
    for posting here. Getting a minimal script is a big enough ask for most
    folk.


    DS
    David Squire, Sep 19, 2006
    #5
  6. David Squire <> wrote:
    > John Bokma wrote:
    >> David Squire <> wrote:
    >>
    >>> John Bokma wrote:
    >>>
    >>>> don't use camelcase with Perl
    >>> Now, familiar as I am with the doctrinaire nature of this group, this
    >>> seems to me to go too far. Surely folk can have their own naming
    >>> conventions and you (we) can be quiet about that.



    Except when their convention is to use PERL, perhaps? :)


    > So, I have no problems with your saying "I don't like camelcase", but
    > "don't use camelcase with Perl" to me over-steps the mark.



    He should have probably said:

    people will think less of you if you use camelcase with Perl

    That is not to say that it is right to think that, but I am quite sure
    that my re-phrase is true nonetheless.

    Similar to:

    people will think less of you if you spell it PERL

    They both just tell folks that you don't know the secret handshake.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Sep 19, 2006
    #6
  7. Wonder

    John Bokma Guest

    David Squire <> wrote:

    > John Bokma wrote:


    [..]

    > I can't agree with this. Using strict and warnings is
    > language-specific advice that helps folk solve and avoid problems. The
    > readability or not of "camelcase" is person-specific and has nothing
    > to do with Perl.


    "... use underscores to separate words." Programming Perl 3rd ed, p605
    (Programming with Style).

    "... use underscores to separate words in longer identifiers."
    perldoc perlstyle

    IMO this has everything to do with Perl as for the same reason using camel
    case in Java has everything to do with Java.


    > So, I have no problems with your saying "I don't like camelcase", but
    > "don't use camelcase with Perl" to me over-steps the mark.


    I just passed on the advice given in Programming Perl ;-)

    > I, and no doubt others, participate in multiple Perl projects where we
    > don't always have control over naming conventions (how do you feel
    > about Hungarian?), and editing that down should surely not be a
    > requirement for posting here.


    For the same reason not using use strict / warnings shouldn't be an issue
    if that isn't causing any problem. Yet "we" insist on it being present.
    Personally I think the posted problem should conform to at least to
    perldoc perlstyle.

    I agree with you to stick with the naming conventions of a given project.
    And for that very reason, I think it's good to follow in Perl related
    newsgroups the naming convention as outlined in perldoc perlstyle /
    Programming Perl.

    > Getting a minimal script is a big enough
    > ask for most folk.


    They need help, and ask "us" do answer their question(s). I think its fair
    to ask of them to present their question in a format that's easy to read
    for most people. To me that's following perlstyle as close as possible.

    Finally, I am not saying to have no point at all: I agree that too much
    nit picking can be overwhelming for the OP :) But I also think I have a
    point ;-)

    --
    John Experienced Perl programmer: http://castleamber.com/

    Perl help, tutorials, and examples: http://johnbokma.com/perl/
    John Bokma, Sep 19, 2006
    #7
  8. Wonder

    Wonder Guest

    "John Bokma" <> wrote in message
    >
    > You might want to try:
    >
    > my $ua = LWP::UserAgent->new(
    > agent => 'Mozilla/5.0'
    > );
    >
    > If that gives the same problem, try to use a longer agent, e.g.
    >
    > my $ua = LWP::UserAgent->new(
    > agent => 'Mozilla/5.0'
    > . ' (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6)'
    > . ' Gecko/20060728 Firefox/1.5.0.6'
    > );
    >
    > (yes, I am going to update after this post :) ).


    Thanks, John. It seems both methods don't work. I got the same "404 not
    found" error.
    Here is the web page I'd like to access:
    http://0daycheck.eastgame.net/0day/archives/20060901.html

    >
    > Quite some sites block well known "grabbers".
    >
    >> $CurrentURL = "http://DomainName.com/RemoteFileName.html";

    >
    > don't use camelcase with Perl (yes, that sounds like a contradiction).
    > Also, RemoteFileName.html is not really a filename Also, use example.com
    > for examples, don't invent domain names.
    >


    Sorry, I'm new in this group, and get the habit of camelcase from C++
    programming. It's the first time I heard of the perl_style (shouldn't
    it be like this?). Thank you for letting me know. I'll be very glad to
    conform to it here.

    DS: Thank you too for your understanding.

    Wonder
    Wonder, Sep 19, 2006
    #8
  9. Wonder

    John Bokma Guest

    "Wonder" <> wrote:

    > Thanks, John. It seems both methods don't work. I got the same "404 not
    > found" error.
    > Here is the web page I'd like to access:
    > http://0daycheck.eastgame.net/0day/archives/20060901.html


    http://0daycheck.eastgame.net/0day/archives/20060901.html

    GET /0day/archives/20060901.html HTTP/1.1
    Host: 0daycheck.eastgame.net
    User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7)
    Gecko/20060909 Firefox/1.5.0.7
    Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=
    0.9,text/plain;q=0.8,image/png,*/*;q=0.5
    Accept-Language: en-us,en;q=0.5
    Accept-Encoding: gzip,deflate
    Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
    Keep-Alive: 300
    Connection: keep-alive

    HTTP/1.x 404 Not Found
    Server: Zeus/4.3
    Date: Tue, 19 Sep 2006 17:39:10 GMT
    Content-Type: text/html
    X-Cache: MISS from cache.xal.megared.net.mx
    Connection: close


    So, yeah, they return a 404 :) (output via Firefox + live headers
    extension).

    use strict;
    use warnings;

    use LWP::UserAgent;

    my $ua = LWP::UserAgent->new();

    my $url = 'http://0daycheck.eastgame.net/0day/archives/20060901.html';
    my $response = $ua->get( $url );

    $response->is_success
    or $response->code == 404 # ignore 404
    or die "Can't get '$url': ", $response->status_line;

    print $response->content;


    --
    John Experienced Perl programmer: http://castleamber.com/

    Perl help, tutorials, and examples: http://johnbokma.com/perl/
    John Bokma, Sep 19, 2006
    #9
  10. Wonder

    Mumia W. Guest

    On 09/19/2006 10:29 AM, Wonder wrote:
    > "John Bokma" <> wrote in message
    >> You might want to try:
    >>
    >> my $ua = LWP::UserAgent->new(
    >> agent => 'Mozilla/5.0'
    >> );
    >>
    >> If that gives the same problem, try to use a longer agent, e.g.
    >>
    >> my $ua = LWP::UserAgent->new(
    >> agent => 'Mozilla/5.0'
    >> . ' (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6)'
    >> . ' Gecko/20060728 Firefox/1.5.0.6'
    >> );
    >>
    >> (yes, I am going to update after this post :) ).

    >
    > Thanks, John. It seems both methods don't work. I got the same "404 not
    > found" error.
    > Here is the web page I'd like to access:
    > http://0daycheck.eastgame.net/0day/archives/20060901.html
    >
    >> Quite some sites block well known "grabbers".
    >>
    >>> $CurrentURL = "http://DomainName.com/RemoteFileName.html";

    >> don't use camelcase with Perl (yes, that sounds like a contradiction).
    >> Also, RemoteFileName.html is not really a filename Also, use example.com
    >> for examples, don't invent domain names.
    >>

    >
    > Sorry, I'm new in this group, and get the habit of camelcase from C++
    > programming. It's the first time I heard of the perl_style (shouldn't
    > it be like this?). Thank you for letting me know. I'll be very glad to
    > conform to it here.
    >
    > DS: Thank you too for your understanding.
    >
    > Wonder
    >


    What does "DS" mean?

    I accessed that URL using a few programs, and these are my results:

    Firefox 1.5 (works)
    wget (fails)
    curl (works)
    lynx (works)

    I got real curious, so I wrote a perl [0] program to download that site.
    I'm a dial-up user, so I notice any lags in download time. Usually, a
    404 response comes back immediately, but not this time.

    For some insane reason, that site outputs a 221 thousand-byte 404 error
    response--probably due to webmaster brain-death. The output is largely
    in Chinese, and it seems to be some sort of warez page.

    I hope this helps. Don't steal any software. YMMV. Be nice to others.
    Eat your peas....

    --
    [0]
    use LWP::UserAgent;
    use File::Slurp;
    my $url = "http://0daycheck.eastgame.net/0day/archives/20060901.html";
    unlink '404-out.html';

    my $ua = LWP::UserAgent->new();
    my $response = $ua->get($url);
    if ($response->is_success) {
    print $response->content;
    } else {
    print $response->status_line;
    write_file '404-out.html', $response->content;
    }
    Mumia W., Sep 19, 2006
    #10
  11. Wonder

    Wonder Guest

    Mumia W. wrote:
    >
    > What does "DS" mean?
    >
    > I accessed that URL using a few programs, and these are my results:
    >
    > Firefox 1.5 (works)
    > wget (fails)
    > curl (works)
    > lynx (works)
    >
    > I got real curious, so I wrote a perl [0] program to download that site.
    > I'm a dial-up user, so I notice any lags in download time. Usually, a
    > 404 response comes back immediately, but not this time.
    >
    > For some insane reason, that site outputs a 221 thousand-byte 404 error
    > response--probably due to webmaster brain-death. The output is largely
    > in Chinese, and it seems to be some sort of warez page.
    >
    > I hope this helps. Don't steal any software. YMMV. Be nice to others.
    > Eat your peas....
    >
    > --


    Thanks, Mumia. DS is the initial of a guy who took part in the
    discussion yesterday.

    Yeah, this is a website about warez, but it only briefly introduces
    what those software are. It doesn't supply any downloads. I would just
    download the html file and fetch the links of some pictures, so from my
    understanding, it is legal.



    > [0]
    > use LWP::UserAgent;
    > use File::Slurp;
    > my $url = "http://0daycheck.eastgame.net/0day/archives/20060901.html";
    > unlink '404-out.html';
    >
    > my $ua = LWP::UserAgent->new();
    > my $response = $ua->get($url);
    > if ($response->is_success) {
    > print $response->content;
    > } else {
    > print $response->status_line;
    > write_file '404-out.html', $response->content;
    > }
    Wonder, Sep 19, 2006
    #11
  12. Wonder

    Wonder Guest

    John Bokma wrote:
    > http://0daycheck.eastgame.net/0day/archives/20060901.html
    >
    > GET /0day/archives/20060901.html HTTP/1.1
    > Host: 0daycheck.eastgame.net
    > User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7)
    > Gecko/20060909 Firefox/1.5.0.7
    > Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=
    > 0.9,text/plain;q=0.8,image/png,*/*;q=0.5
    > Accept-Language: en-us,en;q=0.5
    > Accept-Encoding: gzip,deflate
    > Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
    > Keep-Alive: 300
    > Connection: keep-alive
    >
    > HTTP/1.x 404 Not Found
    > Server: Zeus/4.3
    > Date: Tue, 19 Sep 2006 17:39:10 GMT
    > Content-Type: text/html
    > X-Cache: MISS from cache.xal.megared.net.mx
    > Connection: close
    >
    >
    > So, yeah, they return a 404 :) (output via Firefox + live headers
    > extension).
    >
    > use strict;
    > use warnings;
    >
    > use LWP::UserAgent;
    >
    > my $ua = LWP::UserAgent->new();
    >
    > my $url = 'http://0daycheck.eastgame.net/0day/archives/20060901.html';
    > my $response = $ua->get( $url );
    >
    > $response->is_success
    > or $response->code == 404 # ignore 404
    > or die "Can't get '$url': ", $response->status_line;
    >
    > print $response->content;
    >
    >
    > --
    > John Experienced Perl programmer: http://castleamber.com/
    >
    > Perl help, tutorials, and examples: http://johnbokma.com/perl/


    Thanks a lot, John. The webpage is in $response->content. So, I'm
    curious how the browser knows it's not a real 404 error, and know to
    display the real content instead of an 404 error.

    Besides, this is probably a silly question: if I put quotation marks
    around the $response member variables, such as

    print "$response->content";

    I'll get a line "HTTP::Response=HASH(0x104832bc)->content". Why is
    that?

    Thanks.

    Wonder
    Wonder, Sep 19, 2006
    #12
  13. Wonder

    John Bokma Guest

    "Wonder" <> wrote:

    > print "$response->content";
    >
    > I'll get a line "HTTP::Response=HASH(0x104832bc)->content". Why is
    > that?


    You might want to read:

    perldoc perlintro (Basic syntax overview, print)
    perldoc perlboot (which explains why $response "contains"
    HTTP::Response=HASH(...)


    --
    John Experienced Perl programmer: http://castleamber.com/

    Perl help, tutorials, and examples: http://johnbokma.com/perl/
    John Bokma, Sep 19, 2006
    #13
  14. Wonder

    Matt Garrish Guest

    David Squire wrote:

    > John Bokma wrote:
    > > David Squire <> wrote:
    > >
    > >> John Bokma wrote:
    > >>
    > >>> don't use camelcase with Perl
    > >> Now, familiar as I am with the doctrinaire nature of this group, this
    > >> seems to me to go too far. Surely folk can have their own naming
    > >> conventions and you (we) can be quiet about that.

    > >
    > > Yup, as they can decide not to use strict & warnings. Yet this group
    > > insists on some form of readability, and *I* include not using camel case
    > > with that.

    >
    > I can't agree with this. Using strict and warnings is language-specific
    > advice that helps folk solve and avoid problems. The readability or not
    > of "camelcase" is person-specific and has nothing to do with Perl.
    >
    > So, I have no problems with your saying "I don't like camelcase", but
    > "don't use camelcase with Perl" to me over-steps the mark.
    >
    > I, and no doubt others, participate in multiple Perl projects where we
    > don't always have control over naming conventions (how do you feel about
    > Hungarian?), and editing that down should surely not be a requirement
    > for posting here.


    Don't forget those of us who can't type worth a damn. It's enough just
    to get me to use the shift key, but to also try and reach the
    underscore key is asking too much. And besides, what gets called "camel
    case" here is standard coding convention for almost all the other
    languages I have to deal in. It certainly doesn't warrant chastising a
    poster about, because as you note, it is just style and I'm sure we've
    all used it.

    Matt
    Matt Garrish, Sep 20, 2006
    #14
  15. Wonder <> wrote:


    > Besides, this is probably a silly question: if I put quotation marks
    > around the $response member variables, such as
    >
    > print "$response->content";
    >
    > I'll get a line "HTTP::Response=HASH(0x104832bc)->content". Why is
    > that?



    Part 1: you get the "->content" part because subroutines (and methods)
    do not interpolate.

    Part 2: you get the "HTTP::Response=HASH(0x104832bc)" part because
    that is the stringified representation for a blessed reference (object).


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Sep 20, 2006
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    13
    Views:
    2,738
    Arne Vajhøj
    Mar 18, 2008
  2. Andrew
    Replies:
    3
    Views:
    94
    Andrew
    Nov 24, 2003
  3. Bruce Horrocks
    Replies:
    3
    Views:
    261
    Bruce Horrocks
    Jan 1, 2004
  4. Bumble

    Newbie LWP Question

    Bumble, Jan 7, 2004, in forum: Perl Misc
    Replies:
    7
    Views:
    85
    Sherm Pendley
    Jan 11, 2004
  5. Replies:
    4
    Views:
    123
Loading...

Share This Page