Question on download by LWP

W

Wonder

Hello,

I'm trying to use a perl script to download some html files on a
website, but it's very strange that the response for the fetch-file is
always "404 not found", even though I have no problem to open the same
url in IE. And the same script can work for the urls of other websites.

It seems that website only allow the connection from web browser, but
not download tools ( it also denied the download request of tools like
Flashget). Is there any field I need to set up in my perl code to cheat
the remote website to make it think this is a request from the web
browser? Thanks.

The following is my code:

use LWP::UserAgent;
use HTTP::Response;
$ua = LWP::UserAgent->new;
$CurrentURL = "http://DomainName.com/RemoteFileName.html";
$filename = ".LocalFileName.html";
$response = $ua->mirror($CurrentURL, $filename);
die "Can't get $CurrentURL -- ", $response->status_line unless
$response->is_success;
 
J

John Bokma

Wonder said:
Hello,

I'm trying to use a perl script to download some html files on a
website, but it's very strange that the response for the fetch-file is
always "404 not found", even though I have no problem to open the same
url in IE. And the same script can work for the urls of other
websites.

It seems that website only allow the connection from web browser, but
not download tools ( it also denied the download request of tools like
Flashget). Is there any field I need to set up in my perl code to
cheat the remote website to make it think this is a request from the
web browser? Thanks.

The following is my code:

use LWP::UserAgent;
use HTTP::Response;

use strict;
use warnings;
$ua = LWP::UserAgent->new;

You might want to try:

my $ua = LWP::UserAgent->new(
agent => 'Mozilla/5.0'
);

If that gives the same problem, try to use a longer agent, e.g.

my $ua = LWP::UserAgent->new(
agent => 'Mozilla/5.0'
. ' (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6)'
. ' Gecko/20060728 Firefox/1.5.0.6'
);

(yes, I am going to update after this post :) ).

Quite some sites block well known "grabbers".

don't use camelcase with Perl (yes, that sounds like a contradiction).
Also, RemoteFileName.html is not really a filename Also, use example.com
for examples, don't invent domain names.
$filename = ".LocalFileName.html";
$response = $ua->mirror($CurrentURL, $filename);
die "Can't get $CurrentURL -- ", $response->status_line unless
$response->is_success;

I would change that into:

$response->is_success or die .....


Which you can read as: the response has to be sucessful or we give up.
 
D

David Squire

John said:
don't use camelcase with Perl

Now, familiar as I am with the doctrinaire nature of this group, this
seems to me to go too far. Surely folk can have their own naming
conventions and you (we) can be quiet about that.


DS
 
J

John Bokma

David Squire said:
Now, familiar as I am with the doctrinaire nature of this group, this
seems to me to go too far. Surely folk can have their own naming
conventions and you (we) can be quiet about that.

Yup, as they can decide not to use strict & warnings. Yet this group
insists on some form of readability, and *I* include not using camel case
with that. If I have to include a disclaimer with each and every sentence
my replies take too much time. It's free advice, after all, and I assume
that the reader is a grown up who can ignore advice at will.
 
D

David Squire

John said:
Yup, as they can decide not to use strict & warnings. Yet this group
insists on some form of readability, and *I* include not using camel case
with that.

I can't agree with this. Using strict and warnings is language-specific
advice that helps folk solve and avoid problems. The readability or not
of "camelcase" is person-specific and has nothing to do with Perl.

So, I have no problems with your saying "I don't like camelcase", but
"don't use camelcase with Perl" to me over-steps the mark.

I, and no doubt others, participate in multiple Perl projects where we
don't always have control over naming conventions (how do you feel about
Hungarian?), and editing that down should surely not be a requirement
for posting here. Getting a minimal script is a big enough ask for most
folk.


DS
 
T

Tad McClellan

Except when their convention is to use PERL, perhaps? :)

So, I have no problems with your saying "I don't like camelcase", but
"don't use camelcase with Perl" to me over-steps the mark.


He should have probably said:

people will think less of you if you use camelcase with Perl

That is not to say that it is right to think that, but I am quite sure
that my re-phrase is true nonetheless.

Similar to:

people will think less of you if you spell it PERL

They both just tell folks that you don't know the secret handshake.
 
J

John Bokma

David Squire said:
John Bokma wrote:
[..]

I can't agree with this. Using strict and warnings is
language-specific advice that helps folk solve and avoid problems. The
readability or not of "camelcase" is person-specific and has nothing
to do with Perl.

"... use underscores to separate words." Programming Perl 3rd ed, p605
(Programming with Style).

"... use underscores to separate words in longer identifiers."
perldoc perlstyle

IMO this has everything to do with Perl as for the same reason using camel
case in Java has everything to do with Java.

So, I have no problems with your saying "I don't like camelcase", but
"don't use camelcase with Perl" to me over-steps the mark.

I just passed on the advice given in Programming Perl ;-)
I, and no doubt others, participate in multiple Perl projects where we
don't always have control over naming conventions (how do you feel
about Hungarian?), and editing that down should surely not be a
requirement for posting here.

For the same reason not using use strict / warnings shouldn't be an issue
if that isn't causing any problem. Yet "we" insist on it being present.
Personally I think the posted problem should conform to at least to
perldoc perlstyle.

I agree with you to stick with the naming conventions of a given project.
And for that very reason, I think it's good to follow in Perl related
newsgroups the naming convention as outlined in perldoc perlstyle /
Programming Perl.
Getting a minimal script is a big enough
ask for most folk.

They need help, and ask "us" do answer their question(s). I think its fair
to ask of them to present their question in a format that's easy to read
for most people. To me that's following perlstyle as close as possible.

Finally, I am not saying to have no point at all: I agree that too much
nit picking can be overwhelming for the OP :) But I also think I have a
point ;-)
 
W

Wonder

John Bokma said:
You might want to try:

my $ua = LWP::UserAgent->new(
agent => 'Mozilla/5.0'
);

If that gives the same problem, try to use a longer agent, e.g.

my $ua = LWP::UserAgent->new(
agent => 'Mozilla/5.0'
. ' (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6)'
. ' Gecko/20060728 Firefox/1.5.0.6'
);

(yes, I am going to update after this post :) ).

Thanks, John. It seems both methods don't work. I got the same "404 not
found" error.
Here is the web page I'd like to access:
http://0daycheck.eastgame.net/0day/archives/20060901.html
Quite some sites block well known "grabbers".


don't use camelcase with Perl (yes, that sounds like a contradiction).
Also, RemoteFileName.html is not really a filename Also, use example.com
for examples, don't invent domain names.

Sorry, I'm new in this group, and get the habit of camelcase from C++
programming. It's the first time I heard of the perl_style (shouldn't
it be like this?). Thank you for letting me know. I'll be very glad to
conform to it here.

DS: Thank you too for your understanding.

Wonder
 
J

John Bokma

Wonder said:
Thanks, John. It seems both methods don't work. I got the same "404 not
found" error.
Here is the web page I'd like to access:
http://0daycheck.eastgame.net/0day/archives/20060901.html

http://0daycheck.eastgame.net/0day/archives/20060901.html

GET /0day/archives/20060901.html HTTP/1.1
Host: 0daycheck.eastgame.net
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7)
Gecko/20060909 Firefox/1.5.0.7
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=
0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.x 404 Not Found
Server: Zeus/4.3
Date: Tue, 19 Sep 2006 17:39:10 GMT
Content-Type: text/html
X-Cache: MISS from cache.xal.megared.net.mx
Connection: close


So, yeah, they return a 404 :) (output via Firefox + live headers
extension).

use strict;
use warnings;

use LWP::UserAgent;

my $ua = LWP::UserAgent->new();

my $url = 'http://0daycheck.eastgame.net/0day/archives/20060901.html';
my $response = $ua->get( $url );

$response->is_success
or $response->code == 404 # ignore 404
or die "Can't get '$url': ", $response->status_line;

print $response->content;
 
M

Mumia W.

Thanks, John. It seems both methods don't work. I got the same "404 not
found" error.
Here is the web page I'd like to access:
http://0daycheck.eastgame.net/0day/archives/20060901.html


Sorry, I'm new in this group, and get the habit of camelcase from C++
programming. It's the first time I heard of the perl_style (shouldn't
it be like this?). Thank you for letting me know. I'll be very glad to
conform to it here.

DS: Thank you too for your understanding.

Wonder

What does "DS" mean?

I accessed that URL using a few programs, and these are my results:

Firefox 1.5 (works)
wget (fails)
curl (works)
lynx (works)

I got real curious, so I wrote a perl [0] program to download that site.
I'm a dial-up user, so I notice any lags in download time. Usually, a
404 response comes back immediately, but not this time.

For some insane reason, that site outputs a 221 thousand-byte 404 error
response--probably due to webmaster brain-death. The output is largely
in Chinese, and it seems to be some sort of warez page.

I hope this helps. Don't steal any software. YMMV. Be nice to others.
Eat your peas....

--
[0]
use LWP::UserAgent;
use File::Slurp;
my $url = "http://0daycheck.eastgame.net/0day/archives/20060901.html";
unlink '404-out.html';

my $ua = LWP::UserAgent->new();
my $response = $ua->get($url);
if ($response->is_success) {
print $response->content;
} else {
print $response->status_line;
write_file '404-out.html', $response->content;
}
 
W

Wonder

Mumia said:
What does "DS" mean?

I accessed that URL using a few programs, and these are my results:

Firefox 1.5 (works)
wget (fails)
curl (works)
lynx (works)

I got real curious, so I wrote a perl [0] program to download that site.
I'm a dial-up user, so I notice any lags in download time. Usually, a
404 response comes back immediately, but not this time.

For some insane reason, that site outputs a 221 thousand-byte 404 error
response--probably due to webmaster brain-death. The output is largely
in Chinese, and it seems to be some sort of warez page.

I hope this helps. Don't steal any software. YMMV. Be nice to others.
Eat your peas....

--

Thanks, Mumia. DS is the initial of a guy who took part in the
discussion yesterday.

Yeah, this is a website about warez, but it only briefly introduces
what those software are. It doesn't supply any downloads. I would just
download the html file and fetch the links of some pictures, so from my
understanding, it is legal.


[0]
use LWP::UserAgent;
use File::Slurp;
my $url = "http://0daycheck.eastgame.net/0day/archives/20060901.html";
unlink '404-out.html';

my $ua = LWP::UserAgent->new();
my $response = $ua->get($url);
if ($response->is_success) {
print $response->content;
} else {
print $response->status_line;
write_file '404-out.html', $response->content;
}
 
W

Wonder

John said:
http://0daycheck.eastgame.net/0day/archives/20060901.html

GET /0day/archives/20060901.html HTTP/1.1
Host: 0daycheck.eastgame.net
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7)
Gecko/20060909 Firefox/1.5.0.7
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=
0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.x 404 Not Found
Server: Zeus/4.3
Date: Tue, 19 Sep 2006 17:39:10 GMT
Content-Type: text/html
X-Cache: MISS from cache.xal.megared.net.mx
Connection: close


So, yeah, they return a 404 :) (output via Firefox + live headers
extension).

use strict;
use warnings;

use LWP::UserAgent;

my $ua = LWP::UserAgent->new();

my $url = 'http://0daycheck.eastgame.net/0day/archives/20060901.html';
my $response = $ua->get( $url );

$response->is_success
or $response->code == 404 # ignore 404
or die "Can't get '$url': ", $response->status_line;

print $response->content;

Thanks a lot, John. The webpage is in $response->content. So, I'm
curious how the browser knows it's not a real 404 error, and know to
display the real content instead of an 404 error.

Besides, this is probably a silly question: if I put quotation marks
around the $response member variables, such as

print "$response->content";

I'll get a line "HTTP::Response=HASH(0x104832bc)->content". Why is
that?

Thanks.

Wonder
 
J

John Bokma

Wonder said:
print "$response->content";

I'll get a line "HTTP::Response=HASH(0x104832bc)->content". Why is
that?

You might want to read:

perldoc perlintro (Basic syntax overview, print)
perldoc perlboot (which explains why $response "contains"
HTTP::Response=HASH(...)
 
M

Matt Garrish

David said:
I can't agree with this. Using strict and warnings is language-specific
advice that helps folk solve and avoid problems. The readability or not
of "camelcase" is person-specific and has nothing to do with Perl.

So, I have no problems with your saying "I don't like camelcase", but
"don't use camelcase with Perl" to me over-steps the mark.

I, and no doubt others, participate in multiple Perl projects where we
don't always have control over naming conventions (how do you feel about
Hungarian?), and editing that down should surely not be a requirement
for posting here.

Don't forget those of us who can't type worth a damn. It's enough just
to get me to use the shift key, but to also try and reach the
underscore key is asking too much. And besides, what gets called "camel
case" here is standard coding convention for almost all the other
languages I have to deal in. It certainly doesn't warrant chastising a
poster about, because as you note, it is just style and I'm sure we've
all used it.

Matt
 
T

Tad McClellan

Besides, this is probably a silly question: if I put quotation marks
around the $response member variables, such as

print "$response->content";

I'll get a line "HTTP::Response=HASH(0x104832bc)->content". Why is
that?


Part 1: you get the "->content" part because subroutines (and methods)
do not interpolate.

Part 2: you get the "HTTP::Response=HASH(0x104832bc)" part because
that is the stringified representation for a blessed reference (object).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top