How to make a Perl program do concurrent downloading?

A

Adlene

Hi, there:

I wrote a program to download 500,000 HTML files from a website, I
have compiled all the links in a file. my grabber.pl will download all of
them...

I have a fast internet connection. I think it is better to run multiple
downloads at
same time, but $INET = new Win32::Internet() only allows one at a
time...what
may I do?

I also found, occassionally the grabber just hang somewhere...In such
situation I
need to bypass $INET->FetchURL($url), write the offending URL in an error
file
and continue on to next iteration...How may I do that?

Best Regards,
Adlene
 
B

Bryan Castillo

Adlene said:
Hi, there:

I wrote a program to download 500,000 HTML files from a website, I
have compiled all the links in a file. my grabber.pl will download all of
them...

Depending on who owns the Internet site, they may find it rude that
you want to dowload so many files and that you may want to take as
much resources as possible from their web server. Perhaps you should
find a different way of retrieving the data, such as contacting the
web site administrator and tell them what you want to do, they may
give you a tar gzipped file of the site??

I have a fast internet connection. I think it is better to run multiple
downloads at

It may be better for you, but that is questionable for everyone else.

Here is some information on web robots. You might want to do some
more searching though on web robots.

http://www.phantomsearch.com/usersguide/R04Robot.htm

<from the above URL>

The Four Laws of Web Robotics
Law One: A Web Robot Must Show Identification
Phantom supports this. You can set the "User-Agent" and "From E-Mail"
fields in the preferences dialog. Both of these are reported in the
HTTP header when Phantom makes requests of remote Web servers.

Law Two: A Web Robot Must Obey Exclusion Standard
Phantom fully supports the exclusion standard.

Law Three: A Web Robot Must Not Hog Resources
Phantom only retrieves files it can index (unless mirroring with
binaries option on) and restricts its movement to the path specified
by starting point s. You can also set the minimum time between hits on
the same server. Generally, 60 seconds is considered polite.

For busy sites a greater hit rate may be acceptable, but do not assume
whether a site is "busy" or notÑ contact the webmaster first. When
crawling your own server, of course, you can set the hit interval to
anything you like, including zero.

Law Four: A Web Robot Must Report Errors
Phantom can show you links that are no longer valid. Please contact
the Webmaster and pass this information on if broken URLs are found.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top