Downloading lots and lots and lots of files

C

coolneo

First, what I am doing is legit... I'm NOT trying to grab someone
elses content. I work for a non-profit organization and we have
something going on with Google where they are providing digitized
versions of our material. They (Google) provided some information on
howto write a script (shell) to download the digitized version using
wget.

There are about 50,000 items, raning in size from 15MB-600MB. My
script downloads them fine, but it would be much faster if i could
multi-thread(?) it. I'm running the wget using the sys command on a
windows box (i know, i know, but the whole place is windows so I don't
have much of a choice).

Am I on the right track? Or should I be doing this differently?

Thanks!
J
 
C

coolneo

Why do you want to download those files again?

Purl Gurl


I managed to download about 21,000 of the 50,000 items over the course
of some time. Initally, Google was processing these items at a slow
rate but lately they have picked it up.

Bandwidth is indeed a concern, and I understand downloading 5TB will
take a long long time, but I think it would be a little shorter if I
could spawn off 4 downloads at a time, or even 2, during our off
business hours and the weekend (I get . The average file size is
125MB. We have a 200mb pipe, so it's not entirely unreasonable (is
it?).
 
P

Peter Scott

First, what I am doing is legit... I'm NOT trying to grab someone
elses content. I work for a non-profit organization and we have
something going on with Google where they are providing digitized
versions of our material. They (Google) provided some information on
howto write a script (shell) to download the digitized version using
wget.

There are about 50,000 items, ranging in size from 15MB-600MB. My
script downloads them fine, but it would be much faster if i could
multi-thread(?) it. I'm running the wget using the sys command on a
windows box (i know, i know, but the whole place is windows so I don't
have much of a choice).

You could try

http://search.cpan.org/~marclang/ParallelUserAgent-2.57/lib/LWP/Parallel.pm

Looks like you'll need Cygwin.
 
T

Ted Zlatanov

I managed to download about 21,000 of the 50,000 items over the course
of some time. Initally, Google was processing these items at a slow
rate but lately they have picked it up.
Bandwidth is indeed a concern, and I understand downloading 5TB will
take a long long time, but I think it would be a little shorter if I
could spawn off 4 downloads at a time, or even 2, during our off
business hours and the weekend (I get . The average file size is
125MB. We have a 200mb pipe, so it's not entirely unreasonable (is
it?).

You should contact Google and request the data directly. I guarantee
you they will be happy to avoid the load on their network and
servers, since HTTP is not the best way to transfer lots of data.

Ted
 
X

xhoster

Abigail said:
Of course, it's quite likely that the network is the bottleneck.
Starting up many simultaneous connections isn't going to help in
that case.

Finally, I wouldn't use threads. I'd either fork() or use a select()
loop, depending on the details of the work that needs to be done.
But then, I'm a Unix person.

I probably wouldn't even use fork. I'd just make 3 (or 4, or 10, whatever)
different to do lists, and start up 3 (or 4, or 10) completely independent
programs from the command line.

Xho
 
G

gf

coolneo said:
[...] They (Google) provided some information on
howto write a script (shell) to download the digitized version using
wget.

There are about 50,000 items, raning in size from 15MB-600MB. My
script downloads them fine, but it would be much faster if i could
multi-thread(?) it. I'm running the wget using the sys command on a
windows box (i know, i know, but the whole place is windows so I don't
have much of a choice).

Am I on the right track? Or should I be doing this differently?

You didn't say if this is a one-time job or something that'll be on-
going.

If it's a one-time job, then I'd split that file list into however
many processes I want to run, then start that many shell jobs and just
let 'em run until it's done. It's not elegant, it's brute force, but
sometimes that's plenty good.

If you're going to be doing this regularly, then LWP::parallel is
pretty sweet. You can have each LWP agent shift an individual URL off
the list and slowly whittle it down.

The I/O issues mentioned are going to be worse on a single box though.
You can hit a point where the machine is network I/O bound so you
might want to consider confiscating a couple PCs and run a separate
job on each PC, as long as you're on a switch and a fast pipe.

I'd also seriously consider a modern sneaker-net, and see about buying
some hard-drives that'll hold the entire set of data, and send them to
Google, have them fill the drives, and then return them overnight air.
That might be a lot faster, and then you could reuse the drives later.
 
C

coolneo

you they will be happy to avoid the load on their network and
servers, since HTTP is not the best way to transfer lots of data.

Ted

Ted, I didn't provide some addition information that would may make
you think differently:

Google is kinda odd sometimes. It took them forever to allow multiple
download streams, and then they provide this web interface to recall
data in text format with wget. I mean, for Google, you figure they
could do better. I think they would prefer to not give us anything at
all. Once we have it there is always the chance we'll give it way or
lose it or have it stolen (by Microsoft!).

Another thing I didn't mention is that this can grow to much larger
than the 50,000, in which case, I'd much rather just auto-download,
than deal with media.
 
T

Ted Zlatanov

Google is kinda odd sometimes. It took them forever to allow multiple
download streams, and then they provide this web interface to recall
data in text format with wget. I mean, for Google, you figure they
could do better. I think they would prefer to not give us anything at
all. Once we have it there is always the chance we'll give it way or
lose it or have it stolen (by Microsoft!).

As a business decision it may make sense; technically it's nonsense :)

At the very least they should give you a rsync interface. It's a
single TCP stream, it's fast, and it can be resumed if the connection
should abort. HTTP is low on my list of transport mechanisms for
large files.
Another thing I didn't mention is that this can grow to much larger
than the 50,000, in which case, I'd much rather just auto-download,
than deal with media.

Sure. I was talking about your initial data load; subsequent loads
can be incremental.

I would also suggest limiting to N downloads per hour, to avoid bugs
or other situations (unmounted disk, for example) where you're
repeatedly requesting all the data you already have. That's a very
nasty situation.

Ted
 
C

coolneo

Thanks everyone. I'm going to give LWP:parallel a closer look. That
looks like it will do what I want. Thanks for the advice on queuing
the downloads. That makes perfect sense.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top