Downloading lots and lots and lots of files

coolneo · Jan 29, 2007

First, what I am doing is legit... I'm NOT trying to grab someone
elses content. I work for a non-profit organization and we have
something going on with Google where they are providing digitized
versions of our material. They (Google) provided some information on
howto write a script (shell) to download the digitized version using
wget.

There are about 50,000 items, raning in size from 15MB-600MB. My
script downloads them fine, but it would be much faster if i could
multi-thread(?) it. I'm running the wget using the sys command on a
windows box (i know, i know, but the whole place is windows so I don't
have much of a choice).

Am I on the right track? Or should I be doing this differently?

Thanks!
J

coolneo · Jan 29, 2007

Why do you want to download those files again?

Purl Gurl

I managed to download about 21,000 of the 50,000 items over the course
of some time. Initally, Google was processing these items at a slow
rate but lately they have picked it up.

Bandwidth is indeed a concern, and I understand downloading 5TB will
take a long long time, but I think it would be a little shorter if I
could spawn off 4 downloads at a time, or even 2, during our off
business hours and the weekend (I get . The average file size is
125MB. We have a 200mb pipe, so it's not entirely unreasonable (is
it?).

Peter Scott · Jan 29, 2007

First, what I am doing is legit... I'm NOT trying to grab someone
elses content. I work for a non-profit organization and we have
something going on with Google where they are providing digitized
versions of our material. They (Google) provided some information on
howto write a script (shell) to download the digitized version using
wget.

There are about 50,000 items, ranging in size from 15MB-600MB. My
script downloads them fine, but it would be much faster if i could
multi-thread(?) it. I'm running the wget using the sys command on a
windows box (i know, i know, but the whole place is windows so I don't
have much of a choice).

You could try

http://search.cpan.org/~marclang/ParallelUserAgent-2.57/lib/LWP/Parallel.pm

Looks like you'll need Cygwin.

Ted Zlatanov · Jan 29, 2007

I managed to download about 21,000 of the 50,000 items over the course
of some time. Initally, Google was processing these items at a slow
rate but lately they have picked it up.

Bandwidth is indeed a concern, and I understand downloading 5TB will
take a long long time, but I think it would be a little shorter if I
could spawn off 4 downloads at a time, or even 2, during our off
business hours and the weekend (I get . The average file size is
125MB. We have a 200mb pipe, so it's not entirely unreasonable (is
it?).

You should contact Google and request the data directly. I guarantee
you they will be happy to avoid the load on their network and
servers, since HTTP is not the best way to transfer lots of data.

Ted

xhoster · Jan 29, 2007

Abigail said:
Of course, it's quite likely that the network is the bottleneck.
Starting up many simultaneous connections isn't going to help in
that case.

Finally, I wouldn't use threads. I'd either fork() or use a select()
loop, depending on the details of the work that needs to be done.
But then, I'm a Unix person.

I probably wouldn't even use fork. I'd just make 3 (or 4, or 10, whatever)
different to do lists, and start up 3 (or 4, or 10) completely independent
programs from the command line.

Xho

gf · Jan 29, 2007

coolneo said:
[...] They (Google) provided some information on
howto write a script (shell) to download the digitized version using
wget.

There are about 50,000 items, raning in size from 15MB-600MB. My
script downloads them fine, but it would be much faster if i could
multi-thread(?) it. I'm running the wget using the sys command on a
windows box (i know, i know, but the whole place is windows so I don't
have much of a choice).

Am I on the right track? Or should I be doing this differently?

You didn't say if this is a one-time job or something that'll be on-
going.

If it's a one-time job, then I'd split that file list into however
many processes I want to run, then start that many shell jobs and just
let 'em run until it's done. It's not elegant, it's brute force, but
sometimes that's plenty good.

If you're going to be doing this regularly, then LWP:

arallel is
pretty sweet. You can have each LWP agent shift an individual URL off
the list and slowly whittle it down.

The I/O issues mentioned are going to be worse on a single box though.
You can hit a point where the machine is network I/O bound so you
might want to consider confiscating a couple PCs and run a separate
job on each PC, as long as you're on a switch and a fast pipe.

I'd also seriously consider a modern sneaker-net, and see about buying
some hard-drives that'll hold the entire set of data, and send them to
Google, have them fill the drives, and then return them overnight air.
That might be a lot faster, and then you could reuse the drives later.

coolneo · Jan 29, 2007

you they will be happy to avoid the load on their network and
servers, since HTTP is not the best way to transfer lots of data.

Ted

Ted, I didn't provide some addition information that would may make
you think differently:

Google is kinda odd sometimes. It took them forever to allow multiple
download streams, and then they provide this web interface to recall
data in text format with wget. I mean, for Google, you figure they
could do better. I think they would prefer to not give us anything at
all. Once we have it there is always the chance we'll give it way or
lose it or have it stolen (by Microsoft!).

Another thing I didn't mention is that this can grow to much larger
than the 50,000, in which case, I'd much rather just auto-download,
than deal with media.

Dr.Ruud · Jan 29, 2007

coolneo schreef:

recall data in text format with wget.

I assume it is gz-compressed?

Ted Zlatanov · Jan 29, 2007

Google is kinda odd sometimes. It took them forever to allow multiple
download streams, and then they provide this web interface to recall
data in text format with wget. I mean, for Google, you figure they
could do better. I think they would prefer to not give us anything at
all. Once we have it there is always the chance we'll give it way or
lose it or have it stolen (by Microsoft!).

As a business decision it may make sense; technically it's nonsense

At the very least they should give you a rsync interface. It's a
single TCP stream, it's fast, and it can be resumed if the connection
should abort. HTTP is low on my list of transport mechanisms for
large files.

Another thing I didn't mention is that this can grow to much larger
than the 50,000, in which case, I'd much rather just auto-download,
than deal with media.

Sure. I was talking about your initial data load; subsequent loads
can be incremental.

I would also suggest limiting to N downloads per hour, to avoid bugs
or other situations (unmounted disk, for example) where you're
repeatedly requesting all the data you already have. That's a very
nasty situation.

Ted

coolneo · Jan 30, 2007

Thanks everyone. I'm going to give LWP

arallel a closer look. That
looks like it will do what I want. Thanks for the advice on queuing
the downloads. That makes perfect sense.

How to prevent hanging when writing lots of text to a pipe?	13	Oct 23, 2009
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
CGI and temFileName -- uploading files	0	Dec 8, 2011
Rails 0.8.5: Better fixtures, shared generators, sendmail for AM, lots of fixes!	3	Nov 18, 2004
how do you prevent distutils from downloading and building packageswithout consent?	6	Mar 26, 2009
Desktop app for stripping comments from a group of files from onefolder and saving them in another	1	Nov 16, 2009
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
How to use ruby, csv and webservice? Need a direction.	2	Oct 8, 2008

Downloading lots and lots and lots of files

coolneo

coolneo

Peter Scott

Ted Zlatanov

xhoster

gf

coolneo

Dr.Ruud

Ted Zlatanov

coolneo

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads