Downloading lots and lots and lots of files

Discussion in 'Perl Misc' started by coolneo, Jan 29, 2007.

  1. coolneo

    coolneo Guest

    First, what I am doing is legit... I'm NOT trying to grab someone
    elses content. I work for a non-profit organization and we have
    something going on with Google where they are providing digitized
    versions of our material. They (Google) provided some information on
    howto write a script (shell) to download the digitized version using
    wget.

    There are about 50,000 items, raning in size from 15MB-600MB. My
    script downloads them fine, but it would be much faster if i could
    multi-thread(?) it. I'm running the wget using the sys command on a
    windows box (i know, i know, but the whole place is windows so I don't
    have much of a choice).

    Am I on the right track? Or should I be doing this differently?

    Thanks!
    J
     
    coolneo, Jan 29, 2007
    #1
    1. Advertising

  2. coolneo

    coolneo Guest

    On Jan 29, 10:04 am, Purl Gurl <> wrote:
    > coolneo wrote:
    > > There are about 50,000 items, raning in size from 15MB-600MB. My
    > > script downloads them fine, but it would be much faster if i could
    > > multi-thread(?) it.You indicate you have already downloaded those files.

    >
    > Why do you want to download those files again?
    >
    > Purl Gurl



    I managed to download about 21,000 of the 50,000 items over the course
    of some time. Initally, Google was processing these items at a slow
    rate but lately they have picked it up.

    Bandwidth is indeed a concern, and I understand downloading 5TB will
    take a long long time, but I think it would be a little shorter if I
    could spawn off 4 downloads at a time, or even 2, during our off
    business hours and the weekend (I get . The average file size is
    125MB. We have a 200mb pipe, so it's not entirely unreasonable (is
    it?).
     
    coolneo, Jan 29, 2007
    #2
    1. Advertising

  3. coolneo

    Peter Scott Guest

    On Mon, 29 Jan 2007 06:44:02 -0800, coolneo wrote:
    > First, what I am doing is legit... I'm NOT trying to grab someone
    > elses content. I work for a non-profit organization and we have
    > something going on with Google where they are providing digitized
    > versions of our material. They (Google) provided some information on
    > howto write a script (shell) to download the digitized version using
    > wget.
    >
    > There are about 50,000 items, ranging in size from 15MB-600MB. My
    > script downloads them fine, but it would be much faster if i could
    > multi-thread(?) it. I'm running the wget using the sys command on a
    > windows box (i know, i know, but the whole place is windows so I don't
    > have much of a choice).


    You could try

    http://search.cpan.org/~marclang/ParallelUserAgent-2.57/lib/LWP/Parallel.pm

    Looks like you'll need Cygwin.

    --
    Peter Scott
    http://www.perlmedic.com/
    http://www.perldebugged.com/
     
    Peter Scott, Jan 29, 2007
    #3
  4. coolneo

    Ted Zlatanov Guest

    On 29 Jan 2007, wrote:

    > I managed to download about 21,000 of the 50,000 items over the course
    > of some time. Initally, Google was processing these items at a slow
    > rate but lately they have picked it up.


    > Bandwidth is indeed a concern, and I understand downloading 5TB will
    > take a long long time, but I think it would be a little shorter if I
    > could spawn off 4 downloads at a time, or even 2, during our off
    > business hours and the weekend (I get . The average file size is
    > 125MB. We have a 200mb pipe, so it's not entirely unreasonable (is
    > it?).


    You should contact Google and request the data directly. I guarantee
    you they will be happy to avoid the load on their network and
    servers, since HTTP is not the best way to transfer lots of data.

    Ted
     
    Ted Zlatanov, Jan 29, 2007
    #4
  5. coolneo

    Guest

    Abigail <> wrote:
    >
    > Of course, it's quite likely that the network is the bottleneck.
    > Starting up many simultaneous connections isn't going to help in
    > that case.
    >
    > Finally, I wouldn't use threads. I'd either fork() or use a select()
    > loop, depending on the details of the work that needs to be done.
    > But then, I'm a Unix person.


    I probably wouldn't even use fork. I'd just make 3 (or 4, or 10, whatever)
    different to do lists, and start up 3 (or 4, or 10) completely independent
    programs from the command line.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jan 29, 2007
    #5
  6. coolneo

    gf Guest

    coolneo wrote:
    > [...] They (Google) provided some information on
    > howto write a script (shell) to download the digitized version using
    > wget.
    >
    > There are about 50,000 items, raning in size from 15MB-600MB. My
    > script downloads them fine, but it would be much faster if i could
    > multi-thread(?) it. I'm running the wget using the sys command on a
    > windows box (i know, i know, but the whole place is windows so I don't
    > have much of a choice).
    >
    > Am I on the right track? Or should I be doing this differently?


    You didn't say if this is a one-time job or something that'll be on-
    going.

    If it's a one-time job, then I'd split that file list into however
    many processes I want to run, then start that many shell jobs and just
    let 'em run until it's done. It's not elegant, it's brute force, but
    sometimes that's plenty good.

    If you're going to be doing this regularly, then LWP::parallel is
    pretty sweet. You can have each LWP agent shift an individual URL off
    the list and slowly whittle it down.

    The I/O issues mentioned are going to be worse on a single box though.
    You can hit a point where the machine is network I/O bound so you
    might want to consider confiscating a couple PCs and run a separate
    job on each PC, as long as you're on a switch and a fast pipe.

    I'd also seriously consider a modern sneaker-net, and see about buying
    some hard-drives that'll hold the entire set of data, and send them to
    Google, have them fill the drives, and then return them overnight air.
    That might be a lot faster, and then you could reuse the drives later.
     
    gf, Jan 29, 2007
    #6
  7. coolneo

    coolneo Guest

    On Jan 29, 12:20 pm, Ted Zlatanov <> wrote:
    > On 29 Jan 2007, wrote:
    >
    > > I managed to download about 21,000 of the 50,000 items over the course
    > > of some time. Initally, Google was processing these items at a slow
    > > rate but lately they have picked it up.
    > > Bandwidth is indeed a concern, and I understand downloading 5TB will
    > > take a long long time, but I think it would be a little shorter if I
    > > could spawn off 4 downloads at a time, or even 2, during our off
    > > business hours and the weekend (I get . The average file size is
    > > 125MB. We have a 200mb pipe, so it's not entirely unreasonable (is
    > > it?).You should contact Google and request the data directly. I guarantee

    > you they will be happy to avoid the load on their network and
    > servers, since HTTP is not the best way to transfer lots of data.
    >
    > Ted


    Ted, I didn't provide some addition information that would may make
    you think differently:

    Google is kinda odd sometimes. It took them forever to allow multiple
    download streams, and then they provide this web interface to recall
    data in text format with wget. I mean, for Google, you figure they
    could do better. I think they would prefer to not give us anything at
    all. Once we have it there is always the chance we'll give it way or
    lose it or have it stolen (by Microsoft!).

    Another thing I didn't mention is that this can grow to much larger
    than the 50,000, in which case, I'd much rather just auto-download,
    than deal with media.
     
    coolneo, Jan 29, 2007
    #7
  8. coolneo

    Dr.Ruud Guest

    coolneo schreef:

    > recall data in text format with wget.


    I assume it is gz-compressed?

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jan 29, 2007
    #8
  9. coolneo

    Ted Zlatanov Guest

    On 29 Jan 2007, wrote:

    > Google is kinda odd sometimes. It took them forever to allow multiple
    > download streams, and then they provide this web interface to recall
    > data in text format with wget. I mean, for Google, you figure they
    > could do better. I think they would prefer to not give us anything at
    > all. Once we have it there is always the chance we'll give it way or
    > lose it or have it stolen (by Microsoft!).


    As a business decision it may make sense; technically it's nonsense :)

    At the very least they should give you a rsync interface. It's a
    single TCP stream, it's fast, and it can be resumed if the connection
    should abort. HTTP is low on my list of transport mechanisms for
    large files.

    > Another thing I didn't mention is that this can grow to much larger
    > than the 50,000, in which case, I'd much rather just auto-download,
    > than deal with media.


    Sure. I was talking about your initial data load; subsequent loads
    can be incremental.

    I would also suggest limiting to N downloads per hour, to avoid bugs
    or other situations (unmounted disk, for example) where you're
    repeatedly requesting all the data you already have. That's a very
    nasty situation.

    Ted
     
    Ted Zlatanov, Jan 29, 2007
    #9
  10. coolneo

    coolneo Guest

    Thanks everyone. I'm going to give LWP:parallel a closer look. That
    looks like it will do what I want. Thanks for the advice on queuing
    the downloads. That makes perfect sense.
     
    coolneo, Jan 30, 2007
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. blantz
    Replies:
    3
    Views:
    414
    =?Utf-8?B?QWxleGFuZGVy?=
    Nov 23, 2004
  2. =?Utf-8?B?SGltYW5zaHU=?=

    problem in uploading and downloading files from DB in ASP.Net

    =?Utf-8?B?SGltYW5zaHU=?=, Jun 25, 2005, in forum: ASP .Net
    Replies:
    4
    Views:
    721
    =?Utf-8?B?SGltYW5zaHU=?=
    Jul 1, 2005
  3. javadrivesmenuts
    Replies:
    2
    Views:
    498
    Andrew Thompson
    Nov 26, 2003
  4. Jim Bancroft
    Replies:
    6
    Views:
    347
    Laurent Bugnion, MVP
    Aug 2, 2007
  5. brad
    Replies:
    9
    Views:
    383
    Bruno Desthuilliers
    Jun 19, 2008
Loading...

Share This Page