parralel downloads

John Deas · Mar 8, 2008

Hi,

I would like to write a python script that will download a list of
files (mainly mp3s) from Internet. For this, I thought to use urllib,
with

urlopen("myUrl").read() and then writing the resulting string to a
file

my problem is that I would like to download several files at the time.
As I have not much experience in programming, could you point me the
easier ways to do this in python ?

Thanks,

JD

poof65 · Mar 8, 2008

For your problem you have to use threads.

You can have more information here.
http://artfulcode.nfshost.com/files/multi-threading-in-python.html

Gary Herron · Mar 8, 2008

poof65 said:
For your problem you have to use threads.

Not at all true. Thread provide one way to solve this, but another is
the select function. For this simple case, select() may (or may not) be
easier to write. Pseudo-code would look something like this:

openSockets = list of sockets one per download file:
while openSockets:
readySockets = select(openSockets ...) # Identifies sockets with
data to be read
for each s in readSockets:
read from s and do whatever with the data
if s is at EOF: close and remove s from openSockets

That's it. Far easier than threads.

Gary Herron

John Deas · Mar 9, 2008

Not at all true. Thread provide one way to solve this, but another is
the select function. For this simple case, select() may (or may not) be
easier to write. Pseudo-code would look something like this:

openSockets = list of sockets one per download file:
while openSockets:
readySockets = select(openSockets ...) # Identifies sockets with
data to be read
for each s in readSockets:
read from s and do whatever with the data
if s is at EOF: close and remove s from openSockets

That's it. Far easier than threads.

Gary Herron

Thank you both for your help. Threads are working for me. However, a
new problem for me is that the url I want to download are in an xml
file (I want to download podcasts), and is not the same as the file
downloaded:

http://www.sciam.com/podcast/podcast.mp3?e_id=86102326-0B1F-A3D4-74B2BBD61E9ECD2C&ref=p_rss

will be redirected to download:

http://podcast.sciam.com/daily/sa_d_podcast_080307.mp3

is there a way, knowing the first url to get the second at runtime in
my script ?

John Deas · Mar 9, 2008

Thank you both for your help. Threads are working for me. However, a
new problem for me is that the url I want to download are in an xml
file (I want to download podcasts), and is not the same as the file
downloaded:

http://www.sciam.com/podcast/podcast.mp3?e_id=86102326-0B1F-A3D4-74B2...

will be redirected to download:

http://podcast.sciam.com/daily/sa_d_podcast_080307.mp3

is there a way, knowing the first url to get the second at runtime in
my script ?

Found it: geturl() does the job

castironpi · Mar 9, 2008

my problem is that I would like to download several files at the time.

Found it: geturl() does the job

That's for normalizing schemes. I believe you subclass FancyURLopener
and override the read method.

Gabriel Genellina · Mar 9, 2008

En Sat, 08 Mar 2008 14:47:45 -0200, Gary Herron

Not at all true. Thread provide one way to solve this, but another is
the select function. For this simple case, select() may (or may not) be
easier to write. Pseudo-code would look something like this:

openSockets = list of sockets one per download file:
while openSockets:
readySockets = select(openSockets ...) # Identifies sockets with
data to be read
for each s in readSockets:
read from s and do whatever with the data
if s is at EOF: close and remove s from openSockets

That's it. Far easier than threads.

Easier? If you omit all the relevant details, yes, looks easy. For
example, you read some data from one socket, part of the file you're
downloading. Where do you write it? You require additional structures to
keep track of things.
Pseudocode for the threaded version, complete with socket creation:

def downloadfile(url, fn):
s = create socket for url
f = open filename for writing
shutil.copyfileobj(s.makefile(), f)

for each url, filename to retrieve:
t = threading.Thread(target=downloadfile, args=(url,filename))
add t to threadlist
t.start()

for each t in threadlist:
t.join()

The downloadfile function looks simpler to me - it's what anyone would
write in a single threaded program, with local variables and keeping full
state.
The above pseudocode can be converted directly into Python code - no more
structures nor code are required.

Of course, don't try to download a million files at the same time -
neither a million sockets nor a million threads would work.

castironpi · Mar 10, 2008

That's it. Far easier than threads.

I'll order a 'easyness' metric from the warehouse. Of course,
resources are parameters to the metric, such as facility given lots of
time, facility given lots of libraries, facility given hot shots, &c.

Easier? If you omit all the relevant details, yes, looks easy. For

def downloadfile(url, fn):
s = create socket for url
f = open filename for writing
shutil.copyfileobj(s.makefile(), f)

for each url, filename to retrieve:

[ threadlist.addandstart( threading.Thread(target=downloadfile,
args=(url,filename)) ) ][ threadlist.joineach() ]

Of course, don't try to download a million files at the same time -
neither a million sockets nor a million threads would work.

Dammit! Then what's my million-core machine for? If architectures
"have reached the point of diminishing returns" ( off py.ideas ), then
what's the PoDR for numbers of threads per core?

Answer: One. Just write data structures and don't swap context. But
when do you want it by? What is the PoDR for amount of effort per
clock cycle saved? Get a Frank and a Brit and ask them what language
is easiest to speak.

(Answer: Math. Har *plonk*.)

I am making a Snake game and it has a: "raise Terminator/turtle.Terminator" message.	2	Dec 20, 2021
Need Advice	10	Dec 9, 2022
How can I upload a tar.bz2 file to OpenStack swift object storage container using the Python swift client?	1	Mar 22, 2024
Proxying downloads	2	Oct 30, 2007
Mocking file downloads	1	Apr 4, 2010
[C Language] Need help transferring Linux CodeBlocks Project to Windows CodeBlocks Project	1	Jun 19, 2023
Looking For Advice	1	Dec 10, 2022
url2lib (windows 7) does not notice when network reconnects (getaddrinfoproblem)	1	Mar 17, 2010

parralel downloads

John Deas

poof65

Gary Herron

John Deas

John Deas

castironpi

Gabriel Genellina

castironpi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads