Downloading binary files - Python3

Anders Eriksson · Mar 21, 2009

Hello,

I have made a short program that given an url will download all referenced
files on that url.

It works, but I'm thinking it could use some optimization since it's very
slow.

I create a list of tuples where each tuple consist of the url to the file
and the path to where I want to save it. E.g (http://somewhere.com/foo.mp3,
c:\Music\foo.mp3)

The downloading part (which is the part I need help with) looks like this:
def GetFiles():
"""do the actual copying of files"""
for url,path in hreflist:
print(url,end=" ")
srcdata = urlopen(url).read()
dstfile = open(path,mode='wb')
dstfile.write(srcdata)
dstfile.close()
print("Done!")

hreflist if the list of tuples.

at the moment the print(url,end=" ") will not be printed before the actual
download, instead it will be printed at the same time as print("Done!").
This I would like to have the way I intended.

Is downloading a binary file using: srcdata = urlopen(url).read()
the best way? Is there some other way that would speed up the downloading?

// Anders

Matteo · Mar 21, 2009

srcdata = urlopen(url).read()

dstfile = open(path,mode='wb')
dstfile.write(srcdata)
dstfile.close()
print("Done!")

Have you tried reading all files first, then saving each one on the
appropriate directory? It might work if you have enough memory, i.e.
if the files you are downloading are small, and I assume they are,
otherwise it would be almost useless to optimize the code, since the
most time consuming part would always be the download. Anyway, I would
try and time it, or timeit.

Anyway, opening a network connection does take some time, independent
of the size of the files you are downloading and of the kind of code
requesting it, you can't do much about that. If you had linux you
could probably get better results with wget, but that's another story
altogether.

Peter Otten · Mar 21, 2009

Anders said:
Hello,

I have made a short program that given an url will download all referenced
files on that url.

It works, but I'm thinking it could use some optimization since it's very
slow.

I create a list of tuples where each tuple consist of the url to the file
and the path to where I want to save it. E.g
(http://somewhere.com/foo.mp3, c:\Music\foo.mp3)

The downloading part (which is the part I need help with) looks like this:
def GetFiles():

Consider passing 'hreflist' explicitly. Global variables make your script
harder to manage in the long run.

"""do the actual copying of files"""
for url,path in hreflist:
print(url,end=" ")

You can force python to write out its internal buffer by calling

sys.stdout.flush()

You may also take a look at the logging package.

srcdata = urlopen(url).read()

For large files you would read the source in chunks:

src = urlopen(url)
with open(path, mode="wb") as dstfile:
while True:
chunk = src.read(2**20)
if not chunk:
break
dstfile.write(chunk)

Instead of writing this loop yourself you can use

shutil.copyfileobj(src, dstfile)

or even

urllib.request.urlretrieve(url, path)

which also takes care of opening the file.

dstfile = open(path,mode='wb')
dstfile.write(srcdata)
dstfile.close()
print("Done!")

hreflist if the list of tuples.

at the moment the print(url,end=" ") will not be printed before the actual
download, instead it will be printed at the same time as print("Done!").
This I would like to have the way I intended.

Is downloading a binary file using: srcdata = urlopen(url).read()
the best way? Is there some other way that would speed up the downloading?

The above method may not faster (the operation is "io-bound") but it is able
to handle large files gracefully.

Peter

Stefan Behnel · Mar 21, 2009

Anders said:
I have made a short program that given an url will download all referenced
files on that url.

It works, but I'm thinking it could use some optimization since it's very
slow.

What's slow about it? Is downloading each file slow, is it the overhead of
connecting to the server before the download, or is it more the feeling
that the overall process could use your bandwidth better?

I create a list of tuples where each tuple consist of the url to the file
and the path to where I want to save it. E.g (http://somewhere.com/foo.mp3,
c:\Music\foo.mp3)

The downloading part (which is the part I need help with) looks like this:
def GetFiles():
"""do the actual copying of files"""
for url,path in hreflist:
print(url,end=" ")
srcdata = urlopen(url).read()
dstfile = open(path,mode='wb')
dstfile.write(srcdata)
dstfile.close()
print("Done!")

hreflist if the list of tuples.

at the moment the print(url,end=" ") will not be printed before the actual
download, instead it will be printed at the same time as print("Done!").
This I would like to have the way I intended.

Is downloading a binary file using: srcdata = urlopen(url).read()
the best way? Is there some other way that would speed up the downloading?

Yes. Instead of running the downloads in a sequential loop, put the code
for downloading one file into a function and start one thread per file,
each of which runs that function (see the threading module). That way, each
thread can happily sit and wait for data coming from its server, without
preventing other threads from receiving data from their server at the same
time. That should get your bandwidth usage up.

You may have to take care that you do not run too many threads against the
same server (which may get upset and block your requests, depending on the
site), or that you limit the number of threads when you download a large
number of files. Running too many threads can slow things down again. But
you'll see that when you try.

Stefan

MRAB · Mar 21, 2009

Matteo said:
Have you tried reading all files first, then saving each one on the
appropriate directory? It might work if you have enough memory, i.e.
if the files you are downloading are small, and I assume they are,
otherwise it would be almost useless to optimize the code, since the
most time consuming part would always be the download. Anyway, I would
try and time it, or timeit.

Anyway, opening a network connection does take some time, independent
of the size of the files you are downloading and of the kind of code
requesting it, you can't do much about that. If you had linux you
could probably get better results with wget, but that's another story
altogether.

If your net connection is working at its maximum then there's nothing
you can do to speed up the downloads.

If it's the response time that's the problem then you could put the
tuples into a queue and run a number of threads, each one repeatedly
getting a tuple from the queue and downloading, until the queue is
empty.

running Python2 Python3 parallel concurrent	3	Mar 31, 2011
ValueError: I/O operation on closed file. with python3	0	Jun 12, 2013
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Downloading Large Files -- Feedback?	10	Feb 12, 2006
Downloading multiple files based on info extracted from CSV	5	Dec 12, 2013
Python3: to add, remove and change	3	Apr 2, 2009
Question regd downloading multipart mail using POP3	0	Mar 28, 2008
Downloading files	1	Mar 8, 2006

Downloading binary files - Python3

Anders Eriksson

Matteo

Peter Otten

Stefan Behnel

MRAB

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads