urllib2 rate limiting

Dimitrios Apostolou · Jan 10, 2008

Hello list,

I want to limit the download speed when using urllib2. In particular,
having several parallel downloads, I want to make sure that their total
speed doesn't exceed a maximum value.

I can't find a simple way to achieve this. After researching a can try
some things but I'm stuck on the details:

1) Can I overload some method in _socket.py to achieve this, and perhaps
make this generic enough to work even with other libraries than urllib2?

2) There is the urllib.urlretrieve() function which accepts a reporthook
parameter. Perhaps I can have reporthook to increment a global counter and
sleep as necessary when a threshold is reached.
However there is not something similar in urllib2. Isn't urllib2 supposed
to be a superset of urllib in functionality? Why there is no reporthook
parameter in any of urllib2's functions?
Moreover, even the existing way reporthook can be used doesn't seem so
right: reporthook(blocknum, bs, size) is always called with bs=8K even
for the last block, and sometimes (blocknum*bs > size) is possible, if the
server sends wrong Content-Lentgth HTTP headers.

3) Perhaps I can use filehandle.read(1024) and manually read as many
chunks of data as I need. However I think this would generally be
inefficient and I'm not sure how it would work because
of internal buffering of urllib2.

So how do you think I can achieve rate limiting in urllib2?

Thanks in advance,
Dimitris

P.S. And something simpler: How can I disallow urllib2 to follow
redirections to foreign hosts?

Rob Wolfe · Jan 10, 2008

Dimitrios Apostolou said:
P.S. And something simpler: How can I disallow urllib2 to follow
redirections to foreign hosts?

You need to subclass `urllib2.HTTPRedirectHandler`, override
`http_error_301` and `http_error_302` methods and throw
`urllib2.HTTPError` exception.

http://diveintopython.org/http_web_services/redirects.html

HTH,
Rob

Dimitrios Apostolou · Jan 10, 2008

You need to subclass `urllib2.HTTPRedirectHandler`, override
`http_error_301` and `http_error_302` methods and throw
`urllib2.HTTPError` exception.

Thanks! I think for my case it's better to override redirect_request
method, and return a Request only in case the redirection goes to the
same site. Just another question, because I can't find in the docs the
meaning of (req, fp, code, msg, hdrs) parameters. To read the URL I get
redirected to (the 'Location:' HTTP header?), should I check the hdrs
parameter or there is a better way?

Thanks,
Dimitris

Rob Wolfe · Jan 10, 2008

Dimitrios Apostolou said:
Thanks! I think for my case it's better to override redirect_request
method, and return a Request only in case the redirection goes to the
same site. Just another question, because I can't find in the docs the
meaning of (req, fp, code, msg, hdrs) parameters. To read the URL I
get redirected to (the 'Location:' HTTP header?), should I check the
hdrs parameter or there is a better way?

Well, according to the documentation there is no better way.
But I looked into the source code of `urllib2` and it seems
that `redirect_request` method takes one more parameter
`newurl`, what is probably what you're looking for.

Regards,
Rob

Dimitrios Apostolou · Jan 10, 2008

Well, according to the documentation there is no better way.
But I looked into the source code of `urllib2` and it seems
that `redirect_request` method takes one more parameter
`newurl`, what is probably what you're looking for.

Regards,
Rob

Cool!

Sometimes undocumented features provide superb solutions... I wonder
if there is something similar for rate limiting :-s

Thank you,
Dimitris

Nick Craig-Wood · Jan 11, 2008

Dimitrios Apostolou said:
I want to limit the download speed when using urllib2. In particular,
having several parallel downloads, I want to make sure that their total
speed doesn't exceed a maximum value.

I can't find a simple way to achieve this. After researching a can try
some things but I'm stuck on the details:

1) Can I overload some method in _socket.py to achieve this, and perhaps
make this generic enough to work even with other libraries than urllib2?

2) There is the urllib.urlretrieve() function which accepts a reporthook
parameter.

Here is an implementation based on that idea. I've used urllib rather
than urllib2 as that is what I'm familiar with.

------------------------------------------------------------
#!/usr/bin/python

"""
Fetch a url rate limited

Syntax: rate URL local_file_name
"""

import os
import sys
import urllib
from time import time, sleep

class RateLimit(object):
"""Rate limit a url fetch"""
def __init__(self, rate_limit):
"""rate limit in kBytes / second"""
self.rate_limit = rate_limit
self.start = time()
def __call__(self, block_count, block_size, total_size):
total_kb = total_size / 1024
downloaded_kb = (block_count * block_size) / 1024
elapsed_time = time() - self.start
if elapsed_time != 0:
rate = downloaded_kb / elapsed_time
print "%d kb of %d kb downloaded %f.1 kBytes/s\n" % (downloaded_kb ,total_kb, rate),
expected_time = downloaded_kb / self.rate_limit
sleep_time = expected_time - elapsed_time
print "Sleep for", sleep_time
if sleep_time > 0:
sleep(sleep_time)

def main():
"""Fetch the contents of urls"""
if len(sys.argv) != 4:
print 'Syntax: %s "rate in kBytes/s" URL "local output path"' % sys.argv[0]
raise SystemExit(1)
rate_limit, url, out_path = sys.argv[1:]
rate_limit = float(rate_limit)
print "Fetching %r to %r with rate limit %.1f" % (url, out_path, rate_limit)
urllib.urlretrieve(url, out_path, reporthook=RateLimit(rate_limit))

if __name__ == "__main__": main()
------------------------------------------------------------

Use it like this

$ ./rate-limited-fetch.py 16 http://some/url/or/other z
Fetching 'http://some/url/or/other' to 'z' with rate limit 16.0
0 kb of 10118 kb downloaded 0.000000.1 kBytes/s
Sleep for -0.0477550029755
8 kb of 10118 kb downloaded 142.073242.1 kBytes/s
Sleep for 0.443691015244
16 kb of 10118 kb downloaded 32.130966.1 kBytes/s
Sleep for 0.502038002014
24 kb of 10118 kb downloaded 23.952789.1 kBytes/s
Sleep for 0.498028993607
32 kb of 10118 kb downloaded 21.304672.1 kBytes/s
Sleep for 0.497982025146
40 kb of 10118 kb downloaded 19.979510.1 kBytes/s
Sleep for 0.497948884964
48 kb of 10118 kb downloaded 19.184721.1 kBytes/s
Sleep for 0.498008966446
....
1416 kb of 10118 kb downloaded 16.090774.1 kBytes/s
Sleep for 0.499262094498
1424 kb of 10118 kb downloaded 16.090267.1 kBytes/s
Sleep for 0.499293088913
1432 kb of 10118 kb downloaded 16.089760.1 kBytes/s
Sleep for 0.499292135239
1440 kb of 10118 kb downloaded 16.089254.1 kBytes/s
Sleep for 0.499267101288
....

Dimitrios Apostolou · Jan 12, 2008

Here is an implementation based on that idea. I've used urllib rather
than urllib2 as that is what I'm familiar with.

Thanks! Really nice implementation. However I'm stuck with urllib2 because
of its extra functionality so I'll try to implement something similar
using handle.read(1024) to read in small chunks.

It really seems weird that urllib2 is missing reporthook functionality!

Thank you,
Dimitris

urllib2 FTP Weirdness	1	Jan 23, 2013
Need urllib.urlretrieve and urllib2.OpenerDirector together	0	Dec 26, 2010
urllib2.urlopen Progress bar	0	Jan 15, 2006
script to Login a website	8	Jul 31, 2013
How to test urllib\|urllib2-using code?	5	Nov 4, 2009
urllib2 disable proxy	3	Jan 2, 2008
urllib2 and exceptions	6	Sep 28, 2008
Problem with reading CSV file from URL, last record truncated.	2	Aug 3, 2009

urllib2 rate limiting

Dimitrios Apostolou

Rob Wolfe

Dimitrios Apostolou

Rob Wolfe

Dimitrios Apostolou

Nick Craig-Wood

Dimitrios Apostolou

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads