urllib2 rate limiting

  • Thread starter Dimitrios Apostolou
  • Start date
D

Dimitrios Apostolou

Hello list,

I want to limit the download speed when using urllib2. In particular,
having several parallel downloads, I want to make sure that their total
speed doesn't exceed a maximum value.

I can't find a simple way to achieve this. After researching a can try
some things but I'm stuck on the details:

1) Can I overload some method in _socket.py to achieve this, and perhaps
make this generic enough to work even with other libraries than urllib2?

2) There is the urllib.urlretrieve() function which accepts a reporthook
parameter. Perhaps I can have reporthook to increment a global counter and
sleep as necessary when a threshold is reached.
However there is not something similar in urllib2. Isn't urllib2 supposed
to be a superset of urllib in functionality? Why there is no reporthook
parameter in any of urllib2's functions?
Moreover, even the existing way reporthook can be used doesn't seem so
right: reporthook(blocknum, bs, size) is always called with bs=8K even
for the last block, and sometimes (blocknum*bs > size) is possible, if the
server sends wrong Content-Lentgth HTTP headers.

3) Perhaps I can use filehandle.read(1024) and manually read as many
chunks of data as I need. However I think this would generally be
inefficient and I'm not sure how it would work because
of internal buffering of urllib2.

So how do you think I can achieve rate limiting in urllib2?


Thanks in advance,
Dimitris

P.S. And something simpler: How can I disallow urllib2 to follow
redirections to foreign hosts?
 
D

Dimitrios Apostolou

You need to subclass `urllib2.HTTPRedirectHandler`, override
`http_error_301` and `http_error_302` methods and throw
`urllib2.HTTPError` exception.

Thanks! I think for my case it's better to override redirect_request
method, and return a Request only in case the redirection goes to the
same site. Just another question, because I can't find in the docs the
meaning of (req, fp, code, msg, hdrs) parameters. To read the URL I get
redirected to (the 'Location:' HTTP header?), should I check the hdrs
parameter or there is a better way?


Thanks,
Dimitris
 
R

Rob Wolfe

Dimitrios Apostolou said:
Thanks! I think for my case it's better to override redirect_request
method, and return a Request only in case the redirection goes to the
same site. Just another question, because I can't find in the docs the
meaning of (req, fp, code, msg, hdrs) parameters. To read the URL I
get redirected to (the 'Location:' HTTP header?), should I check the
hdrs parameter or there is a better way?

Well, according to the documentation there is no better way.
But I looked into the source code of `urllib2` and it seems
that `redirect_request` method takes one more parameter
`newurl`, what is probably what you're looking for. ;)

Regards,
Rob
 
D

Dimitrios Apostolou

Well, according to the documentation there is no better way.
But I looked into the source code of `urllib2` and it seems
that `redirect_request` method takes one more parameter
`newurl`, what is probably what you're looking for. ;)

Regards,
Rob

Cool! :) Sometimes undocumented features provide superb solutions... I wonder
if there is something similar for rate limiting :-s


Thank you,
Dimitris
 
N

Nick Craig-Wood

Dimitrios Apostolou said:
I want to limit the download speed when using urllib2. In particular,
having several parallel downloads, I want to make sure that their total
speed doesn't exceed a maximum value.

I can't find a simple way to achieve this. After researching a can try
some things but I'm stuck on the details:

1) Can I overload some method in _socket.py to achieve this, and perhaps
make this generic enough to work even with other libraries than urllib2?

2) There is the urllib.urlretrieve() function which accepts a reporthook
parameter.

Here is an implementation based on that idea. I've used urllib rather
than urllib2 as that is what I'm familiar with.

------------------------------------------------------------
#!/usr/bin/python

"""
Fetch a url rate limited

Syntax: rate URL local_file_name
"""

import os
import sys
import urllib
from time import time, sleep

class RateLimit(object):
"""Rate limit a url fetch"""
def __init__(self, rate_limit):
"""rate limit in kBytes / second"""
self.rate_limit = rate_limit
self.start = time()
def __call__(self, block_count, block_size, total_size):
total_kb = total_size / 1024
downloaded_kb = (block_count * block_size) / 1024
elapsed_time = time() - self.start
if elapsed_time != 0:
rate = downloaded_kb / elapsed_time
print "%d kb of %d kb downloaded %f.1 kBytes/s\n" % (downloaded_kb ,total_kb, rate),
expected_time = downloaded_kb / self.rate_limit
sleep_time = expected_time - elapsed_time
print "Sleep for", sleep_time
if sleep_time > 0:
sleep(sleep_time)

def main():
"""Fetch the contents of urls"""
if len(sys.argv) != 4:
print 'Syntax: %s "rate in kBytes/s" URL "local output path"' % sys.argv[0]
raise SystemExit(1)
rate_limit, url, out_path = sys.argv[1:]
rate_limit = float(rate_limit)
print "Fetching %r to %r with rate limit %.1f" % (url, out_path, rate_limit)
urllib.urlretrieve(url, out_path, reporthook=RateLimit(rate_limit))

if __name__ == "__main__": main()
------------------------------------------------------------

Use it like this

$ ./rate-limited-fetch.py 16 http://some/url/or/other z
Fetching 'http://some/url/or/other' to 'z' with rate limit 16.0
0 kb of 10118 kb downloaded 0.000000.1 kBytes/s
Sleep for -0.0477550029755
8 kb of 10118 kb downloaded 142.073242.1 kBytes/s
Sleep for 0.443691015244
16 kb of 10118 kb downloaded 32.130966.1 kBytes/s
Sleep for 0.502038002014
24 kb of 10118 kb downloaded 23.952789.1 kBytes/s
Sleep for 0.498028993607
32 kb of 10118 kb downloaded 21.304672.1 kBytes/s
Sleep for 0.497982025146
40 kb of 10118 kb downloaded 19.979510.1 kBytes/s
Sleep for 0.497948884964
48 kb of 10118 kb downloaded 19.184721.1 kBytes/s
Sleep for 0.498008966446
....
1416 kb of 10118 kb downloaded 16.090774.1 kBytes/s
Sleep for 0.499262094498
1424 kb of 10118 kb downloaded 16.090267.1 kBytes/s
Sleep for 0.499293088913
1432 kb of 10118 kb downloaded 16.089760.1 kBytes/s
Sleep for 0.499292135239
1440 kb of 10118 kb downloaded 16.089254.1 kBytes/s
Sleep for 0.499267101288
....
 
D

Dimitrios Apostolou

Here is an implementation based on that idea. I've used urllib rather
than urllib2 as that is what I'm familiar with.

Thanks! Really nice implementation. However I'm stuck with urllib2 because
of its extra functionality so I'll try to implement something similar
using handle.read(1024) to read in small chunks.

It really seems weird that urllib2 is missing reporthook functionality!


Thank you,
Dimitris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,074
Latest member
StanleyFra

Latest Threads

Top