urllib2 slow for multiple requests

Tomas Svarovsky · May 13, 2009

Hello everybody, really new to python, so bear with me. I am trying to
do some very basic scraping tool. Bascally it just grabs a page xy
times and tells me how long it took. When I do this once, it is
blazingly fast, but when I increase the number of repetitions, it is
slowing down considerably (1 is like 3 ms, 100 takes 6 seconds). I
have done implementations in couple more languages (php, ruby) and
none of them seems to suffer from a similar problem and it seems, that
it behaves linearly. Maybe it is a known issue in urllib2, or I am
simply using it badly. I am using python 2.4.3, machine has CentOS,
below is the sc. Thanks in advance

import urllib2
from datetime import datetime

def application():
start = datetime.now()
req = urllib2.Request("http://127.0.0.1/gdc/about", None,
{'Accept': 'application/json'})
for number in range(100):
response = urllib2.urlopen(req)
end = datetime.now()
output = end - start
print output

application()

cgoldberg · May 13, 2009

Bascally it just grabs a page xy

times and tells me how long it took.

you aren't doing a read(), so technically you are just connecting to
the web server and sending the request but never reading the content
back from the socket. So your timing wouldn't be accurate.

try this instead:
response = urllib2.urlopen(req).read()

But that is not the problem you are describing...

when I increase the number of repetitions, it is
slowing down considerably (1 is like 3 ms, 100 takes 6 seconds).
Maybe it is a known issue in urllib2

I ran your code and can not reproduce that behavior. No matter how
many repetitions, I still get a similar response time per transaction.

any more details or code samples you can provide?

-Corey Goldberg

Tomas Svarovsky · May 14, 2009

you aren't doing a read(), so technically you are just connecting to
the web server and sending the request but never reading the content
back from the socket. So your timing wouldn't be accurate.

try this instead:
response = urllib2.urlopen(req).read()

But that is not the problem you are describing...

Thanks for this pointer, didn't come to my mind.

I ran your code and can not reproduce that behavior. No matter how
many repetitions, I still get a similar response time per transaction.

any more details or code samples you can provide?

I don;t know, I have tried the program on my local MacOs, where I have
several python runtimes installed and there is huge dfference between
result after running at 2.6 and 2.4. So this might be the problem.
When ran on the 2.6 result are comparable to php and better than ruby,
which is what I expect.

The problem is, that CentOS is running on the server and there is only
2.4 available. On wich version did you ran these tests?

Thanks

Tomas Svarovsky · May 14, 2009

One more thing, since I am stuck with 2.4 (and if this is really 2.4
issue), is there some substitution for urllib2?

Richard Brodie · May 14, 2009

you aren't doing a read(), so technically you are just connecting to
the web server and sending the request but never reading the content
back from the socket.

But that is not the problem you are describing...

It might be, if the local server doesn't scale well enough to handle
100 concurrent requests.

Tomas Svarovsky · May 14, 2009

It might be, if the local server doesn't scale well enough to handle
100 concurrent requests.

This is a good point, but then it would manifest regardless of the
language used AFAIK. And this is not the case, ruby and php
implementations are working quite fine.

Thanks for reply

cgoldberg · May 14, 2009

The problem is, that CentOS is running on the server and there is only

2.4 available. On wich version did you ran these tests?

I tested with Windows XP and Python 2.5.4. I don't have a 2.4 setup I
can easily test with.

you can try httplib rather than urllib2. httplib is slightly lower
level and is actually used inside urllib2 for transport.

-Corey

cgoldberg · May 14, 2009

It might be, if the local server doesn't scale well enough to handle

100 concurrent requests.

true.. I didn't think of that. I was assuming the client machine
wasn't resource constrained. That would definitely lead to inaccurate
timings if that was the case.

Richard Brodie · May 14, 2009

This is a good point, but then it would manifest regardless of the
language used AFAIK. And this is not the case, ruby and php
implementations are working quite fine.

What I meant was: not reading the data and leaving the connection
open is going to force the server to handle all 100 requests concurrently.
I'm guessing that's not what your other implementations do.
What happens to the timing if you call response.read(), response.close() ?

Tomas Svarovsky · May 15, 2009

What I meant was: not reading the data and leaving the connection
open is going to force the server to handle all 100 requests concurrently..
I'm guessing that's not what your other implementations do.
What happens to the timing if you call response.read(), response.close() ?

Now I get it, but nevertheless, even when I explicitely read from the
socket and then close it properly, the timing still doesn't change.

Thanks for advice though

urllib2 opendirector versus request object	0	Jun 9, 2011
Python 2.6.4 - Urllib2 - Windows XP - Reading streaming HTTP sourcekills network card ... (believe i	0	Jan 12, 2010
urllib2.Request:: http Request sending successfully, but Responsecontains in valid data.	1	Feb 11, 2009
urllib2 problem with ports.	2	Jun 15, 2006
client ssl verification	0	Mar 15, 2012
Web Scraping - Output File	1	Apr 26, 2012
how to force HTTP 1.1 when using urllib2?	0	Dec 21, 2004
Python and Windows Services Question	1	Mar 4, 2009

urllib2 slow for multiple requests

Tomas Svarovsky

cgoldberg

Tomas Svarovsky

Tomas Svarovsky

Richard Brodie

Tomas Svarovsky

cgoldberg

cgoldberg

Richard Brodie

Tomas Svarovsky

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads