urllib2 slow for multiple requests

T

Tomas Svarovsky

Hello everybody, really new to python, so bear with me. I am trying to
do some very basic scraping tool. Bascally it just grabs a page xy
times and tells me how long it took. When I do this once, it is
blazingly fast, but when I increase the number of repetitions, it is
slowing down considerably (1 is like 3 ms, 100 takes 6 seconds). I
have done implementations in couple more languages (php, ruby) and
none of them seems to suffer from a similar problem and it seems, that
it behaves linearly. Maybe it is a known issue in urllib2, or I am
simply using it badly. I am using python 2.4.3, machine has CentOS,
below is the sc. Thanks in advance

import urllib2
from datetime import datetime

def application():
start = datetime.now()
req = urllib2.Request("http://127.0.0.1/gdc/about", None,
{'Accept': 'application/json'})
for number in range(100):
response = urllib2.urlopen(req)
end = datetime.now()
output = end - start
print output

application()
 
C

cgoldberg

Bascally it just grabs a page xy
times and tells me how long it took.

you aren't doing a read(), so technically you are just connecting to
the web server and sending the request but never reading the content
back from the socket. So your timing wouldn't be accurate.

try this instead:
response = urllib2.urlopen(req).read()

But that is not the problem you are describing...

when I increase the number of repetitions, it is
slowing down considerably (1 is like 3 ms, 100 takes 6 seconds).
Maybe it is a known issue in urllib2

I ran your code and can not reproduce that behavior. No matter how
many repetitions, I still get a similar response time per transaction.

any more details or code samples you can provide?

-Corey Goldberg
 
T

Tomas Svarovsky

you aren't doing a read(), so technically you are just connecting to
the web server and sending the request but never reading the content
back from the socket.  So your timing wouldn't be accurate.

try this instead:
response = urllib2.urlopen(req).read()

But that is not the problem you are describing...

Thanks for this pointer, didn't come to my mind.
I ran your code and can not reproduce that behavior.  No matter how
many repetitions, I still get a similar response time per transaction.

any more details or code samples you can provide?

I don;t know, I have tried the program on my local MacOs, where I have
several python runtimes installed and there is huge dfference between
result after running at 2.6 and 2.4. So this might be the problem.
When ran on the 2.6 result are comparable to php and better than ruby,
which is what I expect.

The problem is, that CentOS is running on the server and there is only
2.4 available. On wich version did you ran these tests?

Thanks
 
T

Tomas Svarovsky

One more thing, since I am stuck with 2.4 (and if this is really 2.4
issue), is there some substitution for urllib2?
 
R

Richard Brodie

you aren't doing a read(), so technically you are just connecting to
the web server and sending the request but never reading the content
back from the socket.

But that is not the problem you are describing...

It might be, if the local server doesn't scale well enough to handle
100 concurrent requests.
 
T

Tomas Svarovsky

It might be, if the local server doesn't scale well enough to handle
100 concurrent requests.

This is a good point, but then it would manifest regardless of the
language used AFAIK. And this is not the case, ruby and php
implementations are working quite fine.

Thanks for reply
 
C

cgoldberg

The problem is, that CentOS is running on the server and there is only
2.4 available. On wich version did you ran these tests?

I tested with Windows XP and Python 2.5.4. I don't have a 2.4 setup I
can easily test with.

you can try httplib rather than urllib2. httplib is slightly lower
level and is actually used inside urllib2 for transport.

-Corey
 
C

cgoldberg

It might be, if the local server doesn't scale well enough to handle
100 concurrent requests.

true.. I didn't think of that. I was assuming the client machine
wasn't resource constrained. That would definitely lead to inaccurate
timings if that was the case.
 
R

Richard Brodie

This is a good point, but then it would manifest regardless of the
language used AFAIK. And this is not the case, ruby and php
implementations are working quite fine.

What I meant was: not reading the data and leaving the connection
open is going to force the server to handle all 100 requests concurrently.
I'm guessing that's not what your other implementations do.
What happens to the timing if you call response.read(), response.close() ?
 
T

Tomas Svarovsky

What I meant was: not reading the data and leaving the connection
open is going to force the server to handle all 100 requests concurrently..
I'm guessing that's not what your other implementations do.
What happens to the timing if you call response.read(), response.close() ?

Now I get it, but nevertheless, even when I explicitely read from the
socket and then close it properly, the timing still doesn't change.

Thanks for advice though
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top