Making HTTP requests using Twisted

rzimerman · Jul 11, 2006

I'm hoping to write a program that will read any number of urls from
stdin (1 per line), download them, and process them. So far my script
(below) works well for small numbers of urls. However, it does not
scale to more than 200 urls or so, because it issues HTTP requests for
all of the urls simultaneously, and terminates after 25 seconds.
Ideally, I'd like this script to download at most 50 pages in parallel,
and to time out if and only if any HTTP request is not answered in 3
seconds. What changes do I need to make?

Is Twisted the best library for me to be using? I do like Twisted, but
it seems more suited to batch mode operations. Is there some way that I
could continue registering url requests while the reactor is running?
Is there a way to specify a time out per page request, rather than for
a batch of pages requests?

Thanks!

#-------------------------------------------------

from twisted.internet import reactor
from twisted.web import client
import re, urllib, sys, time

def extract(html):
#do some processing on html, writing to stdout

def printError(failure):
print >> sys.stderr, "Error:", failure.getErrorMessage( )

def stopReactor():
print "Now stopping reactor..."
reactor.stop()

for url in sys.stdin:
url = url.rstrip()
client.getPage(url).addCallback(extract).addErrback(printError)

reactor.callLater(25, stopReactor)
reactor.run()

K.S.Sreeram · Jul 11, 2006

rzimerman said:
I'm hoping to write a program that will read any number of urls from
stdin (1 per line), download them, and process them. So far my script
(below) works well for small numbers of urls. However, it does not
scale to more than 200 urls or so, because it issues HTTP requests for
all of the urls simultaneously, and terminates after 25 seconds.
Ideally, I'd like this script to download at most 50 pages in parallel,
and to time out if and only if any HTTP request is not answered in 3
seconds. What changes do I need to make?

Is Twisted the best library for me to be using? I do like Twisted, but
it seems more suited to batch mode operations. Is there some way that I
could continue registering url requests while the reactor is running?
Is there a way to specify a time out per page request, rather than for
a batch of pages requests?

Have a look at pyCurl. (http://pycurl.sourceforge.net)

Regards
Sreeram

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEs2vIrgn0plK5qqURAmahAJ4oPAJ4AtPNvRFxs99IFNHuViyCiQCgmT8a
GYqpz82zvsin4QrXGXW0WDI=
=rz4Q
-----END PGP SIGNATURE-----

Fredrik Lundh · Jul 11, 2006

rzimerman said:
Is Twisted the best library for me to be using? I do like Twisted, but
it seems more suited to batch mode operations. Is there some way that I
could continue registering url requests while the reactor is running?
Is there a way to specify a time out per page request, rather than for
a batch of pages requests?

there are probably ways to solve this with Twisted, but in case you want a
simpler alternative, you could use Python's standard asyncore module and
the stuff described here:

http://effbot.org/zone/effnews.htm

especially

http://effbot.org/zone/effnews-1.htm#storing-the-rss-data
http://effbot.org/zone/effnews-3.htm#managing-downloads

</F>

Manlio Perillo · Jul 11, 2006

rzimerman ha scritto:

I'm hoping to write a program that will read any number of urls from
stdin (1 per line), download them, and process them. So far my script
(below) works well for small numbers of urls. However, it does not
scale to more than 200 urls or so, because it issues HTTP requests for
all of the urls simultaneously, and terminates after 25 seconds.
Ideally, I'd like this script to download at most 50 pages in parallel,
and to time out if and only if any HTTP request is not answered in 3
seconds. What changes do I need to make?

Take a look at
http://svn.twistedmatrix.com/cvs/trunk/doc/core/examples/stdiodemo.py?view=markup&rev=15456

And read
http://twistedmatrix.com/documents/current/api/twisted.web.client.HTTPClientFactory.html

You can pass a timeout to the constructor.

To download at most 50 pages in parallel you can use a download queue.

Here is a quick example, ABSOLUTELY NOT TESTED:

class DownloadQueue(object):
SIZE = 50

def init(self):
self.requests = [] # queued requests
self.deferreds = [] # waiting requests

def addRequest(self, url, timeout):
if len(self.deferreds) >= sels.SIZE:
# wait for completion of all previous requests
DeferredList(self.deferreds
).addCallback(self._callback)
self.deferreds = []

# queue the request
deferred = Deferred()
self.requests.append((url, timeout, deferred))

return deferred
else:
# execute the request now
deferred = getPage(url, timeout=timeout)
self.deferreds.append(deferred)

return deferred

def _callback(self):
if len(self.requests) > self.SIZE:
queue = self.requests[:self.SIZE]
self.requests = self.requests[self.SIZE:]
else:
queue = self.requests[:]
self.requests = []

# execute the requests
for (url, timeout, deferredHelper) in queue:
deferred = getPage(url, timeout=timeout)
self.deferreds.append(deferred)

deferred.chainDeferred(deferredHelper)

Regards Manlio Perillo

Manlio Perillo · Jul 11, 2006

Manlio Perillo ha scritto:

[...]

Here is a quick example, ABSOLUTELY NOT TESTED:

class DownloadQueue(object):
SIZE = 50

def init(self):
self.requests = [] # queued requests
self.deferreds = [] # waiting requests

def addRequest(self, url, timeout):
if len(self.deferreds) >= sels.SIZE:
# wait for completion of all previous requests
DeferredList(self.deferreds
).addCallback(self._callback)
self.deferreds = []

The deferreds list should be cleared in the _callback method, not here.
Please note that probably there are other bugs.

Regards Manlio Perillo

twisted server	0	Jun 15, 2009
How to keep cookies when making http requests (Python 2.7)	8	Aug 20, 2013
twisted: problem with sftp-client	0	Jan 23, 2008
Twisted (or for loops ?) madness	6	Oct 15, 2007
issue with twisted and reactor. Can't stop reactor	1	May 11, 2009
Twisted Matrix and multicast broadcast	0	Oct 9, 2008
Jabber in Twisted	3	May 21, 2004
Twisted and Tkinter	9	Apr 27, 2006

Making HTTP requests using Twisted

rzimerman

K.S.Sreeram

Fredrik Lundh

Manlio Perillo

Manlio Perillo

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads