Web Spider

Thomas Lindgaard · Jul 6, 2004

Hello

I'm a newcomer to the world of Python trying to write a web spider. I
downloaded the skeleton from

http://starship.python.net/crew/aahz/OSCON2001/ThreadPoolSpider.py

Some of the source shown below.

A couple of questions:

1) Why use the

if __name__ == '__main__':

construct?

2) In Retrievepool.__init__ the Retriever.__init__ is called with
self.inputQueue and self.outputQueue as arguments. Does this mean that
each Retriever thread has a reference to Retrievepool.inputQueue and
Retrievepool.outputQueue (ie. there is only one input and output queue and
the threads all share, pushing and popping whenever they want (which is
safe due to the synchronized nature of Queue)?

3) How many threads will be running? Spider.run initializes the
Retrievepool and this will consist of MAX_THREADS threads, so once the
crawler is running there will be the main thread (caught in the while loop
in Spider.run) and MAX_THREADS Retriever threads running, right?

Hmm... I think that's about it for now.

---------------------------------------------------------------------

MAX_THREADS = 3

....

class Retriever(threading.Thread):
def __init__(self, inputQueue, outputQueue):
threading.Thread.__init__(self)
self.inputQueue = inputQueue
self.outputQueue = outputQueue

def run(self):
while 1:
self.URL = self.inputQueue.get()
self.getPage()
self.outputQueue.put(self.getLinks())

...

class RetrievePool:
def __init__(self, numThreads):
self.retrievePool = []
self.inputQueue = Queue.Queue()
self.outputQueue = Queue.Queue()
for i in range(numThreads):
retriever = Retriever(self.inputQueue, self.outputQueue)
retriever.start()
self.retrievePool.append(retriever)

...

class Spider:
def __init__(self, startURL, maxThreads):
self.URLs = []
self.queue = [startURL]
self.URLdict = {startURL: 1}
self.include = startURL
self.numPagesQueued = 0
self.retriever = RetrievePool(maxThreads)

def run(self):
self.startPages()
while self.numPagesQueued > 0:
self.queueLinks()
self.startPages()
self.retriever.shutdown()
self.URLs = self.URLdict.keys()
self.URLs.sort()

...

if __name__ == '__main__':
startURL = sys.argv[1]
spider = Spider(startURL, MAX_THREADS)
spider.run()
print
for URL in spider.URLs:
print URL

Peter Hansen · Jul 6, 2004

Thomas said:
A couple of questions:

1) Why use the
if __name__ == '__main__':
construct?

Answered indirectly in this FAQ:
http://www.python.org/doc/faq/programming.html#how-do-i-find-the-current-module-name

2) In Retrievepool.__init__ the Retriever.__init__ is called with
self.inputQueue and self.outputQueue as arguments. Does this mean that
each Retriever thread has a reference to Retrievepool.inputQueue and
Retrievepool.outputQueue

Yes, and that's sort of the whole point of the thing.

3) How many threads will be running? Spider.run initializes the
Retrievepool and this will consist of MAX_THREADS threads, so once the
crawler is running there will be the main thread (caught in the while loop
in Spider.run) and MAX_THREADS Retriever threads running, right?

Yep. Good analysis.

You could inject this somewhere to
check:

print len(threading.enumerate()), 'threads exist'

-Peter

Thomas Lindgaard · Jul 7, 2004

Answered indirectly in this FAQ:
http://www.python.org/doc/faq/programming.html#how-do-i-find-the-current-module-name

Let me just see if I understood this correctly...

The reason for using the construct is to have to "modes" for the script:
One for running the script by itself (ie. run main()) and one for when it
is included from somewhere else (ie. main() should not be run unless
called from the surrounding code).

Yes, and that's sort of the whole point of the thing.

Okidoki

Yep. Good analysis. You could inject this somewhere to check:

Thanks - sometimes it actually helps to read code you want to elaborate on
closely

print len(threading.enumerate()), 'threads exist'

Can a thread die spontaneously if for instance an exception is thrown?

Peter Hansen · Jul 7, 2004

Thomas said:
Let me just see if I understood this correctly...

The reason for using the construct is to have to "modes" for the script:
One for running the script by itself (ie. run main()) and one for when it
is included from somewhere else (ie. main() should not be run unless
called from the surrounding code).
Yep.

Can a thread die spontaneously if for instance an exception is thrown?

The interactive prompt is your friend for such questions in Python.
Good to get in the habit of being able to check such stuff out
easily:

c:\>python
Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information..... def run(self):
.... while 1:
.... time.sleep(5)
.... 1/0
....

[ said:
>>> a = Test()
>>> threading.enumerate()

Click to expand...

[ said:

>>> a.start()
>>> threading.enumerate()

Click to expand...

>>> # wait a few seconds here

Click to expand...

Click to expand...

Exception in thread Thread-2:
Traceback (most recent call last):
File "c:\a\python23\lib\threading.py", line 436, in __bootstrap
self.run()
[<_MainThread(MainThread, started)>]

Tada! The answer is yes.

-Peter

Oh what a twisted thread we weave....	2	Oct 29, 2005
Basic Python Query	17	Aug 21, 2013
Multithreaded class with queues	1	Jul 22, 2005
urllib2.urlopen(url) pulling something other than HTML	7	Aug 20, 2007
Some help needed with small multi-threaded program!	1	May 17, 2010
Multiprocessing / threading confusion	11	Sep 5, 2013
How to check client shutdown?	8	Aug 26, 2013
MacOS 10.9.2: threading error using python.org 2.7.6 distribution	7	Apr 25, 2014

Web Spider

Thomas Lindgaard

Peter Hansen

Thomas Lindgaard

Peter Hansen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads