Web Spider

Discussion in 'Python' started by Thomas Lindgaard, Jul 6, 2004.

  1. Hello

    I'm a newcomer to the world of Python trying to write a web spider. I
    downloaded the skeleton from

    http://starship.python.net/crew/aahz/OSCON2001/ThreadPoolSpider.py

    Some of the source shown below.

    A couple of questions:

    1) Why use the

    if __name__ == '__main__':

    construct?

    2) In Retrievepool.__init__ the Retriever.__init__ is called with
    self.inputQueue and self.outputQueue as arguments. Does this mean that
    each Retriever thread has a reference to Retrievepool.inputQueue and
    Retrievepool.outputQueue (ie. there is only one input and output queue and
    the threads all share, pushing and popping whenever they want (which is
    safe due to the synchronized nature of Queue)?

    3) How many threads will be running? Spider.run initializes the
    Retrievepool and this will consist of MAX_THREADS threads, so once the
    crawler is running there will be the main thread (caught in the while loop
    in Spider.run) and MAX_THREADS Retriever threads running, right?

    Hmm... I think that's about it for now.

    ---------------------------------------------------------------------

    MAX_THREADS = 3

    ....

    class Retriever(threading.Thread):
    def __init__(self, inputQueue, outputQueue):
    threading.Thread.__init__(self)
    self.inputQueue = inputQueue
    self.outputQueue = outputQueue

    def run(self):
    while 1:
    self.URL = self.inputQueue.get()
    self.getPage()
    self.outputQueue.put(self.getLinks())

    ...


    class RetrievePool:
    def __init__(self, numThreads):
    self.retrievePool = []
    self.inputQueue = Queue.Queue()
    self.outputQueue = Queue.Queue()
    for i in range(numThreads):
    retriever = Retriever(self.inputQueue, self.outputQueue)
    retriever.start()
    self.retrievePool.append(retriever)

    ...


    class Spider:
    def __init__(self, startURL, maxThreads):
    self.URLs = []
    self.queue = [startURL]
    self.URLdict = {startURL: 1}
    self.include = startURL
    self.numPagesQueued = 0
    self.retriever = RetrievePool(maxThreads)

    def run(self):
    self.startPages()
    while self.numPagesQueued > 0:
    self.queueLinks()
    self.startPages()
    self.retriever.shutdown()
    self.URLs = self.URLdict.keys()
    self.URLs.sort()

    ...


    if __name__ == '__main__':
    startURL = sys.argv[1]
    spider = Spider(startURL, MAX_THREADS)
    spider.run()
    print
    for URL in spider.URLs:
    print URL


    --
    Regards
    /Thomas
    Thomas Lindgaard, Jul 6, 2004
    #1
    1. Advertising

  2. Thomas Lindgaard

    Peter Hansen Guest

    Thomas Lindgaard wrote:

    > A couple of questions:
    >
    > 1) Why use the
    > if __name__ == '__main__':
    > construct?


    Answered indirectly in this FAQ:
    http://www.python.org/doc/faq/programming.html#how-do-i-find-the-current-module-name

    > 2) In Retrievepool.__init__ the Retriever.__init__ is called with
    > self.inputQueue and self.outputQueue as arguments. Does this mean that
    > each Retriever thread has a reference to Retrievepool.inputQueue and
    > Retrievepool.outputQueue


    Yes, and that's sort of the whole point of the thing.

    > 3) How many threads will be running? Spider.run initializes the
    > Retrievepool and this will consist of MAX_THREADS threads, so once the
    > crawler is running there will be the main thread (caught in the while loop
    > in Spider.run) and MAX_THREADS Retriever threads running, right?


    Yep. Good analysis. :) You could inject this somewhere to
    check:

    print len(threading.enumerate()), 'threads exist'

    -Peter
    Peter Hansen, Jul 6, 2004
    #2
    1. Advertising

  3. On Tue, 06 Jul 2004 11:19:01 -0400, Peter Hansen wrote:

    > Answered indirectly in this FAQ:
    > http://www.python.org/doc/faq/programming.html#how-do-i-find-the-current-module-name


    Let me just see if I understood this correctly...

    The reason for using the construct is to have to "modes" for the script:
    One for running the script by itself (ie. run main()) and one for when it
    is included from somewhere else (ie. main() should not be run unless
    called from the surrounding code).

    >> 2) In Retrievepool.__init__ the Retriever.__init__ is called with
    >> self.inputQueue and self.outputQueue as arguments. Does this mean that
    >> each Retriever thread has a reference to Retrievepool.inputQueue and
    >> Retrievepool.outputQueue

    >
    > Yes, and that's sort of the whole point of the thing.


    Okidoki :)

    >> 3) How many threads will be running? Spider.run initializes the
    >> Retrievepool and this will consist of MAX_THREADS threads, so once the
    >> crawler is running there will be the main thread (caught in the while
    >> loop in Spider.run) and MAX_THREADS Retriever threads running, right?

    >
    > Yep. Good analysis. :) You could inject this somewhere to check:


    Thanks - sometimes it actually helps to read code you want to elaborate on
    closely :)

    > print len(threading.enumerate()), 'threads exist'


    Can a thread die spontaneously if for instance an exception is thrown?

    --
    Mvh.
    /Thomas
    Thomas Lindgaard, Jul 7, 2004
    #3
  4. Thomas Lindgaard

    Peter Hansen Guest

    Thomas Lindgaard wrote:
    > On Tue, 06 Jul 2004 11:19:01 -0400, Peter Hansen wrote:
    >>Answered indirectly in this FAQ:
    >>http://www.python.org/doc/faq/programming.html#how-do-i-find-the-current-module-name

    >
    > Let me just see if I understood this correctly...
    >
    > The reason for using the construct is to have to "modes" for the script:
    > One for running the script by itself (ie. run main()) and one for when it
    > is included from somewhere else (ie. main() should not be run unless
    > called from the surrounding code).


    Yep.
    >
    > Can a thread die spontaneously if for instance an exception is thrown?


    The interactive prompt is your friend for such questions in Python.
    Good to get in the habit of being able to check such stuff out
    easily:

    c:\>python
    Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on
    win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import time, threading
    >>> class Test(threading.Thread):

    .... def run(self):
    .... while 1:
    .... time.sleep(5)
    .... 1/0
    ....
    >>> a = Test()
    >>> threading.enumerate()

    [<_MainThread(MainThread, started)>]
    >>> a.start()
    >>> threading.enumerate()

    [<Test(Thread-2, started)>, <_MainThread(MainThread, started)>]

    >>> # wait a few seconds here

    Exception in thread Thread-2:
    Traceback (most recent call last):
    File "c:\a\python23\lib\threading.py", line 436, in __bootstrap
    self.run()
    File "<stdin>", line 5, in run
    ZeroDivisionError: integer division or modulo by zero

    >>> threading.enumerate()

    [<_MainThread(MainThread, started)>]

    Tada! The answer is yes. :)

    -Peter
    Peter Hansen, Jul 7, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. baroque Chou

    how google spider access my web site?

    baroque Chou, Jan 26, 2006, in forum: ASP .Net
    Replies:
    7
    Views:
    3,891
    Alan Silver
    Feb 2, 2006
  2. JeepGary
    Replies:
    2
    Views:
    472
    Roedy Green
    Oct 21, 2003
  3. jdonnell
    Replies:
    5
    Views:
    559
    Peter Hansen
    Feb 17, 2005
  4. Victor

    ajax menu and web spider problem

    Victor, Apr 13, 2007, in forum: ASP .Net
    Replies:
    3
    Views:
    384
    Thomas Hansen
    Apr 18, 2007
  5. Replies:
    0
    Views:
    385
Loading...

Share This Page