urllib2 and threading

Discussion in 'Python' started by robean, May 1, 2009.

  1. robean

    robean Guest

    I am writing a program that involves visiting several hundred webpages
    and extracting specific information from the contents. I've written a
    modest 'test' example here that uses a multi-threaded approach to
    reach the urls with urllib2. The actual program will involve fairly
    elaborate scraping and parsing (I'm using Beautiful Soup for that) but
    the example shown here is simplified and just confirms the url of the
    site visited.

    Here's the problem: the script simply crashes after getting a a couple
    of urls and takes a long time to run (slower that a non-threaded
    version that I wrote and ran). Can anyone figure out what I am doing
    wrong? I am new to both threading and urllib2, so its possible that
    the SNAFU is quite obvious.

    The urls are stored in a text file that I read from. The urls are all
    valid, so there's no problem there.

    Here's the code:

    #!/usr/bin/python

    import urllib2
    import threading

    class MyThread(threading.Thread):
    """subclass threading.Thread to create Thread instances"""
    def __init__(self, func, args):
    threading.Thread.__init__(self)
    self.func = func
    self.args = args

    def run(self):
    apply(self.func, self.args)


    def get_info_from_url(url):
    """ A dummy version of the function simply visits urls and prints
    the url of the page. """
    try:
    page = urllib2.urlopen(url)
    except urllib2.URLError, e:
    print "**** error ****", e.reason
    except urllib2.HTTPError, e:
    print "**** error ****", e.code

    else:
    ulock.acquire()
    print page.geturl() # obviously, do something more useful here,
    eventually
    page.close()
    ulock.release()

    ulock = threading.Lock()
    num_links = 10
    threads = [] # store threads here
    urls = [] # store urls here

    fh = open("links.txt", "r")
    for line in fh:
    urls.append(line.strip())
    fh.close()

    # collect threads
    for i in range(num_links):
    t = MyThread(get_info_from_url, (urls,) )
    threads.append(t)

    # start the threads
    for i in range(num_links):
    threads.start()

    for i in range(num_links):
    threads.join()

    print "all done"
     
    robean, May 1, 2009
    #1
    1. Advertising

  2. robean

    Paul Rubin Guest

    robean <> writes:
    > reach the urls with urllib2. The actual program will involve fairly
    > elaborate scraping and parsing (I'm using Beautiful Soup for that) but
    > the example shown here is simplified and just confirms the url of the
    > site visited.


    Keep in mind Beautiful Soup is pretty slow, so if you're doing a lot
    of pages and have multiple cpu's, you probably want parallel processes
    rather than threads.

    > wrong? I am new to both threading and urllib2, so its possible that
    > the SNAFU is quite obvious..
    > ...
    > ulock = threading.Lock()


    Without looking at the code for more than a few seconds, using an
    explicit lock like that is generally not a good sign. The usual
    Python style is to send all inter-thread communications through
    Queues. You'd dump all your url's into a queue and have a bunch of
    worker threads getting items off the queue and processing them. This
    really avoids a lot of lock-related headache. The price is that you
    sometimes use more threads than strictly necessary. Unless it's a LOT
    of extra threads, it's usually not worth the hassle of messing with
    locks.
     
    Paul Rubin, May 1, 2009
    #2
    1. Advertising

  3. robean

    robean Guest

    Thanks for your reply. Obviously you make several good points about
    Beautiful Soup and Queue. But here's the problem: even if I do nothing
    whatsoever with the threads beyond just visiting the urls with
    urllib2, the program chokes. If I replace

    else:
    ulock.acquire()
    print page.geturl() # obviously, do something more useful
    here,eventually
    page.close()
    ulock.release()

    with

    else:
    pass

    the urllib2 starts raising URLErrros after the first 3 - 5 urls have
    been visited. Do you have any sense what in the threads is corrupting
    urllib2's behavior? Many thanks,

    Robean



    On May 1, 12:27 am, Paul Rubin <http://> wrote:
    > robean <> writes:
    > > reach the urls with urllib2. The actual program will involve fairly
    > > elaborate scraping and parsing (I'm using Beautiful Soup for that) but
    > > the example shown here is simplified and just confirms the url of the
    > > site visited.

    >
    > Keep in mind Beautiful Soup is pretty slow, so if you're doing a lot
    > of pages and have multiple cpu's, you probably want parallel processes
    > rather than threads.
    >
    > > wrong? I am new to both threading and urllib2, so its possible that
    > > the SNAFU is quite obvious..
    > > ...
    > > ulock = threading.Lock()

    >
    > Without looking at the code for more than a few seconds, using an
    > explicit lock like that is generally not a good sign.  The usual
    > Python style is to send all inter-thread communications through
    > Queues.  You'd dump all your url's into a queue and have a bunch of
    > worker threads getting items off the queue and processing them.  This
    > really avoids a lot of lock-related headache.  The price is that you
    > sometimes use more threads than strictly necessary.  Unless it's a LOT
    > of extra threads, it's usually not worth the hassle of messing with
    > locks.
     
    robean, May 1, 2009
    #3
  4. robean wrote:
    > I am writing a program that involves visiting several hundred webpages
    > and extracting specific information from the contents. I've written a
    > modest 'test' example here that uses a multi-threaded approach to
    > reach the urls with urllib2. The actual program will involve fairly
    > elaborate scraping and parsing (I'm using Beautiful Soup for that)


    Try lxml.html instead. It often parses HTML pages better than BS, can parse
    directly from HTTP/FTP URLs, frees the GIL doing so, and is generally a lot
    faster and more memory friendly than the combination of urllib2 and BS,
    especially when threading is involved. It also supports CSS selectors for
    finding page content, so your "elaborate scraping" might actually turn out
    to be a lot simpler than you think.

    http://codespeak.net/lxml/

    These might be worth reading:

    http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/
    http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

    Stefan
     
    Stefan Behnel, May 1, 2009
    #4
  5. robean

    Guest

    For better performance, lxml easily outperforms Beautiful Soup.

    For what its worth, the code runs fine if you switch from urllib2 to
    urllib (different exceptions are raised, obviously). I have no
    experience using urllib2 in a threaded environment, so I'm not sure
    why it breaks; urllib does OK, though.

    - Shailen

    On May 1, 9:29 am, Stefan Behnel <> wrote:
    > robean wrote:
    > > I am writing a program that involves visiting several hundred webpages
    > > and extracting specific information from the contents. I've written a
    > > modest 'test' example here that uses a multi-threaded approach to
    > > reach the urls with urllib2. The actual program will involve fairly
    > > elaborate scraping and parsing (I'm using Beautiful Soup for that)

    >
    > Try lxml.html instead. It often parses HTML pages better than BS, can parse
    > directly from HTTP/FTP URLs, frees the GIL doing so, and is generally a lot
    > faster and more memory friendly than the combination of urllib2 and BS,
    > especially when threading is involved. It also supports CSS selectors for
    > finding page content, so your "elaborate scraping" might actually turn out
    > to be a lot simpler than you think.
    >
    > http://codespeak.net/lxml/
    >
    > These might be worth reading:
    >
    > http://blog.ianbicking.org/2008/12/...rg/2008/03/30/python-html-parser-performance/
    >
    > Stefan
     
    , May 1, 2009
    #5
  6. >>>>> robean <> (R) wrote:

    >R> def get_info_from_url(url):
    >R> """ A dummy version of the function simply visits urls and prints
    >R> the url of the page. """
    >R> try:
    >R> page = urllib2.urlopen(url)
    >R> except urllib2.URLError, e:
    >R> print "**** error ****", e.reason
    >R> except urllib2.HTTPError, e:
    >R> print "**** error ****", e.code


    There's a problem here. HTTPError is a subclass of URLError so it should
    be first. Otherwise when you have an HTTPError (like a 404 File not
    found) it will be caught by the "except URLError", but it will not have
    a reason attribute, and then you get an exception in the except clause
    and the thread will crash.
    --
    Piet van Oostrum <>
    URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
    Private email:
     
    Piet van Oostrum, May 1, 2009
    #6
  7. robean

    Aahz Guest

    In article <>,
    robean <> wrote:
    >
    >Here's the problem: the script simply crashes after getting a a couple
    >of urls and takes a long time to run (slower that a non-threaded
    >version that I wrote and ran). Can anyone figure out what I am doing
    >wrong? I am new to both threading and urllib2, so its possible that
    >the SNAFU is quite obvious.


    For an example, see

    http://www.pythoncraft.com/OSCON2001/index.html
    --
    Aahz () <*> http://www.pythoncraft.com/

    "Typing is cheap. Thinking is expensive." --Roy Smith
     
    Aahz, May 2, 2009
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jacek Trzmiel

    Urllib2/threading errors under Cygwin

    Jacek Trzmiel, May 1, 2004, in forum: Python
    Replies:
    0
    Views:
    412
    Jacek Trzmiel
    May 1, 2004
  2. Josef Cihal
    Replies:
    0
    Views:
    873
    Josef Cihal
    Sep 5, 2005
  3. Replies:
    9
    Views:
    1,125
    Mark Space
    Dec 29, 2007
  4. Steven Woody
    Replies:
    0
    Views:
    474
    Steven Woody
    Jan 9, 2009
  5. Steven Woody
    Replies:
    0
    Views:
    488
    Steven Woody
    Jan 9, 2009
Loading...

Share This Page