urllib2 and threading

robean · May 1, 2009

I am writing a program that involves visiting several hundred webpages
and extracting specific information from the contents. I've written a
modest 'test' example here that uses a multi-threaded approach to
reach the urls with urllib2. The actual program will involve fairly
elaborate scraping and parsing (I'm using Beautiful Soup for that) but
the example shown here is simplified and just confirms the url of the
site visited.

Here's the problem: the script simply crashes after getting a a couple
of urls and takes a long time to run (slower that a non-threaded
version that I wrote and ran). Can anyone figure out what I am doing
wrong? I am new to both threading and urllib2, so its possible that
the SNAFU is quite obvious.

The urls are stored in a text file that I read from. The urls are all
valid, so there's no problem there.

Here's the code:

#!/usr/bin/python

import urllib2
import threading

class MyThread(threading.Thread):
"""subclass threading.Thread to create Thread instances"""
def __init__(self, func, args):
threading.Thread.__init__(self)
self.func = func
self.args = args

def run(self):
apply(self.func, self.args)

def get_info_from_url(url):
""" A dummy version of the function simply visits urls and prints
the url of the page. """
try:
page = urllib2.urlopen(url)
except urllib2.URLError, e:
print "**** error ****", e.reason
except urllib2.HTTPError, e:
print "**** error ****", e.code

else:
ulock.acquire()
print page.geturl() # obviously, do something more useful here,
eventually
page.close()
ulock.release()

ulock = threading.Lock()
num_links = 10
threads = [] # store threads here
urls = [] # store urls here

fh = open("links.txt", "r")
for line in fh:
urls.append(line.strip())
fh.close()

# collect threads
for i in range(num_links):
t = MyThread(get_info_from_url, (urls,) )
threads.append(t)

# start the threads
for i in range(num_links):
threads.start()

for i in range(num_links):
threads.join()

print "all done"

Paul Rubin · May 1, 2009

robean said:
reach the urls with urllib2. The actual program will involve fairly
elaborate scraping and parsing (I'm using Beautiful Soup for that) but
the example shown here is simplified and just confirms the url of the
site visited.

Keep in mind Beautiful Soup is pretty slow, so if you're doing a lot
of pages and have multiple cpu's, you probably want parallel processes
rather than threads.

wrong? I am new to both threading and urllib2, so its possible that
the SNAFU is quite obvious..
...
ulock = threading.Lock()

Without looking at the code for more than a few seconds, using an
explicit lock like that is generally not a good sign. The usual
Python style is to send all inter-thread communications through
Queues. You'd dump all your url's into a queue and have a bunch of
worker threads getting items off the queue and processing them. This
really avoids a lot of lock-related headache. The price is that you
sometimes use more threads than strictly necessary. Unless it's a LOT
of extra threads, it's usually not worth the hassle of messing with
locks.

robean · May 1, 2009

Thanks for your reply. Obviously you make several good points about
Beautiful Soup and Queue. But here's the problem: even if I do nothing
whatsoever with the threads beyond just visiting the urls with
urllib2, the program chokes. If I replace

else:
ulock.acquire()
print page.geturl() # obviously, do something more useful
here,eventually
page.close()
ulock.release()

with

else:
pass

the urllib2 starts raising URLErrros after the first 3 - 5 urls have
been visited. Do you have any sense what in the threads is corrupting
urllib2's behavior? Many thanks,

Robean

Stefan Behnel · May 1, 2009

robean said:
I am writing a program that involves visiting several hundred webpages
and extracting specific information from the contents. I've written a
modest 'test' example here that uses a multi-threaded approach to
reach the urls with urllib2. The actual program will involve fairly
elaborate scraping and parsing (I'm using Beautiful Soup for that)

Try lxml.html instead. It often parses HTML pages better than BS, can parse
directly from HTTP/FTP URLs, frees the GIL doing so, and is generally a lot
faster and more memory friendly than the combination of urllib2 and BS,
especially when threading is involved. It also supports CSS selectors for
finding page content, so your "elaborate scraping" might actually turn out
to be a lot simpler than you think.

http://codespeak.net/lxml/

These might be worth reading:

http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Stefan

shailen.tuli · May 1, 2009

For better performance, lxml easily outperforms Beautiful Soup.

For what its worth, the code runs fine if you switch from urllib2 to
urllib (different exceptions are raised, obviously). I have no
experience using urllib2 in a threaded environment, so I'm not sure
why it breaks; urllib does OK, though.

- Shailen

Piet van Oostrum · May 1, 2009

robean said:
R> def get_info_from_url(url):
R> """ A dummy version of the function simply visits urls and prints
R> the url of the page. """
R> try:
R> page = urllib2.urlopen(url)
R> except urllib2.URLError, e:
R> print "**** error ****", e.reason
R> except urllib2.HTTPError, e:
R> print "**** error ****", e.code

There's a problem here. HTTPError is a subclass of URLError so it should
be first. Otherwise when you have an HTTPError (like a 404 File not
found) it will be caught by the "except URLError", but it will not have
a reason attribute, and then you get an exception in the except clause
and the thread will crash.

Aahz · May 2, 2009

Here's the problem: the script simply crashes after getting a a couple
of urls and takes a long time to run (slower that a non-threaded
version that I wrote and ran). Can anyone figure out what I am doing
wrong? I am new to both threading and urllib2, so its possible that
the SNAFU is quite obvious.

For an example, see

http://www.pythoncraft.com/OSCON2001/index.html

Threading change, 2.5.4 -> 2.6.1	4	Jan 7, 2010
Threading with queues	7	Dec 21, 2009
MacOS 10.9.2: threading error using python.org 2.7.6 distribution	7	Apr 25, 2014
Classes and threading	4	May 19, 2010
Multiprocessing / threading confusion	11	Sep 5, 2013
promlems with threading and print	2	Jul 15, 2009
urllib2 script slowing and stopping	2	Oct 11, 2010
Threading issue (using alsaaudio)	0	Dec 19, 2011

urllib2 and threading

robean

Paul Rubin

robean

Stefan Behnel

shailen.tuli

Piet van Oostrum

Aahz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads