Concurrent threads to pull web pages?

G

Gilles Ganault

Hello

I recently asked how to pull companies' ID from an SQLite database,
have multiple instances of a Python script download each company's web
page from a remote server, eg. www.acme.com/company.php?id=1, and use
regexes to extract some information from each page.

I need to run multiple instances to save time, since each page takes
about 10 seconds to be returned to the script/browser.

Since I've never written a multi-threaded Python script before, to
save time investigating, I was wondering if someone already had a
script that downloads web pages and save some information into a
database.

Thank you for any tip.
 
E

exarkun

Hello

I recently asked how to pull companies' ID from an SQLite
database,
have multiple instances of a Python script download each company's web
page from a remote server, eg. www.acme.com/company.php?id=1, and use
regexes to extract some information from each page.

I need to run multiple instances to save time, since each page takes
about 10 seconds to be returned to the script/browser.

Since I've never written a multi-threaded Python script before, to
save time investigating, I was wondering if someone already had a
script that downloads web pages and save some information into a
database.

There's no need to use threads for this. Have a look at Twisted:

http://twistedmatrix.com/trac/

Here's an example of how to use the Twisted HTTP client:

http://twistedmatrix.com/projects/web/documentation/examples/getpage.py

Jean-Paul
 
M

MRAB

Gilles said:
Hello

I recently asked how to pull companies' ID from an SQLite database,
have multiple instances of a Python script download each company's web
page from a remote server, eg. www.acme.com/company.php?id=1, and use
regexes to extract some information from each page.

I need to run multiple instances to save time, since each page takes
about 10 seconds to be returned to the script/browser.

Since I've never written a multi-threaded Python script before, to
save time investigating, I was wondering if someone already had a
script that downloads web pages and save some information into a
database.

Thank you for any tip.

You could put the URLs into a queue and have multiple worker threads
repeatedly get a URL from the queue, download the page, and then put the
page into another queue for processing by another extraction thread.
This post might help:

http://mail.python.org/pipermail/python-list/2009-September/195866.html
 
E

exarkun

I don't think he was looking for a framework... Specifically a
framework
that you work on.

He's free to use anything he likes. I'm offering an option he may not
have been aware of before. It's okay. It's great to have options.

Jean-Paul
 
D

Dennis Lee Bieber

There's no need to use threads for this. Have a look at Twisted:

http://twistedmatrix.com/trac/

Strange... While I can easily visualize how to convert the problem
to a task pool -- especially given that code to do a single occurrence
is already in place...

... conversion to an event-dispatch based system is something /I/
can not imagine...

Twisted may be a magnificent effort... but it doesn't fit my mental
framework.
 
E

exarkun

Strange... While I can easily visualize how to convert the
problem
to a task pool -- especially given that code to do a single occurrence
is already in place...

... conversion to an event-dispatch based system is something
/I/
can not imagine...

The cool thing is that there's not much conversion to do from the single
request version to the multiple request version, if you're using
Twisted. The single request version looks like this:

getPage(url).addCallback(pageReceived)

And the multiple request version looks like this:

getPage(firstURL).addCallback(pageReceived)
getPage(secondURL).addCallback(pageReceived)

Since the APIs don't block, doing things concurrently ends up being the
easy thing.

Not to say it isn't a bit of a challenge to get into this mindset, but I
think anyone who wants to put a bit of effort into it can manage. :)
Getting used to using Deferreds in the first place (necessary to
write/use even the single request version) is probably where more people
have trouble.

Jean-Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top