Concurrent threads to pull web pages?

Discussion in 'Python' started by Gilles Ganault, Oct 1, 2009.

  1. Hello

    I recently asked how to pull companies' ID from an SQLite database,
    have multiple instances of a Python script download each company's web
    page from a remote server, eg. www.acme.com/company.php?id=1, and use
    regexes to extract some information from each page.

    I need to run multiple instances to save time, since each page takes
    about 10 seconds to be returned to the script/browser.

    Since I've never written a multi-threaded Python script before, to
    save time investigating, I was wondering if someone already had a
    script that downloads web pages and save some information into a
    database.

    Thank you for any tip.
    Gilles Ganault, Oct 1, 2009
    #1
    1. Advertising

  2. Gilles Ganault

    Guest

    On 1 Oct, 09:28 am, wrote:
    >Hello
    >
    > I recently asked how to pull companies' ID from an SQLite
    >database,
    >have multiple instances of a Python script download each company's web
    >page from a remote server, eg. www.acme.com/company.php?id=1, and use
    >regexes to extract some information from each page.
    >
    >I need to run multiple instances to save time, since each page takes
    >about 10 seconds to be returned to the script/browser.
    >
    >Since I've never written a multi-threaded Python script before, to
    >save time investigating, I was wondering if someone already had a
    >script that downloads web pages and save some information into a
    >database.


    There's no need to use threads for this. Have a look at Twisted:

    http://twistedmatrix.com/trac/

    Here's an example of how to use the Twisted HTTP client:

    http://twistedmatrix.com/projects/web/documentation/examples/getpage.py

    Jean-Paul
    , Oct 2, 2009
    #2
    1. Advertising

  3. Gilles Ganault

    MRAB Guest

    Gilles Ganault wrote:
    > Hello
    >
    > I recently asked how to pull companies' ID from an SQLite database,
    > have multiple instances of a Python script download each company's web
    > page from a remote server, eg. www.acme.com/company.php?id=1, and use
    > regexes to extract some information from each page.
    >
    > I need to run multiple instances to save time, since each page takes
    > about 10 seconds to be returned to the script/browser.
    >
    > Since I've never written a multi-threaded Python script before, to
    > save time investigating, I was wondering if someone already had a
    > script that downloads web pages and save some information into a
    > database.
    >
    > Thank you for any tip.


    You could put the URLs into a queue and have multiple worker threads
    repeatedly get a URL from the queue, download the page, and then put the
    page into another queue for processing by another extraction thread.
    This post might help:

    http://mail.python.org/pipermail/python-list/2009-September/195866.html
    MRAB, Oct 2, 2009
    #3
  4. Gilles Ganault

    Guest

    On 01:36 am, wrote:
    >On Thu, Oct 1, 2009 at 6:33 PM, <> wrote:
    >>On 1 Oct, 09:28 am, wrote:
    >>>Hello
    >>>
    >>> I recently asked how to pull companies' ID from an SQLite
    >>>database,
    >>>have multiple instances of a Python script download each company's
    >>>web
    >>>page from a remote server, eg. www.acme.com/company.php?id=1, and use
    >>>regexes to extract some information from each page.
    >>>
    >>>I need to run multiple instances to save time, since each page takes
    >>>about 10 seconds to be returned to the script/browser.
    >>>
    >>>Since I've never written a multi-threaded Python script before, to
    >>>save time investigating, I was wondering if someone already had a
    >>>script that downloads web pages and save some information into a
    >>>database.

    >>
    >>There's no need to use threads for this. Have a look at Twisted:
    >>
    >> http://twistedmatrix.com/trac/
    >>
    >>Here's an example of how to use the Twisted HTTP client:
    >>
    >>http://twistedmatrix.com/projects/web/documentation/examples/getpage.py

    >
    >I don't think he was looking for a framework... Specifically a
    >framework
    >that you work on.


    He's free to use anything he likes. I'm offering an option he may not
    have been aware of before. It's okay. It's great to have options.

    Jean-Paul
    , Oct 2, 2009
    #4
  5. On Fri, 02 Oct 2009 01:33:18 -0000, declaimed
    the following in gmane.comp.python.general:

    > There's no need to use threads for this. Have a look at Twisted:
    >
    > http://twistedmatrix.com/trac/
    >


    Strange... While I can easily visualize how to convert the problem
    to a task pool -- especially given that code to do a single occurrence
    is already in place...

    ... conversion to an event-dispatch based system is something /I/
    can not imagine...

    Twisted may be a magnificent effort... but it doesn't fit my mental
    framework.
    --
    Wulfraed Dennis Lee Bieber KD6MOG
    HTTP://wlfraed.home.netcom.com/
    Dennis Lee Bieber, Oct 2, 2009
    #5
  6. Gilles Ganault

    Guest

    On 05:48 am, wrote:
    >On Fri, 02 Oct 2009 01:33:18 -0000, declaimed
    >the following in gmane.comp.python.general:
    >>There's no need to use threads for this. Have a look at Twisted:
    >>
    >> http://twistedmatrix.com/trac/

    >
    > Strange... While I can easily visualize how to convert the
    >problem
    >to a task pool -- especially given that code to do a single occurrence
    >is already in place...
    >
    > ... conversion to an event-dispatch based system is something
    >/I/
    >can not imagine...


    The cool thing is that there's not much conversion to do from the single
    request version to the multiple request version, if you're using
    Twisted. The single request version looks like this:

    getPage(url).addCallback(pageReceived)

    And the multiple request version looks like this:

    getPage(firstURL).addCallback(pageReceived)
    getPage(secondURL).addCallback(pageReceived)

    Since the APIs don't block, doing things concurrently ends up being the
    easy thing.

    Not to say it isn't a bit of a challenge to get into this mindset, but I
    think anyone who wants to put a bit of effort into it can manage. :)
    Getting used to using Deferreds in the first place (necessary to
    write/use even the single request version) is probably where more people
    have trouble.

    Jean-Paul
    , Oct 2, 2009
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Reza
    Replies:
    6
    Views:
    5,594
  2. Pep
    Replies:
    6
    Views:
    819
  3. Replies:
    2
    Views:
    2,154
    Mike Treseler
    Jun 28, 2006
  4. Joe Martin
    Replies:
    13
    Views:
    322
    Robert Klemme
    Mar 8, 2010
  5. Daniel Gagliardi
    Replies:
    1
    Views:
    94
    Ulrich Eckhardt
    May 28, 2013
Loading...

Share This Page