Problem with subprocess in threaded enviroment

Discussion in 'Python' started by Ningyu Shi, Mar 12, 2008.

  1. Ningyu Shi

    Ningyu Shi Guest

    I'm trying to write a multi-task downloader to download files from a
    website using multi-threading. I have one thread to analyze the
    webpage, get the addresses of the files to be downloaded and put these
    in a Queue. Then the main thread will start some threads to get the
    address from the queue and download it. To keep the maximum files
    downloaded concurrently, I use a semaphore to control this, like at
    most 5 downloads at same time.

    I tried to use urllib.urlretreive in the download() thread, but from
    time to time, it seems that one download thread may freeze the whole
    program. Then I gave up and use subprocess to call wget to do the job.
    My download thread is like this:

    def download( url ):
    subprocess.call(["wget", "-q", url])
    with print_lock:
    print url, 'finished.'
    semaphore.realease()

    But later I found that after the specific wget job finished
    downloading, that download() thread never reach the print url
    statement. So I end up with files been downloaded, but the download()
    thread never ends and don't realease the semaphore then block the
    whole program. My guess is that at the time wget process ends, that
    specific download thread is not active and missed the return of the
    call.

    Any comment and suggestion about this problem? Thanks
     
    Ningyu Shi, Mar 12, 2008
    #1
    1. Advertising

  2. Ningyu Shi

    dripton Guest

    On Mar 12, 12:58 pm, Ningyu Shi <> wrote:
    > I'm trying to write a multi-task downloader to download files from a
    > website using multi-threading. I have one thread to analyze the
    > webpage, get the addresses of the files to be downloaded and put these
    > in a Queue. Then the main thread will start some threads to get the
    > address from the queue and download it. To keep the maximum files
    > downloaded concurrently, I use a semaphore to control this, like at
    > most 5 downloads at same time.


    You don't need a semaphore for that. Semaphores are low-level and
    very hard to get right. (Read http://www.greenteapress.com/semaphores/
    if you don't believe me.) You don't need any other type of explicit
    lock either. Just use Queues.

    Queues are much easier. Make simple threads, each of which typically
    reads from one queue and writes to one queue. If you want N-way
    concurrency at one stage, start N instances of that thread. If you
    need to guarantee non-concurrency at some stage, make that thread a
    singleton. Avoid sharing mutable data across threads except through
    Queues.

    > I tried to use urllib.urlretreive in the download() thread, but from
    > time to time, it seems that one download thread may freeze the whole
    > program. Then I gave up and use subprocess to call wget to do the job.
    > My download thread is like this:
    >
    > def download( url ):
    > subprocess.call(["wget", "-q", url])
    > with print_lock:
    > print url, 'finished.'
    > semaphore.realease()


    If the prints are just for debugging, print_lock is overkill. And in
    practice that print statement is probably atomic in CPython because of
    the GIL.

    But let's assume that it really is critical not to mix the output
    streams, and that the print isn't necessarily atomic. So instead of
    the lock, use a print_queue and a single PrintThread that just pulls
    strings off the queue and prints them.

    > But later I found that after the specific wget job finished
    > downloading, that download() thread never reach the print url
    > statement. So I end up with files been downloaded, but the download()
    > thread never ends and don't realease the semaphore then block the
    > whole program. My guess is that at the time wget process ends, that
    > specific download thread is not active and missed the return of the
    > call.


    The fact that changing libraries didn't fix your problem is a strong
    indicator that it's your code, not the libraries. You're probably
    introducing deadlocks with your semaphores. Too much locking is as
    bad as too little.

    Here's the skeleton of how I'd write this program:

    main thread: create queues, create and start other threads, find URLs,
    put URLs on input_queue

    class DownloadThread(threading.Thread):
    def run(self):
    while True:
    url = input_queue.get()
    urllib.urlretrieve(url)
    print_queue.put("done with %s" % url)

    class PrintThread(threading.Thread):
    def run(self):
    while True:
    st = print_queue.get()
    print st

    Then you just have to worry about clean shutdown. A simple way to
    handle this to initialize a counter to the number of URLs in main, and
    decrement it in PrintThread. Then have main busy-wait (with a sleep
    to avoid consuming too much CPU) for it to hit zero, then join all the
    other threads and exit. (This seems to violate the no-shared-mutable-
    data rule, but only one thread at a time can mutate it, so it's safe.)

    Once you've done this manually, consider using the TaskQueue recipe on
    the Python Cookbook site, to simplify it a bit.
     
    dripton, Mar 13, 2008
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?TmV3Ymll?=

    HttpSessionState.Abandon() from ASP Enviroment?

    =?Utf-8?B?TmV3Ymll?=, May 31, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    547
    Nicole Calinoiu
    May 31, 2004
  2. =?Utf-8?B?QW1pdA==?=

    Hosted Enviroment Problem

    =?Utf-8?B?QW1pdA==?=, Aug 13, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    1,034
    =?Utf-8?B?QW1pdA==?=
    Aug 14, 2004
  3. Ronen Yacov
    Replies:
    7
    Views:
    324
  4. Ronen Yacov
    Replies:
    3
    Views:
    350
  5. hiral
    Replies:
    2
    Views:
    604
    Jean-Michel Pichavant
    May 5, 2010
Loading...

Share This Page