parallel subprocess.getoutput

Discussion in 'Python' started by Jaroslav Dobrek, May 11, 2012.

  1. Hello,

    I wrote the following code for using egrep on many large files:

    MY_DIR = '/my/path/to/dir'
    FILES = os.listdir(MY_DIR)

    def grep(regex):
    i = 0
    l = len(FILES)
    output = []
    while i < l:
    command = "egrep " + '"' + regex + '" ' + MY_DIR + '/' +
    FILES
    result = subprocess.getoutput(command)
    if result:
    output.append(result)
    i += 1
    return output

    Yet, I don't think that the files are searched in parallel. Am I
    right? How can I search them in parallel?

    Jaroslav
    Jaroslav Dobrek, May 11, 2012
    #1
    1. Advertising

  2. Sorry, for code-historical reasons this was unnecessarily complicated.
    Should be:


    MY_DIR = '/my/path/to/dir'
    FILES = os.listdir(MY_DIR)


    def grep(regex):
    output = []
    for f in FILES:
    command = "egrep " + '"' + regex + '" ' + MY_DIR + '/' + f
    result = subprocess.getoutput(command)
    if result:
    output.append(result)
    return output
    Jaroslav Dobrek, May 11, 2012
    #2
    1. Advertising

  3. Jaroslav Dobrek

    Adam Skutt Guest

    On May 11, 8:04 am, Jaroslav Dobrek <> wrote:
    > Hello,
    >
    > I wrote the following code for using egrep on many large files:
    >
    > MY_DIR = '/my/path/to/dir'
    > FILES = os.listdir(MY_DIR)
    >
    > def grep(regex):
    >     i = 0
    >     l = len(FILES)
    >     output = []
    >     while i < l:
    >         command = "egrep " + '"' + regex + '" ' + MY_DIR + '/' +
    > FILES
    >         result = subprocess.getoutput(command)
    >         if result:
    >             output.append(result)
    >         i += 1
    >     return output
    >
    > Yet, I don't think that the files are searched in parallel. Am I
    > right? How can I search them in parallel?


    subprocess.getoutput() blocks until the command writes out all of its
    output, so no, they're not going to be run in parallel. You really
    shouldn't use it anyway, as it's very difficult to use it securely.
    Your code, as it stands, could be exploited if the user can supply the
    regex or the directory.

    There are plenty of tools to do parallel execution in a shell, such
    as: http://code.google.com/p/ppss/. I would use one of those tools
    first.

    Nevertheless, if you must do it in Python, then the most portable way
    to accomplish what you want is to:
    0) Create a thread-safe queue object to hold the output.
    1) Create each process using a subprocess.Popen object. Do this
    safely and securely, which means NOT passing shell=True in the
    constructor, passing stdin=False, and passing stderr=False unless you
    intend to capture error output.
    2) Spawn a new thread for each process. That thread should block
    reading the Popen.stdout file object. Each time it reads some output,
    it should then write it to the queue. If you monitor stderr as well,
    you'll need to spawn two threads per subprocess. When EOF is reached,
    close the descriptor and call Popen.wait() to terminate the process
    (this is trickier with two threads and requires additional
    synchronization).
    3) After spawning each process, monitor the queue in the first thread
    and capture all of the output.
    4) Call the join() method on all of the threads to terminate them.
    The easiest way to do this is to have each thread write a special
    object (a sentinel) to the queue to indicate that it is done.

    If you don't mind platform specific code (and it doesn't look like you
    do), then you can use fcntl.fcntl to make each file-object non-
    blocking, and then use any of the various asynchronous I/O APIs to
    avoid the use of threads. You still need to clean up all of the file
    objects and processes when you are done, though.
    Adam Skutt, May 11, 2012
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Garry Hodgson

    does commands.getoutput() use nice()?

    Garry Hodgson, Apr 8, 2004, in forum: Python
    Replies:
    1
    Views:
    328
    Steve Holden
    Apr 8, 2004
  2. Replies:
    1
    Views:
    420
    Peter Otten
    Jul 23, 2009
  3. Chris Rebert
    Replies:
    2
    Views:
    284
    Chris Rebert
    Jul 24, 2009
  4. J
    Replies:
    0
    Views:
    1,120
  5. J
    Replies:
    0
    Views:
    481
Loading...

Share This Page