parallel subprocess.getoutput

J

Jaroslav Dobrek

Hello,

I wrote the following code for using egrep on many large files:

MY_DIR = '/my/path/to/dir'
FILES = os.listdir(MY_DIR)

def grep(regex):
i = 0
l = len(FILES)
output = []
while i < l:
command = "egrep " + '"' + regex + '" ' + MY_DIR + '/' +
FILES
result = subprocess.getoutput(command)
if result:
output.append(result)
i += 1
return output

Yet, I don't think that the files are searched in parallel. Am I
right? How can I search them in parallel?

Jaroslav
 
J

Jaroslav Dobrek

Sorry, for code-historical reasons this was unnecessarily complicated.
Should be:


MY_DIR = '/my/path/to/dir'
FILES = os.listdir(MY_DIR)


def grep(regex):
output = []
for f in FILES:
command = "egrep " + '"' + regex + '" ' + MY_DIR + '/' + f
result = subprocess.getoutput(command)
if result:
output.append(result)
return output
 
A

Adam Skutt

Hello,

I wrote the following code for using egrep on many large files:

MY_DIR = '/my/path/to/dir'
FILES = os.listdir(MY_DIR)

def grep(regex):
    i = 0
    l = len(FILES)
    output = []
    while i < l:
        command = "egrep " + '"' + regex + '" ' + MY_DIR + '/' +
FILES
        result = subprocess.getoutput(command)
        if result:
            output.append(result)
        i += 1
    return output

Yet, I don't think that the files are searched in parallel. Am I
right? How can I search them in parallel?


subprocess.getoutput() blocks until the command writes out all of its
output, so no, they're not going to be run in parallel. You really
shouldn't use it anyway, as it's very difficult to use it securely.
Your code, as it stands, could be exploited if the user can supply the
regex or the directory.

There are plenty of tools to do parallel execution in a shell, such
as: http://code.google.com/p/ppss/. I would use one of those tools
first.

Nevertheless, if you must do it in Python, then the most portable way
to accomplish what you want is to:
0) Create a thread-safe queue object to hold the output.
1) Create each process using a subprocess.Popen object. Do this
safely and securely, which means NOT passing shell=True in the
constructor, passing stdin=False, and passing stderr=False unless you
intend to capture error output.
2) Spawn a new thread for each process. That thread should block
reading the Popen.stdout file object. Each time it reads some output,
it should then write it to the queue. If you monitor stderr as well,
you'll need to spawn two threads per subprocess. When EOF is reached,
close the descriptor and call Popen.wait() to terminate the process
(this is trickier with two threads and requires additional
synchronization).
3) After spawning each process, monitor the queue in the first thread
and capture all of the output.
4) Call the join() method on all of the threads to terminate them.
The easiest way to do this is to have each thread write a special
object (a sentinel) to the queue to indicate that it is done.

If you don't mind platform specific code (and it doesn't look like you
do), then you can use fcntl.fcntl to make each file-object non-
blocking, and then use any of the various asynchronous I/O APIs to
avoid the use of threads. You still need to clean up all of the file
objects and processes when you are done, though.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top