Parallel/Multiprocessing script design question

A

Amit N

Hi guys,

I tend to ramble, and I am afraid none of you busy experts will bother
reading my long post, so I will try to summarize it first:

1. I have a script that processes ~10GB of data daily, and runs for a long
time that I need to parallelize on a multicpu/multicore system. I am trying
to decide on a module/toolkit that would help me create a multiprocessing
solution but there are so many of them that I can't decide what to use. I am
looking for a cross platform solution. Although right now it has to work in
windows first, so many of the fork based modules are out. I am hoping people
with experience using any of these would chime in with tips. The main thing
I would look for in a toolkit is maturity and no extra dependencies. Plus a
wide user community is always good.
POSH/parallelpython/mpi4py/pyPar/Kamaelia/Twisted I am so confused :(

2. The processing involves multiple steps that each input file has to go
through. I am trying to decide between a batch mode design and a pipelined
design for concurrency. In the batched design, all files will be processed
on one processing step(in parallel) before the next step is started. In a
pipelined design, each file will be taken through all steps to the end. So
multiple files will be in parallel pipelines at the same time. I can't
decide which is better. I guess I am asking for experienced eyes to take a
look at the alternatives, for things that I, making my very first concurrent
design, won't see.

DETAILS:

I have been trying to choose a design for this project but am striken by my
usual case of analysis paralysis.

I had decided to learn Python about 3 weeks ago specifically for this
project, as it needed parsing and text processing, not realizing that I
would need concurrency. I am having the same trouble in deciding which
parser generator to use, but I will ask about parsing in a separate thread
to keep this focused.

It was slow, so I tried to run a multithreaded version, naively expecting a
2x speedup. I barely got a 5% improvement and only then learned about the
GIL. I guess I still haven't got too much time invested in this, so I can
still switch to another language. I am not sure which other scripting
languages have real multithreading? Perl? But I had chosen Python over Perl
for readability and maintainability and am not ready to give that up yet. I
know about stackless/Ironpython/Jython but I want to stick to CPython. So I
am going to try to figure this out.

Even after deciding to go for a SMP solution, I still don't know which
toolkit to use. The subprocess module should allow spawning new processes,
but I am not sure how to get status/error codes back from those? I guess
this is why people made those parallel processing modules that might help by
taking care of these things. I think my application is fairly simple and
should be easy to SMP.

THE TASK:

About 800+ 10-15MB files are generated daily that need to be processed. The
processing consists of different steps that the files must go through:

-Uncompress
-FilterA
-FilterB
-Parse
-Possibly compress parsed files for archival

All files have to be run through each of the two filters. The two filters
are independent of each other and produce output files that need separate
parsers. So they can in fact run in parallel, and so can the subsequent
parsers. Furthermore, multiple files can be running in parallel inside each
step. Eg. 4 files being uncompressed at the same time. I am using the python
library for uncompressing and will be doing the parsing in Python too. But
the two filters are external console programs that I spawn in the system
shell with subprocess.call(). I guess I can forget about communicating with
those?

The first method that came to mind was to finish each step on all files
before going to the next. So all files are uncompressed first, using
multiple processes in parallel. Then all files are filtered in parallel,
etc. I guess I would need some sort of queuing system here, to submit files
to the CPUs properly?

The other way could be to have each individual file run through all the
steps and have multiple such "pipelines" running simultaneously in parallel.
It feels like this method will lose cache performance because all the code
for all the steps will be loaded at the same time, but I am not sure if I
should be worrying about that. This will have the advantage of "Fast
First-Out" which means that something waiting for the results of processing
won't have to wait till the very end. They can start receiving data
incrementally from the start(kind of streaming?). Pipelined mode may also
help to rerun an individual file quickly in case it had an error.
So whats the better method?

EVALUATIONS:

POSH - Doesn't seem mature, was supposed to be proof of concept only. People
have reported Bugs/Problems using it. POSIX Only.
delegate/forkmap/pprocess - fork based, POSIX only
ParallelPython - Seems to meet all criteria, and is cross platform. I will
be trying this one first.
remoteD - Claims to be platform independent, but I don't think so. Code
shows os.fork only. Last updated 2004 v0.8
processing - Is in beta V0.33 but looks promising and is cross platform.
Emulates processes as threads. http://www.python.org/pypi/processing

MPI based modules(probably overkill for my application):

pyPar - Mature, cross platform. Has a dependency on Numeric Python + needs a
C compiler.
pyMpi - POSIX only . Alpha status. From lawrence livermore labs. It modifies
the interpreter itself to make it multi-noded.
mpi4py - ? another MPI implementation.

LINKS & DISCUSSIONS

http://wiki.python.org/moin/ParallelProcessing
http://blog.ianbicking.org/gil-of-doom.html
http://www.usenix.org/events/hotos03/tech/full_papers/vonbehren/vonbehren_html/index.html
http://groups.google.com/group/comp.lang.python/browse_thread/thread/1f5d927d34f8f323/
http://groups.google.com/group/comp.lang.python/browse_frm/thread/332083cdc8bc44b/
http://groups.google.com/group/comp.lang.python/browse_frm/thread/13da24f2d6dc24a9/
http://groups.google.com/group/comp.lang.python/browse_thread/thread/f822ec289f30b26a/
http://groups.google.com/group/comp.lang.python/browse_thread/thread/902dbddfc31b8891
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d8fa9ad770c17c70/
 
?

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

I tend to ramble, and I am afraid none of you busy experts will bother
reading my long post

I think that's a fairly accurate description, and prediction.
I am hoping people
with experience using any of these would chime in with tips. The main thing
I would look for in a toolkit is maturity and no extra dependencies. Plus a
wide user community is always good.
POSH/parallelpython/mpi4py/pyPar/Kamaelia/Twisted I am so confused :(

After reading your problem description, I think you need none of these.
Parallelization-wise, it seems to be a fairly straight-forward task,
with coarse-grained parallelism (which is the easier kind).
2. The processing involves multiple steps that each input file has to go
through. I am trying to decide between a batch mode design and a pipelined
design for concurrency.

Try to avoid synchronization as much as you can if you want a good
speed-up. Have long-running tasks that don't need to interact with
each other, rather than having frequent synchronization. If you
haven't read about Amdahl's law, do so now.

From your description, it seems that having a single process process an
entire input file, from the beginning to the end, sounds like the right
approach; use the multiple CPUs to process different input files in
parallel (IIUC, they can be processed in any order, and simultaneously).
In the batched design, all files will be processed
on one processing step(in parallel) before the next step is started.

Why that? If you have N "worker" processes, each one should do the
processing at its own rate. I.e. each one does all the steps for a
single file in sequence; no need to wait until all processes have
completed one step before starting the next one.

As for N: don't make it the number of input files. Instead, make it
the number of CPUs, or perhaps two times the number of CPUs. The maximum
speed-up out of k CPUs is k, so it is pointless (and memory-consuming)
to have many-more-than-k worker processes.
In a pipelined design, each file will be taken through all steps to the end.

Perhaps this is what I suggested above - I would not call it
"pipelined", though, because in pipelining, I would expect that the
separate steps of a pipeline run in parallel, each one passing its
output to the next step in the pipeline. That allows for a maximum
speedup equal to the number of pipeline steps. IIUC, you have many
more input documents than pipeline steps, so parallelizing by input
data allows for higher speedups.
The subprocess module should allow spawning new processes,
but I am not sure how to get status/error codes back from those?

What's wrong with the returncode attribute in subprocess?
I guess I can forget about communicating with those?

Assuming all you want to know is whether they succeeded or failed: yes.
Just look at the exit status.
I guess I would need some sort of queuing system here, to submit files
to the CPUs properly?

No. You don't submit files to CPUs, you submit them to processes (if
you "submit" anything at all). The operating system will chose a CPU
for you.

If you follow the architecture I proposed above (i.e. one Python
process processes a file from the beginning to the end), you pass
the file to be processed on the command line. If the processing is
fairly short (which I assume it is not), you could have a single
process process multiple input files, one after another (e.g.
process 1 does files 1..400, process 401..800, and so on).
The other way could be to have each individual file run through all the
steps and have multiple such "pipelines" running simultaneously in parallel.

Ok. See above - I wouldn't call it piplelined, but this is what you
should do.
It feels like this method will lose cache performance because all the code
for all the steps will be loaded at the same time, but I am not sure if I
should be worrying about that.

[I assume you mean "CPU cache" here]

You should not worry about it, plus I very much doubt that this actually
happens. The code for all the steps will *not* be loaded into the cache;
that it sits in main memory has no effect whatsoever on cache
performance. You can assume that all the code that runs a single step
will be in the cache (unless the algorithms are very large). OTOH,
the data of a input file will probably not fit in the CPU cache. You
should be worried about reducing disk IO, so all processing of a file
should run completely in memory. For that, it is better if you run an
input file from the beginning to the end - if you would first read
all 800 files, decompress them, then likely the output will not
fit into the disk cache, so the system will have to read the compressed
data from disk in the next step.

Regards,
Martin
 
A

A.T.Hofkamp

Hi guys,

I tend to ramble, and I am afraid none of you busy experts will bother
reading my long post, so I will try to summarize it first:

I haven't read the details, but you seem to aim for a single python program
that does 'it'. A single sequential thread is not fast enough, so you want
parallel execution (and yes there are a zillion ways of doing that).

Why don't you start at the other end?

Write a program for each task that you have, then fork/spawn/chain/whatever
enough processes at OS level to eat all your data. The OS is usually much
better at balancing CPU's. The Python module 'subprocess' would be your friend
in that case.

In addition, you can run each program independently, which will come in handy
one day.


Sincerely,
Albert
 
I

Ivan Voras

Amit said:
About 800+ 10-15MB files are generated daily that need to be processed. The
processing consists of different steps that the files must go through:

-Uncompress
-FilterA
-FilterB
-Parse
-Possibly compress parsed files for archival

You can implement one of two easy straightforward approaches:

1 - Create one program, start N instances of it, where N is the number
of CPUs/cores, and let each process one file to completion. You'll
probably need an "overseer" program to start them and dispatch jobs to
them. The easiest is to start your processes with first N files, then
monitor them for completion and when any of them finishes, start another
with the next file in queue, etc.

2 - Create a program / process for each of these steps and let the steps
operate independently, but feed output from one step to the input of the
next. You'll probably need some buffering and more control, so that if
(for example) "FilterA" is slower then "Uncompress", the "Uncompress"
process is signaled to wait a little until "FilterA" needs more data.
The key is that, as long as all the steps run at approximatly the same
speed, they can run in parallel.

Note that both approaches are in principle independent on whether you
use threads or processes, with the exception of communication between
the steps/stages, but you can't use threads in python if your goal is
parallel execution of threads.



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFG6UDpldnAQVacBcgRAzwpAJ47N30PYtdplO0DZSCP0ZEvGaRRggCgrVt+
Dm+G4CceP2VeupmY5SThBmc=
=pkzH
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top