Parallel/Multiprocessing script design question

Discussion in 'Python' started by Amit N, Sep 13, 2007.

  1. Amit N

    Amit N Guest

    Hi guys,

    I tend to ramble, and I am afraid none of you busy experts will bother
    reading my long post, so I will try to summarize it first:

    1. I have a script that processes ~10GB of data daily, and runs for a long
    time that I need to parallelize on a multicpu/multicore system. I am trying
    to decide on a module/toolkit that would help me create a multiprocessing
    solution but there are so many of them that I can't decide what to use. I am
    looking for a cross platform solution. Although right now it has to work in
    windows first, so many of the fork based modules are out. I am hoping people
    with experience using any of these would chime in with tips. The main thing
    I would look for in a toolkit is maturity and no extra dependencies. Plus a
    wide user community is always good.
    POSH/parallelpython/mpi4py/pyPar/Kamaelia/Twisted I am so confused :(

    2. The processing involves multiple steps that each input file has to go
    through. I am trying to decide between a batch mode design and a pipelined
    design for concurrency. In the batched design, all files will be processed
    on one processing step(in parallel) before the next step is started. In a
    pipelined design, each file will be taken through all steps to the end. So
    multiple files will be in parallel pipelines at the same time. I can't
    decide which is better. I guess I am asking for experienced eyes to take a
    look at the alternatives, for things that I, making my very first concurrent
    design, won't see.

    DETAILS:

    I have been trying to choose a design for this project but am striken by my
    usual case of analysis paralysis.

    I had decided to learn Python about 3 weeks ago specifically for this
    project, as it needed parsing and text processing, not realizing that I
    would need concurrency. I am having the same trouble in deciding which
    parser generator to use, but I will ask about parsing in a separate thread
    to keep this focused.

    It was slow, so I tried to run a multithreaded version, naively expecting a
    2x speedup. I barely got a 5% improvement and only then learned about the
    GIL. I guess I still haven't got too much time invested in this, so I can
    still switch to another language. I am not sure which other scripting
    languages have real multithreading? Perl? But I had chosen Python over Perl
    for readability and maintainability and am not ready to give that up yet. I
    know about stackless/Ironpython/Jython but I want to stick to CPython. So I
    am going to try to figure this out.

    Even after deciding to go for a SMP solution, I still don't know which
    toolkit to use. The subprocess module should allow spawning new processes,
    but I am not sure how to get status/error codes back from those? I guess
    this is why people made those parallel processing modules that might help by
    taking care of these things. I think my application is fairly simple and
    should be easy to SMP.

    THE TASK:

    About 800+ 10-15MB files are generated daily that need to be processed. The
    processing consists of different steps that the files must go through:

    -Uncompress
    -FilterA
    -FilterB
    -Parse
    -Possibly compress parsed files for archival

    All files have to be run through each of the two filters. The two filters
    are independent of each other and produce output files that need separate
    parsers. So they can in fact run in parallel, and so can the subsequent
    parsers. Furthermore, multiple files can be running in parallel inside each
    step. Eg. 4 files being uncompressed at the same time. I am using the python
    library for uncompressing and will be doing the parsing in Python too. But
    the two filters are external console programs that I spawn in the system
    shell with subprocess.call(). I guess I can forget about communicating with
    those?

    The first method that came to mind was to finish each step on all files
    before going to the next. So all files are uncompressed first, using
    multiple processes in parallel. Then all files are filtered in parallel,
    etc. I guess I would need some sort of queuing system here, to submit files
    to the CPUs properly?

    The other way could be to have each individual file run through all the
    steps and have multiple such "pipelines" running simultaneously in parallel.
    It feels like this method will lose cache performance because all the code
    for all the steps will be loaded at the same time, but I am not sure if I
    should be worrying about that. This will have the advantage of "Fast
    First-Out" which means that something waiting for the results of processing
    won't have to wait till the very end. They can start receiving data
    incrementally from the start(kind of streaming?). Pipelined mode may also
    help to rerun an individual file quickly in case it had an error.
    So whats the better method?

    EVALUATIONS:

    POSH - Doesn't seem mature, was supposed to be proof of concept only. People
    have reported Bugs/Problems using it. POSIX Only.
    delegate/forkmap/pprocess - fork based, POSIX only
    ParallelPython - Seems to meet all criteria, and is cross platform. I will
    be trying this one first.
    remoteD - Claims to be platform independent, but I don't think so. Code
    shows os.fork only. Last updated 2004 v0.8
    processing - Is in beta V0.33 but looks promising and is cross platform.
    Emulates processes as threads. http://www.python.org/pypi/processing

    MPI based modules(probably overkill for my application):

    pyPar - Mature, cross platform. Has a dependency on Numeric Python + needs a
    C compiler.
    pyMpi - POSIX only . Alpha status. From lawrence livermore labs. It modifies
    the interpreter itself to make it multi-noded.
    mpi4py - ? another MPI implementation.

    LINKS & DISCUSSIONS

    http://wiki.python.org/moin/ParallelProcessing
    http://blog.ianbicking.org/gil-of-doom.html
    http://www.usenix.org/events/hotos03/tech/full_papers/vonbehren/vonbehren_html/index.html
    http://groups.google.com/group/comp.lang.python/browse_thread/thread/1f5d927d34f8f323/
    http://groups.google.com/group/comp.lang.python/browse_frm/thread/332083cdc8bc44b/
    http://groups.google.com/group/comp.lang.python/browse_frm/thread/13da24f2d6dc24a9/
    http://groups.google.com/group/comp.lang.python/browse_thread/thread/f822ec289f30b26a/
    http://groups.google.com/group/comp.lang.python/browse_thread/thread/902dbddfc31b8891
    http://groups.google.com/group/comp.lang.python/browse_thread/thread/d8fa9ad770c17c70/
     
    Amit N, Sep 13, 2007
    #1
    1. Advertising

  2. > I tend to ramble, and I am afraid none of you busy experts will bother
    > reading my long post


    I think that's a fairly accurate description, and prediction.

    > I am hoping people
    > with experience using any of these would chime in with tips. The main thing
    > I would look for in a toolkit is maturity and no extra dependencies. Plus a
    > wide user community is always good.
    > POSH/parallelpython/mpi4py/pyPar/Kamaelia/Twisted I am so confused :(


    After reading your problem description, I think you need none of these.
    Parallelization-wise, it seems to be a fairly straight-forward task,
    with coarse-grained parallelism (which is the easier kind).

    > 2. The processing involves multiple steps that each input file has to go
    > through. I am trying to decide between a batch mode design and a pipelined
    > design for concurrency.


    Try to avoid synchronization as much as you can if you want a good
    speed-up. Have long-running tasks that don't need to interact with
    each other, rather than having frequent synchronization. If you
    haven't read about Amdahl's law, do so now.

    From your description, it seems that having a single process process an
    entire input file, from the beginning to the end, sounds like the right
    approach; use the multiple CPUs to process different input files in
    parallel (IIUC, they can be processed in any order, and simultaneously).

    > In the batched design, all files will be processed
    > on one processing step(in parallel) before the next step is started.


    Why that? If you have N "worker" processes, each one should do the
    processing at its own rate. I.e. each one does all the steps for a
    single file in sequence; no need to wait until all processes have
    completed one step before starting the next one.

    As for N: don't make it the number of input files. Instead, make it
    the number of CPUs, or perhaps two times the number of CPUs. The maximum
    speed-up out of k CPUs is k, so it is pointless (and memory-consuming)
    to have many-more-than-k worker processes.

    > In a pipelined design, each file will be taken through all steps to the end.


    Perhaps this is what I suggested above - I would not call it
    "pipelined", though, because in pipelining, I would expect that the
    separate steps of a pipeline run in parallel, each one passing its
    output to the next step in the pipeline. That allows for a maximum
    speedup equal to the number of pipeline steps. IIUC, you have many
    more input documents than pipeline steps, so parallelizing by input
    data allows for higher speedups.

    > The subprocess module should allow spawning new processes,
    > but I am not sure how to get status/error codes back from those?


    What's wrong with the returncode attribute in subprocess?

    > I guess I can forget about communicating with those?


    Assuming all you want to know is whether they succeeded or failed: yes.
    Just look at the exit status.

    > I guess I would need some sort of queuing system here, to submit files
    > to the CPUs properly?


    No. You don't submit files to CPUs, you submit them to processes (if
    you "submit" anything at all). The operating system will chose a CPU
    for you.

    If you follow the architecture I proposed above (i.e. one Python
    process processes a file from the beginning to the end), you pass
    the file to be processed on the command line. If the processing is
    fairly short (which I assume it is not), you could have a single
    process process multiple input files, one after another (e.g.
    process 1 does files 1..400, process 401..800, and so on).

    > The other way could be to have each individual file run through all the
    > steps and have multiple such "pipelines" running simultaneously in parallel.


    Ok. See above - I wouldn't call it piplelined, but this is what you
    should do.

    > It feels like this method will lose cache performance because all the code
    > for all the steps will be loaded at the same time, but I am not sure if I
    > should be worrying about that.


    [I assume you mean "CPU cache" here]

    You should not worry about it, plus I very much doubt that this actually
    happens. The code for all the steps will *not* be loaded into the cache;
    that it sits in main memory has no effect whatsoever on cache
    performance. You can assume that all the code that runs a single step
    will be in the cache (unless the algorithms are very large). OTOH,
    the data of a input file will probably not fit in the CPU cache. You
    should be worried about reducing disk IO, so all processing of a file
    should run completely in memory. For that, it is better if you run an
    input file from the beginning to the end - if you would first read
    all 800 files, decompress them, then likely the output will not
    fit into the disk cache, so the system will have to read the compressed
    data from disk in the next step.

    Regards,
    Martin
     
    =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=, Sep 13, 2007
    #2
    1. Advertising

  3. Amit N

    A.T.Hofkamp Guest

    On 2007-09-13, Amit N <> wrote:
    > Hi guys,
    >
    > I tend to ramble, and I am afraid none of you busy experts will bother
    > reading my long post, so I will try to summarize it first:


    I haven't read the details, but you seem to aim for a single python program
    that does 'it'. A single sequential thread is not fast enough, so you want
    parallel execution (and yes there are a zillion ways of doing that).

    Why don't you start at the other end?

    Write a program for each task that you have, then fork/spawn/chain/whatever
    enough processes at OS level to eat all your data. The OS is usually much
    better at balancing CPU's. The Python module 'subprocess' would be your friend
    in that case.

    In addition, you can run each program independently, which will come in handy
    one day.


    Sincerely,
    Albert
     
    A.T.Hofkamp, Sep 13, 2007
    #3
  4. Amit N

    Ivan Voras Guest

    Amit N wrote:

    > About 800+ 10-15MB files are generated daily that need to be processed. The
    > processing consists of different steps that the files must go through:
    >
    > -Uncompress
    > -FilterA
    > -FilterB
    > -Parse
    > -Possibly compress parsed files for archival


    You can implement one of two easy straightforward approaches:

    1 - Create one program, start N instances of it, where N is the number
    of CPUs/cores, and let each process one file to completion. You'll
    probably need an "overseer" program to start them and dispatch jobs to
    them. The easiest is to start your processes with first N files, then
    monitor them for completion and when any of them finishes, start another
    with the next file in queue, etc.

    2 - Create a program / process for each of these steps and let the steps
    operate independently, but feed output from one step to the input of the
    next. You'll probably need some buffering and more control, so that if
    (for example) "FilterA" is slower then "Uncompress", the "Uncompress"
    process is signaled to wait a little until "FilterA" needs more data.
    The key is that, as long as all the steps run at approximatly the same
    speed, they can run in parallel.

    Note that both approaches are in principle independent on whether you
    use threads or processes, with the exception of communication between
    the steps/stages, but you can't use threads in python if your goal is
    parallel execution of threads.



    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.5 (GNU/Linux)
    Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

    iD8DBQFG6UDpldnAQVacBcgRAzwpAJ47N30PYtdplO0DZSCP0ZEvGaRRggCgrVt+
    Dm+G4CceP2VeupmY5SThBmc=
    =pkzH
    -----END PGP SIGNATURE-----
     
    Ivan Voras, Sep 13, 2007
    #4
  5. Amit N

    Paddy Guest

    Paddy, Sep 13, 2007
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Soren
    Replies:
    4
    Views:
    1,321
    c d saunter
    Feb 14, 2008
  2. Valery
    Replies:
    9
    Views:
    1,563
    Klauss
    Jan 7, 2010
  3. Vivek Menon
    Replies:
    5
    Views:
    3,467
    Paul Uiterlinden
    Jun 8, 2011
  4. Hseu-Ming Chen
    Replies:
    1
    Views:
    1,029
    Chris Torek
    Jun 12, 2011
  5. Vivek Menon
    Replies:
    0
    Views:
    1,795
    Vivek Menon
    Jun 10, 2011
Loading...

Share This Page