Multiprocessing: don't push the pedal to the metal?

Discussion in 'Python' started by John Ladasky, May 22, 2011.

  1. John Ladasky

    John Ladasky Guest

    Hello again, everyone.

    I'm developing some custom neural network code. I'm using Python
    2.6, Numpy 1.5, and Ubuntu Linux 10.10. I have an AMD 1090T six-core
    CPU. About six weeks ago, I asked some questions about
    multiprocessing in Python, and I got some very helpful responses from
    you all.

    http://groups.google.com/group/comp.lang.python/browse_frm/thread/374e1890efbcc87b

    Now I'm back with a new question. I have gotten comfortable with
    cProfile, and with multiprocessing's various Queues (I've graduated
    from Pool). I just ran some extensive tests of my newest code, and
    I've learned some surprising things. I have a pretty picture here (be
    sure to view the full-size image):

    http://www.flickr.com/photos/15579975@N00/5744093219

    I'll quickly ask my question first, to avoid a TL;DR problem: when you
    have a multi-core CPU with N cores, is it common to see the
    performance peak at N-1, or even N-2 processes? And so, should you
    avoid using quite as many processes as there are cores? I was
    expecting diminishing returns for each additional core -- but not
    outright declines.

    That's what I think my data shows for many of my trial runs. I've
    tried running this test twice. Once, I was reading a few PDFs and web
    pages while my speed test was running. But even when I wasn't using
    the computer for these other (light) tasks, I saw the same performance
    drops. Perhaps this is due to the OS overhead? The load average on my
    system monitor looks pretty quiet to me when I'm not running my
    program.

    OK, if you care to read further, here's some more detail...

    My graphs show the execution times of my neural network evaluation
    routine as a function of:

    - the size of my neural network (six sizes were tried -- with varying
    numbers of inputs, outputs and hidden nodes),
    - the subprocess configuration (either not using a subprocess, or
    using 1-6 subprocesses), and
    - the size of the input data vector (from 7 to 896 sets of inputs --
    I'll explain the rationale for the exact numbers I chose if anyone
    cares to know).

    Each graph is normalized to the execution time that running the
    evaluation routine takes on a single CPU, without invoking a
    subprocess. Obviously, I'm looking for the conditions which yield
    performance gains above that baseline. (I'll be running this
    particular piece of code millions of times!)

    I tried 200 repetitions for each combination network size, input data
    size, and number of CPU cores. Even so, there was substantial
    irregularity in the timing graphs. So, rather than connecting the
    dots directly, which would lead to some messy crossing lines which are
    a bit hard to read, I fit B-spline curves to the data.

    As I anticipated, there is a performance penalty that is incurred just
    for parceling out the data to the multiple processes and collating the
    results at the end. When the data set is small, it's faster to send
    it to a single CPU, without invoking a subprocess. In fact, dividing
    a small task among 3 processes can underperform a two-process
    approach, and so on! See the leftmost two panels in the top row, and
    the rightmost two panels in the bottom row.

    When the networks increase in complexity, the size of the data set for
    which break-even performance is achieved drops accordingly. I'm more
    concerned about optimizing these bigger problems, obviously, because
    they take the longest to run.

    What I did not anticipate was finding that performance reversal with
    added computing power for large data sets. Comments are appreciated!
     
    John Ladasky, May 22, 2011
    #1
    1. Advertising

  2. John Ladasky

    John Ladasky Guest

    Following up to my own post...

    Flickr informs me that quite a few of you have been looking at my
    graphs of performance vs. the number of sub-processes employed in a
    parallelizable task:

    On May 21, 8:58 pm, John Ladasky <> wrote:
    > http://www.flickr.com/photos/15579975@N00/5744093219

    [...]
    > I'll quickly ask my question first, to avoid a TL;DR problem: when you
    > have a multi-core CPU with N cores, is it common to see the
    > performance peak at N-1, or even N-2 processes?  And so, should you
    > avoid using quite as many processes as there are cores?  I was
    > expecting diminishing returns for each additional core -- but not
    > outright declines.


    But no one has offered any insight yet? Well, I slept on it, and I
    had a thought. Please feel free to shoot it down.

    If I spawn N worker sub-processes, my application in fact has N+1
    processes in all, because there's also the master process itself. If
    the master process has anything significant to do (and mine does, and
    I would surmise that many multi-core applications would be that way),
    then the master process may sometimes find itself competing for time
    on a CPU core with a worker sub-process. This could impact
    performance even when the demands from the operating system and/or
    other applications are modest.

    I'd still appreciate hearing from anyone else who has more experience
    with multiprocessing. If there are general rules about how to do this
    best, I haven't seen them posted anywhere. This may not be a Python-
    specific issue, of course.

    Tag, you're it!
     
    John Ladasky, May 22, 2011
    #2
    1. Advertising

  3. On Mon, May 23, 2011 at 7:06 AM, John Ladasky <> wrote:
    > If I spawn N worker sub-processes, my application in fact has N+1
    > processes in all, because there's also the master process itself.


    This would definitely be correct. How much impact the master process
    has depends on how much it's doing.

    > I'd still appreciate hearing from anyone else who has more experience
    > with multiprocessing.  If there are general rules about how to do this
    > best, I haven't seen them posted anywhere.  This may not be a Python-
    > specific issue, of course.


    I don't have much experience with Python's multiprocessing model, but
    I've done concurrent programming on a variety of platforms, and there
    are some common issues.

    Each CPU (or core) has its own execution cache. If you can keep one
    thread running on the same core all the time, it will benefit more
    from that cache than if it has to keep flitting from one to another.

    You undoubtedly will have other processes in the system, too. As well
    as your master, there'll be processes over which you have no control
    (unless you're on a bare-bones system). Some of them may preempt your
    processes.

    Leaving one CPU/core available for "everything else" may allow the OS
    to keep each thread on its own core. Having as many workers as cores
    means that every time there's something else to do, one of your
    workers has to be kicked off its nice warm CPU and sent out into the
    cold for a while. If all your workers are at the same priority, it
    will then grab a timeslice off one of the other cores, kicking its
    incumbent off... rinse and repeat.

    This is a tradeoff, though. If the rest of your system is going to use
    0.01 of a core, then 1% thrashing is worth having one more core
    available 99% of the time. If the numbers are reversed, it's equally
    obvious that you should leave one core available. In your case, it's
    probably turning out that the contention causes more overhead than the
    extra worker is worth.

    That's just some general concepts, without an in-depth analysis of
    your code and your entire system. It's probably easier to analyse by
    results rather than inspection.

    Chris Angelico
     
    Chris Angelico, May 23, 2011
    #3
  4. On Mon, 2011-05-23 at 10:32 +1000, Chris Angelico wrote:
    > On Mon, May 23, 2011 at 7:06 AM, John Ladasky <> wrote:
    > > If I spawn N worker sub-processes, my application in fact has N+1
    > > processes in all, because there's also the master process itself.
    > > I'd still appreciate hearing from anyone else who has more experience
    > > with multiprocessing. If there are general rules about how to do this
    > > best, I haven't seen them posted anywhere. This may not be a Python-
    > > specific issue, of course.

    > I don't have much experience with Python's multiprocessing model, but
    > I've done concurrent programming on a variety of platforms, and there
    > are some common issues.


    I develop an app that uses multiprocessing heavily. Remember that all
    these processes are processes - so you can use all the OS facilities
    regarding processes on them. This includes setting nice values,
    schedular options, CPU pinning, etc...

    > Each CPU (or core) has its own execution cache. If you can keep one
    > thread running on the same core all the time, it will benefit more
    > from that cache than if it has to keep flitting from one to another.


    +1

    > You undoubtedly will have other processes in the system, too. As well
    > as your master, there'll be processes over which you have no control
    > (unless you're on a bare-bones system). Some of them may preempt your
    > processes.


    This is very true. You get a benefit from dividing work up to the
    correct number of processes - but too many processes will quickly take
    back all the benefit. One good trick is to have the parent monitor the
    load average and only spawn additional workers when that value is below
    a certain value.
     
    Adam Tauno Williams, May 23, 2011
    #4
  5. John Ladasky

    John Ladasky Guest

    On May 23, 2:50 am, Adam Tauno Williams <>
    wrote:

    > I develop an app that uses multiprocessing heavily.  Remember that all
    > these processes are processes - so you can use all the OS facilities
    > regarding processes on them.  This includes setting nice values,
    > schedular options, CPU pinning, etc...


    That's interesting. Does code exist in the Python library which
    allows the adjustment of CPU pinning and nice levels? I just had
    another look at the multiprocessing docs, and also at os.subprocess.
    I didn't see anything that pertains to these issues.

    > > Each CPU (or core) has its own execution cache. If you can keep one
    > > thread running on the same core all the time, it will benefit more
    > > from that cache than if it has to keep flitting from one to another.

    >
    > +1
     
    John Ladasky, May 23, 2011
    #5
  6. On Mon, 2011-05-23 at 12:51 -0700, John Ladasky wrote:
    > On May 23, 2:50 am, Adam Tauno Williams <>
    > wrote:
    > > I develop an app that uses multiprocessing heavily. Remember that all
    > > these processes are processes - so you can use all the OS facilities
    > > regarding processes on them. This includes setting nice values,
    > > schedular options, CPU pinning, etc...

    > That's interesting. Does code exist in the Python library which
    > allows the adjustment of CPU pinning and nice levels? I just had
    > another look at the multiprocessing docs, and also at os.subprocess.
    > I didn't see anything that pertains to these issues.


    "in the Python library" - no. All these types of behaviors are platform
    specific.

    For example you can set the "nice" (priority) of a UNIX/LINUX process
    using the nice method from the os module. Our workflow engine does this
    on all worker processes it starts - it sends the workers to the lowest
    priority.


    from os import nice as os_priority
    ....
    try:
    os_priority(20)
    except Exception, e:
    ...

    I'm not aware of a tidy way to call sched_setaffinity from Python; but
    my own testing indicates that the LINUX kernel is very good at figuring
    this out on its own so long as it isn't swamped. Queuing, rather than
    starting, additional workflows if load average exceeds X.Y and setting
    the process priority of workers to very-low seems to work very well.

    There is <http://pypi.python.org/pypi/affinity> for setting affinity,
    but I haven't used it.
     
    Adam Tauno Williams, May 23, 2011
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    1
    Views:
    271
  2. Replies:
    1
    Views:
    353
  3. XyZaa
    Replies:
    0
    Views:
    574
    XyZaa
    Jul 19, 2007
  4. Flash Gordon
    Replies:
    0
    Views:
    423
    Flash Gordon
    Jan 24, 2010
  5. samppi
    Replies:
    27
    Views:
    499
    David A. Black
    Dec 5, 2007
Loading...

Share This Page