Multiprocessing: don't push the pedal to the metal?

J

John Ladasky

Hello again, everyone.

I'm developing some custom neural network code. I'm using Python
2.6, Numpy 1.5, and Ubuntu Linux 10.10. I have an AMD 1090T six-core
CPU. About six weeks ago, I asked some questions about
multiprocessing in Python, and I got some very helpful responses from
you all.

http://groups.google.com/group/comp.lang.python/browse_frm/thread/374e1890efbcc87b

Now I'm back with a new question. I have gotten comfortable with
cProfile, and with multiprocessing's various Queues (I've graduated
from Pool). I just ran some extensive tests of my newest code, and
I've learned some surprising things. I have a pretty picture here (be
sure to view the full-size image):

http://www.flickr.com/photos/15579975@N00/5744093219

I'll quickly ask my question first, to avoid a TL;DR problem: when you
have a multi-core CPU with N cores, is it common to see the
performance peak at N-1, or even N-2 processes? And so, should you
avoid using quite as many processes as there are cores? I was
expecting diminishing returns for each additional core -- but not
outright declines.

That's what I think my data shows for many of my trial runs. I've
tried running this test twice. Once, I was reading a few PDFs and web
pages while my speed test was running. But even when I wasn't using
the computer for these other (light) tasks, I saw the same performance
drops. Perhaps this is due to the OS overhead? The load average on my
system monitor looks pretty quiet to me when I'm not running my
program.

OK, if you care to read further, here's some more detail...

My graphs show the execution times of my neural network evaluation
routine as a function of:

- the size of my neural network (six sizes were tried -- with varying
numbers of inputs, outputs and hidden nodes),
- the subprocess configuration (either not using a subprocess, or
using 1-6 subprocesses), and
- the size of the input data vector (from 7 to 896 sets of inputs --
I'll explain the rationale for the exact numbers I chose if anyone
cares to know).

Each graph is normalized to the execution time that running the
evaluation routine takes on a single CPU, without invoking a
subprocess. Obviously, I'm looking for the conditions which yield
performance gains above that baseline. (I'll be running this
particular piece of code millions of times!)

I tried 200 repetitions for each combination network size, input data
size, and number of CPU cores. Even so, there was substantial
irregularity in the timing graphs. So, rather than connecting the
dots directly, which would lead to some messy crossing lines which are
a bit hard to read, I fit B-spline curves to the data.

As I anticipated, there is a performance penalty that is incurred just
for parceling out the data to the multiple processes and collating the
results at the end. When the data set is small, it's faster to send
it to a single CPU, without invoking a subprocess. In fact, dividing
a small task among 3 processes can underperform a two-process
approach, and so on! See the leftmost two panels in the top row, and
the rightmost two panels in the bottom row.

When the networks increase in complexity, the size of the data set for
which break-even performance is achieved drops accordingly. I'm more
concerned about optimizing these bigger problems, obviously, because
they take the longest to run.

What I did not anticipate was finding that performance reversal with
added computing power for large data sets. Comments are appreciated!
 
J

John Ladasky

Following up to my own post...

Flickr informs me that quite a few of you have been looking at my
graphs of performance vs. the number of sub-processes employed in a
parallelizable task:

http://www.flickr.com/photos/15579975@N00/5744093219 [...]
I'll quickly ask my question first, to avoid a TL;DR problem: when you
have a multi-core CPU with N cores, is it common to see the
performance peak at N-1, or even N-2 processes?  And so, should you
avoid using quite as many processes as there are cores?  I was
expecting diminishing returns for each additional core -- but not
outright declines.

But no one has offered any insight yet? Well, I slept on it, and I
had a thought. Please feel free to shoot it down.

If I spawn N worker sub-processes, my application in fact has N+1
processes in all, because there's also the master process itself. If
the master process has anything significant to do (and mine does, and
I would surmise that many multi-core applications would be that way),
then the master process may sometimes find itself competing for time
on a CPU core with a worker sub-process. This could impact
performance even when the demands from the operating system and/or
other applications are modest.

I'd still appreciate hearing from anyone else who has more experience
with multiprocessing. If there are general rules about how to do this
best, I haven't seen them posted anywhere. This may not be a Python-
specific issue, of course.

Tag, you're it!
 
C

Chris Angelico

If I spawn N worker sub-processes, my application in fact has N+1
processes in all, because there's also the master process itself.

This would definitely be correct. How much impact the master process
has depends on how much it's doing.
I'd still appreciate hearing from anyone else who has more experience
with multiprocessing.  If there are general rules about how to do this
best, I haven't seen them posted anywhere.  This may not be a Python-
specific issue, of course.

I don't have much experience with Python's multiprocessing model, but
I've done concurrent programming on a variety of platforms, and there
are some common issues.

Each CPU (or core) has its own execution cache. If you can keep one
thread running on the same core all the time, it will benefit more
from that cache than if it has to keep flitting from one to another.

You undoubtedly will have other processes in the system, too. As well
as your master, there'll be processes over which you have no control
(unless you're on a bare-bones system). Some of them may preempt your
processes.

Leaving one CPU/core available for "everything else" may allow the OS
to keep each thread on its own core. Having as many workers as cores
means that every time there's something else to do, one of your
workers has to be kicked off its nice warm CPU and sent out into the
cold for a while. If all your workers are at the same priority, it
will then grab a timeslice off one of the other cores, kicking its
incumbent off... rinse and repeat.

This is a tradeoff, though. If the rest of your system is going to use
0.01 of a core, then 1% thrashing is worth having one more core
available 99% of the time. If the numbers are reversed, it's equally
obvious that you should leave one core available. In your case, it's
probably turning out that the contention causes more overhead than the
extra worker is worth.

That's just some general concepts, without an in-depth analysis of
your code and your entire system. It's probably easier to analyse by
results rather than inspection.

Chris Angelico
 
A

Adam Tauno Williams

I don't have much experience with Python's multiprocessing model, but
I've done concurrent programming on a variety of platforms, and there
are some common issues.

I develop an app that uses multiprocessing heavily. Remember that all
these processes are processes - so you can use all the OS facilities
regarding processes on them. This includes setting nice values,
schedular options, CPU pinning, etc...
Each CPU (or core) has its own execution cache. If you can keep one
thread running on the same core all the time, it will benefit more
from that cache than if it has to keep flitting from one to another.
+1

You undoubtedly will have other processes in the system, too. As well
as your master, there'll be processes over which you have no control
(unless you're on a bare-bones system). Some of them may preempt your
processes.

This is very true. You get a benefit from dividing work up to the
correct number of processes - but too many processes will quickly take
back all the benefit. One good trick is to have the parent monitor the
load average and only spawn additional workers when that value is below
a certain value.
 
J

John Ladasky

I develop an app that uses multiprocessing heavily.  Remember that all
these processes are processes - so you can use all the OS facilities
regarding processes on them.  This includes setting nice values,
schedular options, CPU pinning, etc...

That's interesting. Does code exist in the Python library which
allows the adjustment of CPU pinning and nice levels? I just had
another look at the multiprocessing docs, and also at os.subprocess.
I didn't see anything that pertains to these issues.
 
A

Adam Tauno Williams

That's interesting. Does code exist in the Python library which
allows the adjustment of CPU pinning and nice levels? I just had
another look at the multiprocessing docs, and also at os.subprocess.
I didn't see anything that pertains to these issues.

"in the Python library" - no. All these types of behaviors are platform
specific.

For example you can set the "nice" (priority) of a UNIX/LINUX process
using the nice method from the os module. Our workflow engine does this
on all worker processes it starts - it sends the workers to the lowest
priority.


from os import nice as os_priority
....
try:
os_priority(20)
except Exception, e:
...

I'm not aware of a tidy way to call sched_setaffinity from Python; but
my own testing indicates that the LINUX kernel is very good at figuring
this out on its own so long as it isn't swamped. Queuing, rather than
starting, additional workflows if load average exceeds X.Y and setting
the process priority of workers to very-low seems to work very well.

There is <http://pypi.python.org/pypi/affinity> for setting affinity,
but I haven't used it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top