Parallel Python

parallelpython · Jan 6, 2007

Has anybody tried to run parallel python applications?
It appears that if your application is computation-bound using 'thread'
or 'threading' modules will not get you any speedup. That is because
python interpreter uses GIL(Global Interpreter Lock) for internal
bookkeeping. The later allows only one python byte-code instruction to
be executed at a time even if you have a multiprocessor computer.
To overcome this limitation, I've created ppsmp module:
http://www.parallelpython.com
It provides an easy way to run parallel python applications on smp
computers.
I would appreciate any comments/suggestions regarding it.
Thank you!

Laszlo Nagy · Jan 8, 2007

Has anybody tried to run parallel python applications?
It appears that if your application is computation-bound using 'thread'
or 'threading' modules will not get you any speedup. That is because
python interpreter uses GIL(Global Interpreter Lock) for internal
bookkeeping. The later allows only one python byte-code instruction to
be executed at a time even if you have a multiprocessor computer.
To overcome this limitation, I've created ppsmp module:
http://www.parallelpython.com
It provides an easy way to run parallel python applications on smp
computers.
I would appreciate any comments/suggestions regarding it.

I always thought that if you use multiple processes (e.g. os.fork) then
Python can take advantage of multiple processors. I think the GIL locks
one processor only. The problem is that one interpreted can be run on
one processor only. Am I not right? Is your ppm module runs the same
interpreter on multiple processors? That would be very interesting, and
something new.

Or does it start multiple interpreters? Another way to do this is to
start multiple processes and let them communicate through IPC or a local
network.

Laszlo

Duncan Booth · Jan 8, 2007

Laszlo Nagy said:
I always thought that if you use multiple processes (e.g. os.fork) then
Python can take advantage of multiple processors. I think the GIL locks
one processor only. The problem is that one interpreted can be run on
one processor only. Am I not right? Is your ppm module runs the same
interpreter on multiple processors? That would be very interesting, and
something new.

The GIL locks all processors, but just for one process. So, yes, if you
spawn off multiple processes then Python will take advantage of this. For
example we run Zope on a couple of dual processor dual core systems, so we
use squid and pound to ensure that the requests are spread across 4
instances of Zope on each machine. That way we do get a fairly even cpu
usage.

For some applications it is much harder to split the tasks across separate
processes rather than just separate threads, but there is a benefit once
you've done it since you can then distribute the processing across cpus on
separate machines.

The 'parallel python' site seems very sparse on the details of how it is
implemented but it looks like all it is doing is spawning some subprocesses
and using some simple ipc to pass details of the calls and results. I can't
tell from reading it what it is supposed to add over any of the other
systems which do the same.

Combined with the closed source 'no redistribution' license I can't really
see anyone using it.

robert · Jan 8, 2007

Duncan said:
The 'parallel python' site seems very sparse on the details of how it is
implemented but it looks like all it is doing is spawning some subprocesses
and using some simple ipc to pass details of the calls and results. I can't
tell from reading it what it is supposed to add over any of the other
systems which do the same.

Combined with the closed source 'no redistribution' license I can't really
see anyone using it.

Thats true. IPC through sockets or (somewhat faster) shared memory - cPickle at least - is usually the maximum of such approaches.
See http://groups.google.de/group/comp.lang.python/browse_frm/thread/f822ec289f30b26a

For tasks really requiring threading one can consider IronPython.
Most advanced technique I've see for CPython ist posh : http://poshmodule.sourceforge.net/

I'd say Py3K should just do the locking job for dicts / collections, obmalloc and refcount (or drop the refcount mechanism) and do the other minor things in order to enable free threading. Or at least enable careful sharing of Py-Objects between multiple separated Interpreter instances of one process.
..NET and Java have shown that the speed costs for this technique are no so extreme. I guess less than 10%.
And Python is a VHLL with less focus on speed anyway.
Also see discussions in http://groups.google.de/group/comp.lang.python/browse_frm/thread/f822ec289f30b26a .

Robert

parallelpython · Jan 10, 2007

I always thought that if you use multiple processes (e.g. os.fork) then

Python can take advantage of multiple processors. I think the GIL locks
one processor only. The problem is that one interpreted can be run on
one processor only. Am I not right? Is your ppm module runs the same
interpreter on multiple processors? That would be very interesting, and
something new.

Or does it start multiple interpreters? Another way to do this is to
start multiple processes and let them communicate through IPC or a local
network.

That's right. ppsmp starts multiple interpreters in separate
processes and organize communication between them through IPC.

Originally ppsmp was designed to speedup an existent application
which is written in pure python but is quite computationally expensive
(the other ways to optimize it were used too). It was also required
that the application will run out of the box on the most standard Linux
distributions (they all contain CPython).

sturlamolden · Jan 10, 2007

robert said:
Thats true. IPC through sockets or (somewhat faster) shared memory - cPickle at least - is usually the maximum of such approaches.
See http://groups.google.de/group/comp.lang.python/browse_frm/thread/f822ec289f30b26a

For tasks really requiring threading one can consider IronPython.
Most advanced technique I've see for CPython ist posh : http://poshmodule.sourceforge.net/

In SciPy there is an MPI-binding project, mpi4py.

MPI is becoming the de facto standard for high-performance parallel
computing, both on shared memory systems (SMPs) and clusters. Spawning
threads or processes is not recommended way to do numerical parallel
computing. Threading makes programming certain tasks more convinient
(particularly GUI and I/O, for which the GIL does not matter anyway),
but is not a good paradigm for dividing CPU bound computations between
multiple processors. MPI is a high level API based on a concept of
"message passing", which allows the programmer to focus on solving the
problem, instead on irrelevant distractions such as thread managament
and synchronization.

Although MPI has standard APIs for C and Fortran, it may be used with
any programming language. For Python, an additional advantage of using
MPI is that the GIL has no practical consequence for performance. The
GIL can lock a process but not prevent MPI from using multiple
processors as MPI is always using multiple processes. For IPC, MPI will
e.g. use shared-memory segments on SMPs and tcp/ip on clusters, but all
these details are hidden.

It seems like 'ppsmp' of parallelpython.com is just an reinvention of a
small portion of MPI.

http://mpi4py.scipy.org/
http://en.wikipedia.org/wiki/Message_Passing_Interface

sturlamolden · Jan 10, 2007

[email protected] said:
That's right. ppsmp starts multiple interpreters in separate
processes and organize communication between them through IPC.

Thus you are basically reinventing MPI.

http://mpi4py.scipy.org/
http://en.wikipedia.org/wiki/Message_Passing_Interface

Nick Maclaren · Jan 10, 2007

|>
|> MPI is becoming the de facto standard for high-performance parallel
|> computing, both on shared memory systems (SMPs) and clusters.

It has been for some time, and is still gaining ground.

|> Spawning
|> threads or processes is not recommended way to do numerical parallel
|> computing.

Er, MPI works by getting SOMETHING to spawn processes, which then
communicate with each other.

|> Threading makes programming certain tasks more convinient
|> (particularly GUI and I/O, for which the GIL does not matter anyway),
|> but is not a good paradigm for dividing CPU bound computations between
|> multiple processors. MPI is a high level API based on a concept of
|> "message passing", which allows the programmer to focus on solving the
|> problem, instead on irrelevant distractions such as thread managament
|> and synchronization.

Grrk. That's not quite it.

The problem is that the current threading models (POSIX threads and
Microsoft's equivalent) were intended for running large numbers of
semi-independent, mostly idle, threads: Web servers and similar.
Everything about them, including their design (such as it is), their
interfaces and their implementations, are unsuitable for parallel HPC
applications. One can argue whether that is insoluble, but let's not,
at least not here.

Now, Unix and Microsoft processes are little better but, because they
are more separate (and, especially, because they don't share memory)
are MUCH easier to run effectively on shared memory multi-CPU systems.
You still have to play administrator tricks, but they aren't as foul
as the ones that you have to play for threaded programs. Yes, I know
that it is a bit Irish for the best way to use a shared memory system
to be to not share memory, but that's how it is.

Regards,
Nick Maclaren.

sturlamolden · Jan 10, 2007

Nick said:
as the ones that you have to play for threaded programs. Yes, I know
that it is a bit Irish for the best way to use a shared memory system
to be to not share memory, but that's how it is.

Thank you for clearing that up.

In any case, this means that Python can happily keep its GIL, as the
CPU bound 'HPC' tasks for which the GIL does matter should be done
using multiple processes (not threads) anyway. That leaves threads as a
tool for programming certain i/o tasks and maintaining 'responsive'
user interfaces, for which the GIL incidentally does not matter.

I wonder if too much emphasis is put on thread programming these days.
Threads may be nice for programming web servers and the like, but not
for numerical computing. Reading books about thread programming, one
can easily get the impression that it is 'the' way to parallelize
numerical tasks on computers with multiple CPUs (or multiple CPU
cores). But if threads are inherently designed and implemented to stay
idle most of the time, that is obviously not the case.

I like MPI. Although it is a huge API with lots of esoteric functions,
I only need to know a handfull to cover my needs. Not to mention the
fact that I can use MPI with Fortran, which is frowned upon by computer
scientists but loved by scientists and engineers specialized in any
other field.

Paul Rubin · Jan 10, 2007

Yes, I know that it is a bit Irish for the best way to use a shared
memory system to be to not share memory, but that's how it is.

But I thought serious MPI implementations use shared memory if they
can. That's the beauty of it, you can run your application on SMP
processors getting the benefit of shared memory, or split it across
multiple machines using ethernet or infiniband or whatever, without
having to change the app code.

Nick Maclaren · Jan 10, 2007

|>
|> In any case, this means that Python can happily keep its GIL, as the
|> CPU bound 'HPC' tasks for which the GIL does matter should be done
|> using multiple processes (not threads) anyway. That leaves threads as a
|> tool for programming certain i/o tasks and maintaining 'responsive'
|> user interfaces, for which the GIL incidentally does not matter.

Yes. That is the approach being taken at present by almost everyone.

|> I wonder if too much emphasis is put on thread programming these days.
|> Threads may be nice for programming web servers and the like, but not
|> for numerical computing. Reading books about thread programming, one
|> can easily get the impression that it is 'the' way to parallelize
|> numerical tasks on computers with multiple CPUs (or multiple CPU
|> cores). But if threads are inherently designed and implemented to stay
|> idle most of the time, that is obviously not the case.

You have to distinguish "lightweight processes" from "POSIX threads"
from the generic concept. It is POSIX and Microsoft threads that are
inherently like that, and another kind of thread model might be very
different. Don't expect to see one provided any time soon, even by
Linux.

OpenMP is the current leader for SMP parallelism, and it would be
murder to produce a Python binding that had any hope of delivering
useful performance. I think that it could be done, but implementing
the result would be a massive task. The Spruce Goose and Project
Habbakuk (sic) spring to my mind, by comparison[*]

|> I like MPI. Although it is a huge API with lots of esoteric functions,
|> I only need to know a handfull to cover my needs. Not to mention the
|> fact that I can use MPI with Fortran, which is frowned upon by computer
|> scientists but loved by scientists and engineers specialized in any
|> other field.

Yup. MPI is also debuggable and tunable (with difficulty). Debugging
and tuning OpenMP and POSIX threads are beyond anyone except the most
extreme experts; I am only on the borderline of being able to.

The ASCI bunch favour Co-array Fortran, and its model matches Python
like a steam turbine is a match for a heart transplant.

[*] They are worth looking up, if you don't know about them.

Regards,
Nick Maclaren.

Nick Maclaren · Jan 10, 2007

|>
|> > Yes, I know that it is a bit Irish for the best way to use a shared
|> > memory system to be to not share memory, but that's how it is.
|>
|> But I thought serious MPI implementations use shared memory if they
|> can. That's the beauty of it, you can run your application on SMP
|> processors getting the benefit of shared memory, or split it across
|> multiple machines using ethernet or infiniband or whatever, without
|> having to change the app code.

They use it for the communication, but don't expose it to the
programmer. It is therefore easy to put the processes on different
CPUs, and get the memory consistency right.

Regards,
Nick Maclaren.

Sergei Organov · Jan 10, 2007

|> I wonder if too much emphasis is put on thread programming these days.
|> Threads may be nice for programming web servers and the like, but not
|> for numerical computing. Reading books about thread programming, one
|> can easily get the impression that it is 'the' way to parallelize
|> numerical tasks on computers with multiple CPUs (or multiple CPU
|> cores). But if threads are inherently designed and implemented to stay
|> idle most of the time, that is obviously not the case.

You have to distinguish "lightweight processes" from "POSIX threads"
from the generic concept. It is POSIX and Microsoft threads that are
inherently like that,

Do you mean that POSIX threads are inherently designed and implemented
to stay idle most of the time?! If so, I'm afraid those guys that
designed POSIX threads won't agree with you. In particular, as far as I
remember, David R. Butenhof said a few times in comp.programming.threads
that POSIX threads were primarily designed to meet parallel programming
needs on SMP, or at least that was how I understood him.

-- Sergei.

Nick Maclaren · Jan 10, 2007

|>
|> Do you mean that POSIX threads are inherently designed and implemented
|> to stay idle most of the time?! If so, I'm afraid those guys that
|> designed POSIX threads won't agree with you. In particular, as far as I
|> remember, David R. Butenhof said a few times in comp.programming.threads
|> that POSIX threads were primarily designed to meet parallel programming
|> needs on SMP, or at least that was how I understood him.

I do mean that, and I know that they don't agree. However, the word
"designed" doesn't really make a lot of sense for POSIX threads - the
one I tend to use is "perpetrated".

The people who put the specification together were either unaware of
most of the experience of the previous 30 years, or chose to ignore it.
In particular, in this context, the importance of being able to control
the scheduling was well-known, as was the fact that it is NOT possible
to mix processes with different scheduling models on the same set of
CPUs. POSIX's facilities are completely hopeless for that purpose, and
most of the systems I have used effectively ignore them.

I could go on at great length, and the performance aspects are not even
the worst aspect of POSIX threads. The fact that there is no usable
memory model, and the synchronisation depends on C to handle the
low-level consistency, but there are no CONCEPTS in common between
POSIX and C's memory consistency 'specifications' is perhaps the worst.
That is why many POSIX threads programs work until the genuinely
shared memory accesses become frequent enough that you get some to the
same location in a single machine cycle.

Regards,
Nick Maclaren.

Carl J. Van Arsdall · Jan 10, 2007

Just as something to note, but many HPC applications will use a
combination of both MPI and threading (OpenMP usually, as for the
underlying thread implementation i don't have much to say). Its
interesting to see on this message board this huge "anti-threading"
mindset, but the HPC community seems to be happy using a little of both
depending on their application and the topology of their parallel
machine. Although if I was doing HPC applications, I probably would not
choose to use Python but I would write things in C or FORTRAN.

What I liked about python threads was that they were easy whereas using
processes and IPC is a real pain in the butt sometimes. I don't
necessarily think this module is the end-all solution to all of our
problems but I do think that its a good thing and I will toy with it
some in my spare time. I think that any effort to making python
threading better is a good thing and I'm happy to see the community
attempt to make improvements. It would also be cool if this would be
open sourced and I'm not quite sure why its not.

-carl

--

Carl J. Van Arsdall
(e-mail address removed)
Build and Release
MontaVista Software

Nick Maclaren · Jan 10, 2007

|>
|> Just as something to note, but many HPC applications will use a
|> combination of both MPI and threading (OpenMP usually, as for the
|> underlying thread implementation i don't have much to say). Its
|> interesting to see on this message board this huge "anti-threading"
|> mindset, but the HPC community seems to be happy using a little of both
|> depending on their application and the topology of their parallel
|> machine. Although if I was doing HPC applications, I probably would not
|> choose to use Python but I would write things in C or FORTRAN.

That is a commonly quoted myth.

Some of the ASCI community did that, but even they have backed off
to a great extent. Such code is damn near impossible to debug, let
alone tune. To the best of my knowledge, no non-ASCI application
has ever done that, except for virtuosity. I have several times
asked claimants to name some examples of code that does that and is
used in the general research community, and have so far never had a
response.

I managed the second-largest HPC system in UK academia for a decade,
ending less than a year ago, incidentally, and was and am fairly well
in touch with what is going on in HPC world-wide.

Regards,
Nick Maclaren.

robert · Jan 11, 2007

sturlamolden said:
Nick Maclaren wrote:

I wonder if too much emphasis is put on thread programming these days.
Threads may be nice for programming web servers and the like, but not
for numerical computing. Reading books about thread programming, one
can easily get the impression that it is 'the' way to parallelize
numerical tasks on computers with multiple CPUs (or multiple CPU

Most threads on this planet are not used for number crunching jobs, but for "organization of execution".

Also if one wants to exploit the speed of upcoming multi-core CPUs for all kinds of fine grained programs, things need fast fine grained communication - and most important: huge data trees in memory have to be shared effectively.
CPU frequencies will not grow anymore in the future, but we will see multi-cores/SMP. How to exploit them in a manner as if we had really faster CPU's: threads and thread-like techniques.

Things like MPI, IPC are just for the area of "small message, big job" - typically sci number crunching, where you collect the results "at the end of day". Its more a slow network technique.

A most challenging example on this are probably games - not to discuss about gaming here, but as tech example to the point: Would you do MPI, RPC etc. while 30fps 3D and real time physics simulation is going on?

Robert

robert · Jan 11, 2007

Nick said:
|>
|> > Yes, I know that it is a bit Irish for the best way to use a shared
|> > memory system to be to not share memory, but that's how it is.
|>
|> But I thought serious MPI implementations use shared memory if they
|> can. That's the beauty of it, you can run your application on SMP
|> processors getting the benefit of shared memory, or split it across
|> multiple machines using ethernet or infiniband or whatever, without
|> having to change the app code.

They use it for the communication, but don't expose it to the
programmer. It is therefore easy to put the processes on different
CPUs, and get the memory consistency right.

Thus communicated data is "serialized" - not directly used as with threads or with custom shared memory techniques like POSH object sharing.

Robert

Sergei Organov · Jan 11, 2007

|>
|> Do you mean that POSIX threads are inherently designed and implemented
|> to stay idle most of the time?! If so, I'm afraid those guys that
|> designed POSIX threads won't agree with you. In particular, as far as I
|> remember, David R. Butenhof said a few times in comp.programming.threads
|> that POSIX threads were primarily designed to meet parallel programming
|> needs on SMP, or at least that was how I understood him.

I do mean that, and I know that they don't agree. However, the word
"designed" doesn't really make a lot of sense for POSIX threads - the
one I tend to use is "perpetrated".

OK, then I don't think the POSIX threads were "perpetrated" to be idle
most of time.

The people who put the specification together were either unaware of
most of the experience of the previous 30 years, or chose to ignore it.
In particular, in this context, the importance of being able to control
the scheduling was well-known, as was the fact that it is NOT possible
to mix processes with different scheduling models on the same set of
CPUs. POSIX's facilities are completely hopeless for that purpose, and
most of the systems I have used effectively ignore them.

I won't argue that. On the other hand, POSIX threads capabilities in the
field of I/O-bound and real-time threads are also limited, and that's
where "threads that are idle most of time" idiom comes from, I
think. What I argue, is that POSIX were "perpetrated" to support
I/O-bound or real-time apps any more than to support parallel
calculations apps. Besides, pthreads real-time extensions came later
than pthreads themselves.

What I do see, is that Microsoft designed their system so that it's
almost impossible to implement an interactive application without using
threads, and that fact leads to the current situation where threads are
considered to be beasts that are sleeping most of time.

I could go on at great length, and the performance aspects are not even
the worst aspect of POSIX threads. The fact that there is no usable
memory model, and the synchronisation depends on C to handle the
low-level consistency, but there are no CONCEPTS in common between
POSIX and C's memory consistency 'specifications' is perhaps the worst.

I won't argue that either. However, I don't see how does it make POSIX
threads to be "perpetrated" to be idle most of time.

That is why many POSIX threads programs work until the genuinely
shared memory accesses become frequent enough that you get some to the
same location in a single machine cycle.

Sorry, I don't understand. Are you saying that it's inherently
impossible to write an application that uses POSIX threads and that
doesn't have bugs accessing shared state? I thought that pthreads
mutexes guarantee sequential access to shared data. Or do you mean
something entirely different? Lock-free algorithms maybe?

-- Sergei.

Nick Maclaren · Jan 11, 2007

|>
|> OK, then I don't think the POSIX threads were "perpetrated" to be idle
|> most of time.

Perhaps I was being unclear. I should have added "In the case where
there are more threads per system than CPUs per system". The reasons
are extremely obscure and are to do with the scheduling, memory access
and communication.

I am in full agreement that the above effect was not INTENDED.

|> > That is why many POSIX threads programs work until the genuinely
|> > shared memory accesses become frequent enough that you get some to the
|> > same location in a single machine cycle.
|>
|> Sorry, I don't understand. Are you saying that it's inherently
|> impossible to write an application that uses POSIX threads and that
|> doesn't have bugs accessing shared state? I thought that pthreads
|> mutexes guarantee sequential access to shared data. Or do you mean
|> something entirely different? Lock-free algorithms maybe?

I mean precisely the first.

The C99 standard uses a bizarre consistency model, which requires serial
execution, and its consistency is defined in terms of only volatile
objects and external I/O. Any form of memory access, signalling or
whatever is outside that, and is undefined behaviour.

POSIX uses a different but equally bizarre one, based on some function
calls being "thread-safe" and others forcing "consistency" (which is
not actually defined, and there are many possible, incompatible,
interpretations). It leaves all language aspects (including allowed
code movement) to C.

There are no concepts in common between C's and POSIX's consistency
specifications (even when they are precise enough to use), and so no
way of mapping the two standards together.

Regards,
Nick Maclaren.

Parallel/Multiprocessing script design question	4	Sep 13, 2007
Status of Python threading support (GIL removal)?	51	Jun 19, 2009
Can python threads take advantage of use dual core ?	6	Aug 17, 2007
The Future of Python Threading	34	Aug 10, 2007
[ANN] Lupa 0.17 released - Lua in Python	1	Nov 5, 2010
Discussion: Python and OpenMP	1	May 12, 2006
Threads, GIL and re.match() performance	5	Jun 25, 2008
Multi Threading embedded python	1	Jun 30, 2005

Parallel Python

parallelpython

Laszlo Nagy

Duncan Booth

robert

parallelpython

sturlamolden

sturlamolden

Nick Maclaren

sturlamolden

Paul Rubin

Nick Maclaren

Nick Maclaren

Sergei Organov

Nick Maclaren

Carl J. Van Arsdall

Nick Maclaren

robert

robert

Sergei Organov

Nick Maclaren

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads