Parallelization on muli-CPU hardware?

P

P.M.

Hi,

I'm working on an C++/embedded-Python application that will run on
multi-CPU hardware and I need to efficiently distribute the execution
of multiple scripts across the CPUs. I'm embedding Python using the
Boost Python library.

My question is, how can I best parallelize the running of separate
autonomous Python scripts within this app? Can I run multiple
interpreters in separate threads within a single process? In past
newsgroup messages I've seen advice that the only way to get
scalability, due to the GIL, is to use an IPC mechanism between
multiple distinct processes running seperate interpreters. Is this
still true or are there better ways to accomplish this?

Thanks
Peter
 
A

Andrew Dalke

P.M. said:
My question is, how can I best parallelize the running of separate
autonomous Python scripts within this app? Can I run multiple
interpreters in separate threads within a single process?

No. Python only supports one interpreter per process. There
may be several threads running, but only in the context of a
single interpreter. Why? Various data structures assume
global state. One is support for different modules.
In past
newsgroup messages I've seen advice that the only way to get
scalability, due to the GIL, is to use an IPC mechanism between
multiple distinct processes running seperate interpreters. Is this
still true or are there better ways to accomplish this?

Hasn't changed. There many ways to do it, ranging from
ZODB to Twisted to POSH to CORBA to shared memory to Pyro
to a few dozen more. I've been waiting for a time to
give Pyro a try.

If most of your time is spent in Boost calculations then
you can release the GIL when you leave Python. So long
as extensions do not call back into Python there won't
be a scaling problem due to the GIL.

Andrew
(e-mail address removed)
 
S

Sam G.

According to the fact that all Thread run on the same CPU (if i didn't
understand wrong), i'm asking if python will suffer from the future
multicore CPU. Will not python use only one core, then a half or a
quarter of CPU ? It could be a serious problem for the future of python...

I'm sorry if I ask a stupid alsmost-newbie question, but it's not the
first time I'm asking it.

Sam.
 
A

Alan Kennedy

[Sam G.]
> According to the fact that all Thread run on the same CPU (if i didn't
> understand wrong), i'm asking if python will suffer from the future
> multicore CPU. Will not python use only one core, then a half or a
> quarter of CPU ? It could be a serious problem for the future of
python...

I agree that it could potentially be a serious hindrance for cpython if
"multiple core" CPUs become commonplace. This is in contrast to jython
and ironpython, both of which support multiple-cpu parallelism.

Although I completely accept the usual arguments offered in defense of
the GIL, i.e. that it isn't a problem in the great majority of use
cases, I think that position will become more difficult to defend as
desktop CPUs sprout more and more execution pipelines.

I think that this also fits in with AM Kuchling's recent
musing/thesis/prediction that the existing cpython VM may no longer be
in use in 5 years, and that it may be superceded by python
"interpreters" running on top of other VMs, namely the JVM, the CLR,
Smalltalk VM, Parrot, etc, etc, etc.

http://www.amk.ca/diary/archives/cat_python.html#003382

I too agree with Andrew's basic position: the Python language needs a
period of library consolidation. There is so much duplication of
functionality out there, with the situation only getting worse as people
re-invent the wheel yet again using newer features such generators,
gen-exps and decorators.

Just my €0,02.
 
G

Grant Edwards

According to the fact that all Thread run on the same CPU

I don't think anybody said that, and based on my understanding
of Python's threading, I don't see why it would be true.
(if i didn't understand wrong), i'm asking if python will
suffer from the future multicore CPU. Will not python use only
one core, then a half or a quarter of CPU ? It could be a
serious problem for the future of python...

AFAIK, the threads aren't all locked to a single CPU. However,
there is a global interpreter lock (a mutex), which somewhat
prevents simultaneous execution of threads when they _are_ on
different CPUs.
 
A

Aahz

I agree that it could potentially be a serious hindrance for cpython if
"multiple core" CPUs become commonplace. This is in contrast to jython
and ironpython, both of which support multiple-cpu parallelism.

Although I completely accept the usual arguments offered in defense of
the GIL, i.e. that it isn't a problem in the great majority of use
cases, I think that position will become more difficult to defend as
desktop CPUs sprout more and more execution pipelines.

Perhaps. Then again, those pipelines will probably have their work cut
out running firewalls, spam filters, voice recognition, and so on. I
doubt the GIL will make much difference, still.
 
A

Aahz

So, IMHO, there are basically the following design decisions:
GIL: large granularity
MSL: (many small locks) would slow down the overall execution of Python
programs.
MSLu: (many small locks, unsafe) inacceptable because it would change
Python experience ;)

That's a good summation. Not quite QOTW, but definitely something
everyone should remember.
 
D

Daniel Dittmar

Andreas said:
So, IMHO, there are basically the following design decisions:
GIL: large granularity
MSL: (many small locks) would slow down the overall execution of Python
programs.
MSLu: (many small locks, unsafe) inacceptable because it would change
Python experience ;)

You forgot GarbageCollection instead of reference counting.

Daniel
 
A

Andreas Kostyrka

Perhaps. Then again, those pipelines will probably have their work cut
out running firewalls, spam filters, voice recognition, and so on. I
doubt the GIL will make much difference, still.
Actually I doubt that loosing the GIL would make that much performance
difference even on a real (not HT) 4-way box ->
What the GIL buys us is no lock contention and no locking overhead
with Python.

And before somebody cries out, just think how the following trivial
statement would happen.

a = b + c
Lock a
Lock b
Lock c
a = b + c
Unlock c
Unlock b
Unlock a

So basically you either get a really huge number of locks (one per
object) with enough potential for conflicts, deadlocks and all the other
stuff to make it real slow down the ceval.

One could use less granularity, and lock say the class of the object
involved, but that wouldn't help that much either.

So basically the GIL is a design decision that makes sense, perhaps it
shouldn't be just called the GIL, call it the "very large locking
granularity design decision".

And before somebody points out that other languages can use locks too.
Well, other languages have usually much lower level execution model than
Python. And they usually force the developer to deal with
synchronization primitives. Python OTOH had always the "no segmentation
fault" policy -> so locking would have be "safe" as in "not producing
segfaults". That's not trivial to implement, for example reference
counting isn't trivially implementable without locking (at least
portably).

So, IMHO, there are basically the following design decisions:
GIL: large granularity
MSL: (many small locks) would slow down the overall execution of Python
programs.
MSLu: (many small locks, unsafe) inacceptable because it would change
Python experience ;)

Andreas
 
J

Jeff Shannon

Daniel said:
You forgot GarbageCollection instead of reference counting.


That's an orthogonal issue. GC/Refcounting may depend on the
data-locking scheme being thread-safe, but even if one used a GC scheme
that was completely thread-safe regardless of locking, you'd still need
data-locking to safely use objects in your own code. The fact that I
can be completely confident that the memory space pointed to by a, b,
and c won't be deallocated behind my back says nothing about whether
another worker thread may change b in between when I test it and when I
use it. This means that in a free-threaded environment (without GIL),
given

a = b + c
d = b + c
assert (a == d)

I cannot guarantee that the assertion is true, unless I explicitly lock
all the variables involved -- four locking operations plus four
unlocking operations. Locking the local namespace won't work, because
any of those could be a global reference or a cross-module reference or
whatever. One could, I suppose, explicitly acquire/release a
GIL-granularity lock (process-global "critical section", in essence),
but that leaves the possibility of mistakenly unprotected variable
causing a segfault -- again, a violation of one of Python's design
constraints. And I don't think that it's practical to have the
interpreter infer where it'd be safe to release the GIL.

Jeff Shannon
Technician/Programmer
Credit International
 
B

Bryan Olson

I have to agree with Alan. The major chip vendors are telling
us that the future is multi-processor.

I don't know of any quantitative demonstration either way, but
demands on computers tend to be bursty. Frequently, a single
process has a lot of tasks that suddenly demanding computation,
while all other processes are quiet. A GIL is a killer.
> Actually I doubt that loosing the GIL would make that much performance
> difference even on a real (not HT) 4-way box ->
> What the GIL buys us is no lock contention and no locking overhead
> with Python.
>
> And before somebody cries out, just think how the following trivial
> statement would happen.
>
> a = b + c
> Lock a
> Lock b
> Lock c
> a = b + c
> Unlock c
> Unlock b
> Unlock a

If that happens frequently, the program has blown it rather
badly.
> So basically you either get a really huge number of locks (one per
> object) with enough potential for conflicts, deadlocks and all the other
> stuff to make it real slow down the ceval.
>
> One could use less granularity, and lock say the class of the object
> involved, but that wouldn't help that much either.

There's a significant point there in that threads do well with
course granularity, and badly with finely parallel algorithms.
But if we look at how sensible multi-threaded programs work, we
see that the threads share little writable data. Thread
contention is surprisingly rare, outside of identifiable hot-
spots around shared data-structures.
> So basically the GIL is a design decision that makes sense, perhaps it
> shouldn't be just called the GIL, call it the "very large locking
> granularity design decision".
>
> And before somebody points out that other languages can use locks too.
> Well, other languages have usually much lower level execution model than
> Python. And they usually force the developer to deal with
> synchronization primitives. Python OTOH had always the "no segmentation
> fault" policy -> so locking would have be "safe" as in "not producing
> segfaults". That's not trivial to implement, for example reference
> counting isn't trivially implementable without locking (at least
> portably).

It's not trivial to implement, but it is implementable, and is
so valuable that languages that do it are likely to antiquate
those that do not. For example, Java strictly specified what
the programmer can assume and what he cannot. Updates to 32-bit
integers are always atomic, updates to 64-bit integers are not.
The Java VM (with the Java libraries) does not crash because of
application-program threading mistakes; or at least if it does
everyone recognizes it as a major bug. As presently
implemented, Java both allows more thread-concurrency than
Python, and is more thread-safe.
> So, IMHO, there are basically the following design decisions:
> GIL: large granularity
> MSL: (many small locks) would slow down the overall execution of Python
> programs.
> MSLu: (many small locks, unsafe) inacceptable because it would change
> Python experience ;)

One-grain-at-a-time is quickly becoming unacceptable regardless
of the granularity.
 
P

P.M.

Alan Kennedy said:
I agree that it could potentially be a serious hindrance for cpython if
"multiple core" CPUs become commonplace. This is in contrast to jython
and ironpython, both of which support multiple-cpu parallelism.

Although I completely accept the usual arguments offered in defense of
the GIL, i.e. that it isn't a problem in the great majority of use
cases, I think that position will become more difficult to defend as
desktop CPUs sprout more and more execution pipelines.

I wonder if Python could be changed to use thread local storage? That
might allow for multiple interpreter instances without the GIL (I've
never looked at the actual code so I'm just hypothesizing). I took a
quick look at Lua today and it has no problems with creating multiple
instances of the interpreter, so it definately is a solvable problem.
 
S

Steve Holden

The GIL has been a problem to me (probably a "fairly typical" Python
user) precisely once, when I tried to run a lengthy database process on
my brand-new H-P desktop-replacement laptop. The task seemed to be
running a little slower than I would have expected from the quoted clock
speed, so I opened up the Windows task manager and lo and behold, the
performance monitor was showing two CPUs, one hitting the stop at 100%
and the other basically idle.

Until that time I hadn't realized that the laptop was built on a
multi-core processor. Fortunately the task was easily partitioned into
two independent processes, each dealing with a separate set of database
rows, so the run completed well under the necessary 24 hours.
I wonder if Python could be changed to use thread local storage? That
might allow for multiple interpreter instances without the GIL (I've
never looked at the actual code so I'm just hypothesizing). I took a
quick look at Lua today and it has no problems with creating multiple
instances of the interpreter, so it definately is a solvable problem.

Thread synchronization through a common namespace would likely be the
problem. The last time this was discussed on c.l.py someone who knows
far more than I about the interpreter's structure (Tim Peters, unless I
mistake my guess) opined that removing the GIL from CPython was far from
being a simple undertaking. Whether that's changed since I dunno.

regards
Steve
 
S

Sam G.

OK, a mistake from me, I read more on GIL to understand well. Thanks to
correct me.

But if only one thread per interpreter could run at a time, in multicore
two (or more) thread can't run at the same time. In performance the
result is almost the same : if the server has nothing else than running
python (not real i know, just to be simple), only one core work...

Not very good for Zope, as multicore are for the near future...

Sam.


Grant Edwards wrote:
[...]
AFAIK, the threads aren't all locked to a single CPU. However,
there is a global interpreter lock (a mutex), which somewhat
prevents simultaneous execution of threads when they _are_ on
different CPUs.
[...]
 
A

Alex Martelli

Sam G. said:
OK, a mistake from me, I read more on GIL to understand well. Thanks to
correct me.

But if only one thread per interpreter could run at a time, in multicore
two (or more) thread can't run at the same time. In performance the
result is almost the same : if the server has nothing else than running
python (not real i know, just to be simple), only one core work...

Not very good for Zope, as multicore are for the near future...

Unless Zope can use multiple processes (I don't know whether it can or
not), the trouble's already there -- multi-CPU computers are already
widespread and have been for a while (all PowerMacs except the
entry-point ones, and I believe all Mac Servers, have been using
multiple CPUs for years now). It does, however, appear to me that
structuring a server application into multiple processes rather than
into multiple threads shouldn't be that big a deal anyway. And with
multiple processes you can exploit multiple CPUs or cores easily.


Alex
 
N

Neil Hodgson

Steve Holden:
Until that time I hadn't realized that the laptop was built on a
multi-core processor.

It is more likely you had a machine that featured 'hyperthreading' which
is much less than multiple cores. Somewhere between two sets of registers
and two processors.
Fortunately the task was easily partitioned into
two independent processes, each dealing with a separate set of database
rows, so the run completed well under the necessary 24 hours.

Did you measure a real performance increase, that is, elapsed time to
completion of job? Many benchmarks show minimal or even negative performance
improvements for hyperthreading. Relying on secondary indicators such as CPU
busyness can be misleading.

Neil
 
D

Daniel Dittmar

Jeff said:
a = b + c
d = b + c
assert (a == d)

I don't think that anybody expects the assertion to be true if threads
and any global variables are involved.
> I cannot guarantee that the assertion is true, unless I explicitly
> lock all the variables involved -- four locking operations plus four
> unlocking operations.

Trying to lock more than one object automatically is a recipe for
deadlocks.
causing a segfault -- again, a violation of one of Python's design
constraints. And I don't think that it's practical to have the
interpreter infer where it'd be safe to release the GIL.

Unlike Java (1), Python assignments are not atomic operations, but
rather complex operations (updating of a dictionary).

Dictionaries would be locked at the C-level, so every assignment to
module or instance variables would trigger at least one lock.

I'm not sure if locals () would have to be locked. As Pythons closures
can't assign new values to variables of the surrounding scope, I'm not
aware of a situation where a local variable could be updated (2) by a
second thread.

Although 'Garbage Collection' fits your 'MSL: (many small locks) would
slow down the overall execution of Python programs.' in name, it is
technical a very different solution from locking the reference counter,
probably with very different performance implications.

Daniel

(1) assignments of certain types is not guaranteed to be atomic on Java
(2) 'updated' is here meant as assigning a new value, not as calling a
method on a mutable object.
 
A

Alan Kennedy

[Andreas Kostyrka]
> So basically you either get a really huge number of locks (one per
> object) with enough potential for conflicts, deadlocks and all the
> other stuff to make it real slow down the ceval.
>
> One could use less granularity, and lock say the class of the object
> involved, but that wouldn't help that much either.
>
> So basically the GIL is a design decision that makes sense, perhaps it
> shouldn't be just called the GIL, call it the "very large locking
> granularity design decision".

Reading the above, one might be tempted to conclude that presence of the
GIL in the cpython VM is actually a benefit, and perhaps that all
language interpreters should have one!

But the reality is that few other VMs have selected to employ such a
"very large locking granularity design decision". If one were to believe
the arguments about the problems of fine-grained locking, one might be
tempted to conclude that other VMs such as the JVM and the CLR are
incapable of competing with the cpython VM, in terms of performance. But
Jim Hugunin's pre-release figures for IronPython performance indicate
that it can, in some cases, outperform cpython while running on just a
single processor, let alone when multiple processors are available. And
that is for an "interpreter running on an interpreter".

CPython's GIL does give a *small* performance benefit, but only when
there is a single execution pipeline. Once there is more than one
pipeline, it degrades performance.

As I've already stated, I believe the benefits and trade-offs of the GIL
are arguable either way when there is a small number of processors
involved, e.g. 2 or less. But if chip makers are already producing chips
with 2 execution pipelines, then you can be sure it won't too be long
before they are shipping units with 4, 8, 16, 32, etc, execution
pipelines. As this number increases, the GIL will increasingly become a
restrictive bottleneck.

Contrarily, jython, ironpython, etc, will continue to benefit from the
enormous and massively-resourced optimisation efforts going into the JVM
and CLR respectively (e.g. JIT compilation), all without a single change
to the python code or the code interpreting it.

Lastly, as the number of execution pipelines in CPUs grow, what will
happen if/when they start talking back and forth to each,
transputer-style, instead of executing mostly isolated and in parallel
as they do today? Transputers[1] had the lovely concept of high-speed
hardware communication channels cross-linking all CPUs in the "array".
The reason this model never really took off back in the 80's was because
there were no familiar high-level language models for exploiting it,
besides Occam[2].

IMHO, new python concepts such as generators are precisely the right
high-level concepts for enabling transputer style fine-granularity,
inter-execution-pipeline, on-demand, "pull" comms. This would put python
at an enormous conceptual advantage compared to other mainstream
languages, which generally don't have generator-style concepts. What a
shame that the GIL would restrict only one such CPU at a time to
actually be running python code.

regards,

--
alan kennedy
------------------------------------------------------
email alan: http://xhaus.com/contact/alan

[1] Transputer architecture
http://www.cmpe.boun.edu.tr/courses/cmpe511/fall2003/transputer.ppt

[2] Occam programming language
http://encyclopedia.thefreedictionary.com/Occam programming language
 
A

Andreas Kostyrka

You forgot GarbageCollection instead of reference counting.
Well, "GarbageCollection" isn't really an option for a portable ANSI C
program, isn't it?

Andreas
 
N

Nick Craig-Wood

Alan Kennedy said:
[Andreas Kostyrka]
So basically you either get a really huge number of locks (one per
object) with enough potential for conflicts, deadlocks and all the
other stuff to make it real slow down the ceval.

One could use less granularity, and lock say the class of the object
involved, but that wouldn't help that much either.

So basically the GIL is a design decision that makes sense, perhaps it
shouldn't be just called the GIL, call it the "very large locking
granularity design decision".

Reading the above, one might be tempted to conclude that presence of the
GIL in the cpython VM is actually a benefit, and perhaps that all
language interpreters should have one!

Interestingly the linux kernel has been through a similar evolution...

In linux 2.0 Multi-processing (SMP) was brought in. This was done
using something called the Big Kernel Lock (BKL). Sounds familiar
doesn't it! This worked fine but had the problem that only one CPU
could be executing in the kernel at once and hence a performance loss
in certain situations. This is the identical situation to python now
- there can only be one thread in python core at once.

However the linux kernel has evolved to replace the BKL with a series
of smaller locks. This has been a gradual process from 2.2->2.6 The
BKL is still there but its only used by legacy code in 2.6.

Yes fine grained locking does have an overhead. More locks mean more
memory and more wasted time in locking and unlocking them. The linux
kernel has got round this by

1) The locks aren't compiled in at all for a uniprocessor kernel (this
would mean a non-threading python - probably not an option)

2) The linux kernel has many different forms of locking both heavy and
light-weight

3) care has been taken to properly align the locks on cache line
boundaries. This costs memory though.

4) the linux kernel now has Read-Copy-Update (RCU) which requires no
locking

Linux is aiming at the future when everyone has 4 or 8
cores/hyperthreads and I think that is the right decision. Fine
grained locking will come to python one day I'm sure.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,480
Members
44,900
Latest member
Nell636132

Latest Threads

Top