Exploiting Dual Core's with Py_NewInterpreter's separated GIL ?

robert · Nov 2, 2006

I'd like to use multiple CPU cores for selected time consuming Python computations (incl. numpy/scipy) in a frictionless manner.

Interprocess communication is tedious and out of question, so I thought about simply using a more Python interpreter instances (Py_NewInterpreter) with extra GIL in the same process.
I expect to be able to directly push around Python Object-Trees between the 2 (or more) interpreters by doing some careful locking.

Any hope to come through? If possible, what are the main dangers? Is there an example / guideline around for that task? - using ctypes or so.

Or is there even a ready made Python module which makes it easy to setup and deal with extra Interpreter instances?
If not, would it be an idea to create such thing in the Python std libs to make Python multi-processor-ready. I guess Python will always have a GIL - otherwise it would loose lots of comfort in threaded programming

robert

Daniel Dittmar · Nov 3, 2006

robert said:
I'd like to use multiple CPU cores for selected time consuming Python
computations (incl. numpy/scipy) in a frictionless manner.

Interprocess communication is tedious and out of question, so I thought
about simply using a more Python interpreter instances
(Py_NewInterpreter) with extra GIL in the same process.

If I understand Python/ceval.c, the GIL is really global, not specific
to an interpreter instance:
static PyThread_type_lock interpreter_lock = 0; /* This is the GIL */

Daniel

robert · Nov 3, 2006

Feature Request: Py_NewInterpreter to create separate GIL (branch)

Daniel said:
If I understand Python/ceval.c, the GIL is really global, not specific
to an interpreter instance:
static PyThread_type_lock interpreter_lock = 0; /* This is the GIL */

Thats the show stopper as of now.
There are only a handfull funcs in ceval.c to use that very global lock. The rest uses that funcs around thread states.

Would it be a possibilty in next Python to have the lock separate for each Interpreter instance.
Thus: have *interpreter_lock separate in each PyThreadState instance and only threads of same Interpreter have same GIL?
Separation between Interpreters seems to be enough. The Interpreter runs mainly on the stack. Possibly only very few global C-level resources would require individual extra locks.

Sooner or later Python will have to answer the multi-processor question.
A per-interpreter GIL and a nice module for tunneling Python-Objects directly between Interpreters inside one process might be the answer at the right border-line ? Existing extension code base would remain compatible, as far as there is already decent locking on module globals, which is the the usual case.

Robert

Filip Wasilewski · Nov 3, 2006

robert said:
I'd like to use multiple CPU cores for selected time consuming Python computations (incl. numpy/scipy) in a frictionless manner.

Interprocess communication is tedious and out of question, so I thought about simply using a more Python interpreter instances (Py_NewInterpreter) with extra GIL in the same process.
I expect to be able to directly push around Python Object-Trees between the 2 (or more) interpreters by doing some careful locking.

I don't want to discourage you but what about reference counting/memory
management for shared objects? Doesn't seem fun for me.

Take a look at IPython1 and it's parallel computing capabilities [1,
2]. It is designed to run on multiple systems or a single system with
multiple CPU/multi-core. It's worker interpreters (engines) are loosely
coupled and can utilize several MPI modules, so there is no low-level
messing with GIL. Although it is work in progress it already looks
quite awesome.

[1] http://ipython.scipy.org/moin/Parallel_Computing
[2] http://ipython.scipy.org/moin/Parallel_Computing/Tutorial

fw

robert · Nov 3, 2006

Filip said:
I don't want to discourage you but what about reference counting/memory
management for shared objects? Doesn't seem fun for me.

in combination with some simple locking (anyway necessary) I don't see a problem in ref-counting.

If at least any interpreter branch has a pointer to the (root) object in question the ref-count is >0.

----
Question Besides: do concurrent INC/DEC machine OP-commands execute atomically on Multi-Cores as they do in Single-Core threads?

Example:

obj=Obj()

In a read-only phase (e.g. after computations) without locking, 2 Interpreters would for example both access the obj (and change around the refcount but no data).
The CPU will execute 2 [INC/DEC @refcount] OP-codes on different cores concurrently. Is it guaranteed that the counts sum up correctly?

Take a look at IPython1 and it's parallel computing capabilities [1,
2]. It is designed to run on multiple systems or a single system with
multiple CPU/multi-core. It's worker interpreters (engines) are loosely
coupled and can utilize several MPI modules, so there is no low-level
messing with GIL. Although it is work in progress it already looks
quite awesome.

[1] http://ipython.scipy.org/moin/Parallel_Computing
[2] http://ipython.scipy.org/moin/Parallel_Computing/Tutorial

there are some MPI methods around. (This IPython method seems to be only on the level of an artefact of the interactive terminal connections.)

Yet with its burden of expensive data sync thats far away from my requirements. Thats for massive parallel computing and in sci areas.

I do selected things with interprocess shared memory techniques already. Thats medium efficent.

Multiple interpreters inside one process seem to be most promising for seemless multi-core programming. As all Python objects share the same malloc space - thats the key requirement in order to get the main effect.
As soon as with have object pickling in between its well away from this very discussion.

robert

Daniel Dittmar · Nov 3, 2006

robert said:
---- Question Besides: do concurrent INC/DEC machine OP-commands
execute atomically on Multi-Cores as they do in Single-Core threads?

No on the level that that Python reference counting is implemented. The
CPUs have often special assembler ops for these operations. I think that
it's even more complicated as some machines require additional read and
write barriers around these operations.

Take a look at IPython1 and it's parallel computing capabilities [1,

Click to expand...

[...]
I do selected things with interprocess shared memory techniques already.
Thats medium efficent.
Multiple interpreters inside one process seem to be most promising for
seemless multi-core programming.

Both IPython and Jython use multiple CPUs without the GIL, so you don't
need to add the complication of multiple interpreters.

> As all Python objects share the same
> malloc space - thats the key requirement in order to get the main
> effect. As soon as with have object pickling in between its well away
> from this very discussion.

There's the complication that CPython has garbage collection in addition
to reference counting. I'm not sure what happens when the garbage
collector looks at objects rooted in another interpreter.

Daniel

robert · Nov 3, 2006

Daniel said:
robert said:

---- Question Besides: do concurrent INC/DEC machine OP-commands
execute atomically on Multi-Cores as they do in Single-Core threads?

Click to expand...

No on the level that that Python reference counting is implemented. The
CPUs have often special assembler ops for these operations. I think that
it's even more complicated as some machines require additional read and
write barriers around these operations.

Take a look at IPython1 and it's parallel computing capabilities [1,

Click to expand...

[...]
I do selected things with interprocess shared memory techniques
already. Thats medium efficent.
Multiple interpreters inside one process seem to be most promising for
seemless multi-core programming.

Click to expand...

Both IPython and Jython use multiple CPUs without the GIL, so you don't
need to add the complication of multiple interpreters.

(IPython is only a special "python network terminal" as already said.)

With Jython I'm not familiar. Yet I think going to Jython will first cost more performance than you can win - at least on only dual cores as we have now. In 5 years with > 8x MPU systems this may be different ...
Does Jython really eliminate the GIL? What happens when different threads alter/read a dict concurrently - the basic operation in python, which is usually not protected by locks. Does Jython use a dict data type (and other collection types) which lock during _each_ access? in case - that may be very expensive.

There's the complication that CPython has garbage collection in addition
to reference counting. I'm not sure what happens when the garbage
collector looks at objects rooted in another interpreter.

garbage is collected earliest, when the refcount went to 0. If it ever went to 0, no one will ever use such object again. Thus GC should not be different at all.

robert

Daniel Dittmar · Nov 3, 2006

robert said:
(IPython is only a special "python network terminal" as already said.)

Sorry, I thought of IronPython, the .NET variant.

Does Jython really eliminate the GIL? What happens when different
Yes.

threads alter/read a dict concurrently - the basic operation in python,
which is usually not protected by locks. Does Jython use a dict data
type (and other collection types) which lock during _each_ access? in
case - that may be very expensive.

True, though Java synchronization has gotten better and better. Still,
Sun introduced non locking hash tables, so it was a problem. It'd be
interesting to know which kind of dict is used in Jython for attribute
access.

garbage is collected earliest, when the refcount went to 0. If it ever
went to 0, no one will ever use such object again. Thus GC should not be
different at all.

Since Python 2.?, there's a mark-and-sweep garbage collection in
addition to the reference counting scheme. This was put into Python to
be able to reclaim object cycles.

Daniel

robert · Nov 3, 2006

Daniel said:
Sorry, I thought of IronPython, the .NET variant.

True, though Java synchronization has gotten better and better. Still,
Sun introduced non locking hash tables, so it was a problem. It'd be
interesting to know which kind of dict is used in Jython for attribute
access.

I hate Jave

, but to come to concerns of this thread, the speed.
Found not much about.
http://mail.zope.org/pipermail/zpt/2002-March/002884.html
says Jython to be at least 9x slower. Is this still the gap (and constant locking a big factor in this).

IronPython seems to be surprisingly faster than CPython (sometimes): http://www.python.org/pycon/dc2004/papers/9/
But I cannot imagine, that this includes GIL free locking dicts an all ... !?

Is numpy/scipy available for Iron/Jython ?

Since Python 2.?, there's a mark-and-sweep garbage collection in
addition to the reference counting scheme. This was put into Python to
be able to reclaim object cycles.

still that only walks on objects which have already zero refcount. Cannot imagine any additional problems with the GIL.

robert

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 3, 2006

robert said:
in combination with some simple locking (anyway necessary) I don't see a
problem in ref-counting.

In the current implementation, simple locking isn't necessary.
The refcounter can be modified freely since the code modifying
it will always hold the GIL.

---- Question Besides: do concurrent INC/DEC machine OP-commands
execute atomically on Multi-Cores as they do in Single-Core threads?

Not necessarily, no. On x86, you need to prefix the instruction
with the LOCK prefix for it to become atomic. Otherwise, the
two necessary read/write cycles to main memory may interleave
with the memory operations of another processor.

On other processors, INC <mem> might not exist at all as an
instruction; when you only have register add, then implementing
an atomic increment is a challenge. That's why the Windows API
provides you with an atomic increment function as part of the
operating system API.

Regards,
Martin

Paul Rubin · Nov 4, 2006

robert said:
in combination with some simple locking (anyway necessary) I don't
see a problem in ref-counting.
If at least any interpreter branch has a pointer to the (root)
object in question the ref-count is >0. ----
Question Besides: do concurrent INC/DEC machine OP-commands execute
atomically on Multi-Cores as they do in Single-Core threads?

Generally speaking, no, the inc/dec instructions are not atomic. You
can do an atomic increment on the x86 using LOCK XCHG (or maybe LOCK
INC is possible). The thing is that the locking protocol that
guarantees atomicity is very expensive, like 100x as expensive as an
unlocked instruction on a big multiprocessor. So yes, of course you
could accomplish reference counting through locks around the ref
counts, but performance suffers terribly. The solution is to get rid
of the ref counts and manage the entire heap using garbage collection.

For stuff like dictionary access, there are protocols (again based on
LOCK XCHG) that don't require locking for lookups. Only updates
require locking. Simon Peyton-Jones has a good paper about how it's
done in Concurrent Haskell:

http://research.microsoft.com/~simonpj/papers/stm/stm.pdf

This is really cool stuff and has found its way into Perl 6. I'd like
to see Python get something like it.

Steve Holden · Nov 4, 2006

robert said:
Daniel said:

robert wrote:

Click to expand...

[...]

Since Python 2.?, there's a mark-and-sweep garbage collection in
addition to the reference counting scheme. This was put into Python to
be able to reclaim object cycles.

Click to expand...

still that only walks on objects which have already zero refcount. Cannot imagine any additional problems with the GIL.

If I understand you correctly, then you are suffering a misapprehension.
Any object whose reference count goes to zero will immediately be
reclaimed. The mark-sweep garbage collector is used to detect objects
that are taking part in cycles - sets of objects that only refer to each
other without being bound to a name in any current namespace or to any
container object bound to such a name.

In other words, it detects (and reclaims) objects with non-zero
reference counts which nevertheless can be reclaimed without ill effect
on the program.

regards
Steve

robert · Nov 4, 2006

Martin said:
In the current implementation, simple locking isn't necessary.
The refcounter can be modified freely since the code modifying
it will always hold the GIL.

( meant: a lock to prevent multiple Interpreters accessing concurrently the hot shared/tunneled objects )

Not necessarily, no. On x86, you need to prefix the instruction
with the LOCK prefix for it to become atomic. Otherwise, the
two necessary read/write cycles to main memory may interleave
with the memory operations of another processor.

On other processors, INC <mem> might not exist at all as an
instruction; when you only have register add, then implementing
an atomic increment is a challenge. That's why the Windows API
provides you with an atomic increment function as part of the
operating system API.

Thanks for that info. That is interesting.
Thus even on x86 currently this LOCK is not used (just (op)->ob_refcnt++) )

Reading this I got pinched:
In win32ui there are infact Py_INC/DECREF's outside of the GIL !
And I have a severe crash problem with threaded apps - the problem is only only on dual cores !
That pointer probably will end a long search...

robert

PS: Besides: what are speed costs of LOCK INC <mem> ?

GHUM · Nov 4, 2006

robert,

Interprocess communication is tedious and out of questio [...]
I expect to be able to directly push around Python Object-Trees between the 2 (or more) interpreters by doing some careful locking.

Please do yourself a favour and have a look at pyro. pyro makes
InterComputer and InterProcess Communication a snap. And it allows you
to push around Python Objects not only between processes, but
computers.

Maybe it solves your problem much easier and even more than you want to
solve it now by being able to use more then one computer.

Harald

robert · Nov 4, 2006

Paul said:
Generally speaking, no, the inc/dec instructions are not atomic. You
can do an atomic increment on the x86 using LOCK XCHG (or maybe LOCK
INC is possible). The thing is that the locking protocol that
guarantees atomicity is very expensive, like 100x as expensive as an
unlocked instruction on a big multiprocessor. So yes, of course you
could accomplish reference counting through locks around the ref
counts, but performance suffers terribly. The solution is to get rid
of the ref counts and manage the entire heap using garbage collection.

For stuff like dictionary access, there are protocols (again based on
LOCK XCHG) that don't require locking for lookups. Only updates
require locking. Simon Peyton-Jones has a good paper about how it's
done in Concurrent Haskell:

http://research.microsoft.com/~simonpj/papers/stm/stm.pdf

This is really cool stuff and has found its way into Perl 6. I'd like
to see Python get something like it.

Thats really interesting. Do expect this to remove once the GIL from Python? As dict-accesses (which are also must-be-atoms here) compose a major Python CPU load, the 100x costing instructions would probably put a >30% burden on Pythons overall speed.

A lock-protected possibilty to use multiple well separated Interpreters by tunneling objects will probably still be a most effective solution without speed costs.
The problem of singleton object's refcount (None, "", 1,2,3...), which MvL mentioned, is the main concern as far as I've understood.
The locking of Python core global resources (files etc.) can be done with litte effort.
The globals of extension modules are not really critical, as in the typical applications the tunnel method is mainly used for limited number crunching and as programmer you are well aware of multi-interpreter-unfit modules (until they final become mature). There could be also a special import method to duplicate extension data as workaround.

The singletons refcount could be fix-reset to MAXINT/2 at refcreation-time (or GC or ..) or so to freeze them quiet for ever.

(Mutable (why?)) exception types could be doubled in same style as normal python modules, or they could be rendered read-only. Thrown Exceptions will not cross the border. ( Think, the fact that "Exception.x=5" is possible is more an artefact than intended )

robert

robert · Nov 4, 2006

GHUM said:
robert,

Interprocess communication is tedious and out of questio [...]
I expect to be able to directly push around Python Object-Trees between the 2 (or more) interpreters by doing some careful locking.

Click to expand...

Please do yourself a favour and have a look at pyro. pyro makes
InterComputer and InterProcess Communication a snap. And it allows you
to push around Python Objects not only between processes, but
computers.

Maybe it solves your problem much easier and even more than you want to
solve it now by being able to use more then one computer.

another MPI - pickles and worse on the network :-( Havy networked batch jobs were never a problem with Python. Thats what I want to go away from. Read the rest of this thread. Discussion is about in-place object tunnel for fine grained high-speed SMP multi-threading ("memory bus")

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 4, 2006

robert said:
PS: Besides: what are speed costs of LOCK INC <mem> ?

That very much depends on the implementation. In

http://gcc.gnu.org/ml/java/2001-03/msg00132.html

Hans Boehm claims it's 15 cycles. The LOCK prefix
itself asserts the lock# bus signal for the entire
operation, meaning that the other processors
can't perform memory operations during that time.
On the P6, if the data is cacheable (for some Intel
definition of this word), the lock# signal will
not be asserted, just the cache gets locked.
The LOCK prefix also causes any pending writes
to be performed before the operation starts.

So in the worst case, a LOCK INC will have to
wait for pending writes, then will assert the
lock# prefix, then perform a read and a write
cycle memory cycle, with the increment processor
cycle in-between. Assuming a 400MHz memory bus
and a 4GHz processor, LOCK INC will take around
20 cycles, whereas a plain INC might get done
in a single cycle or less (assuming pipelining
and caching).

Regards,
Martin

Paul Rubin · Nov 4, 2006

Martin v. Löwis said:
That very much depends on the implementation. In
http://gcc.gnu.org/ml/java/2001-03/msg00132.html
Hans Boehm claims it's 15 cycles.

I think that has to be on a single processor, or at most a dual core
processor with shared cache on die. With multiple cpu chips I don't
think can get the signals around that fast.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 5, 2006

Paul said:
I think that has to be on a single processor, or at most a dual core
processor with shared cache on die. With multiple cpu chips I don't
think can get the signals around that fast.

Can you explain what you mean? The lock# signal takes *immediate*
effect, within the CPU cycle in which it is asserted. There is no
delay whatsoever (except for speed-of-light issues, of course).

Regards,
Martin

Paul Rubin · Nov 5, 2006

Martin v. Löwis said:
Can you explain what you mean? The lock# signal takes *immediate*
effect, within the CPU cycle in which it is asserted. There is no
delay whatsoever (except for speed-of-light issues, of course).

I dunno about x86 hardware signals but these instructions do
read-modify-write operaitons. That means there has to be enough
interlocking to prevent two cpu's from updating the same memory
location simultaneously, which means the CPU's have to communicate.
See <http://en.wikipedia.org/wiki/MESI_protocol> (I think this is
how the current x86's do it):

A cache that holds a line in the Modified state must snoop
(intercept) all attempted reads (from all of the other CPUs in the
system) of the corresponding main memory location and insert the
data that it holds. This is typically done by forcing the read to
back off (i.e. to abort the memory bus transaction), then writing
the data to main memory and changing the cache line to the Shared
state.

A cache that holds a line in the Shared state must also snoop all
invalidate broadcasts from other CPUs, and discard the line (by
moving it into Invalid state) on a match.

It is quite difficult to get signals off of a chip at GHz speeds, and
so this signalling is much slower than the CPU cycle.

Exploiting Dual Core's with Py_NewInterpreter's separated GIL ?

robert

Daniel Dittmar

robert

Filip Wasilewski

robert

Daniel Dittmar

robert

Daniel Dittmar

robert

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Paul Rubin

Steve Holden

robert

GHUM

robert

robert

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Paul Rubin

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Paul Rubin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads