Exploiting Dual Core's with Py_NewInterpreter's separated GIL ?

J

Joe Seigh

Ross said:
Ross said:
So give an example where reference counting is unsafe.

Nobody claimed that, in that thread. Instead, the claim was
"Atomic increment and decrement instructions are not by themselves
sufficient to make reference counting safe."
[...]

The problem your describing isn't that reference counting hasn't been
made safe. What you and Joe seem to be trying to say is that atomic
increment and decrement instructions alone don't make accessing shared
structure members safe.

How you increment and decrement a reference count is an implementation
issue and whether it's correct or not depends on the semantics of the
refcount based pointer. I.e. what kind of thread safety guarantees
are we ascribing to pointer operations? Java reference (pointer)
guarantees are not the same as C++ Boost shared_ptr guarantees. In
Java simple pointer assignment, "a = b;" is always safe no matter what
any other thread may be doing to a and/or b. With shared_ptr you have
to have a priori knowlege that the refcount for b will never go to
zero during the copy operation. Usually that implies ownership of b.

To get a better feeling for the latter semantics you should take a look
at C++ String implementations. A C++ String COW (copy on write) implementation
using atomic inc/dec is as thread-safe as a non-COW String implementation
because in both cases you have to own the strings you access.

About the term "thread-safe". In Posix it's taken as meaning an operation
on non-shared data isn't affected by other threads. So non-COW strings
are thread-safe, and COW strings are thread-safe if their internal
refcounting is synchronized properly. But note that in both cases,
the implemention is transparent to the user.

So as you say, a lot depends on how you access or want to access shared
structure members, e.g. with or without locking.
 
R

robert

Ross said:
So give an example of where atomic increment and decrement instructions
are not by themselves sufficent to make reference counting safe.


Your example is of how access to the "f_name" member is unsafe, not of
how reference counting being unsafe. The same sort of race condition
can without reference counting being involved at all. Consider the
"f_fp" member: if one thread tries to use "printf()" on it while
another thread calls "fclose()", then you can have same problem. The
race condition here doesn't happen because reference counting hasn't
been made safe, nor does it happen because stdio isn't thread-safe. It
happens because accessing "f_fp" (without the GIL) is unsafe.

The problem your describing isn't that reference counting hasn't been
made safe. What you and Joe seem to be trying to say is that atomic
increment and decrement instructions alone don't make accessing shared
structure members safe.

Yes.

To recall the motivation and have a real world example:
The idea to share/handover objects between 2 (N) well separated Python interpreter instances (free-threading) with 2 different GILs.
There is of course the usage condition: * Only one interpreter may access (read & write) the hot (tunneled) object tree at a time *
(e.g. application: a numeric calc and when finished (flag) the other interpreter walks again the objects directly (without MPI/pickling))

But a key problem was, that the other thread can have old pointer objects (containers) pointing into the hot object tree - an there is shared use of (immutable) singleton objects (None,1,2...): The old pointer objects in other interpreter may disapear at any time or the pointers maybe be doubled.
Thus the refcount of hot object will count down/up out of the interpreter which has not possession of the hot object tree - even if the pointers are not used for write/read access.

Only and truly if you have atomic Py_INCREF/Py_DECREF this is no problem.

Before all interpreters have lost the object there will be no accidental disapearance of the object as Ross Ridge already pointed out.
In addition concurrent read access to _constant_ objects/sub-trees would be possible, and also concurrent read&write access by using an explicit locking!
Thus the original OP requirements would be fulfilled.

See so far only 5 main additional requirements to offer the possibility of separated GILs/free-threading interpreters:

* pointer to current GIL in threadstate and dynamic PyThreadState_GET() / currentthreadstate in TLS

* locks for global objects (file etc) of course, if they should be supported therefore. (I'd use the free-threading only for mere computations)

* enable the already existing obmalloc.c/LOCK&UNLOCK by something fast like:
_retry:
__asm LOCK INC malloc_lock
if (malloc_lock!=1) { LOCK DEC malloc_lock; /*yield();*/ goto _retry; }

* a special (LOCK INC) locking dict type for the global dict of extension modules
(created clearly by Py_InitModule(name, methods) - thus that would also preserve backwards compatibility for extension C-code)

* nice tunnel functions to create extra interpreters and for actually tunneling the objects and maybe offering the fast locking-dict type to enable a fast sharing of the hot tunneled object tree.

Speed costs? Probably not much as far as the discussion in this thread sounds...

Of course this option of 2 interpreters - though easy to use - would still be for power use cases only: A Python programming bug doing accidential unlocked concurrent access into a hot tunneled tree can cause a C-level crash. (This cannot happen so far with simple Python threading - you'd only get inconsistent data or a Python exception. But of course you can crash already now at C-level by using the Python standard library :) ).
That danger would be ok for me so far. Conceptually its not more complicated that using locks right in normal Python thread programming - only the effect of bugs will more critical ...

If one thinks about overcoming the GIL at all - we are probably not far away. Mainly:

* make the locking dict type (and locking list type) the common case - the non-locking obsolete or for optimization only

* lock some other non-constant types which are not already mainly dicts/lists. most objects which' access-functions only change INTEGERS etc and call threadsafe C-lib functions etc don't require extra locking

Maybe the separated-GIL/interpreter-method can be a bridge to that.
Refcounting probably doen't necessarily block that road.

Really interesting would be to make an experiment about the speed costs of LOCK INC. Guess the hot spot regarding recounting will be Py_None's cache line (on multi core CPUs with separated cache per core).


Robert
 
R

robert

robert said:
Martin v. Löwis wrote: [..]

Thanks for that info. That is interesting.
Thus even on x86 currently this LOCK is not used (just
(op)->ob_refcnt++) )

Reading this I got pinched: In win32ui there are infact Py_INC/DECREF's
outside of the GIL !
And I have a severe crash problem with threaded apps - the problem is
only only on dual cores !
That pointer probably will end a long search...

In fact that it was in win32ui. Months I've search the bug strainedly, and this fancy discussion revealed it :)
 
R

robert

Sandra-24 said:
Why not use IronPython? It's up to date (unlike Jython), has no GIL,
and is cross-platform wherever you can get .NET or Mono (UNIX, macs,
windows) and you can use most any scientific libraries written for the
.NET/Mono platform (there's a lot) Take a look anyway.

-Sandra

what about speed. Is it true that IronPython is almost as fast as C-Python meanwhile?

When this all is really true, its probably a proof that putting out LOCK-INC-lock's (on dicts, lists, mutables ...) in CPython to remove the GIL in future should not be really so expensive as it was teached in the past :)

Still to adopt to .NET libraries will be a big step. Is there really a thing out there as usable as numpy/scipy. And GUI programming in IronPython ...


( The FAQ's on codeplex.com and many docs are not readable currently due to site errors. What about overall stability and # of users of Iron as of today? )


-robert
 
R

robert

Shane said:
of multiple cores. I think Python only needs a nice way to share a
relatively small set of objects using shared memory. POSH goes in that
direction, but I don't think it's simple enough yet.

http://poshmodule.sourceforge.net/

interesting, a solution possibly a little faster than pickling - but maybe only in selected situations. Made already experiments with pickling through shared memory.

With "x = posh.share(x)" an object tree will be (deep-)copied to shared mem ( as far as objects fullfil some conditions http://poshmodule.sourceforge.net/posh/html/node6.html: is this true for numpy arrays?)
Every object to be inserted in the hot tunnel object tree has to be copied that same style. Thus ~pickling, but somewhat easier to use.

And compiling it for Windows will be quite difficult ...

Robert
 
S

sturlamolden

robert said:
I'd like to use multiple CPU cores for selected time consuming Python computations (incl. numpy/scipy) in a frictionless manner.

Threading is not the best way to exploit multiprocessors in this
context. Threads are not the preferred way of exploiting multiple
processors in scientific computing.

here are a few thoughts on the matter:

1. SciPy uses ATLAS/BLAS and LAPACK. You can compile these libraries
for SMPs. The same goes for FFTW, vendor optimized math kernels, etc.
If most of the CPU time is spent inside these numeric libraries, using
multi-processor versions of these libraries are a much better strategy.

2. The number of CPUs are not the only speed limiting factor on an SMP.
Use of cache and prefetching are just as important. That can make
multi-processor aware numeric libraries a lot more efficient than
manual multi-threading.

3. One often uses cluster architectures (e.g. Beowulf) instead of SMPs
for scientific computing. MPI works on SMP and clusters. Threads only
work on SMPs.

4. Fortran compilers can recognize parallel array statements in
Fortran90/95 and exploit multiple processors on an SMP automatically.
NumPy should be able to to the same when it matures. E.g. if you make a
statement like "arr[1,::] = arr[2,::] * arr[3,::]", then this statement
could be evaluated in parallel on multiple CPUs, without any
multi-threading on your part. Since the order in which the
multiplications are performed are of no significance, the work can just
as well be spread out to multiple processors in an SMP or a cluster.
NumPy is still immature, but Fortran compilers have done this at least
two decades.

5. Streaming SIMD extensions (SSE) and similar opcodes: Are you aware
that Pentium III (and newer) processors are pipe-lined to do four
floating-point operations in parallel? You could theoretically
quadruple your flops using the SSE registers, using no threading at
all. (The actual improvement is slightly less, due to some extra
book-keeping required to get the data in and out of the SSE registers.)
Again this requires modifications inside NumPy, not multi-threading.
If not, would it be an idea to create such thing in the Python std libs to make Python multi-processor-ready. I guess Python will always have a GIL - otherwise it would loose lots of comfort in threaded programming

I would say that the GIL actually has very little effect of Python's
potential in high-performance numeric and scientific computing. It all
depends on the libraries, not on Python per se. Threading is for making
certain tasks more comfortable to write, not so much for computational
speed.
 
P

Paul Rubin

robert said:
what about speed. Is it true that IronPython is almost as fast as C-Python meanwhile?

When this all is really true, its probably a proof that putting out
LOCK-INC-lock's (on dicts, lists, mutables ...) in CPython to remove
the GIL in future should not be really so expensive as it was teached
in the past :)

I don't think IronPython uses locks that way. AFAIK it doesn't use
reference counts, and relies on user-supplied synchronization to keep
stuff like dictionaries safe.
 
S

sturlamolden

sturlamolden said:
3. One often uses cluster architectures (e.g. Beowulf) instead of SMPs
for scientific computing. MPI works on SMP and clusters. Threads only
work on SMPs.

Following up on my previous post, there is a simple Python MPI wrapper
that can be used to exploit multiple processors for scientific
computing. It only works for Numeric, but an adaptaion to NumPy should
be easy (there is only one small C file in the source):

http://datamining.anu.edu.au/~ole/pypar/

If you are using Windows, you can get a free implementation of MPI
here:

http://www-unix.mcs.anl.gov/mpi/mpich1/mpich-nt/
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ross said:
The problem your describing isn't that reference counting hasn't been
made safe. What you and Joe seem to be trying to say is that atomic
increment and decrement instructions alone don't make accessing shared
structure members safe.

All I can do is to repeat Joe's words literally
"Atomic increment and decrement instructions are not by themselves
sufficient to make reference counting safe."

Replace "by themselves" with "alone", and "reference counting"
with "reference counting in the presence of shared structure
members", and you find that your statement and Joe's are equivalent.
*Of course* reference counting is about shared structure members.
That's it's primary purpose.

Regards,
Martin
 
F

Fernando Perez

sturlamolden said:
Following up on my previous post, there is a simple Python MPI wrapper
that can be used to exploit multiple processors for scientific
computing. It only works for Numeric, but an adaptaion to NumPy should
be easy (there is only one small C file in the source):

http://datamining.anu.edu.au/~ole/pypar/

If you are using Windows, you can get a free implementation of MPI
here:

http://www-unix.mcs.anl.gov/mpi/mpich1/mpich-nt/

It's also worth noting

http://mpi4py.scipy.org

The project just moved over to scipy hosting a few days ago, and the boxes
haven't been unpacked yet. But it is very actively developed, has full
numpy and MPI-2 support, and is looking very nice.

Cheers,

f
 
P

Paul Boddie

robert said:
interesting, a solution possibly a little faster than pickling - but maybe only in selected
situations. Made already experiments with pickling through shared memory.

What did you discover in your experiments? I'd certainly suspect that
pickling is going to add an unacceptable degree of overhead in the kind
of application you're attempting to write (using a shared data
structure with random access properties), and I'll admit that I haven't
really had to deal with that kind of situation when using my own
pickle-based parallelisation solutions (which involve communicating
processes).
With "x = posh.share(x)" an object tree will be (deep-)copied to shared mem ( as far as
objects fullfil some conditions http://poshmodule.sourceforge.net/posh/html/node6.html: is
this true for numpy arrays?)

My impression is that POSH isn't maintained any more and that work was
needed to make it portable, as you have observed. Some discussions did
occur on one of the Python development mailing lists about the
possibility of using shared memory together with serialisation
representations faster than pickles (which also don't need to be
managed as live objects by Python), and I think that this could be an
acceptable alternative.
Every object to be inserted in the hot tunnel object tree has to be copied that same style.
Thus ~pickling, but somewhat easier to use.

If my understanding of "hot tunnel object tree" is correct, you're
really wanting fast access involving mutual exclusion to some shared
data structure. At this point it becomes worth considering some kind of
distributed object technology (or even a database technology) where you
have to initialise the data structure and then communicate with an
object (or database) to perform operations on it, all for reasons I'll
explain shortly.

In your ideal situation, you say that you'd have the data structure in
the same address space as a number of threads, and each thread would be
able to perform some algorithm on the data structure, but the pattern
of accessing the structure isn't an insignificant concern even where
you can assume that the overheads are generally low: reading and
writing things might be fast in the same address space, but if the
nature of access involves lots of locking, you'll incur quite a penalty
anyway. In other words, if each read or write to the structure involves
acquiring a lock for that operation in isolation, this could
significantly diminish performance, whereas if you can guarantee that
the granularity of locking is more coarse than having to occur upon
each read or write access - that there exist some high-level operations
that require consistency within the data structure - then reasonable
performance might be maintained.

However, if the pattern of concurrent access is restricted to coarse
operations, where some entity has exclusive access to a potentially
large dataset, and where the overhead of the communication of inputs
and outputs to and from that entity is low in comparison to the cost of
performing such coarse operations, and where such operations are
themselves performed infrequently, then such a pattern coincides with
classic database or distributed object scenario. In other words, you
implement the operations acting on the data structure in a distributed
object (or as a database query or operation) and then invoke such
operations from separate processes.

I hope this makes some sense. Generally, demands for high concurrent
performance using threads often ignore other important properties such
as reliability and scalability. Certainly, threads have an important
place - classically, this involved maintaining responsiveness in
graphical user interfaces - but even then, the background threads were
often detached and not competing for the same resources as the
foreground threads servicing the event loop. The only kinds of
situation I can think of right now which might benefit from uninhibited
random access to shared data structures might be things like
simulations or large-scale multi-user games, where an appropriate data
architecture cannot be decided in advance, but even then there are
approaches which attempt to mitigate the seemingly unpredictable nature
of access to shared data.

Paul
 
R

Robin Becker

Paul said:
My impression is that POSH isn't maintained any more and that work was
needed to make it portable, as you have observed. Some discussions did
occur on one of the Python development mailing lists about the
possibility of using shared memory together with serialisation
representations faster than pickles (which also don't need to be
managed as live objects by Python), and I think that this could be an
acceptable alternative.
........

your impression is correct, when I contacted the author some while ago he
maintained that it was a proof of concept only. When I compiled this today with
python-2.4.2 the example fails with this rather cryptic message

~/devel/posh/examples:
$ python Matrix.py
Traceback (most recent call last):
File "Matrix.py", line 6, in ?
import posh
File "/usr/local/lib/python2.4/site-packages/posh/__init__.py", line 14, in ?
import _core
RuntimeError: semaphore set creation failed

but that's most likely something to do with not allocating a semaphore of the
right sort or something.

I think it uses sysv semaphores and although freeBSD 6 has them perhaps there's
something I need to do to allow them to work.
 
R

robert

Paul said:
I don't think IronPython uses locks that way. AFAIK it doesn't use
reference counts, and relies on user-supplied synchronization to keep
stuff like dictionaries safe.

thus there would be crash if 2 threads use the global variables (module.__dict__) of a module?
Cannot imagine .. that would would break most reasons for nice python threading.

-robert
 
R

robert

sturlamolden said:
Threading is not the best way to exploit multiprocessors in this
context. Threads are not the preferred way of exploiting multiple
processors in scientific computing.


You repeat on the simple bulk jobs in sci computing where a file batch interface is enough:
Process X works on input file Y and writes output to file Z. A manager script collects the results and creates a gnuplot diagramm.
Such things of course would not raise the question of this discussion. Not even the question of threads at all and MPI ...

-robert
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

robert said:
thus there would be crash if 2 threads use the global variables
(module.__dict__) of a module?

IronPython uses the .NET virtual machine. That, in itself, gives
consistency guarantees. Read the ECMA IL specification to find
out what the consistency model is.

Regards,
Martin
 
R

robert

Paul said:
What did you discover in your experiments? I'd certainly suspect that
pickling is going to add an unacceptable degree of overhead in the kind
of application you're attempting to write (using a shared data
structure with random access properties), and I'll admit that I haven't
really had to deal with that kind of situation when using my own
pickle-based parallelisation solutions (which involve communicating
processes).


My impression is that POSH isn't maintained any more and that work was
needed to make it portable, as you have observed. Some discussions did
occur on one of the Python development mailing lists about the
possibility of using shared memory together with serialisation
representations faster than pickles (which also don't need to be
managed as live objects by Python), and I think that this could be an
acceptable alternative.


If my understanding of "hot tunnel object tree" is correct, you're
really wanting fast access involving mutual exclusion to some shared
data structure. At this point it becomes worth considering some kind of
distributed object technology (or even a database technology) where you
have to initialise the data structure and then communicate with an
object (or database) to perform operations on it, all for reasons I'll
explain shortly.

In your ideal situation, you say that you'd have the data structure in
the same address space as a number of threads, and each thread would be
able to perform some algorithm on the data structure, but the pattern
of accessing the structure isn't an insignificant concern even where
you can assume that the overheads are generally low: reading and
writing things might be fast in the same address space, but if the
nature of access involves lots of locking, you'll incur quite a penalty
anyway. In other words, if each read or write to the structure involves
acquiring a lock for that operation in isolation, this could
significantly diminish performance, whereas if you can guarantee that
the granularity of locking is more coarse than having to occur upon
each read or write access - that there exist some high-level operations
that require consistency within the data structure - then reasonable
performance might be maintained.

However, if the pattern of concurrent access is restricted to coarse
operations, where some entity has exclusive access to a potentially
large dataset, and where the overhead of the communication of inputs
and outputs to and from that entity is low in comparison to the cost of
performing such coarse operations, and where such operations are
themselves performed infrequently, then such a pattern coincides with
classic database or distributed object scenario. In other words, you
implement the operations acting on the data structure in a distributed
object (or as a database query or operation) and then invoke such
operations from separate processes.

I hope this makes some sense. Generally, demands for high concurrent
performance using threads often ignore other important properties such
as reliability and scalability. Certainly, threads have an important
place - classically, this involved maintaining responsiveness in
graphical user interfaces - but even then, the background threads were
often detached and not competing for the same resources as the
foreground threads servicing the event loop. The only kinds of
situation I can think of right now which might benefit from uninhibited
random access to shared data structures might be things like
simulations or large-scale multi-user games, where an appropriate data
architecture cannot be decided in advance, but even then there are
approaches which attempt to mitigate the seemingly unpredictable nature
of access to shared data.

Some thoughts may make reason, when one wants to trick the single GIL a little.
But the length of the text tells us again to possibly get at least refcounting thread safe (within access restrictions) - or get rid of refcount - or do it an go the full path towards a GIL free Python.

Besides some programming costs the speed cost would be that of a few LOCK INC's. Rumors tell not much - and almost nothing when not using non-SMP threading. I'd like to see experimental data regarding the speed. Found nothing on the web so far.

And optimistic all lockfree basic datastructures (obmalloc's freelist, dicts, queues,..) could be built up for futher optimizations:
http://64.233.183.104/search?q=cach...SMP+"LOCK+INC"+speed&hl=de&gl=de&ct=clnk&cd=2


Little test:

--------------------
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

volatile int x=0;
clock_t c0,c1;
time_t t0,t1;

int main (int argc, char **argv) {
int sum = 0, i, count = 1000000000;
t0=time(&t0);
c0=clock();
for (i = 0; i < count; ++i) {
__asm lock inc x;
// __asm inc x;
sum += x;
}
c1=clock();
t1=time(&t1);
// printf("%d\n", sum);
printf("clocks: %d\n", c1-c0);
printf("secs: %d\n", t1-t0);
return sum;
}


--------------
0040101F mov eax,3B9ACA00h
13: for (i = 0; i < count; ++i) {
14: __asm lock inc x;
00401024 lock inc dword ptr [_x (00408a00)]
15: sum += x;
0040102B mov edx,dword ptr [_x (00408a00)]
00401031 add esi,edx
00401033 dec eax
00401034 jne main+24h (00401024)
16: }
---------------

results on a UP PIII :

INC version:
clocks: 7520
secs: 7

LOCK INC version:
clocks: 36632
secs: 36


Its probably not much...

-robert
 
A

Andrew MacIntyre

Robin said:
I think it uses sysv semaphores and although freeBSD 6 has them perhaps there's
something I need to do to allow them to work.

IIRC, you need to explicitly configure loading the kernel module, or
compile the kernel with the necessary option in the config file.

--
 
R

Robin Becker

Andrew said:
IIRC, you need to explicitly configure loading the kernel module, or
compile the kernel with the necessary option in the config file.
I tried loading the module, but it seems already to be in the kernel.
Probably some sysctl stuff to do.
 
R

robert

robert said:
--------------
0040101F mov eax,3B9ACA00h
13: for (i = 0; i < count; ++i) {
14: __asm lock inc x;
00401024 lock inc dword ptr [_x (00408a00)]
15: sum += x;
0040102B mov edx,dword ptr [_x (00408a00)]
00401031 add esi,edx
00401033 dec eax
00401034 jne main+24h (00401024)
16: }
---------------

results on a UP PIII :

INC version:
clocks: 7520
secs: 7

LOCK INC version:
clocks: 36632
secs: 36


Its probably not much...

The Intels I checked, have all about this factor of ~5 in that simple example.

AMDs were typically faster. less than ~3.


-robert

PS:
the asm for x86 linux/gcc:
-------------------

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

volatile int x=0;
clock_t c0,c1;
time_t t0,t1;

int main (int argc, char **argv) {
int sum = 0, i, count = 100000000;
t0=time(&t0);
c0=clock();
for (i = 0; i < count; ++i) {
asm ("incl %0;" : "=m"(x) );
sum += x;
}
c1=clock();
t1=time(&t1);
// printf("%d\n", sum);
printf("clocks: %d\n", c1-c0);
printf("secs: %d\n", t1-t0);
return sum;
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
DewittMill
Top