Exploiting Dual Core's with Py_NewInterpreter's separated GIL ?

Ross Ridge · Nov 5, 2006

Paul said:
I dunno about x86 hardware signals but these instructions do
read-modify-write operaitons. That means there has to be enough
interlocking to prevent two cpu's from updating the same memory
location simultaneously, which means the CPU's have to communicate.
See <http://en.wikipedia.org/wiki/MESI_protocol> (I think this is
how the current x86's do it):

The MESI protocol is used to ensure all read and write operations are
cache coherent, not just locked read-modify-write operations. Whether
the LOCK prefix is used or not in an instruction normally isn't
externally visable. From IA-32 Intel® Architecture Software
Developer's Manual Volume 3A: System Programming Guide, Part 1:

For the Pentium 4, Intel Xeon, and P6 family processors,
if the area of memory being locked during a LOCK operation
is cached in the processor that is performing the LOCK operation
as write-back memory and is completely contained in a cache line,
the processor may not assert the LOCK# signal on the bus. Instead,
it will modify the memory location internally and allow it's
cache coherency mechanism to insure that the operation is carried
out atomically. This operation is called "cache locking."
The cache coherency mechanism automatically prevents two or
more processors that have cached the same area of memory from
simultaneously modifying data in that area.

The cost of mantaining cache coherency for a locked increment
instruction should be no different than that of an unlocked increment
instruction.

Ross Ridge

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 5, 2006

Paul said:
I dunno about x86 hardware signals but these instructions do
read-modify-write operaitons. That means there has to be enough
interlocking to prevent two cpu's from updating the same memory
location simultaneously, which means the CPU's have to communicate.
See <http://en.wikipedia.org/wiki/MESI_protocol> (I think this is
how the current x86's do it):

Ah, but in the case where the lock# signal is used, it's known that
the data is not in the cache of the CPU performing the lock operation;
I believe it is also known that the data is not in the cache of any
other CPU. So the CPU performing the LOCK INC sequence just has
to perform two memory cycles. No cache coherency protocol runs
in that case.

But even when caching is involved, I don't see why there should be
more than three memory cycles. The MESI "protocol" really consists
only of two pins: HIT# and HITM#; the actual implementation is just
in keeping the state for each cache line, and in snooping. There
CPU's don't really "communicate". Instead, if one processor tries
to fill a cache line, the others snoop the read, and assert either
HIT# (when they have not modified their cache lines) or HITM#
(when they do have modified their cache lines). Assertions of
these signals is also immediate.

The worst case would be that one processor performs a LOCK INC,
and another processor has the modified value in its cache line.
So it needs to first flush the cache line, before the other
processor can modify the memory. If the memory is not cached
yet in another processor, the MESI protocol does not involve
additional penalties.

Regards,
Martin

Paul Rubin · Nov 7, 2006

Martin v. Löwis said:
Ah, but in the case where the lock# signal is used, it's known that
the data is not in the cache of the CPU performing the lock operation;
I believe it is also known that the data is not in the cache of any
other CPU. So the CPU performing the LOCK INC sequence just has
to perform two memory cycles. No cache coherency protocol runs
in that case.

How can any CPU know in advance that the data is not in the cache of
some other CPU?

But even when caching is involved, I don't see why there should be
more than three memory cycles. The MESI "protocol" really consists
only of two pins: HIT# and HITM#; the actual implementation is just
in keeping the state for each cache line, and in snooping. There
CPU's don't really "communicate". Instead, if one processor tries
to fill a cache line, the others snoop the read, and assert either
HIT# (when they have not modified their cache lines) or HITM#
(when they do have modified their cache lines). Assertions of
these signals is also immediate.

OK, this is logical, but it already implies a cache miss, which costs
many dozen (100?) cycles. But this case may be uncommon, since one
hops that cache misses are relatively rare.

The worst case would be that one processor performs a LOCK INC,
and another processor has the modified value in its cache line.
So it needs to first flush the cache line, before the other
processor can modify the memory. If the memory is not cached
yet in another processor, the MESI protocol does not involve
additional penalties.

I think for Python refcounts this case must occur quite frequently
since there are many Python objects (small integers, None, etc.)
whose refcounts get modified extremely often.

IIRC, the SPJ paper that I linked claims that lock-free protocols
outperform traditional lock-based ones even with just two processors.
But maybe things are better with a dual core processor (shared cache)
than with two separate packages.

Ross Ridge · Nov 7, 2006

Martin v. Löwis said:
Ah, but in the case where the lock# signal is used, it's known that
the data is not in the cache of the CPU performing the lock operation;
I believe it is also known that the data is not in the cache of any
other CPU. So the CPU performing the LOCK INC sequence just has
to perform two memory cycles. No cache coherency protocol runs
in that case.

Paul said:
How can any CPU know in advance that the data is not in the cache of
some other CPU?

In the case where the LOCK# signal is asserted the area of memory
accessed is marked as being uncachable. In a SMP system all CPUs must
have the same mapping of cached and uncached memory or things like this
break. In the case where the LOCK# signal isn't used, the MESI
protocol informs the CPU of which of it's cache lines might also be in
the cache of another CPU.

OK, this is logical, but it already implies a cache miss, which costs
many dozen (100?) cycles. But this case may be uncommon, since one
hops that cache misses are relatively rare.

The cost of the cache miss is the same whether the increment
instruction is locked or not.

Ross Ridge

Joe Seigh · Nov 7, 2006

Paul said:
Generally speaking, no, the inc/dec instructions are not atomic. You
can do an atomic increment on the x86 using LOCK XCHG (or maybe LOCK
INC is possible). The thing is that the locking protocol that
guarantees atomicity is very expensive, like 100x as expensive as an
unlocked instruction on a big multiprocessor. So yes, of course you
could accomplish reference counting through locks around the ref
counts, but performance suffers terribly. The solution is to get rid
of the ref counts and manage the entire heap using garbage collection.

Atomic increment and decrement instructions are not by themselves
sufficient to make reference counting safe. It's what Boost::shared_ptr
uses to make itself internally thread-safe but it's not safe to copy
a shared_ptr without "owning" it. Not like Java where you can safely
copy a reference (Java pointer) no matter what.

Basically there's a race condition where an object containing the
refcount can be deleted between the time you load a pointer to
the object and the time you increment what used to be a refcount
and is possibly something else but definitely undefined.

I have an experimental refcounting implementation at
http://atomic-ptr-plus.sourceforge.net/

For stuff like dictionary access, there are protocols (again based on
LOCK XCHG) that don't require locking for lookups. Only updates
require locking. Simon Peyton-Jones has a good paper about how it's
done in Concurrent Haskell:

http://research.microsoft.com/~simonpj/papers/stm/stm.pdf

This is really cool stuff and has found its way into Perl 6. I'd like
to see Python get something like it.

That seems to be about STM (Software Transactional Memory). What you're
describing seems to be read lock-free using what I call PDR
(PCOW (Partial Copy On Write) Deferred Reclaimation). Examples of PDR
are RCU (used in Linux kernel), Maged Michael's SMR hazard pointers,
and thread-safe GC (used in Java's concurrent collections java.util.concurrent).

You can also use PDR to manage safe refcounting ,e.g. RCU based refcounting, rcuref,
in the Linux kernel.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 7, 2006

Paul said:
How can any CPU know in advance that the data is not in the cache of
some other CPU?

AFAIU, the lock# line, in P6, is only used for memory regions that
are marked non-cacheable. The CPU determines a memory region to be
non-cacheable if either the memory-type register (MTRR) says it's
non-cacheable, or if the not-cached bit in the page table is set.

If a certain location is known to be modified a lot from different
CPUs (e.g. a spin lock), it might be best if the operating system
sets the page where this location lives as non-cacheable - it might
be more performant to always modify it through the main memory,
than having cache coherence deal with it.

OK, this is logical, but it already implies a cache miss, which costs
many dozen (100?) cycles. But this case may be uncommon, since one
hops that cache misses are relatively rare.

Right - I'm completely uncertain as to what the cost of a cache miss
is, in terms of internal cycles. E.g. how many memory cycles does
the CPU have to perform to fill a cache line? And what is the ratio
between memory cycles and CPU cycles?

IIRC, the SPJ paper that I linked claims that lock-free protocols
outperform traditional lock-based ones even with just two processors.
But maybe things are better with a dual core processor (shared cache)
than with two separate packages.

Likely, yes. Although different dual-core designs are out there, some
of them not using a shared cache, but two individual ones.

Regards,
Martin

Ross Ridge · Nov 7, 2006

Joe said:
Basically there's a race condition where an object containing the
refcount can be deleted between the time you load a pointer to
the object and the time you increment what used to be a refcount
and is possibly something else but definitely undefined.

That doesn't really make sense. The object can't be deleted because
the thread should already have a reference (directly or indirectly) to
the object, otherwise any access to it can cause the race condition you
describe.

Ross Ridge

Joe Seigh · Nov 7, 2006

Ross said:
That doesn't really make sense. The object can't be deleted because
the thread should already have a reference (directly or indirectly) to
the object, otherwise any access to it can cause the race condition you
describe.

True but if the thread didn't already have a reference, how would you get
that initial reference to a shared object without a lock?

Ross Ridge · Nov 7, 2006

Ross said:
That doesn't really make sense. The object can't be deleted because
the thread should already have a reference (directly or indirectly) to
the object, otherwise any access to it can cause the race condition you
describe.

Joe said:
True but if the thread didn't already have a reference, how would you get
that initial reference to a shared object without a lock?

The thread that shares it increments the reference count before passing
its address to directly another thread or indirectly through a shared
container.

Ross Ridge

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 7, 2006

Ross said:
The thread that shares it increments the reference count before passing
its address to directly another thread or indirectly through a shared
container.

To make a specific example, consider this fragment from
Objects/fileobject.c:

static PyObject *
file_repr(PyFileObject *f)
{
if (PyUnicode_Check(f->f_name)) {
....

Now, assume there wasn't a GIL protecting this all, and also
assume f_name was a mutable member (which it currently isn't).

Then, this access wouldn't be thread-safe: This code roughly
translates to

reg_x = f->f_name
push reg_x
call PyUnicode_Check (assuming this was a function and
not a macro)

Meanwhile, another process might perform

old = f->f_name;
f->f_name = new;
Py_DECREF(old);

i.e. change the file name. Now, it might be that they
interleave this way:

reg_x = f->f_name
old = f->f_name
f->f_name = new
Py_DECREF_old
push reg_x
call PyUnicode_Check

which would now operate on a deallocated object.

To fix this, one might think that we need

Py_INCREF(f->f_name)
if(Py_UnicodeCheck(f->f_name))
...
Py_DECREF(f->f_name)

However, this would not help, because the
first incref translates to

reg_x = f->f_name
LOCK INC f->ob_refcnt

which again leaves a race condition where the
INCREF operation comes "too late".

How would you propose to fix file_repr to prevent such
a race condition?

Regards,
Martin

Ross Ridge · Nov 7, 2006

Martin said:
How would you propose to fix file_repr to prevent such
a race condition?

The race condition you describe is different from the one Joe Seigh
described. It's caused because without GIL access to the file object
is no longer thread safe, not because reference counting isn't thread
safe.

Ross Ridge

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 7, 2006

Ross said:
The race condition you describe is different from the one Joe Seigh
described. It's caused because without GIL access to the file object
is no longer thread safe, not because reference counting isn't thread
safe.

Joe's claim (quoting him) was: "Atomic increment and decrement
instructions are not by themselves sufficient to make reference
counting safe."

Rephrasing this in Python terms is "if the GIL is dropped and
incref/decref are made atomic, the reference counting is not
automatically thread-safe". This is what my example demonstrates:
Drop the GIL, and assume an atomic inc/dec, and it won't be
thread-safe.

It's not that the file object becomes thread-unsafe. Instead,
access to the f_name property won't work anymore (as would
access to any other PyObject* field).

You still didn't say what you would suggest to make it thread-safe
again; most likely, you proposal would be to add locking. If I
understand Joe's approach correctly, he has a solution that does
not involve locking (although I don't understand how it works).

Regards,
Martin

Joe Seigh · Nov 7, 2006

Martin said:
You still didn't say what you would suggest to make it thread-safe
again; most likely, you proposal would be to add locking. If I
understand Joe's approach correctly, he has a solution that does
not involve locking (although I don't understand how it works).

Sun had applied for a patent on it. You can go to the
uspto search page here http://www.uspto.gov/patft/index.html
and look for

20060218561 Code preparation technique employing lock-free pointer operations
20060037026 Lightweight reference counting using single-target synchronization

Click on the images link on the patent application where the illustrations
are which show the concepts probably better than the text.

The first one above is actually a continuation patent on three different
techniques. One using double wide compare and swap, one using ROP (Repeat
Offender Problem), a form of PDR, and one using DCAS (compare and swap
of two separate locations) which only exists on MC68020 and MC68030
processors.

Chaz Ginger · Nov 7, 2006

Joe said:
Sun had applied for a patent on it. You can go to the
uspto search page here http://www.uspto.gov/patft/index.html
and look for

20060218561 Code preparation technique employing lock-free pointer
operations
20060037026 Lightweight reference counting using single-target
synchronization

Click on the images link on the patent application where the illustrations
are which show the concepts probably better than the text.

The first one above is actually a continuation patent on three different
techniques. One using double wide compare and swap, one using ROP (Repeat
Offender Problem), a form of PDR, and one using DCAS (compare and swap
of two separate locations) which only exists on MC68020 and MC68030
processors.

Check out the work in the '80s from the NYU Ultra project. They did a
great deal of work on using atomic incr/decr for all sorts of algorithms
to get around locking on parallel processors.

Chaz.

Shane Hathaway · Nov 7, 2006

robert said:
I'd like to use multiple CPU cores for selected time consuming Python
computations (incl. numpy/scipy) in a frictionless manner.

Interprocess communication is tedious and out of question, so I
thought about simply using a more Python interpreter instances
(Py_NewInterpreter) with extra GIL in the same process. I expect to
be able to directly push around Python Object-Trees between the 2 (or
more) interpreters by doing some careful locking.

Any hope to come through? If possible, what are the main dangers? Is
there an example / guideline around for that task? - using ctypes or
so.

Or is there even a ready made Python module which makes it easy to
setup and deal with extra Interpreter instances? If not, would it be
an idea to create such thing in the Python std libs to make Python
multi-processor-ready. I guess Python will always have a GIL -
otherwise it would loose lots of comfort in threaded programming

I'd like to mention mod_python, which creates multiple interpreters
inside Apache. It does this transparently and works well. There is no
apparent lock contention between the interpreters, because no Python
objects are shared.

Also, I'd like to make the observation that it is not always necessary
to share the entire interpreter (ala threads) in order to take advantage
of multiple cores. I think Python only needs a nice way to share a
relatively small set of objects using shared memory. POSH goes in that
direction, but I don't think it's simple enough yet.

http://modpython.org/
http://poshmodule.sourceforge.net/

Shane

Ross Ridge · Nov 7, 2006

Martin said:
How would you propose to fix file_repr to prevent such
a race condition?

Ross said:
The race condition you describe is different from the one Joe Seigh
described. It's caused because without GIL access to the file object
is no longer thread safe, not because reference counting isn't thread
safe.

Joe's claim (quoting him) was: "Atomic increment and decrement
instructions are not by themselves sufficient to make reference
counting safe."

So give an example where reference counting is unsafe.

Remember that I only said that Joe Seigh's statement didn't make sense,
not that I had some magic solution that would somehow make Python
thread safe without locking.

Ross Ridge

grahamd · Nov 7, 2006

Shane said:
I'd like to mention mod_python, which creates multiple interpreters
inside Apache. It does this transparently and works well. There is no
apparent lock contention between the interpreters, because no Python
objects are shared.

In relation to mod_python, your statement isn't strictly true. One can
create scenarios in mod_python where a Python object created in the
context of one interpreter is then used in another interpreter at a
later point. Generally this isn't an issue because it is part of a
recursive sequence of processing and thus while the second interpreter
is doing stuff with the object the first is stalled waiting for the
second to return. It wouldn't be totally inconceivable though for a
user to create threads which may operate on the object at the same time
in the first interpreter as the object is being used in the second
interpreter.

The situation gets even more complicated when you start to use third
party modules which may be used from multiple interpreters and a
multithreaded Apache MPM is used and where the third party module
performs callbacks into interpreters with some common object.

Thus, mod_python may not be able to benefit from each Python
interpreter instance having its own GIL if that was your point in
referencing mod_python.

Graham

Sandra-24 · Nov 7, 2006

I'd like to use multiple CPU cores for selected time consuming Python computations (incl. numpy/scipy) in a frictionless manner.

Interprocess communication is tedious and out of question, so I thought about simply using a more Python interpreter instances (Py_NewInterpreter) with extra GIL in the same process.

Why not use IronPython? It's up to date (unlike Jython), has no GIL,
and is cross-platform wherever you can get .NET or Mono (UNIX, macs,
windows) and you can use most any scientific libraries written for the
..NET/Mono platform (there's a lot) Take a look anyway.

-Sandra

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 8, 2006

Ross said:
So give an example where reference counting is unsafe.

Nobody claimed that, in that thread. Instead, the claim was
"Atomic increment and decrement instructions are not by themselves
sufficient to make reference counting safe."

I did give an example, in <[email protected]>.
Even though f_name is reference-counted, it might happen that you get a
dangling pointer. More specifically, even if file_repr *incremented* the
reference counter for f_name, it might still be a dangling pointer.

To quote Joe again:

# Basically there's a race condition where an object containing the
# refcount can be deleted between the time you load a pointer to
# the object and the time you increment what used to be a refcount
# and is possibly something else but definitely undefined.

I don't understand why you can't see that this statement is factually
correct.

Regards,
Martin

Ross Ridge · Nov 8, 2006

Ross said:
So give an example where reference counting is unsafe.

Nobody claimed that, in that thread. Instead, the claim was
"Atomic increment and decrement instructions are not by themselves
sufficient to make reference counting safe."

So give an example of where atomic increment and decrement instructions
are not by themselves sufficent to make reference counting safe.

I did give an example, in <[email protected]>.
Even though f_name is reference-counted, it might happen that you get a
dangling pointer.

Your example is of how access to the "f_name" member is unsafe, not of
how reference counting being unsafe. The same sort of race condition
can without reference counting being involved at all. Consider the
"f_fp" member: if one thread tries to use "printf()" on it while
another thread calls "fclose()", then you can have same problem. The
race condition here doesn't happen because reference counting hasn't
been made safe, nor does it happen because stdio isn't thread-safe. It
happens because accessing "f_fp" (without the GIL) is unsafe.

The problem your describing isn't that reference counting hasn't been
made safe. What you and Joe seem to be trying to say is that atomic
increment and decrement instructions alone don't make accessing shared
structure members safe.
Ross Ridge

Exploiting Dual Core's with Py_NewInterpreter's separated GIL ?

Ross Ridge

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Paul Rubin

Ross Ridge

Joe Seigh

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ross Ridge

Joe Seigh

Ross Ridge

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ross Ridge

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Joe Seigh

Chaz Ginger

Shane Hathaway

Ross Ridge

grahamd

Sandra-24

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ross Ridge

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads