what's the point of rpython?

R

Ross Ridge

Carl Banks said:
I just looked at the boost documentation, which claims that multiple
asynchronous writes to the same shared_ptr results in undefined
behavior. That will not suffice for Python reference counting.

If you read the Boost documentation you'll see that while multiple
simulaneous writes to a shared_ptr *instance* isn't supported, multiple
simulataneous updates of the reference counter used by the shared_ptr
implementaiton is supported. Example #1 in the URL that Paul Rubin
gave demonstrates this. The allowed simulanteous assignment of the two
shared_ptr instances "p2" and "p3" from the same shared_ptr instance
"p", requires that a single reference count be incremented simultenously.

On the other hand, the documentation also notes that the implementation
is only lock free on x86, IA-64, and PowerPC systems, though I think it
would also be possible on MIPS CPUs.

Ross Ridge
 
R

Rhamphoryncus

So I kind of wanted to ask this question on the pypy mailing list..
but there's only a pypy-dev list, and I don't want to put noise on the
dev list.

What's the point of RPython? By this, I don't mean "What is RPython"?
I get that. I mean, why?

There are some distinct benefits of RPython:

* starts up as real python code, letting you define globals that later
get "snapshotted" into static code
* can be executed using a real python interpreter, getting much more
reliable debugging (but broken rpython code might run fine under
python)
* provides a clear avenue for extension, by adding more python
features

You might argue just having a python syntax is also a benefit, but the
semantics are so different as to more than counteract it. Add in the
cost of implementing your own compiler... yeah.
 
R

Rhamphoryncus

Maybe I'm missing something here but a lock free algorithm for
reference counting seems pretty trivial. As long as you can atomically
increment and decrement an integer without locking you are pretty much
done.

"lock free" is largely meaningless. What it really means is "we use
small hardware locks rather than big software locks, thereby reducing
(but not eliminating!) the contention".

Atomic refcounting is easy. If done sparingly to minimize contention
it works great. Python uses refcounting massively with heavily
contended objects (think about your builtin types, constants, globals,
classes, etc.) It does not perform acceptably for python.

The second issue is the objects themselves, like a list which is
mutable. If you're using it in a single thread or writing from
multiple threads this is a non-trivial constant cost. If your object
is not modified after creation and is read from many threads a lock
would be a point of contention, preventing you from scaling freely.
The dicts used by classes and globals are an import example of this,
and a successful implementation needs something non-contending. I
assume Jython and IronPython do this.
 
P

Paul Rubin

Brendan Miller said:
I think you are misreading that. It says that multiple assignments to
different copies of a share_ptr in different threads are fine.

I'll respond to this later.
http://www.codemaestro.com/reviews/8
http://siyobik.info/index.php?module=x86&id=159

Again, I'm not an assembly guru, but his sounds like exactly what
you'd want. It gains exclusive access to system memory in a
multi-processor environtment without leaving user space. Thus XADD is
an atomic increment/decrement. It would be educational if someone more
famliar with x86 than me could speak to the performance merits of this
on modern multicore machines.

Those links describe using the LOCK prefix, which as the name implies,
asserts a lock, so it is no longer "lockless reference counting". The
LOCK prefix adds about 100 cycles to the instruction.
 
P

Paul Rubin

Rhamphoryncus said:
"lock free" is largely meaningless. What it really means is "we use
small hardware locks rather than big software locks, thereby reducing
(but not eliminating!) the contention".

At least in the case of Haskell's "software transactional memory",
reads are genuinely lock free in the normal uncontended case. You use
LOCK XCHG to increment a counter in the object when you want to
update, and increment it again when you are done updating. Since the
counter starts at 0, any time it has an odd value, an update is in
process and any other thread can notice that the object is locked and
try again later without asserting any locks itself. If the counter
has an even value, it is unlocked, but there is a chance that a writer
thread can lock and udpate the object while a reader thread has a
lockless access in progress. The reader detects this has occurred
when it finishes its transaction, read the counter a second time, and
notices that the counter has been incremented (maybe more than once)
since the transaction started; it then rolls back its operation and
tries again. I'm not explaining that well so you can read about it in
SPJ's paper or Keir Fraser's.
The second issue is the objects themselves, like a list which is
mutable. If you're using it in a single thread or writing from
multiple threads this is a non-trivial constant cost. If your object
is not modified after creation and is read from many threads a lock
would be a point of contention, preventing you from scaling freely.
The dicts used by classes and globals are an import example of this,
and a successful implementation needs something non-contending. I
assume Jython and IronPython do this.

I'm pretty sure Jython makes no attempt at all to mess with ref
counts--it just relies on the underlying Java gc. I have no idea
about IronPython.
 
R

Rhamphoryncus

At least in the case of Haskell's "software transactional memory",
reads are genuinely lock free in the normal uncontended case.  You use
LOCK XCHG to increment a counter in the object when you want to
update, and increment it again when you are done updating.  Since the
counter starts at 0, any time it has an odd value, an update is in
process and any other thread can notice that the object is locked and
try again later without asserting any locks itself.  If the counter
has an even value, it is unlocked, but there is a chance that a writer
thread can lock and udpate the object while a reader thread has a
lockless access in progress.  The reader detects this has occurred
when it finishes its transaction, read the counter a second time, and
notices that the counter has been incremented (maybe more than once)
since the transaction started; it then rolls back its operation and
tries again.  I'm not explaining that well so you can read about it in
SPJ's paper or Keir Fraser's.

a) The contended case is the issue, not the uncontended case. An
uncontended lock is just constant overhead, not a barrier to
scalability
b) Locks on linux (since the switch to futexes) are pretty close to
free when uncontended.

Transactions are really for different properties:
a) read patterns can be uncontended (fully scalable)
b) a preempted writer does not block other writers, guaranteeing
forward progress
c) ad-hoc combination of several objects into a single atomic update

I'm pretty sure Jython makes no attempt at all to mess with ref
counts--it just relies on the underlying Java gc.  I have no idea
about IronPython.

The second issue has *nothing* to do with refcounts. It is the
objects themselves. A classic example is trying to do "i += 1", which
breaks down into "i = i + 1", which isn't an atomic operation.

Java's ConcurrentHash map gets this right.
 
R

Ross Ridge

Paul Rubin said:
Those links describe using the LOCK prefix, which as the name implies,
asserts a lock, so it is no longer "lockless reference counting".

No, it doesn't assert a lock in the sense used in this thread. On modern
Intel systems it's normally handled completely within the cache.
The LOCK prefix adds about 100 cycles to the instruction.

That's unavoidable. It's still faster than spin lock, which would also
have to use syncronizing instructions.

Ross Ridge
 
R

Rhodri James

What cpu's do you know of that can atomically increment and decrement
integers without locking?

x86 (and pretty much any 8080 derivative, come to think of it).

That said, what you actually need is an atomic read-and-increment,
which is a lot harder to find. Even if you could find a platform
supporting it, it doesn't help you on other platforms you may need to
run on. Just do the locking properly and worry about optimisations
later.
 
P

Paul Rubin

Rhodri James said:
x86 (and pretty much any 8080 derivative, come to think of it).

It would not have occurred to me that "lock inc" increments "without
locking". I understand that's different from a lock value sitting in
the data object but I thought that "lock-free algorithm" meant one
that didn't assert any of these hardware locks either. Maybe I'm
wrong.
Just do the locking properly and worry about optimisations later.

That has already been tried, and found to be unacceptably slow for the
purpose at hand. Now we're looking for the optimizations.
 
R

Rhodri James

It would not have occurred to me that "lock inc" increments "without
locking". I understand that's different from a lock value sitting in
the data object but I thought that "lock-free algorithm" meant one
that didn't assert any of these hardware locks either. Maybe I'm
wrong.

I tend to live in single-core worlds, so "inc" on its lonesome works
just fine.
That has already been tried, and found to be unacceptably slow for the
purpose at hand. Now we're looking for the optimizations.

In that case I'd second the suggestion of taking a long, hard look
at the Linux core locking and synchronisation primatives.
 
B

Brendan Miller

It would not have occurred to me that "lock inc" increments "without
locking". I understand that's different from a lock value sitting in
the data object but I thought that "lock-free algorithm" meant one
that didn't assert any of these hardware locks either. Maybe I'm
wrong.

Right... I was wondering about that. Well, any kind of memory access
gets exclusive control of the bus except on NUMA, but I'm wondering
how CMPXCHG
http://en.wikipedia.org/wiki/Compare-and-swap

compares to XADD performance wise.

It seems to me that both of them must pull the old value across the
bus, hang onto the bus, and move the new value in. Maybe since XADD
needs to perform arithmetic there will be a few cycles lag between
getting the old value and pushing the new value? Maybe CMPXCHG doesn't
go through the ALU?

If the bus isn't just sitting idle and you can immediately push out
the new value then there's no real locking. Actually this article
explicitly mentions CMPXCHG as lock free.

http://en.wikipedia.org/wiki/Lock-free_and_wait-free_algorithms
 
P

Paul Rubin

Rhodri James said:
I tend to live in single-core worlds, so "inc" on its lonesome works
just fine.

In a single core world we wouldn't be having these endless angsty
conversations about eliminating the GIL.
In that case I'd second the suggestion of taking a long, hard look
at the Linux core locking and synchronisation primatives.

Do you understand what the issue is, about CPython's reference counts?
Do you have any idea how often the interpreter updates them? Even
using LOCK INCR (raw machine instruction, not portable C, no operating
system involvement), manipulating the ref counts would be around 100x
slower than it is now.
 
R

Rhodri James

Do you understand what the issue is, about CPython's reference counts?
Do you have any idea how often the interpreter updates them? Even
using LOCK INCR (raw machine instruction, not portable C, no operating
system involvement), manipulating the ref counts would be around 100x
slower than it is now.

You asked a question about CPUs with atomic update, strongly implying
there were none. All I did was supply a counter-example, and observe
that neither this nor the question were in fact helpful.

My inability to spell "primitive", on the other hand, is all my own
fault.
 
P

Paul Rubin

Rhodri James said:
You asked a question about CPUs with atomic update, strongly implying
there were none. All I did was supply a counter-example,

Well, more specifically, atomic update without locking, but I counted
the LOCK prefix as locking while other people didn't, and that caused
some confusion. Of course I'm aware of the LOCK prefix but it slows
down the instruction enormously compared with a non-locked instruction.
 
B

Brendan Miller

Well, more specifically, atomic update without locking, but I counted
the LOCK prefix as locking while other people didn't, and that caused
some confusion. Of course I'm aware of the LOCK prefix but it slows
down the instruction enormously compared with a non-locked instruction.

I'm curious about that. I've been looking around for timing
information on the lock signal, but am having some trouble finding
them. Intuitively, given that the processor is much faster than the
bus, and you are just waiting for processor to complete an addition or
comparison before put the new memory value on the bus, it seems like
there should be very little additional bus contention vs a normal add
instruction.
 
P

Paul Rubin

Brendan Miller said:
I'm curious about that. I've been looking around for timing
information on the lock signal, but am having some trouble finding
them. Intuitively, given that the processor is much faster than the
bus, and you are just waiting for processor to complete an addition or
comparison before put the new memory value on the bus, it seems like
there should be very little additional bus contention vs a normal add
instruction.

The bus is slow compared with the L1 cache. I just looked for figures
and couldn't find any either, but I remember seeing some for the Core
2 saying around 100 cycles, and something similar for the Athlon. I
just came across something saying the Core i7 is considerably better
than the Core 2 at this.

The real solution is to not use so much bus locking. Get rid of the
ref counts and use a real gc, if not in CPython then in PyPy.
 
S

skip

Brendan> Intuitively, given that the processor is much
Brendan> faster than the bus, and you are just waiting for processor to
Brendan> complete an addition or comparison before put the new memory
Brendan> value on the bus...

I think in Python's case the reference count will often be in the
processor's local cache. (Presumably, if the current thread is going to
Py_DECREF the object it's been working with the object's state recently.)
The stuff I read suggested that simply locking the local cache would be
significantly faster than having to lock the memory bus.
 
B

Brendan Miller

The opcode cannot simply talk to its cache, it must either go directly
to off-chip memory or communicate to other processors that it (and it
alone) owns the increment target.

Oh, right. *light bulb goes on* I wasn't thinking about cache at all.
 
M

MRAB

Brendan said:
Oh, right. *light bulb goes on* I wasn't thinking about cache at all.
I'm not sure whether multicore processors share a cache or, if not, have
some other on-chip mechanism. Multiprocessor machines, however, are a
different matter...
 
R

Ross Ridge

Scott David Daniels said:
The opcode cannot simply talk to its cache, it must either go directly
to off-chip memory or communicate to other processors that it (and it
alone) owns the increment target.

In fact all it does simply talk to its cache. From the "Intel 64 and
IA-32 Architectures Software Developer's Manual, Volume 3A: System
Programming Guide, Part 1":

For the P6 and more recent processor families, if the area of
memory being locked during a LOCK operation is cached in the
processor that is performing the LOCK operation as write-back
memory and is completely contained in a cache line, the processor
may not assert the LOCK# signal on the bus. Instead, it will
modify the memory location internally and allow it's cache
coherency mechanism to insure that the operation is carried
out atomically. This operation is called "cache locking." The
cache coherency mechanism automatically prevents two or more
processors that have cached the same area of memory from
simultaneously modifying data in that area.

The same cache coherency mechanism that prevents ordinary "unlocked"
instructions from simulanteously modifying the same cache line on
two different processors also provides the guarantee with "locked"
instructions. There's no additional hardware locks involved, and no
additional communication required.

Ross Ridge
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top