2.6, 3.0, and truly independent intepreters

Glenn Linderman · Oct 24, 2008

Actually, the GIL doesn't make Python faster; it is a design decision
that reduces the overhead of lock acquisition, while still allowing use
of global variables.

Using finer-grained locks has higher run-time cost; eliminating the use
of global variables has a higher programmer-time cost, but would
actually run faster and more concurrently than using a GIL. Especially
on a multi-core/multi-CPU machine.

Another peeve I have is his characterization of the observer pattern.
The generalized form of the problem exists in both single-threaded
sequential programs, in the form of unexpected reentrancy, and message
passing, with infinite CPU usage or infinite number of pending
messages.

So how do you get reentrancy is a single-threaded sequential program? I
think only via recursion? Which isn't a serious issue for the observer
pattern. If you add interrupts, then your program is no longer sequential.

Try looking at it on another level: when your CPU wants to read from a
bit of memory controlled by another CPU it sends them a message
requesting they get it for us. They send back a message containing
that memory. They also note we have it, in case they want to modify
it later. We also note where we got it, in case we want to modify it
(and not wait for them to do modifications for us).

I understand that level... one of my degrees is in EE, and I started
college wanting to design computers (at about the time the first
microprocessor chip came along, and they, of course, have now taken
over). But I was side-lined by the malleability of software, and have
mostly practiced software during my career.

Anyway, that is the level that Herb Sutter was describing in the Dr
Dobbs articles I mentioned. And the overhead of doing that at the level
of a cache line is high, if there is lots of contention for particular
memory locations between threads running on different cores/CPUs. So to
achieve concurrency, you must not only limit explicit software locks,
but must also avoid memory layouts where data needed by different
cores/CPUs are in the same cache line.

Message passing vs shared memory isn't really a yes/no question. It's
about ratios, usage patterns, and tradeoffs. *All* programs will
share data, but in what way? If it's just the code itself you can
move the cache validation into software and simplify the CPU, making
it faster. If the shared data is a lot more than that, and you use it
to coordinate accesses, then it'll be faster to have it in hardware.

I agree there are tradeoffs... unfortunately, the hardware architectures
vary, and the languages don't generally understand the hardware. So then
it becomes an OS API, which adds the overhead of an OS API call to the
cost of the synchronization... It could instead be (and in clever
applications is) a non-portable assembly level function that wraps on OS
locking or waiting API.

Nonetheless, while putting the shared data accesses in hardware might be
more efficient per unit operation, there are still tradeoffs: A software
solution can group multiple accesses under a single lock acquisition;
the hardware probably doesn't have enough smarts to do that. So it may
well require many more hardware unit operations for the same overall
concurrently executed function, and the resulting performance may not be
any better.

Sidestepping the whole issue, by minimizing shared data in the
application design, avoiding not only software lock calls, and hardware
cache contention, is going to provide the best performance... it isn't
the things you do efficiently that make software fast — it is the things
you don't do at all.

Rhamphoryncus · Oct 24, 2008

Hmm. So I think your PyE is an instance is an attempt to be more
explicit about what I said above in PyC: PyC threads do not share data
between threads except by explicit interfaces. I consider your
definitions of shared data types somewhat orthogonal to the types of
threads, in that both PyA and PyC threads could use these new shared
data items.

Unlike PyC, there's a *lot* shared by default (classes, modules,
function), but it requires only minimal recoding. It's as close to
"have your cake and eat it too" as you're gonna get.

I think/hope that you meant that "many types are now only allowed to be
non-shareable"? At least, I think that should be the default; they
should be within the context of a single, independent interpreter
instance, so other interpreters don't even know they exist, much less
how to share them. If so, then I understand most of the rest of your
paragraph, and it could be a way of providing shared objects, perhaps.

There aren't multiple interpreters under my model. You only need
one. Instead, you create a monitor, and run a thread on it. A list
is not shareable, so it can only be used within the monitor it's
created within, but the list type object is shareable.

I've no interest in *requiring* a C/C++ extension to communicate
between isolated interpreters. Without that they're really no better
than processes.

Rhamphoryncus · Oct 24, 2008

Actually, the GIL doesn't make Python faster; it is a design decision that
reduces the overhead of lock acquisition, while still allowing use of global
variables.

Using finer-grained locks has higher run-time cost; eliminating the use of
global variables has a higher programmer-time cost, but would actually run
faster and more concurrently than using a GIL. Especially on a
multi-core/multi-CPU machine.

Those "globals" include classes, modules, and functions. You can't
have *any* objects shared. Your interpreters are entirely isolated,
much like processes (and we all start wondering why you don't use
processes in the first place.)

Or use safethread. It imposes safe semantics on shared objects, so
you can keep your global classes, modules, and functions. Still need
garbage collection though, and on CPython that means refcounting and
the GIL.

So how do you get reentrancy is a single-threaded sequential program? I
think only via recursion? Which isn't a serious issue for the observer
pattern. If you add interrupts, then your program is no longer sequential..

Sorry, I meant recursion. Why isn't it a serious issue for
single-threaded programs? Just the fact that it's much easier to
handle when it does happen?

I understand that level... one of my degrees is in EE, and I started college
wanting to design computers (at about the time the first microprocessor chip
came along, and they, of course, have now taken over). But I was side-lined
by the malleability of software, and have mostly practiced software during
my career.

Anyway, that is the level that Herb Sutter was describing in the Dr Dobbs
articles I mentioned. And the overhead of doing that at the level of a cache
line is high, if there is lots of contention for particular memory locations
between threads running on different cores/CPUs. So to achieve concurrency,
you must not only limit explicit software locks, but must also avoid memory
layouts where data needed by different cores/CPUs are in the same cache
line.

I suspect they'll end up redesigning the caching to use a size and
alignment of 64 bits (or smaller). Same cache line size, but with
masking.

You still need to minimize contention of course, but that should at
least be more predictable. Having two unrelated mallocs contend could
suck.

I agree there are tradeoffs... unfortunately, the hardware architectures
vary, and the languages don't generally understand the hardware. So then it
becomes an OS API, which adds the overhead of an OS API call to the cost of
the synchronization... It could instead be (and in clever applications is) a
non-portable assembly level function that wraps on OS locking or waiting
API.

In practice I highly doubt we'll see anything that doesn't extend
traditional threading (posix threads, whatever MS has, etc).

Nonetheless, while putting the shared data accesses in hardware might be
more efficient per unit operation, there are still tradeoffs: A software
solution can group multiple accesses under a single lock acquisition; the
hardware probably doesn't have enough smarts to do that. So it may well
require many more hardware unit operations for the same overall concurrently
executed function, and the resulting performance may not be any better.

Speculative ll/sc?

Sidestepping the whole issue, by minimizing shared data in the application
design, avoiding not only software lock calls, and hardware cache
contention, is going to provide the best performance... it isn't the things
you do efficiently that make software fast — it is the things you don't do
at all.

Minimizing contention, certainly. Minimizing the shared data itself
is iffier though.

Glenn Linderman · Oct 24, 2008

Another great post, Glenn!! Very well laid-out and posed!! Thanks for
taking the time to lay all that out.

Like any software that's meant to be used without restrictions, our
code and frameworks always use a context object pattern so that
there's never and non-const global/shared data). I would go as far to
say that this is the case with more performance-oriented software than
you may think since it's usually a given for us to have to be parallel
friendly in as many ways as possible. Perhaps Patrick can back me up
there.

I agree that performance-oriented software wants to be parallel friendly.

And avoiding concurrently shared data is a great way to be parallel
friendly. But having the data be used by one thread or another at
different times, is also necessary... I'm guessing that part of your
parallelism is to split up the data and process it in chunks; or perhaps
you generate a sequence of frames in different threads from static
data. Either of those techniques could work in a mostly independent
manner, using shared, read-only data, plus a collection of output buffers.

As to what modules are "essential"... As you point out, once
reentrant module implementations caught on in PyC or hybrid world, I
think we'd start to see real effort to whip them into compliance--
there's just so much to be gained imho.

Indeed. Often it is just to find the migration path, to enable something.

But to answer the question,
there's the obvious ones (operator, math, etc), string/buffer
processing (string, re), C bridge stuff (struct, array), and OS basics
(time, file system, etc). Nice-to-haves would be buffer and image
decompression (zlib, libpng, etc), crypto modules, and xml. As far as
I can imagine, I have to believe all of these modules already contain
little, if any, global data, so I have to believe they'd be super easy
to make "PyC happy". Patrick, what would you see you guys using?

That's a reasonable list, and I agree it sounds like the modules should
be straightforward to make reentrant, and they likely mostly are already.

As I understand things, the multiprocessing puts stuff in a child
process (i.e. a separate address space), so the only to get stuff to/
from it is via IPC, which can include a shared/mapped memory region.
Unfortunately, a shared address region doesn't work when you have
large and opaque objects (e.g. a rendered CoreVideo movie in the
QuickTime API or 300 megs of audio data that just went through a
DSP). Then you've got the hit of serialization if you're got
intricate data structures (that would normally would need to be
serialized, such as a hashtable or something). Also, if I may speak
for commercial developers out there who are just looking to get the
job done without new code, it's usually always preferable to just a
single high level sync object (for when the job is complete) than to
start a child processes and use IPC. The former is just WAY less
code, plain and simple.

If your shared address space is large enough, it certainly could handle
large and opaque objects... but you would likely have to write your own
allocator for that space. Yes, if you want to share Python objects, you
would have to serialize; but that would be true with PyC also, at least
in my mental "phase 1."

Glenn Linderman · Oct 24, 2008

Unlike PyC, there's a *lot* shared by default (classes, modules,
function), but it requires only minimal recoding. It's as close to
"have your cake and eat it too" as you're gonna get.

Yes, but I like my cake frosted with performance; Guido's non-acceptance
of granular locks in the blog entry someone referenced was due to the
slowdown acquired with granular locking and shared objects. Your PyE
model, with highly granular sharing, will likely suffer the same fate.

The independent threads model, with only slight locking for a few
explicitly shared objects, has a much better chance of getting better
performance overall. With one thread running, it would be the same as
today; with multiple threads, it should scale at the same rate as the
system... minus any locking done at the higher level.

There aren't multiple interpreters under my model. You only need
one. Instead, you create a monitor, and run a thread on it. A list
is not shareable, so it can only be used within the monitor it's
created within, but the list type object is shareable.

The python interpreter code should be sharable, having been written in
C, and being/becoming reentrant. So in that sense, there is only one
interpreter. Similarly, any other reentrant C extensions would be that
way. On the other hand, each thread of execution requires its own
interpreter context, so that would have to be independent for the
threads to be independent. It is the combination of code+context that I
call an interpreter, and there would be one per thread for PyC threads.
Bytecode for loaded modules could potentially be shared, if it is also
immutable. However, that could be in my mental "phase 2", as it would
require an extra level of complexity in the interpreter as it creates
shared bytecode... there would be a memory savings from avoiding
multiple copies of shared bytecode, likely, and maybe also a compilation
performance savings. So it sounds like a win, but it is a win that can
deferred for initial simplicity, to prove the concept is or is not workable.

A monitor allows a single thread to run at a time; that is the same
situation as the present GIL. I guess I don't fully understand your model.

I've no interest in *requiring* a C/C++ extension to communicate
between isolated interpreters. Without that they're really no better
than processes.

Indeed. My PyC model only immediately supplied a context block for
passing data to (and back from) a spawned, independent thread. That is
pretty limited. However, the context block could be provided either by
a C program or an independent Python interpreter. I agree that some
additional shared data, either shared data objects, or message queues,
would make things more useful. But these should be carefully designed
based on application needs, and as limited as possible, to maximize
performance and minimize lock contention and lock acquisition overhead.
Shared objects will be significantly more expensive to access and
manipulate than private objects; not only must they be protected with a
lock, all attributes of the shared object must be in the same
concurrency group, or must have its own lock, or must be immutable, and
protected from garbage collection. Or some such set of rules; various
techniques are possible. But they are all more expensive than private
objects manipulated in one independent thread.

Glenn Linderman · Oct 25, 2008

Those "globals" include classes, modules, and functions. You can't
have *any* objects shared. Your interpreters are entirely isolated,
much like processes (and we all start wondering why you don't use
processes in the first place.)

Indeed; isolated, independent interpreters are one of the goals. It is,
indeed, much like processes, but in a single address space. It allows
the master process (Python or C for the embedded case) to be coded using
memory references and copies and pointer swaps instead of using
semaphores, and potentially multi-megabyte message transfers.

It is not clear to me that with the use of shared memory between
processes, that the application couldn't use processes, and achieve many
of the same goals. On the other hand, the code to create and manipulate
processes and shared memory blocks is harder to write and has more
overhead than the code to create and manipulate threads, which can, when
told, access any memory block in the process. This allows the shared
memory to be resized more easily, or more blocks of shared memory
created more easily. On the other hand, the creation of shared memory
blocks shouldn't be a high-use operation in a program that has
sufficient number crunching to do to be able to consume multiple cores/CPUs.

Or use safethread. It imposes safe semantics on shared objects, so
you can keep your global classes, modules, and functions. Still need
garbage collection though, and on CPython that means refcounting and
the GIL.

Sounds like safethread has 35-40% overhead. Sounds like too much, to me.

Sorry, I meant recursion. Why isn't it a serious issue for
single-threaded programs? Just the fact that it's much easier to
handle when it does happen?

There are generally only a finite number of observer patterns that would
intersect with recursive algorithms, within a particular program. So
yes, it is easier to deal with, and less likely to be missed during
testing, than true multi-threaded programs.

I suspect they'll end up redesigning the caching to use a size and
alignment of 64 bits (or smaller). Same cache line size, but with
masking.

You still need to minimize contention of course, but that should at
least be more predictable. Having two unrelated mallocs contend could
suck.

Contention on highly shared, frequently referenced data structures
definitely reduces concurrency. Significantly. Placing two such data
structures in the same cache line is definitely a performance no-no in a
multicore/CPU world.

Larger cache lines allow for wider memory buses, which increases memory
throughput. Memory is the current bottleneck, as its speed has not
increased at the same rate as CPU speeds. So wider memory buses are
good for performance.

Most modern caches are "N-way" associative, meaning that you can have N
cache lines in each bucket. So two unrelated mallocs are not a problem
– but N+1 could be. You might jump to the conclusion that you would
want the N for cache associativity to be at least as big as the number
of cores/CPUs, but that isn't good enough.. a single thread of execution
could reference memory that would load 2 or more cache lines in the same
bucket. That would be relatively rare, but if it happened on multiple
cores/CPUs concurrently, it could produce a slowdown. It is for this
sort of reason that it is very difficult to construct a data intensive
application that scales linearly with the number of cores/CPUs in use.

In practice I highly doubt we'll see anything that doesn't extend
traditional threading (posix threads, whatever MS has, etc).

I doubt it either. But there is plenty of room for innovation in those
extensions; the Berkeley paper basically boils down to two items, in my
mind:

1) Minimize shared data object to achieve maximum concurrency.

2) Design the application around the necessary inter-thread
interactions. Focus on that aspect of the design independently of the
sequential code necessary to do bits of work. Avoid the temptation to
solve problems by adding more locks and shared data items.

The paper hopes for high-level, language-level codifications of useful
patterns for inter-thread interactions, mentioning a few, and a few
languages/schemes that are in the research stages. It doesn't
necessarily hope that individual languages will be extended to support
those patterns, but rather than a meta-language might do so. But that
is no reason to expect that Python _couldn't_ be the the meta-language,
and also retain its current function; it might (or might not) mean that
if Python implemented the meta-language, that it could become the
meta-language for other languages. Someone will be first to the table.

The PyC design I'm outlining seems to provide a basis for doing
experimentation with meta-language options – first as a set of functions
or objects or shared objects in a master thread, and then perhaps as
language syntax where and if appropriate.

Speculative ll/sc?

CPUs have gotten amazingly good at branch prediction and speculative
execution, but it adds significantly to the complexity of the design.
That complexity is one of the reasons that multiple-CPUs, multiple
cores, and VLIW designs are becoming more common... to push some of the
complexity back at the software, because the ROI for the increased
hardware complexity buck is diminishing.

So there might be some room for a bit more speculative execution, and
from better caching; N-way associativity may be able to double or
quadruple N in the future, but I don't hold out a lot of hope for big
paybacks from that sort technology.

Minimizing contention, certainly. Minimizing the shared data itself
is iffier though.

I should have said (because I meant) minimizing the number of chunks of
shared data. The size of the chunk is less relevant; the overhead per
chunk is roughly the same, regardless of the size.

For example, Python presently has a rather stupid algorithm for string
concatenation. It allocates only the exactly necessary space for the
concatenated string. This is a brilliant move, when you realize that
strings are immutable, and once allocated can never change, but the
operation

for line in mylistofstrings:
string = string + line

is basically O(N-squared) as a result. The better algorithm would
double the size of memory allocated for string each time there is not
enough room to add the next line, and that reduces the cost of the
algorithm to O(N). [Notes: I haven't figured out if this algorithm is
more efficient with Python buffers instead of strings; and I'm aware of
''.join(mylistofstrings) which is also O(N).]

So arranging to have large buffers of data that are shared is generally
higher performance than sharing unit data items wrapped it little objects.

Andy O'Meara · Oct 25, 2008

Are you familiar with the API at all? Multiprocessing was designed to
mimic threading in about every way possible, the only restriction on
shared data is that it must be serializable, but event then you can
override or customize the behavior.

Also, inter process communication is done via pipes. It can also be
done with messages if you want to tweak the manager(s).

I apologize in advance if I don't understand something correctly, but
as I understand them, everything has to be serialized in order to go
through IPC. So when you're talking about thousands of objects,
buffers, and/or large OS opaque objects (e.g. memory-resident video
and images), that seems like a pretty rough hit of run-time resources.

Please don't misunderstand my comments to suggest that multiprocessing
isn't great stuff. On the contrary, it's very impressive and it
singlehandedly catapults python *way* closer to efficient CPU bound
processing than it ever was before. All I mean to say is that in the
case where using a shared address space with a worker pthread per
spare core to do CPU bound work, it's a really big win not to have to
serialize stuff. And in the case of hundreds of megs of data and/or
thousands of data structure instances, it's a deal breaker to
serialize and unserialize everything just so that it can be sent
though IPC. It's a deal breaker for most performance-centric apps
because of the unnecessary runtime resource hit and because now all
those data structures being passed around have to have accompanying
serialization code written (and maintained) for them. That's
actually what I meant when I made the comment that a high level sync
object in a shared address space is "better" then sending it all
through IPC (when the data sets are wild and crazy). From a C/C++
point of view, I would venture to say that it's always a huge win to
just stick those "embarrassingly easy" parallelization cases into the
thread with a sync object than forking and using IPC and having to
write all the serialization code. And in the case of huge data types--
such as video or image rendering--it makes me nervous to think of
serializing it all just so it can go through IPC when it could just be
passed using a pointer change and a single sync object.

So, if I'm missing something and there's a way so pass data structures
without serialization, then I'd definitely like to learn more (sorry
in advance if I missed something there). When I took a look at
multiprocessing my concerns where:
- serialization (discussed above)
- maturity (are we ready to bet the farm that mp is going to work
properly on the platforms we need it to?)

Again, I'm psyched that multiprocessing appeared in 2.6 and it's a
huge huge step in getting everyone to unlock the power of python!
But, then some of the tidbits described above are additional data
points for you and others to chew on. I can tell you they're pretty
important points for any performance-centric software provider (us,
game developers--from EA to Ambrosia, and A/V production app
developers like Patrick).

Andy

Adam Olsen · Oct 25, 2008

Yes, but I like my cake frosted with performance; Guido's non-acceptance of
granular locks in the blog entry someone referenced was due to the slowdown
acquired with granular locking and shared objects. Your PyE model, with
highly granular sharing, will likely suffer the same fate.

No, my approach includes scalable performance. Typical paths will
involve *no* contention (ie no locking). classes and modules use
shareddict, which is based on a read-write lock built into the
interpreter, so it's uncontended for read-only usage patterns. Pretty
much everything else is immutable.

Of course that doesn't include the cost of garbage collection.
CPython's refcounting can't scale.

The independent threads model, with only slight locking for a few explicitly
shared objects, has a much better chance of getting better performance
overall. With one thread running, it would be the same as today; with
multiple threads, it should scale at the same rate as the system... minus
any locking done at the higher level.

So use processes with a little IPC for these expensive-yet-"shared"
objects. multiprocessing does it already.

The python interpreter code should be sharable, having been written in C,
and being/becoming reentrant. So in that sense, there is only one
interpreter. Similarly, any other reentrant C extensions would be that way.
On the other hand, each thread of execution requires its own interpreter
context, so that would have to be independent for the threads to be
independent. It is the combination of code+context that I call an
interpreter, and there would be one per thread for PyC threads. Bytecode
for loaded modules could potentially be shared, if it is also immutable.
However, that could be in my mental "phase 2", as it would require an extra
level of complexity in the interpreter as it creates shared bytecode...
there would be a memory savings from avoiding multiple copies of shared
bytecode, likely, and maybe also a compilation performance savings. So it
sounds like a win, but it is a win that can deferred for initial simplicity,
to prove the concept is or is not workable.

A monitor allows a single thread to run at a time; that is the same
situation as the present GIL. I guess I don't fully understand your model.

To use your terminology, each monitor is a context. Each thread
operates in a different monitor. As you say, most C functions are
already thread-safe (reentrant). All I need to do is avoid letting
multiple threads modify a single mutable object (such as a list) at a
time, which I do by containing it within a single monitor (context).

Adam Olsen · Oct 25, 2008

Indeed; isolated, independent interpreters are one of the goals. It is,
indeed, much like processes, but in a single address space. It allows the
master process (Python or C for the embedded case) to be coded using memory
references and copies and pointer swaps instead of using semaphores, and
potentially multi-megabyte message transfers.

It is not clear to me that with the use of shared memory between processes,
that the application couldn't use processes, and achieve many of the same
goals. On the other hand, the code to create and manipulate processes and
shared memory blocks is harder to write and has more overhead than the code
to create and manipulate threads, which can, when told, access any memory
block in the process. This allows the shared memory to be resized more
easily, or more blocks of shared memory created more easily. On the other
hand, the creation of shared memory blocks shouldn't be a high-use operation
in a program that has sufficient number crunching to do to be able to
consume multiple cores/CPUs.

Sounds like safethread has 35-40% overhead. Sounds like too much, to me.

The specific implementation of safethread, which attempts to remove
the GIL from CPython, has significant overhead and had very limited
success at being scalable.

The monitor design proposed by safethread has no inherent overhead and
is completely scalable.

Martin v. Löwis · Oct 25, 2008

It seems to me that the very simplest move would be to remove global

static data so the app could provide all thread-related data, which
Andy suggests through references to the QuickTime API. This would
suggest compiling python without thread support so as to leave it up
to the application.

I'm not sure whether you realize that this is not simple at all.
Consider this fragment

if (string == Py_None || index >= state->lastmark ||
!state->mark[index] || !state->mark[index+1]) {
if (empty)
/* want empty string */
i = j = 0;
else {
Py_INCREF(Py_None);
return Py_None;

Py_None here is a global variable. How would you replace it?
It's used in thousands of places.

For another example, consider

PyErr_SetString(PyExc_ValueError,
"Empty module name");
or

dp = PyObject_New(dbmobject, &Dbmtype);

There are tons of different variables denoting exceptions and
other types which all somehow need to be rewritten (likely with
undesirable effects on readability).

So I don't think that this is a simple solution. It's the right
one, but it will take five or ten years to implement.

Regards,
Martin

Martin v. Löwis · Oct 25, 2008

A c-level module, on the other hand, can sidestep/release

...Unless part of the C module execution involves the need do CPU-
bound work on another thread through a different python interpreter,
right?
Wrong.

(even if the interpreter is 100% independent, yikes).

Again, wrong.

For
example, have a python C module designed to programmatically generate
images (and video frames) in RAM for immediate and subsequent use in
animation. Meanwhile, we'd like to have a pthread with its own
interpreter with an instance of this module and have it dequeue jobs
as they come in (in fact, there'd be one of these threads for each
excess core present on the machine).

I don't understand how this example involves multiple threads. You
mention a single thread (running the module), and you mention designing
a module. Where is the second thread?

Let's assume there is another thread producing jobs, and then
a thread that generates the images. The structure would be this

while 1:
job = queue.get()
processing_module.process(job)

and in process:

PyArg_ParseTuple(args, "s", job_data);
result = PyString_New(bufsize);
buf = PyString_AsString(result);
Py_BEGIN_ALLOW_THREADS
compute_frame(job_data, buf);
Py_END_ALLOW_THREADS
return PyString_FromString(buf);

All these compute_frames could happily run in parallel.

As far as I can tell, it seems
CPython's current state can't CPU bound parallelization in the same
address space.

That's not true.

Regards,
Martin

Glenn Linderman · Oct 25, 2008

All I mean to say is that in the
case where using a shared address space with a worker pthread per
spare core to do CPU bound work, it's a really big win not to have to
serialize stuff.

I would agree that it is a big win not to have to serialize stuff....

And in the case of hundreds of megs of data

.... and I would be surprised at someone that would embed hundreds of
megs of data into an object such that it had to be serialized... seems
like the proper design is to point at the data, or a subset of it, in a
big buffer. Then data transfers would just transfer the offset/length
and the reference to the buffer.

and/or thousands of data structure instances,

.... and this is another surprise! You have thousands of objects (data
structure instances) to move from one thread to another?

Of course, I know that data get large, but typical multimedia streams
are large, binary blobs. I was under the impression that processing
them usually proceeds along the lines of keeping offsets into the blobs,
and interpreting, etc. Editing is usually done by making a copy of a
blob, transforming it or a subset in some manner during the copy
process, resulting in a new, possibly different-sized blob. But then, I
can't say that I am any sort of expert in the multimedia arena...
generating a bit of MIDI is the closest I've come to multimedia, and
that is a much smaller scale data set.

So I was working under the assumption that the data blobs would be
handled outside of the concepts of objects, other than objects that
would contain references to appropriate sections of the blobs, and
corresponding access methods.. Otherwise, as you point out, the data
has to be serialized to be transferred via messages, which causes a
triple copy (1-serialize, 2-message copy, 3-deserialize) as well as
other processing.

Care to educate me, if you can, without revealing anything proprietary?

On the other hand, if the data are in thousands of objects, then those
objects would have to be sharable objects if they are to be handed from
one thread to another, and overhead would go up. As long as they are in
the same concurrency group, though, the overhead might not be too
significant... although it would cost an extra compare (to validate that
fact) on each operation that modifies an object.

Glenn Linderman · Oct 25, 2008

It seems to me that the very simplest move would be to remove global
static data so the app could provide all thread-related data, which
Andy suggests through references to the QuickTime API. This would
suggest compiling python without thread support so as to leave it up
to the application.

Click to expand...

I'm not sure whether you realize that this is not simple at all.
Consider this fragment

if (string == Py_None || index >= state->lastmark ||
!state->mark[index] || !state->mark[index+1]) {
if (empty)
/* want empty string */
i = j = 0;
else {
Py_INCREF(Py_None);
return Py_None;

Py_None here is a global variable. How would you replace it?
It's used in thousands of places.

If Py_None corresponds to None in Python syntax (sorry I'm not familiar
with Python internals yet; glad you are commenting, since you are), then
it is a fixed constant and could be left global, probably. But if we
want a separate None for each interpreter, or if we just use Py_None as
an example global variable to use to answer the question then here goes
(as much as I find the solution distasteful, it is a solution, and I
believe this solution is used in Perl for exactly the same reason – the
desire to convert from a single interpreter per process to allow
multiple interpreters per process).

Instead of this:

PyInstance Py_None; /* or whatever type it is declared; I'll use
PyInstance in this example */
PyInstance Py_AnotherGlobal; /* and so forth, until all globals are
enumerated */

The following is coded:

struct PyInterpreterContext
{
PyInstance Py_None_member;
PyInstance Py_AnotherGlobal_member; /* and so forth, until all globals
are included in the struct */
}

and you define a parameter struct PyInterpreterContext *PyIC that gets
added as the first parameter to every internal Python function
definition, declaration, and invocation.

And then you define:

#define Py_None PyIC->Py_None_member
#define Py_AnotherGlobal PyIC->Py_AnotherGlobal_member /* and so forth,
until all globals are enumerated */

So there is a performance cost to this, but I don't think it is 35% like
Stackless Python. The cost is one more push on every function call, and
one more indirection on every reference to a global. Depending on the
coding style for Python internals, it may be possible to write a script
to edit the source to add the PyIC parameter, at least for most invocations.

On many platforms, there is the concept of TLS, or thread-local storage.
Such a mechanism allows you to define globals in a "special" way (varies
per platform), and then those globals are specially linked somehow and
can still (I think) be visible without passing them around, yet the
platform code ensures that the TLS storage is swapped at the same time
as the thread. This can provide better performance on platforms that
support the concept... then the coding somewhat different, for those
platforms, the special syntax is used for the struct declaration

TLS struct PyInterpreterContext PyTLSIC;

and the defines would change to be

#define Py_None PyTLSIC.Py_None_member
#define Py_AnotherGlobal PyTLSIC.Py_AnotherGlobal_member /* etc. */

Terry Reedy · Oct 25, 2008

Glenn said:
For example, Python presently has a rather stupid algorithm for string
concatenation.

Python the language has syntax and semantics. Python implementations
have algorithms that fulfill the defined semantics.

It allocates only the exactly necessary space for the
concatenated string. This is a brilliant move, when you realize that
strings are immutable, and once allocated can never change, but the
operation

for line in mylistofstrings:
string = string + line

is basically O(N-squared) as a result. The better algorithm would
double the size of memory allocated for string each time there is not
enough room to add the next line, and that reduces the cost of the
algorithm to O(N).

If there is more than one reference to a guaranteed immutable object,
such as a string, the 'stupid' algorithm seem necessary to me. In-place
modification of a shared immutable would violate semantics.

However, if you do

string = ''
for line in strings:
string =+ line

so that there is only one reference and you tell the interpreter that
you don't mind the old value being updated, then I believe in 2.6, if
not before, CPython does overallocation and in-place extension. (I am
not sure about s=s+l.) But this is just ref-counted CPython.

Terry Jan Reedy

Glenn Linderman · Oct 25, 2008

Python the language has syntax and semantics. Python implementations
have algorithms that fulfill the defined semantics.

I can buy that, but when Python is not qualified, CPython should be
assumed, as it predominates. Of course, the latest official release
should probably also be assumed, but that is so recent, few have likely
upgraded as yet... I should have qualified the statement.

Of course, the poor algorithm was just an example of how things
shouldn't be done, and that better algorithms should be found when
possible, thus reducing the overall cost of the computation.

If there is more than one reference to a guaranteed immutable object,
such as a string, the 'stupid' algorithm seem necessary to me.
In-place modification of a shared immutable would violate semantics.

Absolutely. But after the first iteration, there is only one reference
to string.

However, if you do

string = ''
for line in strings:
string =+ line

so that there is only one reference and you tell the interpreter that
you don't mind the old value being updated, then I believe in 2.6, if
not before, CPython does overallocation and in-place extension. (I am
not sure about s=s+l.) But this is just ref-counted CPython.

Sounds like a useful improvement in 2.6. I'm pretty sure 2.5.2 doesn't
have it, because that's where I first encountered the poorly performing
algorithm. Thanks for the info. I'm sort of waiting for a final
release of PyQt (I'm too much of a PyQt novice to try the snapshots) for
2.6 to upgrade.

greg · Oct 25, 2008

Andy said:
I would definitely agree if there was a context (i.e. environment)
object passed around then perhaps we'd have the best of all worlds.

Moreover, I think this is probably the *only* way that
totally independent interpreters could be realized.

Converting the whole C API to use this strategy would be
a very big project. Also, on the face of it, it seems like
it would render all existing C extension code obsolete,
although it might be possible to do something clever with
macros to create a compatibility layer.

Another thing to consider is that passing all these extra
pointers around everywhere is bound to have some effect
on performance. The idea mightn't go down too well if it
slows things significantly in the case where you're only
using one interpreter.

greg · Oct 25, 2008

Andy said:
- each worker thread makes its own interpreter, pops scripts off a
work queue, and manages exporting (and then importing) result data to
other parts of the app.

I hope you realize that starting up one of these interpreters
is going to be fairly expensive. It will have to create its
own versions of all the builtin constants and type objects,
and import its own copy of all the modules it uses.

One wonders if it wouldn't be cheaper just to fork the
process. Shared memory can be used to transfer large lumps
of data if needed.

greg · Oct 25, 2008

Glenn said:
If Py_None corresponds to None in Python syntax ... then
it is a fixed constant and could be left global, probably.

No, it couldn't, because it's a reference-counted object
like any other Python object, and therefore needs to be
protected against simultaneous refcount manipulation by
different threads. So each interpreter would need its own
instance of Py_None.

The same goes for all the other built-in constants and
type objects -- there are dozens of these.

The cost is one more push on every function call,

Which sounds like it could be a rather high cost! If
(just a wild guess) each function has an average of 2
parameters, then this is increasing the amount of
argument pushing going on by 50%...

On many platforms, there is the concept of TLS, or thread-local storage.

That's another possibility, although doing it that
way would require you to have a separate thread for
each interpreter, which you mightn't always want.

greg · Oct 25, 2008

Andy said:
In our case, we're doing image and video
manipulation--stuff not good to be messaging from address space to
address space.

Have you considered using shared memory?

Using mmap or equivalent, you can arrange for a block of
memory to be shared between processes. Then you can dump
the big lump of data to be transferred in there, and send
a short message through a pipe to the other process to
let it know it's there.

greg · Oct 25, 2008

Rhamphoryncus said:
A list
is not shareable, so it can only be used within the monitor it's
created within, but the list type object is shareable.

Type objects contain dicts, which allow arbitrary values
to be stored in them. What happens if one thread puts
a private object in there? It becomes visible to other
threads using the same type object. If it's not safe
for sharing, bad things happen.

Python's data model is not conducive to making a clear
distinction between "private" and "shared" objects,
except at the level of an entire interpreter.

PyDev 3.0 Released	2	Nov 7, 2013
C language now truly universal	0	Jan 1, 2011
Truly platform-independent DB access in Python?	18	Aug 28, 2006
Multiple independent Python interpreters in a C/C++ program?	9	Apr 11, 2008
[ANN] pyparsing 2.0.1 released - compatible with Python 2.6 and later	1	Jul 20, 2013
ANN: Celery 3.0 (chiastic slide) released!	1	Jul 7, 2012
os independent rename	3	Sep 17, 2011
Unittest2 on python 2.6	0	Mar 18, 2012

2.6, 3.0, and truly independent intepreters

Glenn Linderman

Rhamphoryncus

Rhamphoryncus

Glenn Linderman

Glenn Linderman

Glenn Linderman

Andy O'Meara

Adam Olsen

Adam Olsen

Martin v. Löwis

Martin v. Löwis

Glenn Linderman

Glenn Linderman

Terry Reedy

Glenn Linderman

greg

greg

greg

greg

greg

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads