2.6, 3.0, and truly independent intepreters

Andy O'Meara · Oct 24, 2008

Instead of "appdomains" (one interpreter per thread), or free
threading, you could use multiple processes. Take a look at the new
multiprocessing module in Python 2.6.

That's mentioned earlier in the thread.

There is a fundamental problem with using homebrew loading of multiple
(but renamed) copies of PythonXX.dll that is easily overlooked. That
is, extension modules (.pyd) are DLLs as well.

Tell me about it--there's all kinds of problems and maintenance
liabilities with our approach. That's why I'm here talking about this
stuff.

There are other options as well:

- Use IronPython. It does not have a GIL.

- Use Jython. It does not have a GIL.

- Use pywin32 to create isolated outproc COM servers in Python. (I'm
not sure what the effect of inproc servers would be.)

- Use os.fork() if your platform supports it (Linux, Unix, Apple,
Cygwin, Windows Vista SUA). This is the standard posix way of doing
multiprocessing. It is almost unbeatable if you have a fast copy-on-
write implementation of fork (that is, all platforms except Cygwin).

This is discussed earlier in the thread--they're unfortunately all
out.

Stefan Behnel · Oct 24, 2008

Terry said:
Everything in DLLs is compiled C extensions. I see about 15 for Windows
3.0.

Ah, weren't that wonderful times back in the days of Win3.0, when DLL-hell was
inhabited by only 15 libraries? *sigh*

.... although ... wait, didn't Win3.0 have more than that already? Maybe you
meant Windows 1.0?

SCNR-ly,

Stefan

sturlamolden · Oct 24, 2008

This is discussed earlier in the thread--they're unfortunately all
out.

It occurs to me that tcl is doing what you want. Have you ever thought
of not using Python?

That aside, the fundamental problem is what I perceive a fundamental
design flaw in Python's C API. In Java JNI, each function takes a
JNIEnv* pointer as their first argument. There is nothing the
prevents you from embedding several JVMs in a process. Python can
create embedded subinterpreters, but it works differently. It swaps
subinterpreters like a finite state machine: only one is concurrently
active, and the GIL is shared. The approach is fine, except it kills
free threading of subinterpreters. The argument seems to be that
Apache's mod_python somehow depends on it (for reasons I don't
understand).

Andy O'Meara · Oct 24, 2008

Something like that is necessary for independent interpreters,
but not sufficient. There are also all the built-in constants
and type objects to consider. Most of these are statically
allocated at the moment.

Agreed--I was just trying to speak generally. Or, put another way,
there's no hope for independent interpreters without the likes of PEP
3121. Also, as Martin pointed out, there's the issue of module
cleanup some guys here may underestimate (and I'm glad Martin pointed
out the importance of it). Without the module cleanup, every time a
dynamic library using python loads and unloads you've got leaks. This
issue is a real problem for us since our software is loaded and
unloaded many many times in a host app (iTunes, WMP, etc). I hadn't
raised it here yet (and I don't want to turn the discussion to this),
but lack of multiple load and unload support has been another painful
issue that we didn't expect to encounter when we went with python.

No, it's there because it's necessary for acceptable performance
when multiple threads are running in one interpreter. Independent
interpreters wouldn't mean the absence of a GIL; it would only
mean each interpreter having its own GIL.

I see what you're saying, but let's note that what you're talking
about at this point is an interpreter containing protection from the
client level violating (supposed) direction put forth in python
multithreaded guidelines. Glenn Linderman's post really gets at
what's at hand here. It's really important to consider that it's not
a given that python (or any framework) has to be designed against
hazardous use. Again, I refer you to the diagrams and guidelines in
the QuickTime API:

http://developer.apple.com/technotes/tn/tn2125.html

They tell you point-blank what you can and can't do, and it's that's
simple. Their engineers can then simply create the implementation
around those specs and not weigh any of the implementation down with
sync mechanisms. I'm in the camp that simplicity and convention wins
the day when it comes to an API. It's safe to say that software
engineers expect and assume that a thread that doesn't have contact
with other threads (except for explicit, controlled message/object
passing) will run unhindered and safely, so I raise an eyebrow at the
GIL (or any internal "helper" sync stuff) holding up an thread's
performance when the app is designed to not need lower-level global
locks.

Anyway, let's talk about solutions. My company looking to support
python dev community endeavor that allows the following:

- an app makes N worker threads (using the OS)

- each worker thread makes its own interpreter, pops scripts off a
work queue, and manages exporting (and then importing) result data to
other parts of the app. Generally, we're talking about CPU-bound work
here.

- each interpreter has the essentials (e.g. math support, string
support, re support, and so on -- I realize this is open-ended, but
work with me here).

Let's guesstimate about what kind of work we're talking about here and
if this is even in the realm of possibility. If we find that it *is*
possible, let's figure out what level of work we're talking about.
From there, I can get serious about writing up a PEP/spec, paid
support, and so on.

Regards,
Andy

Andy O'Meara · Oct 24, 2008

That aside, the fundamental problem is what I perceive a fundamental
design flaw in Python's C API. In Java JNI, each function takes a
JNIEnv* pointer as their first argument. There is nothing the
prevents you from embedding several JVMs in a process. Python can
create embedded subinterpreters, but it works differently. It swaps
subinterpreters like a finite state machine: only one is concurrently
active, and the GIL is shared.

Bingo, it seems that you've hit it right on the head there. Sadly,
that's why I regard this thread largely futile (but I'm an optimist
when it comes to cool software communities so here I am). I've been
afraid to say it for fear of getting mauled by everyone here, but I
would definitely agree if there was a context (i.e. environment)
object passed around then perhaps we'd have the best of all worlds.
*winces*

It occurs to me that tcl is doing what you want. Have you ever thought
of not using Python?

Bingo again. Our research says that the options are tcl, perl
(although it's generally untested and not recommended by the
community--definitely dealbreakers for a commercial user like us), and
lua. Also, I'd rather saw off my own right arm than adopt perl, so
that's out. :^)

As I mentioned, we're looking to either (1) support a python dev
community effort, (2) make our own high-performance python interpreter
(that uses an env object as you described), or (3) drop python and go
to lua. I'm favoring them in the order I list them, but the more I
discuss the issue with folks here, the more people seem to be
unfortunately very divided on (1).

Andy

Patrick Stinson · Oct 24, 2008

I'm not finished reading the whole thread yet, but I've got some
things below to respond to this post with.

I've been following this discussion with interest, as it certainly seems
that multi-core/multi-CPU machines are the coming thing, and many
applications will need to figure out how to use them effectively.

Reading this PDF paper is extremely interesting (albeit somewhat dependent
on understanding abstract theories of computation; I have enough math
background to follow it, sort of, and most of the text can be read even
without fully understanding the theoretical abstractions).

I have already heard people talking about "Java applications are buggy". I
don't believe that general sequential programs written in Java are any
buggier than programs written in other languages... so I had interpreted
that to mean (based on some inquiry) that complex, multi-threaded Java
applications are buggy. And while I also don't believe that complex,
multi-threaded programs written in Java are any buggier than complex,
multi-threaded programs written in other languages, it does seem to be true
that Java is one of the currently popular languages in which to write
complex, multi-threaded programs, because of its language support for
threads and concurrency primitives. These reports were from people that are
not programmers, but are field IT people, that have bought and/or support
software and/or hardware with drivers, that are written in Java, and seem to
have non-ideal behavior, (apparently only) curable by stopping/restarting
the application or driver, or sometimes requiring a reboot.

The paper explains many traps that lead to complex, multi-threaded programs
being buggy, and being hard to test. I have worked with parallel machines,
applications, and databases for 25 years, and can appreciate the succinct
expression of the problems explained within the paper, and can, from
experience, agree with its premises and conclusions. Parallel applications
only have been commercial successes when the parallelism is tightly
constrained to well-controlled patterns that could be easily understood.
Threads, especially in "cooperation" with languages that use memory
pointers, have the potential to get out of control, in inexplicable ways.

This statement, after reading the paper, seems somewhat in line with the
author's premise that language acceptability requires that a language be
self-contained/monolithic, and potentially sufficient to implement itself.
That seems to also be one of the reasons that Java is used today for
threaded applications. It does seem to be true, given current hardware
trends, that _some mechanism_ must be provided to obtain the benefit of
multiple cores/CPUs to a single application, and that Python must either
implement or interface to that mechanism to continue to be a viable language
for large scale application development.

Andy seems to want an implementation of independent Python processes
implemented as threads within a single address space, that can be
coordinated by an outer application. This actually corresponds to the model
promulgated in the paper as being most likely to succeed. Of course, it
maps nicely into a model using separate processes, coordinated by an outer
process, also. The differences seem to be:

1) Most applications are historically perceived as corresponding to single
processes. Language features for multi-processing are rare, and such
languages are not in common use.

2) A single address space can be convenient for the coordinating outer
application. It does seem simpler and more efficient to simply "copy" data
from one memory location to another, rather than send it in a message,
especially if the data are large. On the other hand, coordination of memory
access between multiple cores/CPUs effectively causes memory copies from one
cache to the other, and if memory is accessed from multiple cores/CPUs
regularly, the underlying hardware implements additional synchronization and
copying of data, potentially each time the memory is accessed. Being forced
to do message passing of data between processes can actually be more
efficient than access to shared memory at times. I should note that in my
25 years of parallel development, all the systems created used a message
passing paradigm, partly because the multiple CPUs often didn't share the
same memory chips, much less the same address space, and that a key feature
of all the successful systems of that nature was an efficient inter-CPU
message passing mechanism. I should also note that Herb Sutter has a recent
series of columns in Dr Dobbs regarding multi-core/multi-CPU parallelism and
a variety of implementation pitfalls, that I found to be very interesting
reading.

I have noted the multiprocessing module that is new to Python 2.6/3.0 being
feverishly backported to Python 2.5, 2.4, etc... indicating that people
truly find the model/module useful... seems that this is one way, in Python
rather than outside of it, to implement the model Andy is looking for,
although I haven't delved into the details of that module yet, myself. I
suspect that a non-Python application could load one embedded Python
interpreter, and then indirectly use the multiprocessing module to control
other Python interpreters in other processors. I don't know that
multithreading primitives such as described in the paper are available in
the multiprocessing module, but perhaps they can be implemented in some
manner using the tools that are provided; in any case, some interprocess
communication primitives are provided via this new Python module.

There could be opportunity to enhance Python with process creation and
process coordination operations, rather than have it depend on
easy-to-implement-incorrectly coordination patterns or
easy-to-use-improperly libraries/modules of multiprocessing primitives (this
is not a slam of the new multiprocessing module, which appears to be filling
a present need in rather conventional ways, but just to point out that ideas
promulgated by the paper, which I suspect 2 years later are still research
topics, may be a better abstraction than the conventional mechanisms).

One thing Andy hasn't yet explained (or I missed) is why any of his
application is coded in a language other than Python. I can think of a
number of possibilities:

A) (Historical) It existed, then the desire for extensions was seen, and
Python was seen as a good extension language.

B) Python is inappropriate (performance?) for some of the algorithms (but
should they be coded instead as Python extensions, with the core application
being in Python?)

C) Unavailability of Python wrappers for particularly useful 3rd-party
libraries

D) Other?

We develop virtual instrument plugins for music production using
AudioUnit, VST, and RTAS on Windows and OS X. While our dsp engine's
code has to be written in C/C++ for performance reasons, the gui could
have been written in python. But, we didn't because:

1) Our project lead didn't know python, and the project began with
little time for him to learn it.
2) All of our third-party libs (for dsp, plugin-wrappers, etc) are
written in C++, so it would far easier to write and debug our app if
written in the same language. Could I do it now? yes. Could we do it
then? No.

** Additionally **, we would have run into this problem, which is very
appropriate to this thread:

3) Adding python as an audio scripting language in the audio thread
would have caused concurrency issues if our GUI had been written in
python, since audio threads are not allowed to make blockign calls
(f.ex. acquiring the GIL).

OK, I'll continue reading the thread now

Andy O'Meara · Oct 24, 2008

Glenn, great post and points!

Andy seems to want an implementation of independent Python processes
implemented as threads within a single address space, that can be
coordinated by an outer application. This actually corresponds to the
model promulgated in the paper as being most likely to succeed.

Yeah, that's the idea--let the highest levels run and coordinate the
show.

It does seem simpler and more efficient to simply "copy"
data from one memory location to another, rather than send it in a
message, especially if the data are large.

That's the rub... In our case, we're doing image and video
manipulation--stuff not good to be messaging from address space to
address space. The same argument holds for numerical processing with
large data sets. The workers handing back huge data sets via
messaging isn't very attractive.

One thing Andy hasn't yet explained (or I missed) is why any of his
application is coded in a language other than Python.

Our software runs in real time (so performance is paramount),
interacts with other static libraries, depends on worker threads to
perform real-time image manipulation, and leverages Windows and Mac OS
API concepts and features. Python's performance hits have generally
been a huge challenge with our animators because they often have to go
back and massage their python code to improve execution performance.
So, in short, there are many reasons why we use python as a part
rather than a whole.

The other area of pain that I mentioned in one of my other posts is
that what we ship, above all, can't be flaky. The lack of module
cleanup (intended to be addressed by PEP 3121), using a duplicate copy
of the python dynamic lib, and namespace black magic to achieve
independent interpreters are all examples that have made using python
for us much more challenging and time-consuming then we ever
anticipated.

Again, if it turns out nothing can be done about our needs (which
appears to be more and more like the case), I think it's important for
everyone here to consider the points raised here in the last week.
Moreover, realize that the python dev community really stands to gain
from making python usable as a tool (rather than a monolith). This
fact alone has caused lua to *rapidly* rise in popularity with
software companies looking to embed a powerful, lightweight
interpreter in their software.

As a python language fan an enthusiast, don't let lua win! (I say
this endearingly of course--I have the utmost respect for both
communities and I only want to see CPython be an attractive pick when
a company is looking to embed a language that won't intrude upon their
app's design).

Andy

Patrick Stinson · Oct 24, 2008

We are in the same position as Andy here.

I think that something that would help people like us produce
something in code form is a collection of information outlining the
problem and suggested solutions, appropriate parts of the CPython's
current threading API, and pros and cons of the many various proposed
solutions to the different levels of the problem. The most valuable
information I've found is contained in the many (lengthy!) discussions
like this one, a few related PEP's, and the CPython docs, but has
anyone condensed the state of the problem into a wiki or something
similar? Maybe we should start one?

For example, Guido's post here
http://www.artima.com/weblogs/viewpost.jsp?thread=214235describes some
possible solutions to the problem, like interpreter-specific locks, or
fine-grained object locks, and he also mentions the primary
requirement of not harming from the performance of single-threaded
apps. As I understand it, that requirement does not rule out new build
configurations that provide some level of concurrency, as long as you
can still compile python so as to perform as well on single-threaded
apps.

To add to the heap of use cases, the most important thing to us is to
simple have the python language and the sip/PyQt modules available to
us. All we wanted to do was embed the interpreter and language core as
a local scripting engine, so had we patched python to provide
concurrent execution, we wouldn't have cared about all of the other
unsuppported extension modules since our scripts are quite
application-specific.

It seems to me that the very simplest move would be to remove global
static data so the app could provide all thread-related data, which
Andy suggests through references to the QuickTime API. This would
suggest compiling python without thread support so as to leave it up
to the application.

Anyway, I'm having fun reading all of these papers and news postings,
but it's true that code talks, and it could be a little easier if the
state of the problems was condensed. This could be an intense and fun
project, but frankly it's a little tough to keep it all in my head. Is
there a wiki or something out there or should we start one, or do I
just need to read more code?

Patrick Stinson · Oct 24, 2008

As a side note to the performance question, we are executing python
code in an audio thread that is used in all of the top-end music
production environments. We have found the language to perform
extremely well when executed at control-rate frequency, meaning we
aren't doing DSP computations, just responding to less-frequent events
like user input and MIDI messages.

So we are sitting this music platform with unimaginable possibilities
in the music world (of which python does not play a role), but those
little CPU spikes caused by the GIL at low latencies won't let us have
it. AFAIK, there is no music scripting language out there that would
come close, and yet we are sooooo close! This is a big deal.

Terry Reedy · Oct 24, 2008

Stefan said:
Ah, weren't that wonderful times back in the days of Win3.0, when DLL-hell was
inhabited by only 15 libraries? *sigh*

... although ... wait, didn't Win3.0 have more than that already? Maybe you
meant Windows 1.0?

SCNR-ly,

Is that the equivalent of a smilely? or did you really not understand
what I wrote?

Jesse Noller · Oct 24, 2008

I see what you're saying, but let's note that what you're talking
about at this point is an interpreter containing protection from the
client level violating (supposed) direction put forth in python
multithreaded guidelines. Glenn Linderman's post really gets at
what's at hand here. It's really important to consider that it's not
a given that python (or any framework) has to be designed against
hazardous use. Again, I refer you to the diagrams and guidelines in
the QuickTime API:

http://developer.apple.com/technotes/tn/tn2125.html

They tell you point-blank what you can and can't do, and it's that's
simple. Their engineers can then simply create the implementation
around those specs and not weigh any of the implementation down with
sync mechanisms. I'm in the camp that simplicity and convention wins
the day when it comes to an API. It's safe to say that software
engineers expect and assume that a thread that doesn't have contact
with other threads (except for explicit, controlled message/object
passing) will run unhindered and safely, so I raise an eyebrow at the
GIL (or any internal "helper" sync stuff) holding up an thread's
performance when the app is designed to not need lower-level global
locks.

Anyway, let's talk about solutions. My company looking to support
python dev community endeavor that allows the following:

- an app makes N worker threads (using the OS)

- each worker thread makes its own interpreter, pops scripts off a
work queue, and manages exporting (and then importing) result data to
other parts of the app. Generally, we're talking about CPU-bound work
here.

- each interpreter has the essentials (e.g. math support, string
support, re support, and so on -- I realize this is open-ended, but
work with me here).

Let's guesstimate about what kind of work we're talking about here and
if this is even in the realm of possibility. If we find that it *is*
possible, let's figure out what level of work we're talking about.
support, and so on.

Point of order! Just for my own sanity if anything

I think some
minor clarifications are in order.

What are "threads" within Python:

Python has built in support for POSIX light weight threads. This is
what most people are talking about when they see, hear and say
"threads" - they mean Posix Pthreads
(http://en.wikipedia.org/wiki/POSIX_Threads) this is not what you
(Adam) seem to be asking for. PThreads are attractive due to the fact
they exist within a single interpreter, can share memory all "willy
nilly", etc.

Python does in fact, use OS-Level pthreads when you request multiple threads.

The Global Interpreter Lock is fundamentally designed to make the
interpreter easier to maintain and safer: Developers do not need to
worry about other code stepping on their namespace. This makes things
thread-safe, inasmuch as having multiple PThreads within the same
interpreter space modifying global state and variable at once is,
well, bad. A c-level module, on the other hand, can sidestep/release
the GIL at will, and go on it's merry way and process away.

POSIX Threads/pthreads/threads as we get from Java, allow unsafe
programming styles. These programming styles are of the "shared
everything deadlock lol" kind. The GIL *partially* protects against
some of the pitfalls. You do not seem to be asking for pthreads

http://www.python.org/doc/faq/library/#can-t-we-get-rid-of-the-global-interpreter-lock
http://en.wikipedia.org/wiki/Multi-threading

However, then there are processes.

The difference between threads and processes is that they do *not
share memory* but they can share state via shared queues/pipes/message
passing - what you seem to be asking for - is the ability to
completely fork independent Python interpreters, with their own
namespace and coordinate work via a shared queue accessed with pipes
or some other communications mechanism. Correct?

Multiprocessing, as it exists within python 2.6 today actually forks
(see trunk/Lib/multiprocessing/forking.py) a completely independent
interpreter per process created and then construct pipes to
inter-communicate, and queue to do work coordination. I am not
suggesting this is good for you - I'm trying to get to exactly what
you're asking for.

Fundamentally, allowing total free-threading with Posix threads, using
the same Java-Model for control is a recipe for pain - we're just
repeating mistakes instead of solving a problem, ergo - Adam Olsen's
work. Monitors, Actors, etc have all been discussed, proposed and are
being worked on.

So, just to clarify - Andy, do you want one interpreter, $N threads
(e.g. PThreads) or the ability to fork multiple "heavyweight"
processes?

Other bits for reading:
http://www.boddie.org.uk/python/pprocess.html (as an alternative the
multiprocessing)
http://smparkes.net/tag/dramatis/
http://osl.cs.uiuc.edu/parley/
http://candygram.sourceforge.net/

Jesse Noller · Oct 24, 2008

Point of order! Just for my own sanity if anything I think some
minor clarifications are in order.

What are "threads" within Python:

Python has built in support for POSIX light weight threads. This is
what most people are talking about when they see, hear and say
"threads" - they mean Posix Pthreads
(http://en.wikipedia.org/wiki/POSIX_Threads) this is not what you
(Adam) seem to be asking for. PThreads are attractive due to the fact
they exist within a single interpreter, can share memory all "willy
nilly", etc.

Python does in fact, use OS-Level pthreads when you request multiple threads.

The Global Interpreter Lock is fundamentally designed to make the
interpreter easier to maintain and safer: Developers do not need to
worry about other code stepping on their namespace. This makes things
thread-safe, inasmuch as having multiple PThreads within the same
interpreter space modifying global state and variable at once is,
well, bad. A c-level module, on the other hand, can sidestep/release
the GIL at will, and go on it's merry way and process away.

POSIX Threads/pthreads/threads as we get from Java, allow unsafe
programming styles. These programming styles are of the "shared
everything deadlock lol" kind. The GIL *partially* protects against
some of the pitfalls. You do not seem to be asking for pthreads

http://www.python.org/doc/faq/library/#can-t-we-get-rid-of-the-global-interpreter-lock
http://en.wikipedia.org/wiki/Multi-threading

However, then there are processes.

The difference between threads and processes is that they do *not
share memory* but they can share state via shared queues/pipes/message
passing - what you seem to be asking for - is the ability to
completely fork independent Python interpreters, with their own
namespace and coordinate work via a shared queue accessed with pipes
or some other communications mechanism. Correct?

Multiprocessing, as it exists within python 2.6 today actually forks
(see trunk/Lib/multiprocessing/forking.py) a completely independent
interpreter per process created and then construct pipes to
inter-communicate, and queue to do work coordination. I am not
suggesting this is good for you - I'm trying to get to exactly what
you're asking for.

Fundamentally, allowing total free-threading with Posix threads, using
the same Java-Model for control is a recipe for pain - we're just
repeating mistakes instead of solving a problem, ergo - Adam Olsen's
work. Monitors, Actors, etc have all been discussed, proposed and are
being worked on.

So, just to clarify - Andy, do you want one interpreter, $N threads
(e.g. PThreads) or the ability to fork multiple "heavyweight"
processes?

Other bits for reading:
http://www.boddie.org.uk/python/pprocess.html (as an alternative the
multiprocessing)
http://smparkes.net/tag/dramatis/
http://osl.cs.uiuc.edu/parley/
http://candygram.sourceforge.net/

I almost forgot:

http://www.kamaelia.org/Home

Glenn Linderman · Oct 24, 2008

Glenn, great post and points!

Thanks. I need to admit here that while I've got a fair bit of
professional programming experience, I'm quite new to Python -- I've not
learned its internals, nor even the full extent of its rich library. So
I have some questions that are partly about the goals of the
applications being discussed, partly about how Python is constructed,
and partly about how the library is constructed. I'm hoping to get a
better understanding of all of these; perhaps once a better
understanding is achieved, limitations will be understood, and maybe
solutions be achievable.

Let me define some speculative Python interpreters; I think the first is
today's Python:

PyA: Has a GIL. PyA threads can run within a process; but are
effectively serialized to the places where the GIL is obtained/released.
Needs the GIL because that solves lots of problems with non-reentrant
code (an example of non-reentrant code, is code that uses global (C
global, or C static) variables – note that I'm not talking about Python
vars declared global... they are only module global). In this model,
non-reentrant code could include pieces of the interpreter, and/or
extension modules.

PyB: No GIL. PyB threads acquire/release a lock around each reference to
a global variable (like "with" feature). Requires massive recoding of
all code that contains global variables. Reduces performance
significantly by the increased cost of obtaining and releasing locks.

PyC: No locks. Instead, recoding is done to eliminate global variables
(interpreter requires a state structure to be passed in). Extension
modules that use globals are prohibited... this eliminates large
portions of the library, or requires massive recoding. PyC threads do
not share data between threads except by explicit interfaces.

PyD: (A hybrid of PyA & PyC). The interpreter is recoded to eliminate
global variables, and each interpreter instance is provided a state
structure. There is still a GIL, however, because globals are
potentially still used by some modules. Code is added to detect use of
global variables by a module, or some contract is written whereby a
module can be declared to be reentrant and global-free. PyA threads will
obtain the GIL as they would today. PyC threads would be available to be
created. PyC instances refuse to call non-reentrant modules, but also
need not obtain the GIL... PyC threads would have limited module support
initially, but over time, most modules can be migrated to be reentrant
and global-free, so they can be used by PyC instances. Most 3rd-party
libraries today are starting to care about reentrancy anyway, because of
the popularity of threads.

The assumptions here are that:

Data-1) A Python interpreter doesn't provide any mechanism to share
normal data among threads, they are independent... but message passing
works.
Data-2) A Python interpreter could be extended to provide mechanisms to
share special data, and the data would come with an implicit lock.
Data-3) A Python interpreter could be extended to provide unlocked
access to special data, requiring the application to handle the
synchronization between threads. Data of type 2 could be used to control
access to data of type 3. This type of data could be large, or
frequently referenced data, but only by a single thread at a time, with
major handoffs to a different thread synchronized by the application in
whatever way it chooses.

Context-1) A Python interpreter would know about threads it spawns, and
could pass in a block of context (in addition to the state structure) as
a parameter to a new thread. That block of context would belong to the
thread as long as it exists, and return to the spawner when the thread
completes. An embedded interpreter would also be given a block of
context (in addition to the state structure). This would allow
application context to be created and passed around. Pointers to shared
memory structures, might be typical context in the embedded case.

Context-2) Embedded Python interpreters could be spawned either as PyA
threads or PyC threads. PyC threads would be limited to modules that are
reentrant.

I think that PyB and PyC are the visions that people see, which argue
against implementing independent interpreters. PyB isn't truly
independent, because data are shared, recoding is required, and
performance suffers. Ick. PyC requires "recoding the whole library"
potentially, if it is the only solution. PyD allows access to the whole
standard library of modules, exactly like today, but the existing
limitations still obtain for PyA threads using that model – very limited
concurrency. But PyC threads would execute in their own little
environments, and not need locking. Pure Python code would be
immediately happy there. Properly coded (reentrant, global-free)
extensions would be happy there. Lots of work could be done there, to
use up multi-core/multi-CPU horsepower (shared-memory architecture).

Questions for people that know the Python internals: Is PyD possible?
How hard? Is a PyC thread an effective way of implementing a Python
sandbox? If it is, and if it would attract the attention of Brett
Cannon, who at least once wanted to do a thesis on Python sandboxes, he
could be a helpful supporter.

Questions for Andy: is the type of work you want to do in independent
threads mostly pure Python? Or with libraries that you can control to
some extent? Are those libraries reentrant? Could they be made
reentrant? How much of the Python standard library would need to be
available in reentrant mode to provide useful functionality for those
threads? I think you want PyC

Questions for Patrick: So if you had a Python GUI using the whole
standard library -- would it likely runs fine in PyA threads, and still
be able to use PyC threads for the audio scripting language? Would it be
a problem for those threads to have limited library support (only
reentrant modules)?

That's the rub... In our case, we're doing image and video
manipulation--stuff not good to be messaging from address space to
address space. The same argument holds for numerical processing with
large data sets. The workers handing back huge data sets via
messaging isn't very attractive.

In the module multiprocessing environment could you not use shared
memory, then, for the large shared data items?

Our software runs in real time (so performance is paramount),
interacts with other static libraries, depends on worker threads to
perform real-time image manipulation, and leverages Windows and Mac OS
API concepts and features. Python's performance hits have generally
been a huge challenge with our animators because they often have to go
back and massage their python code to improve execution performance.
So, in short, there are many reasons why we use python as a part
rather than a whole.

The other area of pain that I mentioned in one of my other posts is
that what we ship, above all, can't be flaky. The lack of module
cleanup (intended to be addressed by PEP 3121), using a duplicate copy
of the python dynamic lib, and namespace black magic to achieve
independent interpreters are all examples that have made using python
for us much more challenging and time-consuming then we ever
anticipated.

Again, if it turns out nothing can be done about our needs (which
appears to be more and more like the case), I think it's important for
everyone here to consider the points raised here in the last week.
Moreover, realize that the python dev community really stands to gain
from making python usable as a tool (rather than a monolith). This
fact alone has caused lua to *rapidly* rise in popularity with
software companies looking to embed a powerful, lightweight
interpreter in their software.

As a python language fan an enthusiast, don't let lua win! (I say
this endearingly of course--I have the utmost respect for both
communities and I only want to see CPython be an attractive pick when
a company is looking to embed a language that won't intrude upon their
app's design).

Thanks for the further explanations.

Andy O'Meara · Oct 24, 2008

The Global Interpreter Lock is fundamentally designed to make the
interpreter easier to maintain and safer: Developers do not need to
worry about other code stepping on their namespace. This makes things
thread-safe, inasmuch as having multiple PThreads within the same
interpreter space modifying global state and variable at once is,
well, bad. A c-level module, on the other hand, can sidestep/release
the GIL at will, and go on it's merry way and process away.

....Unless part of the C module execution involves the need do CPU-
bound work on another thread through a different python interpreter,
right? (even if the interpreter is 100% independent, yikes). For
example, have a python C module designed to programmatically generate
images (and video frames) in RAM for immediate and subsequent use in
animation. Meanwhile, we'd like to have a pthread with its own
interpreter with an instance of this module and have it dequeue jobs
as they come in (in fact, there'd be one of these threads for each
excess core present on the machine). As far as I can tell, it seems
CPython's current state can't CPU bound parallelization in the same
address space (basically, it seems that we're talking about the
"embarrassingly parallel" scenario raised in that paper). Why does it
have to be in same address space? Convenience and simplicity--the
same reasons that most APIs let you hang yourself if the app does dumb
things with threads. Also, when the data sets that you need to send
to and from each process is large, using the same address space makes
more and more sense.

So, just to clarify - Andy, do you want one interpreter, $N threads
(e.g. PThreads) or the ability to fork multiple "heavyweight"
processes?

Sorry if I haven't been clear, but we're talking the app starting a
pthread, making a fresh/clean/independent interpreter, and then being
responsible for its safety at the highest level (with the payoff of
each of these threads executing without hinderance). No different
than if you used most APIs out there where step 1 is always to make
and init a context object and the final step is always to destroy/take-
down that context object.

I'm a lousy writer sometimes, but I feel bad if you took the time to
describe threads vs processes. The only reason I raised IPC with my
"messaging isn't very attractive" comment was to respond to Glenn
Linderman's points regarding tradeoffs of shared memory vs no.

Andy

Jesse Noller · Oct 24, 2008

I'm a lousy writer sometimes, but I feel bad if you took the time to
describe threads vs processes. The only reason I raised IPC with my
"messaging isn't very attractive" comment was to respond to Glenn
Linderman's points regarding tradeoffs of shared memory vs no.

I actually took the time to bring anyone listening in up to speed, and
to clarify so I could better understand your use case. Don't feel bad,
things in the thread are moving fast and I just wanted to clear it up.

Ideally, we all want to improve the language, and the interpreter.
However trying to push it towards a particular use case is dangerous
given the idea of "general use".

-jesse

Rhamphoryncus · Oct 24, 2008

Thanks. I need to admit here that while I've got a fair bit of
professional programming experience, I'm quite new to Python -- I've not
learned its internals, nor even the full extent of its rich library. So
I have some questions that are partly about the goals of the
applications being discussed, partly about how Python is constructed,
and partly about how the library is constructed. I'm hoping to get a
better understanding of all of these; perhaps once a better
understanding is achieved, limitations will be understood, and maybe
solutions be achievable.

Let me define some speculative Python interpreters; I think the first is
today's Python:

PyA: Has a GIL. PyA threads can run within a process; but are
effectively serialized to the places where the GIL is obtained/released.
Needs the GIL because that solves lots of problems with non-reentrant
code (an example of non-reentrant code, is code that uses global (C
global, or C static) variables – note that I'm not talking about Python
vars declared global... they are only module global). In this model,
non-reentrant code could include pieces of the interpreter, and/or
extension modules.

PyB: No GIL. PyB threads acquire/release a lock around each reference to
a global variable (like "with" feature). Requires massive recoding of
all code that contains global variables. Reduces performance
significantly by the increased cost of obtaining and releasing locks.

PyC: No locks. Instead, recoding is done to eliminate global variables
(interpreter requires a state structure to be passed in). Extension
modules that use globals are prohibited... this eliminates large
portions of the library, or requires massive recoding. PyC threads do
not share data between threads except by explicit interfaces.

PyD: (A hybrid of PyA & PyC). The interpreter is recoded to eliminate
global variables, and each interpreter instance is provided a state
structure. There is still a GIL, however, because globals are
potentially still used by some modules. Code is added to detect use of
global variables by a module, or some contract is written whereby a
module can be declared to be reentrant and global-free. PyA threads will
obtain the GIL as they would today. PyC threads would be available to be
created. PyC instances refuse to call non-reentrant modules, but also
need not obtain the GIL... PyC threads would have limited module support
initially, but over time, most modules can be migrated to be reentrant
and global-free, so they can be used by PyC instances. Most 3rd-party
libraries today are starting to care about reentrancy anyway, because of
the popularity of threads.

PyE: objects are reclassified as shareable or non-shareable, many
types are now only allowed to be shareable. A module and its classes
become shareable with the use of a __future__ import, and their
shareddict uses a read-write lock for scalability. Most other
shareable objects are immutable. Each thread is run in its own
private monitor, and thus protected from the normal threading memory
module nasties. Alas, this gives you all the semantics, but you still
need scalable garbage collection.. and CPython's refcounting needs the
GIL.

Our software runs in real time (so performance is paramount),
interacts with other static libraries, depends on worker threads to
perform real-time image manipulation, and leverages Windows and Mac OS
API concepts and features. Python's performance hits have generally
been a huge challenge with our animators because they often have to go
back and massage their python code to improve execution performance.
So, in short, there are many reasons why we use python as a part
rather than a whole. [...]
As a python language fan an enthusiast, don't let lua win! (I say
this endearingly of course--I have the utmost respect for both
communities and I only want to see CPython be an attractive pick when
a company is looking to embed a language that won't intrude upon their
app's design).

Click to expand...

I agree with the problem, and desire to make python fill all niches,
but let's just say I'm more ambitious with my solution.

Andy O'Meara · Oct 24, 2008

Another great post, Glenn!! Very well laid-out and posed!! Thanks for
taking the time to lay all that out.

Questions for Andy: is the type of work you want to do in independent
threads mostly pure Python? Or with libraries that you can control to
some extent? Are those libraries reentrant? Could they be made
reentrant? How much of the Python standard library would need to be
available in reentrant mode to provide useful functionality for those
threads? I think you want PyC

I think you've defined everything perfectly, and you're you're of
course correct about my love for for the PyC model. :^)

Like any software that's meant to be used without restrictions, our
code and frameworks always use a context object pattern so that
there's never and non-const global/shared data). I would go as far to
say that this is the case with more performance-oriented software than
you may think since it's usually a given for us to have to be parallel
friendly in as many ways as possible. Perhaps Patrick can back me up
there.

As to what modules are "essential"... As you point out, once
reentrant module implementations caught on in PyC or hybrid world, I
think we'd start to see real effort to whip them into compliance--
there's just so much to be gained imho. But to answer the question,
there's the obvious ones (operator, math, etc), string/buffer
processing (string, re), C bridge stuff (struct, array), and OS basics
(time, file system, etc). Nice-to-haves would be buffer and image
decompression (zlib, libpng, etc), crypto modules, and xml. As far as
I can imagine, I have to believe all of these modules already contain
little, if any, global data, so I have to believe they'd be super easy
to make "PyC happy". Patrick, what would you see you guys using?

In the module multiprocessing environment could you not use shared
memory, then, for the large shared data items?

As I understand things, the multiprocessing puts stuff in a child
process (i.e. a separate address space), so the only to get stuff to/
from it is via IPC, which can include a shared/mapped memory region.
Unfortunately, a shared address region doesn't work when you have
large and opaque objects (e.g. a rendered CoreVideo movie in the
QuickTime API or 300 megs of audio data that just went through a
DSP). Then you've got the hit of serialization if you're got
intricate data structures (that would normally would need to be
serialized, such as a hashtable or something). Also, if I may speak
for commercial developers out there who are just looking to get the
job done without new code, it's usually always preferable to just a
single high level sync object (for when the job is complete) than to
start a child processes and use IPC. The former is just WAY less
code, plain and simple.

Andy

Glenn Linderman · Oct 24, 2008

PyE: objects are reclassified as shareable or non-shareable, many
types are now only allowed to be shareable. A module and its classes
become shareable with the use of a __future__ import, and their
shareddict uses a read-write lock for scalability. Most other
shareable objects are immutable. Each thread is run in its own
private monitor, and thus protected from the normal threading memory
module nasties. Alas, this gives you all the semantics, but you still
need scalable garbage collection.. and CPython's refcounting needs the
GIL.

Hmm. So I think your PyE is an instance is an attempt to be more
explicit about what I said above in PyC: PyC threads do not share data
between threads except by explicit interfaces. I consider your
definitions of shared data types somewhat orthogonal to the types of
threads, in that both PyA and PyC threads could use these new shared
data items.

I think/hope that you meant that "many types are now only allowed to be
non-shareable"? At least, I think that should be the default; they
should be within the context of a single, independent interpreter
instance, so other interpreters don't even know they exist, much less
how to share them. If so, then I understand most of the rest of your
paragraph, and it could be a way of providing shared objects, perhaps.

I don't understand the comment that CPython's refcounting needs the
GIL... yes, it needs the GIL if multiple threads see the object, but not
for private objects... only one threads uses the private objects... so
today's refcounting should suffice... with each interpreter doing its
own refcounting and collecting its own garbage.

Shared objects would have to do refcounting in a protected way, under
some lock. One "easy" solution would be to have just two types of
objects; non-shared private objects in a thread, and global shared
objects; access to global shared objects would require grabbing the GIL,
and then accessing the object, and releasing the GIL. An interface
could allow for grabbing releasing the GIL around a block of accesses to
shared objects (with GIL

This could reduce the number of GIL
acquires. Then the reference counting for those objects would also be
done under the GIL, and the garbage collecting? By another PyA thread,
perhaps, that grabs the GIL by default? Or a PyC one that explicitly
grabs the GIL and does a step of global garbage collection?

A more complex, more parallel solution would allow for independent
groups of shared objects. Of course, once there is more than one lock
involved, there is more potential for deadlock, but it also provides for
more parallelism. So a shared object might inherit from a "concurrency
group" which would have a lock that could be acquired (with conc_group

for access to those data items. Again, the reference counting would be
done under that lock for that group of objects, and garbage collecting
those objects would potentially require that lock as well...

The solution with multiple concurrency groups allows for such groups to
contain a single shared object, or many (probably related) shared
objects. So the application gets a choice of the granularity of sharing
and locking, and can choose the number of locks to optimize performance
and achieve correctness. This sort of shared data among threads,
though, suffers in the limit from all the problems described in the
Berkeley paper. More reliable programs might be achieved by using
straight PyC threads, and some very limited "data ports" that can be
combined using a higher-order flow control concept, as outlined in the
paper.

While Python might be extended with these flow control concepts, they
could be added gradually over time, and in the embedded case, could be
implemented in some other language.

--
Glenn
------------------------------------------------------------------------

.. _|_|_| _|
.. _| _| _|_| _|_|_| _|_|_|
.. _| _|_| _| _|_|_|_| _| _| _| _|
.. _| _| _| _| _| _| _| _|
.. _|_|_| _| _|_|_| _| _| _| _|

Glenn Linderman · Oct 24, 2008

PyE: objects are reclassified as shareable or non-shareable, many
types are now only allowed to be shareable. A module and its classes
become shareable with the use of a __future__ import, and their
shareddict uses a read-write lock for scalability. Most other
shareable objects are immutable. Each thread is run in its own
private monitor, and thus protected from the normal threading memory
module nasties. Alas, this gives you all the semantics, but you still
need scalable garbage collection.. and CPython's refcounting needs the
GIL.

Hmm. So I think your PyE is an instance is an attempt to be more
explicit about what I said above in PyC: PyC threads do not share data
between threads except by explicit interfaces. I consider your
definitions of shared data types somewhat orthogonal to the types of
threads, in that both PyA and PyC threads could use these new shared
data items.

I think/hope that you meant that "many types are now only allowed to be
non-shareable"? At least, I think that should be the default; they
should be within the context of a single, independent interpreter
instance, so other interpreters don't even know they exist, much less
how to share them. If so, then I understand most of the rest of your
paragraph, and it could be a way of providing shared objects, perhaps.

I don't understand the comment that CPython's refcounting needs the
GIL... yes, it needs the GIL if multiple threads see the object, but not
for private objects... only one threads uses the private objects... so
today's refcounting should suffice... with each interpreter doing its
own refcounting and collecting its own garbage.

Shared objects would have to do refcounting in a protected way, under
some lock. One "easy" solution would be to have just two types of
objects; non-shared private objects in a thread, and global shared
objects; access to global shared objects would require grabbing the GIL,
and then accessing the object, and releasing the GIL. An interface
could allow for grabbing releasing the GIL around a block of accesses to
shared objects (with GIL

This could reduce the number of GIL
acquires. Then the reference counting for those objects would also be
done under the GIL, and the garbage collecting? By another PyA thread,
perhaps, that grabs the GIL by default? Or a PyC one that explicitly
grabs the GIL and does a step of global garbage collection?

A more complex, more parallel solution would allow for independent
groups of shared objects. Of course, once there is more than one lock
involved, there is more potential for deadlock, but it also provides for
more parallelism. So a shared object might inherit from a "concurrency
group" which would have a lock that could be acquired (with conc_group

for access to those data items. Again, the reference counting would be
done under that lock for that group of objects, and garbage collecting
those objects would potentially require that lock as well...

The solution with multiple concurrency groups allows for such groups to
contain a single shared object, or many (probably related) shared
objects. So the application gets a choice of the granularity of sharing
and locking, and can choose the number of locks to optimize performance
and achieve correctness. This sort of shared data among threads,
though, suffers in the limit from all the problems described in the
Berkeley paper. More reliable programs might be achieved by using
straight PyC threads, and some very limited "data ports" that can be
combined using a higher-order flow control concept, as outlined in the
paper.

While Python might be extended with these flow control concepts, they
could be added gradually over time, and in the embedded case, could be
implemented in some other language.

Jesse Noller · Oct 24, 2008

As I understand things, the multiprocessing puts stuff in a child
process (i.e. a separate address space), so the only to get stuff to/
from it is via IPC, which can include a shared/mapped memory region.
Unfortunately, a shared address region doesn't work when you have
large and opaque objects (e.g. a rendered CoreVideo movie in the
QuickTime API or 300 megs of audio data that just went through a
DSP). Then you've got the hit of serialization if you're got
intricate data structures (that would normally would need to be
serialized, such as a hashtable or something). Also, if I may speak
for commercial developers out there who are just looking to get the
job done without new code, it's usually always preferable to just a
single high level sync object (for when the job is complete) than to
start a child processes and use IPC. The former is just WAY less
code, plain and simple.

Are you familiar with the API at all? Multiprocessing was designed to
mimic threading in about every way possible, the only restriction on
shared data is that it must be serializable, but event then you can
override or customize the behavior.

Also, inter process communication is done via pipes. It can also be
done with messages if you want to tweak the manager(s).

-jesse

PyDev 3.0 Released	2	Nov 7, 2013
C language now truly universal	0	Jan 1, 2011
Truly platform-independent DB access in Python?	18	Aug 28, 2006
Multiple independent Python interpreters in a C/C++ program?	9	Apr 11, 2008
[ANN] pyparsing 2.0.1 released - compatible with Python 2.6 and later	1	Jul 20, 2013
ANN: Celery 3.0 (chiastic slide) released!	1	Jul 7, 2012
os independent rename	3	Sep 17, 2011
Unittest2 on python 2.6	0	Mar 18, 2012

2.6, 3.0, and truly independent intepreters

Andy O'Meara

Stefan Behnel

sturlamolden

Andy O'Meara

Andy O'Meara

Patrick Stinson

Andy O'Meara

Patrick Stinson

Patrick Stinson

Terry Reedy

Jesse Noller

Jesse Noller

Glenn Linderman

Andy O'Meara

Jesse Noller

Rhamphoryncus

Andy O'Meara

Glenn Linderman

Glenn Linderman

Jesse Noller

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads