2.6, 3.0, and truly independent intepreters

G

Glenn Linderman

I think we miscommunicated there--I'm actually agreeing with you. I
was trying to make the same point you were: that intricate and/or
large structures are meant to be passed around by a top-level pointer,
not using and serialization/messaging. This is what I've been trying
to explain to others here; that IPC and shared memory unfortunately
aren't viable options, leaving app threads (rather than child
processes) as the solution.

And I think we still are miscommunicating! Or maybe communicating anyway!

So when you said "object", I actually don't know whether you meant
Python object or something else. I assumed Python object, which may not
have been correct... but read on, I think the stuff below clears it up.

Your instincts are right. I'd only add on that when you're talking
about data structures associated with an intricate video format, the
complexity and depth of the data structures is insane -- the LAST
thing you want to burn cycles on is serializing and unserializing that
stuff (so IPC is out)--again, we're already on the same page here.

I think at one point you made the comment that shared memory is a
solution to handle large data sets between a child process and the
parent. Although this is certainty true in principle, it doesn't hold
up in practice since complex data structures often contain 3rd party
and OS API objects that have their own allocators. For example, in
video encoding, there's TONS of objects that comprise memory-resident
video from all kinds of APIs, so the idea of having them allocated
from shared/mapped memory block isn't even possible. Again, I only
raise this to offer evidence that doing real-world work in a child
process is a deal breaker--a shared address space is just way too much
to give up.

So I was thinking of multimedia data structures as a blob, and, in fact,
as a contiguous blob... that would be easy to toss into a shared memory.

Then when you mentioned thousands of objects, I imagined thousands of
Python objects, and somehow transforming the blob into same... and back
again. And Python objects certainly would need to be
serialized/deserialized, either via pickle or some
application-specific-more-efficient mechanism, but still that process
would add 3 copies to the process of moving data from one thread to another.

But now I think I understand your issue, about why shared memory is a
problem.

In addition to contiguous blobs, a multimedia application might have
non-contiguous blobs. For video, the contiguous blob might be an MPEG
stream (of one standard or another). This might get transformed into a
list of frames; the MPEG stream has frames of different types due to
compression, but that isn't the best format for doing transformations,
so a 3rd party library might be called to decompress the stream into a
set of independently allocated chunks, each containing one frame (each
possibly consisting of several allocations of memory for associated
metadata) that is independent of other frames (there may still be some
internal compression for the frame, such as the difference between JPEG
and BMP). This collection of frames is now subdivided into 8 parts, and
each of the 8 parts wants to be passed to a thread for processing. The
application provides a pointer to one part of the frames, each thread
has been loaded with modules that understand the structure of the
frames, and the user code manipulates those frames based on these
manipulation modules. If there are 8 processors, this goes 8 times as
fast as it would otherwise, except for GIL.

Hence, shared memory or shared temporary files is hard, because the
splitter is a 3rd party process that uses the standard C allocator, and
the data would have to be reconstructed/copied in shared space
afterwards to use it in multiple processes, which is not only
performance killing, but is extra code to maintain.

So I think this description is of a problem for which PyC (non-GIL,
independent) threads would be useful, and other solutions would be lower
performance.

It'll be interesting to see if anyone can suggest an alternative, now
that the problem is described this way. I somehow doubt it.
 
G

Glenn Linderman

Glenn said:
If None remains global, then type(None) also remains global, and
type(None),__bases__[0]. Then type(None).__bases__[0].__subclasses__()
will yield "interesting" results. This is essentially the status quo.

I certainly don't grok the implications of what you say above, as I
barely grok the semantics of it.

Not only is there a link from a class to its base classes, there
is a link to all its subclasses as well.

Since every class is ultimately a subclass of 'object', this means
that starting from *any* object, you can work your way up the
__bases__ chain until you get to 'object', then walk the sublass
hierarchy and find every class in the system.

This means that if any object at all is shared, then all class
objects, and any object reachable from them, are shared as well.

Thanks for the explanation.

So that means that each PyC thread should have its own object, and all
its own globals. Unless Python were reimplemented to allow the
interpreter itself to do things that the Python code can't access. But
that wouldn't be very Pythonic, perhaps. So that means that each PyC
thread should have its own object, and all its own globals, and be truly
independent.
 
R

Rhamphoryncus

Grrr... I posted a ton of lengthy replies to you and other recent
posts here using Google and none of them made it, argh. Poof. There's
nothing that fires more up more than lost work,  so I'll have to
revert short and simple answers for the time being.  Argh, damn.






I'm with you on all counts, so no disagreement there.  On the "passing
a ptr everywhere" issue, perhaps one idea is that all objects could
have an additionalfieldthat would point back to their parent context
(ie. their interpreter).  So the only prototypes that would have to be
modified to contain the context ptr would be the ones that don't
inherently operate on objects (e.g. importing a module).

Trying to directly share objects like this is going to create
contention. The refcounting becomes the sequential portion of
Amdahl's Law. This is why safethread doesn't scale very well: I share
a massive amount of objects.

An alternative, actually simpler, is to create proxies to your real
object. The proxy object has a pointer to the real object and the
context containing it. When you call a method it serializes the
arguments, acquires the target context's GIL (while releasing yours),
and deserializes in the target context. Once the method returns it
reverses the process.

There's two reasons why this may perform well for you: First,
operations done purely in C may cheat (if so designed). A copy from
one memory buffer to another memory buffer may be given two proxies as
arguments, but then operate directly on the target objects (ie without
serialization).

Second, if a target context is idle you can enter it (acquiring its
GIL) without any context switch.

Of course that scenario is full of "maybes", which is why I have
little interest in it..

An even better scenario is if your memory buffer's methods are in pure
C and it's a simple object (no pointers). You can stick the memory
buffer in shared memory and have multiple processes manipulate it from
C. More "maybes".

An evil trick if you need pointers, but control the allocation, is to
take advantage of the fork model. Have a master process create a
bunch of blank files (temp files if linux doesn't allow /dev/zero),
mmap them all using MAP_SHARED, then fork and utilize. The addresses
will be inherited from the master process, so any pointers within them
will be usable across all processes. If you ever want to return
memory to the system you can close that file, then have all processes
use MAP_SHARED|MAP_FIXED to overwrite it. Evil, but should be
disturbingly effective, and still doesn't require modifying CPython.
 
A

Andy O'Meara

Andy,

Why don't you just use a temporary file
system (ram disk) to store the data that
your app is manipulating. All you need to
pass around then is a file descriptor.

--JamesMills

Unfortunately, it's the penalty of serialization and unserialization.
When you're talking about stuff like memory-resident images and video
(complete with their intricate and complex codecs), then the only
option is to be passing around a couple pointers rather then take the
hit of serialization (which is huge for video, for example). I've
gone into more detail in some other posts but I could have missed
something.


Andy
 
A

Andy O'Meara

Andy O'Meara wrote:


WHAT PARENT PROCESS? "In the same address space", to me, means
"a single process only, not multiple processes, and no parent process
anywhere". If you have just multiple threads, the notion of passing
data from a "child process" back to the "parent process" is
meaningless.

I know... I was just responding to you and others here keep beating
the "fork" drum. I just trying make it clear that a shared address
space is the only way to go. Ok, good, so we're in agreement that
threads is the only way to deal with the "intricate and complex" data
set issue in a performance-centric application.
I understand that this is your mission in this thread. However, why
is that your problem? Why can't you just use the existing (limited)
multiple-interpreters machinery, and solve your problems with that?

Because then we're back into the GIL not permitting threads efficient
core use on CPU bound scripts running on other threads (when they
otherwise could). Just so we're on the same page, "when they
otherwise could" is relevant here because that's the important given:
that each interpreter ("context") truly never has any context with
others.

An example would be python scripts that generate video programatically
using an initial set of params and use an in-house C module to
construct frame (which in turn make and modify python C objects that
wrap to intricate codec related data structures). Suppose you wanted
to render 3 of these at the same time, one on each thread (3
threads). With the GIL in place, these threads can't anywhere close
to their potential. Your response thus far is that the C module
should release the GIL before it commences its heavy lifting. Well,
the problem is that if during its heavy lifting it needs to call back
into its interpreter. It's turns out that this isn't an exotic case
at all: there's a *ton* of utility gained by making calls back into
the interpreter. The best example is that since code more easily
maintained in python than in C, a lot of the module "utility" code is
likely to be in python. Unsurprisingly, this is the situation myself
and many others are in: where we want to subsequently use the
interpreter within the C module (so, as I understand it, the proposal
to have the C module release the GIL unfortunately doesn't work as a
general solution).
And that's indeed the case for Python, too. The app can make as many
subinterpreters as it wants to, and it must not pass objects from one
subinterpreter to another one, nor should it use a single interpreter
from more than one thread (although that is actually supported by
Python - but it surely won't hurt if you restrict yourself to a single
thread per interpreter).

I'm not following you there... I thought we're all in agreement that
the existing C modules are FAR from being reentrant, regularly making
use of static/global objects. The point I had made before is that
other industry-caliber packages specifically don't have restrictions
in *any* way.

I appreciate your arguments these a PyC concept is a lot of work with
some careful design work, but let's not kill the discussion just
because of that. The fact remains that the video encoding scenario
described above is a pretty reasonable situation, and as more people
are commenting in this thread, there's an increasing need to offer
apps more flexibility when it comes to multi-threaded use.


Andy
 
A

Andy O'Meara

These discussion pop up every year or so and I think that most of them
are not really all that necessary, since the GIL isn't all that bad.

Thing is, if the topic keeps coming up, then that may be an indicator
that change is truly needed. Someone much wiser than me once shared
that a measure of the usefulness and quality of a package (or API) is
how easily it can be added to an application--of any flavors--without
the application needing to change.

So in the rising world of idle cores and worker threads, I do see an
increasing concern over the GIL. Although I recognize that the debate
is lengthy, heated, and has strong arguments on both sides, my reading
on the issue makes me feel like there's a bias for the pro-GIL side
because of the volume of design and coding work associated with
considering various alternatives (such as Glenn's "Py*" concepts).
And I DO respect and appreciate where the pro-GIL people come from:
who the heck wants to do all that work and recoding so that a tiny
percent of developers can benefit? And my best response is that as
unfortunate as it is, python needs to be more multi-threaded app-
friendly if we hope to attract the next generation of app developers
that want to just drop python into their app (and not have to change
their app around python). For example, Lua has that property, as
evidenced by its rapidly growing presence in commercial software
(Blizzard uses it heavily, for example).
Furthermore, there are lots of ways to tune the CPython VM to make
it more or less responsive to thread switches via the various sys.set*()
functions in the sys module.

Most computing or I/O intense C extensions, built-in modules and object
implementations already release the GIL for you, so it usually doesn't
get in the way all that often.


The main issue I take there is that it's often highly useful for C
modules to make subsequent calls back into the interpreter. I suppose
the response to that is to call the GIL before reentry, but it just
seems to be more code and responsibility in scenarios where it's no
necessary. Although that code and protocol may come easy to veteran
CPython developers, let's not forget that an important goal is to
attract new developers and companies to the scene, where they get
their thread-independent code up and running using python without any
unexpected reengineering. Again, why are companies choosing Lua over
Python when it comes to an easy and flexible drop-in interpreter? And
please take my points here to be exploratory, and not hostile or
accusatory, in nature.


Andy
 
A

Andy O'Meara

And I think we still are miscommunicating!  Or maybe communicating anyway!

So when you said "object", I actually don't know whether you meant
Python object or something else.  I assumed Python object, which may not
have been correct... but read on, I think the stuff below clears it up.


Then when you mentioned thousands of objects, I imagined thousands of
Python objects, and somehow transforming the blob into same... and back
again.  

My apologies to you and others here on my use of "objects" -- I'm use
the term generically and mean it to *not* refer to python objects (for
the all the reasons discussed here). Python only makes up a small
part of our app, hence my habit of "objects" to refer to other APIs'
allocated and opaque objects (including our own and OS APIs). For all
the reasons we've discussed, in our world, python objects don't travel
around outside of our python C modules -- when python objects need to
be passed to other parts of the app, they're converted into their non-
python (portable) equivalents (ints, floats, buffers, etc--but most of
the time, the objects are PyCObjects, so they can enter and leave a
python context with negligible overhead). I venture to say this is
pretty standard when any industry app uses a package (such as python),
for various reasons:
- Portability/Future (e.g. if we do decode to drop Python and go
with Lua, the changes are limited to only one region of code).
- Sanity (having any API's objects show up in places "far away"
goes against easy-to-follow code).
- MT flexibility (because we always never use static/global
storage, we have all kinds of options when it comes to
multithreading). For example, recall that by throwing python in
multiple dynamic libs, we were able to achieve the GIL-less
interpreter independence that we want (albeit ghetto and a pain).



Andy
 
R

Rhamphoryncus

Thing is, if the topic keeps coming up, then that may be an indicator
that change is truly needed.  Someone much wiser than me once shared
that a measure of the usefulness and quality of a package (or API) is
how easily it can be added to an application--of any flavors--without
the application needing to change.

So in the rising world of idle cores and worker threads, I do see an
increasing concern over the GIL.  Although I recognize that the debate
is lengthy, heated, and has strong arguments on both sides, my reading
on the issue makes me feel like there's a bias for the pro-GIL side
because of the volume of design and coding work associated with
considering various alternatives (such as Glenn's "Py*" concepts).
And I DO respect and appreciate where the pro-GIL people come from:
who the heck wants to do all that work and recoding so that a tiny
percent of developers can benefit?  And my best response is that as
unfortunate as it is, python needs to be more multi-threaded app-
friendly if we hope to attract the next generation of app developers
that want to just drop python into their app (and not have to change
their app around python).  For example, Lua has that property, as
evidenced by its rapidly growing presence in commercial software
(Blizzard uses it heavily, for example).





The main issue I take there is that it's often highly useful for C
modules to make subsequent calls back into the interpreter. I suppose
the response to that is to call the GIL before reentry, but it just
seems to be more code and responsibility in scenarios where it's no
necessary.  Although that code and protocol may come easy to veteran
CPython developers, let's not forget that an important goal is to
attract new developers and companies to the scene, where they get
their thread-independent code up and running using python without any
unexpected reengineering.  Again, why are companies choosing Lua over
Python when it comes to an easy and flexible drop-in interpreter?  And
please take my points here to be exploratory, and not hostile or
accusatory, in nature.

Andy

Okay, here's the bottom line:
* This is not about the GIL. This is about *completely* isolated
interpreters; most of the time when we want to remove the GIL we want
a single interpreter with lots of shared data.
* Your use case, although not common, is not extraordinarily rare
either. It'd be nice to support.
* If CPython had supported it all along we would continue to maintain
it.
* However, since it's not supported today, it's not worth the time
invested, API incompatibility, and general breakage it would imply.
* Although it's far more work than just solving your problem, if I
were to remove the GIL I'd go all the way and allow shared objects.

So there's really only two options here:
* get a short-term bodge that works, like hacking the 3rd party
library to use your shared-memory allocator. Should be far less work
than hacking all of CPython.
* invest yourself in solving the *entire* problem (GIL removal with
shared python objects).
 
M

Martin v. Löwis

Because then we're back into the GIL not permitting threads efficient
core use on CPU bound scripts running on other threads (when they
otherwise could).

Why do you think so? For C code that is carefully written, the GIL
allows *very well* to write CPU bound scripts running on other threads.
(please do get back to Jesse's original remark in case you have lost
the thread :)
An example would be python scripts that generate video programatically
using an initial set of params and use an in-house C module to
construct frame (which in turn make and modify python C objects that
wrap to intricate codec related data structures). Suppose you wanted
to render 3 of these at the same time, one on each thread (3
threads). With the GIL in place, these threads can't anywhere close
to their potential. Your response thus far is that the C module
should release the GIL before it commences its heavy lifting. Well,
the problem is that if during its heavy lifting it needs to call back
into its interpreter.

So it should reacquire the GIL then. Assuming the other threads
all do their heavy lifting, it should immediately get the GIL,
fetch some data, release the GIL, and continue to do heavy lifting.
If it's truly CPU-bound, I hope it doesn't spend most of its time
in Python API, but in true computation.
It's turns out that this isn't an exotic case
at all: there's a *ton* of utility gained by making calls back into
the interpreter. The best example is that since code more easily
maintained in python than in C, a lot of the module "utility" code is
likely to be in python.

You should really reconsider writing performance-critical code in
Python. Regardless of the issue under discussion, a lot of performance
can be gained by using "flattened" data structures, less pointer,
less reference counting, less objects, and so on - in the inner loops
of the computation. You didn't reveal what *specific* computation you
perform, so it's difficult to give specific advise.
Unsurprisingly, this is the situation myself
and many others are in: where we want to subsequently use the
interpreter within the C module (so, as I understand it, the proposal
to have the C module release the GIL unfortunately doesn't work as a
general solution).

Not if you do the actual computation in Python, no. However, this
subthread started with Jesse's remark that you *can* release the GIL
in C code.

Again, if you do heavy-lifting in Python, you should consider to rewrite
the performance-critical parts in C. You may find that the need for
multiple CPUs goes even away.
I appreciate your arguments these a PyC concept is a lot of work with
some careful design work, but let's not kill the discussion just
because of that.

Any discussion in this newsgroup is futile, except when it either
a) leads to a solution that is already possible, and the OP didn't
envision, or
b) is followed up by code contributions from one of the participants.

If neither is likely to result, killing the discussion is the most
productive thing we can do.

Regards,
Maritn
 
P

Patrick Stinson

Close, I work currently for EastWest :)

Well, I actually like almost everything else about CPython,
considering my audio work the only major problem I've had is with the
GIL. I like the purist community, and I like the code, since
integrating it on both platforms has been relatively clean, and
required *zero* support. Frankly, with the exception of some windows
deployment issues relating to static linking of libpython and some
extensions, it's been a dream lib to use.

Further, I really appreciate the discussions that happen in these
lists, and I think that this particular problem is a wonderful example
of a situation that requires tons of miscellaneous opinions and input
from all angles - especially at this stage. I think that this problem
has lots of standing discussion and lots of potential solutions and/or
workarounds, and it would be cool for someone to aggregate and
paraphrase that stuff into a page to assist those thinking about doing
some patching. That's probably something that the coder would do
themselves though.
 
P

Patrick Stinson

Another great post, Glenn!! Very well laid-out and posed!! Thanks for
taking the time to lay all that out.


I think you've defined everything perfectly, and you're you're of
course correct about my love for for the PyC model. :^)

Like any software that's meant to be used without restrictions, our
code and frameworks always use a context object pattern so that
there's never and non-const global/shared data). I would go as far to
say that this is the case with more performance-oriented software than
you may think since it's usually a given for us to have to be parallel
friendly in as many ways as possible. Perhaps Patrick can back me up
there.

And I will.
As to what modules are "essential"... As you point out, once
reentrant module implementations caught on in PyC or hybrid world, I
think we'd start to see real effort to whip them into compliance--
there's just so much to be gained imho. But to answer the question,
there's the obvious ones (operator, math, etc), string/buffer
processing (string, re), C bridge stuff (struct, array), and OS basics
(time, file system, etc). Nice-to-haves would be buffer and image
decompression (zlib, libpng, etc), crypto modules, and xml. As far as
I can imagine, I have to believe all of these modules already contain
little, if any, global data, so I have to believe they'd be super easy
to make "PyC happy". Patrick, what would you see you guys using?

We don't need anything :) Since our goal is just to use python as a
scripting language/engine to our MIDI application, all we really need
is to make calls to the api that we expose using __builtins__.

You know, the standard python library is pretty siiiiiick, but the
syntax, object model, and import mechanics of python itself is an
**equally exportable function** of the code. Funny that I'm lucky
enough to say:

"Screw the extension modules - I just want the LANGUAGE". But, I can't have it.
 
P

Paul Boddie

* get a short-term bodge that works, like hacking the 3rd party
library to use your shared-memory allocator.  Should be far less work
than hacking all of CPython.

Did anyone come up with a reason why shared memory couldn't be used
for the purpose described by the inquirer? With the disadvantages of
serialisation circumvented, that would leave issues of contention, and
on such matters I have to say that I'm skeptical about solutions which
try and make concurrent access to CPython objects totally transparent,
mostly because it appears to be quite a lot of work to get right (as
POSH illustrates, and as your own safethread work shows), and also
because systems where contention is spread over a large "surface" (any
object can potentially be accessed by any process at any time) are
likely to incur a lot of trouble for the dubious benefit of being
vague about which objects are actually being shared.

Paul
 
R

Rhamphoryncus

Did anyone come up with a reason why shared memory couldn't be used
for the purpose described by the inquirer? With the disadvantages of
serialisation circumvented, that would leave issues of contention, and
on such matters I have to say that I'm skeptical about solutions which
try and make concurrent access to CPython objects totally transparent,
mostly because it appears to be quite a lot of work to get right (as
POSH illustrates, and as your own safethread work shows), and also
because systems where contention is spread over a large "surface" (any
object can potentially be accessed by any process at any time) are
likely to incur a lot of trouble for the dubious benefit of being
vague about which objects are actually being shared.

I believe large existing libraries were the reason. Thus my
suggestion of the evil fork+mmap abuse.
 
P

Patrick Stinson

If you are dealing with "lots" of data like in video or sound editing,
you would just keep the data in shared memory and send the reference
over IPC to the worker process. Otherwise, if you marshal and send you
are looking at a temporary doubling of the memory footprint of your
app because the data will be copied, and marshaling overhead.
 
G

Glenn Linderman

If you are dealing with "lots" of data like in video or sound editing,
you would just keep the data in shared memory and send the reference
over IPC to the worker process. Otherwise, if you marshal and send you
are looking at a temporary doubling of the memory footprint of your
app because the data will be copied, and marshaling overhead.

Right. Sounds, and is, easy, if the data is all directly allocated by
the application. But when pieces are allocated by 3rd party libraries,
that use the C-runtime allocator directly, then it becomes more
difficult to keep everything in shared memory.

One _could_ replace the C-runtime allocator, I suppose, but that could
have some adverse effects on other code, that doesn't need its data to
be in shared memory. So it is somewhat between a rock and a hard place.

By avoiding shared memory, such problems are sidestepped... until you
run smack into the GIL.
 
A

Andy O'Meara

Okay, here's the bottom line:
* This is not about the GIL.  This is about *completely* isolated
interpreters; most of the time when we want to remove the GIL we want
a single interpreter with lots of shared data.
* Your use case, although not common, is not extraordinarily rare
either.  It'd be nice to support.
* If CPython had supported it all along we would continue to maintain
it.
* However, since it's not supported today, it's not worth the time
invested, API incompatibility, and general breakage it would imply.
* Although it's far more work than just solving your problem, if I
were to remove the GIL I'd go all the way and allow shared objects.

Great recap (although saying "it's not about the GIL" may cause some
people lose track of the root issues here, but your following comment
GIL removal shows that we're on the same page).
So there's really only two options here:
* get a short-term bodge that works, like hacking the 3rd party
library to use your shared-memory allocator.  Should be far less work
than hacking all of CPython.

The problem there is that we're not talking about a single 3rd party
API/allocator--there's many, including the OS which has its own
internal allocators. My video encoding example is meant to illustrate
a point, but the real-world use case is where there's allocators all
over the place from all kinds of APIs, and when you want your C module
to reenter the interpreter often to execute python helper code.
* invest yourself in solving the *entire* problem (GIL removal with
shared python objects).

Well, as I mentioned, I do represent a company willing an able to
expend real resources here. However, as you pointed out, there's some
serious work at hand here (sadly--it didn't have to be this way) and
there seems to be some really polarized people here that don't seem as
interested as I am to make python more attractive for app developers
shopping for an interpreter to embed.

From our point of view, there's two other options which unfortunately
seem to be the only out the more we seem to uncover with this
discussion:

3) Start a new python implementation, let's call it "CPythonES",
specifically targeting performance apps and uses an explicit object/
context concept to permit the free threading under discussion here.
The idea would be to just implement the core language, feature set,
and a handful of modules. I refer you to that list I made earlier of
"essential" modules.

4) Drop python, switch to Lua.

The interesting thing about (3) is that it'd be in the same spirit as
how OpenGL ES came to be (except in place of the need for free
threading was the fact the standard OpenGL API was too overgrown and
painful for the embedded scale).

We're currently our own in-house version of (3), but we unfortunately
have other priorities at the moment that would otherwise slow this
down. Given the direction of many-core machines these days, option
(3) or (4), for us, isn't a question of *if*, it's a question of
*when*. So that's basically where we're at right now.

As to my earlier point about representing a company ready to spend
real resources, please email me off-list if anyone here would have an
interest in an open "CPythonES" project (and get full compensation).
I can say for sure that we'd be able to lead with API framework design
work--that's my personal strength and we have a lot of real world
experience there.

Andy
 
J

Jesse Noller

Right. Sounds, and is, easy, if the data is all directly allocated by the
application. But when pieces are allocated by 3rd party libraries, that use
the C-runtime allocator directly, then it becomes more difficult to keep
everything in shared memory.

One _could_ replace the C-runtime allocator, I suppose, but that could have
some adverse effects on other code, that doesn't need its data to be in
shared memory. So it is somewhat between a rock and a hard place.

By avoiding shared memory, such problems are sidestepped... until you run
smack into the GIL.

If you do not have shared memory: You don't need threads, ergo: You
don't get penalized by the GIL. Threads are only useful when you need
to have that requirement of large in-memory data structures shared and
modified by a pool of workers.

-jesse
 
A

Andy O'Meara

Why do you think so? For C code that is carefully written, the GIL
allows *very well* to write CPU bound scripts running on other threads.
(please do get back to Jesse's original remark in case you have lost
the thread :)

I don't follow you there. If you're referring to multiprocessing, our
concerns are:

- Maturity (am I willing to tell my partners and employees that I'm
betting our future on a brand-new module that imposes significant
restrictions as to how our app operates?)
- Liability (am I ready to invest our resources into lots of new
python module-specific code to find out that a platform that we want
to target isn't supported or has problems?). Like it not, we're a
company and we have to show sensitivity about new or fringe packages
that make our codebase less agile -- C/C++ continues to win the day in
that department.
- Shared memory -- for the reasons listed in my other posts, IPC or a
shared/mapped memory region doesn't work for our situation (and I
venture to say, for many real world situations otherwise you'd see end-
user/common apps use forking more often than threading).

You should really reconsider writing performance-critical code in
Python.

I don't follow you there... Performance-critical code in Python??
Suppose you're doing pixel-level filters on images or video, or
Patrick needs to apply a DSP to some audio... Our app's performance
would *tank*, in a MAJOR way (that, and/or background tasks would take
100x+ longer to do their work).
Regardless of the issue under discussion, a lot of performance
can be gained by using "flattened" data structures, less pointer,
less reference counting, less objects, and so on - in the inner loops
of the computation. You didn't reveal what *specific* computation you
perform, so it's difficult to give specific advise.

I tried to list some abbreviated examples in other posts, but here's
some elaboration:

- Pixel-level effects and filters, where some filters may use C procs
while others may call back into the interpreter to execute logic --
while some do both, multiple times.
- Image and video analysis/recognition where there's TONS of intricate
data structures and logic. Those data structures and logic are
easiest to develop and maintain in python, but you'll often want to
call back to C procs which will, in turn, want to access Python (as
well as C-level) data structures.

The common pattern here is where there's a serious mix of C and python
code and data structures, BUT it can all be done with a free-thread
mentality since the finish point is unambiguous and distinct -- where
all the "results" are handed back to the "main" app in a black and
white handoff. It's *really* important for an app to freely make
calls into its interpreter (or the interpreter's data structures)
without having to perform lock/unlocking because that affords an app a
*lot* of options and design paths. It's just not practical to be
locking and locking the GIL when you want to operate on python data
structures or call back into python.

You seem to have placed the burden of proof on my shoulders for an app
to deserve the ability to free-thread when using 3rd party packages,
so how about we just agree it's not an unreasonable desire for a
package (such as python) to support it and move on with the
discussion.
Again, if you do heavy-lifting in Python, you should consider to rewrite
the performance-critical parts in C. You may find that the need for
multiple CPUs goes even away.

Well, the entire premise we're operating under here is that we're
dealing with "embarrassingly easy" parallelization scenarios, so when
you suggest that the need for multiple CPUs may go away, I'm worried
that you're not keeping the big picture in mind.
Any discussion in this newsgroup is futile, except when it either
a) leads to a solution that is already possible, and the OP didn't
envision, or
b) is followed up by code contributions from one of the participants.

If neither is likely to result, killing the discussion is the most
productive thing we can do.

Well, most others here seem to have a lot different definition of what
qualifies as a "futile" discussion, so how about you allow the rest of
us continue to discuss these issues and possible solutions. And, for
the record, I've said multiple times I'm ready to contribute
monetarily, professionally, and personally, so if that doesn't qualify
as the precursor to "code contributions from one of the participants"
then I don't know WHAT does.


Andy
 
J

Jesse Noller

I don't follow you there. If you're referring to multiprocessing, our
concerns are:

- Maturity (am I willing to tell my partners and employees that I'm
betting our future on a brand-new module that imposes significant
restrictions as to how our app operates?)
- Liability (am I ready to invest our resources into lots of new
python module-specific code to find out that a platform that we want
to target isn't supported or has problems?). Like it not, we're a
company and we have to show sensitivity about new or fringe packages
that make our codebase less agile -- C/C++ continues to win the day in
that department.
- Shared memory -- for the reasons listed in my other posts, IPC or a
shared/mapped memory region doesn't work for our situation (and I
venture to say, for many real world situations otherwise you'd see end-
user/common apps use forking more often than threading).

FWIW (and again, I am not saying MP is good for your problem domain) -
multiprocessing works on windows, OS/X, Linux and Solaris quite well.
The only platforms it has problems on right now *BSD and AIX. It has
plenty of tests (I want more more more) and has a decent amount of
usage is my mail box and bug list are any indication.

Multiprocessing is not *new* - it's a branch of the pyprocessing package.

Multiprocessing is written in C, so as for the "less agile" - I don't
see how it's any less agile then what you've talked about. If you
wanted true platform insensitivity, then Java is a better bet :) As
for your final point:
- Shared memory -- for the reasons listed in my other posts, IPC or a
shared/mapped memory region doesn't work for our situation (and I
venture to say, for many real world situations otherwise you'd see end-
user/common apps use forking more often than threading).

I philosophically disagree with you here. PThreads and Shared memory
as it is today, is largely based on Java's influence on the world. I
would argue that the reason most people use threads as opposed to
processes is simply based on "ease of use and entry" (which is ironic,
given how many problems it causes). Not because they *need* the shared
memory aspects of it, or because they could not decompose the problem
into Actors/message passing, but because threads:

A> are there (e.g. in Java, Python, etc)
B> allow you to "share anything" (which allows you to take horrible shortcuts)
C> is what everyone "knows" at this point.

Even luminaries such as Brian Goetz and many, many others have pointed
out that threading, as it exists today is fundamentally difficult to
get right. Ergo the "renaissance" (read: echo chamber) towards
Erlang-style concurrency.

For many "real world" applications - threading is just "simple". This
is why Multiprocessing exists at all - to attempt to make forking/IPC
as "simple" as the API to threading. It's not foolproof, but the goal
was to open the door to multiple cores with a familiar API:

Quoting PEP 371:

"The pyprocessing package offers a method to side-step the GIL
allowing applications within CPython to take advantage of
multi-core architectures without asking users to completely change
their programming paradigm (i.e.: dropping threaded programming
for another "concurrent" approach - Twisted, Actors, etc).

The Processing package offers CPython a "known API" which mirrors
albeit in a PEP 8 compliant manner, that of the threading API,
with known semantics and easy scalability."

I would argue that most of the people taking part in this discussion
are working on "real world" applications - sure, multiprocessing as it
exists today, right now - may not support your use case, but it was
evaluated to fit *many* use cases.

Most of the people here are working in Pure python, or they're using a
few extension modules here and there (in C). Again, when you say
threads and processes, most people here are going to think "import
threading", "fork()" or "import multiprocessing"

Please correct me if I am wrong in understanding what you want: You
are making threads in another language (not via the threading API),
embed python in those threads, but you want to be able to share
objects/state between those threads, and independent interpreters. You
want to be able to pass state from one interpreter to another via
shared memory (e.g. pointers/contexts/etc).

Example:

ParentAppFoo makes 10 threads (in C)
Each thread gets an itty bitty python interpreter
ParentAppFoo gets a object(video) to render
Rather then marshal that object, you pass a pointer to the object to
the children
You want to pass that pointer to an existing, or newly created itty
bitty python interpreter for mangling
Itty bitty python interpreter passes the object back to a C module via
a pointer/context

If the above is wrong, I think possible outlining it in the above form
may help people conceptualize it - I really don't think you're talking
about python-level processes or threads.

-jesse
 
V

VanL

Jesse said:
Even luminaries such as Brian Goetz and many, many others have pointed
out that threading, as it exists today is fundamentally difficult to
get right. Ergo the "renaissance" (read: echo chamber) towards
Erlang-style concurrency.

I think this is slightly missing what Andy is saying. Andy is trying
something that would look much more like Erlang-style concurrency than
classic threads - "green processes" to use someone else's term.

AFAIK, Erlang "processes" aren't really processes at the OS level.
Instead, they are named processes because they only communicate through
message passing. When multiple "processes" are running in the same
os-level-multi-threaded interpreter, the interpreter cheats to make the
message passing fast.

I think Andy is thinking along the same lines. With a Python
subinterpreter per thread, he is suggesting intra-process message
passing as a way to get concurrency.

Its actually not too far from what he is doing already, but he is
fighting OS-level shared library semantics to do it. Instead, if Python
supported a per-subinterpreter GIL and per-subinterpreter state, then
you could theoretically get to a good place:

- You only initialize subinterpreters if you need them, so
single-process Python doesn't pay a large (any?) penalty
- Intra-process message passing can be fast, but still has the
no-shared-state benefits of the Erlang concurrency model
- There are fewer changes to the Python core, because the GIL doesn't go
away

No, this isn't whole-hog free threading (or safe threading), there are
restrictions that go along with this model - but there would be benefits.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,608
Members
45,252
Latest member
MeredithPl

Latest Threads

Top