2.6, 3.0, and truly independent intepreters

M

Martin v. Löwis

If Py_None corresponds to None in Python syntax (sorry I'm not familiar
with Python internals yet; glad you are commenting, since you are), then
it is a fixed constant and could be left global, probably.

If None remains global, then type(None) also remains global, and
type(None),__bases__[0]. Then type(None).__bases__[0].__subclasses__()
will yield "interesting" results. This is essentially the status quo.
But if we
want a separate None for each interpreter, or if we just use Py_None as
an example global variable to use to answer the question then here goes

There are a number of problems with that approach. The biggest one is
that it is theoretical. Of course I'm aware of thread-local variables,
and the abstract possibility of collecting all global variables in
a single data structure (in fact, there is already an interpreter
structure and per-interpreter state in Python). I wasn't claiming that
it was impossible to solve that problem - just that it is not simple.
If you want to find out what all the problems are, please try
implementing it for real.

Regards,
Martin
 
M

M.-A. Lemburg

These discussion pop up every year or so and I think that most of them
are not really all that necessary, since the GIL isn't all that bad.

Some pointers into the past:

* http://effbot.org/pyfaq/can-t-we-get-rid-of-the-global-interpreter-lock.htm
Fredrik on the GIL

* http://mail.python.org/pipermail/python-dev/2000-April/003605.html
Greg Stein's proposal to move forward on free threading

* http://www.sauria.com/~twl/conferences/pycon2005/20050325/Python at Google.notes
(scroll down to the Q&A section)
Greg Stein on whether the GIL really does matter that much

Furthermore, there are lots of ways to tune the CPython VM to make
it more or less responsive to thread switches via the various sys.set*()
functions in the sys module.

Most computing or I/O intense C extensions, built-in modules and object
implementations already release the GIL for you, so it usually doesn't
get in the way all that often.

So you have the option of using a single process with multiple
threads, allowing efficient sharing of data. Or you use multiple
processes and OS mechanisms to share data (shared memory, memory
mapped files, message passing, pipes, shared file descriptors, etc.).

Both have their pros and cons.

There's no general answer to the
problem of how to make best use of multi-core processors, multiple
linked processors or any of the more advanced parallel processing
mechanisms (http://en.wikipedia.org/wiki/Parallel_computing).
The answers will always have to be application specific.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 25 2008)________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
 
T

Terry Reedy

Yes, CPython2.x, x<=5 did.
I can buy that, but when Python is not qualified, CPython should be
assumed, as it predominates.

People do that, and it sometimes leads to unnecessary confusion. As to
the present discussion, is it about
* changing Python, the language
* changing all Python implementations
* changing CPython, the leading implementation
* branching CPython with a compiler switch, much as there was one for
including Unicode or not.
* forking CPython
* modifying an existing module
* adding a new module
* making better use of the existing facilities
* some combination of the above
> Of course, the latest official release
should probably also be assumed, but that is so recent,

People do that, and it sometimes leads to unnecessary confusion. People
routine posted version specific problems and questions without
specifying the version (or platform when relevant). In a month or so,
there will be *2* latest official releases. There will be more
confusion without qualification.
few have likely
upgraded as yet... I should have qualified the statement.

* Is the target of this discussion 2.7 or 3.1 (some changes would be 3.1
only).

[diversion to the side topic]
Absolutely. But after the first iteration, there is only one reference
to string.

Which is to say, 'string' is the only reference to its object it refers
too. You are right, so I presume that the optimization described would
then kick in. But I have not read the code, and CPython optimizations
are not part of the *language* reference.

[back to the main topic]

There is some discussion/debate/confusion about how much of the stdlib
is 'standard Python library' versus 'standard CPython library'. [And
there is some feeling that standard Python modules should have a default
Python implementation that any implementation can use until it
optionally replaces it with a faster compiled version.] Hence my
question about the target of this discussion and the first three options
listed above.

Terry Jan Reedy
 
P

Philip Semanchuk

If the poshmodule had a bit of TLC, it would be extremely useful for
this,
since it does (surprisingly) still work with python 2.5, but does
need a
bit of TLC to make it usable.

http://poshmodule.sourceforge.net/

Last time I checked that was Windows-only. Has that changed?

The only IPC modules for Unix that I'm aware of are one which I
adopted (for System V semaphores & shared memory) and one which I
wrote (for POSIX semaphores & shared memory).

http://NikitaTheSpider.com/python/shm/
http://semanchuk.com/philip/posix_ipc/


If anyone wants to wrap POSH cleverness around them, go for it! If
not, maybe I'll make the time someday.

Cheers
Philip
 
G

Glenn Linderman

No, it couldn't, because it's a reference-counted object
like any other Python object, and therefore needs to be
protected against simultaneous refcount manipulation by
different threads. So each interpreter would need its own
instance of Py_None.

The same goes for all the other built-in constants and
type objects -- there are dozens of these.

Fine. The code fragment shown didn't provide enough information to be
sure. There is a "simplicity" benefit to having built-in constants be
dynamically created initially, to avoid special-casing throughout the
code. So that's why I said "probably", I'm sure you know more about it
than I do.
Which sounds like it could be a rather high cost! If
(just a wild guess) each function has an average of 2
parameters, then this is increasing the amount of
argument pushing going on by 50%...

Some actual statistics would be better than wild guesses, of course. if
the average number of parameters is 6, then this would only be a 17%
increase. And it only affects interpreter functions, not system
functions, which would reduce the overall percentage in either case.
And if there are any inline functions, their cost would not increase at
all, again reducing the overall percentage. And depending on the size
of the function, parameter pushing may be an insignificant amount of the
overall cost of calling the function.

So your 50% number is just a scare tactic, it would seem, based on wild
guesses. Was there really any benefit to the comment?
That's another possibility, although doing it that
way would require you to have a separate thread for
each interpreter, which you mightn't always want.

I suppose there could be uses for multiple interpreters without
requiring a thread for them. The primary focus of the discussion, and
PyC threads in particular, is to allow interpreters to run in parallel,
which certainly requires them to be in separate threads.

It would be possible to implement multiple PyA interpreters in a single
thread, by swapping out the content of the interpreter's context within
the TLS area for that thread, in approximately the same manner as they
are presently implemented.
 
G

Glenn Linderman

If Py_None corresponds to None in Python syntax (sorry I'm not familiar
with Python internals yet; glad you are commenting, since you are), then
it is a fixed constant and could be left global, probably.

If None remains global, then type(None) also remains global, and
type(None),__bases__[0]. Then type(None).__bases__[0].__subclasses__()
will yield "interesting" results. This is essentially the status quo.

You know a lot more about it than I do. Greg thinks each PyC
interpreter should have its own Py_None, so it sounds like you do too,
perhaps. I certainly don't grok the implications of what you say above,
as I barely grok the semantics of it.
There are a number of problems with that approach. The biggest one is
that it is theoretical.

Not theoretical. Used successfully in Perl. Granted Perl is quite a
different language than Python, but then there are some basic
similarities in the concepts.

Perhaps you should list the problems, instead of vaguely claiming that
there are a number of them. Hard to respond to such a vague claim.
Of course I'm aware of thread-local variables,
and the abstract possibility of collecting all global variables in
a single data structure (in fact, there is already an interpreter
structure and per-interpreter state in Python). I wasn't claiming that
it was impossible to solve that problem - just that it is not simple.
If you want to find out what all the problems are, please try
implementing it for real.

I'm certainly not yet at the point of Python expertise where I could
attempt this, having not yet looked at the internals at all; still
learning the language. But the approach is sound; nearly any monolithic
program can be turned into a multithreaded program containing one
monolith per thread using such a technique.
 
M

Martin v. Löwis

There are a number of problems with that approach. The biggest one is
Not theoretical. Used successfully in Perl.

Perhaps it is indeed what Perl does, I know nothing about that.
However, it *is* theoretical for Python. Please trust me that
there are many many many many pitfalls in it, each needing a
separate solution, most likely with no equivalent in Perl.

If you had a working patch, *then* it would be practical.
Granted Perl is quite a
different language than Python, but then there are some basic
similarities in the concepts.

Yes - just as much as both are implemented in C :-(
Perhaps you should list the problems, instead of vaguely claiming that
there are a number of them. Hard to respond to such a vague claim.

As I said: go implement it, and you will find out. Unless you are
really going at an implementation, I don't want to spend my time
explaining it to you.
But the approach is sound; nearly any monolithic
program can be turned into a multithreaded program containing one
monolith per thread using such a technique.

I'm not debating that. I just claim that it is far from simple.

Regards,
Martin
 
A

Andy O'Meara

Again, wrong.


I don't understand how this example involves multiple threads. You
mention a single thread (running the module), and you mention designing
a  module. Where is the second thread?

Glenn seems to be following me here... The point is to have any many
threads as the app wants, each in it's own world, running without
restriction (performance wise). Maybe the app wants to run a thread
for each extra core on the machine.

Perhaps the disconnect here is that when I've been saying "start a
thread", I mean the app starts an OS thread (e.g. pthread) with the
given that any contact with other threads is managed at the app level
(as opposed to starting threads through python). So, as far as python
knows, there's zero mention or use of threading in any way,
*anywhere*.

That's not true.

Um... So let's say you have a opaque object ref from the OS that
represents hundreds of megs of data (e.g. memory-resident video). How
do you get that back to the parent process without serialization and
IPC? What should really happen is just use the same address space so
just a pointer changes hands. THAT's why I'm saying that a separate
address space is generally a deal breaker when you have large or
intricate data sets (ie. when performance matters).

Andy
 
A

Andy O'Meara

It seems to me that the very simplest move would be to remove global
static data so the app could provide all thread-related data, which
Andy suggests through references to the QuickTime API. This would
suggest compiling python without thread support so as to leave it up
to the application.

I'm not sure whether you realize that this is not simple at all.
Consider this fragment

    if (string == Py_None || index >= state->lastmark ||
!state->mark[index] || !state->mark[index+1]) {
        if (empty)
            /* want empty string */
            i = j = 0;
        else {
            Py_INCREF(Py_None);
            return Py_None;


The way to think about is that, ideally in PyC, there are never any
global variables. Instead, all "globals" are now part of a context
(ie. a interpreter) and it would presumably be illegal to ever use
them in a different context. I'd say this is already the expectation
and convention for any modern, industry-grade software package
marketed as extension for apps. Industry app developers just want to
drop in a 3rd party package, make as many contexts as they want (in as
many threads as they want), and expect to use each context without
restriction (since they're ensuring contexts never interact with each
other). For example, if I use zlib, libpng, or libjpg, I can make as
many contexts as I want and put them in whatever threads I want. In
the app, the only thing I'm on the hook for is to: (a) never use
objects from one context in another context, and (b) ensure that I'm
never make any calls into a module from more than one thread at the
same time. Both of these requirements are trivial to follow in the
"embarrassingly easy" parallelization scenarios, and that's why I
started this thread in the first place. :^)

Andy
 
A

Andy O'Meara

... and I would be surprised at someone that would embed hundreds of
megs of data into an object such that it had to be serialized... seems
like the proper design is to point at the data, or a subset of it, in a
big buffer.  Then data transfers would just transfer the offset/length
and the reference to the buffer.


... and this is another surprise!  You have thousands of objects (data
structure instances) to move from one thread to another?

Heh, no, we're actually in agreement here. I'm saying that in the
case where the data sets are large and/or intricate, a single top-
level pointer changing hands is *always* the way to go rather than
serialization. For example, suppose you had some nifty python code
and C procs that were doing lots of image analysis, outputting tons of
intricate and rich data structures. Once the thread is done with that
job, all that output is trivially transferred back to the appropriate
thread by a pointer changing hands.
Of course, I know that data get large, but typical multimedia streams
are large, binary blobs.  I was under the impression that processing
them usually proceeds along the lines of keeping offsets into the blobs,
and interpreting, etc.  Editing is usually done by making a copy of a
blob, transforming it or a subset in some manner during the copy
process, resulting in a new, possibly different-sized blob.

No, you're definitely right-on, with the the additional point that the
representation of multimedia usually employs intricate and diverse
data structures (imagine the data structure representation of a movie
encoded in modern codec, such as H.264, complete with paths, regions,
pixel flow, geometry, transformations, and textures). As we both
agree, that's something that you *definitely* want to move around via
a single pointer (and not in a serialized form). Hence, my position
that apps that use python can't be forced to go through IPC or else:
(a) there's a performance/resource waste to serialize and unserialize
large or intricate data sets, and (b) they're required to write and
maintain serialization code that otherwise doesn't serve any other
purpose.

Andy
 
A

Andy O'Meara

Moreover, I think this is probably the *only* way that
totally independent interpreters could be realized.

Converting the whole C API to use this strategy would be
a very big project. Also, on the face of it, it seems like
it would render all existing C extension code obsolete,
although it might be possible to do something clever with
macros to create a compatibility layer.

Another thing to consider is that passing all these extra
pointers around everywhere is bound to have some effect
on performance.


Good points--I would agree with you on all counts there. On the
"passing a context everywhere" performance hit, perhaps one idea is
that all objects could have an additional field that would point back
to their parent context (ie. their interpreter). So the only
prototypes that would have to be modified to contain the context ptr
would be the ones that inherently don't take any objects. This would
conveniently and generally correspond to procs associated with
interpreter control (e.g. importing modules, shutting down modules,
etc).

I hope you realize that starting up one of these interpreters
is going to be fairly expensive.

Absolutely. I had just left that issue out in an effort to keep the
discussion pointed, but it's a great point to raise. My response is
that, like any 3rd party industry package, I'd say this is the
expectation (that context startup and shutdown is non-trivial and to
should be minimized for performance reasons). For simplicity, my
examples didn't talk about this issue but in practice, it'd be typical
for apps to have their "worker" interpreters persist as they chew
through jobs.


Andy
 
R

Rhamphoryncus

Type objects contain dicts, which allow arbitrary values
to be stored in them. What happens if one thread puts
a private object in there? It becomes visible to other
threads using the same type object. If it's not safe
for sharing, bad things happen.

Python's data model is not conducive to making a clear
distinction between "private" and "shared" objects,
except at the level of an entire interpreter.

shareable type objects (enabled by a __future__ import) use a
shareddict, which requires all keys and values to themselves be
shareable objects.

Although it's a significant semantic change, in many cases it's easy
to deal with: replace mutable (unshareable) global constants with
immutable ones (ie list -> tuple, set -> frozenset). If you've got
some global state you move it into a monitor (which doesn't scale, but
that's your design). The only time this really fails is when you're
deliberately storing arbitrary mutable objects from any thread, and
later inspecting them from any other thread (such as our new ABC
system's cache). If you want to store an object, but only to give it
back to the original thread, I've got a way to do that.
 
G

Greg Ewing

Glenn said:
So your 50% number is just a scare tactic, it would seem, based on wild
guesses. Was there really any benefit to the comment?

All I was really trying to say is that it would be a
mistake to assume that the overhead will be negligible,
as that would be just as much a wild guess as 50%.
 
G

greg

Glenn said:
If None remains global, then type(None) also remains global, and
type(None),__bases__[0]. Then type(None).__bases__[0].__subclasses__()
will yield "interesting" results. This is essentially the status quo.

I certainly don't grok the implications of what you say above,
as I barely grok the semantics of it.

Not only is there a link from a class to its base classes, there
is a link to all its subclasses as well.

Since every class is ultimately a subclass of 'object', this means
that starting from *any* object, you can work your way up the
__bases__ chain until you get to 'object', then walk the sublass
hierarchy and find every class in the system.

This means that if any object at all is shared, then all class
objects, and any object reachable from them, are shared as well.
 
M

Martin v. Löwis

As far as I can tell, it seems
Um... So let's say you have a opaque object ref from the OS that
represents hundreds of megs of data (e.g. memory-resident video). How
do you get that back to the parent process without serialization and
IPC?

What parent process? I thought you were talking about multi-threading?
What should really happen is just use the same address space so
just a pointer changes hands. THAT's why I'm saying that a separate
address space is generally a deal breaker when you have large or
intricate data sets (ie. when performance matters).

Right. So use a single address space, multiple threads, and perform the
heavy computations in C code. I don't see how Python is in the way at
all. Many people do that, and it works just fine. That's what
Jesse (probably) meant with his remark

Please reconsider this; it might be a solution to your problem.

Regards,
Martin
 
A

Andy O'Meara

Grrr... I posted a ton of lengthy replies to you and other recent
posts here using Google and none of them made it, argh. Poof. There's
nothing that fires more up more than lost work, so I'll have to
revert short and simple answers for the time being. Argh, damn.


Moreover, I think this is probably the *only* way that
totally independent interpreters could be realized.

Converting the whole C API to use this strategy would be
a very big project. Also, on the face of it, it seems like
it would render all existing C extension code obsolete,
although it might be possible to do something clever with
macros to create a compatibility layer.

Another thing to consider is that passing all these extra
pointers around everywhere is bound to have some effect
on performance.


I'm with you on all counts, so no disagreement there. On the "passing
a ptr everywhere" issue, perhaps one idea is that all objects could
have an additional field that would point back to their parent context
(ie. their interpreter). So the only prototypes that would have to be
modified to contain the context ptr would be the ones that don't
inherently operate on objects (e.g. importing a module).


I hope you realize that starting up one of these interpreters
is going to be fairly expensive. It will have to create its
own versions of all the builtin constants and type objects,
and import its own copy of all the modules it uses.

Yeah, for sure. And I'd say that's a pretty well established
convention already out there for any industry package. The pattern
I'd expect to see is where the app starts worker threads, starts
interpreters in one or more of each, and throws jobs to different ones
(and the interpreter would persist to move on to subsequent jobs).
One wonders if it wouldn't be cheaper just to fork the
process. Shared memory can be used to transfer large lumps
of data if needed.

As I mentioned, wen you're talking about intricate data structures, OS
opaque objects (ie. that have their own internal allocators), or huge
data sets, even a shared memory region unfortunately can't fit the
bill.


Andy
 
A

Andy O'Meara



Let's take a step back and remind ourselves of the big picture. The
goal is to have independent interpreters running in pthreads that the
app starts and controls. Each interpreter never at any point is doing
any thread-related stuff in any way. For example, each script job
just does meat an potatoes CPU work, using callbacks that, say,
programatically use OS APIs to edit and transform frame data.

So I think the disconnect here is that maybe you're envisioning
threads being created *in* python. To be clear, we're talking out
making threads at the app level and making it a given for the app to
take its safety in its own hands.


That's not true.

Well, when you're talking about large, intricate data structures
(which include opaque OS object refs that use process-associated
allocators), even a shared memory region between the child process and
the parent can't do the job. Otherwise, please describe in detail how
I'd get an opaque OS object (e.g. an OS ref that refers to memory-
resident video) from the child process back to the parent process.

Again, the big picture that I'm trying to plant here is that there
really is a serious need for truly independent interpreters/contexts
in a shared address space. Consider stuff like libpng, zlib, ipgjpg,
or whatever, the use pattern is always the same: make a context
object, do your work in the context, and take it down. For most
industry-caliber packages, the expectation and convention (unless
documented otherwise) is that the app can make as many contexts as its
wants in whatever threads it wants because the convention is that the
app is must (a) never use one context's objects in another context,
and (b) never use a context at the same time from more than one
thread. That's all I'm really trying to look at here.


Andy
 
A

Andy O'Meara

... and I would be surprised at someone that would embed hundreds of
megs of data into an object such that it had to be serialized... seems
like the proper design is to point at the data, or a subset of it, in a
big buffer.  Then data transfers would just transfer the offset/length
and the reference to the buffer.


... and this is another surprise!  You have thousands of objects (data
structure instances) to move from one thread to another?

I think we miscommunicated there--I'm actually agreeing with you. I
was trying to make the same point you were: that intricate and/or
large structures are meant to be passed around by a top-level pointer,
not using and serialization/messaging. This is what I've been trying
to explain to others here; that IPC and shared memory unfortunately
aren't viable options, leaving app threads (rather than child
processes) as the solution.

Of course, I know that data get large, but typical multimedia streams
are large, binary blobs.  I was under the impression that processing
them usually proceeds along the lines of keeping offsets into the blobs,
and interpreting, etc.  Editing is usually done by making a copy of a
blob, transforming it or a subset in some manner during the copy
process, resulting in a new, possibly different-sized blob.


Your instincts are right. I'd only add on that when you're talking
about data structures associated with an intricate video format, the
complexity and depth of the data structures is insane -- the LAST
thing you want to burn cycles on is serializing and unserializing that
stuff (so IPC is out)--again, we're already on the same page here.

I think at one point you made the comment that shared memory is a
solution to handle large data sets between a child process and the
parent. Although this is certainty true in principle, it doesn't hold
up in practice since complex data structures often contain 3rd party
and OS API objects that have their own allocators. For example, in
video encoding, there's TONS of objects that comprise memory-resident
video from all kinds of APIs, so the idea of having them allocated
from shared/mapped memory block isn't even possible. Again, I only
raise this to offer evidence that doing real-world work in a child
process is a deal breaker--a shared address space is just way too much
to give up.


Andy
 
J

James Mills

I think we miscommunicated there--I'm actually agreeing with you. I
was trying to make the same point you were: that intricate and/or
large structures are meant to be passed around by a top-level pointer,
not using and serialization/messaging. This is what I've been trying
to explain to others here; that IPC and shared memory unfortunately
aren't viable options, leaving app threads (rather than child
processes) as the solution.

Andy,

Why don't you just use a temporary file
system (ram disk) to store the data that
your app is manipulating. All you need to
pass around then is a file descriptor.

--JamesMills
 
M

Martin v. Löwis

Andy said:
[...]

So I think the disconnect here is that maybe you're envisioning
threads being created *in* python. To be clear, we're talking out
making threads at the app level and making it a given for the app to
take its safety in its own hands.

No. Whether or not threads are created by Python or the application
does not matter for my "Wrong" evaluation: in either case, C module
execution can easily side-step/release the GIL.
Well, when you're talking about large, intricate data structures
(which include opaque OS object refs that use process-associated
allocators), even a shared memory region between the child process and
the parent can't do the job. Otherwise, please describe in detail how
I'd get an opaque OS object (e.g. an OS ref that refers to memory-
resident video) from the child process back to the parent process.

WHAT PARENT PROCESS? "In the same address space", to me, means
"a single process only, not multiple processes, and no parent process
anywhere". If you have just multiple threads, the notion of passing
data from a "child process" back to the "parent process" is
meaningless.
Again, the big picture that I'm trying to plant here is that there
really is a serious need for truly independent interpreters/contexts
in a shared address space.

I understand that this is your mission in this thread. However, why
is that your problem? Why can't you just use the existing (limited)
multiple-interpreters machinery, and solve your problems with that?
For most
industry-caliber packages, the expectation and convention (unless
documented otherwise) is that the app can make as many contexts as its
wants in whatever threads it wants because the convention is that the
app is must (a) never use one context's objects in another context,
and (b) never use a context at the same time from more than one
thread. That's all I'm really trying to look at here.

And that's indeed the case for Python, too. The app can make as many
subinterpreters as it wants to, and it must not pass objects from one
subinterpreter to another one, nor should it use a single interpreter
from more than one thread (although that is actually supported by
Python - but it surely won't hurt if you restrict yourself to a single
thread per interpreter).

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top