2.6, 3.0, and truly independent intepreters

sturlamolden · Nov 6, 2008

In a few earlier posts, I went into details what's meant there:

http://groups.google.com/group/comp...m/group/comp.lang.python/msg/5be213c31519217b

All this says is:

1. The cost of serialization and deserialization is to large.
2. Complex data structures cannot be placed in shared memory.

The first claim is unsubstantiated. It depends on how much and what
you serialize. If you use something like NumPy arrays, the cost of
pickling is tiny. Erlang is a language specifically designed for
concurrent programming, yet it does not allow anything to be shared.

The second claim is plain wrong. You can put anything you want in
shared memory. The mapping address of the shared memory segment may
vary, but it can be dealt with (basically use integers instead of
pointers, and use the base address as offset.) Pyro is a Python
project that has investigated this. With Pyro you can put any Python
object in a shared memory region. You can also use NumPy record arrays
to put very complex data structures in shared memory.

What do you gain by placing multiple interpreters in the same process?
You will avoid the complication that the mapping address of the shared
memory region may be different. But this is a problem that has been
worked out and solved. Instead you get a lot of issues dealing with
DLL loading and unloading (Python extension objects).

The multiprocessing module has something called proxy objects, which
also deals with this issue. An object is hosed in a server process,
and client processes may access it through synchronized IPC calls.
Inside the client process the remote object looks like any other
Python object. The synchronized IPC is hidden away in an abstraction
layer. In Windows, you can also construct outproc ActiveX objects,
which are not that different from multiprocessing's proxy objects.

If you need to place a complex object in shared memory:

1. Check if a NumPy record array may suffice (dtypes may be nested).
It will if you don't have dynamically allocated pointers inside the
data structure.

2. Consider using multiprocessing's proxy objects or outproc ActiveX
objects.

3. Go to http://pyro.sourceforge.net, download the code and read the
documentation.

Saying that "it can't be done" is silly before you have tried.
Programmers are not that good at guessing where the bottlenecks
reside, even if we think we do.

sturlamolden · Nov 6, 2008

The language features look a lot like what others have already been
offering for a while: keywords for parallelised constructs (clik_for)
which are employed by solutions for various languages (C# and various C
++ libraries spring immediately to mind); spawning and synchronisation
are typically supported in existing Python solutions, although
obviously not using language keywords.

Yes, but there is not a 'concurrency platform' that takes care of
things like load balancing and testing for race conditions. If you
spawn with cilk++, the result is not that a new process or thread is
spawned. The task is put in a queue (scheduled using work stealing),
and executed by a pool of threads/processes. Multiprocessing makes
it easy to write concurrent algorithms (as opposed to subprocess or
popen), but automatic load balancing is something it does not do. It
also does not identify and warn the programmer about race conditions.
It does not have a barrier synchronization paradigm, but it can be
constructed.

java.util.concurrent.forkjoin is actually based on cilk.

Something like cilk can easily be built on top of the multiprocessing
module. Extra keywords can and should be avoided. But it is easier in
Python than C. Keywords are used in cilk++ because they can be defined
out by the preprocessor, thus restoring the original seqential code.
In Python we can e.g. use a decorator instead.

Walter Overby · Nov 6, 2008

Hi,

I've been following this discussion, and although I'm not nearly the
Python expert that others on this thread are, I think I understand
Andy's point of view. His premises seem to include at least:

1. His Python code does not control the creation of the threads. That
is done "at the app level".
2. Perhaps more importantly, his Python code does not control the
allocation of the data he needs to operate on. He's got, for example,
"an opaque OS object" that is manipulated by CPU-intensive OS
functions.

sturlamolden suggests a few approaches:

1. Check if a NumPy record array may suffice (dtypes may be nested).
It will if you don't have dynamically allocated pointers inside the
data structure.

I suspect that the OS is very likely to have dynamically allocated
pointers inside their opaque structures.

2. Consider using multiprocessing's proxy objects or outproc ActiveX
objects.

I don't understand how this would help. If these large data
structures reside only in one remote process, then the overhead of
proxying the data into another process for manipulation requires too
much IPC, or at least so Andy stipulates.

3. Go to http://pyro.sourceforge.net, download the code and read the
documentation.

I don't see how this solves the problem with 2. I admit I have only
cursory knowledge, but I understand "remoting" approaches to have the
same weakness.

I understand Andy's problem to be that he needs to operate on a large
amount of in-process data from several threads, and each thread mixes
CPU-intensive C functions with callbacks to Python utility functions.
He contends that, even though he releases the GIL in the CPU-bound C
functions, the reacquisition of the GIL for the utility functions
causes unacceptable contention slowdowns in the current implementation
of CPython.

After reading Martin's posts, I think I also understand his point of
view. Is the time spent in these Python callbacks so large compared
to the C functions that you really have to wait? If so, then Andy has
crossed over into writing performance-critical code in Python. Andy
proposes that the Python community could work on making that possible,
but Martin cautions that it may be very hard to do so.

If I understand them correctly, none of these concerns are silly.

Walter.

sturlamolden · Nov 6, 2008

I don't understand how this would help. If these large data
structures reside only in one remote process, then the overhead of
proxying the data into another process for manipulation requires too
much IPC, or at least so Andy stipulates.

Perhaps it will, or perhaps not. Reading or writing to a pipe has
slightly more overhead than a memcpy. There are things that Python
needs to do that are slower than the IPC. In this case, the real
constraint would probably be contention for the object in the server,
not the IPC. (And don't blame it on the GIL, because putting a lock
around the object would not be any better.)

I don't see how this solves the problem with 2.

It puts Python objects in shared memory. Shared memory is the fastest
form of IPC there is. The overhead is basically zero. The only
constraint will be contention for the object.

I understand Andy's problem to be that he needs to operate on a large
amount of in-process data from several threads, and each thread mixes
CPU-intensive C functions with callbacks to Python utility functions.
He contends that, even though he releases the GIL in the CPU-bound C
functions, the reacquisition of the GIL for the utility functions
causes unacceptable contention slowdowns in the current implementation
of CPython.

Yes, callbacks to Python are expensive. But is the problem the GIL?
Instead of contention for the GIL, he seems to prefer contention for a
complex object. Is that any better? It too has to be protected by a
lock.

If I understand them correctly, none of these concerns are silly.

No they are not. But I think he underestimates what multiple processes
can do. The objects in 'multiprocessing' are already a lot faster than
their 'threading' and 'Queue' counterparts.

Walter Overby · Nov 6, 2008

Perhaps it will, or perhaps not. Reading or writing to a pipe has
slightly more overhead than a memcpy. There are things that Python
needs to do that are slower than the IPC. In this case, the real
constraint would probably be contention for the object in the server,
not the IPC. (And don't blame it on the GIL, because putting a lock
around the object would not be any better.)

(I'm not blaming anything on the GIL.)

I read Andy to stipulate that the pipe needs to transmit "hundreds of
megs of data and/or thousands of data structure instances." I doubt
he'd be happy with memcpy either. My instinct is that contention for
a lock could be the quicker option.

And don't forget, he says he's got an "opaque OS object." He asked
the group to explain how to send that via IPC to another process. I
surely don't know how.

It puts Python objects in shared memory. Shared memory is the fastest
form of IPC there is. The overhead is basically zero. The only
constraint will be contention for the object.

I don't think he has Python objects to work with. I'm persuaded when
he says: "when you're talking about large, intricate data structures
(which include opaque OS object refs that use process-associated
allocators), even a shared memory region between the child process and
the parent can't do the job."

Why aren't you persuaded?

Yes, callbacks to Python are expensive. But is the problem the GIL?
Instead of contention for the GIL, he seems to prefer contention for a
complex object. Is that any better? It too has to be protected by a
lock.

At a couple points, Andy has expressed his preference for a "single
high level sync object" to synchronize access to the data, at least
that's my reading. What he doesn't seem to prefer is the slowdown
arising from the Python callbacks acquiring the GIL. I think that
would be an additional lock, and that's near the heart of Andy's
concern, as I read him.

No they are not. But I think he underestimates what multiple processes
can do. The objects in 'multiprocessing' are already a lot faster than
their 'threading' and 'Queue' counterparts.

Andy has complimented 'multiprocessing' as a "huge huge step." He
just offers a scenario where multiprocessing might not be the best
solution, and so far, I see no evidence he is wrong. That's not
underestimation, in my estimation!

Walter.

sturlamolden · Nov 7, 2008

I read Andy to stipulate that the pipe needs to transmit "hundreds of
megs of data and/or thousands of data structure instances." I doubt
he'd be happy with memcpy either. My instinct is that contention for
a lock could be the quicker option.

If he needs to communicate that amount of data very often, he has a
serious design problem.

A pipe can transmit hundreds of megs in a split second by the way.

And don't forget, he says he's got an "opaque OS object." He asked
the group to explain how to send that via IPC to another process. I
surely don't know how.

This is a typical situation where one could use a proxy object. Let
one server process own the opaque OS object, and multiple client
processes access it via IPC calls to the server.

I don't think he has Python objects to work with. I'm persuaded when
he says: "when you're talking about large, intricate data structures
(which include opaque OS object refs that use process-associated
allocators), even a shared memory region between the child process and
the parent can't do the job."

Why aren't you persuaded?

I am persuaded that shared memory may be difficult in that particular
case. I am not persuaded that multiple processes cannot be used,
because one can let one server process own the object.

Paul Boddie · Nov 7, 2008

If he needs to communicate that amount of data very often, he has a
serious design problem.

As far as I can tell, he wants to keep the data in one place and just
pass a pointer around between execution contexts. The apparent issue
with using shared memory segments for this is that he relies on
existing components which have their own allocation preferences. So
although you or I might choose shared memory if writing this stuff
from scratch, he doesn't appear to have this option.

The inquirer hasn't acknowledged my remarks about tinypy, but I know
that if I were considering dropping $40000 and/or 2-3 man-months, I'd
at least have a look at what those people have done and whether
there's any mileage in using it before starting a new, embeddable
implementation of Python from scratch.

Paul

sturlamolden · Nov 7, 2008

As far as I can tell, he wants to keep the data in one place and just
pass a pointer around between execution contexts.

This would be the easiest solution if Python were designed to do this
from the beginning. I have previously stated that I believe the lack
of a context pointer in Python's C API is a design flaw, albeit one
that is difficult to change.

If the alternative is to rewrite the whole CPython interpreter, I
would say it it easier to try a proxy object design instead (either
using multiprocessing or an outproc ActiveX object).

Andy O'Meara · Nov 10, 2008

Anyway, to keep things constructive, I should ask (again) whether you
looked at tinypy [1] and whether that might possibly satisfy your
embedded requirements.

Actually, I'm starting to get into the tinypy codebase and have been
talking in detail with the leads for that project (I just branched it,
in fact). TP indeed has all the right ingredients for a CPython "ES"
API, so I'm currently working on a first draft. Interestingly, the TP
VM is largely based on Lua's implementation and stresses compactness.
One challenge is that it's design may be overly compact, making it a
little tricky to extend and maintain (but I anticipate things will
improve as we rev it).

When I have a draft of this "CPythonES" API, I plan to post here for
everyone to look at and give feedback on. The only thing that sucks
is that I have a lot of other commitments right now, so I can't spend
the time on this that I'd like to. Once we have that API finalized,
I'll be able to start offering some bounties for filling in some of
its implementation. In any case, I look forward to updating folks
here on our progress!

Andy

Andy O'Meara · Nov 10, 2008

All this says is:

1. The cost of serialization and deserialization is to large.
2. Complex data structures cannot be placed in shared memory.

The first claim is unsubstantiated. It depends on how much and what
you serialize.

Right, but I'm telling you that it *is* substantial... Unfortunately,
you can't serialize thousands of opaque OS objects (which undoubtably
contain sub allocations and pointers) in a frame-based, performance
centric-app. Please consider that others (such as myself) are not
trying to be difficult here--turns out that we're actually
professionals. Again, I'm not the type to compare credentials, but it
would be nice if you considered that you aren't the final authority on
real-time professional software development.

The second claim is plain wrong. You can put anything you want in
shared memory. The mapping address of the shared memory segment may
vary, but it can be dealt with (basically use integers instead of
pointers, and use the base address as offset.)

I explained this in other posts: OS objects are opaque and their
serialization has to be done via their APIs, which is never marketed
as being fast *OR* cheap. I've gone into this many times and in many
posts.

Saying that "it can't be done" is silly before you have tried.

Your attitude and unwillingless to look at the use cases listed myself
and others in this thread shows that this discussion may not be a good
use of your time. In any case, you haven't even acknowledged that a
package can't "wag the dog" when it comes to app development--and
that's the bottom line and root liability.

Andy

Andy O'Meara · Nov 10, 2008

If he needs to communicate that amount of data very often, he has a
serious design problem.

Hmmm... Your comment there seems to be an indicator that you don't
have a lot of experience with real-time, performance-centric apps.
Consider my previously listed examples of video rendering and
programatic effects in real-time. You need to have a lot of stuff in
threads being worked on, and as Walter described, using a signal
rather than serialization is the clear choice. Or, consider Patrick's
case where you have massive amounts of audio being run through a DSP--
it just doesn't make sense to serialize a intricate, high level object
when you could otherwise just hand it off via a single sync step.
Walter and Paul really get what's being said here, so that should be
an indicator to take a step back for a moment and ease up a bit...
C'mon, man--we're all on the same side here! :^)

Andy

PyDev 3.0 Released	2	Nov 7, 2013
C language now truly universal	0	Jan 1, 2011
Truly platform-independent DB access in Python?	18	Aug 28, 2006
Multiple independent Python interpreters in a C/C++ program?	9	Apr 11, 2008
[ANN] pyparsing 2.0.1 released - compatible with Python 2.6 and later	1	Jul 20, 2013
ANN: Celery 3.0 (chiastic slide) released!	1	Jul 7, 2012
os independent rename	3	Sep 17, 2011
Unittest2 on python 2.6	0	Mar 18, 2012

2.6, 3.0, and truly independent intepreters

sturlamolden

sturlamolden

Walter Overby

sturlamolden

Walter Overby

sturlamolden

Paul Boddie

sturlamolden

Andy O'Meara

Andy O'Meara

Andy O'Meara

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads