Python does not play well with others

J

John Nagle

The GIL doesn't affect seperate processes, and any large server that
cares about stability is going to be running a pre-forking MPM no
matter what language they're supporting.

Pre-forking doesn't reduce load; it just improves responsiveness.
You still pay for loading all the modules on every request. For
many AJAX apps, the loading cost tends to dominate the transaction.

FastCGI, though, reuses the loaded Python (or whatever) image.
The code has to be written to be able to process transactions
in sequence (i.e. don't rely on variables intitialized at load),
but each process lives for more than one transaction cycle.
However, each process has a copy of the whole Python system,
although maybe some code gets shared.

John Nagle
 
P

Paul Boddie

Pre-forking doesn't reduce load; it just improves responsiveness.
You still pay for loading all the modules on every request. For
many AJAX apps, the loading cost tends to dominate the transaction.

According to the Apache prefork documentation, you can configure how
often the child processes stick around...

http://httpd.apache.org/docs/2.2/mod/mpm_common.html#maxrequestsperchild

....and I suppose mod_python retains its state in a child process as
long as that process is running:

"Once created, a subinterpreter will be reused for subsequent
requests. It is never destroyed and exists until the Apache process
dies."
- http://www.modpython.org/live/current/doc-html/pyapi-interps.html

"Depending on the use of various PythonInter* directives, a single
python interpreter (and list of imported modules, and per-module
global variables, etc) might be shared between multiple mod_python
applications."
- http://www.modpython.org/FAQ/faqw.py?req=show&file=faq03.005.htp

The FAQ entry (3.1) about module reloading would appear to be
pertinent here, too:

http://www.modpython.org/FAQ/faqw.py?req=show&file=faq03.001.htp

Paul
 
P

Paul Rubin

John Nagle said:
Pre-forking doesn't reduce load; it just improves responsiveness.
You still pay for loading all the modules on every request. For
many AJAX apps, the loading cost tends to dominate the transaction.

I think the idea is that each pre-forked subprocess has its own
mod_python that services multiple requests serially.

New to me is the idea that you can have multiple separate Python
interpreters in a SINGLE process (mentioned in another post). I'd
thought that being limited to one interpreter per process was a
significant and hard-to-fix limitation of the current CPython
implementation that's unlikely to be fixed earlier than 3.0.
 
G

Graham Dumpleton

Pre-forking doesn't reduce load; it just improves responsiveness.
You still pay for loading all the modules on every request. For
many AJAX apps, the loading cost tends to dominate the transaction.

FastCGI, though, reuses the loaded Python (or whatever) image.
The code has to be written to be able to process transactions
in sequence (i.e. don't rely on variables intitialized at load),
but each process lives for more than one transaction cycle.
However, each process has a copy of the whole Python system,
although maybe some code gets shared.

As someone else pointed out, your understanding of how mod_python
works within Apache is somewhat wrong. I'll explain some things a bit
further to make it clearer for you.

When the main Apache process (parent) is started it will load all the
various Apache modules including that for mod_python. Each of these
modules has the opportunity to hook into various configuration phases
to perform actions. In the case of mod_python it will hook into the
post config phase and initialise Python which will in turn setup all
the builtin Python modules.

When Apache forks off child processes each of those child processes
will inherit Python already in an initialised state and also the
initial Python interpreter instance which was created, This therefore
avoids the need to perform initialisation of Python every time that a
child process is created.

In general this initial Python interpreter instance isn't actually
used though, as the default strategy of mod_python is to allocate
distinct Python interpreter instances for each VirtualHost, thereby at
least keeping applications running in distinct VirtualHost containers
to be separate so they don't interfere with each other.

Yes, these per VirtualHost interpreter instances will only be created
on demand in the child process when a request arrives which
necessitates it be created and so there is some first time setup for
that specific interpreter instance at that point, but the main Python
initialisation has already occurred so this is minor. Most
importantly, once that interpreter instance is created for the
specific VirtualHost in the child process it is retained in memory and
used from one request to the next. If the handler for a request loads
in Python modules, those Python modules are retained in memory and do
not have to be reloaded on each request as you believe.

If you are concerned about the fact that you don't specifically know
when an interpreter instance will be first created in the child
process, ie., because it would only be created upon the first request
arriving that actually required it, you can force interpreter
instances to be created as soon as the child process has been created
by using the PythonImport directive. What this directive allows you to
do is specify a Python module that should be preloaded into a specific
interpreter instance as soon as the child process is created. Because
the interpreter will need to exist, it will first be created before
the module is loaded thereby achieving the effect of precreating the
specific named interpreter instance.

So as to make sure you don't think that that first interpreter
instance created in the parent and inherited by the child processes is
completely wasted, it should be pointed out that the first interpreter
instance created by Python is sort of special. In general it shouldn't
matter, but there is one case where it does. This is where a third
party extension module for Python has not been written so as to work
properly in a context where there are multiple sub interpreters.
Specifically, if a third party extension module used the simplified
API for GIL locking one can have problems using that module in
anything but the first interpreter instance created by Python. Thus,
the first instance is retained and in some cases it may be necessary
to force your application to run within the context of that
interpreter instance to get it to work where using such a module. If
you have to do this for multiple applications running under different
VirtualHost containers you loose your separation though, thus this is
only provided as a fallback when you don't have a choice.

I'll mention one other area in case you have the wrong idea about it
as well. In mod_python there is a feature for certain Python modules
to be reloaded. This feature is normally on by default but is always
recommended to be turned off in a production environment. To make it
quite clear, this feature does not mean that the modules which are
candidates for reloading will be reloaded on every request. Such
modules will only be reloaded if the code file for that module has
been changed. Ie., its modification time on disk has been changed. In
mod_python 3.3 where this feature is a bit more thorough and robust,
it will also reload a candidate module if some child or descendant of
the module has been changed.

So to summarise. Interpreter instances once created in the child
processes for a particular context are retained in memory and used
from one request to the next. Further, any modules loaded by code for
a request handler is retained in memory and do not have to be reloaded
on each request. Even when module reloading is enabled in mod_python,
modules are only reloaded where a code file associated with that
module has been changed on disk.

Does that clarify any misunderstandings you have?

So far it looks like the only problem that has been identified is one
that I already know about, which is that there isn't any good
documentation out there which describes how it all works. As a result
there are a lot of people out there who describe wrongly how they
think it works and thus give others the wrong impression. I already
knew this, as I quite often find myself having to correct statements
on various newgroups and in documentation for various Python web
frameworks. What is annoying is that even though you point out to some
of the Python web frameworks that what they state in their
documentation is wrong or misleading they don't correct it. Thus the
wrong information persists and keeps spreading the myth that there
must be some sort of problem where there isn't really. :-(

Graham
 
G

Graham Dumpleton

I think the idea is that each pre-forked subprocess has its own
mod_python that services multiple requests serially.

And where 'worker' MPM is used, each child process can be handling
multiple concurrent requests at the same time. Similarly on Windows
although there is only one process.
New to me is the idea that you can have multiple separate Python
interpreters in a SINGLE process (mentioned in another post). I'd
thought that being limited to one interpreter per process was a
significant and hard-to-fix limitation of the current CPython
implementation that's unlikely to be fixed earlier than 3.0.

No such limitation exists with mod_python as it does all the
interpreter creation and management at the Python C API level. The one
interpreter per process limitation is only when using the standard
'python' runtime executable and you are doing everything in Python
code.

Graham
 
S

sjdevnull

Pre-forking doesn't reduce load; it just improves responsiveness.
You still pay for loading all the modules on every request.

No, you don't. Each server is persistent and serves many requests--
it's not at all like CGI, and it reuses the loaded Python image.

So if you have, say, an expensive to load Python module, that will
only be executed once for each server you start...e.g. if you have
Apache configured to accept up to 50 connections, the module will be
run at most 50 times; once each of the 50 processes has started up,
they stick around until you restart Apache, unless you've configured
apache to only serve X requests in one process before restarting it.
(The one major feature that mod_python _is_ missing is the ability to
do some setup in the Python module prior to forking. That would make
restarting Apache somewhat nicer).

The major advantage of pre-forking is that you have memory protection
between servers, so a bug in one won't take down the whole apache
server (just the connection(s) that are affected by that bug). Most
shared hosting providers use pre-forking just for these stability
reasons.

A nice side effect of the memory protection is that you have
completely seperate Python interpreters in each process--while each
one is reused between connections, they run in independent processes
and the GIL doesn't come into play at all.
 
P

Paul Rubin

Graham Dumpleton said:
No such limitation exists with mod_python as it does all the
interpreter creation and management at the Python C API level. The one
interpreter per process limitation is only when using the standard
'python' runtime executable and you are doing everything in Python code.

Oh cool, I thought CPython used global and/or static variables or had
other obstacles to supporting multiple interpreters. Is there a
separate memory pool for each interpreter when you have multiple ones?
So each one has its own copies of 0,1,2,None,..., etc? How big is the
memory footprint per interpreter? I guess there's no way to timeshare
(i.e. with microthreads) between interpreters but that's o.k.
 
P

Paul Rubin

Graham Dumpleton said:
Yes, these per VirtualHost interpreter instances will only be created
on demand in the child process when a request arrives which
necessitates it be created and so there is some first time setup for
that specific interpreter instance at that point, but the main Python
initialisation has already occurred so this is minor.

Well ok, but what if each of those interpreters wants to load, say,
the cookie module? Do you have separate copies of the cookie module
in each interpreter? Does each one take the overhead of loading the
cookie module? It would be neat if there was a way of including
frequently used modules in the shared text segment of the
interpreters, as created during the initial build process. GNU Emacs
used to do something like that with a contraption called "unexec" (it
could dump out parts of its data segment into a pure (shared)
executable that you could then run without the overhead of loading all
those modules) but the capability went away as computers got faster
and it became less common to have a lot of Emacs instances weighing
down timesharing systems. Maybe it's time for a revival of those
techniques.
 
G

Graham Dumpleton

No, you don't. Each server is persistent and serves many requests--
it's not at all like CGI, and it reuses the loaded Python image.

So if you have, say, an expensive to load Python module, that will
only be executed once for each server you start...e.g. if you have
Apache configured to accept up to 50 connections, the module will be
run at most 50 times; once each of the 50 processes has started up,
they stick around until you restart Apache, unless you've configured
apache to only serve X requests in one process before restarting it.
(The one major feature thatmod_python_is_ missing is the ability to
do some setup in the Python module prior to forking. That would make
restarting Apache somewhat nicer).

There would be a few issues with preloading modules before the main
Apache child process performed the fork.

The first is whether it would be possible for code to be run with
elevated privileges given that the main Apache process usually is
started as root. I'm not sure at what point it switches to the special
user Apache generally runs as and whether in the main process the way
this switch is done is enough to prevent code getting back root
privileges in some way, so would need to be looked into.

The second issue is that there can be multiple Python interpreters
ultimately created depending on how URLs are mapped, thus it isn't
just an issue with loading a module once, you would need to create all
the interpreters you think might need it and preload it into each. All
this will blow out the memory size of the main Apache process.

There is also much more possibility for code, if it runs up extra
threads, to interfere with the operation of the Apache parent process.
One particular area which could be a problem is where Apache wants to
do a restart, as it will attempt to unload the mod_python module and
reload it. Right now this may not be an issue as mod_python does the
wrong thing and doesn't shutdown Python allowing it to be
reinitialised when mod_python is reloaded, but in mod_wsgi (when
mod_python isn't also being loaded), it will shutdown Python. If there
is user code executing in a thread within the parent process this may
actually stop mod_wsgi from cleanly shutting down Python thus causing
Apache to hang.

All up, the risks of loading extra modules in the parent process
aren't worth it and could just result in things being less stable.

Graham
 
G

Graham Dumpleton

Well ok, but what if each of those interpreters wants to load, say,
the cookie module? Do you have separate copies of the cookie module
in each interpreter? Does each one take the overhead of loading the
cookie module?

Each interpreter instance will have its own copy of any Python based
code modules. You can't avoid this as Python code is so modifiable
that they have to be separate else you would be modifying the same
instance as used by a different interpreter which could screw up the
other applications view of the world. The whole point of having
separate interpreters is to avoid applications trampling on each
other. If you really are concerned about multiple loading, use the
PythonInterpreter directive to specifically say that applications
running under different VirtualHost containers should use the same
interpreter.

Note though, that although you can run multiple applications in one
interpreter in many cases, it may not be able to be done in others.
For example, it is not possible to run two instances of Django within
the one interpreter instance. The first reason as to why this can't be
done is that Django expects certain information about its
configuration to come from os.environ. Since there is only one
os.environ it can't have two different values for each application at
the same time. Some may argue that in 'prefork' you could just change
os.environ to be correct for the application for the current request
and this effectively is what the mod_python adapter for Django does,
but this will fail when 'worker' MPM or Windows is used. I suspect
this is the where the idea that Django can't be run on 'worker' MPM
came from. Although the documentation for Django suggests it is a
mod_python problem, it is actually a Django problem. This use of
os.environ by Django also means that Django isn't a well behaved WSGI
application component. :-(
It would be neat if there was a way of including
frequently used modules in the shared text segment of the
interpreters, as created during the initial build process. GNU Emacs
used to do something like that with a contraption called "unexec" (it
could dump out parts of its data segment into a pure (shared)
executable that you could then run without the overhead of loading all
those modules) but the capability went away as computers got faster
and it became less common to have a lot of Emacs instances weighing
down timesharing systems. Maybe it's time for a revival of those
techniques.

I don't see it as being applicable. Do note that provided there are
precompiled byte code files for .py files then load time is at least
reduced because Python doesn't have to recompile the code. This
actually can be quite significant.

Graham
 
P

Paul Rubin

Graham Dumpleton said:
The first is whether it would be possible for code to be run with
elevated privileges given that the main Apache process usually is
started as root. I'm not sure at what point it switches to the special
user Apache generally runs as and whether in the main process the way
this switch is done is enough to prevent code getting back root
privileges in some way, so would need to be looked into.

It switches very early, I think. It starts as root so it can listen
on port 80.
There is also much more possibility for code, if it runs up extra
threads, to interfere with the operation of the Apache parent process.

Certainly launching any new threads should be postponed til after the
fork.
 
G

Graham Dumpleton

Certainly launching any new threads should be postponed til after the
fork.

Except that you can't outright prevent it from being done as a Python
module could create the threads as a side effect of the module import
itself. I guess though if you load a module which does that and it
screws things up, then you have brought it on yourself as it would
have been your choice to make mod_python load it in the first place if
the feature was there. :)
 
P

Paul Rubin

Graham Dumpleton said:
Except that you can't outright prevent it from being done as a Python
module could create the threads as a side effect of the module import
itself.

Yeah, the preload would have to be part of the server configuration,
requiring appropriate care in choosing the preloaded modules (they'd
normally be stdlib modules which rarely do uncivilized things like
launch new threads on import). It wouldn't do to let random user
scripts into the preload.

One could imagine languages in which this could be enforced by a
static type system. Hmm.
 
S

sjdevnull

There would be a few issues with preloading modules before the main
Apache child process performed the fork.

The first is whether it would be possible for code to be run with
elevated privileges given that the main Apache process usually is
started as root. I'm not sure at what point it switches to the special
user Apache generally runs as and whether in the main process the way
this switch is done is enough to prevent code getting back root
privileges in some way, so would need to be looked into.

In our case, the issue is this: we load a ton of info at server
restart, from the database. Some of it gets processed a bit based on
configuration files and so forth. If this were done in my own C
server, I'd do all of that and set up the (read-only) runtime data
structures prior to forking. That would mean that:
a) The processing time would be lower since you're just doing the pre-
processing once; and
b) The memory footprint could be lower if large data structures were
created prior to fork; they'd be in shared copy-on-write pages.

b) isn't really possible in Python as far as I can tell (you're going
to wind up touching the reference counts when you get pointers to
objects in the page, so everything's going to get copied into your
process eventually), but a) would be very nice to have.
The second issue is that there can be multiple Python interpreters
ultimately created depending on how URLs are mapped, thus it isn't
just an issue with loading a module once, you would need to create all
the interpreters you think might need it and preload it into each. All
this will blow out the memory size of the main Apache process.

It'll blow out the children, too, though. Most real-world
implementations I've seen just use one interpreter, so even a solution
that didn't account for this would be very useful in practice.
There is also much more possibility for code, if it runs up extra
threads, to interfere with the operation of the Apache parent process.

Yeah, you don't want to run threads in the parent (I'm not sure many
big mission-critical sites use multiple threads anyway, certainly none
of the 3 places I've worked at did). You don't want to allow
untrusted code. You have to be careful, and you should treat anything
run there as part of the server configuration.

But it would still be mighty nice. We're considering migrating to
another platform (still Python-based) because of this issue, but
that's only because we've gotten big enough (in terms of "many big fat
servers sucking up CPU on one machine", not "tons of traffic") that
it's finally an issue. mod_python is still very nice and frankly if
our startup coding was a little less piggish it might not be an issue
even now--on the other hand, we've gotten a lot of flexibility out of
our approach, and the code base is up to 325,000 lines of python or
so. We might be able to refactor things to cut down on startup costs,
but in general a way to call startup code only once seems like the
Right Thing(TM).
 
P

Paul Rubin

In our case, the issue is this: we load a ton of info at server
restart, from the database. Some of it gets processed a bit based on
configuration files and so forth. If this were done in my own C
server, I'd do all of that and set up the (read-only) runtime data
structures prior to forking. That would mean that:
a) The processing time would be lower since you're just doing the pre-
processing once; and
b) The memory footprint could be lower if large data structures were
created prior to fork; they'd be in shared copy-on-write pages.

If you completely control the server, write an apache module that
dumps this data into a file on startup, then mmap it into your Python app.
 
S

sjdevnull

If you completely control the server, write an apache module that
dumps this data into a file on startup, then mmap it into your Python app.

The final data after loading is in the form of a bunch of python
objects in a number of complex data structures, so that's not really a
good solution as far as I can tell. We read in a bunch of data from
the database and build a data layer describing all the various classes
(and some kinds of global configuration data, etc) used by the various
applications in the system.

It's possible that we could build it all in a startup module and then
pickle everything we've built into a file that each child would
unpickle, but I'm a bit leery about that approach.
 
P

Paul Rubin

It's possible that we could build it all in a startup module and then
pickle everything we've built into a file that each child would
unpickle, but I'm a bit leery about that approach.

Yeah, that's not so great. You could look at POSH.
 
A

azrael

you have to undestand that python is not like other languages. I am
working wih it for 3 months. in this time i learned more than throgh
c, c++, java or php. you take. what the hell is php. a language
developed primary for webaplications. take zope and you have the same.
besides that zope will do fantastic things crossing with other
modules.
c and c++ are incredible languages and if i am not wrong, c is the way
to make aplications working on minimum time (not including assembler).
it has been developed for quite a long time and the researches have
been sponsored even from the goverments. the development of php is
continued while Rasmus Lerdorf is working for yahoo. the python
comunity is not so big as the php (at least in my country), but such
people like me, who fell in love with python, we work on our projects,
we know what we want, and if it dosn't work we make it work. I ma
working on a project and had no filters i needed for signal processing
so I wrote it. My next step when they are finished and work well is to
send it to the admins from pil. this is open source, and we help each
others.

thats open source.

you try python and you like it or not, you keep using it or not. if
something dosn't work, be so kind and make it work, but please, don't
expect someone else to do your "homework". If there are some problems,
contact the admins and offer them your help.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,585
Members
45,080
Latest member
mikkipirss

Latest Threads

Top