threading and multicores, pros and cons

Maric Michaud · Feb 14, 2007

This is a recurrent problem I encounter when I try to sell python solutions to
my customers. I'm aware that this problem is sometimes overlooked, but here
is the market's law.

I've heard of a bunch of arguments to defend python's choice of GIL, but I'm
not quite sure of their technical background, nor what is really important
and what is not. These discussions often end in a prudent "python has made a
choice among others"... which is not really convincing.

If some guru has made a good recipe, or want to resume the main points it
would be really appreciated.

regards,

--
_____________

Maric Michaud
_____________

Aristote - www.aristote.info
3 place des tapis
69004 Lyon
Tel: +33 426 880 097
Mobile: +33 632 77 00 21

Paul Rubin · Feb 14, 2007

Maric Michaud said:
If some guru has made a good recipe, or want to resume the main points it
would be really appreciated.

Basically Python applications are usually not too CPU-intensive; there
are some ways you can get parallelism with reasonable extra effort;
and for most of Python's history, multi-CPU systems have been rather
exotic so the GIL didn't create too big a problem. Right now it is
starting to become more of a problem than before, but it's not yet
intolerable. Obviously something will have to be done about it in the
long run, maybe with PyPy.

Maric Michaud · Feb 14, 2007

Le mercredi 14 février 2007 05:49, Paul Rubin a écrit :

Basically Python applications are usually not too CPU-intensive; there
are some ways you can get parallelism with reasonable extra effort;

Basically, while not CPU intensive, application server needs to get benefit of
all resources of the hardware.
When a customer comes with his new beautiful dual-core server and get a basic
plone install up and running, he will immediately compare it to J2EE and
wonder why he should pay a consultant to make it work properly.
At this time, it 's not easy to explain him that python is not flawed compared
to Java, and that he will not regret his choice in the future.
First impression may be decisive.

The historical explanation should be inefficient here, I'm afraid. What about
the argument that said that multi threading is not so good for parallelism ?
Is it strong enough ?

--
_____________

Maric Michaud
_____________

Aristote - www.aristote.info
3 place des tapis
69004 Lyon
Tel: +33 426 880 097
Mobile: +33 632 77 00 21

Paul Rubin · Feb 14, 2007

Maric Michaud said:
Le mercredi 14 février 2007 05:49, Paul Rubin a écrit :
Basically, while not CPU intensive, application server needs to get
benefit of all resources of the hardware.

But this is impossible--if the application is not CPU intensive, by
definition it leaves a lot of the available CPU cycles unused.

When a customer comes with his new beautiful dual-core server and
get a basic plone install up and running, he will immediately
compare it to J2EE and wonder why he should pay a consultant to make
it work properly. At this time, it 's not easy to explain him that
python is not flawed compared to Java, and that he will not regret
his choice in the future. First impression may be decisive.

That is true, parallelism is an area where Java is ahead of us.

The historical explanation should be inefficient here, I'm
afraid. What about the argument that said that multi threading is
not so good for parallelism ? Is it strong enough ?

It's not much good for parallelism in the typical application that
spends most of its time blocked waiting for I/O. That is many
applications. It might even even be most applications. But there are
still such things as CPU-intensive applications which can benefit from
parallelism, and Python has a weak spot there.

garrickp · Feb 14, 2007

I've heard of a bunch of arguments to defend python's choice of GIL, but I'm
not quite sure of their technical background, nor what is really important
and what is not. These discussions often end in a prudent "python has made a
choice among others"... which is not really convincing.

Well, INAG (I'm not a Guru), but we recently had training from a Guru.
When we brought up this question, his response was fairly simple.
Paraphrased for inaccuracy:

"Some time back, a group did remove the GIL from the python core, and
implemented locks on the core code to make it threadsafe. Well, the
problem was that while it worked, the necessary locks it made single
threaded code take significantly longer to execute."

He then proceeded to show us how to achieve the same effect
(multithreading python for use on multi-core computers) using popen2
and stdio pipes.

FWIW, ~G

Istvan Albert · Feb 14, 2007

At this time, it 's not easy to explain him that python
is notflawed compared to Java, and that he will not
regret his choice in the future.

Database adaptors such as psycopg do release the GIL while connecting
and exchanging data. Apache's MPM (multi processing module) can run
mod_python and with that multiple python instances as separate
processes thus avoiding the global lock as well.

plone install up and running, he will immediately compare it to
J2EE wonder why he should pay a consultant to make it work properly.

I really doubt that any performance difference will be due to the
global interpreter lock. This not how things work. You most certainly
have far more substantial bottlenecks in each application.

i.

Nikita the Spider · Feb 14, 2007

Maric Michaud said:
This is a recurrent problem I encounter when I try to sell python solutions
to
my customers. I'm aware that this problem is sometimes overlooked, but here
is the market's law.

I've heard of a bunch of arguments to defend python's choice of GIL, but I'm
not quite sure of their technical background, nor what is really important
and what is not. These discussions often end in a prudent "python has made a
choice among others"... which is not really convincing.

If some guru has made a good recipe, or want to resume the main points it
would be really appreciated.

When designing a new Python application I read a fair amount about the
implications of multiple cores for using threads versus processes, and
decided that using multiple processes was the way to go for me. On that
note, there a (sort of) new module available that allows interprocess
communication via shared memory and semaphores with Python. You can find
it here:
http://NikitaTheSpider.com/python/shm/

Hope this helps

sjdevnull · Feb 14, 2007

That is true, parallelism is an area where Java is ahead of us.

Java's traditionally been ahead in one case, but well behind in
general.

Java has historically had no support at all for real multiple process
solutions (akin to fork() or ZwCreateProcess() with NULL
SectionHandle), which should make up the vast majority of parallel
programs (basically all of those except where you don't want memory
protection).

Has this changed in recent Java releases? Is there a way to use
efficient copy-on-write multiprocess architectures?

Paul Rubin · Feb 14, 2007

[email protected] said:
Java has historically had no support at all for real multiple process
solutions (akin to fork() or ZwCreateProcess() with NULL
SectionHandle), which should make up the vast majority of parallel
programs (basically all of those except where you don't want memory
protection).

I don't know what ZwCreateProcess is (sounds like a Windows-ism) but I
remember using popen() under Java 1.1 in Solaris. That at least
allows launching a new process and communicating with it. I don't
know if there was anything like mmap. I think this is mostly a
question of library functions--you could certainly write JNI
extensions for that stuff.

Has this changed in recent Java releases? Is there a way to use
efficient copy-on-write multiprocess architectures?

I do think they've been adding more stuff for parallelism in general.

sjdevnull · Feb 14, 2007

I don't know what ZwCreateProcess is (sounds like a Windows-ism)

Yeah, it's the Window equivalent to fork. Does true copy-on-write, so
you can do efficient multiprocess work.

but I
remember using popen() under Java 1.1 in Solaris. That at least
allows launching a new process and communicating with it.

Yep. That's okay for limited kinds of applications.

I don't know if there was anything like mmap.

That would be important as well.

I think this is mostly a
question of library functions--you could certainly write JNI
extensions for that stuff.

Sure. If you're writing extensions you can work around the GIL, too.

I do think they've been adding more stuff for parallelism in general.

Up through 1.3/1.4 or so they were pretty staunchly in the "threads
for everything!" camp, but they've added a select/poll-style call a
couple versions back. That was a pretty big sticking point previously.

Paul Rubin · Feb 14, 2007

[email protected] said:
question of library functions--you could certainly write JNI
extensions for that stuff [access to mmap, etc.]

Click to expand...

Sure. If you're writing extensions you can work around the GIL, too.

I don't think that's comparable--if you have extensions turning off
the GIL, they can't mess with Python data objects, which generally
assume the GIL's presence. Python's mmap module can't do that either.

Up through 1.3/1.4 or so they were pretty staunchly in the "threads
for everything!" camp, but they've added a select/poll-style call a
couple versions back. That was a pretty big sticking point previously.

They've gone much further now and they actually have some STM features:

http://www-128.ibm.com/developerworks/java/library/j-jtp11234/

MRAB · Feb 14, 2007

Well, INAG (I'm not a Guru), but we recently had training from a Guru.
When we brought up this question, his response was fairly simple.
Paraphrased for inaccuracy:

"Some time back, a group did remove the GIL from the python core, and
implemented locks on the core code to make it threadsafe. Well, the
problem was that while it worked, the necessary locks it made single
threaded code take significantly longer to execute."

He then proceeded to show us how to achieve the same effect
(multithreading python for use on multi-core computers) using popen2
and stdio pipes.

Hmm. I wonder whether it would be possible to have a pair of python
cores, one for single-threaded code (no locks necessary) and the other
for multi-threaded code. When the Python program went from single-
threaded to multi-threaded or multi-threaded to single-threaded there
would be a switch from one core to the other.

Maric Michaud · Feb 15, 2007

Le mercredi 14 février 2007 16:24, (e-mail address removed) a écrit :

"Some time back, a group did remove the GIL from the python core, and
implemented locks on the core code to make it threadsafe. Well, the
problem was that while it worked, the necessary locks it made single
threaded code take significantly longer to execute."

Very interesting point, this is exactly the sort of thing I'm looking for. Any
valuable link on this ?

--
_____________

Maric Michaud
_____________

Aristote - www.aristote.info
3 place des tapis
69004 Lyon
Tel: +33 426 880 097
Mobile: +33 632 77 00 21

Paul Rubin · Feb 15, 2007

Maric Michaud said:
Very interesting point, this is exactly the sort of thing I'm
looking for. Any valuable link on this ?

I think it was a long time ago, Python 1.5.2 or something. However it
really wasn't that useful, since as Garrick said, it slowed Python
down. The reason was CPython's structures weren't designed for thread
safety so it needed a huge amount of locking/releasing. For example,
adjusting any reference count required setting and releasing a lock,
and CPython does this all the time. Getting rid of the GIL in a
serious way requires radically changing the interpreter, not just
sticking some locks here and there.

John Nagle · Feb 15, 2007

If locking is expensive on x86, it's implemented wrong.
It's done right in QNX, with inline code for the non-blocking
case. Not sure about the current libraries for Linux, but
by now, somebody should have gotten this right.

John Nagle

Paul Rubin · Feb 15, 2007

John Nagle said:
If locking is expensive on x86, it's implemented wrong.
It's done right in QNX, with inline code for the non-blocking case.

Acquiring the lock still takes an expensive instruction, LOCK XCHG or
whatever. I think QNX is usually run on embedded cpu's with less
extensive caching as these multicore x86's, so the lock prefix may be
less expensive in the QNX systems.

John Nagle · Feb 15, 2007

Paul said:
Acquiring the lock still takes an expensive instruction, LOCK XCHG or
whatever. I think QNX is usually run on embedded cpu's with less
extensive caching as these multicore x86's, so the lock prefix may be
less expensive in the QNX systems.

That's not so bad. See

http://lists.freebsd.org/pipermail/freebsd-current/2004-August/033462.html

But there are dumb thread implementations that make
a system call for every lock.

John Nagle

Paul Rubin · Feb 15, 2007

John Nagle said:
But there are dumb thread implementations that make
a system call for every lock.

Yes, a sys call on each lock access would really be horrendous. But I
think that in a modern cpu, LOCK XCHG costs as much as hundreds of
regular instructions. Doing that on every adjustment of a Python
reference count is enough to impact the interpreter significantly.
It's not just mutating user data; every time you use an integer, or
call a function and make an arg tuple and bind the function's locals
dictionary, you're touching refcounts.

The preferred locking scheme in Linux these days is called futex,
which avoids system calls in the uncontended case--see the docs.

Rhamphoryncus · Feb 15, 2007

Hmm. I wonder whether it would be possible to have a pair of python
cores, one for single-threaded code (no locks necessary) and the other
for multi-threaded code. When the Python program went from single-
threaded to multi-threaded or multi-threaded to single-threaded there
would be a switch from one core to the other.

I have explored this option (and some simpler variants). Essentially,
you end up rewriting a massive amount of CPython's codebase to change
the refcount API. Even all the C extension types assume the refcount
can be statically initialized (which may not be true if you're trying
to make it efficient on multiple CPUs.)

Once you realize the barrier for entry is so high you start
considering alternative implementations. Personally, I'm watching
PyPy to see if they get reasonable performance using JIT. Then I can
start hacking on it.

Paul Boddie · Feb 15, 2007

Yeah, it's the Window equivalent to fork. Does true copy-on-write, so
you can do efficient multiprocess work.

Aside from some code floating around the net which possibly originates
from some book on Windows systems programming, is there any reference
material on ZwCreateProcess, is anyone actually using it as "fork on
Windows", and would it be in any way suitable for an implementation of
os.fork in the Python standard library? I only ask because there's a
lot of folklore about this particular function (everyone seems to
repeat more or less what you've just said), but aside from various
Cygwin mailing list threads where they reject its usage, there's
precious little information of substance.

Not that I care about Windows, but it would be useful to be able to
offer fork-based multiprocessing solutions to people using that
platform. Although the python-dev people currently seem more intent in
considering (and now hopefully rejecting) yet more syntax sugar [1],
it'd be nice to consider matters seemingly below the python-dev
threshold of consideration and offer some kind of roadmap for
convenient parallel processing.

Paul

[1] http://mail.python.org/pipermail/python-dev/2007-February/070939.html

threading and multicores, pros and cons

Maric Michaud

Paul Rubin

Maric Michaud

Paul Rubin

garrickp

Istvan Albert

Nikita the Spider

sjdevnull

Paul Rubin

sjdevnull

Paul Rubin

MRAB

Maric Michaud

Paul Rubin

John Nagle

Paul Rubin

John Nagle

Paul Rubin

Rhamphoryncus

Paul Boddie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads