Parallelization on muli-CPU hardware?

Piet van Oostrum · Oct 11, 2004

BO> I'm not really up-to-date on modern multi-processor support.
BO> Back in grad school I read some papers on cache coherence, and I
BO> don't know how well the problems have been solved. The issue
BO> was that a single processor can support a one-instruction lock
BO> (in the usual no-contention case) simply by supplying an
BO> uninterruptable read-and-update instruction, but on a multi-
BO> processor, all the processors have respect the lock.

For multiprocessor systems, uninterruptible isn't enough (depending on your
definition of uninterruptible). Memory operations from other processors
also shouldn't interleave with the instruction. Modern processors usually
have a 'test and set' or 'compare and swap' instruction for this purpose,
which lock memory for the duration of the operation.

Luis P Caamano · Oct 11, 2004

One critical reason for the GIL is to support CPython's ability to call
random C libraries with little effort. Too many C libraries are not
thread-safe, let alone thread-hot.

Very true and I think that's one of the main reasons we still have the
GIL.

Forcing libraries that wish to
participate in threading to use Python's GIL-release mechanism is the
only safe approach.

The ONLY one? That's too strong a statement unless you qualify it.
One problem is that CPython tries to make C library development as
easy as possible to the point of being condescending. IOW, it seems
that CPython assumes that most C library writers are too stupid to
write thread-safe code. However, one could say that if it had always
forced all C libraries to be written thread-safe from scratch, most
library developers would've practiced a lot by now.

Given enough time, familiarity is a good tool to tame apparently
complex issues.

Unfortunately, it's not that simple or maybe not even true. I suspect
that the real reason C libraries are not required to be thread-safe is
that there's already too many C libraries out there that are NOT
thread safe and too many people think (arguably mistaken) that fixing
all those libaries to get rid of the GIL is not worth the trouble.

I speculate that the GIL "problem" might have grown to its current
state because CPython grew in parallel with thread programming.
CPython existed way before Posix 1003c was ratified or even before we
had good pthread libraries. In addition, CPython always wanted to be
OS agnostic. In 1995, it would've been possible to write a
CPosixPython with full support for the Posix thread library with
support for all those cool things like CPU affinity, options for
kernel or user threads, thread cancellation, thread local storage,
etc. but it would not have worked on any OS that didn't have a good,
conformat pthreads library, which at that time were many.

Therefore, fixing the GIL was always in competition with existing code
and libraries, and it always lost because it was (and still is)
considered "not worth the effort."

If we were going to write a CPython interpreter today, it would be a
lot easier to write the interpreter loop and its data in a manner that
would maximize the use of threads because today thread support is
widely spread among supported OSes.

As a kernel developer and experience pthreads programmer, the first
time I saw CPython's main interpreter loop in ceval.c my jaw hit the
floor because I couldn't believe why anybody would write a threaded
interpreter that grabs ONE BIG mutex, runs 100 op codes (checkinterval
default), and then releases the mutex. It took a while to finally
understand that the reason is not technical but mostly historical
(IMHO).

I've said it before. One day enough people will think that the GIL is
a problem big enough to warrant a solution, e.g., when the majority of
systems where CPython runs have more than one CPU. Until then we have
to go back to early 90s programming and use IPC (interprocess
communication) to scale applications that want to run PURE python code
on more than one CPU. That's probably the main disagreement I have
with those that think that the GIL is not a big problem, IPC is not a
solution but a workaround.

This is not a Python unique problem either. Most major UNIX OSes,
e.g., HPUX, Sun, AIX, etc. went through the same thing when adding
support for SMP in early through mid 90s. And that was a bigger
problem that CPython's GIL because they had a lot drivers that were
not SMP safe.

I know the problem is complex and there are other non-technical issues
to consider. However, and I don't mean to oversimplify, here are a
few ideas that might help with the GIL problem:

- Detect multiple CPUs, if single CPU, do not lock globals. That's
what some OSes do to avoid unnecessary SMP penalties on single CPU
systems.

- Create python objects in thread local storage by default, which
don't need locking.

- Rewrite the interpreter loop so that it doesn't grab a BIG lock,
unless configured to do so.

- Let users lock, not the interpreter. If you use threads and you
access global objects without locking, you might get undefined
behaviors, just like in C/C++.

- Allow legacy behavior with a command line option to enable the GIL
and create global objects. (or viceversa).

- Require C libraries to tell the interpreter if they are thread
safe or not. If they are not, the interpreter would not load them
unless running in legacy-mode.

- Augment the posix module to include full support for the pthreads
library, e.g., thread cancellation, CPU affinity.

Of course, that's easier said than done and I'm not saying that it can
or should be done now. The point is that getting rid of the GIL is
straightforward, it's been done before many times, but it will not
happen until it's widely viewed as a problem.

Unfortunately, those changes are big enough that I don't think they'll
happen under the CPython source tree even we all wanted them. More
than likely it will require a separate project, PosixPython perhaps?

I hope one day I'll work for an employer that could afford to donate
some (or all) my time for python development so I could start working
on that. Now, that would be cool.

With a <sigh> and a <wink>,

Neil Hodgson · Oct 11, 2004

Luis P Caamano:

Therefore, fixing the GIL was always in competition with existing code
and libraries, and it always lost because it was (and still is)
considered "not worth the effort."

Greg Stein implemented a "free-threaded" version of Python in 1996 and
had another look in 2000.

http://python.tpnet.pl/contrib-09-Dec-1999/System/threading.README
http://mail.python.org/pipermail/python-dev/2000-April/003605.html

So, it is not the initial implementation of a GIL-free Python that is
the stumbling block but the maintenance of the code. The poor performance of
free-threaded Python contributed to the lack of interest.

That's probably the main disagreement I have
with those that think that the GIL is not a big problem, IPC is not a
solution but a workaround.

For me, the GIL works fine because it is released around I/O operations
which are the bottlenecks in the types of application I write.

- Let users lock, not the interpreter. If you use threads and you
access global objects without locking, you might get undefined
behaviors, just like in C/C++.

That will cause problems for much existing code.

Unfortunately, those changes are big enough that I don't think they'll
happen under the CPython source tree even we all wanted them.

It won't happen until the people that think this is a problem are
prepared to provide the effort to maintain free-threaded Python.

Neil

Bryan Olson · Oct 12, 2004

Piet said:
> BO> I'm not really up-to-date on modern multi-processor support.
> BO> Back in grad school I read some papers on cache coherence, and I
> BO> don't know how well the problems have been solved. The issue
> BO> was that a single processor can support a one-instruction lock
> BO> (in the usual no-contention case) simply by supplying an
> BO> uninterruptable read-and-update instruction, but on a multi-
> BO> processor, all the processors have respect the lock.
>
> For multiprocessor systems, uninterruptible isn't enough (depending on your
> definition of uninterruptible). Memory operations from other processors
> also shouldn't interleave with the instruction. Modern processors usually
> have a 'test and set' or 'compare and swap' instruction for this purpose,
> which lock memory for the duration of the operation.

I'm with you that far. The reason I mentioned cache coherence
is that locking memory isn't enough on MP systems. Another
processor may have the value in local cache memory. For all the
processors to respect the lock, they have to communicate the
locking before any processor can update the address.

If every thread treats a certain memory address is used as lock,
each can access it only by instructions that go to main memory.
But then locking every object can have a devastating effect on
speed.

Bengt Richter · Oct 12, 2004

I'm with you that far. The reason I mentioned cache coherence
is that locking memory isn't enough on MP systems. Another
processor may have the value in local cache memory. For all the
processors to respect the lock, they have to communicate the
locking before any processor can update the address.

If every thread treats a certain memory address is used as lock,
each can access it only by instructions that go to main memory.
But then locking every object can have a devastating effect on
speed.

I don't know if smarter instructions are available now, but if a lock
is in a locked state, there is no point in writing locked status value
through to memory, which is what a naive atomic swap would do. The
returned value would just leave you to try again. If n-1 out of n
processors were doing that to a queue lock waiting for something to
be dequeuable, that would be a huge waste of bus bandwidth, maybe
interfering with DMA i/o. I believe the strategy is (or was, if better
instructions are available) to read lock values passively in a short
spin, so that each processor so doing is just looking at its cache value.
Then when the lock is released, that is done by a write through, and
everyone's cache is invalidated and reloaded, and potentially many see the
lock as free. At that point, everyone tries their atomic swap instructions,
writing locked status through to memory, and only one succeeds in getting
the unlocked value back in the atomic swap. As soon as everyone sees locked
again, they go back to passive spinning. The spins don't go on forever, since
multiple threads in one CPU may be waiting for different things. That's a balance
between context switch (within CPU) overhead vs a short spin lock wait. Probably
lock-dependent. Maybe automatically tunable.

The big MP cache hammer comes if multiple CPU's naively dequeue thread work
first-come-first-serve, and they alternate reloading each other's previously
cached thread active memory content. I suspect all of the above is history
long ago by now. It's been a while ;-)

Regards,
Bengt Richter

Nicolas Lehuen · Oct 12, 2004

Sure. I wonder what (e.g.) IronPython or Ruby do about it -- never
studied the internals of either, yet.

Alex

I've just asked the question on the IronPython mailing list - I'll
post the answer here. I've also asked about Jython since Jim Hugunin
worked on it.

Just as a note, coming from a Java background (I switched to Python
two years ago and nearly never looked back - my current job allows
this, I'm an happy guy), I must say that there is a certain deficit of
thread-awareness in the Python community at large (not offense meant).

The stdlib is actually quite poor on the thread-safe side, and whereas
I see people rush to implement Doug Lea's concurrent package
[http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/intro.html]
on new languages like D, it seems that it is not a priority to the
Python community (maybe a point in the 'develop the stdlib vs develop
the interpreter and language' debate). A symptom of this is that
must-have classes like threading.RLock() are still implemented in
Python ! This should definitely be written in C for the sake of
performance.

I ran into many multi-threading problem in mod_python as well, and the
problem was the same ; people expected mod_python to run in a
multi-process context, not a multi-threaded context (I guess this is
due to a Linux-centered mindset, forgetting about BSD, MacosX or Win32
OSes). When I asked questions and pointed problem, the answer was 'Duh
- use Linux with the forking MPM, not Windows with the threading MPM'.
Well, this is not going to take us anywhere, especially with all the
multicore CPUs coming.

Java, C#, D and many other languages prove that threading is not
necessarily a complicated issue. I can be tricky, granted, but where
would be the fun it if it was not ? In a former life, I have
implemented an application server and had to write a big bunch of
code, always with thread awareness in mind. I guarantee you that
provided your language gives you the proper support (I was using
Java), it's a real treat

.

With all respect due to the Founding Fathers (and most of all Guido),
there are quite a few languages out there that do not require a GIL...
I understand that historical implementation issues meant that a GIL
was required (we talk about a language whose implementation began in
the early nineties), but maybe it is time to reconsider this in Python
3000 ? And while I'm on this iconoclast binge, why not forget about
reference counting and just switch to plain garbage collecting
everywhere, thus clearing a lot of the mess we have to deal with when
writing C extensions ? Plus, a full garbage collection model can
actually help with threading issues, and vice versa.

I know a lot of people who fleed Python for two reasons. The first
reason is a bad one, it's the indentation-as-syntax reason. My answer
is 'just try it'. The second reason is concerns about performance and
scalability, especially from those who heard about the GIL. Well, all
I can answer is 'That will improve over time, look at Psyco, for
instance'. But as far as the GIL is concerned, well... I'm as worried
as them. Please, please, could we change this ?

Regards,

Nicolas Lehuen

Nicolas Lehuen · Oct 12, 2004

Neil Hodgson said:
Luis P Caamano:

Greg Stein implemented a "free-threaded" version of Python in 1996 and
had another look in 2000.

http://python.tpnet.pl/contrib-09-Dec-1999/System/threading.README
http://mail.python.org/pipermail/python-dev/2000-April/003605.html

So, it is not the initial implementation of a GIL-free Python that is
the stumbling block but the maintenance of the code. The poor performance of
free-threaded Python contributed to the lack of interest.

Yeah, and the poor performance of the ActiveState Python for .NET implementation led them to tell that the CLR could not be efficient in implementing Python. Turns out that the problem was in the implementation, not the CLR, and that IronPython gives pretty good performances.

It's not because an implementation gives poor performance that the whole free-threaded concept is bad.

Plus, the benchmarks is highly questionable. Compare a single threaded CPU intensive program with a multi-threaded CPU intensive program, the single threaded program wins. Now compare IO intensive (especially network IO) programs, the multi-threaded program wins. Granted, you can built high performing asynchronous frameworks (Twisted or medusa come to mind), but sooner or later you'll have to use threads (see the DB adapter for Twisted).

Regards,

Nicolas Lehuen

Alex Martelli · Oct 12, 2004

Nicolas Lehuen said:
problem was the same ; people expected mod_python to run in a
multi-process context, not a multi-threaded context (I guess this is
due to a Linux-centered mindset, forgetting about BSD, MacosX or Win32
OSes). When I asked questions and pointed problem, the answer was 'Duh
- use Linux with the forking MPM, not Windows with the threading MPM'.

Sorry, I don't get your point. Sure, Windows makes process creation
hideously expensive and has no forking. But all kinds of BSD, including
MacOSX, are just great at forking. Why is a preference for multiple
processes over threads "forgetting about BSD, MacOSX", or any other
flavour of Unix for that matter?

Well, this is not going to take us anywhere, especially with all the
multicore CPUs coming.

Again, I don't get it. Why would multicore CPUs be any worse than
current multi-CPU machines at multiple processes, and forking?

Alex

Richie Hindle · Oct 12, 2004

[Nicolas, arguing for a free-threading Python]

compare IO intensive (especially network IO) programs, the multi-threaded
program wins.

The GIL is released during network IO. Multithreaded network-bound
Python programs can take full advantage of multiple CPUs already.

(You may already know this and I'm missing a point somewhere, but it's
worth repeating for the benefit of those that don't know.)

Nicolas Lehuen · Oct 12, 2004

Alex Martelli said:
Sorry, I don't get your point. Sure, Windows makes process creation
hideously expensive and has no forking. But all kinds of BSD, including
MacOSX, are just great at forking. Why is a preference for multiple
processes over threads "forgetting about BSD, MacOSX", or any other
flavour of Unix for that matter?

Because when you have multithreaded programs, you can easily share objects between different threads, provided you carefully implement them. On the web framework I wrote, this means sharing and reusing the same DB connection pool, template cache, other caches and so on. This means a reduced memory footprint and increased performance. In a multi-process environment, you have to instantiate as many connections, caches, templates etc. that you have processes. This is a waste of time and memory.

BTW [shameless plug] here is the cookbook recipe I wrote about thread-safe caching.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/302997

Again, I don't get it. Why would multicore CPUs be any worse than
current multi-CPU machines at multiple processes, and forking?

Obviously they won't be any worse. Well, to be precise, it still depends on the OS, because the scheduler must know the difference between 2 processors and a 2-core processor to efficiently balance the work, but anyway.

What I meant is that right now I'm writing this on a desktop PC with hyperthreading. This means that even on a desktop PC you can benefit from having multithreaded (or multi-processed) applications. With multicore or multiprocessor machines being more and more current, the pressure to have proper threading support in Python will grow and grow.

Regards,

Nicolas

Nicolas Lehuen · Oct 12, 2004

Richie Hindle said:
[Nicolas, arguing for a free-threading Python]

compare IO intensive (especially network IO) programs, the multi-threaded
program wins.

Click to expand...

The GIL is released during network IO. Multithreaded network-bound
Python programs can take full advantage of multiple CPUs already.

(You may already know this and I'm missing a point somewhere, but it's
worth repeating for the benefit of those that don't know.)

Yeah I know this. My point was that when benchmarking single-threaded programs vs multithreaded ones, you have to choose the kind of program you use : CPU-intensive vs IO-intensive. On a single CPU machine, CPU-intensive program will always run better in a single-threaded model. But of course, if you have 2 procs and a nice threading model, you can even do 2 CPU-intensive tasks simultaneously, which Python cannot do if I've understood everything so far.

Regards,

Nicolas

Alex Martelli · Oct 12, 2004

Nicolas Lehuen said:
"Alex Martelli" <[email protected]> a écrit dans le message de news:1gljja1.1nxj82c1a25c1bN%[email protected]...

Because when you have multithreaded programs, you can easily share objects

between different threads, provided you carefully implement them. On the
web framework I wrote, this means sharing and reusing the same DB
connection pool, template cache, other caches and so on. This means a
reduced memory footprint and increased performance. In a multi-process
environment, you have to instantiate as many connections, caches,
templates etc. that you have processes. This is a waste of time and
memory.

I'm not particularly interested in debating the pros and cons of threads
vs processes, right now, but in getting a clarification of your original
assertion which I _still_ don't get. It's still quoted up there, so
please DO clarify: how would "a Linux-centered mindset forgetting about
BSD" (and MacOSX is a BSD in all that matters in this context, I'd say)
bias one against multi-threading or towards multi-processing? In what
ways are you claiming that Unix-like systems with BSD legacy are
inferior in multi-processing, or superior in multi-threading, to ones
with Linux kernels? I think I understand both families decently well
and yet I _still_ don't understand whence this claim is coming. (Forget
the red herring of Win32 -- nobody's disputing that _their_ process
spawning is a horror of inefficiency -- let's focus about Unixen, since
you did see fit to mention them so explicitly contrasted, hm?!).

_Then_, once that point is cleared, we may (e.g.) debate how the newest
Linux VM development (presumably coming in 2.6.9) may make mmap quite as
fast as sharing memory among threads (which, I gather from hearsay only,
is essentially the case today already... but _only_ for machines with no
more than 2GB, 3GB tops of physical memory being so shared -- the claim
is that the newest developments will let you have upwards of 256 GB of
physical memory shares with similar efficiency, by doing away with the
current pagetables overhead of mmap), and what (if anything) is there in
BSD-ish kernels to match that accomplishments. But until you clarify
that (to me) strange and confusing assertion, to help me understand what
point you were making there (and nothing in this "answer" of yours is at
all addressing my doubts and my very specific question about it!), I see
little point in trying to delve into such exoterica.

BTW [shameless plug] here is the cookbook recipe I wrote about thread-safe caching.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/302997

I do not see any relevance of that recipe (with which I'm quite
familiar, since I'm preparing the 2nd Edition of the Cookbook) to your
assertion with Linux on one side, and BSD derivatives grouped with
Windows on the other, quoted, questioned and never clarified above.

Obviously they won't be any worse. Well, to be precise, it still depends

on the OS, because the scheduler must know the difference between 2
processors and a 2-core processor to efficiently balance the work, but
anyway.

As far as I know there is no special support in current kernels for
multi-core CPUs as differentiated from multiple CPUs sharing external
buses... (nor even yet, unless I'm mistaken, for such fundamental
novelties as HyperTransport, aka DirectConnect, which _has_ been around
for quite a while now -- and points to a completely different paradigm
than shared memory as being potentially much-faster IPC... bandwidths of
over 10 gigabytes/second, which poor overworked memory subsystems might
well have some trouble reaching... how one would exploit hypertransport
within multiple threads of a process, programmed on the basis of sharing
memory, I dunno -- using it between separate processes which exchange
messages appears conceptually simpler). Anyway, again to the limited
amount of my current knowledge, this holds just as much for multiple
threads as for multiple processes, no?

What I meant is that right now I'm writing this on a desktop PC with

hyperthreading. This means that even on a desktop PC you can benefit
from having multithreaded (or multi-processed) applications.

I'm writing this on my laptop (uniprocessor, no quirks), because I'm on
a trip, but at home I do have a dual-processor desktop (actually a
minitower, but many powerful 'desktops' are that way), and it's a
year-old model (and Apple was making dual processors for years before
the one I own, though with 32-bit 'G4' chips rather than 64-bit 'G5'
ones they're using now). So this is hardly I can run make on a
substantial software system much faster with a -j switch to let it spawn
multiple jobs (processes).

With multicore or multiprocessor machines being more and more current, the

pressure to have proper threading support in Python will grow and grow.

The pressure has been growing for a while and I concur it will keep
growing, particularly since the OS by far most widespread on desktops
has such horrible features for multiple-process spawning and control.
But, again, before we go on to debate this, I would really appreciate it
if you clarified your previous assertions, to help me understand why you
believe that BSD derivatives, including Mac OS X, are to be grouped on
the same side as Windows, while only Linux would favour processes over
threads -- when, to _me_, it seems so obvious that the reasonable
grouping is with all Unix-like systems on one side, Win on the other.
You either know something I don't, about the internals of these systems,
and it appears so obvious to you that you're not even explaining it now
that I have so specifically requested you to explain; or there is
something else going on that I really do not understand.

Alex

Nicolas Lehuen · Oct 12, 2004

Alex Martelli said:
between different threads, provided you carefully implement them. On the
web framework I wrote, this means sharing and reusing the same DB
connection pool, template cache, other caches and so on. This means a
reduced memory footprint and increased performance. In a multi-process
environment, you have to instantiate as many connections, caches,
templates etc. that you have processes. This is a waste of time and
memory.

I'm not particularly interested in debating the pros and cons of threads
vs processes, right now, but in getting a clarification of your original
assertion which I _still_ don't get. It's still quoted up there, so
please DO clarify: how would "a Linux-centered mindset forgetting about
BSD" (and MacOSX is a BSD in all that matters in this context, I'd say)
bias one against multi-threading or towards multi-processing? In what
ways are you claiming that Unix-like systems with BSD legacy are
inferior in multi-processing, or superior in multi-threading, to ones
with Linux kernels? I think I understand both families decently well
and yet I _still_ don't understand whence this claim is coming. (Forget
the red herring of Win32 -- nobody's disputing that _their_ process
spawning is a horror of inefficiency -- let's focus about Unixen, since
you did see fit to mention them so explicitly contrasted, hm?!).

Wow, I don't want to launch any OS flame war there. My point is just that I have noticed that the vast majority of people running mod_python are running it on Apache 2 on Linux with the forking MPM. Hence, multi-threading problems that you DO encounter if you use the threading MPM are often dismissed because, hey, everybody uses the forking MPM in which a single thread handles all the requests it is given by the parent process. Now, when I discussed about this problem on the mod_python mailing list, I had some echo from people who would like to use the multithreading MPM under MacOS X (which is a BSD indeed). Just to say that people running Apache 2 on Win32 are not the only ones interested.

To sum up : people running mod_python under Linux don't have any multithreading issues. They represent 95% of mod_python's market (pure guess). So the multithreading issues have not many chances of being fixed soon. That is why I said that a "Linux-centered mindset forgetting [other OSes, none in particular]" is hindering the bugfix process.

_Then_, once that point is cleared, we may (e.g.) debate how the newest
Linux VM development (presumably coming in 2.6.9) may make mmap quite as
fast as sharing memory among threads (which, I gather from hearsay only,
is essentially the case today already... but _only_ for machines with no
more than 2GB, 3GB tops of physical memory being so shared -- the claim
is that the newest developments will let you have upwards of 256 GB of
physical memory shares with similar efficiency, by doing away with the
current pagetables overhead of mmap), and what (if anything) is there in
BSD-ish kernels to match that accomplishments. But until you clarify
that (to me) strange and confusing assertion, to help me understand what
point you were making there (and nothing in this "answer" of yours is at
all addressing my doubts and my very specific question about it!), I see
little point in trying to delve into such exoterica.

I do hope the point above is cleared.

Now, you propose to share objects between Python VMs using shared memory. Why not, if it is correctly implemented (I've seen it done in Gemstone for Java, IIRC), I'd be as happy with this as I am when sharing objects between threads. The trouble is that you'll have exactly the same problems, if not more. You'll have to implement the same locking primitives. You'll have to make sure that all the stdlib and extensions are ready to support objects that are shared this way. All the trouble we have now with multiple threads, you'll have with multiple processes. And you may have so big portability issues. I don't see any benefit vs the work already done, even if tiny, on the multithreading support.

BTW [shameless plug] here is the cookbook recipe I wrote about thread-safe caching.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/302997

Click to expand...

I do not see any relevance of that recipe (with which I'm quite
familiar, since I'm preparing the 2nd Edition of the Cookbook) to your
assertion with Linux on one side, and BSD derivatives grouped with
Windows on the other, quoted, questioned and never clarified above.

Well, that was a shameless plug... But this recipe allows you to build, for example, module caches that can safely be shared between threads, with minimum locking. I encountered a bug in mod_python on the subject, which I fixed using it, so I though it might be illustrative.

on the OS, because the scheduler must know the difference between 2
processors and a 2-core processor to efficiently balance the work, but
anyway.

As far as I know there is no special support in current kernels for
multi-core CPUs as differentiated from multiple CPUs sharing external
buses... (nor even yet, unless I'm mistaken, for such fundamental
novelties as HyperTransport, aka DirectConnect, which _has_ been around
for quite a while now -- and points to a completely different paradigm
than shared memory as being potentially much-faster IPC... bandwidths of
over 10 gigabytes/second, which poor overworked memory subsystems might
well have some trouble reaching... how one would exploit hypertransport
within multiple threads of a process, programmed on the basis of sharing
memory, I dunno -- using it between separate processes which exchange
messages appears conceptually simpler). Anyway, again to the limited
amount of my current knowledge, this holds just as much for multiple
threads as for multiple processes, no?

I am way beyond my competences here, but I've read some articles about hyperthreading-aware schedulers (in WinXP, Win2003, and patches for Linux). The idea is that on multi-core CPUs, threads from the same process should be ran on the same core for maximum cache efficiency, whereas different processes can freely run on different cores. I've read that you can effectively have worse performances on multicore CPUs if the kernel scheduler does not know about HT. I cannot find the articles again but there are a bunch referenced on Google :

http://www.google.com/search?q=hyperthread+scheduler

But apart from this caveat, yes, multi-threads and multi-processes application equally benefit from multi-core CPUs.

hyperthreading. This means that even on a desktop PC you can benefit
from having multithreaded (or multi-processed) applications.

I'm writing this on my laptop (uniprocessor, no quirks), because I'm on
a trip, but at home I do have a dual-processor desktop (actually a
minitower, but many powerful 'desktops' are that way), and it's a
year-old model (and Apple was making dual processors for years before
the one I own, though with 32-bit 'G4' chips rather than 64-bit 'G5'
ones they're using now). So this is hardly I can run make on a
substantial software system much faster with a -j switch to let it spawn
multiple jobs (processes).

pressure to have proper threading support in Python will grow and grow.

The pressure has been growing for a while and I concur it will keep
growing, particularly since the OS by far most widespread on desktops
has such horrible features for multiple-process spawning and control.
But, again, before we go on to debate this, I would really appreciate it
if you clarified your previous assertions, to help me understand why you
believe that BSD derivatives, including Mac OS X, are to be grouped on
the same side as Windows, while only Linux would favour processes over
threads -- when, to _me_, it seems so obvious that the reasonable
grouping is with all Unix-like systems on one side, Win on the other.
You either know something I don't, about the internals of these systems,
and it appears so obvious to you that you're not even explaining it now
that I have so specifically requested you to explain; or there is
something else going on that I really do not understand.

I hope it has been cleared : it's just that some people using MacOS X seemed as interested as me in having a decent multi-threading support in mod_python. If more people used a multi-threading MPM on Linux, their reaction would be the same. I'm really not religious about OSes, no offence meant.

Best regards,

Nicolas

Alex Martelli · Oct 12, 2004

Nicolas Lehuen said:
...
mod_python's market (pure guess). So the multithreading issues have not
many chances of being fixed soon. That is why I said that a
"Linux-centered mindset forgetting [other OSes, none in particular]" is
hindering the bugfix process. ...
I do hope the point above is cleared.

It's basically "retracted", as I see thing, except that whatever's left
still doesn't make any sense to me. Why should a Linux user need to
"forget" another system, that's just as perfect for multiprocessing as
Linux, before deciding he's got better things to do with his or her time
than work on a problem which multiprocessing finesses>

Now, you propose to share objects between Python VMs using shared memory.

Not necessarily -- as you say, if you need synchronization the problem
is just about as hard (maybe even worse). I was just trying to
understand the focus on multithreading vs multiprocessing, it now
appears it's more of a focus on the shared-memory paradigm of
multiprocessing in general.

...
I am way beyond my competences here, but I've read some articles about
hyperthreading-aware schedulers (in WinXP, Win2003, and patches for
Linux). The idea is that on multi-core CPUs, threads from the same
process should be ran on the same core for maximum cache efficiency,
whereas different processes can freely run on different cores. I've read

And how is this a difference between two processors and two cores within
the same processor, which is what I quoted you above as saying? If two
CPUs do not share caches, the CPU-affinity issues (of processing units
that share address spaces vs ones that don't) would appear to be the
same. If two CPUs share some level of cache (as some multi-CPU designs
do), that's different from the case where the CPUs share no cache but to
share RAM.

But apart from this caveat, yes, multi-threads and multi-processes
application equally benefit from multi-core CPUs.

So it would seem to me, yes -- except that if CPUs share caches this may
help (perhaps) used of shared memory (if the cache design is optimized
for that and the scheduler actively helps), even though even in that
case it doesn't seem to me you can reach the same bandwidth that
hypertransport promises for a more streaming/message passing approach.

Alex

Nicolas Lehuen · Oct 12, 2004

Alex Martelli said:
Nicolas Lehuen said:

Sorry, I don't get your point. Sure, Windows makes process creation
hideously expensive and has no forking. But all kinds of BSD, including
MacOSX, are just great at forking. Why is a preference for multiple
processes over threads "forgetting about BSD, MacOSX", or any other
flavour of Unix for that matter?

Click to expand...

...
mod_python's market (pure guess). So the multithreading issues have not
many chances of being fixed soon. That is why I said that a
"Linux-centered mindset forgetting [other OSes, none in particular]" is
hindering the bugfix process. ...
I do hope the point above is cleared.

Click to expand...

It's basically "retracted", as I see thing, except that whatever's left
still doesn't make any sense to me. Why should a Linux user need to
"forget" another system, that's just as perfect for multiprocessing as
Linux, before deciding he's got better things to do with his or her time
than work on a problem which multiprocessing finesses>

Because a Linux user could eventually be interested in sharing data amongst different workers - be it processes or threads. Things that are taken for granted in the Java world, like a good connection pool, are still not being used whereas they are really interesting from a performance and management point of view.

Not necessarily -- as you say, if you need synchronization the problem
is just about as hard (maybe even worse). I was just trying to
understand the focus on multithreading vs multiprocessing, it now
appears it's more of a focus on the shared-memory paradigm of
multiprocessing in general.

Exactly. I don't care about multithreading or multiprocessing, I want to share data between my workers. It's just more easily done with threads, as of today. The connection pool example is a good one : how can you easily share TCP connection to a DBMS server between processes ? With threads it's a matter of a few lines of Python code (I'll publish my connection pool code on the Python Cookbook soon).

And how is this a difference between two processors and two cores within
the same processor, which is what I quoted you above as saying? If two
CPUs do not share caches, the CPU-affinity issues (of processing units
that share address spaces vs ones that don't) would appear to be the
same. If two CPUs share some level of cache (as some multi-CPU designs
do), that's different from the case where the CPUs share no cache but to
share RAM.

So it would seem to me, yes -- except that if CPUs share caches this may
help (perhaps) used of shared memory (if the cache design is optimized
for that and the scheduler actively helps), even though even in that
case it doesn't seem to me you can reach the same bandwidth that
hypertransport promises for a more streaming/message passing approach.

Alex

The trick is that there are many levels of internal or external cache, and that IIRC from my readings the two cores of a Pentium IV processor with HT share some level of internal cache.

But again, I'm pretty much outsmarted on the subject, so I'll call it quits on the subject of hyperthreading and its compared impact on the threaded model vs the fork model

Best regards,

Nicolas

Aahz · Oct 12, 2004

BTW, Nicolas, it'd be nice if you did not post quoted-printable. I'm
leaving the crud in so you can see it.

Yeah I know this. My point was that when benchmarking single-threaded =
programs vs multithreaded ones, you have to choose the kind of program =
you use : CPU-intensive vs IO-intensive. On a single CPU machine, =
CPU-intensive program will always run better in a single-threaded model. =
But of course, if you have 2 procs and a nice threading model, you can =
even do 2 CPU-intensive tasks simultaneously, which Python cannot do if =
I've understood everything so far.

Python itself cannot, but there's nothing preventing anyone from writing
numeric extensions that release the GIL.

Nicolas Lehuen · Oct 13, 2004

Aahz said:
BTW, Nicolas, it'd be nice if you did not post quoted-printable. I'm
leaving the crud in so you can see it.

I write non-ASCII emails, since I'm french (using accentuated characters
and all that), what is the encoding you would suggest me to use ? AFAIK,
quoted-printable is a standard encoding (though a poor one). The MUA I
used may have chosed a bad encoding, since it was Outlook Express. At
home, I use Thunderbird 0.8, tell me if you feel better with the
encoding it uses.

Python itself cannot, but there's nothing preventing anyone from writing
numeric extensions that release the GIL.

There is nothing preventing anyone from writing a new language from
scratch, either. What I was suggesting was that we could try to make
Python a more thread-friendly language, so that we could write more
CPU-efficient code in the new context of multicore or multi-CPU machines
without having to revert to a low-level language like C.

If only I could tell the Python Interpreter 'OK, this bit of Python code
is thread-safe, so you don't need to hold the GIL on it', maybe I could
be more efficient, but as of today, this is a thing that can only be
done in C, not in Python. Of course, being able to release the GIL from
Python code is not a true solution. A true solution would be to get rid
of the GIL itself.

Regards,

Nicolas

Nicolas Lehuen · Oct 13, 2004

Aahz said:
> BTW, Nicolas, it'd be nice if you did not post quoted-printable. I'm
> leaving the crud in so you can see it.

I write non-ASCII emails, since I'm french (using accentuated characters
and all that), what is the encoding you would suggest me to use ? AFAIK,
quoted-printable is a standard encoding (though a poor one). The MUA I
used may have chosed a bad encoding, since it was Outlook Express. At
home, I use Thunderbird 0.8, tell me if you feel better with the
encoding it uses.

> In article <[email protected]>,

>
>
> Python itself cannot, but there's nothing preventing anyone from writing
> numeric extensions that release the GIL.

There is nothing preventing anyone from writing a new language from
scratch, either. What I was suggesting was that we could try to make
Python a more thread-friendly language, so that we could write more
CPU-efficient code in the new context of multicore or multi-CPU machines
without having to revert to a low-level language like C.

If only I could tell the Python Interpreter 'OK, this bit of Python code
is thread-safe, so you don't need to hold the GIL on it', maybe I could
be more efficient, but as of today, this is a thing that can only be
done in C, not in Python. Of course, being able to release the GIL from
Python code is not a true solution. A true solution would be to get rid
of the GIL itself.

Regards,

Nicolas

Jon Perez · Oct 26, 2004

http://poshmodule.sf.net

POSH (Python Object sharing) allows Python objects
to be placed in shared memory and accessed transparently
between different Python processes.

This removes the need to futz around with IPC mechanisms
and lets you use Python objects (placed in shared memory)
between processes in much the same way that you use them
in threads.

"On multiprocessor architectures, multi-process applications
using POSH can significantly outperform similar multi-threaded
applications, since Python threads don't scale to take advantage
of multiple processors. Even so, POSH lends itself to a
programming model very similar to threads."

POSH has actually been around for quite a while, so I wonder
why it has not been mentioned more often.

Andrew Dalke · Oct 26, 2004

Jon said:
POSH (Python Object sharing) allows Python objects
to be placed in shared memory and accessed transparently
between different Python processes. ...
POSH has actually been around for quite a while, so I wonder
why it has not been mentioned more often.

How often should it be mentioned? I mentioned it twice
early this month. The previous mentions were in August,
then April, then January, then ...

Huh. But I see I mention it the most, so I'm biased.

The major problem I have with it is its use of Intel
specific assembly, generated (I as I recall) through
gcc-specific inline code. I run OS X.

So the only people who would use it are Python developers
running on a multi-proc Intel machine with Linux/*BSD
who have compute intensive jobs and aren't interested
in portability. And don't prefer the dozen+ other more
mature/robust solutions.

I suspect that might be a small number.

Andrew
(e-mail address removed)

Parallelization on multi-CPU hardware?	0	Oct 5, 2004
Overcoming python performance penalty for multicore CPU	19	Feb 2, 2010
speed performances / hardware / cpu	8	Nov 16, 2006
Embedding multiple interpreters	14	Dec 6, 2013
multi-CPU, GIL, threading on linux	0	Jun 14, 2005
The future of "frozen" types as the number of CPU cores increases	27	Feb 16, 2010
Advice regarding multiprocessing module	0	Mar 11, 2013
Module missing when embedding?	0	Dec 12, 2013

Parallelization on muli-CPU hardware?

Piet van Oostrum

Luis P Caamano

Neil Hodgson

Bryan Olson

Bengt Richter

Nicolas Lehuen

Nicolas Lehuen

Alex Martelli

Richie Hindle

Nicolas Lehuen

Nicolas Lehuen

Alex Martelli

Nicolas Lehuen

Alex Martelli

Nicolas Lehuen

Aahz

Nicolas Lehuen

Nicolas Lehuen

Jon Perez

Andrew Dalke

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads