Overcoming python performance penalty for multicore CPU

John Nagle · Feb 2, 2010

I know there's a performance penalty for running Python on a
multicore CPU, but how bad is it? I've read the key paper
("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate
if the GIL just limited Python to running on one CPU at a time,
but it's worse than that; there's excessive overhead due to
a lame locking implementation. Running CPU-bound multithreaded
code on a dual-core CPU runs HALF AS FAST as on a single-core
CPU, according to Beasley.

My main server application, which runs "sitetruth.com"
has both multiple processes and multiple threads in each process.
The system rates web sites, which involves reading and parsing
up to 20 pages from each domain. Analysis of each domain is
performed in a separate process, but each process uses multiple
threads to read process several web pages simultaneously.

Some of the threads go compute-bound for a second or two at a time as
they parse web pages. Sometimes two threads (but never more than three)
in the same process may be parsing web pages at the same time, so
they're contending for CPU time.

So this is nearly the worst case for the lame GIL lock logic.
Has anyone tried using "affinity" ("http://pypi.python.org/pypi/affinity")
to lock each Python process to a single CPU? Does that help?

John Nagle

exarkun · Feb 2, 2010

I know there's a performance penalty for running Python on a
multicore CPU, but how bad is it? I've read the key paper
("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate
if the GIL just limited Python to running on one CPU at a time,
but it's worse than that; there's excessive overhead due to
a lame locking implementation. Running CPU-bound multithreaded
code on a dual-core CPU runs HALF AS FAST as on a single-core
CPU, according to Beasley.

It's not clear that Beasley's performance numbers apply to any platform
except OS X, which has a particularly poor implementation of the
threading primitives CPython uses to implement the GIL.

You should check to see if it actually applies to your deployment
environment.

The GIL has been re-implemented recently. Python 3.2, I think, will
include the new implementation, which should bring OS X performance up
to the level of other platforms. It may also improve certain other
aspects of thread switching.

Jean-Paul

alex23 · Feb 3, 2010

I know there's a performance penalty for running Python on a
multicore CPU, but how bad is it? I've read the key paper
("www.dabeaz.com/python/GIL.pdf"), of course.

It's a shame that Python 3.x is dead to you, otherwise you'd be able
to enjoy the new GIL implementation in 3.2: http://www.dabeaz.com/python/NewGIL.pdf

Actually, it looks like you probably still can:
+ patch for 2.5.4: http://thread.gmane.org/gmane.comp.python.devel/109929
+ patch for 2.7? http://bugs.python.org/issue7753

(Can't comment on affinity, though, sorry)

Terry Reedy · Feb 3, 2010

It's a shame that Python 3.x is dead to you, otherwise you'd be able
to enjoy the new GIL implementation in 3.2: http://www.dabeaz.com/python/NewGIL.pdf

Actually, it looks like you probably still can:
+ patch for 2.5.4: http://thread.gmane.org/gmane.comp.python.devel/109929
+ patch for 2.7? http://bugs.python.org/issue7753

The patch was rejected for 2.7 (and earlier) because it could break code
as explained in the discussion. One would have to apply and compile
their own binary.

Paul Rubin · Feb 4, 2010

John Nagle said:
Analysis of each domain is
performed in a separate process, but each process uses multiple
threads to read process several web pages simultaneously.

Some of the threads go compute-bound for a second or two at a time as
they parse web pages.

You're probably better off using separate processes for the different
pages. If I remember, you were using BeautifulSoup, which while very
cool, is pretty doggone slow for use on large volumes of pages. I don't
know if there's much that can be done about that without going off on a
fairly messy C or C++ coding adventure. Maybe someday someone will do
that.

John Nagle · Feb 4, 2010

Paul said:
You're probably better off using separate processes for the different
pages. If I remember, you were using BeautifulSoup, which while very
cool, is pretty doggone slow for use on large volumes of pages. I don't
know if there's much that can be done about that without going off on a
fairly messy C or C++ coding adventure. Maybe someday someone will do
that.

I already use separate processes for different domains. I could
live with Python's GIL as long as moving to a multicore server
doesn't make performance worse. That's why I asked about CPU dedication
for each process, to avoid thrashing at the GIL.

There's enough intercommunication between the threads working on
a single site that it's a pain to do them as subprocesses. And I
definitely don't want to launch subprocesses for each page; the
Python load time would be worse than the actual work. The
subprocess module assumes you're willing to launch a subprocess
for each transaction.

The current program organization is that there's a scheduler
process which gets requests, prioritizes them, and runs the requested
domains through the site evaluation mill. The scheduler maintains a
pool of worker processes which get work request via their input pipe, in Pickle
format, and return results, again in Pickle format. When not in
use, the worker processes sit there dormant, so there's no Python
launch cost for each transaction. If a worker process crashes, the
scheduler replaces it with a fresh one, and every few hundred uses,
each worker process is replaced with a fresh copy, in case Python
has a memory leak. It's a lot like the way
FCGI works.

Scheduling is managed using an in-memory
table in MySQL, so the load can be spread over a cluster if desired,
with a scheduler process on each machine.

So I already have a scalable architecture. The only problem
is excess overhead on multicore CPUs.

John Nagle

Steve Holden · Feb 4, 2010

John said:
I already use separate processes for different domains. I could
live with Python's GIL as long as moving to a multicore server
doesn't make performance worse. That's why I asked about CPU dedication
for each process, to avoid thrashing at the GIL.

I believe it's already been said that the GIL thrashing is mostly MacOS
specific. You might also find something in the affinity module

http://pypi.python.org/pypi/affinity/0.1.0

to ensure that each process in your pool runs on only one processor.

regards
Steve

John Nagle · Feb 4, 2010

Steve said:
I believe it's already been said that the GIL thrashing is mostly MacOS
specific. You might also find something in the affinity module

No, the original analysis was MacOS oriented, but the same mechanism
applies for fighting over the GIL on all platforms. There was some
pontification that it might be a MacOS-only issue, but no facts
were presented. It might be cheaper on C implementations with mutexes
that don't make system calls for the non-blocking cases.

John Nagle

Paul Rubin · Feb 4, 2010

John Nagle said:
There's enough intercommunication between the threads working on
a single site that it's a pain to do them as subprocesses. And I
definitely don't want to launch subprocesses for each page; the
Python load time would be worse than the actual work. The
subprocess module assumes you're willing to launch a subprocess
for each transaction.

Why not just use socketserver and have something like a fastcgi?

Anh Hai Trinh · Feb 4, 2010

There's enough intercommunication between the threads working on
a single site that it's a pain to do them as subprocesses. And I
definitely don't want to launch subprocesses for each page; the
Python load time would be worse than the actual work. The
subprocess module assumes you're willing to launch a subprocess
for each transaction.

You could perhaps use a process pool inside each domain worker to work
on the pages? There is multiprocessing.Pool and other
implementations.

For examples, in this library, you can s/ThreadPool/ProcessPool/g and
this example would work: <http://www.onideas.ws/stream.py/#retrieving-
web-pages-concurrently>.

If you want to DIY, with multiprocessing.Lock/Pipe/Queue, I don't
understand why it would be more of a pain to write your threads as
processes.

// aht
http://blog.onideas.ws

Stefan Behnel · Feb 8, 2010

Paul Rubin, 04.02.2010 02:51:

You're probably better off using separate processes for the different
pages. If I remember, you were using BeautifulSoup, which while very
cool, is pretty doggone slow for use on large volumes of pages. I don't
know if there's much that can be done about that without going off on a
fairly messy C or C++ coding adventure. Maybe someday someone will do
that.

Well, if multi-core performance is so important here, then there's a pretty
simple thing the OP can do: switch to lxml.

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Stefan

Paul Rubin · Feb 8, 2010

Stefan Behnel said:
Well, if multi-core performance is so important here, then there's a pretty
simple thing the OP can do: switch to lxml.

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it
only works on well-formed XML. The point of Beautiful Soup is that it
works on all kinds of garbage hand-written legacy HTML with mismatched
tags and other sorts of errors. Beautiful Soup is slower because it's
full of special cases and hacks for that reason, and it is written in
Python. Writing something that complex in C to handle so much
potentially malicious input would be quite a lot of work to write at
all, and very difficult to ensure was really safe. Look at the many
browser vulnerabilities we've seen over the years due to that sort of
problem, for example. But, for web crawling, you really do need to
handle the messy and wrong HTML properly.

Antoine Pitrou · Feb 8, 2010

Le Tue, 02 Feb 2010 15:02:49 -0800, John Nagle a Ã©critÂ :

I know there's a performance penalty for running Python on a multicore
CPU, but how bad is it? I've read the key paper
("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate if
the GIL just limited Python to running on one CPU at a time, but it's
worse than that; there's excessive overhead due to a lame locking
implementation. Running CPU-bound multithreaded code on a dual-core CPU
runs HALF AS FAST as on a single-core CPU, according to Beasley.

That's on certain types of workloads, and perhaps on certain OSes, so you
should try benching your own workload to see whether it applies.

Two closing remarks:
- this should (hopefully) be fixed in 3.2, as exarkun noticed
- instead of spawning one thread per Web page, you could use Twisted or
another event loop mechanism in order to process pages serially, in the
order of arrival

Regards

Antoine.

J Kenneth King · Feb 8, 2010

Paul Rubin said:
Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it
only works on well-formed XML. The point of Beautiful Soup is that it
works on all kinds of garbage hand-written legacy HTML with mismatched
tags and other sorts of errors. Beautiful Soup is slower because it's
full of special cases and hacks for that reason, and it is written in
Python. Writing something that complex in C to handle so much
potentially malicious input would be quite a lot of work to write at
all, and very difficult to ensure was really safe. Look at the many
browser vulnerabilities we've seen over the years due to that sort of
problem, for example. But, for web crawling, you really do need to
handle the messy and wrong HTML properly.

If the difference is great enough, you might get a benefit from
analyzing all pages with lxml and throwing invalid pages into a bucket
for later processing with BeautifulSoup.

John Krukoff · Feb 8, 2010

Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it
only works on well-formed XML. The point of Beautiful Soup is that it
works on all kinds of garbage hand-written legacy HTML with mismatched
tags and other sorts of errors. Beautiful Soup is slower because it's
full of special cases and hacks for that reason, and it is written in
Python. Writing something that complex in C to handle so much
potentially malicious input would be quite a lot of work to write at
all, and very difficult to ensure was really safe. Look at the many
browser vulnerabilities we've seen over the years due to that sort of
problem, for example. But, for web crawling, you really do need to
handle the messy and wrong HTML properly.

Actually, lxml has an HTML parser which does pretty well with the
standard level of broken one finds most often on the web. And, when it
falls down, it's easy to integrate BeautifulSoup as a slow backup for
when things go really wrong (as J Kenneth King mentioned earlier):

http://codespeak.net/lxml/lxmlhtml.html#parsing-html

At least in my experience, I haven't actually had to parse anything that
lxml couldn't handle yet, however.

Martin v. Loewis · Feb 21, 2010

John said:
I know there's a performance penalty for running Python on a
multicore CPU, but how bad is it? I've read the key paper
("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate
if the GIL just limited Python to running on one CPU at a time,
but it's worse than that; there's excessive overhead due to
a lame locking implementation. Running CPU-bound multithreaded
code on a dual-core CPU runs HALF AS FAST as on a single-core
CPU, according to Beasley.

I couldn't reproduce these results on Linux. Not sure what "HALF AS
FAST" is; I suppose it means "it runs TWICE AS LONG" - this is what I
couldn't reproduce.

If I run Beazley's program on Linux 2.6.26, on a 4 processor Xeon (3GHz)
machine, I get 30s for the sequential execution, 40s for the
multi-threaded case, and 32s for the multi-threaded case when pinning
the Python process to a single CPU (using taskset(1)).

So it's 6% overhead for threading, and 25% penalty for multicore CPUs -
far from the 100% you seem to expect.

Regards,
Martin

Ryan Kelly · Feb 21, 2010

I couldn't reproduce these results on Linux. Not sure what "HALF AS
FAST" is; I suppose it means "it runs TWICE AS LONG" - this is what I
couldn't reproduce.

If I run Beazley's program on Linux 2.6.26, on a 4 processor Xeon (3GHz)
machine, I get 30s for the sequential execution, 40s for the
multi-threaded case, and 32s for the multi-threaded case when pinning
the Python process to a single CPU (using taskset(1)).

So it's 6% overhead for threading, and 25% penalty for multicore CPUs -
far from the 100% you seem to expect.

It's far from scientific, but I've seen behaviour that's close to a 100%
performance penalty on a dual-core linux system:

http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2

Short story: a particular test suite of mine used to run in around 25
seconds, but a bit of ctypes magic to set thread affinity dropped the
running time to under 13 seconds.

Cheers,

Ryan

--
Ryan Kelly
http://www.rfk.id.au | This message is digitally signed. Please visit
(e-mail address removed) | http://www.rfk.id.au/ramblings/gpg/ for details

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEABECAAYFAkuBqYoACgkQfI5S64uP50p/pACdHkFZQ2LIEZILcv5+3F21M1Lr
MYIAn1OSW1KQ8Amp5uvv0Vd1KWwWbWNA
=CAAt
-----END PGP SIGNATURE-----

Martin v. Loewis · Feb 21, 2010

It's far from scientific, but I've seen behaviour that's close to a 100%

performance penalty on a dual-core linux system:

http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2

Short story: a particular test suite of mine used to run in around 25
seconds, but a bit of ctypes magic to set thread affinity dropped the
running time to under 13 seconds.

Indeed, it's not scientific - but with a few more details, you could
improve it quite a lot: what specific Linux distribution (the posting
doesn't even say it's Linux), what specific Python version had you been
using? (less important) what CPUs? If you can: what specific test suite?

A lot of science is about repeatability. Making a systematic study is
(IMO) over-valued - anecdotal reports are useful, too, as long as they
allow for repeatable experiments.

Regards,
Martin

Martin v. Loewis · Feb 21, 2010

It's far from scientific, but I've seen behaviour that's close to a 100%

performance penalty on a dual-core linux system:

http://www.rfk.id.au/blog/entry/a-gil-adventure-threading2

Short story: a particular test suite of mine used to run in around 25
seconds, but a bit of ctypes magic to set thread affinity dropped the
running time to under 13 seconds.

Indeed, it's not scientific - but with a few more details, you could
improve it quite a lot: what specific Linux distribution (the posting
doesn't even say it's Linux), what specific Python version had you been
using? (less important) what CPUs? If you can: what specific test suite?

A lot of science is about repeatability. Making a systematic study is
(IMO) over-valued - anecdotal reports are useful, too, as long as they
allow for repeatable experiments.

Regards,
Martin

Ryan Kelly · Feb 21, 2010

Indeed, it's not scientific - but with a few more details, you could
improve it quite a lot: what specific Linux distribution (the posting
doesn't even say it's Linux), what specific Python version had you been
using? (less important) what CPUs? If you can: what specific test suite?

I'm on Ubuntu Karmic, Python 2.6.4, an AMD Athlon 7750 dual core.

Unfortunately the test suite is for a proprietary application. I've
been able to reproduce similar behaviour with an open-source test suite,
using the current trunk of the "pyfilesystem" project:

http://code.google.com/p/pyfilesystem/

In this project "OSFS" is an object-oriented interface to the local
filesystem. The test case "TestOSFS.test_cases_in_separate_dirs" runs
three theads, each doing a bunch of IO in a different directory.

Running the tests normally:

rfk@durian:/storage/software/fs$ nosetests fs/tests/test_fs.py:TestOSFS..test_cases_in_separate_dirs
.
----------------------------------------------------------------------
Ran 1 test in 9.787s

That's the best result from five runs - I saw it go as high as 12
seconds. Watching it in top, I see CPU usage at around 150%.

Now using threading2 to set the process cpu affinity at the start of the
test run:

rfk@durian:/storage/software/fs$ nosetests fs/tests/test_fs.py:TestOSFS..test_cases_in_separate_dirs
.
----------------------------------------------------------------------
Ran 1 test in 3.792s

Again, best of five. The variability in times here is much lower - I
never saw it go above 4 seconds. CPU usage is consistently 100%.

Cheers,

Ryan

--
Ryan Kelly
http://www.rfk.id.au | This message is digitally signed. Please visit
(e-mail address removed) | http://www.rfk.id.au/ramblings/gpg/ for details

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEABECAAYFAkuBthoACgkQfI5S64uP50oM8ACgwYKIMRMRS6gY9gHEEo4tfyNL
CTQAn0mQYCQ3tlun+6ybsR8YlAopjrag
=gpUC
-----END PGP SIGNATURE-----

Some notes on a high-performance Python application.	4	Mar 26, 2008
multi-CPU, GIL, threading on linux	0	Jun 14, 2005
Using multicore CPUs in parallel tasks	18	Oct 29, 2009
The future of "frozen" types as the number of CPU cores increases	27	Feb 16, 2010
Parallelization on multi-CPU hardware?	0	Oct 5, 2004
Threads, GIL and re.match() performance	5	Jun 25, 2008
Multiprocessing: don't push the pedal to the metal?	5	May 22, 2011
setting the CPU affinity for a whole JVM on Linux	3	Jan 28, 2009

Overcoming python performance penalty for multicore CPU

John Nagle

exarkun

alex23

Terry Reedy

Paul Rubin

John Nagle

Steve Holden

John Nagle

Paul Rubin

Anh Hai Trinh

Stefan Behnel

Paul Rubin

Antoine Pitrou

J Kenneth King

John Krukoff

Martin v. Loewis

Ryan Kelly

Martin v. Loewis

Martin v. Loewis

Ryan Kelly

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads