another thread on Python threading

C

cgwalters

Hi,

I've recently been working on an application[1] which does quite a bit
of searching through large data structures and string matching, and I
was thinking that it would help to put some of this CPU-intensive work
in another thread, but of course this won't work because of Python's
GIL.

There's a lot of past discussion on this, and I want to bring it up
again because with the work on Python 3000, I think it is worth trying
to take a look at what can be done to address portions of the problem
through language changes.

Also, the recent hardware trend towards multicore processors is
another reason I think it is worth taking a look at the problem again.

= dynamic objects, locking and __slots__ =

I remember reading (though I can't find it now) one person's attempt
at true multithreaded programming involved adding a mutex to all
object access. The obvious question though is - why don't other true
multithreaded languages like Java need to lock an object when making
changes? The answer is that they don't support adding random
attributes to objects; in other words, they default to the equivalent
of __slots__.

== Why hasn't __slots__ been successful? ==

I very rarely see Python code use __slots__. I think there are
several reasons for this. The first is that a lot of programs don't
need to optimize on this level. The second is that it's annoying to
use, because it means you have to type your member variables *another*
time (in addition to __init__ for example), which feels very un-
Pythonic.

== Defining object attributes ==

In my Python code, one restriction I try to follow is to set all the
attributes I use for an object in __init__. You could do this as
class member variables, but often I want to set them in __init__
anyways from constructor arguments, so "defining" them in __init__
means I only type them once, not twice.

One random idea is to for Python 3000, make the equivalent of
__slots__ the default, *but* instead gather
the set of attributes from all member variables set in __init__. For
example, if I write:

class Foo(object):
def __init__(self, bar=None):
self.__baz = 20
if bar:
self.__bar = bar
else:
self.__bar = time.time()

f = Foo()
f.otherattr = 40 # this would be an error! Can't add random
attributes not defined in __init__

I would argue that the current Python default of supporting adding
random attributes is almost never what you really want. If you *do*
want to set random attributes, you almost certainly want to be using a
dictionary or a subclass of one, not an object. What's nice about the
current Python is that you don't need to redundantly type things, and
we should preserve that while still allowing more efficient
implementation strategies.

= Limited threading =

Now, I realize there are a ton of other things the GIL protects other
than object dictionaries; with true threading you would have to touch
the importer, the garbage collector, verify all the C extension
modules, etc. Obviously non-trivial. What if as an initial push
towards real threading, Python had support for "restricted threads".
Essentially, restricted threads would be limited to a subset of the
standard library that had been verified for thread safety, would not
be able to import new modules, etc.

Something like this:

def datasearcher(list, queue):
for item in list:
if item.startswith('foo'):
queue.put(item)
queue.done()

vals = ['foo', 'bar']
queue = queue.Queue()
threading.start_restricted_thread(datasearcher, vals, queue)
def print_item(item):
print item
queue.set_callback(print_item)

Making up some API above I know, but the point here is "datasearcher"
could pretty easily run in a true thread and touch very little of the
interpreter; only support for atomic reference counting and a
concurrent garbage collector would be needed.

Thoughts?

[1] http://submind.verbum.org/hotwire/wiki
 
C

cgwalters

The pitfall here is that to reduce code duplication,
you might initialize certain variables in a method
called by __init__, because your object might want to
return to its initial state.

This is a good point. I was thinking that this analysis would
occur during module loading; i.e. it would be the equivalent of Java's
classloading.

What if the startup code analysis just extended to methods called
during __init__?
That seems like a relatively straightforward extension. Remember we
aren't actually *executing*
the startup code in my proposal; we are just analyzing it for all
possible execution paths which cause
a member variable assignment.
 
J

Josiah Carlson

I've recently been working on an application[1] which does quite a bit
of searching through large data structures and string matching, and I
was thinking that it would help to put some of this CPU-intensive work
in another thread, but of course this won't work because of Python's
GIL.

If you are doing string searching, implement the algorithm in C, and
call out to the C (remembering to release the GIL).
There's a lot of past discussion on this, and I want to bring it up
again because with the work on Python 3000, I think it is worth trying
to take a look at what can be done to address portions of the problem
through language changes.

Not going to happen. All Python 3000 PEPs had a due-date at least a
month ago (possibly even 2), so you are too late to get *any*
substantial change in.
I remember reading (though I can't find it now) one person's attempt
at true multithreaded programming involved adding a mutex to all
object access. The obvious question though is - why don't other true
multithreaded languages like Java need to lock an object when making
changes?

From what I understand, the Java runtime uses fine-grained locking on
all objects. You just don't notice it because you don't need to write
the acquire()/release() calls. It is done for you. (in a similar
fashion to Python's GIL acquisition/release when switching threads)
They also have a nice little decorator-like thingy (I'm not a Java guy,
so I don't know the name exactly) called 'synchronize', which locks and
unlocks the object when accessing it through a method.


- Josiah


== Why hasn't __slots__ been successful? ==

I very rarely see Python code use __slots__. I think there are
several reasons for this. The first is that a lot of programs don't
need to optimize on this level. The second is that it's annoying to
use, because it means you have to type your member variables *another*
time (in addition to __init__ for example), which feels very un-
Pythonic.

== Defining object attributes ==

In my Python code, one restriction I try to follow is to set all the
attributes I use for an object in __init__. You could do this as
class member variables, but often I want to set them in __init__
anyways from constructor arguments, so "defining" them in __init__
means I only type them once, not twice.

One random idea is to for Python 3000, make the equivalent of
__slots__ the default, *but* instead gather
the set of attributes from all member variables set in __init__. For
example, if I write:

class Foo(object):
def __init__(self, bar=None):
self.__baz = 20
if bar:
self.__bar = bar
else:
self.__bar = time.time()

f = Foo()
f.otherattr = 40 # this would be an error! Can't add random
attributes not defined in __init__

I would argue that the current Python default of supporting adding
random attributes is almost never what you really want. If you *do*
want to set random attributes, you almost certainly want to be using a
dictionary or a subclass of one, not an object. What's nice about the
current Python is that you don't need to redundantly type things, and
we should preserve that while still allowing more efficient
implementation strategies.

= Limited threading =

Now, I realize there are a ton of other things the GIL protects other
than object dictionaries; with true threading you would have to touch
the importer, the garbage collector, verify all the C extension
modules, etc. Obviously non-trivial. What if as an initial push
towards real threading, Python had support for "restricted threads".
Essentially, restricted threads would be limited to a subset of the
standard library that had been verified for thread safety, would not
be able to import new modules, etc.

Something like this:

def datasearcher(list, queue):
for item in list:
if item.startswith('foo'):
queue.put(item)
queue.done()

vals = ['foo', 'bar']
queue = queue.Queue()
threading.start_restricted_thread(datasearcher, vals, queue)
def print_item(item):
print item
queue.set_callback(print_item)

Making up some API above I know, but the point here is "datasearcher"
could pretty easily run in a true thread and touch very little of the
interpreter; only support for atomic reference counting and a
concurrent garbage collector would be needed.

Thoughts?

[1] http://submind.verbum.org/hotwire/wiki
 
S

sturlamolden

I've recently been working on an application[1] which does quite a bit
of searching through large data structures and string matching, and I
was thinking that it would help to put some of this CPU-intensive work
in another thread, but of course this won't work because of Python's
GIL.

If you are doing string searching, implement the algorithm in C, and
call out to the C (remembering to release the GIL).
There's a lot of past discussion on this, and I want to bring it up
again because with the work on Python 3000, I think it is worth trying
to take a look at what can be done to address portions of the problem
through language changes.

Not going to happen. All Python 3000 PEPs had a due-date at least a
month ago (possibly even 2), so you are too late to get *any*
substantial change in.
I remember reading (though I can't find it now) one person's attempt
at true multithreaded programming involved adding a mutex to all
object access. The obvious question though is - why don't other true
multithreaded languages like Java need to lock an object when making
changes?

From what I understand, the Java runtime uses fine-grained locking on
all objects. You just don't notice it because you don't need to write
the acquire()/release() calls. It is done for you. (in a similar
fashion to Python's GIL acquisition/release when switching threads)

The problem is CPython's reference counting. Access to reference
counts must be synchronized.

Java, IronPython and Jython uses another scheme for the garbage
collector and do not need a GIL.

Changing CPython's garbage collection from reference counting to a
generational GC will be a major undertaking. There are also pros and
cons to using reference counts instead of 'modern' garbage collectors.
For example, unless there are cyclic references, one can always know
when an object is garbage collected. One also avoids periodic delays
when garbage are collected, and memory use can be more modest then a
lot of small temporary objects are being used.

Also beware that the GIL is only a problem for CPU bound code. IO
bound code is not slowed by the GIL. The Python runtime itself is a
bigger problem for CPU bound code.

In C or Fortran, writing parallell algorithms for multiprocessor
systems typically involves using OpenMP or MPI. Parallelizing
algorithms using manual threading should be discouraged. It is far
better to insert a compiler directive (#pragma omp) and let an OpenMP
compiler to the job.

There are a number of different options for exploiting multiple CPUs
from CPython, including:

- MPI (e.g. mpi4py or PyMPI)
- PyPar
- os.fork() on Linux or Unix
- subprocess.Popen
- C extensions that use OpenMP
- C extensions that spawn threads (should be discouraged!)
They also have a nice little decorator-like thingy (I'm not a Java guy,
so I don't know the name exactly) called 'synchronize', which locks and
unlocks the object when accessing it through a method.

A similar Python 'synchronized' function decorator may look like this:

def synchronized(fun):
from threading import RLock
rl = RLock()
def decorator(*args,**kwargs):
with rl:
retv = fun(*args,**kwargs)
return retv
return decorator

It is not possible to define a 'synchronized' block though, as Python
do not have Lisp macros :(



















- Josiah
== Why hasn't __slots__ been successful? ==
I very rarely see Python code use __slots__. I think there are
several reasons for this. The first is that a lot of programs don't
need to optimize on this level. The second is that it's annoying to
use, because it means you have to type your member variables *another*
time (in addition to __init__ for example), which feels very un-
Pythonic.
== Defining object attributes ==
In my Python code, one restriction I try to follow is to set all the
attributes I use for an object in __init__. You could do this as
class member variables, but often I want to set them in __init__
anyways from constructor arguments, so "defining" them in __init__
means I only type them once, not twice.
One random idea is to for Python 3000, make the equivalent of
__slots__ the default, *but* instead gather
the set of attributes from all member variables set in __init__. For
example, if I write:
class Foo(object):
def __init__(self, bar=None):
self.__baz = 20
if bar:
self.__bar = bar
else:
self.__bar = time.time()
f = Foo()
f.otherattr = 40 # this would be an error! Can't add random
attributes not defined in __init__
I would argue that the current Python default of supporting adding
random attributes is almost never what you really want. If you *do*
want to set random attributes, you almost certainly want to be using a
dictionary or a subclass of one, not an object. What's nice about the
current Python is that you don't need to redundantly type things, and
we should preserve that while still allowing more efficient
implementation strategies.
= Limited threading =
Now, I realize there are a ton of other things the GIL protects other
than object dictionaries; with true threading you would have to touch
the importer, the garbage collector, verify all the C extension
modules, etc. Obviously non-trivial. What if as an initial push
towards real threading, Python had support for "restricted threads".
Essentially, restricted threads would be limited to a subset of the
standard library that had been verified for thread safety, would not
be able to import new modules, etc.
Something like this:
def datasearcher(list, queue):
for item in list:
if item.startswith('foo'):
queue.put(item)
queue.done()
vals = ['foo', 'bar']
queue = queue.Queue()
threading.start_restricted_thread(datasearcher, vals, queue)
def print_item(item):
print item
queue.set_callback(print_item)
Making up some API above I know, but the point here is "datasearcher"
could pretty easily run in a true thread and touch very little of the
interpreter; only support for atomic reference counting and a
concurrent garbage collector would be needed.
Thoughts?

[1]http://submind.verbum.org/hotwire/wiki
 
C

cgwalters

If you are doing string searching, implement the algorithm in C, and
call out to the C (remembering to release the GIL).

I considered that, but...ick! The whole reason I'm writing this
program
in Python in the first place is so I don't have to deal with the mess
that is involved when you do string matching and data structure
traversal
in C.

On the other hand, there are likely C libraries out there for
searching the
kinds of data structures I use; I'll investigate.
Not going to happen. All Python 3000 PEPs had a due-date at least a
month ago (possibly even 2), so you are too late to get *any*
substantial change in.

=( Too bad. It might be possible to do these changes in a backwards
compatible way,
though less elegantly. For example, the object change could be
denoted by inheriting from "fixedobject"
or something.
 
J

Josiah Carlson

sturlamolden said:
The problem is CPython's reference counting. Access to reference
counts must be synchronized.
Java, IronPython and Jython uses another scheme for the garbage
collector and do not need a GIL.

There was a discussion regarding this in the python-ideas list recently.
You *can* attach a lock to every object, and use fine-grained locking
to handle refcounts. Alternatively, you can use platform-specific
atomic increments and decrements, or even a secondary 'owner thread'
refcount that doesn't need to be locked by 1 thread at a time.

It turns out that atomic updates are slow, and I wasn't able to get any
sort of productive results using 'owner threads' (seemed generally
negative, and was certainly more work to make happen). I don't believe
anyone bothered to test fine-grained locking on objects.

However, locking isn't just for refcounts, it's to make sure that thread
A isn't mangling your object while thread B is traversing it. With
object locking (course via the GIL, or fine via object-specific locks),
you get the same guarantees, with the problem being that fine-grained
locking is about a bazillion times more difficult to verify the lack of
deadlocks than a GIL-based approach.

Changing CPython's garbage collection from reference counting to a
generational GC will be a major undertaking. There are also pros and
cons to using reference counts instead of 'modern' garbage collectors.
For example, unless there are cyclic references, one can always know
when an object is garbage collected. One also avoids periodic delays
when garbage are collected, and memory use can be more modest then a
lot of small temporary objects are being used.

It was done a while ago. The results? On a single-processor machine,
Python code ran like 1/4-1/3 the speed of the original runtime. When
using 4+ processors, there were some gains in threaded code, but not
substantial at that point.

There are a number of different options for exploiting multiple CPUs
from CPython, including:

My current favorite is the processing package (available from the Python
cheeseshop). You get much of the same API as threading, only you are
using processes instead. It works on Windows, OSX, and *nix.

def synchronized(fun):
from threading import RLock
rl = RLock()
def decorator(*args,**kwargs):
with rl:
retv = fun(*args,**kwargs)
return retv
return decorator

It is not possible to define a 'synchronized' block though, as Python
do not have Lisp macros :(

Except that you just used the precise mechanism necessary to get a
synchronized block in Python:

lock = threading.Lock()

with lock:
#synchronized block!
pass


- Josiah
 
S

sturlamolden

lock = threading.Lock()

with lock:
#synchronized block!
pass

True, except that the lock has to be shared among the threads. This
explicit initiation of an reentrant lock is avoided in a Java
synchronized block.
 
S

sturlamolden

However, locking isn't just for refcounts, it's to make sure that thread
A isn't mangling your object while thread B is traversing it.

With
object locking (course via the GIL, or fine via object-specific locks),
you get the same guarantees, with the problem being that fine-grained
locking is about a bazillion times more difficult to verify the lack of
deadlocks than a GIL-based approach.

I think this is just as much a question of what the runtime should
guarantee. One don't need a guarantee that two threads are not
mangling the same object simultaneously. Instead, the runtime could
leave it to the programmer to use explicit locks on the object or
synchronized blocks to guarantee this for himself.

It was done a while ago. The results? On a single-processor machine,
Python code ran like 1/4-1/3 the speed of the original runtime. When
using 4+ processors, there were some gains in threaded code, but not
substantial at that point.

I am not surprised. Reference counts are quite efficient, contrary to
common belief. The problem with reference counts is cyclic references
involving objects that define a __del__ method. As these objects are
not eligible for cyclic garbage collection, this can produce resource
leaks.

My current favorite is the processing package (available from the Python
cheeseshop).

Thanks. I'll take a look at that.
 
J

Josiah Carlson

sturlamolden said:
True, except that the lock has to be shared among the threads. This
explicit initiation of an reentrant lock is avoided in a Java
synchronized block.

You toss the lock creation in the global namespace of the module for
which you would like to synchronize access.


- Josiah
 
J

Josiah Carlson

sturlamolden said:
I think this is just as much a question of what the runtime should
guarantee. One don't need a guarantee that two threads are not
mangling the same object simultaneously. Instead, the runtime could
leave it to the programmer to use explicit locks on the object or
synchronized blocks to guarantee this for himself.

Why? Right now we have a language where the only thing that doing silly
things with threads can get you now is perhaps a deadlock, perhaps
incorrect execution, or maybe some data corruption if you are working
with files. If we forced all thread users to synchronize everything
themselves, we get an uglier language, and incorrectly written code
could potentially cause crashes (though all of the earlier drawbacks
still apply). In the "safety vs. speed" or "easy vs. fast" arenas,
Python has already chosen safe and easy rather than fast. I doubt it is
going to change any time soon.

I am not surprised. Reference counts are quite efficient, contrary to
common belief. The problem with reference counts is cyclic references
involving objects that define a __del__ method. As these objects are
not eligible for cyclic garbage collection, this can produce resource
leaks.

There was a discussion about removing __del__ within the last couple
weeks. I didn't pay much attention to it (having learned never to use
__del__), but I believe it involved some sort of weakref-based cleanup.

- Josiah
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top