Multiprocessing, shared memory vs. pickled copies

John Ladasky · Apr 4, 2011

Hi folks,

I'm developing some custom neural network code. I'm using Python 2.6,
Numpy 1.5, and Ubuntu Linux 10.10. I have an AMD 1090T six-core CPU,
and I want to take full advantage of it. I love to hear my CPU fan
running, and watch my results come back faster.

When I'm training a neural network, I pass two numpy.ndarray objects
to a function called evaluate. One array contains the weights for the
neural network, and the other array contains the input data. The
evaluate function returns an array of output data.

I have been playing with multiprocessing for a while now, and I have
some familiarity with Pool. Apparently, arguments passed to a Pool
subprocess must be able to be pickled. Pickling is still a pretty
vague progress to me, but I can see that you have to write custom
__reduce__ and __setstate__ methods for your objects. An example of
code which creates a pickle-friendly ndarray subclass is here:

http://www.mail-archive.com/[email protected]/msg02446.html

Now, I don't know that I actually HAVE to pass my neural network and
input data as copies -- they're both READ-ONLY objects for the
duration of an evaluate function (which can go on for quite a while).
So, I have also started to investigate shared-memory approaches. I
don't know how a shared-memory object is referenced by a subprocess
yet, but presumably you pass a reference to the object, rather than
the whole object. Also, it appears that subprocesses also acquire a
temporary lock over a shared memory object, and thus one process may
well spend time waiting for another (individual CPU caches may
sidestep this problem?) Anyway, an implementation of a shared-memory
ndarray is here:

https://bitbucket.org/cleemesser/numpy-sharedmem/src/3fa526d11578/shmarray.py

I've added a few lines to this code which allows subclassing the
shared memory array, which I need (because my neural net objects are
more than just the array, they also contain meta-data). But I've run
into some trouble doing the actual sharing part. The shmarray class
CANNOT be pickled. I think that my understanding of multiprocessing
needs to evolve beyond the use of Pool, but I'm not sure yet. This
post suggests as much.

http://mail.scipy.org/pipermail/scipy-user/2009-February/019696.html

I don't believe that my questions are specific to numpy, which is why
I'm posting here, in a more general Python forum.

When should one pickle and copy? When to implement an object in
shared memory? Why is pickling apparently such a non-trivial process
anyway? And, given that multi-core CPU's are apparently here to stay,
should it be so difficult to make use of them?

Philip Semanchuk · Apr 5, 2011

I have been playing with multiprocessing for a while now, and I have
some familiarity with Pool. Apparently, arguments passed to a Pool
subprocess must be able to be pickled.

Hi John,
multiprocessing's use of pickle is not limited to Pool. For instance, objects put into a multiprocessing.Queue are also pickled, as are the args to a multiprocessing.Process. So if you're going to use multiprocessing, you're going to use pickle, and you need pickleable objects.

Pickling is still a pretty
vague progress to me, but I can see that you have to write custom
__reduce__ and __setstate__ methods for your objects.

Well, that's only if one's objects don't support pickle by default. A lot of classes do without any need for custom __reduce__ and __setstate__ methods. Since you're apparently not too familiar with pickle, I don't want you to get the false impression that it's a lot of trouble. I've used pickle a number of times and never had to write custom methods for it.

Now, I don't know that I actually HAVE to pass my neural network and
input data as copies -- they're both READ-ONLY objects for the
duration of an evaluate function (which can go on for quite a while).
So, I have also started to investigate shared-memory approaches. I
don't know how a shared-memory object is referenced by a subprocess
yet, but presumably you pass a reference to the object, rather than
the whole object. Also, it appears that subprocesses also acquire a
temporary lock over a shared memory object, and thus one process may
well spend time waiting for another (individual CPU caches may
sidestep this problem?) Anyway, an implementation of a shared-memory
ndarray is here:

There's no standard shared memory implementation for Python. The mmap module is as close as you get. I wrote & support the posix_ipc and sysv_ipc modules which give you IPC primitives (shared memory and semaphores) in Python. They work well (IMHO) but they're *nix-only and much lower level than multiprocessing. If multiprocessing is like a kitchen well stocked with appliances, posix_ipc (and sysc_ipc) is like a box of sharp knives.

Note that mmap and my IPC modules don't expose Python objects. They expose raw bytes in memory. YOu're still going to have to jump through some hoops (...like pickle) to turn your Python objects into a bytestream and vice versa.

What might be easier than fooling around with boxes of sharp knives is to convert your ndarray objects to Python lists. Lists are pickle-friendly and easy to turn back into ndarray objects once they've crossed the pickle boundary.

When should one pickle and copy? When to implement an object in
shared memory? Why is pickling apparently such a non-trivial process
anyway? And, given that multi-core CPU's are apparently here to stay,
should it be so difficult to make use of them?

My answers to these questions:

1) Depends
2) In Python, almost never unless you're using a nice wrapper like shmarray.py
3) I don't think it's non-trivial =)
4) No, definitely not. Python will only get better at working with multiple cores/CPUs, but there's plenty of room for improvement on the status quo.

Hope this helps
Philip

Robert Kern · Apr 5, 2011

Hi folks,

I'm developing some custom neural network code. I'm using Python 2.6,
Numpy 1.5, and Ubuntu Linux 10.10. I have an AMD 1090T six-core CPU,
and I want to take full advantage of it. I love to hear my CPU fan
running, and watch my results come back faster.

You will want to ask numpy questions on the numpy mailing list.

http://www.scipy.org/Mailing_Lists

When I'm training a neural network, I pass two numpy.ndarray objects
to a function called evaluate. One array contains the weights for the
neural network, and the other array contains the input data. The
evaluate function returns an array of output data.

I have been playing with multiprocessing for a while now, and I have
some familiarity with Pool. Apparently, arguments passed to a Pool
subprocess must be able to be pickled. Pickling is still a pretty
vague progress to me, but I can see that you have to write custom
__reduce__ and __setstate__ methods for your objects. An example of
code which creates a pickle-friendly ndarray subclass is here:

http://www.mail-archive.com/[email protected]/msg02446.html

Note that numpy arrays are already pickle-friendly. This message is telling you
how, *if* you are already subclassing, how to make your subclass pickle the
extra information it holds.

Now, I don't know that I actually HAVE to pass my neural network and
input data as copies -- they're both READ-ONLY objects for the
duration of an evaluate function (which can go on for quite a while).
So, I have also started to investigate shared-memory approaches. I
don't know how a shared-memory object is referenced by a subprocess
yet, but presumably you pass a reference to the object, rather than
the whole object. Also, it appears that subprocesses also acquire a
temporary lock over a shared memory object, and thus one process may
well spend time waiting for another (individual CPU caches may
sidestep this problem?) Anyway, an implementation of a shared-memory
ndarray is here:

https://bitbucket.org/cleemesser/numpy-sharedmem/src/3fa526d11578/shmarray.py

I've added a few lines to this code which allows subclassing the
shared memory array, which I need (because my neural net objects are
more than just the array, they also contain meta-data).

Honestly, you should avoid subclassing ndarray just to add metadata. It never
works well. Make a plain class, and keep the arrays as attributes.

But I've run
into some trouble doing the actual sharing part. The shmarray class
CANNOT be pickled.

Please never just *say* that something doesn't work. Show us what you tried, and
show us exactly what output you got. I assume you tried something like this:

[Downloads]$ cat runmp.py
from multiprocessing import Pool
import shmarray

def f(z):
return z.sum()

y = shmarray.zeros(10)
z = shmarray.ones(10)

p = Pool(2)
print p.map(f, [y, z])

And got output like this:

[Downloads]$ python runmp.py
Exception in thread Thread-2:
Traceback (most recent call last):
File
"/Library/Frameworks/Python.framework/Versions/7.0/lib/python2.7/threading.py",
line 530, in __bootstrap_inner
self.run()
File
"/Library/Frameworks/Python.framework/Versions/7.0/lib/python2.7/threading.py",
line 483, in run
self.__target(*self.__args, **self.__kwargs)
File
"/Library/Frameworks/Python.framework/Versions/7.0/lib/python2.7/multiprocessing/pool.py",
line 287, in _handle_tasks
put(task)
PicklingError: Can't pickle <class
'multiprocessing.sharedctypes.c_double_Array_10'>: attribute lookup
multiprocessing.sharedctypes.c_double_Array_10 failed

Now, the sharedctypes is supposed to implement shared arrays. Underneath, they
have some dynamically created types like this c_double_Array_10 type.
multiprocessing has a custom pickler which has a registry of reduction functions
for types that do not implement a __reduce_ex__() method. For these dynamically
created types that cannot be imported from a module, this dynamic registry is
the only way to do it. At least at one point, the Connection objects which
communicate between processes would use this custom pickler to serialize objects
to bytes to transmit them.

However, at least in Python 2.7, multiprocessing seems to have a C extension
module defining the Connection objects. Unfortunately, it looks like this C
extension just imports the regular pickler that is not aware of these custom
types. That's why you get this error. I believe this is a bug in Python.

So what did you try, and what output did you get? What version of Python are you
using?

I think that my understanding of multiprocessing
needs to evolve beyond the use of Pool, but I'm not sure yet. This
post suggests as much.

http://mail.scipy.org/pipermail/scipy-user/2009-February/019696.html

Maybe. If the __reduce_ex__() method is implemented properly (and
multiprocessing bugs aren't getting in the way), you ought to be able to pass
them to a Pool just fine. You just need to make sure that the shared arrays are
allocated before the Pool is started. And this only works on UNIX machines. The
shared memory objects that shmarray uses can only be inherited. I believe that's
what Sturla was getting at.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Philip Semanchuk · Apr 5, 2011

http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes

Thank you, Dan. My reading comprehension skills need work.

Cheers
Philip

Robert Kern · Apr 5, 2011

On 4/4/11 3:20 PM, John Ladasky wrote:

However, at least in Python 2.7, multiprocessing seems to have a C extension
module defining the Connection objects. Unfortunately, it looks like this C
extension just imports the regular pickler that is not aware of these custom
types. That's why you get this error. I believe this is a bug in Python.

So what did you try, and what output did you get? What version of Python are you
using?

Maybe. If the __reduce_ex__() method is implemented properly (and
multiprocessing bugs aren't getting in the way), you ought to be able to pass
them to a Pool just fine. You just need to make sure that the shared arrays are
allocated before the Pool is started. And this only works on UNIX machines. The
shared memory objects that shmarray uses can only be inherited. I believe that's
what Sturla was getting at.

I'll take that back a little bit. Since the underlying shared memory types can
only be shared by inheritance, the ForkingPickler is only used for the arguments
passed to Process(), and not used for things sent through Queues (which Pool
uses). Since Pool cannot guarantee that the data exists before the Pool starts
its subprocesses, it must use the general mechanism.

So in short, if you pass the shmarrays as arguments to the target function in
Process(), it should work fine.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

John Ladasky · Apr 5, 2011

Hi Philip,

Thanks for the reply.

So if you're going to use multiprocessing, you're going to use pickle, and you
need pickleable objects.

OK, that's good to know.

Well, that's only if one's objects don't support pickle by default. A lotof
classes do without any need for custom __reduce__ and __setstate__ methods.
Since you're apparently not too familiar with pickle, I don't want you toget
the false impression that it's a lot of trouble. I've used pickle a number of
times and never had to write custom methods for it.

I used __reduce__ and __setstate__ once before with success, but I
just hacked at code I found on the Net, without fully understanding
it.

All right, since my last post, I've made an attempt to understand a
bit more about pickle. In doing so, I'm getting into the guts of the
Python class model, in a way that I did not expect. I believe that I
need to understand the following:

1) What kinds of objects have a __dict__,

2) Why the default pickling methods do not simply look for a __dict__
and serialize its contents, if it is present, and

3) Why objects can exist that do not support pickle by default, and
cannot be correctly pickled without custom __reduce__ and __setstate__
methods.

I can only furnish a (possibly incomplete) answer to question #1.
User subclasses apparently always have a __dict__. Here:
... pass # It doesn't get any simpler
...

x = Foo()
x

Click to expand...

x.__dict__ {}
x.spam = "eggs"
x.__dict__ {'spam': 'eggs'}
setattr(x, "parrot", "dead")

Click to expand...

dir(x) ['__doc__', '__module__', 'parrot', 'spam']
x.__dict__ {'parrot': 'dead', 'spam': 'eggs'}
x.__dict__ = {"a":"b", "c":"d"}
dir(x) ['__doc__', '__module__', 'a', 'c']
x.__dict__

Click to expand...

Click to expand...

{'a': 'b', 'c': 'd'}

Aside: this finding makes me wonder what exactly dir() is showing me
about an object (all objects in the namespace, including methods, but
not __dict__ itself?), and why it is distinct from __dict__.

In contrast, it appears that BUILT-IN classes (numbers, lists,
dictionaries, etc.) do not have a __dict__. And that you can't force
a __dict__ into built-in objects, no matter how you may try.
['__abs__', '__add__', '__and__', '__class__', '__cmp__',
'__coerce__',
'__delattr__', '__div__', '__divmod__', '__doc__', '__float__',
'__floordiv__',
'__format__', '__getattribute__', '__getnewargs__', '__hash__',
'__hex__',
'__index__', '__init__', '__int__', '__invert__', '__long__',
'__lshift__',
'__mod__', '__mul__', '__neg__', '__new__', '__nonzero__',
'__oct__', '__or__',
'__pos__', '__pow__', '__radd__', '__rand__', '__rdiv__',
'__rdivmod__',
'__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__',
'__rlshift__',
'__rmod__', '__rmul__', '__ror__', '__rpow__', '__rrshift__',
'__rshift__',
'__rsub__', '__rtruediv__', '__rxor__', '__setattr__',
'__sizeof__', '__str__',
'__sub__', '__subclasshook__', '__truediv__', '__trunc__',
'__xor__', 'conjugate',
'denominator', 'imag', 'numerator', 'real']File "<console>", line 1, in <module>
File "<console>", line 1, in <module>
File "<console>", line 1, in <module>
''' <type 'exceptions.AttributeError'> : 'int' object has no
attribute '__dict__' '''

This leads straight into my second question. I THINK, without knowing
for sure, that most user classes would pickle correctly by simply
iterating through __dict__. So, why isn't this the default behavior
for Python? Was the assumption that programmers would only want to
pickle built-in classes? Can anyone show me an example of a common
object where my suggested approach would fail?

What might be easier than fooling around with boxes of sharp knives is toconvert
your ndarray objects

.... with their attendant meta-data, of course...

to Python lists. Lists are pickle-friendly and easy to turn
back into ndarray objects once they've crossed the pickle boundary.

That's a cute trick. I may try it, but I would rather do it the way
that Python recommends. I always prefer to understand what I'm doing,
and why.

Python will only get better at working with multiple cores/CPUs, but there's plenty of room for improvement on the status quo.

Thank you for agreeing with me. I'm not formally trained in computer
science, and there was always the chance that my opinion as a non-
professional might come from simply not understanding the issues well
enough. But I chose Python as a programming language (after Applesoft
Basic, 6502 assembler, Pascal, and C -- yeah, I've been at this a
while) because of its ease of entry. And while Python has carried me
far, I don't think that an issue that is as central to modern
computing as using multiple CPU's concurrently is so arcane that we
should declare it black magic -- and therefore, it's OK if Python's
philosophical goal of achieving clarity and simplicity is not met in
this case.

Philip Semanchuk · Apr 5, 2011

Hi Philip,

Thanks for the reply.

OK, that's good to know.

But as Dan Stromberg pointed out, there are some pickle-free ways to communicate between processes using multiprocessing.

This leads straight into my second question. I THINK, without knowing
for sure, that most user classes would pickle correctly by simply
iterating through __dict__. So, why isn't this the default behavior
for Python? Was the assumption that programmers would only want to
pickle built-in classes?

One can pickle user-defined classes:
.... pass
.... 'ccopy_reg\n_reconstructor\np0\n(c__main__\nFoo\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\n.'

And as Robert Kern pointed out, numpy arrays are also pickle-able.
"cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(I3\ntp6\ncnumpy\ndtype\np7\n(S'f8'\np8\nI0\nI1\ntp9\nRp10\n(I3\nS'<'\np11\nNNNI-1\nI-1\nI0\ntp12\nbI00\nS'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np13\ntp14\nb."

As a side note, you should always use "new style" classes, particularly since you're exploring the details of Python class construction. "New" is a bit a of misnomer now, as "new" style classes were introduced in Python 2.2. They have been the status quo in Python 2.x for a while now and are the only choice in Python 3.x.

Subclassing object gives you a new style class:
class Foo(object):

Not subclassing object (as you did in your example) gives you an old style class:
class Foo:

Cheers
Philip

John Ladasky · Apr 7, 2011

Hello again, Philip,

I really appreciate you sticking with me. Hopefully this will help
someone else, too. I've done some more reading, and will offer some
minimal code below.

I've known about this page for a while, and it describes some of the
unconventional things one needs to consider when subclassing
numpy.ndarray:

http://www.scipy.org/Subclasses

Now, section 11.1 of the Python documentation says the following
concerning pickling: "Classes can further influence how their
instances are pickled; if the class defines the method __getstate__(),
it is called and the return state is pickled as the contents for the
instance, instead of the contents of the instance’s dictionary. If
there is no __getstate__() method, the instance’s __dict__ is
pickled."

http://docs.python.org/release/2.6.6/library/pickle.html

That being said, I'm having problems there! I've found this minimal
example of pickling with overridden __getstate__ and __setstate__
methods:

http://stackoverflow.com/questions/...le-example-about-setstate-and-getstate-thanks

I'll put all of the thoughts generated by these links together after
responding to a few things you wrote.

But as Dan Stromberg pointed out, there are some pickle-free ways to communicate between processes using multiprocessing.

I only see your reply to Dan Stromberg in this thread, but not Dan
Stromberg's original post. I am reading this through Google Groups.
Perhaps Dan's post failed to make it through a gateway for some
reason?

As a side note, you should always use "new style" classes, particularly since you're exploring the details of Python class construction. "New" is a bit a of misnomer now, as "new" style classes were introduced in Python 2.2.. They have been the status quo in Python 2.x for a while now and are the only choice in Python 3.x.

Sorry, that was an oversight on my part. Normally I do remember that,
but I've been doing a lot of subclassing rather than defining top-
level classes, and therefore it slipped my mind.

One can pickle user-defined classes:

OK, I got that working, I'll show an example farther down. (I never
tried to pickle a minimal class until now, I went straight for a hard
one.)

And as Robert Kern pointed out, numpy arrays are also pickle-able.

OK, but SUBCLASSES of numpy.ndarray are not, in my hands, pickling as
I would expect. I already have lots of code that is based on such
subclasses, and they do everything I want EXCEPT pickle correctly. I
may have to direct this line of questions to the numpy discussion
group after all.

This is going to be a longer exposition. So let me thank you in
advance for your patience, and ask you to get comfortable.

========== begin code ==========

from numpy import array, ndarray
from pickle import dumps, loads

##===

class Foo(object):

def __init__(self, contents):
self.contents = contents

def __str__(self):
return str(type(self)) + "\n" + str(self.__dict__)

##===

class SuperFoo(Foo):

def __getstate__(self):
print "__getstate__"
duplicate = dict(self.__dict__)
duplicate["bonus"] = "added during pickling"
return duplicate

def __setstate__(self, d):
print "__setstate__"
self.__dict__ = d
self.finale = "added during unpickling"

##===

class ArraySubclass(ndarray):
"""
See http://www.scipy.org/Subclasses, this class is very similar.
"""
def __new__(subclass, data, extra=None, dtype=None, copy=False):
print " __new__"
arr = array(data, dtype=dtype, copy=copy)
arr = arr.view(subclass)
if extra is not None:
arr.extra = extra
elif hasattr(data, "extra"):
arr.extra = data.extra
return arr

def __array_finalize__(self, other):
print " __array_finalize__"
self.__dict__ = getattr(other, "__dict__", {})

def __str__(self):
return str(type(self)) + "\n" + ndarray.__str__(self) + \
"\n__dict__ : " + str(self.__dict__)

##===

class PicklableArraySubclass(ArraySubclass):

def __getstate__(self):
print "__getstate__"
return self.__dict__

def __setstate__(self, d):
print "__setstate__"
self.__dict__ = d

##===

print "\n\n** Create a Foo object, then create a copy via pickling."
original = Foo("pickle me please")
print original
clone = loads(dumps(original))
print clone

print "\n\n** Create a SuperFoo object, just to watch __setstate__ and
__getstate__."
original = SuperFoo("pickle me too, please")
print original
clone = loads(dumps(original))
print clone

print "\n\n** Create a numpy ndarray, then create a copy via
pickling."
original = array(((1,2,3),(4,5,6)))
print original
clone = loads(dumps(original))
print clone

print "\n\n** Create an ArraySubclass object, with meta-data..."
original = ArraySubclass(((9,8,7),(6,5,4)), extra = "pickle me
PLEASE!")
print original
print "\n...now attempt to create a copy via pickling."
clone = loads(dumps(original))
print clone

print "\n\n** That failed, try a PicklableArraySubclass..."
original = PicklableArraySubclass(((1,2),(3,4)), extra = "pickle,
dangit!!!")
print original
print "\n...now try to create a copy of the PicklableArraySubclass via
pickling."
clone = loads(dumps(original))
print clone

========== end code, begin output ==========

** Create a Foo object, then create a copy via pickling.
<class '__main__.Foo'>
{'contents': 'pickle me please'}
<class '__main__.Foo'>
{'contents': 'pickle me please'}

** Create a SuperFoo object, just to watch __setstate__ and
__getstate__.
<class '__main__.SuperFoo'>
{'contents': 'pickle me too, please'}
__getstate__
__setstate__
<class '__main__.SuperFoo'>
{'bonus': 'added during pickling', 'finale': 'added during
unpickling', 'contents': 'pickle me too, please'}

** Create a numpy ndarray, then create a copy via pickling.
[[1 2 3]
[4 5 6]]
[[1 2 3]
[4 5 6]]

** Create an ArraySubclass object, with meta-data...
__new__
__array_finalize__
__array_finalize__
__array_finalize__
<class '__main__.ArraySubclass'>
[[9 8 7]
[6 5 4]]
__dict__ : {'extra': 'pickle me PLEASE!'}

....now attempt to create a copy via pickling.
__array_finalize__
__array_finalize__
__array_finalize__
<class '__main__.ArraySubclass'>
[[9 8 7]
[6 5 4]]
__dict__ : {}

** That failed, try a PicklableArraySubclass...
__new__
__array_finalize__
__array_finalize__
__array_finalize__
<class '__main__.PicklableArraySubclass'>
[[1 2]
[3 4]]
__dict__ : {'extra': 'pickle, dangit!!!'}

....now try to create a copy of the PicklableArraySubclass via
pickling.
__array_finalize__
__setstate__
Traceback (most recent call last):
File "minimal ndarray subclass example for posting.py", line 109, in
<module>
clone = loads(dumps(original))
File "/usr/lib/python2.6/pickle.py", line 1374, in loads
return Unpickler(file).load()
File "/usr/lib/python2.6/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib/python2.6/pickle.py", line 1217, in load_build
setstate(state)
File "minimal ndarray subclass example for posting.py", line 75, in
__setstate__
self.__dict__ = d
TypeError: __dict__ must be set to a dictionary, not a 'tuple'

Exit code: 1

========== end output ==========

Observations:

My first three examples work fine. The second example, my SuperFoo
class, shows that I can intercept pickling attempts and add data to
__dict__.

Problems appear in the fourth example, when I try to pickle an
ArraySubclass object. It gets the array itself, but __dict__ is not
copied, despite the fact that that is supposed to be Python's default
behavior, and my first example did indeed show this default behavior.

So in my fifth and final example, the PickleableArraySublcass object,
I tried to override __getstate__ and __setstate__ just as I did with
SuperFoo. Here's what I see: 1) __getstate__ is NEVER called. 2)
__setstate__ is called, but it's expecting a dictionary as an
argument, and that dictionary is absent.

What's up with that?

John Ladasky · Apr 7, 2011

Following up to my own post...

What's up with that?

Apparently, "what's up" is that I will need to implement a third
method in my ndarray subclass -- namely, __reduce__.

http://www.mail-archive.com/[email protected]/msg02446.html

I'm burned out for tonight, I'll attempt to grasp what __reduce__ does
tomorrow.

Again, I'm going to point out that, given the extent that
multiprocessing depends upon pickling, pickling should be made
easier. This is Python, for goodness' sake! I'm still surprised at
the hoops I've got to jump through.

Philip Semanchuk · Apr 7, 2011

Following up to my own post...

Apparently, "what's up" is that I will need to implement a third
method in my ndarray subclass -- namely, __reduce__.

http://www.mail-archive.com/[email protected]/msg02446.html

I'm burned out for tonight, I'll attempt to grasp what __reduce__ does
tomorrow.

Again, I'm going to point out that, given the extent that
multiprocessing depends upon pickling, pickling should be made
easier. This is Python, for goodness' sake! I'm still surprised at
the hoops I've got to jump through.

Hi John,
My own experience has been that when I reach a surprising level of hoop jumping, it usually means there's an easier path somewhere else that I'm neglecting.

But if pickling subclasses of numpy.ndarray objects is what you really feel you need to do, then yes, I think asking on the numpy list is the best idea.

Good luck
Philip

Robert Kern · Apr 7, 2011

OK, but SUBCLASSES of numpy.ndarray are not, in my hands, pickling as
I would expect. I already have lots of code that is based on such
subclasses, and they do everything I want EXCEPT pickle correctly. I
may have to direct this line of questions to the numpy discussion
group after all.

Define the __reduce_ex__() method, not __getstate__(), __setstate__().

http://docs.python.org/release/2.6.6/library/pickle.html#pickling-and-unpickling-extension-types

ndarrays are extension types, so they use that mechanism.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

John Ladasky · Apr 7, 2011

Define the __reduce_ex__() method, not __getstate__(), __setstate__().

http://docs.python.org/release/2.6.6/library/pickle.html#pickling-and...

ndarrays are extension types, so they use that mechanism.

Thanks, Robert, as you can see, I got on that track shortly after I
posted my code example. This is apparently NOT a numpy issue, it's an
issue for pickling all C extension types.

Is there a way to examine a Python object, and determine whether it's
a C extension type or not? Or do you have to deduce that from the
documentation and/or the source code?

I started hunting through the numpy source code last night. It's
complicated. I haven't found the ndarray object definition yet.
Perhaps because I was looking through .py files, when I actually
should have been looking through .c files?

Robert Kern · Apr 7, 2011

Thanks, Robert, as you can see, I got on that track shortly after I
posted my code example. This is apparently NOT a numpy issue, it's an
issue for pickling all C extension types.

Yes, but seriously, you should ask on the numpy mailing list. You will probably
run into more numpy-specific issues. At least, we'd have been able to tell you
things like "ndarray is an extension type, so look at that part of the
documentation" quicker.

Is there a way to examine a Python object, and determine whether it's
a C extension type or not?

For sure? No, not really. Not at the Python level, at least. You may be able to
do something at the C level, I don't know.

Or do you have to deduce that from the
documentation and/or the source code?

I started hunting through the numpy source code last night. It's
complicated. I haven't found the ndarray object definition yet.
Perhaps because I was looking through .py files, when I actually
should have been looking through .c files?

Yes. The implementation for __reduce__ is in numpy/core/src/multiarray/methods.c
as array_reduce(). You may want to look in numpy/ma/core.py for the definition
of MaskedArray. It shows how you would define __reduce__, __getstate__ and
__setstate__ for a subclass of ndarray.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

sturlamolden · Apr 8, 2011

PicklingError: Can't pickle <class
'multiprocessing.sharedctypes.c_double_Array_10'>: attribute lookup
multiprocessing.sharedctypes.c_double_Array_10 failed

Hehe

That is why programmers should not mess with code they don't
understand!

Gaël and I wrote shmem to avoid multiprocessing.sharedctypes, because
they cannot be pickled (they are shared by handle inheritance)! To do
this we used raw Windows API and Unix System V IPC instead of
multiprocessing.Array, and the buffer is pickled by giving it a name
in the file system. Please be informed that the code on bitbucked has
been "fixed" by someone who don't understand my code. "If it ain't
broke don't fix it."

http://folk.uio.no/sturlamo/python/sharedmem-feb13-2009.zip

Known issues/bugs: 64-bit support is lacking, and os._exit in
multiprocessing causes a memory leak on Linux.

Maybe. If the __reduce_ex__() method is implemented properly (and
multiprocessing bugs aren't getting in the way), you ought to be able to pass
them to a Pool just fine. You just need to make sure that the shared arrays are
allocated before the Pool is started. And this only works on UNIX machines. The
shared memory objects that shmarray uses can only be inherited. I believethat's
what Sturla was getting at.

It's a C extension that gives a buffer to NumPy. Then YOU changed how
NumPy pickles arrays referencing these buffers, using
pickle.copy_reg

Sturla

sturlamolden · Apr 8, 2011

https://bitbucket.org/cleemesser/numpy-sharedmem/src/3fa526d11578/shm...

I've added a few lines to this code which allows subclassing the
shared memory array, which I need (because my neural net objects are
more than just the array, they also contain meta-data). But I've run
into some trouble doing the actual sharing part. The shmarray class
CANNOT be pickled.

That is hilarious

I see that the bitbucket page has my and Gaëls name on it, but the
code is changed and broken beyond repair! I don't want to be
associated with that crap!

Their "shmarray.py" will not work -- ever. It fails in two ways:

1. multiprocessing.Array cannot be pickled (as you noticed). It is
shared by handle inheritance. Thus we (that is Gaël and I) made a
shmem buffer object that could be pickled by giving it a name in the
file system, instead of sharing it anonymously by inheriting the
handle. Obviously those behind the bitbucket page don't understand the
difference between named and anonymous shared memory (that is, System
V IPC and BSD mmap, respectively.)

2. By subclassing numpy.ndarray a pickle dump would encode a copy of
the buffer. But that is what we want to avoid! We want to share the
buffer itself, not make a copy of it! So we changed how numpy pickles
arrays pointing to shared memory, instead of subclassing ndarray. I
did that by slightly modifying some code written by Robert Kern.

http://folk.uio.no/sturlamo/python/sharedmem-feb13-2009.zip
Known issues/bugs: 64-bit support is lacking, and os._exit in
multiprocessing causes a memory leak on Linux.

Sturla

sturlamolden · Apr 8, 2011

http://folk.uio.no/sturlamo/python/sharedmem-feb13-2009.zip
Known issues/bugs: 64-bit support is lacking, and os._exit in
multiprocessing causes a memory leak on Linux.

I should probably fix it for 64-bit now. Just recompiliong with 64-bit
integers will not work, because I intentionally hardcoded the higher
32 bits to 0. It doesn't help to represent the lower 32 bits with a 64
bit integer (which it seems someone actually have tried :-D) The
memory leak on Linux is pesky. os._exit prevents clean-up code from
executing, but unlike Windows the Linux kernel does no reference
counting. I am worried we actually need to make a small kernel driver
to make this work properly for Linux, since os._exit makes the current
user-space reference counting fail on child process exit.

Sturla

sturlamolden · Apr 8, 2011

I should probably fix it for 64-bit now. Just recompiliong with 64-bit
integers will not work, because I intentionally hardcoded the higher
32 bits to 0.

That was easy, 64-bit support for Windows is done

Now I'll just have to fix the Linux code, and figure out what to do
with os._exit preventing clean-up on exit... :-(

Sturla

John Ladasky · Apr 9, 2011

That was easy, 64-bit support for Windows is done

Now I'll just have to fix the Linux code, and figure out what to do
with os._exit preventing clean-up on exit... :-(

Sturla

Hi Sturla,

Thanks for finding my discussion! Yes, it's about passing numpy
arrays to multiple processors. I'll accomplish that any way that I
can.

AND thanks to the discussion provided here by Philip and Robert, I've
become a bit less confused about pickling. I have working code which
subclasses ndarray, pickles using __reduce_ex__, and unpickles using
__setstate__. Just seven lines of additional code do what I want.

I almost understand it, too. :^) Seriously, I'm not altogether sure
about the purpose and structure of the tuple generated by
__reduce_ex__. It looks very easy to abuse and break. Referencing
tuple elements by number seems rather unPythonic. Has Python 3 made
any improvements to pickle, I wonder?

At the moment, my arrays are small enough that pickling them should
not be a very expensive process -- provided that I don't have to keep
doing it over and over again! I'm returning my attention to the
multiprocessing module now. Processes created by Pool do not appear
to persist. They seem to disappear after they are called. So I can't
call them once with the neural net array, and then once again (or even
repeatedly) with input data. Perhaps I need to look at Queue.

I will retain a copy of YOUR shmarray code (not the Bitbucket code)
for some time in the future. I anticipate that my arrays might get
really large, and then copying them might not be practical in terms of
time and memory usage.

sturlamolden · Apr 9, 2011

Thanks for finding my discussion! Yes, it's about passing numpy
arrays to multiple processors. I'll accomplish that any way that I
can.

My preferred ways of doing this are:

1. Most cases for parallel processing are covered by libraries, even
for neural nets. This particularly involves linear algebra solvers and
FFTs, or calling certain expensive functions (sin, cos, exp) over and
over again. The solution here is optimised LAPACK and BLAS (Intel MKL,
AMD ACML, GotoBLAS, ATLAS, Cray libsci), optimised FFTs (FFTW, Intel
MKL, ACML), and fast vector math libraries (Intel VML, ACML). For
example, if you want to make multiple calls to the function "exp",
there is a good chance you want to use a vector math library. Despite
of this, most Python programmers' instinct seems to be to use multiple
processes with numpy.exp or math.exp, or use multiple threads in C
with exp from libm (cf. math.h). Why go through this pain when a
single function call to Intel VML or AMD ACML (acml-vm) will be much
better? It is common to see scholars argue that "yes but my needs are
so special that I need to customise everything myself." Usually this
translates to "I don't know these libraries (not even that they exist)
and are happy to reinvent the wheel." Thus, if you think you need to
use manually managed threads or processes for parallel technical
computing, and even contemplate that the GIL might get in your way,
there is a 99% chance you are wrong. You will almost ALWAYS want ot
use a fast library, either directly in Python or linked to your own
serial C or Fortran code. You have probably heard that "premature
optimisation is the root of all evil in computer programming." It
particularly applies here. Learn to use the available performance
libraires, and it does not matter from which language they are called
(C or Fortran will not be faster than Python!) This is one of the
major reasons Python can be used for HPC (high-performance computing)
even though the Python part itself is "slow". Most of these libraires
are available for free (GotoBLAS, ATLAS, FFTW, ACML), but Intel MKL
and VML require a license fee.

Also note that there are comprehensive numerical libraries you can
use, which can be linked with multi-threaded performance libraries
under the hood. Of particular interest are the Fortran libraries from
NAG and IMSL, which are the two gold standards of technical computing.
Also note that the linear algebra solvers of NumPy and SciPy in
Enthought Python Distribution are linked with Intel MKL. Enthought's
license are cheaper than Intel's, and you don't need to build NumPy or
SciPy against MKL yourself. Using scipy.linalg from EPD is likely to
cover your need for parallel computing with neural nets.

2. Use Cython, threading.Thread, and release the GIL. Perhaps we
should have a cookbook example in scipy.org on this. In the "nogil"
block you can call a library or do certain things that Cython allows
without holding the GIL.

3. Use C, C++ or Fortran with OpenMP, and call these using Cython,
ctypes or f2py. (I prefer Cython and Fortran 95, but any combination
will do.)

4. Use mpi4py with any MPI implementation.

Note that 1-4 are not mutually exclusive, you can always use a certain
combination.

I will retain a copy of YOUR shmarray code (not the Bitbucket code)
for some time in the future. I anticipate that my arrays might get
really large, and then copying them might not be practical in terms of
time and memory usage.

The expensive overhead in passing a NymPy array to
multiprocessing.Queue is related to pickle/cPickle, not IPC or making
a copy of the buffer.

For any NumPy array you can afford to copy in terms of memory, just
work with copies.

The shared memory arrays I made are only useful for large arrays. They
are just as expensive to pickle in terms of time, but can be
inexpensive in terms of memory.

Also beware that the buffer is not copied to the pickle, so you need
to call .copy() to pickle the contents of the buffer.

But again, I'd urge you to consider a library or threads
(threading.Thread in Cython or OpenMP) before you consider multiple
processes. The reason I have not updated the sharedmem arrays for two
years is that I have come to the conclusion that there are better ways
to do this (paricularly vendor tuned libraries). But since they are
mostly useful with 64-bit (i.e. large arrays), I'll post an update
soon.

If you decide to use a multithreaded solution (or shared memory as
IPC), beware of "false sharing". If multiple processors write to the
same cache line (they can be up to 512K depending on hardware), you'll
create an invisible "GIL" that will kill any scalability. That is
because dirty cache lines need to be synchonized with RAM. "False
sharing" is one of the major reasons that "home-brewed" compute-
intensive code will not scale.

It is not uncommon to see Java programmers complain about Python's
GIL, and then they go on to write i/o bound or false shared code. Rest
assured that multi-threaded Java will not scale better than Python in
these cases

Regards,
Sturla

John Ladasky · Apr 9, 2011

My preferred ways of doing this are:

1. Most cases for parallel processing are covered by libraries, even
for neural nets. This particularly involves linear algebra solvers and
FFTs, or calling certain expensive functions (sin, cos, exp) over and
over again. The solution here is optimised LAPACK and BLAS (Intel MKL,
AMD ACML, GotoBLAS, ATLAS, Cray libsci), optimised FFTs (FFTW, Intel
MKL, ACML), and fast vector math libraries (Intel VML, ACML). For
example, if you want to make multiple calls to the function "exp",
there is a good chance you want to use a vector math library. Despite
of this, most Python programmers' instinct seems to be to use multiple
processes with numpy.exp or math.exp, or use multiple threads in C
with exp from libm (cf. math.h). Why go through this pain when a
single function call to Intel VML or AMD ACML (acml-vm) will be much
better? It is common to see scholars argue that "yes but my needs are
so special that I need to customise everything myself." Usually this
translates to "I don't know these libraries (not even that they exist)
and are happy to reinvent the wheel."

Whoa, Sturla. That was a proper core dump!

You're right, I'm unfamiliar with the VAST array of libraries that you
have just described. I will have to look at them. It's true, I
probably only know of the largest and most widely-used Python
libraries. There are so many, who can keep track?

Now, I do have a special need. I've implemented a modified version of
the Fahlman Cascade Correlation algorithm that will not be found in
any existing libraries, and which I think should be superior for
certain types of problems. (I might even be able to publish this
algorithm, if I can get it working and show some examples?)

That doesn't mean that I can't use the vector math libraries that
you've recommended. As long as those libraries can take advantage of
my extra computing power, I'm interested. Note, however, that the
cascade evaluation does have a strong sequential requirement. It's
not a traditional three-layer network. In fact, describing a cascade
network according to the number of "layers" it has is not very
meaningful, because each hidden node is essentially its own layer.

So, there are limited advantages to trying to parallelize the
evaluation of ONE cascade network's weights against ONE input vector.
However, evaluating several copies of one cascade network's output,
against several different test inputs simultaneously, should scale up
nicely. Evaluating many possible test inputs is exactly what you do
when training a network to a data set, and so this is how my program
is being designed.

Thus, if you think you need to
use manually managed threads or processes for parallel technical
computing, and even contemplate that the GIL might get in your way,
there is a 99% chance you are wrong. You will almost ALWAYS want ot
use a fast library, either directly in Python or linked to your own
serial C or Fortran code. You have probably heard that "premature
optimisation is the root of all evil in computer programming." It
particularly applies here.

Well, I thought that NUMPY was that fast library...

Funny how this works, though -- I built my neural net class in Python,
rather than avoiding numpy and going straight to wrapping code in C,
precisely because I wanted to AVOID premature optimization (for
unknown, and questionable gains in performance). I started on this
project when I had only a single-core CPU, though. Now that multi-
core CPU's are apparently here to stay, and I've seen just how long my
program takes to run, I want to make full use of multiple cores. I've
even looked at MPI. I'm considering networking to another multi-CPU
machine down the hall, once I have my program working.

But again, I'd urge you to consider a library or threads
(threading.Thread in Cython or OpenMP) before you consider multiple
processes.

My single-CPU neural net training program had two threads, one for the
GUI and one for the neural network computations. Correct me if I'm
wrong here, but -- since the two threads share a single Python
interpreter, this means that only a single CPU is used, right? I'm
looking at multiprocessing for this reason.

The reason I have not updated the sharedmem arrays for two
years is that I have come to the conclusion that there are better ways
to do this (paricularly vendor tuned libraries). But since they are
mostly useful with 64-bit (i.e. large arrays), I'll post an update
soon.

If you decide to use a multithreaded solution (or shared memory as
IPC), beware of "false sharing". If multiple processors write to the
same cache line (they can be up to 512K depending on hardware), you'll
create an invisible "GIL" that will kill any scalability. That is
because dirty cache lines need to be synchonized with RAM. "False
sharing" is one of the major reasons that "home-brewed" compute-
intensive code will not scale.

Even though I'm not formally trained in computer science, I am very
conscious of the fact that WRITING to shared memory is a problem,
cache or otherwise. At the very top of this thread, I pointed out
that my neural network training function would need READ-ONLY access
to two items -- the network weights, and the input data. Given that,
and my (temporary) struggles with pickling, I considered the shared-
memory approach as an alternative.

It is not uncommon to see Java programmers complain about Python's
GIL, and then they go on to write i/o bound or false shared code. Rest
assured that multi-threaded Java will not scale better than Python in
these cases

I've never been a Java programmer, and I hope it stays that way!

multiprocessing	1	Jul 9, 2013
Recommended exception for objects that can't be pickled	1	Apr 21, 2014
multiprocessing callbacks?	0	Dec 15, 2009
Traceback when using multiprocessing, less than helpful?	11	Nov 21, 2013
multiprocessing pipes with custom pickler	0	Jun 18, 2013
multiprocessing	1	Apr 7, 2011
Multiprocessing: don't push the pedal to the metal?	5	May 22, 2011
Which multiprocessing methods use shared memory?	0	Jul 27, 2010

Multiprocessing, shared memory vs. pickled copies

John Ladasky

Philip Semanchuk

Robert Kern

Philip Semanchuk

Robert Kern

John Ladasky

Philip Semanchuk

John Ladasky

John Ladasky

Philip Semanchuk

Robert Kern

John Ladasky

Robert Kern

sturlamolden

sturlamolden

sturlamolden

sturlamolden

John Ladasky

sturlamolden

John Ladasky

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads