Traceback when using multiprocessing, less than helpful?

J

John Ladasky

Hi folks,

Somewhat over a year ago, I struggled with implementing a routine using multiprocessing.Pool and numpy. I eventually succeeded, but I remember finding it very hard to debug. Now I have managed to provoke an error from that routine again, and once again, I'm struggling.

Here is the end of the traceback, starting with the last line of my code: "result = pool.map(evaluate, bundles)". After that, I'm into Python itself.

File ".../evaluate.py", line 81, in evaluate
result = pool.map(evaluate, bundles)
File "/usr/lib/python3.3/multiprocessing/pool.py", line 228, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.3/multiprocessing/pool.py", line 564, in get
raise self._value
ValueError: operands could not be broadcast together with shapes (1,3) (4)

Notice that no line of numpy appears in the traceback? Still, there are three things that make me think that this error is coming from numpy.

1. "raise self._value" means that an exception is stored in a variable, to be re-raised.

2. The words "operands" and "broadcast" do not appear anywhere in the source code of multiprocessing.pool.

3. The words "operands" and "broadcast" are common to numpy errors I have seen before. Numpy does many very tricky things when dealing with arrays ofdifferent dimensions and shapes.

Of course, I am sure that the bug must be in my own code. I even have old programs which are using my evaluate.evaluate() without generating errors. I am comparing the data structures that my working and my non-working programs send to pool.map(). I am comparing the code between my two programs. There is some subtle difference that I haven't spotted.

If I could only see the line of numpy code which is generating the ValueError, I would have a better chance of spotting the bug in my code. So, WHY isn't there any reference to numpy in my traceback?

Here's my theory. The numpy error was generated in a subprocess. The line"raise self._value" is intercepting the exception generated by my subprocess, and passing it back to the master Python interpreter.

Does re-raising an exception, and/or passing an exception from a subprocess, truncate a traceback? That's what I think I'm seeing.

Thanks for any advice!
 
C

Chris Angelico

Here is the end of the traceback, starting with the last line of my code: "result = pool.map(evaluate, bundles)". After that, I'm into Python itself.

File ".../evaluate.py", line 81, in evaluate
result = pool.map(evaluate, bundles)
File "/usr/lib/python3.3/multiprocessing/pool.py", line 228, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.3/multiprocessing/pool.py", line 564, in get
raise self._value
ValueError: operands could not be broadcast together with shapes (1,3) (4)

Notice that no line of numpy appears in the traceback? Still, there are three things that make me think that this error is coming from numpy.

Hmm. This looks like a possible need for the 'raise from' syntax. I
just checked multiprocessing/pool.py from 3.4 alpha, and it has much
what you're seeing there, in the definition of AsyncResult (of which
MapResult is a subclass). The question is, though, how well does the
information traverse the process boundary?

ChrisA
 
J

John Ladasky

Hmm. This looks like a possible need for the 'raise from' syntax.

Thank you, Chris, that made me feel like a REAL Python programmer -- I justdid some reading, and the "raise from" feature was not implemented until Python 3! And I might actually need it! :^)

I think that the article http://www.python.org/dev/peps/pep-3134/ is relevant. Reading it now. To be clear: the complete exception change is stored in every class, it's just not being displayed? I hope that's the case. I shouldn't have to install a "raise from" hook in multiprocessing.map_async itself.
 
C

Chris Angelico

Thank you, Chris, that made me feel like a REAL Python programmer -- I just did some reading, and the "raise from" feature was not implemented untilPython 3! And I might actually need it! :^)

I think that the article http://www.python.org/dev/peps/pep-3134/ is relevant. Reading it now. To be clear: the complete exception change is stored in every class, it's just not being displayed? I hope that's the case. I shouldn't have to install a "raise from" hook in multiprocessing.map_async itself.

That PEP is all about the 'raise from' notation, yes; but the
exception chaining is presumably not being stored, or else you would
be able to see it in the default printout. So the best solution to
this is, most likely, a patch to multiprocessing to have it chain
exceptions properly. I think that would be considered a bugfix, and
thus back-ported to all appropriate versions (rather than a feature
enhancement that goes in 3.4 or 3.5 only).

What you could try is printing out the __cause__ and __context__ of
the exception, to see if there's anything useful in them; if there's
nothing, the next thing to try would be some kind of wrapper in your
inner handler (the evaluate function) that retains additional
information.

Oh, something else to try: It might be that the proper exception
chaining would happen, except that the info isn't traversing processes
properly due to pickling or something. Can you patch your code to use
threading instead of multiprocessing? That might reveal something.
(Don't worry about abysmal performance at this stage.)

Hopefully someone with more knowledge of Python's internals can help
out, here. One way or another, I suspect this will result in a tracker
issue.

ChrisA
 
J

John Ladasky

What you could try is

Suggestion 1:
printing out the __cause__ and __context__ of
the exception, to see if there's anything useful in them;

Suggestion 2:
if there's
nothing, the next thing to try would be some kind of wrapper in your
inner handler (the evaluate function) that retains additional
information.

Suggestion 3:
Oh, something else to try: It might be that the proper exception
chaining would happen, except that the info isn't traversing processes
properly due to pickling or something. Can you patch your code to use
threading instead of multiprocessing? That might reveal something.
(Don't worry about abysmal performance at this stage.)

I have tried the first suggestion, at the top level of my code. Here are the modified lines, and the output:

==============================================

try:
out = evaluate(net, domain)
except ValueError as e:
print(type(e))
print(e) # this just produces the exception string itself
print(e.__context__)
print(e.__cause__)
raise e # just so my program actually stops

==============================================

<class 'ValueError'>
operands could not be broadcast together with shapes (1,3) (4)
None
None

==============================================

So, once I catch the exception, both __context__ and __cause__ are undefined.

I will proceed as you have suggested -- but if anything comes to mind based on what I have already done, please feel free to chime in!
 
J

John Ladasky

Followup:

I didn't need to go as far as Chris Angelico's second suggestion. I haven't looked at certain parts of my own code for a while, but it turns out thatI wrote it REASONABLY logically...

My evaluate() calls another function through pool.map_async() -- _evaluate(), which actually processes the data, on a single CPU. So I didn't need tohassle with threading, as Chris suggested. All I did was to import _evaluate in my top-level code, then change my function calls from evaluate() to _evaluate(). Out popped my numpy error, with a proper traceback. I can now debug it!

I can probably refactor my code to make it even cleaner. I'll have to dealwith the fact that pool.map() requires that all arguments to each subprocess be submitted as a single, iterable object. I didn't want to have to do this when I only had a single process to run, but perhaps the tradeoff willbe acceptable.

So now, for anyone who is still reading this: is it your opinion that the traceback that I obtained through multiprocessing.pool._map_async().get() SHOULD have allowed me to see what the ultimate cause of the exception was? I think so. Is it a bug? Should I request a bugfix? How do I go about doing that?
 
E

Ethan Furman

So now, for anyone who is still reading this: is it your
opinion that the traceback that I obtained through
multiprocessing.pool._map_async().get() SHOULD have allowed
me to see what the ultimate cause of the exception was?

It would certainly be nice.
I think so. Is it a bug? Should I request a bugfix? How
do I go about doing that?

Check out bugs.python.org. Search for multiprocessing and tracebacks to see if anything is already there; if not,
create a new issue.
 
T

Terry Reedy

On 11/21/2013 12:01 PM, John Ladasky wrote:

This is a case where you need to dig into the code (or maybe docs) a bit
File ".../evaluate.py", line 81, in evaluate
result = pool.map(evaluate, bundles) File
"/usr/lib/python3.3/multiprocessing/pool.py", line 228, in map
return self._map_async(func, iterable, mapstar, chunksize).get()

The call to _map_async gets a blank MapResult (a subclass of
ApplyResult), queues tasks to fill it in, and returns the filled in
result. This call is designed to always return as task exceptions are
caught and assigned to MapResult._value in both ApplyResult._set and
MapResult._set.

result = MapResult(self._cache, chunksize, len(iterable), callback,
error_callback=error_callback)
self._taskqueue.put((((result._job, i, mapper, (x,), {})
for i, x in enumerate(task_batches)), None))
return result

It is the subsequent call to get() that 'fails', because it raises
the caught exception.
File "/usr/lib/python3.3/multiprocessing/pool.py", line 564, in get
raise self._value

ValueError: operands could not be broadcast together with shapes (1,3) (4)
Notice that no line of numpy appears in the traceback? Still, there
are three things that make me think that this error is coming from
numpy.

It comes from one of your tasks as the 'result', and your tasks use numpy.
If I could only see the line of numpy code which is generating the
ValueError, I would have a better chance of spotting the bug in my
code.
Definitely.

So, WHY isn't there any reference to numpy in my traceback?

I suspect that raising the exception may replace its __traceback__
attribute. Anyway, there are three things I might try.

1. Use 3.3.3 or latest 3.4 to see if there is any improvement in output.
I vaguely remember a tracker issue that might be related.

2. _map_async takes an error_callback arg that defaults to None and
which is passed on to MapResult. When _value is set to an exception,
"error_callback(_value)" is called in ._set() before the later .get()
re-raises it. pool.map does not allow you to set either the (success)
callback or the error_callback, but pool.map_async does (this is the
difference between the two methods). So switch to the latter so you can
pass a function that uses the traceback module to print (or log) the
traceback attached to _value, assuming that there is one.

3. If that does not work, wrap the current body of your task function in
try: <current suite>
except exception as e:
<use traceback module to add traceback to message>
raise e <or a new exception>
 
J

John Ladasky

Check out bugs.python.org. Search for multiprocessing and tracebacks to see
if anything is already there; if not, create a new issue.


1. Use 3.3.3 or latest 3.4 to see if there is any improvement in output.
I vaguely remember a tracker issue that might be related.


All right, there appear to be two recent bug reports which are relevant.

http://bugs.python.org/issue13831
http://bugs.python.org/issue17836

The comments in the first link, from Richard Oudkerk, appear to indicate that pickling an Exception (so that it can be sent between processes) is difficult, perhaps impossible. I have never completely understood what can be pickled, and what cannot -- or, for that matter, why data needs to be pickled to pass it between processes.

In any case, a string representation of the traceback can be pickled. For debugging purposes, that can still help. So, if I understand everything correctly, in this link...

http://hg.python.org/cpython/rev/c4f92b597074/

....Richard submits his "hack" (his description) to Python 3.4 which picklesand passes the string. When time permits, I'll try it out. Or maybe I'llwait, since Python 3.4.0 is still in alpha.
 
C

Chris Angelico

or, for that matter, why data needs to be pickled to pass it between processes.

Oh, that part's easy. Let's leave the multiprocessing module out of it
for the moment; imagine you spin up two completely separate instances
of Python. Create some object in one of them; now, transfer it to the
other. How are you going to do it?

Ultimately, the operating system isn't going to give you facilities
for moving complex objects around - what you almost exclusively get is
streams of bytes (or occasionally messaged chunks with lengths, but
still of bytes). Pickling is one method of turning an object into a
stream of bytes, in such a way that it can be turned back into an
equivalent object on the other side. And therein is the problem with
exceptions; since the traceback includes references to stack frames
and such, it's not as simple as saying "Two to beam up" and hearing
the classic sound effect - somehow you need to transfer all the
appropriate information across processes.

ChrisA
 
J

John Ladasky

Oh, that part's easy. Let's leave the multiprocessing module out of it
for the moment; imagine you spin up two completely separate instances
of Python. Create some object in one of them; now, transfer it to the
other. How are you going to do it?

For what definition of "completely separate"?

If I have two instances of the same version of the Python interpreter running on the same hardware, and the same operating system, I expect I would just copy a block of memory from one interpreter to the other, and then writesome new pointers. That kind of data sharing has to be the most common kind. It's also the simplest.

I understand that pickling allows sharing of Python objects between Python interpreters even if those interpreters run on different CPU's with different memory architecture, different operating systems, etc. It just seems like overkill to me to use pickling in the simple case.
 
C

Chris Angelico

For what definition of "completely separate"?

If I have two instances of the same version of the Python interpreter running on the same hardware, and the same operating system, I expect I would just copy a block of memory from one interpreter to the other, and then write some new pointers. That kind of data sharing has to be the most common kind. It's also the simplest.

Okay, so you copy a block of memory. Now how are you going to
guarantee that you picked up everything that object references? Python
objects frequently reference other objects:

send_me = [1.0, 2.0, 3.0]

The block of memory might have the addresses of those three floats,
but that'll be invalid in the target. Somehow you need to package up
this object and everything else you need.

Ultimately, you need some system for turning a single object reference
(a pointer, if you like) into the entire package of information needed
to recreate that object on the other side. That's what pickling is.
It's a compact (with people to fight for its compactness, there's
current discussion elsewhere about that) format that can be easily
transferred around, which refcounted blocks of memory can't.

ChrisA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top