numpy NaN, not surviving pickle/unpickle?

John Ladasky · Sep 13, 2009

Hi folks,

I am aware that numpy has its own discussion group, which is hosted at
gmane. Unfortunately, I can't seem to get in to gmane today.

In any case, I'm not sure whether I have a problem with numpy, or with
my understanding of the Python pickle module, so I'm posting here.

I am pickling numpy.ndarray objects to disk which are of type "float",
but which may include NaN in some cells. When I unpickle these
objects and then test for the presence of NaN, the test fails. Here's
a minimal sample program, and its output:

=== program ========================================

## numpy nan pickle test.py

import pickle
from numpy import *

print "\n\nNaN equivalency tests:\n"
x, y = nan, NaN # Capitalization reality check
print "x =", x, ", y =", y
print "x is nan:", x is nan
print "y is NaN:", y is NaN
print "x is y:", x is y

A0 = array([[1.2, nan], [3.4, 5.6]])
print "\n\nPickling and saving this array to disk:\n\n", A0
f0 = open("test array pickle.py", "w")
pickle.dump(A0, f0)
f0.close()
print "\nArray saved to disk."

f1 = open("test array pickle.py", "r")
A1 = pickle.load(f1)
f1.close()
print "\n\nThe array reloaded from the disk is:\n\n", A1
print "\narray[0,1] =", A1[0,1]
print "array[0,1] is nan:", A1[0,1] is nan, "\n\n"

=== output ========================================

NaN equivalency tests:

x = nan , y = nan
x is nan: True
y is NaN: True
x is y: True

Pickling and saving this array to disk:

[[ 1.2 NaN]
[ 3.4 5.6]]

Array saved to disk.

The array reloaded from the disk is:

[[ 1.2 NaN]
[ 3.4 5.6]]

array[0,1] = nan
array[0,1] is nan: False

============================================================================

The last line of my output is unexpected. I've printed the contents
of the cell in the array, and it says that it contains "nan". But
when I try the same equivalency test that I tried in the first few
lines of the program (with unpickled objects), this time it says that
my test object isn't "nan".

I thought that Python was supposed to make values and even objects
portable?

Obligatory version information:

Numpy: 1.0.4
Python: 2.5.2
OS: Ubuntu Linux 8.04

Thanks for any help!

Robert Kern · Sep 13, 2009

John said:
Hi folks,

I am aware that numpy has its own discussion group, which is hosted at
gmane. Unfortunately, I can't seem to get in to gmane today.

It is not hosted at GMane. It just has a GMane mirror.

http://www.scipy.org/Mailing_Lists

In any case, I'm not sure whether I have a problem with numpy, or with
my understanding of the Python pickle module, so I'm posting here.

I am pickling numpy.ndarray objects to disk which are of type "float",
but which may include NaN in some cells. When I unpickle these
objects and then test for the presence of NaN, the test fails.

The problem is that you are trying to use "is" to compare by Python object
identity. Except for dtype=object arrays, the object identities of the
individual elements that you extract from numpy arrays are never guaranteed.
Usually, they will always be different. You need to use numpy.isnan() to
determine whether an object is a NaN.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

John Ladasky · Sep 13, 2009

Hi Robert,

Thanks for the quick reply.

The problem is that you are trying to use "is" to compare by Python object
identity. Except for dtype=object arrays, the object identities of the
individual elements that you extract from numpy arrays are never guaranteed.
Usually, they will always be different. You need to use numpy.isnan() to
determine whether an object is a NaN.

OK, so there's a dedicated function in numpy to handle this. Thanks!

I tried "x is NaN" after noting the obvious, that any equality or
inequality test involving NaN will return False.

In my leisure time, I would like to dig deeper into the issue of why
object identities are not guaranteed for elements in numpy arrays...
with elements of type "float", at least, I thought this would be
trivial.

Robert Kern · Sep 13, 2009

John said:
In my leisure time, I would like to dig deeper into the issue of why
object identities are not guaranteed for elements in numpy arrays...
with elements of type "float", at least, I thought this would be
trivial.

Why do you think that? We would have to keep a reference around to every scalar
object that gets created and check against that cache whenever someone accesses
an element in order to reuse the previously created object. That slows element
access down for essentially no benefit.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Carl Banks · Sep 14, 2009

In my leisure time, I would like to dig deeper into the issue of why
object identities are not guaranteed for elements in numpy arrays...
with elements of type "float", at least, I thought this would be
trivial.

Unlike Python lists, numpy arrays don't store objects. It stores the
underlying number, not the object containing the number. So whenever
you get a value from a numpy array, Python (usually) has to create a
new object for it.

Carl Banks

Steven D'Aprano · Sep 14, 2009

Why do you think that? We would have to keep a reference around to every
scalar object that gets created and check against that cache whenever
someone accesses an element in order to reuse the previously created
object. That slows element access down for essentially no benefit.

Exactly -- there are 2**53 distinct floats on most IEEE systems, the vast
majority of which might as well be "random". What's the point of caching
numbers like 2.5209481723210079? Chances are it will never come up again
in a calculation.

There may be something to be said for caching "common" floats, like pi,
small integers (0.0, 1.0, 2.0, ...), 0.5, 0.25 and similar, but I doubt
the memory savings would be worth the extra complexity.

You can do your own caching: pass every calculation result through the
following:

_cache = {}
def cache(f):
"""Cache and return float f."""
if f in _cache:
return _cache[f]
_cache[f] = f
return f

John Ladasky · Sep 14, 2009

Unlike Python lists, numpy arrays don't store objects.

That would be the crux of it, I think. I've gotten so used to the
behavior of Python lists that I now have to unlearn it!

Gabriel Genellina · Sep 14, 2009

En Sun, 13 Sep 2009 20:53:26 -0300, Steven D'Aprano

There may be something to be said for caching "common" floats, like pi,
small integers (0.0, 1.0, 2.0, ...), 0.5, 0.25 and similar, but I doubt
the memory savings would be worth the extra complexity.

I've read some time ago, that simply caching 0.0 reduced appreciably the
memory usage of a Zope application.
(Note that Zope relies on pickling and unpickling objects all the time, so
even if two objects started as the "same" zero, they may become different
at a later time.)

Terry Reedy · Sep 14, 2009

Gabriel said:
En Sun, 13 Sep 2009 20:53:26 -0300, Steven D'Aprano

Pi is already cached -- in the math module.
Zero is not because one can easily write zero=0.0, etc.
The main memory saving comes on allocation of large arrays or of
multiple medium arrays. For that, one can use one named object.

It is easy to cache and test for ints in a contiguous range.
Cached ints are heavily reused in the interpreter before it executes a
line of code.

Built-in equality tests for several floats would slow down all float
code. Interpreter does not use floats for its internal operations. So
idea was considered and rejected by devs.

Mark Dickinson · Sep 15, 2009

You are missing a few orders of magnitude here; there are approx. 2 ** 64
distinct floats. 2 ** 53 is the mantissa of regular floats. There are
2**52 floats X where 1.0 <= X < 2.0.
The number of "normal" floats is 2 ** 64 - 2 ** 52 + 1.

Since we're being picky here:

Don't you mean 2 ** 64 - 2 ** 54 + 1?

Pickle MemoryError - any ideas?	3	Jul 20, 2010
numpy help	2	Nov 3, 2006
Problem with splice in a 2D ARRAY	6	Jun 17, 2013
An idea for fast function composition	4	Feb 16, 2008
You can write Fortran in any language	13	Aug 21, 2006
Address of a specific element: an Array containing Array References...	1	Oct 30, 2008
En/Decrypt Mismatch: Command-Line Tool vs. Crypt::OpenSSL::RSA	1	Aug 14, 2008
need help with a cart I inherited, need to increase number of total characters allowed	3	Oct 22, 2007

numpy NaN, not surviving pickle/unpickle?

John Ladasky

Robert Kern

John Ladasky

Robert Kern

Carl Banks

Steven D'Aprano

John Ladasky

Gabriel Genellina

Terry Reedy

Mark Dickinson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads