numpy NaN, not surviving pickle/unpickle?

J

John Ladasky

Hi folks,

I am aware that numpy has its own discussion group, which is hosted at
gmane. Unfortunately, I can't seem to get in to gmane today.

In any case, I'm not sure whether I have a problem with numpy, or with
my understanding of the Python pickle module, so I'm posting here.

I am pickling numpy.ndarray objects to disk which are of type "float",
but which may include NaN in some cells. When I unpickle these
objects and then test for the presence of NaN, the test fails. Here's
a minimal sample program, and its output:


=== program ========================================


## numpy nan pickle test.py

import pickle
from numpy import *

print "\n\nNaN equivalency tests:\n"
x, y = nan, NaN # Capitalization reality check
print "x =", x, ", y =", y
print "x is nan:", x is nan
print "y is NaN:", y is NaN
print "x is y:", x is y

A0 = array([[1.2, nan], [3.4, 5.6]])
print "\n\nPickling and saving this array to disk:\n\n", A0
f0 = open("test array pickle.py", "w")
pickle.dump(A0, f0)
f0.close()
print "\nArray saved to disk."

f1 = open("test array pickle.py", "r")
A1 = pickle.load(f1)
f1.close()
print "\n\nThe array reloaded from the disk is:\n\n", A1
print "\narray[0,1] =", A1[0,1]
print "array[0,1] is nan:", A1[0,1] is nan, "\n\n"


=== output ========================================


NaN equivalency tests:

x = nan , y = nan
x is nan: True
y is NaN: True
x is y: True


Pickling and saving this array to disk:

[[ 1.2 NaN]
[ 3.4 5.6]]

Array saved to disk.


The array reloaded from the disk is:

[[ 1.2 NaN]
[ 3.4 5.6]]

array[0,1] = nan
array[0,1] is nan: False


============================================================================


The last line of my output is unexpected. I've printed the contents
of the cell in the array, and it says that it contains "nan". But
when I try the same equivalency test that I tried in the first few
lines of the program (with unpickled objects), this time it says that
my test object isn't "nan".

I thought that Python was supposed to make values and even objects
portable?


Obligatory version information:

Numpy: 1.0.4
Python: 2.5.2
OS: Ubuntu Linux 8.04


Thanks for any help!
 
R

Robert Kern

John said:
Hi folks,

I am aware that numpy has its own discussion group, which is hosted at
gmane. Unfortunately, I can't seem to get in to gmane today.

It is not hosted at GMane. It just has a GMane mirror.

http://www.scipy.org/Mailing_Lists
In any case, I'm not sure whether I have a problem with numpy, or with
my understanding of the Python pickle module, so I'm posting here.

I am pickling numpy.ndarray objects to disk which are of type "float",
but which may include NaN in some cells. When I unpickle these
objects and then test for the presence of NaN, the test fails.

The problem is that you are trying to use "is" to compare by Python object
identity. Except for dtype=object arrays, the object identities of the
individual elements that you extract from numpy arrays are never guaranteed.
Usually, they will always be different. You need to use numpy.isnan() to
determine whether an object is a NaN.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
J

John Ladasky

Hi Robert,

Thanks for the quick reply.

The problem is that you are trying to use "is" to compare by Python object
identity. Except for dtype=object arrays, the object identities of the
individual elements that you extract from numpy arrays are never guaranteed.
Usually, they will always be different. You need to use numpy.isnan() to
determine whether an object is a NaN.

OK, so there's a dedicated function in numpy to handle this. Thanks!

I tried "x is NaN" after noting the obvious, that any equality or
inequality test involving NaN will return False.

In my leisure time, I would like to dig deeper into the issue of why
object identities are not guaranteed for elements in numpy arrays...
with elements of type "float", at least, I thought this would be
trivial.
 
R

Robert Kern

John said:
In my leisure time, I would like to dig deeper into the issue of why
object identities are not guaranteed for elements in numpy arrays...
with elements of type "float", at least, I thought this would be
trivial.

Why do you think that? We would have to keep a reference around to every scalar
object that gets created and check against that cache whenever someone accesses
an element in order to reuse the previously created object. That slows element
access down for essentially no benefit.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
C

Carl Banks

In my leisure time, I would like to dig deeper into the issue of why
object identities are not guaranteed for elements in numpy arrays...
with elements of type "float", at least, I thought this would be
trivial.

Unlike Python lists, numpy arrays don't store objects. It stores the
underlying number, not the object containing the number. So whenever
you get a value from a numpy array, Python (usually) has to create a
new object for it.


Carl Banks
 
S

Steven D'Aprano

Why do you think that? We would have to keep a reference around to every
scalar object that gets created and check against that cache whenever
someone accesses an element in order to reuse the previously created
object. That slows element access down for essentially no benefit.


Exactly -- there are 2**53 distinct floats on most IEEE systems, the vast
majority of which might as well be "random". What's the point of caching
numbers like 2.5209481723210079? Chances are it will never come up again
in a calculation.

There may be something to be said for caching "common" floats, like pi,
small integers (0.0, 1.0, 2.0, ...), 0.5, 0.25 and similar, but I doubt
the memory savings would be worth the extra complexity.

You can do your own caching: pass every calculation result through the
following:

_cache = {}
def cache(f):
"""Cache and return float f."""
if f in _cache:
return _cache[f]
_cache[f] = f
return f
 
J

John Ladasky

Unlike Python lists, numpy arrays don't store objects.  

That would be the crux of it, I think. I've gotten so used to the
behavior of Python lists that I now have to unlearn it!
 
G

Gabriel Genellina

En Sun, 13 Sep 2009 20:53:26 -0300, Steven D'Aprano
There may be something to be said for caching "common" floats, like pi,
small integers (0.0, 1.0, 2.0, ...), 0.5, 0.25 and similar, but I doubt
the memory savings would be worth the extra complexity.

I've read some time ago, that simply caching 0.0 reduced appreciably the
memory usage of a Zope application.
(Note that Zope relies on pickling and unpickling objects all the time, so
even if two objects started as the "same" zero, they may become different
at a later time.)
 
T

Terry Reedy

Gabriel said:
En Sun, 13 Sep 2009 20:53:26 -0300, Steven D'Aprano

Pi is already cached -- in the math module.
Zero is not because one can easily write zero=0.0, etc.
The main memory saving comes on allocation of large arrays or of
multiple medium arrays. For that, one can use one named object.

It is easy to cache and test for ints in a contiguous range.
Cached ints are heavily reused in the interpreter before it executes a
line of code.

Built-in equality tests for several floats would slow down all float
code. Interpreter does not use floats for its internal operations. So
idea was considered and rejected by devs.
 
M

Mark Dickinson

You are missing a few orders of magnitude here; there are approx. 2 ** 64
distinct floats.  2 ** 53 is the mantissa of regular floats.  There are
2**52 floats X where 1.0 <= X < 2.0.
The number of "normal" floats is 2 ** 64 - 2 ** 52 + 1.

Since we're being picky here:

Don't you mean 2 ** 64 - 2 ** 54 + 1? :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top