S
Steven D'Aprano
Hello folks,
I'm designing an API for some lightweight calculator-like statistics
functions, such as mean, standard deviation, etc., and I want to support
missing values. Missing values should be just ignored. E.g.:
mean([1, 2, MISSING, 3]) => 6/3 = 2 rather than 6/4 or raising an error.
My question is, should I accept None as the missing value, or a dedicated
singleton?
In favour of None: it's already there, no extra code required. People may
expect it to work.
Against None: it's too easy to mistakenly add None to a data set by mistake,
because functions return None by default.
In favour of a dedicated MISSING singleton: it's obvious from context. It's
not a lot of work to implement compared to using None. Hard to accidentally
include it by mistake. If None does creep into the data by accident, you
get a nice explicit exception.
Against MISSING: users may expect to be able to choose their own sentinel by
assigning to MISSING. I don't want to support that.
I've considered what other packages do:-
R uses a special value, NA, to stand in for missing values. This is more or
less the model I wish to follow.
I believe that MATLAB treats float NANs as missing values. I consider this
an abuse of NANs and I won't be supporting that
Spreadsheets such as Excel, OpenOffice and Gnumeric generally ignore blank
cells, and give you a choice between ignoring text and treating it as zero.
E.g. with cells set to [1, 2, "spam", 3] the AVERAGE function returns 2 and
the AVERAGEA function returns 1.5.
numpy uses masked arrays, which is probably over-kill for my purposes; I am
gratified to see it doesn't abuse NANs:
numpy also treats None as an error:
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/site-packages/numpy/core/fromnumeric.py", line
860, in mean
return mean(axis, dtype, out)
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
I would appreciate any comments, advice or suggestions.
I'm designing an API for some lightweight calculator-like statistics
functions, such as mean, standard deviation, etc., and I want to support
missing values. Missing values should be just ignored. E.g.:
mean([1, 2, MISSING, 3]) => 6/3 = 2 rather than 6/4 or raising an error.
My question is, should I accept None as the missing value, or a dedicated
singleton?
In favour of None: it's already there, no extra code required. People may
expect it to work.
Against None: it's too easy to mistakenly add None to a data set by mistake,
because functions return None by default.
In favour of a dedicated MISSING singleton: it's obvious from context. It's
not a lot of work to implement compared to using None. Hard to accidentally
include it by mistake. If None does creep into the data by accident, you
get a nice explicit exception.
Against MISSING: users may expect to be able to choose their own sentinel by
assigning to MISSING. I don't want to support that.
I've considered what other packages do:-
R uses a special value, NA, to stand in for missing values. This is more or
less the model I wish to follow.
I believe that MATLAB treats float NANs as missing values. I consider this
an abuse of NANs and I won't be supporting that
Spreadsheets such as Excel, OpenOffice and Gnumeric generally ignore blank
cells, and give you a choice between ignoring text and treating it as zero.
E.g. with cells set to [1, 2, "spam", 3] the AVERAGE function returns 2 and
the AVERAGEA function returns 1.5.
numpy uses masked arrays, which is probably over-kill for my purposes; I am
gratified to see it doesn't abuse NANs:
nanimport numpy as np
a = np.array([1, 2, float('nan'), 3])
np.mean(a)
numpy also treats None as an error:
Traceback (most recent call last):a = np.array([1, 2, None, 3])
np.mean(a)
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/site-packages/numpy/core/fromnumeric.py", line
860, in mean
return mean(axis, dtype, out)
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
I would appreciate any comments, advice or suggestions.