None versus MISSING sentinel -- request for design feedback

Discussion in 'Python' started by Steven D'Aprano, Jul 15, 2011.

  1. Hello folks,

    I'm designing an API for some lightweight calculator-like statistics
    functions, such as mean, standard deviation, etc., and I want to support
    missing values. Missing values should be just ignored. E.g.:

    mean([1, 2, MISSING, 3]) => 6/3 = 2 rather than 6/4 or raising an error.

    My question is, should I accept None as the missing value, or a dedicated
    singleton?

    In favour of None: it's already there, no extra code required. People may
    expect it to work.

    Against None: it's too easy to mistakenly add None to a data set by mistake,
    because functions return None by default.

    In favour of a dedicated MISSING singleton: it's obvious from context. It's
    not a lot of work to implement compared to using None. Hard to accidentally
    include it by mistake. If None does creep into the data by accident, you
    get a nice explicit exception.

    Against MISSING: users may expect to be able to choose their own sentinel by
    assigning to MISSING. I don't want to support that.


    I've considered what other packages do:-

    R uses a special value, NA, to stand in for missing values. This is more or
    less the model I wish to follow.

    I believe that MATLAB treats float NANs as missing values. I consider this
    an abuse of NANs and I won't be supporting that :p

    Spreadsheets such as Excel, OpenOffice and Gnumeric generally ignore blank
    cells, and give you a choice between ignoring text and treating it as zero.
    E.g. with cells set to [1, 2, "spam", 3] the AVERAGE function returns 2 and
    the AVERAGEA function returns 1.5.

    numpy uses masked arrays, which is probably over-kill for my purposes; I am
    gratified to see it doesn't abuse NANs:

    >>> import numpy as np
    >>> a = np.array([1, 2, float('nan'), 3])
    >>> np.mean(a)

    nan

    numpy also treats None as an error:

    >>> a = np.array([1, 2, None, 3])
    >>> np.mean(a)

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/python2.5/site-packages/numpy/core/fromnumeric.py", line
    860, in mean
    return mean(axis, dtype, out)
    TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'


    I would appreciate any comments, advice or suggestions.


    --
    Steven
    Steven D'Aprano, Jul 15, 2011
    #1
    1. Advertising

  2. On Fri, Jul 15, 2011 at 3:28 PM, Steven D'Aprano
    <> wrote:
    > My question is, should I accept None as the missing value, or a dedicated
    > singleton?
    >
    > In favour of None: it's already there, no extra code required. People may
    > expect it to work.
    >
    > Against None: it's too easy to mistakenly add None to a data set by mistake,
    > because functions return None by default.


    I guess the question is: Why are the missing values there? If they're
    there because some function returned None because it didn't have a
    value to return, and therefore it's a missing value, then using None
    as "missing" would make a lot of sense. But if it's a more explicit
    concept of "here's a table of values, and the user said that this one
    doesn't exist", it'd be better to have an explicit MISSING. (Which I
    assume would be exposed as yourmodule.MISSING or something.)

    Agreed that float('nan') and "" and "spam" are all bad values for
    Missings. Possibly "" should come out as 0, but "spam" should
    definitely fail.

    Chris Angelico
    Chris Angelico, Jul 15, 2011
    #2
    1. Advertising

  3. Steven D'Aprano wrote in news:4e1fd009$0$29986$c3e8da3
    $ in gmane.comp.python.general:

    > I'm designing an API for some lightweight calculator-like statistics
    > functions, such as mean, standard deviation, etc., and I want to support
    > missing values. Missing values should be just ignored. E.g.:
    >
    > mean([1, 2, MISSING, 3]) => 6/3 = 2 rather than 6/4 or raising an error.


    If you can't make your mind up then maybe you shouldn't:

    MISSING = MissingObject()
    def mean( sequence, missing = MISSING ):
    ...

    Rob.
    Rob Williscroft, Jul 15, 2011
    #3
  4. On 15Jul2011 15:28, Steven D'Aprano <> wrote:
    | In favour of None: it's already there, no extra code required. People may
    | expect it to work.

    Broadly, I like this one for the reasons you cite.

    | Against None: it's too easy to mistakenly add None to a data set by mistake,
    | because functions return None by default.

    This is a hazard everywhere, but won't such a circumstance normally
    break lots of stuff anyway? What's an example scenario for getting None
    by accident but still a bunch of non-None values? The main one I can
    imagine is a function with a return path that accidentally misses the
    value something, eg:

    def f(x):
    if blah:
    return 7
    ...
    if foo:
    return 0
    # whoops!


    I suppose there's no scope for having the append-to-the-list step sanity
    check for the sentinel (be it None or otherwise)?

    | In favour of a dedicated MISSING singleton: it's obvious from context. It's
    | not a lot of work to implement compared to using None. Hard to accidentally
    | include it by mistake. If None does creep into the data by accident, you
    | get a nice explicit exception.

    I confess to being about to discard None as a sentinel in a bit of my
    own code, but only to allow None to be used as a valid value, using the
    usual idiom:

    class IQ(Queue):
    def __init__(self, ...):
    self._sentinel = object()
    ...

    | Against MISSING: users may expect to be able to choose their own sentinel by
    | assigning to MISSING. I don't want to support that.

    Well, we don't have readonly values to play with :-(
    Personally I'd do what I did above: give it a "private" name like
    _MISSING so that people should expect to have inside (and unsupported,
    unguarenteed) knowledge if they fiddle with it. Or are you publishing
    the sentinal's name to your callers i.e. may they really return _MISSING
    legitimately from their functions?

    Cheers,
    --
    Cameron Simpson <> DoD#743
    http://www.cskk.ezoshosting.com/cs/

    What's fair got to do with it? It's going to happen. - Lawrence of Arabia
    Cameron Simpson, Jul 15, 2011
    #4
  5. Steven D'Aprano

    Guest

    On Jul 15, 8:08 am, Chris Angelico <> wrote:
    >
    > Agreed that float('nan') and "" and "spam" are all bad values for
    > Missings. Possibly "" should come out as 0


    "In the face of ambiguity, refuse the temptation to guess."

    As far as I'm concerned, I'd expect this to raise a TypeError...
    , Jul 15, 2011
    #5
  6. Steven D'Aprano

    Guest

    On Jul 15, 7:28 am, Steven D'Aprano <steve
    > wrote:
    >
    > I'm designing an API for some lightweight calculator-like statistics
    > functions, such as mean, standard deviation, etc., and I want to support
    > missing values. Missing values should be just ignored. E.g.:



    (snip)

    > Against None: it's too easy to mistakenly add None to a data set by mistake,
    > because functions return None by default.


    Yeps.

    > In favour of a dedicated MISSING singleton: it's obvious from context. It's
    > not a lot of work to implement compared to using None. Hard to accidentally
    > include it by mistake. If None does creep into the data by accident, you
    > get a nice explicit exception.
    >
    > Against MISSING: users may expect to be able to choose their own sentinelby
    > assigning to MISSING. I don't want to support that.


    What about allowing users to specificy their own sentinel in the
    simplest pythonic way:

    # stevencalc.py
    MISSING = object()

    def mean(values, missing=MISSING):
    your code here


    Or, if you want to make it easier to specify the sentinel once for the
    whole API:

    # stevencalc.py
    MISSING = object()

    class Calc(object):
    def __init__(self, missing=MISSING):
    self._missing = missing
    def mean(self, values):
    # your code here


    # default:
    _calc = Calc()
    mean = _calc.mean
    # etc...

    My 2 cents...
    , Jul 15, 2011
    #6
  7. * 2011-07-15T15:28:41+10:00 * Steven D'Aprano wrote:

    > I'm designing an API for some lightweight calculator-like statistics
    > functions, such as mean, standard deviation, etc., and I want to
    > support missing values. Missing values should be just ignored. E.g.:
    >
    > mean([1, 2, MISSING, 3]) => 6/3 = 2 rather than 6/4 or raising an
    > error.
    >
    > My question is, should I accept None as the missing value, or a
    > dedicated singleton?


    How about accepting anything but ignoring all non-numbers?
    Teemu Likonen, Jul 15, 2011
    #7
  8. Steven D'Aprano

    Guest

    On Jul 15, 9:44 am, Cameron Simpson <> wrote:
    > On 15Jul2011 15:28, Steven D'Aprano <> wrote:
    > | Against MISSING: users may expect to be able to choose their own sentinel by
    > | assigning to MISSING. I don't want to support that.
    >
    > Well, we don't have readonly values to play with :-(
    > Personally I'd do what I did above: give it a "private" name like
    > _MISSING so that people should expect to have inside (and unsupported,
    > unguarenteed) knowledge if they fiddle with it.


    I think the point is to allow users to explicitely use MISSING in
    their data sets, so it does have to be public. But anyway: ALL_UPPER
    names are supposed to be treated as constants, so the "warranty void
    if messed with" still apply.
    , Jul 15, 2011
    #8
  9. Steven D'Aprano

    Guest

    On Jul 15, 10:28 am, Teemu Likonen <> wrote:
    >
    > How about accepting anything but ignoring all non-numbers?


    Totally unpythonic. Better to be explicit about what you expect and
    crash as loudly as possible when you get anything unexpected.
    , Jul 15, 2011
    #9
  10. Cameron Simpson wrote:

    > On 15Jul2011 15:28, Steven D'Aprano <>
    > wrote:
    > | In favour of None: it's already there, no extra code required. People
    > | may expect it to work.
    >
    > Broadly, I like this one for the reasons you cite.
    >
    > | Against None: it's too easy to mistakenly add None to a data set by
    > | mistake, because functions return None by default.
    >
    > This is a hazard everywhere, but won't such a circumstance normally
    > break lots of stuff anyway?


    Maybe, maybe not. Either way, it has nothing to do with me -- I only care
    about what my library does if presented with None in a list of numbers.
    Should I treat it as a missing value, and ignore it, or treat it as an
    error?


    > What's an example scenario for getting None
    > by accident but still a bunch of non-None values? The main one I can
    > imagine is a function with a return path that accidentally misses the
    > value something, eg:

    [code snipped]

    Yes, that's the main example I can think of. It doesn't really matter how it
    happens though, only that it is more likely for None to accidentally get
    inserted into a list than it is for a module-specific MISSING value.

    My thoughts are, if my library gets presented with two lists:

    [1, 2, 3, None, 5, 6]

    [1, 2, 3, mylibrary.MISSING, 5, 6]

    which is less likely to be an accident rather than deliberate? That's the
    one I should accept as the missing value. Does anyone think that's the
    wrong choice?


    > I suppose there's no scope for having the append-to-the-list step sanity
    > check for the sentinel (be it None or otherwise)?


    It is not my responsibility to validate data during construction, only to do
    the right thing when given that data. The right thing being, raise an
    exception if values are not numeric, unless an explicit "missing" value
    (whatever that ends up being).


    > | Against MISSING: users may expect to be able to choose their own
    > | sentinel by assigning to MISSING. I don't want to support that.
    >
    > Well, we don't have readonly values to play with :-(
    > Personally I'd do what I did above: give it a "private" name like
    > _MISSING so that people should expect to have inside (and unsupported,
    > unguarenteed) knowledge if they fiddle with it. Or are you publishing
    > the sentinal's name to your callers i.e. may they really return _MISSING
    > legitimately from their functions?


    Assuming I choose against None, and go with MISSING, it will be a public
    part of the library API. The idea being that callers will be responsible
    for ensuring that if they have data with missing values, they insert the
    correct sentinel, rather than whatever random non-numeric value they
    started off with.



    --
    Steven
    Steven D'Aprano, Jul 15, 2011
    #10
  11. Rob Williscroft wrote:

    > Steven D'Aprano wrote in news:4e1fd009$0$29986$c3e8da3
    > $ in gmane.comp.python.general:
    >
    >> I'm designing an API for some lightweight calculator-like statistics
    >> functions, such as mean, standard deviation, etc., and I want to support
    >> missing values. Missing values should be just ignored. E.g.:
    >>
    >> mean([1, 2, MISSING, 3]) => 6/3 = 2 rather than 6/4 or raising an error.

    >
    > If you can't make your mind up then maybe you shouldn't:


    Heh, good point.

    It's not so much that I can't make up my mind -- I have a preferred solution
    in mind, but I want to hear what sort of interface for dealing with missing
    values others expect, and I don't want to prejudice others too greatly.


    > MISSING = MissingObject()
    > def mean( sequence, missing = MISSING ):


    So you think the right API is to allow the caller to specify what counts as
    a missing value at runtime? Are you aware of any other statistics packages
    that do that?


    --
    Steven
    Steven D'Aprano, Jul 15, 2011
    #11
  12. On 15Jul2011 20:17, Steven D'Aprano <> wrote:
    | Cameron Simpson wrote:
    | > I suppose there's no scope for having the append-to-the-list step sanity
    | > check for the sentinel (be it None or otherwise)?
    |
    | It is not my responsibility to validate data during construction, only to do
    | the right thing when given that data. The right thing being, raise an
    | exception if values are not numeric, unless an explicit "missing" value
    | (whatever that ends up being).

    Well there you go. You need to use MISSING, not None. As you say, None
    can easily be a mistake and you want to be sure. If what you describe as
    "right" is right, then I too would be using a special sentinal instead
    of None.
    --
    Cameron Simpson <> DoD#743
    http://www.cskk.ezoshosting.com/cs/

    The English language has a word to describe a group of anarcho-collectivists
    without resorting to spiffy hyphenated coined phrases: a mob.
    - Tim Mefford, <>
    Cameron Simpson, Jul 15, 2011
    #12
  13. Chris Angelico wrote:

    > On Fri, Jul 15, 2011 at 3:28 PM, Steven D'Aprano
    > <> wrote:
    >> My question is, should I accept None as the missing value, or a dedicated
    >> singleton?
    >>
    >> In favour of None: it's already there, no extra code required. People may
    >> expect it to work.
    >>
    >> Against None: it's too easy to mistakenly add None to a data set by
    >> mistake, because functions return None by default.

    >
    > I guess the question is: Why are the missing values there? If they're
    > there because some function returned None because it didn't have a
    > value to return, and therefore it's a missing value, then using None
    > as "missing" would make a lot of sense. But if it's a more explicit
    > concept of "here's a table of values, and the user said that this one
    > doesn't exist", it'd be better to have an explicit MISSING. (Which I
    > assume would be exposed as yourmodule.MISSING or something.)


    In general, you have missing values in statistics because somebody wouldn't
    answer a question, and the Ethics Committee frowns on researchers torturing
    their subjects to get information. They make you fill out forms.

    Seriously, missing data is just missing. Unknown. Lost. Not available. Like:

    Name Age Income Years of schooling
    ==============================================
    Bill 42 150,000 16
    Susan 23 39,000 14
    Karen unknown 89,000 15
    Bob 31 0 7
    George 79 12,000 unknown
    Sally 17 19,000 5
    Fred 66 unknown 11

    One might still like to calculate the average age as 43.



    --
    Steven
    Steven D'Aprano, Jul 15, 2011
    #13
  14. * 2011-07-15T03:02:11-07:00 * bruno wrote:

    > On Jul 15, 10:28 am, Teemu Likonen <> wrote:
    >> How about accepting anything but ignoring all non-numbers?

    >
    > Totally unpythonic. Better to be explicit about what you expect and
    > crash as loudly as possible when you get anything unexpected.


    Sure, but sometimes an API can be "accept anything" if any kind of trash
    is expected. But it seems that not in this case, so you're right.
    Teemu Likonen, Jul 15, 2011
    #14
  15. On Fri, Jul 15, 2011 at 8:46 PM, Steven D'Aprano
    <> wrote:
    > In general, you have missing values in statistics because somebody wouldn't
    > answer a question, and the Ethics Committee frowns on researchers torturing
    > their subjects to get information. They make you fill out forms.
    >


    Which, then, is in support of an explicit "User chose not to answer
    this question" MISSING value.

    ChrisA
    Chris Angelico, Jul 15, 2011
    #15
  16. Chris Angelico wrote:

    > On Fri, Jul 15, 2011 at 8:46 PM, Steven D'Aprano
    > <> wrote:
    >> In general, you have missing values in statistics because somebody
    >> wouldn't answer a question, and the Ethics Committee frowns on
    >> researchers torturing their subjects to get information. They make you
    >> fill out forms.
    >>

    >
    > Which, then, is in support of an explicit "User chose not to answer
    > this question" MISSING value.


    Well yes, but None is an explicit missing value too. The question I have is
    if I should support None as that value, or something else. Or if anyone can
    put a good case for it, both, or neither and so something completely
    different.



    --
    Steven
    Steven D'Aprano, Jul 15, 2011
    #16
  17. Steven D'Aprano

    Mel Guest

    Steven D'Aprano wrote:

    > Well yes, but None is an explicit missing value too. The question I have
    > is if I should support None as that value, or something else. Or if anyone
    > can put a good case for it, both, or neither and so something completely
    > different.


    If it's any help, I think (some of?) the database interface packages already
    do just that, returning None when they find NULL fields.


    Mel.
    Mel, Jul 15, 2011
    #17
  18. Steven D'Aprano

    Eric Snow Guest

    On Thu, Jul 14, 2011 at 11:28 PM, Steven D'Aprano
    <> wrote:
    > Hello folks,
    >
    > I'm designing an API for some lightweight calculator-like statistics
    > functions, such as mean, standard deviation, etc., and I want to support
    > missing values. Missing values should be just ignored. E.g.:
    >
    > mean([1, 2, MISSING, 3]) => 6/3 = 2 rather than 6/4 or raising an error.
    >
    > My question is, should I accept None as the missing value, or a dedicated
    > singleton?
    >
    > In favour of None: it's already there, no extra code required. People may
    > expect it to work.
    >
    > Against None: it's too easy to mistakenly add None to a data set by mistake,
    > because functions return None by default.


    Good point.

    >
    > In favour of a dedicated MISSING singleton: it's obvious from context. It's
    > not a lot of work to implement compared to using None. Hard to accidentally
    > include it by mistake. If None does creep into the data by accident, you
    > get a nice explicit exception.


    Also good points.

    >
    > Against MISSING: users may expect to be able to choose their own sentinelby
    > assigning to MISSING. I don't want to support that.
    >
    >
    > I've considered what other packages do:-
    >
    > R uses a special value, NA, to stand in for missing values. This is more or
    > less the model I wish to follow.
    >
    > I believe that MATLAB treats float NANs as missing values. I consider this
    > an abuse of NANs and I won't be supporting that :p


    I was just thinking of this. :)

    >
    > Spreadsheets such as Excel, OpenOffice and Gnumeric generally ignore blank
    > cells, and give you a choice between ignoring text and treating it as zero.
    > E.g. with cells set to [1, 2, "spam", 3] the AVERAGE function returns 2 and
    > the AVERAGEA function returns 1.5.
    >
    > numpy uses masked arrays, which is probably over-kill for my purposes; I am
    > gratified to see it doesn't abuse NANs:
    >
    >>>> import numpy as np
    >>>> a = np.array([1, 2, float('nan'), 3])
    >>>> np.mean(a)

    > nan
    >
    > numpy also treats None as an error:
    >
    >>>> a = np.array([1, 2, None, 3])
    >>>> np.mean(a)

    > Traceback (most recent call last):
    >  File "<stdin>", line 1, in <module>
    >  File "/usr/lib/python2.5/site-packages/numpy/core/fromnumeric.py", line
    > 860, in mean
    >    return mean(axis, dtype, out)
    > TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
    >
    >
    > I would appreciate any comments, advice or suggestions.
    >


    Too bad there isn't a good way to "freeze" a name, i.e. indicate that
    any attempt to rebind it is an exception. Trying to rebind None is a
    SyntaxError, but a NameError or something would be fine. Then the
    downside of using your own sentinel here goes away.

    In reality, using Missing may be your best bet anyway. If there were
    a convention for indicating a name should not be re-bound (like a
    single leading underscore indicates "private"), you could use that
    (all caps?). Since "we're all consenting adults" it would probably be
    good enough to make sure others know that Missing should not be
    re-bound...

    I might have said to use NotImplemented instead of None, but it can be
    re-bound and the name isn't as helpful for your use case.

    Another solution, perhaps ugly or confusing, is to use something like
    two underscores as the name for your sentinel:

    mean([1, 2, __, 3])

    Still it seems like using Missing (or whatever) would be better than None.

    -eric

    >
    > --
    > Steven
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
    Eric Snow, Jul 15, 2011
    #18
  19. Steven D'Aprano

    Ethan Furman Guest

    Mel wrote:
    > Steven D'Aprano wrote:
    >
    >> Well yes, but None is an explicit missing value too. The question I have
    >> is if I should support None as that value, or something else. Or if anyone
    >> can put a good case for it, both, or neither and so something completely
    >> different.

    >
    > If it's any help, I think (some of?) the database interface packages already
    > do just that, returning None when they find NULL fields.


    Indeed. I'm adding Null support to my dbf package now, and while some
    of the return values (Logical, Date, DateTime, and probably Character)
    will have their own dedicated singletons (Null, NullDate, NullDateTime,
    NullChar -- which will all compare equal to None) the numeric values
    will be None... although, now that I've seen this thread, I'll add the
    ability to choose what the numeric Null is returned as.

    ~Ethan~
    Ethan Furman, Jul 15, 2011
    #19
  20. Steven D'Aprano wrote:

    > Rob Williscroft wrote:
    >> MISSING = MissingObject()
    >> def mean( sequence, missing = MISSING ):

    >
    > So you think the right API is to allow the caller to specify what
    > counts as a missing value at runtime? Are you aware of any other
    > statistics packages that do that?


    R does it, not in the stats functions itself but in, for instance
    read.table. When reading data from an external file, you can specify a
    set of values that will be converted to NA in the resulting data frame.

    I think it's worth considering this approach, namely separating the
    input of the data into your system from the calculations on that
    data. You haven't said exactly how people are going to be using your
    API, but your example of "where mising data comes from" showed something
    like a table of data from a survey. If this is the case, and users are
    going to be importing sets of data from external files, it makes a lot
    of sense to let them specify "convert these particular values to MISSING
    when importing".

    Either way, my answer to your original question would be: if you
    want to err on the side of caution, use your own MISSING value and just
    provide a simple function that will MISSING-ize specified values:

    def ckeanUp(data, missing=None):
    if missing is None:
    missing = []
    return [d for d in data if d not in missing else MISSING]

    (Yet another use of None here! :)

    Then if people find their functions are returning None (or any
    other value, such as an empty string) to mean a "genuine" missing value,
    they can just wrap the call in this cleanUp function. The reverse is
    harder to do: if you use None as your missing-value sentinel, you
    irrevocably lose the ability to tell it apart from other uses of None.

    --
    --OKB (not okblacke)
    Brendan Barnwell
    "Do not follow where the path may lead. Go, instead, where there is
    no path, and leave a trail."
    --author unknown
    OKB (not okblacke), Jul 15, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Paul Butcher
    Replies:
    12
    Views:
    683
    Gary Wright
    Nov 28, 2007
  2. length power
    Replies:
    2
    Views:
    65
    Rustom Mody
    Apr 10, 2014
  3. Skip Montanaro
    Replies:
    0
    Views:
    48
    Skip Montanaro
    Apr 10, 2014
  4. Johannes Schneider

    Re: why i have the output of [None, None, None]

    Johannes Schneider, Apr 10, 2014, in forum: Python
    Replies:
    0
    Views:
    43
    Johannes Schneider
    Apr 10, 2014
  5. Terry Reedy
    Replies:
    0
    Views:
    53
    Terry Reedy
    Apr 10, 2014
Loading...

Share This Page