Request for feedback on API design

Discussion in 'Python' started by Steven D'Aprano, Dec 9, 2010.

  1. I am soliciting feedback regarding the API of my statistics module:

    http://code.google.com/p/pycalcstats/


    Specifically the following couple of issues:

    (1) Multivariate statistics such as covariance have two obvious APIs:

    A pass the X and Y values as two separate iterable arguments, e.g.:
    cov([1, 2, 3], [4, 5, 6])

    B pass the X and Y values as a single iterable of tuples, e.g.:
    cov([(1, 4), (2, 5), (3, 6)]

    I currently support both APIs. Do people prefer one, or the other, or
    both? If there is a clear preference for one over the other, I may drop
    support for the other.


    (2) Statistics text books often give formulae in terms of sums and
    differences such as

    Sxx = n*Σ(x**2) - (Σx)**2

    There are quite a few of these: I count at least six common ones, all
    closely related and confusing named:

    Sxx, Syy, Sxy, SSx, SSy, SPxy

    (the x and y should all be subscript).

    Are they useful, or would they just add unnecessary complexity? Would
    people would like to see these included in the package?



    Thank you for your feedback.


    --
    Steven
    Steven D'Aprano, Dec 9, 2010
    #1
    1. Advertising

  2. Steven D'Aprano

    Tim Chase Guest

    On 12/09/2010 05:44 PM, Steven D'Aprano wrote:
    > (1) Multivariate statistics such as covariance have two obvious APIs:
    >
    > A pass the X and Y values as two separate iterable arguments, e.g.:
    > cov([1, 2, 3], [4, 5, 6])
    >
    > B pass the X and Y values as a single iterable of tuples, e.g.:
    > cov([(1, 4), (2, 5), (3, 6)]
    >
    > I currently support both APIs. Do people prefer one, or the other, or
    > both? If there is a clear preference for one over the other, I may drop
    > support for the other.


    I'm partial to the "B" form (iterable of 2-tuples) -- it
    indicates that the two data-sets (x_n and y_n) should be of the
    same length and paired. The "A" form leaves this less obvious
    that len(param1) should equal len(param2).

    I haven't poked at your code sufficiently to determine whether
    all the functions within can handle streamed data, or whether
    they keep the entire dataset internally, but handing off an
    iterable-of-pairs tends to be a little more straight-forward:

    cov(humongous_dataset_iter)

    or

    cov(izip(humongous_dataset_iter1, humongous_dataset_iter2))

    The "A" form makes doing this a little less obvious than the "B"
    form.

    > (2) Statistics text books often give formulae in terms of sums and
    > differences such as
    >
    > Sxx = n*Σ(x**2) - (Σx)**2
    >
    > There are quite a few of these: I count at least six common ones,


    When you take this count, is it across multiple text-books, or
    are they common in just a small sampling of texts? (I confess
    it's been a decade and a half since I last suffered a stats class)

    > all closely related and confusing named:
    >
    > Sxx, Syy, Sxy, SSx, SSy, SPxy
    >
    > (the x and y should all be subscript).
    >
    > Are they useful, or would they just add unnecessary complexity?


    I think it depends on your audience: amateur statisticians or
    pros? I suspect that pros wouldn't blink at the distinctions
    while weekenders like myself would get a little bleary-eyed
    without at least a module docstring to clearly spell out the
    distinctions and the forumlae used for determining them.

    Just my from-the-hip thoughts for whatever little they may be worth.

    -tkc
    Tim Chase, Dec 10, 2010
    #2
    1. Advertising

  3. On Thu, 09 Dec 2010 18:48:10 -0600, Tim Chase wrote:

    > On 12/09/2010 05:44 PM, Steven D'Aprano wrote:
    >> (1) Multivariate statistics such as covariance have two obvious APIs:
    >>
    >> A pass the X and Y values as two separate iterable arguments,
    >> e.g.:
    >> cov([1, 2, 3], [4, 5, 6])
    >>
    >> B pass the X and Y values as a single iterable of tuples, e.g.:
    >> cov([(1, 4), (2, 5), (3, 6)]
    >>
    >> I currently support both APIs. Do people prefer one, or the other, or
    >> both? If there is a clear preference for one over the other, I may drop
    >> support for the other.

    >
    > I'm partial to the "B" form (iterable of 2-tuples) -- it indicates that
    > the two data-sets (x_n and y_n) should be of the same length and paired.
    > The "A" form leaves this less obvious that len(param1) should equal
    > len(param2).



    Thanks for the comments Tim. To answer your questions:


    > I haven't poked at your code sufficiently to determine whether all the
    > functions within can handle streamed data, or whether they keep the
    > entire dataset internally,


    Where possible, the functions don't keep the entire dataset internally.
    Some functions have to (e.g. order statistics need to see the entire data
    sequence at once), but the rest are capable of dealing with streamed data.

    Also, there are a few functions such as standard deviation that have a
    single-pass algorithm, and a more accurate multiple-pass algorithm.


    >> (2) Statistics text books often give formulae in terms of sums and
    >> differences such as
    >>
    >> Sxx = n*Σ(x**2) - (Σx)**2
    >>
    >> There are quite a few of these: I count at least six common ones,

    >
    > When you take this count, is it across multiple text-books, or are they
    > common in just a small sampling of texts? (I confess it's been a decade
    > and a half since I last suffered a stats class)


    I admit that I haven't done an exhaustive search of the literature, but
    it does seen quite common to extract common expressions from various
    stats formulae and give them names. The only use-case I can imagine for
    them is checking hand-calculations or doing schoolwork.


    --
    Steven
    Steven D'Aprano, Dec 10, 2010
    #3
  4. Steven D'Aprano <> writes:

    > I am soliciting feedback regarding the API of my statistics module:
    >
    > http://code.google.com/p/pycalcstats/
    >
    >
    > Specifically the following couple of issues:
    >
    > (1) Multivariate statistics such as covariance have two obvious APIs:
    >
    > A pass the X and Y values as two separate iterable arguments, e.g.:
    > cov([1, 2, 3], [4, 5, 6])
    >
    > B pass the X and Y values as a single iterable of tuples, e.g.:
    > cov([(1, 4), (2, 5), (3, 6)]
    >
    > I currently support both APIs. Do people prefer one, or the other, or
    > both? If there is a clear preference for one over the other, I may drop
    > support for the other.
    >


    I don't have an informed opinion on this.

    > (2) Statistics text books often give formulae in terms of sums and
    > differences such as
    >
    > Sxx = n*Σ(x**2) - (Σx)**2


    Interestingly, your Sxx is closely related to the variance:

    if x is a list of n numbers then

    Sxx == (n**2)*var(x)

    And more generally if x and y have the same length n, then Sxy (*) is
    related to the covariance

    Sxy == (n**2)*cov(x, y)

    So if you have a variance and covariance function, it would be redundant
    to include Sxx and Sxy. Another argument against including Sxx & co is
    that their definition is not universally agreed upon. For example, I
    have seen

    Sxx = Σ(x**2) - (Σx)**2/n

    HTH

    --
    Arnaud

    (*) Here I take Sxy to be n*Σ(xy) - (Σx)(Σy), generalising from your
    definition of Sxx.
    Arnaud Delobelle, Dec 13, 2010
    #4
  5. Steven D'Aprano

    Ethan Furman Guest

    Steven D'Aprano wrote:
    > I am soliciting feedback regarding the API of my statistics module:
    >
    > http://code.google.com/p/pycalcstats/
    >
    >
    > Specifically the following couple of issues:
    >
    > (1) Multivariate statistics such as covariance have two obvious APIs:
    >
    > A pass the X and Y values as two separate iterable arguments, e.g.:
    > cov([1, 2, 3], [4, 5, 6])
    >
    > B pass the X and Y values as a single iterable of tuples, e.g.:
    > cov([(1, 4), (2, 5), (3, 6)]
    >
    > I currently support both APIs. Do people prefer one, or the other, or
    > both? If there is a clear preference for one over the other, I may drop
    > support for the other.
    >


    Don't currently need/use stats, but B seems clearer to me.

    ~Ethan~
    Ethan Furman, Dec 13, 2010
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Ashenden
    Replies:
    5
    Views:
    599
    Peter Ashenden
    Dec 17, 2004
  2. Michael Attenborough
    Replies:
    22
    Views:
    2,272
    Mike Treseler
    Mar 13, 2006
  3. =?Utf-8?B?cm9kY2hhcg==?=

    feedback request on design issue

    =?Utf-8?B?cm9kY2hhcg==?=, Sep 8, 2006, in forum: ASP .Net
    Replies:
    2
    Views:
    304
    =?Utf-8?B?cm9kY2hhcg==?=
    Sep 19, 2006
  4. Steven D'Aprano
    Replies:
    22
    Views:
    474
    Ethan Furman
    Jul 17, 2011
  5. josh
    Replies:
    5
    Views:
    595
    Robert Klemme
    Dec 27, 2011
Loading...

Share This Page