Request for feedback on API design

Steven D'Aprano · Dec 9, 2010

I am soliciting feedback regarding the API of my statistics module:

http://code.google.com/p/pycalcstats/

Specifically the following couple of issues:

(1) Multivariate statistics such as covariance have two obvious APIs:

A pass the X and Y values as two separate iterable arguments, e.g.:
cov([1, 2, 3], [4, 5, 6])

B pass the X and Y values as a single iterable of tuples, e.g.:
cov([(1, 4), (2, 5), (3, 6)]

I currently support both APIs. Do people prefer one, or the other, or
both? If there is a clear preference for one over the other, I may drop
support for the other.

(2) Statistics text books often give formulae in terms of sums and
differences such as

Sxx = n*Î£(x**2) - (Î£x)**2

There are quite a few of these: I count at least six common ones, all
closely related and confusing named:

Sxx, Syy, Sxy, SSx, SSy, SPxy

(the x and y should all be subscript).

Are they useful, or would they just add unnecessary complexity? Would
people would like to see these included in the package?

Thank you for your feedback.

Tim Chase · Dec 10, 2010

(1) Multivariate statistics such as covariance have two obvious APIs:

A pass the X and Y values as two separate iterable arguments, e.g.:
cov([1, 2, 3], [4, 5, 6])

B pass the X and Y values as a single iterable of tuples, e.g.:
cov([(1, 4), (2, 5), (3, 6)]

I currently support both APIs. Do people prefer one, or the other, or
both? If there is a clear preference for one over the other, I may drop
support for the other.

I'm partial to the "B" form (iterable of 2-tuples) -- it
indicates that the two data-sets (x_n and y_n) should be of the
same length and paired. The "A" form leaves this less obvious
that len(param1) should equal len(param2).

I haven't poked at your code sufficiently to determine whether
all the functions within can handle streamed data, or whether
they keep the entire dataset internally, but handing off an
iterable-of-pairs tends to be a little more straight-forward:

cov(humongous_dataset_iter)

or

cov(izip(humongous_dataset_iter1, humongous_dataset_iter2))

The "A" form makes doing this a little less obvious than the "B"
form.

(2) Statistics text books often give formulae in terms of sums and
differences such as

Sxx = n*Î£(x**2) - (Î£x)**2

There are quite a few of these: I count at least six common ones,

When you take this count, is it across multiple text-books, or
are they common in just a small sampling of texts? (I confess
it's been a decade and a half since I last suffered a stats class)

all closely related and confusing named:

Sxx, Syy, Sxy, SSx, SSy, SPxy

(the x and y should all be subscript).

Are they useful, or would they just add unnecessary complexity?

I think it depends on your audience: amateur statisticians or
pros? I suspect that pros wouldn't blink at the distinctions
while weekenders like myself would get a little bleary-eyed
without at least a module docstring to clearly spell out the
distinctions and the forumlae used for determining them.

Just my from-the-hip thoughts for whatever little they may be worth.

-tkc

Steven D'Aprano · Dec 10, 2010

(1) Multivariate statistics such as covariance have two obvious APIs:

A pass the X and Y values as two separate iterable arguments,
e.g.:
cov([1, 2, 3], [4, 5, 6])

B pass the X and Y values as a single iterable of tuples, e.g.:
cov([(1, 4), (2, 5), (3, 6)]

I currently support both APIs. Do people prefer one, or the other, or
both? If there is a clear preference for one over the other, I may drop
support for the other.

Click to expand...

I'm partial to the "B" form (iterable of 2-tuples) -- it indicates that
the two data-sets (x_n and y_n) should be of the same length and paired.
The "A" form leaves this less obvious that len(param1) should equal
len(param2).

Thanks for the comments Tim. To answer your questions:

I haven't poked at your code sufficiently to determine whether all the
functions within can handle streamed data, or whether they keep the
entire dataset internally,

Where possible, the functions don't keep the entire dataset internally.
Some functions have to (e.g. order statistics need to see the entire data
sequence at once), but the rest are capable of dealing with streamed data.

Also, there are a few functions such as standard deviation that have a
single-pass algorithm, and a more accurate multiple-pass algorithm.

When you take this count, is it across multiple text-books, or are they
common in just a small sampling of texts? (I confess it's been a decade
and a half since I last suffered a stats class)

I admit that I haven't done an exhaustive search of the literature, but
it does seen quite common to extract common expressions from various
stats formulae and give them names. The only use-case I can imagine for
them is checking hand-calculations or doing schoolwork.

Arnaud Delobelle · Dec 13, 2010

Steven D'Aprano said:
I am soliciting feedback regarding the API of my statistics module:

http://code.google.com/p/pycalcstats/

Specifically the following couple of issues:

(1) Multivariate statistics such as covariance have two obvious APIs:

A pass the X and Y values as two separate iterable arguments, e.g.:
cov([1, 2, 3], [4, 5, 6])

B pass the X and Y values as a single iterable of tuples, e.g.:
cov([(1, 4), (2, 5), (3, 6)]

I currently support both APIs. Do people prefer one, or the other, or
both? If there is a clear preference for one over the other, I may drop
support for the other.

I don't have an informed opinion on this.

(2) Statistics text books often give formulae in terms of sums and
differences such as

Sxx = n*Î£(x**2) - (Î£x)**2

Interestingly, your Sxx is closely related to the variance:

if x is a list of n numbers then

Sxx == (n**2)*var(x)

And more generally if x and y have the same length n, then Sxy (*) is
related to the covariance

Sxy == (n**2)*cov(x, y)

So if you have a variance and covariance function, it would be redundant
to include Sxx and Sxy. Another argument against including Sxx & co is
that their definition is not universally agreed upon. For example, I
have seen

Sxx = Î£(x**2) - (Î£x)**2/n

HTH

Ethan Furman · Dec 13, 2010

Steven said:
I am soliciting feedback regarding the API of my statistics module:

http://code.google.com/p/pycalcstats/

Specifically the following couple of issues:

(1) Multivariate statistics such as covariance have two obvious APIs:

A pass the X and Y values as two separate iterable arguments, e.g.:
cov([1, 2, 3], [4, 5, 6])

B pass the X and Y values as a single iterable of tuples, e.g.:
cov([(1, 4), (2, 5), (3, 6)]

I currently support both APIs. Do people prefer one, or the other, or
both? If there is a clear preference for one over the other, I may drop
support for the other.

Don't currently need/use stats, but B seems clearer to me.

~Ethan~

None versus MISSING sentinel -- request for design feedback	22	Jul 15, 2011
ANN: stats0.1.2a calculator statistics for Python	0	Dec 31, 2010
feedback on code design	23	May 30, 2012
API design for Python 2 / 3 compatibility	3	Apr 13, 2013
ANN: stats0.1.1a calculator statistics for Python	0	Nov 14, 2010
feedback on function introspection in argparse	5	Nov 7, 2009
Need feedback on XS file	0	Jun 17, 2014
Gecode/R - Request for syntax feedback	23	Jun 5, 2007

Request for feedback on API design

Steven D'Aprano

Tim Chase

Steven D'Aprano

Arnaud Delobelle

Ethan Furman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads