PEP 450 Adding a statistics module to Python

S

Skip Montanaro

I am seeking comments on PEP 450, Adding a statistics module to Python's
standard library:

http://www.python.org/dev/peps/pep-0450/

Please read the FAQs before asking anything :)

Given that installing numpy or scipy is generally no more difficult
that executing "pip install (scipy|numpy)" I'm not really feeling the
need for a battery here... (Of course, I use this stuff at work from
time-to-time, so maybe I'm more in the "nuclear reactor of batteries"
camp anyway.)

Skip
 
S

Stefan Behnel

Ben Finney, 10.08.2013 07:05:
See the Rationale of PEP 450 for more reasons why “install NumPy†is not
a feasible solution for many use cases, and why having ‘statistics’ as a
pure-Python, standard-library package is desirable.

The rationale suggests that the module is meant as a simple toolset for
non-NumPy users. Are the APIs (class model, function names, etc.) similar
enough to make it easy to switch, preferably in both directions?

It would be good if a stdlib statistics module could be used as a SciPy
fallback for the "simple" things, and if users of the stdlib module could
easily switch their code to SciPy if they need more speed/features/whatever
at some point, without having to relearn the name of each single function.

I'm not asking for compatibility (doesn't sound reasonable without NumPy
arrays), but I think that a similarity in terms of API naming (as far as it
makes sense) should be clearly stated, e.g. in the Design Decisions section.

Stefan
 
R

Roy Smith

Skip Montanaro said:
Given that installing numpy or scipy is generally no more difficult
that executing "pip install (scipy|numpy)" I'm not really feeling the
need for a battery here...

I just tried installing numpy in a fresh virtualenv on an Ubuntu Precise
box. I ran "pip install numpy". It took 1.5 minutes. It printed
almost 1800 lines of build crap, including 383 warnings and 83 errors.
For a newbie, that can be pretty intimidating.

That's for the case where I've already installed numpy elsewhere on that
box, so I already had the fortran compiler, and the rest of the build
chain. For fun, I just spun up a new Ubuntu Precise instance in AWS.
It came pre-installed with Python 2.7.3. I tried "pip install numpy",
which told me that pip was not installed.

At least it told me what I needed to do to get pip installed.
Unfortunately, I didn't read the message carefully enough and typed
"sudo apt-get install pip", which of course got me another error because
the correct name of the package is python-pip. Doing "sudo apt-get
install python-pip" finally got me to the point where I could start to
install numpy.

Of course, if I didn't have sudo privs on the box (most corporate
environments), I never would have gotten that far.

At this point, "sudo pip install numpy" got me a bunch of errors
culminating in "RuntimeError: Broken toolchain: cannot link a simple C
program", and no indication of how to get any further.

At this point, most people would give up. I don't remember the full set
of steps I needed to do the first time. Obviously, I would start with
installing gcc, but I seem to remember there were additional steps
needed to get fortran support.

Having some simple statistics baked into the standard python package
would be a big win. As shown above, installing numpy can be an
insurmountable hurdle for people with insufficient sysadmin-fu.

PEP-450 makes cogent arguments why rolling your own statistics routines
is fraught with peril. Looking over our source tree, I see we've
implemented std deviation in python at least twice. I'm sure they're
both naive implementations of the sort PEP-450 warns about.

And, yes, backporting to 2.7 would be a big win too. I know the goal is
to get everybody onto 3.x, but my pip external dependency list includes
40 modules. It's going to be a long and complicated road to get to the
point where I can move to 3.x, and I imagine most non-trivial projects
are in a similar situation.
 
O

Oscar Benjamin

I just tried installing numpy in a fresh virtualenv on an Ubuntu Precise
box. I ran "pip install numpy". It took 1.5 minutes. It printed
almost 1800 lines of build crap, including 383 warnings and 83 errors.
For a newbie, that can be pretty intimidating.

That's for the case where I've already installed numpy elsewhere on that
box, so I already had the fortran compiler, and the rest of the build
chain. For fun, I just spun up a new Ubuntu Precise instance in AWS.
It came pre-installed with Python 2.7.3. I tried "pip install numpy",
which told me that pip was not installed.

At least it told me what I needed to do to get pip installed.
Unfortunately, I didn't read the message carefully enough and typed
"sudo apt-get install pip", which of course got me another error because
the correct name of the package is python-pip. Doing "sudo apt-get
install python-pip" finally got me to the point where I could start to
install numpy.

Of course, if I didn't have sudo privs on the box (most corporate
environments), I never would have gotten that far.

At this point, "sudo pip install numpy" got me a bunch of errors
culminating in "RuntimeError: Broken toolchain: cannot link a simple C
program", and no indication of how to get any further.

You should use apt-get for numpy/scipy on Ubuntu. Although
unfortunately IIRC this doesn't work as well as it should since Ubuntu
doesn't install the appropriate BLAS/LAPACK libraries by default
(leaving you with numpy's fallback libraries).

On Windows you should use the MSI installer (or easy_install).
Hopefully numpy/scipy will start distributing wheels soon and pip
install numpy will actually work.


Oscar
 
R

Roy Smith

I described the problems I had trying to follow that advice.

Oscar Benjamin said:
You should use apt-get for numpy/scipy on Ubuntu. Although
unfortunately IIRC this doesn't work as well as it should since Ubuntu
doesn't install the appropriate BLAS/LAPACK libraries by default
(leaving you with numpy's fallback libraries).

That really kind of proves my point. It's *not* easy to install.
Theres' a choice of methods, some of which work in some environments,
some of which work in others. And even if apt-get is the preferred
install method on Ubuntu, it's a method which is unavailable to people
without root access (and may be undesirable if you rely on virtualenv to
keep multiple projects cleanly separated).

And, what happens if you don't have the right libraries? Do you end up
with an install which is missing some functionality, or one where all
the calls work, but they're slower, or numerically unstable, or what?

All these questions go away if it's packaged with the standard library.

I'm not sure where the line should be drawn between "basic stuff that
should be included" and "advanced stuff that you need an add-on to get",
but certainly mean and std-dev should be in the default distribution.
 
O

Oscar Benjamin

That really kind of proves my point. It's *not* easy to install.
Theres' a choice of methods, some of which work in some environments,
some of which work in others. And even if apt-get is the preferred
install method on Ubuntu, it's a method which is unavailable to people
without root access (and may be undesirable if you rely on virtualenv to
keep multiple projects cleanly separated).

And, what happens if you don't have the right libraries? Do you end up
with an install which is missing some functionality, or one where all
the calls work, but they're slower, or numerically unstable, or what?

AFAIK not having separate BLAS/LAPACK libraries just means that
certain operations are a lot slower. If there are differences in
accuracy then they aren't significant enough that I've noticed.

I think that the reason Ubuntu doesn't install them by default is
because it's not sure which ones you want to use. Possibly the best
free setup comes from using ATLAS but this is optimised in a
CPU-specific way at build time. Ubuntu doesn't provide binaries for it
as using generic x86 executables would defeat much of the point of the
library (they do make it a lot easier by providing a source package
though).


Oscar
 
D

Dennis Lee Bieber

Given that installing numpy or scipy is generally no more difficult
that executing "pip install (scipy|numpy)" I'm not really feeling the
need for a battery here... (Of course, I use this stuff at work from
time-to-time, so maybe I'm more in the "nuclear reactor of batteries"
camp anyway.)

And for the whole nuclear power plant, isn't there an interface module
that lets Python control the R-system? http://rpy.sourceforge.net/ for
example.
 
S

Skip Montanaro

See the Rationale of PEP 450 for more reasons why “install NumPy†is not
a feasible solution for many use cases, and why having ‘statistics’ as a
pure-Python, standard-library package is desirable.

I read that before posting but am not sure I agree. I don't see the
screaming need for this package. Why can't it continue to live on
PyPI, where, once again, it is available as "pip install ..."?

S
 
N

Nicholas Cole

I read that before posting but am not sure I agree. I don't see the
screaming need for this package. Why can't it continue to live on
PyPI, where, once again, it is available as "pip install ..."?


Well, I *do* think this module would be a wonderful addition to the
standard library. I've often used python to do analysis of data, nothing
complicated enough to need NumPy, but certainly things where I've needed to
find averages etc. I've rolled my own functions for these projects, and I'm
sure they are fragile. Besides, it was just a pain to do them.

PyPI is terrific. There are lots of excellent modules on there. It's a
wonderful resource. But I think that the standard library is also a
wonderful thing, and where there are clearly defined modules, that serve a
general, well-defined function and where development does not need to be
very rapid, I think they should go into the Standard Library.

I'm aware that my opinion is just that of one user, but I read this PEP and
I thought, "Thank Goodness! That looks great. About time too."

N.
 
S

Steven D'Aprano

I read that before posting but am not sure I agree. I don't see the
screaming need for this package. Why can't it continue to live on PyPI,
where, once again, it is available as "pip install ..."?


The same could be said about any module, really. And indeed, some
languages have that philosophy, they provide no libraries to speak of, if
you want anything you have to either write it yourself or get it from
somebody else.

Not everyone has the luxury of being able, or allowed, to run "pip
install" to get additional, non-standard packages. E.g. in corporate
environments. But I've already said that in the PEP.
 
R

Roy Smith

See the Rationale of PEP 450 for more reasons why “install NumPy†is not
a feasible solution for many use cases, and why having ‘statistics’ as a
pure-Python, standard-library package is desirable.

I read that before posting but am not sure I agree. I don't see the
screaming need for this package. Why can't it continue to live on
PyPI, where, once again, it is available as "pip install ..."?[/QUOTE]

My previous comments on this topic were along the lines of "installing
numpy is a non-starter if all you need are simple mean/std-dev". You
do, however, make a good point here. Running "pip install statistics"
is a much lower barrier to entry than getting numpy going, especially if
statistics is pure python and thus has no dependencies on compiler tool
chains which may be missing.

Still, I see two classes of function in PEP-450. Class 1 is the really
basic stuff:

* mean
* std-dev

Class 2 are the more complicated things like:

* linear regression
* median
* mode
* functions for calculating the probability of random variables
from the normal, t, chi-squared, and F distributions
* inference on the mean
* anything that differentiates between population and sample

I could see leaving class 2 stuff in an optional pure-python module to
be installed by pip, but for (as the PEP phrases it), the simplest and
most obvious statistical functions (into which I lump mean and std-dev),
having them in the standard library would be a big win.
 
D

duncan smith

I read that before posting but am not sure I agree. I don't see the
screaming need for this package. Why can't it continue to live on
PyPI, where, once again, it is available as "pip install ..."?

My previous comments on this topic were along the lines of "installing
numpy is a non-starter if all you need are simple mean/std-dev". You
do, however, make a good point here. Running "pip install statistics"
is a much lower barrier to entry than getting numpy going, especially if
statistics is pure python and thus has no dependencies on compiler tool
chains which may be missing.

Still, I see two classes of function in PEP-450. Class 1 is the really
basic stuff:

* mean
* std-dev

Class 2 are the more complicated things like:

* linear regression
* median
* mode
* functions for calculating the probability of random variables
from the normal, t, chi-squared, and F distributions
* inference on the mean
* anything that differentiates between population and sample

I could see leaving class 2 stuff in an optional pure-python module to
be installed by pip, but for (as the PEP phrases it), the simplest and
most obvious statistical functions (into which I lump mean and std-dev),
having them in the standard library would be a big win.
[/QUOTE]

I would probably move other descriptive statistics (median, mode,
correlation, ...) into Class 1.

I roll my own statistical tests as I need them - simply to avoid having
a dependency on R. But I generally do end up with a dependency on scipy
because I need scipy.stats.distributions. So I guess a distinct library
for probability distributions would be handy - but maybe it should not
be in the standard library.

Once we move on to statistical modelling (e.g. linear regression) I
think the case for inclusion in the standard library becomes weaker
still. Cheers.

Duncan
 
W

Wolfgang Keller

I am seeking comments on PEP 450, Adding a statistics module to
Python's standard library:

I don't think that you want to re-implement RPy.

Sincerely,

Wolfgang
 
S

Steven D'Aprano

I don't think that you want to re-implement RPy.

I never suggested re-implementing RPy. When you read the PEP, you will
see that this proposal is to have a Python implementation of statistics
functions, not a thin wrapper around another language.
 
T

taldcroft

I am seeking comments on PEP 450, Adding a statistics module to Python's

standard library:



http://www.python.org/dev/peps/pep-0450/



Please read the FAQs before asking anything :)

I think this is a super idea. Python is showing up in high-school and colllege intro programming courses here in the U.S. Having a solid statistics module built in would work well in that context and make it even more natural as a complement to math courses. Beyond the educational aspect, having a built-in module to *correctly* handle the frequent light-weight use caseswould be useful across many professional disciplines.

I use NumPy on a daily basis and help scientists with installation problemsfrequently. I can emphatically state that NumPy is not easy to install for newbies. Open up a brand new Mac and look, no compilers! Even experienced users can have problems with gfortran vs. g77 etc. Anyone that has ever built BLAS/ATLAS from source will also tell you that SciPy is definitely not a simple "pip install" on many platforms (particularly if you don't have root).

- Tom
 
C

chris.barker

I am seeking comments on PEP 450, Adding a statistics module to Python's

The trick here is that numpy really is the "right" way to do this stuff.

I like to say:
"crunching numbers in python without numpy is like doing text processing without using the string object"

What this is really an argument for is a numpy-lite in the standard library, which could be used to build these sorts of things on. But that's been rejected before...


A few other comments:

1) the numpy folks have been VERY good at providing binaries for Windows and OS-X -- easy point and click installing.

2) I hope we're almost there with standardizing pip and binary wheels, at which point pip install will be painless.

even before (2) -- pip install works fine anywhere the system is set up to build python extensions (granted, not a given on Windows and Mac, but pretty likely on Linux) -- the idea that running pip install wrote out a lot of text (but worked!) is somehow a barrier to entry is absurd -- anyone building their own stuff on Linux is used to that.

(NOTE: you only need Fortran if you want highly optimized linear algebra stuff -- clearly this use-case is for folks that don't need that!)

3) The fact that the numpy functions have optional arguments is NOT a problem -- the simple calls work as expected -- no one needs to figure out the optional arguments that doesn't need them -- and if they do need them, they had better be there!

All that being said -- if you do decide to do this, please use a PEP 3118 (enhanced buffer) supporting data type (probably array.array) -- compatibility with numpy and other packages for crunching numbers is very nice.

If someone decides to build a stand-alone stats package -- building it on andarray-lite (PEP 3118 compatible) object would be a nice way to go.


One other point -- for performance reason, is would be nice to have some compiled code in there -- this adds incentive to put it in the stdlib -- external packages that need compiling is what makes numpy unacceptable to some folks.


-Chris
 
O

Oscar Benjamin

The trick here is that numpy really is the "right" way to do this stuff.

Although it doesn't mention this in the PEP, a significant point that
is worth bearing in mind is that numpy is only for CPython, not PyPy,
IronPython, Jython etc. See here for a recent update on the status of
NumPyPy:
http://morepypy.blogspot.co.uk/2013_08_01_archive.html
I like to say:
"crunching numbers in python without numpy is like doing text processing without using the string object"

It depends what kind of number crunching you're doing. Numpy gives
efficient C-style number crunching but it doesn't really give
efficient ways to take advantage of the areas where Python is better
than C such as having efficient infinite range integers, and decimal
and rational arithmetic in the standard library. You can use
dtype=object to use all these things with numpy arrays but in my
experience this is typically not faster than working with Python lists
and is only really useful when you want numpy's multi-dimensional,
view-type slicing.

Here's an example where Steven's statistics module is more accurate:
numpy.mean([-1e60, 100, 100, 1e60]) 0.0
statistics.mean([-1e60, 100, 100, 1e60])
50.0

Okay so that's a toy example but it illustrates that Steven is aiming
for ultra-high accuracy where numpy is primarily aimed at speed. He's
also tried to ensure that it works properly with e.g. fractions:
from fractions import Fraction as F
data = [F('1/7'), F('3/7')]
numpy.mean(data) 0.2857142857142857
statistics.mean(data)
Fraction(2, 7)

and decimals:
data = [D('0.1'), D('0.01'), D('0.001')]
numpy.mean(data)
....
TypeError: unsupported operand type(s) for /: 'decimal.Decimal' and 'float'
Decimal('0.037')

What this is really an argument for is a numpy-lite in the standard library, which could be used to build these sorts of things on. But that's been rejected before...

If it's a numpy-lite then it's a numpy-ultra-lite. It really doesn't
provide much of what numpy provides. I would describe it as a Pythonic
implementation of elementary statistical computation rather than a
numpy-lite.

[snip]
All that being said -- if you do decide to do this, please use a PEP 3118 (enhanced buffer) supporting data type (probably array.array) -- compatibility with numpy and other packages for crunching numbers is very nice.

If someone decides to build a stand-alone stats package -- building it on a ndarray-lite (PEP 3118 compatible) object would be a nice way to go.

Why? Yes I'd also like an ndarray-lite or rather an ultra-lite
1-dimensional version but why would it be useful for the statistics
module over using standard Python containers? Note that numpy arrays
do work with the reference implementation of the statistics module
(they're just treated as iterables):
import numpy
import statistics
statistics.mean(numpy.array([1, 2, 3])) 2.0
statistics.mean(numpy.array([[1, 2, 3], [4, 5, 6]]))
array([ 2.5, 3.5, 4.5])
One other point -- for performance reason, is would be nice to have some compiled code in there -- this adds incentive to put it in the stdlib -- external packages that need compiling is what makes numpy unacceptable to some folks.

It might be good to have a C accelerator one day but actually I think
the pure-Python-ness of it is a strong reason to have it since it
provides accurate statistics functions to all Python implementations
(unlike numpy) at no additional cost.


Oscar
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top