Re: Compact Python library for math statistics

Discussion in 'Python' started by Gerrit, Apr 2, 2004.

  1. Gerrit

    Gerrit Guest

    ������� ������� wrote:
    > I'm looking for a Python library for math statistics. This must be a clear set of general statistics functions like 'average', 'variance', 'covariance' etc.


    The next version of Python will have a 'statistics' module. It is
    probably usable in Python 2.3 as well. You can find it in CVS:

    http://cvs.sourceforge.net/viewcvs....thon/nondist/sandbox/statistics/statistics.py

    I'm not sure whether it's usable in current CVS, though. You may have to
    tweak it a little.

    Gerrit.

    --
    Weather in Twenthe, Netherlands 02/04 11:55 UTC:
    16.0°C Broken clouds mostly cloudy wind 4.5 m/s ESE (57 m above NAP)
    --
    Experiences with Asperger's Syndrome:
    http://topjaklont.student.utwente.nl/english/
     
    Gerrit, Apr 2, 2004
    #1
    1. Advertising

  2. Gerrit <> wrote in message news:<>...
    > wrote:
    > > I'm looking for a Python library for math statistics. This must be a cl

    > ear set of general statistics functions like 'average', 'variance', 'cova
    > riance' etc.
    >
    > The next version of Python will have a 'statistics' module. It is
    > probably usable in Python 2.3 as well. You can find it in CVS:
    >
    > http://cvs.sourceforge.net/viewcvs.py/*checkout*/python/python/nondist/sa
    > ndbox/statistics/statistics.py
    >
    > I'm not sure whether it's usable in current CVS, though. You may have to
    > tweak it a little.
    >
    > Gerrit.



    I'm hoping there will be more functions added to this module (e.g.
    median, quantiles, skewness, kurtosis). It wouldnt take much to
    include at least the basic summary stats. I would be more than happy
    to contribute.

    cjf
     
    Chris Fonnesbeck, Apr 6, 2004
    #2
    1. Advertising

  3. Gerrit

    TaeKyon Guest

    Il Mon, 05 Apr 2004 19:41:52 -0700, Chris Fonnesbeck ha scritto:

    >> > I'm looking for a Python library for math statistics. This must be a cl

    >> ear set of general statistics functions like 'average', 'variance', 'cova
    >> riance' etc.


    You can also use R from within python; take a look at:

    http://www.omegahat.org/RSPython/

    --
    Michele Alzetta
     
    TaeKyon, Apr 6, 2004
    #3
  4. Gerrit

    Guest

    Gerrit <> wrote in message news:<>...
    > wrote:
    > > I'm looking for a Python library for math statistics. This must be a cl

    > ear set of general statistics functions like 'average', 'variance', 'cova
    > riance' etc.
    >
    > The next version of Python will have a 'statistics' module. It is
    > probably usable in Python 2.3 as well. You can find it in CVS:
    >
    > http://cvs.sourceforge.net/viewcvs.py/*checkout*/python/python/nondist/sa
    > ndbox/statistics/statistics.py
    >
    > I'm not sure whether it's usable in current CVS, though. You may have to
    > tweak it a little.


    <SNIP>

    It works for me, at least the mean function. A statistics module will
    be nice to have, although it is easy to write your own.

    Here is a minor suggestion. The functions 'mean' and 'variance' are
    separate, and the latter function requires a mean to be calculated. To
    save CPU time, it would be nice to have a single function that returns
    both the mean and variance, or a function to compute the variance with
    a known mean.

    Ideally there would be a function such as

    def stats(x,ss)

    where ss contains a list of statistics to be computed and the function
    returns a list of the same size. If you called it with

    y = stats(x,["mean","variance"])

    the function would compute the mean and variance efficiently.

    Other comments:
    (1) In computing the median, there is a line of code

    return (select(data, n//2) + select(data, n//2-1)) / 2

    I think finding the 500th and 501st elements separately out of a 1000
    element array is inefficient. Isn't there a way to get consecutive
    ordered elements in about the same time needed to get a single
    element?

    (2) The following code crashes when median(x) is computed. Why?

    from statistics import mean,median
    x = [1.0,2.0,3.0,4.0]
    print mean(x)
    print median(x)

    (3) The standard deviation is computed as

    return variance(data, sample) ** 0.5

    I think the sqrt function should be used instead -- this may be
    implemented more efficiently than general exponentiation.
     
    , Apr 6, 2004
    #4
  5. In article <>, Chris Fonnesbeck wrote:
    > Gerrit <> wrote in message news:<>...
    >> wrote:
    >> > I'm looking for a Python library for math statistics. This must be a cl

    >> ear set of general statistics functions like 'average', 'variance', 'cova
    >> riance' etc.
    >>
    >> The next version of Python will have a 'statistics' module. It is
    >> probably usable in Python 2.3 as well. You can find it in CVS:
    >>
    >> http://cvs.sourceforge.net/viewcvs.py/*checkout*/python/python/nondist/sa
    >> ndbox/statistics/statistics.py
    >>
    >> I'm not sure whether it's usable in current CVS, though. You may have to
    >> tweak it a little.

    >
    > I'm hoping there will be more functions added to this module (e.g.
    > median, quantiles, skewness, kurtosis). It wouldnt take much to
    > include at least the basic summary stats. I would be more than happy
    > to contribute.


    I'd really like to see linear regression in the Python stats module. I've
    used the one from stats.py successfully - this may be a good source of
    ideas, too:

    http://www.nmr.mgh.harvard.edu/Neural_Systems_Group/gary/python.html

    (see stats.py)
    (apologies if this has already been pointed out somewhere)

    --
    ..:[ dave benjamin: ramen/[sp00] -:- spoomusic.com -:- ramenfest.com ]:.
    : please talk to your son or daughter about parametric polymorphism. :
     
    Dave Benjamin, Apr 6, 2004
    #5
  6. Gerrit

    Asier Guest

    > > I'm looking for a Python library for math statistics. This must be a cl
    > ear set of general statistics functions like 'average', 'variance', 'cova
    > riance' etc.


    Have you looked at PyGSL? http://pygsl.sf.net

    I've programmed with the GSL library in C and works very well and
    fast. It has code for a very long list of mathematical functions.
    Currently pygsl is a WIP but has some modules completed.

    --
    Asier.
     
    Asier, Apr 7, 2004
    #6
  7. Hi Gerrit,

    If you want an object-oriented version, try the SalStat stats module
    (salstat_stats.py). Features the descriptives you discussed plus a
    range of inferential tests (currently up to and including anova and
    nonparametric equivilents). Addy is http://salstat.sourceforge.net for
    the entire package. The CVS stats module is a little borked right now
    though as I've been making lots of changes, so get the stable
    downloadable one.

    Alan.

    Gerrit <> wrote in message news:<>...
    > wrote:
    > > I'm looking for a Python library for math statistics. This must be a cl

    > ear set of general statistics functions like 'average', 'variance', 'cova
    > riance' etc.
    >
    > The next version of Python will have a 'statistics' module. It is
    > probably usable in Python 2.3 as well. You can find it in CVS:
    >
    > http://cvs.sourceforge.net/viewcvs.py/*checkout*/python/python/nondist/sa
    > ndbox/statistics/statistics.py
    >
    > I'm not sure whether it's usable in current CVS, though. You may have to
    > tweak it a little.
    >
    > Gerrit.
    >
    > --
    > Weather in Twenthe, Netherlands 02/04 11:55 UTC:
    > 16.0°C Broken clouds mostly cloudy wind 4.5 m/s ESE (57 m above NAP
    > )
     
    Alan James Salmoni, Apr 8, 2004
    #7
  8. > A statistics module will
    > be nice to have, although it is easy to write your own.
    >
    > Here is a minor suggestion. The functions 'mean' and 'variance' are
    > separate, and the latter function requires a mean to be calculated. To
    > save CPU time, it would be nice to have a single function that returns
    > both the mean and variance, or a function to compute the variance with
    > a known mean.


    Like you said, that is easy enough to write on your own. This
    lightweight module is not meant to replace heavy-weights that already
    exist outside of the core distribution.

    The goals are to have a simple set of functions for daily use and for
    these data reduction functions to work as well as possible with
    generator expression (one-pass over the data whereever possibe).



    > (1) In computing the median, there is a line of code
    >
    > return (select(data, n//2) + select(data, n//2-1)) / 2
    >
    > I think finding the 500th and 501st elements separately out of a 1000
    > element array is inefficient. Isn't there a way to get consecutive
    > ordered elements in about the same time needed to get a single
    > element?


    Select uses an O(n) algorithm, so they penalty is not that much.
    Making it accomodate selecting a range would greatly complicate and
    slow down the code. If you need the low, high, percentiles, then it
    may be better to just sort the data.



    > (2) The following code crashes when median(x) is computed. Why?
    >
    > from statistics import mean,median
    > x = [1.0,2.0,3.0,4.0]
    > print mean(x)
    > print median(x)


    Hmm, it works for me. What does your traceback look like?



    > (3) The standard deviation is computed as
    >
    > return variance(data, sample) ** 0.5
    >
    > I think the sqrt function should be used instead -- this may be
    > implemented more efficiently than general exponentiation.


    The timings show otherwise:

    C:\pydev>python timeit.py -r9 -n100000 -s "import math;
    sqrt=math.sqrt" "sqrt(7.0)"
    100000 loops, best of 9: 1.7 usec per loop

    C:\pydev>python timeit.py -r9 -n100000 -s "7.0 ** 0.5"
    100000 loops, best of 9: 0.237 usec per loop



    Raymond Hettinger
     
    Raymond Hettinger, Apr 9, 2004
    #8
  9. Gerrit

    Guest

    (Raymond Hettinger) wrote in message news:<>...

    <SNIP>

    > > (2) The following code crashes when median(x) is computed. Why?
    > >
    > > from statistics import mean,median
    > > x = [1.0,2.0,3.0,4.0]
    > > print mean(x)
    > > print median(x)

    >
    > Hmm, it works for me. What does your traceback look like?


    The module statistics.py imports a module 'random'. I have my own file
    random.py, and it was importing that. My mistake -- sorry.

    > > (3) The standard deviation is computed as
    > >
    > > return variance(data, sample) ** 0.5
    > >
    > > I think the sqrt function should be used instead -- this may be
    > > implemented more efficiently than general exponentiation.

    >
    > The timings show otherwise:
    >
    > C:\pydev>python timeit.py -r9 -n100000 -s "import math;
    > sqrt=math.sqrt" "sqrt(7.0)"
    > 100000 loops, best of 9: 1.7 usec per loop
    >
    > C:\pydev>python timeit.py -r9 -n100000 -s "7.0 ** 0.5"
    > 100000 loops, best of 9: 0.237 usec per loop


    For the Compaq and Lahey/Fujitsu Fortran 95 compilers I found that
    sqrt(x) and x**0.5 take the same time -- probably the compiler
    converts the latter to the former. On one compiler I found that
    computing x**0.49 takes about 10 times longer than sqrt(x), indicating
    that a sqrt function should be considerably faster than real
    exponentiation.

    I wonder if for Python, psyco eliminates the speed difference between
    sqrt(x) and x**0.5. Otherwise, the speed difference may indicate a
    fundamental problem in using a scripting language like Python for
    numerical work -- function calls take too much time. Because of that,
    sqrt is much slower than real exponentiation, when it should be much
    faster.

    Overall, the Python code below is about 100 times slower than the
    Fortran equivalent. This is a typical ratio I have found for code
    involving loops.

    from math import sqrt
    n = 10000000 + 1
    sum_sqrt = 0.0
    for i in range(1,n):
    sum_sqrt = sum_sqrt + (float(i))**0.5
    print sum_sqrt
     
    , Apr 9, 2004
    #9
  10. > Overall, the Python code below is about 100 times slower than the
    > Fortran equivalent. This is a typical ratio I have found for code
    > involving loops.
    >
    > from math import sqrt
    > n = 10000000 + 1
    > sum_sqrt = 0.0
    > for i in range(1,n):
    > sum_sqrt = sum_sqrt + (float(i))**0.5
    > print sum_sqrt


    Yeah...you may want to consider doing some optimizations to the above
    code. Using 'xrange' instead of 'range' is significantly faster
    (especially when your machine can't hold 'n' integers in a Python list
    in memory), as is the removal of the 'float(i)' cast (which is unnecessary).

    As for Python being slow compared to Fortran, of course it is going to
    be slow in comparison. Fortran is compiled to assembly, and has fairly
    decent (if not amazing) optimizers. Python is bytecode compiled,
    interpreted, and lacks an even remotely equivalent optimizer.


    - Josiah
     
    Josiah Carlson, Apr 10, 2004
    #10
  11. Gerrit

    Guest

    Josiah Carlson <> wrote in message news:<c59k27$jbi$>...
    > > Overall, the Python code below is about 100 times slower than the
    > > Fortran equivalent. This is a typical ratio I have found for code
    > > involving loops.
    > >
    > > from math import sqrt
    > > n = 10000000 + 1
    > > sum_sqrt = 0.0
    > > for i in range(1,n):
    > > sum_sqrt = sum_sqrt + (float(i))**0.5
    > > print sum_sqrt

    >
    > Yeah...you may want to consider doing some optimizations to the above
    > code. Using 'xrange' instead of 'range' is significantly faster
    > (especially when your machine can't hold 'n' integers in a Python list
    > in memory), as is the removal of the 'float(i)' cast (which is unnecessary).


    My original code, the code with range replaced by xrange, and the code
    with the further replacement of "float(i)" with "i" take 22.0, 20.5,
    and 14.4 seconds. So it looks like removing unnecessary casts can save
    substantial time. Thanks.
     
    , Apr 11, 2004
    #11
  12. Gerrit

    Andrew Dalke Guest

    <>
    > My original code, the code with range replaced by xrange, and the code
    > with the further replacement of "float(i)" with "i" take 22.0, 20.5,
    > and 14.4 seconds. So it looks like removing unnecessary casts can save
    > substantial time. Thanks.


    The following should be even faster

    def sum_sqrt(n):
    sum = 0.0
    for i in xrange(1, n):
    sum = sum + i ** 0.5
    return sum

    print sum_sqrt(10000000 + 1)

    Local variables have a fast lookup while module variables (that is,
    ones outside a function) have to look up the variable in the
    module dictionary.

    Your code (in module scope) takes 24.5 seconds on my box
    while my function version takes 17.5 seconds.

    Andrew
     
    Andrew Dalke, Apr 11, 2004
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. chirs
    Replies:
    18
    Views:
    798
    Chris Uppal
    Mar 2, 2004
  2. AciD_X
    Replies:
    4
    Views:
    8,170
    Jonathan Turkanis
    Apr 1, 2004
  3. W. Watson
    Replies:
    9
    Views:
    285
    W. Watson
    Aug 19, 2007
  4. Phrogz
    Replies:
    8
    Views:
    299
    Morton Goldberg
    Feb 8, 2007
  5. VK
    Replies:
    15
    Views:
    1,313
    Dr J R Stockton
    May 2, 2010
Loading...

Share This Page