numpy/scipy: correlation

Discussion in 'Python' started by robert, Nov 11, 2006.

  1. robert

    robert Guest

    Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
    m, m-err, o, o-err, r-coef,r-coef-err ?

    Or a formula to to compute the 3 error ranges?

    -robert

    PS:

    numpy.corrcoef computes only the bare coeff:
    >>> numpy.corrcoef((0,1,2,3.0),(2,5,6,7.0),)

    array([[ 1. , 0.95618289],
    [ 0.95618289, 1. ]])

    with ints it goes computes wrong:
    >>> numpy.corrcoef((0,1,2,3),(2,5,6,7),)

    array([[ 1. , 0.94491118],
    [ 0.94491118, 1. ]])
    robert, Nov 11, 2006
    #1
    1. Advertising

  2. robert

    Robert Kern Guest

    robert wrote:
    > Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
    > m, m-err, o, o-err, r-coef,r-coef-err ?


    numpy and scipy questions are best asked on their lists, not here. There are a
    number of people who know the capabilities of numpy and scipy through and
    through, but most of them don't hang out on comp.lang.python.

    http://www.scipy.org/Mailing_Lists

    scipy.optimize.leastsq() can be told to return the covariance matrix of the
    estimated parameters (m and o in your example; I have no idea what you think
    r-coeff is).

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
    Robert Kern, Nov 11, 2006
    #2
    1. Advertising

  3. robert

    Robert Kern Guest

    Robert Kern wrote:
    > robert wrote:
    >> Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
    >> m, m-err, o, o-err, r-coef,r-coef-err ?


    > scipy.optimize.leastsq() can be told to return the covariance matrix of the
    > estimated parameters (m and o in your example; I have no idea what you think
    > r-coeff is).


    Ah, the correlation coefficient itself. Since correlation coefficients are weird
    beasts constrained to [-1, 1], standard gaussian errors like you are expecting
    for m-err and o-err don't apply. No, there's currently no function in numpy or
    scipy that will do something sophisticated enough to be reliable. Here's an option:

    http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=155684

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
    Robert Kern, Nov 11, 2006
    #3
  4. robert

    Robert Kern Guest

    robert wrote:
    > Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
    > m, m-err, o, o-err, r-coef,r-coef-err ?


    And of course, those three parameters are not particularly meaningful together.
    If your model is truly "y is a linear response given x with normal noise" then
    "y=m*x+o" is correct, and all of the information that you can get from the data
    will be found in the estimates of m and o and the covariance matrix of the
    estimates.

    On the other hand, if your model is that "(x, y) is distributed as a bivariate
    normal distribution" then "y=m*x+o" is not a particularly good representation of
    the model. You should instead estimate the mean vector and covariance matrix of
    (x, y). Your correlation coefficient will be the off-diagonal term after
    dividing out the marginal standard deviations.

    The difference between the two models is that the first places no restrictions
    on the distribution of x. The second does; both the x and y marginal
    distributions need to be normal. Under the first model, the correlation
    coefficient has no meaning.

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
    Robert Kern, Nov 11, 2006
    #4
  5. robert

    robert Guest

    Robert Kern wrote:
    > robert wrote:
    >> Is there a ready made function in numpy/scipy to compute the correlation y=mx+o of an X and Y fast:
    >> m, m-err, o, o-err, r-coef,r-coef-err ?

    >
    > And of course, those three parameters are not particularly meaningful together.
    > If your model is truly "y is a linear response given x with normal noise" then
    > "y=m*x+o" is correct, and all of the information that you can get from the data
    > will be found in the estimates of m and o and the covariance matrix of the
    > estimates.
    >
    > On the other hand, if your model is that "(x, y) is distributed as a bivariate
    > normal distribution" then "y=m*x+o" is not a particularly good representation of
    > the model. You should instead estimate the mean vector and covariance matrix of
    > (x, y). Your correlation coefficient will be the off-diagonal term after
    > dividing out the marginal standard deviations.
    >
    > The difference between the two models is that the first places no restrictions
    > on the distribution of x. The second does; both the x and y marginal
    > distributions need to be normal. Under the first model, the correlation
    > coefficient has no meaning.


    Think the difference is little in practice - when you head for usable diagonals.
    Looking at the bivar. coef first before going on to any models, seems to be a more stable approach for the first step in data mining. ( before you proceed to a model or to class-learning .. )

    Basically the first need is to analyse lots of x,y data and check for linear dependencies. No real model so far. I'd need a quality measure (coef**2) and to know how much I can rely on it (coef-err). coef alone is not enough. You get a perfect 1.0 with 2 ( or 3 - see below ) points.
    With big coef's and lots of distributed data the coef is very good by itself - its error range err(N) only approx ~ 1/sqrt(N)

    One would expect the error range to drop simply with # of points. Yet it depends more complexly on the mean value of the coef and on the distribution at all.
    More interesting realworld cases: For example I see a lower correlation on lots of points - maybe coef=0.05 . Got it - or not? Thus lower coefs require naturally a coef-err to be useful in practice.

    Now think of adding 'boring data':

    >>> X=[1.,2,3,4]
    >>> Y=[1.,2,3,5]
    >>> sd.correlation((X,Y)) # my old func

    (1.3, -0.5, 0.982707629824) # m,o,coef
    >>> numpy.corrcoef((X,Y))

    array([[ 1. , 0.98270763],
    [ 0.98270763, 1. ]])
    >>> XX=[1.,1,1,1,1,2,3,4]
    >>> YY=[1.,1,1,1,1,2,3,5]
    >>> sd.correlation((XX,YY))

    (1.23684210526, -0.289473684211, 0.988433774639)
    >>>


    I'd expect: the little increase of r is ok. But this 'boring data' should not make the error to go down simply ~1/sqrt(N) ...

    I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore.

    http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Trivia
    says:
    In MATLAB, corr(X) calculates Pearsons correlation coefficient along with p-value.

    Does anybody know how this prob.-value is computed/motivated? Such thing would be very helpful for numpy/scipy too.

    http://links.jstor.org/sici?sici=0162-1459(192906)24:166<170:FFPEOC>2.0.CO;2-Y
    tells:

    probable error of r = 0.6745*(1-r**2)/sqrt(N)

    A simple function of r and N - quite what I expected above roughly for the N-only dep.. But thus it is not sensitive to above considerations about 'boring' data. With above example it would spit a decrease of this probable coef-err from

    0.0115628571429 to 0.00548453410954 !

    And the absolute size of this error measure seems to be too low for just 4 points of data!

    The other formula which I remember seeing once was much more sophisticated and used things like sum_xxy etc...


    Robert



    PS:

    my old func is simply hands-on based on
    n,sum_x,sum_y,sum_xy,sum_xx,sum_yy=len(vx),vx.sum(),vy.sum(),(vx*vy).sum(),(vx*vx).sum(),(vy*vy).sum()
    Guess its already fast for large data?


    Note: numpy.corrcoef strikes on 2 points:
    >>> numpy.corrcoef(([1,2],[1,2]))

    array([[ -1.#IND, -1.#IND],
    [ -1.#IND, -1.#IND]])
    >>> sd.correlation(([1,2],[1,2]))

    (1, 0, 1.0)
    >>>
    >>> numpy.corrcoef(([1,2,3],[1,2,3]))

    array([[ 1., 1.],
    [ 1., 1.]])
    >>> sd.correlation(([1,2,3],[1,2,3]))

    (1, 0, 1.0)


    PPS:

    A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible most-recent libs of scipy and numpy.
    robert, Nov 12, 2006
    #5
  6. robert

    sturlamolden Guest

    Robert Kern wrote:

    > The difference between the two models is that the first places no restrictions
    > on the distribution of x. The second does; both the x and y marginal
    > distributions need to be normal. Under the first model, the correlation
    > coefficient has no meaning.


    That is not correct. The correlation coefficient is meaningful in both
    models, but must be interpreted differently. However, in both cases a
    correlation coefficient of 1 or -1 indicates an exact linear
    relationship between x and y.

    Under the first model ("linear regression"), the squared correlation
    coefficient is the "explained variance", i.e. the the proportion of y's
    variance accounted for by the model y = m*x + o.

    Under the second model ("multivariate normal distribution"), the
    correlation coefficient is the covariance of y and x divided by the
    product of the standard deviations, cov(x,y)/(std(x)*std(y)).
    sturlamolden, Nov 12, 2006
    #6
  7. robert

    robert Guest

    robert wrote:
    > Robert Kern wrote:
    > http://links.jstor.org/sici?sici=0162-1459(192906)24:166<170:FFPEOC>2.0.CO;2-Y
    >
    > tells:
    > probable error of r = 0.6745*(1-r**2)/sqrt(N)
    >
    > A simple function of r and N - quite what I expected above roughly for
    > the N-only dep.. But thus it is not sensitive to above considerations
    > about 'boring' data. With above example it would spit a decrease of this
    > probable coef-err from
    > 0.0115628571429 to 0.00548453410954 !


    This 1929 formula for estimating the error of correlation coefficient seems to make some sense for r=0 .
    I do monte carlo on correlating random series:

    >>> X=numpy.random.random(10000)
    >>> l=[]
    >>> for i in range(200):

    .... Z=numpy.random.random(10000)
    .... l.append( sd.correlation((X,Z))[2] ) #collect coef's
    ....
    >>> mean(l)

    0.000327657082234
    >>> std(l)

    0.0109120766158 # thats how the coef jitters
    >>> std(l)/sqrt(len(l))

    0.000771600337185
    >>> len(l)

    200

    # now
    # 0.6745*(1-r**2)/sqrt(N) = 0.0067440015079
    # vs M.C. 0.0109120766158 ± 0.000771600337185


    but the fancy factor of 0.6745 is significantly just fancy for r=0.

    then for a higher (0.5) correlation:

    >>> l=[]
    >>> for i in range(200):

    .... Z=numpy.random.random(10000)+array(range(10000))/10000.0
    .... l.append( sd.correlation((X+array(range(10000))/10000.0,Z))[2] )
    ....
    >>> mean(l)

    0.498905642552
    >>> std(l)

    0.00546979583163
    >>> std(l)/sqrt(len(l))

    0.000386772972425

    #now:
    # 0.6745*(1-r**2)/sqrt(N) = 0.00512173224849)
    # vs M.C. 0.00546979583163 ± 0.000386772972425

    => there the 0.6745 factor and (1-r**2) seem to get the main effect ! There is something in it.

    --

    Now adding boring data:

    >>> boring=ones(10001)*0.5
    >>> X=numpy.random.random(10000)
    >>> l=[]
    >>> for i in range(200):

    .... Z=concatenate((numpy.random.random(10000)+array(range(10000))/10000.0,boring))
    .... l.append( sd.correlation((concatenate((X+array(range(10000))/10000.0,boring)),Z))[2] )
    ....
    >>> mean(l)

    0.712753628489 # r
    >>> std(l)

    0.00316163649888 # r_err
    >>> std(l)/sqrt(len(l))

    0.0002235614608

    # now:
    # 0.6745*(1-r**2)/sqrt(N) = 0.00234459971461 #N=20000
    # vs M.C. streuung 0.00316163649888 ± 0.0002235614608

    => the boring data has an effect on coef-err which is significantly not reflected by the formula 0.6745*(1-r**2)/sqrt(N)

    => I'll use this formula to get a downside error estimate for the correlation coefficient:

    ------------------------------------------
    | r_err_down ~= 1.0 * (1-r**2)/sqrt(N) |
    ------------------------------------------

    (until I find a better one respecting the actual distribution of data)

    Would be interesting what MATLAB & Octave say ...


    -robert
    robert, Nov 12, 2006
    #7
  8. robert

    sturlamolden Guest

    First, are you talking about rounding error (due to floating point
    arithmetics) or statistical sampling error?

    If you are talking about the latter, I suggest you look it up in a
    statistics text book. E.g. if x and y are normally distributed, then

    t = r * sqrt( (n-2)/(1-r**2) )

    has a Student t-distribution with n-2 degrees of freedom. And if you
    don't know how to get the p-value from that, you should not be messing
    with statistics anyway.
    sturlamolden, Nov 12, 2006
    #8
  9. robert

    robert Guest

    sturlamolden wrote:
    > First, are you talking about rounding error (due to floating point
    > arithmetics) or statistical sampling error?


    About measured data. rounding and sampling errors with special distrutions are neglegible. Thus by default assuming gaussian noise in x and y.
    (This may explain that factor of ~0.7 in the rectangle M.C. test)
    The (x,y) points may not distribute "nicely" along the assumed regression diagonale.

    > If you are talking about the latter, I suggest you look it up in a
    > statistics text book. E.g. if x and y are normally distributed, then
    >
    > t = r * sqrt( (n-2)/(1-r**2) )
    >
    > has a Student t-distribution with n-2 degrees of freedom. And if you
    > don't know how to get the p-value from that, you should not be messing
    > with statistics anyway.


    yet too lazy/practical for digging these things from there. You obviously got it - out of that, what would be a final estimate for an error range of r (n big) ?
    that same "const. * (1-r**2)/sqrt(n)" which I found in that other document ?

    The const. ~1 is less the problem.

    My main concern is, how to respect the fact, that the (x,y) points may not distribute well along the regression line. E.g. due to the nature of the experiment more points are around (0,0) but only few distribute along the interesting part of the diagonale and thus few points have great effect on m & r. above formulas will possibly not respect that.
    I could try a weighting technique, but maybe there is a (commonly used) speedy formula for r/r_err respecting that directly?


    Robert
    robert, Nov 12, 2006
    #9
  10. On 11/12/06, robert <> wrote:
    > Robert Kern wrote:
    > > robert wrote:


    (...)
    > One would expect the error range to drop simply with # of points. Yet it depends more complexly on the mean value of the coef and on the distribution at all.
    > More interesting realworld cases: For example I see a lower correlation on lots of points - maybe coef=0.05 . Got it - or not? Thus lower coefs require naturally a coef-err to be useful in practice.
    >


    (...)

    > I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore.
    >

    (...)
    >
    > Does anybody know how this prob.-value is computed/motivated? Such thing would be very helpful for numpy/scipy too.
    >
    >


    There is a transformation of the correlation coefficient that is
    distributed as a t-statistic under the null. This was derived a long
    way back, and is the usual, standard, way to test for significance of
    (attach a p-value to) correlation coefficients. I do not recall the
    formula from top of my head, but I am sure any google search will find
    it. And of course, any decent intro stats textbook will also provide
    it. You can look in your nearby library for popular textbooks such as
    Snedecor & Cochran, or Sokal & Rohlf, or the wonderful "Beyond Anova"
    by Miller, which do have the expression.

    Since this is such a common procedure all stats programs do provide
    for tests of significance of correlation coefficients. A whole bunch
    of them are free, and one is widely used: R (check
    http://cran.r-project.org). This is an easy way for you to test the
    output of your stats code in Python.

    Now, since you can do a significance test, you can invert the
    procedure and obtain a confidence interval (of whichever width you
    want) for your correlation coefficients. This CI will (almost always)
    be assymmetric. (And recall that CI do have a not-obvious
    interpretation). CI of correlation coefficients are not as common as
    p-values, but this is doable too.

    The expression (and "improved versions" of it, I think Sokal & Rohlf
    show several) is easy to compute. And I bet obtaining the density of a
    t-distribution with k d.f. is also available already for Python. So
    this part is definitely doable.


    Best,

    R.

    (...)
    > Now think of adding 'boring data':
    >
    > >>> X=[1.,2,3,4]
    > >>> Y=[1.,2,3,5]
    > >>> sd.correlation((X,Y)) # my old func

    > (1.3, -0.5, 0.982707629824) # m,o,coef
    > >>> numpy.corrcoef((X,Y))

    > array([[ 1. , 0.98270763],
    > [ 0.98270763, 1. ]])
    > >>> XX=[1.,1,1,1,1,2,3,4]
    > >>> YY=[1.,1,1,1,1,2,3,5]
    > >>> sd.correlation((XX,YY))

    > (1.23684210526, -0.289473684211, 0.988433774639)
    > >>>

    >
    > I'd expect: the little increase of r is ok. But this 'boring data' should not make the error to go down simply ~1/sqrt(N) ...
    >
    > http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Trivia
    > says:
    > In MATLAB, corr(X) calculates Pearsons correlation coefficient along with p-value.

    http://links.jstor.org/sici?sici=0162-1459(192906)24:166<170:FFPEOC>2.0.CO;2-Y
    > tells:
    >
    > probable error of r = 0.6745*(1-r**2)/sqrt(N)
    >
    > A simple function of r and N - quite what I expected above roughly for the N-only dep.. But thus it is not sensitive to above considerations about 'boring' data. With above example it would spit a decrease of this probable coef-err from
    >
    > 0.0115628571429 to 0.00548453410954 !
    >
    > And the absolute size of this error measure seems to be too low for just 4 points of data!
    >
    > The other formula which I remember seeing once was much more sophisticated and used things like sum_xxy etc...
    >
    >
    > Robert
    >
    >
    >
    > PS:
    >
    > my old func is simply hands-on based on
    > n,sum_x,sum_y,sum_xy,sum_xx,sum_yy=len(vx),vx.sum(),vy.sum(),(vx*vy).sum(),(vx*vx).sum(),(vy*vy).sum()
    > Guess its already fast for large data?
    >
    >
    > Note: numpy.corrcoef strikes on 2 points:
    > >>> numpy.corrcoef(([1,2],[1,2]))

    > array([[ -1.#IND, -1.#IND],
    > [ -1.#IND, -1.#IND]])
    > >>> sd.correlation(([1,2],[1,2]))

    > (1, 0, 1.0)
    > >>>
    > >>> numpy.corrcoef(([1,2,3],[1,2,3]))

    > array([[ 1., 1.],
    > [ 1., 1.]])
    > >>> sd.correlation(([1,2,3],[1,2,3]))

    > (1, 0, 1.0)
    >
    >
    > PPS:
    >
    > A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible most-recent libs of scipy and numpy.
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >



    --
    Ramon Diaz-Uriarte
    Statistical Computing Team
    Structural Biology and Biocomputing Programme
    Spanish National Cancer Centre (CNIO)
    http://ligarto.org/rdiaz
    Ramon Diaz-Uriarte, Nov 12, 2006
    #10
  11. robert

    Robert Kern Guest

    sturlamolden wrote:
    > Robert Kern wrote:
    >
    >> The difference between the two models is that the first places no restrictions
    >> on the distribution of x. The second does; both the x and y marginal
    >> distributions need to be normal. Under the first model, the correlation
    >> coefficient has no meaning.

    >
    > That is not correct. The correlation coefficient is meaningful in both
    > models, but must be interpreted differently. However, in both cases a
    > correlation coefficient of 1 or -1 indicates an exact linear
    > relationship between x and y.
    >
    > Under the first model ("linear regression"), the squared correlation
    > coefficient is the "explained variance", i.e. the the proportion of y's
    > variance accounted for by the model y = m*x + o.


    Point.

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
    Robert Kern, Nov 12, 2006
    #11
  12. robert

    Robert Kern Guest

    robert wrote:

    > I remember once I saw somewhere a formula for an error range of the corrcoef. but cannot find it anymore.


    There is no such thing as "a formula for an error range" in a vacuum like that.
    Each formula has a model attached to it. If your data does not follow that
    model, then any such estimate of the error of your estimate is meaningless.

    sturlamolden pointed out to me that I was wrong in thinking that the correlation
    coefficient was meaningless in the linear regression case. However, the error
    estimate that is used for bivariate normal correlation coefficients will almost
    certainly not apply to "correlation coefficient qua sqare root of the fraction
    of y's variance explained by the linear model". In that context, the
    "correlation coefficient" is not an estimate of some parameter. It has no
    uncertainty attached to it.

    > http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Trivia
    > says:
    > In MATLAB, corr(X) calculates Pearsons correlation coefficient along with p-value.
    >
    > Does anybody know how this prob.-value is computed/motivated? Such thing would be very helpful for numpy/scipy too.


    scipy.stats.pearsonr()

    As with all such frequentist hypothesis testing nonsense, one takes the null
    hypothesis (in this case, a bivariate normal distribution of points with 0
    correlation), finds the distribution of the test statistic given the number of
    points sampled, and then finds the probability of getting a test statistic "at
    least as extreme" as the test statistic you actually got.

    > my old func is simply hands-on based on
    > n,sum_x,sum_y,sum_xy,sum_xx,sum_yy=len(vx),vx.sum(),vy.sum(),(vx*vy).sum(),(vx*vx).sum(),(vy*vy).sum()
    > Guess its already fast for large data?


    You really want to use a 2-pass algorithm to avoid numerical problems.

    > Note: numpy.corrcoef strikes on 2 points:
    >>>> numpy.corrcoef(([1,2],[1,2]))

    > array([[ -1.#IND, -1.#IND],
    > [ -1.#IND, -1.#IND]])


    This was fixed in SVN.

    > PPS:
    >
    > A compatible scipy binary (0.5.2?) for numpy 1.0 was announced some weeks back. Think currently many users suffer when trying to get started with incompatible most-recent libs of scipy and numpy.


    Well, go ahead and poke the people that said they would have a Windows binary
    ready. Or provide one yourself.

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
    Robert Kern, Nov 12, 2006
    #12
  13. robert

    sturlamolden Guest

    robert wrote:

    > > t = r * sqrt( (n-2)/(1-r**2) )


    > yet too lazy/practical for digging these things from there. You obviously got it - out of that, what would be a final estimate for an error range of r (n big) ?
    > that same "const. * (1-r**2)/sqrt(n)" which I found in that other document ?


    I gave you th formula. Solve for r and you get the confidence interval.
    You will need to use the inverse cumulative Student t distribution.

    Another quick-and-dirty solution is to use bootstrapping.

    from numpy import mean, std, sum, sqrt, sort
    from numpy.random import randint

    def bootstrap_correlation(x,y):
    idx = randint(len(x),size=(1000,len(x)))
    bx = x[idx] # reasmples x with replacement
    by = y[idx] # resamples y with replacement
    mx = mean(bx,1)
    my = mean(by,1)
    sx = std(bx,1)
    sy = std(by,1)
    r = sort(sum( (bx - mx.repeat(len(x),0).reshape(bx.shape)) *
    (by - my.repeat(len(y),0).reshape(by.shape)), 1) /
    ((len(x)-1)*sx*sy))
    #bootstrap confidence interval (NB! biased)
    return (r[25],r[975])


    > My main concern is, how to respect the fact, that the (x,y) points may not distribute well along the regression line.


    The bootstrap is "non-parametric" in the sense that it is distribution
    free.
    sturlamolden, Nov 12, 2006
    #13
  14. robert

    sturlamolden Guest

    While I am at it, lets add the bootstrap estimate of the standard error
    as well.




    from numpy import mean, std, sum, sqrt, sort, corrcoef, tanh, arctanh
    from numpy.random import randint

    def bootstrap_correlation(x,y):
    idx = randint(len(x),size=(1000,len(x)))
    bx = x[idx]
    by = y[idx]
    mx = mean(bx,1)
    my = mean(by,1)
    sx = std(bx,1)
    sy = std(by,1)
    r = sort(sum( (bx - mx.repeat(len(x),0).reshape(bx.shape)) *
    (by - my.repeat(len(y),0).reshape(by.shape)), 1)
    / ((len(x)-1)*sx*sy))
    #bootstrap confidence interval (NB! biased)
    conf_interval = (r[25],r[975])
    #bootstrap standard error using Fisher's z-transform (NB! biased)
    std_err = tanh(std(arctanh(r))*(len(r)/(len(r)-1.0)))
    return (std_err, conf_interval)

    Since we are not using bias corrected and accelerated bootstrap, the
    standard error are more meaningful, although both estimates are biased.

    I said that the boostrap is "distribution free". That is not quite the
    case either. The assumed distribution is the provided sample
    distribution, i.e. a sum of Dirac delta functions. Often this is not a
    very sound assumption, and the bootstrap must then be improved e.g.
    using the BCA procedure. That is particularly important for confidence
    intervals, but not so much for standard errors. However for a
    quick-and-dirty standard error estimate, we can just ignore the small
    bias.

    Now, if you wonder why I used the standard deviation to obtain the
    standard error (without dividing by sqrt(n) ), the answer is that the
    standard error is the standard deviation of an estimator. As we have
    1000 samples from the correlation's sampling distribution in r, we get
    the "standard error of the correlation" by taking the standard
    deviation of r. The z-transform is required before we can compute mean
    and standard deviations of correlations.
    sturlamolden, Nov 12, 2006
    #14
  15. robert

    robert Guest

    Re: numpy/scipy: error of correlation coefficient (clumpy data)

    sturlamolden wrote:
    > robert wrote:
    >
    >>> t = r * sqrt( (n-2)/(1-r**2) )

    >
    >> yet too lazy/practical for digging these things from there. You obviously got it - out of that, what would be a final estimate for an error range of r (n big) ?
    >> that same "const. * (1-r**2)/sqrt(n)" which I found in that other document ?

    >
    > I gave you th formula. Solve for r and you get the confidence interval.
    > You will need to use the inverse cumulative Student t distribution.
    >
    > Another quick-and-dirty solution is to use bootstrapping.
    >
    > from numpy import mean, std, sum, sqrt, sort
    > from numpy.random import randint
    >
    > def bootstrap_correlation(x,y):
    > idx = randint(len(x),size=(1000,len(x)))
    > bx = x[idx] # reasmples x with replacement
    > by = y[idx] # resamples y with replacement
    > mx = mean(bx,1)
    > my = mean(by,1)
    > sx = std(bx,1)
    > sy = std(by,1)
    > r = sort(sum( (bx - mx.repeat(len(x),0).reshape(bx.shape)) *
    > (by - my.repeat(len(y),0).reshape(by.shape)), 1) /
    > ((len(x)-1)*sx*sy))
    > #bootstrap confidence interval (NB! biased)
    > return (r[25],r[975])
    >
    >
    >> My main concern is, how to respect the fact, that the (x,y) points may not distribute well along the regression line.

    >
    > The bootstrap is "non-parametric" in the sense that it is distribution
    > free.



    thanks for the bootstrap tester. It confirms mainly the "r_stderr = (1-r**2)/sqrt(n)" formula. The assymetry of r (-1..+1) is less a problem.
    Yet my main problem, how to respect clumpy distribution in the data points, is still the same.
    In practice think of a situation where data out of an experiment has an unkown damping/filter (or whatever unkown data clumper) on it, thus lots of redundancy in effect.
    An extreme example is to just duplicate data:

    >>> x ,y =[0.,0,0,0,1]*10 ,[0.,1,1,1,1]*10
    >>> xx,yy=[0.,0,0,0,1]*100,[0.,1,1,1,1]*100
    >>> correlation(x,y)

    (0.25, 0.132582521472, 0.25, 0.75)
    >>> correlation(xx,yy)

    (0.25, 0.0419262745781, 0.25, 0.75)
    >>> bootstrap_correlation(array(x),array(y))

    (0.148447544378, 0.375391432338)
    >>> bootstrap_correlation(array(xx),array(yy))

    (0.215668822617, 0.285633303438)
    >>>


    here the bootstrap test will as well tell us, that the confidence intervall narrows down by a factor ~sqrt(10) - just the same as if there would be 10-fold more of well distributed "new" data. Thus this kind of error estimation has no reasonable basis for data which is not very good.

    The interesting task is probably this: to check for linear correlation but "weight clumping of data" somehow for the error estimation.
    So far I can only think of kind of geometric density approach...

    Or is there a commonly known straight forward approach/formula for this problem?
    In this formula which I can remember weakly somehow - I think there were other basic sum terms like sum_xxy, sum_xyy,.. in it (which are not needed for the formula for r itself )


    Robert
    robert, Nov 15, 2006
    #15
  16. robert

    sturlamolden Guest

    Re: numpy/scipy: error of correlation coefficient (clumpy data)

    robert wrote:

    > here the bootstrap test will as well tell us, that the confidence intervall narrows down by a factor ~sqrt(10) - just the same as if there would be 10-fold more of well distributed "new" data. Thus this kind of error estimation has no reasonable basis for data which is not very good.



    The confidence intervals narrows when the amount of independent data
    increases. If you don't understand why, then you lack a basic
    understanding of statistics. Particularly, it is a fundamental
    assumption in most statistical models that the data samples are
    "IDENTICALLY AND INDEPENDENTLY DISTRIBUTED", often abbreviated "i.i.d."
    And it certainly is assumed in this case. If you tell the computer (or
    model) that you have i.i.d. data, it will assume it is i.i.d. data,
    even when its not. The fundamental law of computer science also applies
    to statistics: shit in = shit out. If you nevertheless provide data
    that are not i.i.d., like you just did, you will simply obtain invalid
    results.

    The confidence interval concerns uncertainty about the value of a
    population parameter, not about the spread of your data sample. If you
    collect more INDEPENDENT data, you know more about the population from
    which the data was sampled. The confidence interval has the property
    that it will contain the unknown "true correlation" 95% of the times it
    is generated. Thus if you two samples WITH INDEPENDENT DATA from the
    same population, one small and one large, the large sample will
    generate a narrower confidence interval. Computer intensive methods
    like bootstrapping and asymptotic approximations derived analytically
    will behave similarly in this respect. However, if you are dumb enough
    to just provide duplications of your data, the computer is dumb enough
    to accept that they are obtained statistically independently. In
    statistical jargon this is called "pseudo-sampling", and is one of the
    most common fallacies among uneducated practitioners.

    Statistical software doesn't prevent the practitioner from shooting
    himself in the leg; it actually makes it a lot easier. Anyone can paste
    data from Excel into SPSS and hit "ANOVA" in the menu. Whether the
    output makes any sense is a whole other story. One can duplicate each
    sample three or four times, and SPSS would be ignorant of that fact. It
    cannot guess that you are providing it with crappy data, and prevent
    you from screwing up your analysis. The same goes for NumPy code. The
    statistical formulas you type in Python have certain assumptions, and
    when they are violated the output is of no value. The more severe the
    violation, the less valuable is the output.

    > The interesting task is probably this: to check for linear correlation but "weight clumping of data" somehow for the error estimation.


    If you have a pathological data sample, then you need to specify your
    knowledge in greater detail. Can you e.g. formulate a reasonable
    stochastic model for your data, fit the model parameters using the
    data, and then derive the correlation analytically?

    I am beginning to think your problem is ill defined because you lack a
    basic understanding of maths and statistics. For example, it seems you
    were confusing numerical error (rounding and truncation error) with
    statistical sampling error, you don't understand why standard errors
    decrease with sample size, you are testing with pathological data, you
    don't understand the difference between independent data and data
    duplications, etc. You really need to pick up a statistics textbook and
    do some reading, that's my advice.
    sturlamolden, Nov 16, 2006
    #16
  17. robert

    robert Guest

    Re: numpy/scipy: error of correlation coefficient (clumpy data)

    sturlamolden wrote:
    > robert wrote:
    >
    >> here the bootstrap test will as well tell us, that the confidence intervall narrows down by a factor ~sqrt(10) - just the same as if there would be 10-fold more of well distributed "new" data. Thus this kind of error estimation has no reasonable basis for data which is not very good.

    >
    >
    > The confidence intervals narrows when the amount of independent data
    > increases. If you don't understand why, then you lack a basic
    > understanding of statistics. Particularly, it is a fundamental
    > assumption in most statistical models that the data samples are
    > "IDENTICALLY AND INDEPENDENTLY DISTRIBUTED", often abbreviated "i.i.d."
    > And it certainly is assumed in this case. If you tell the computer (or
    > model) that you have i.i.d. data, it will assume it is i.i.d. data,
    > even when its not. The fundamental law of computer science also applies
    > to statistics: shit in = shit out. If you nevertheless provide data
    > that are not i.i.d., like you just did, you will simply obtain invalid
    > results.
    >
    > The confidence interval concerns uncertainty about the value of a
    > population parameter, not about the spread of your data sample. If you
    > collect more INDEPENDENT data, you know more about the population from
    > which the data was sampled. The confidence interval has the property
    > that it will contain the unknown "true correlation" 95% of the times it
    > is generated. Thus if you two samples WITH INDEPENDENT DATA from the
    > same population, one small and one large, the large sample will
    > generate a narrower confidence interval. Computer intensive methods
    > like bootstrapping and asymptotic approximations derived analytically
    > will behave similarly in this respect. However, if you are dumb enough
    > to just provide duplications of your data, the computer is dumb enough
    > to accept that they are obtained statistically independently. In
    > statistical jargon this is called "pseudo-sampling", and is one of the
    > most common fallacies among uneducated practitioners.


    that duplication is just an extreme example to show my need: When I get the data, there can be an inherent filter/damping or other mean of clumping on the data which I don't know of beforehand. My model is basically linear (its a preparation step for ranking valuable input data for a classification task, thus for data reduction), just the degree of clumping in the data is unknown. Thus the formula for r is ok, but that bare i.i.d. formula for the error "(1-r**2)/sqrt(n)" (or bootstrap_test same) is blind on that.

    > Statistical software doesn't prevent the practitioner from shooting
    > himself in the leg; it actually makes it a lot easier. Anyone can paste
    > data from Excel into SPSS and hit "ANOVA" in the menu. Whether the
    > output makes any sense is a whole other story. One can duplicate each
    > sample three or four times, and SPSS would be ignorant of that fact. It
    > cannot guess that you are providing it with crappy data, and prevent
    > you from screwing up your analysis. The same goes for NumPy code. The
    > statistical formulas you type in Python have certain assumptions, and
    > when they are violated the output is of no value. The more severe the
    > violation, the less valuable is the output.
    >
    >> The interesting task is probably this: to check for linear correlation but "weight clumping of data" somehow for the error estimation.

    >
    > If you have a pathological data sample, then you need to specify your
    > knowledge in greater detail. Can you e.g. formulate a reasonable
    > stochastic model for your data, fit the model parameters using the
    > data, and then derive the correlation analytically?


    no, its too complex. Or its just: additional clumping/fractality in the data.
    Thus, linear correlation is supposed, but the x,y data distribution may have "less than 2 dimensions". No better model.

    Think of such example: A drunken (x,y) 2D walker is supposed to walk along a diagonal, but he makes frequent and unpredictable pauses/slow motion. You get x,y coordinates in 1 per second. His speed and time pattern at all do not matter - you just want to know how well he keeps his track.
    ( My application data is even worse/blackbox, there is not even such "model" )

    > I am beginning to think your problem is ill defined because you lack a
    > basic understanding of maths and statistics. For example, it seems you
    > were confusing numerical error (rounding and truncation error) with
    > statistical sampling error, you don't understand why standard errors
    > decrease with sample size, you are testing with pathological data, you
    > don't understand the difference between independent data and data
    > duplications, etc. You really need to pick up a statistics textbook and
    > do some reading, that's my advice.


    I think I understand all this very well. Its not on this level. The problem has also nothing to do with rounding, sampling errors etc.
    Of course the error ~1/sqrt(n) is the basic assumption - not what I not know, but what I "complain" about :) (Thus I even guessed the "dumb" formula for r_err well before I saw it somewhere. This is all not the real question.).
    Yet I need a way to _NOT_ just fall on that ~1/sqrt(n) for the error, when there is unknown clumping in the data. It has to be a smarter - say automatic non-i.i.d. computation for a reasonable confidence intervall/error of the correlation - in absence of a final/total modell.

    Thats not even an exceptional application. In most measured data which is created by iso-timestep sampling (thus not "pathological" so far?), the space of some interesting 2 variables may be walked "non-iso". Think of any time series data, where most of the data is "boring"/redundant because the flux of the experiment is so, that interesting things happen only occasionally. In absence of a full model for the "whole history", one could try to preprocess the x,y data by attaching a density weight in order to make it "non-pathological" before feeding it into the formula for r,r_err. Yet this is expensive. Or one could think of computing a rough fractal dimension and decorate the error like fracconst * (1-r**2)/sqrt(n)

    The (fast) formula I'm looking for - possibly it doesn't exist - should do this in a rush.


    Robert
    robert, Nov 16, 2006
    #17
  18. robert

    sturlamolden Guest

    Re: numpy/scipy: error of correlation coefficient (clumpy data)

    robert wrote:

    > Think of such example: A drunken (x,y) 2D walker is supposed to walk along a diagonal, but he makes frequent and unpredictable pauses/slow motion. You get x,y coordinates in 1 per second. His speed and time pattern at all do not matter - you just want to know how well he keeps his track.



    In which case you have time series data, i.e. regular samples from p(t)
    = [ x(t), y(t) ]. Time series have some sort of autocorrelation in the
    samples as well, which must be taken into account. Even tough you could
    weight each point by the drunkard's speed, a correlation or linear
    regression would still not make any sense here, as such analyses are
    based on the assumption of no autocorrelation in the samples or the
    residuals. Correlation has no meaning if y[t] is correlated with
    y[t+1], and regression has no meaning if the residual e[t] is
    correlated with the residual e[t+1].

    A state space model could e.g. be applicable. You could estimate the
    path of the drunkard using a Kalman filter to compute a Taylor series
    expansion p(t) = p0 + v*t + 0.5*a*t**2 + ... for the path at each step
    p(t). When you have estimates for the state parameters s, v, and a, you
    can compute some sort of measure for the drunkard's deviation from his
    ideal path.

    However, if you don't have time series data, you should not treat your
    data as such.

    If you don't know how your data is generated, there is no way to deal
    with them correctly. If the samples are time series they must be
    threated as such, if they are not they should not. If the samples are
    i.i.d. each point count equally much, if they are not they do not. If
    you have a clumped data due to time series or lack of i.i.d., you must
    deal with that. However, data can be i.i.d. and clumped, if the
    underlying distribution is clumped. In order to determine the cause,
    you must consider how your data are generated and how your data are
    sampled. You need meta-information about your data to determine this.
    Matlab or Octave will help you with this, and it is certainly not a
    weakness of NumPy as you implied in your original post. There is no way
    to put magic into any numerical computation. Statistics always require
    formulation of specific assumptions about the data. If you cannot think
    clearly about your data, then that is the problem you must solve.
    sturlamolden, Nov 16, 2006
    #18
  19. robert

    robert Guest

    Re: numpy/scipy: error of correlation coefficient (clumpy data)

    sturlamolden wrote:
    > robert wrote:
    >
    >> Think of such example: A drunken (x,y) 2D walker is supposed to walk along a diagonal, but he makes frequent and unpredictable pauses/slow motion. You get x,y coordinates in 1 per second. His speed and time pattern at all do not matter - you just want to know how well he keeps his track.

    >
    >
    > In which case you have time series data, i.e. regular samples from p(t)
    > = [ x(t), y(t) ]. Time series have some sort of autocorrelation in the
    > samples as well, which must be taken into account. Even tough you could
    > weight each point by the drunkard's speed, a correlation or linear
    > regression would still not make any sense here, as such analyses are
    > based on the assumption of no autocorrelation in the samples or the
    > residuals. Correlation has no meaning if y[t] is correlated with
    > y[t+1], and regression has no meaning if the residual e[t] is
    > correlated with the residual e[t+1].
    >
    > A state space model could e.g. be applicable. You could estimate the
    > path of the drunkard using a Kalman filter to compute a Taylor series
    > expansion p(t) = p0 + v*t + 0.5*a*t**2 + ... for the path at each step
    > p(t). When you have estimates for the state parameters s, v, and a, you
    > can compute some sort of measure for the drunkard's deviation from his
    > ideal path.
    >
    > However, if you don't have time series data, you should not treat your
    > data as such.
    >
    > If you don't know how your data is generated, there is no way to deal
    > with them correctly. If the samples are time series they must be
    > threated as such, if they are not they should not. If the samples are
    > i.i.d. each point count equally much, if they are not they do not. If
    > you have a clumped data due to time series or lack of i.i.d., you must
    > deal with that. However, data can be i.i.d. and clumped, if the
    > underlying distribution is clumped. In order to determine the cause,
    > you must consider how your data are generated and how your data are
    > sampled. You need meta-information about your data to determine this.
    > Matlab or Octave will help you with this, and it is certainly not a
    > weakness of NumPy as you implied in your original post. There is no way
    > to put magic into any numerical computation. Statistics always require
    > formulation of specific assumptions about the data. If you cannot think
    > clearly about your data, then that is the problem you must solve.


    yes, in the example of the drunkard time-series its possible to go to better model - yet even there it is very expensive (in relation to the stats-improvement) to worry too much about the best model for such guy :).

    In the field of datamining with many datatracks and typically digging first for multiple but smaller correlations - without a practical bottom-up modell, I think one falls regularly back to a certain basic case - maybe the most basic model for data at all: that basic modell is possibly that of a "hunter" which waits mostly, but only acts, if rare goodies are in front of him.
    Again in the most basic case, when having 2D x,y data in front of you without a relyable time-path or so, you see this: a density distribution of points. There is possibly a linear correlation on the highest scale - which you are interested in - but the points show also inhomgenity/clumping, and this rises the question of influence on r_err. What now? One sees clearly that its nonsense to make just plain average stats.
    I think this case is a most basic default for data - even compared to the common textbook-i.i.d. case. In fact, one can recognize such kind of stats, which repects mere (inhomogous) data-density itself, as (kind of simple/indepent) auto-bayesian stats vs. dumb averaging.

    I think one can almost always do this "bayesian density weighter/filter" as better option compared to mere average stats in that case of x,y correlation when there is obvioulsy interesting correlation but where you are too lazy..to..principally unable to itch out a modell on physics level. The latter requirement is in fact what any averaging stats cries for at any price - but how often can you do it in real world applications ...

    ( In reality there is anyway no way to eliminate auto-correlation in the composition of data. Everything and everybody lies :) )

    Thats where a top-down (model-free) bayesian stats approach will pay off: In the previous extreme example of criminal data duplication - I'm sure - it will totally neutralize the attack without question. In the drunkard time-series example it will tell me very reliably how well this guy will keep track - without need for a complex model. In the case of good i.i.d. data distribution it will tell me the same as simple stats. Just good news ...

    Thus I can possibly say it so now: I have the problem for guessing linear correlation (coefficient with error) on x,y data with respect to the (most general assupmtion) "bayesian" background of inhomogenous data distribution.
    Therefore I'm seeking a (fast/efficient/approx.) formula for r/r_err. I guess the formula for r does not change (much) compared to that for simple averaging stats, but the formula for r_err will.

    Maybe its easy with some existing means of numpy/scipy already. Maybe not. I'm far from finding the (efficient) math myself, but I know what I want - and can see if a formula really does it.


    Robert
    robert, Nov 16, 2006
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Terry Reedy

    Re: numpy and scipy for Python 2.3

    Terry Reedy, Jul 10, 2003, in forum: Python
    Replies:
    0
    Views:
    441
    Terry Reedy
    Jul 10, 2003
  2. Paxcal
    Replies:
    0
    Views:
    1,824
    Paxcal
    Feb 11, 2004
  3. robert
    Replies:
    2
    Views:
    1,506
    Travis E. Oliphant
    Oct 29, 2006
  4. Rob Clewley
    Replies:
    6
    Views:
    424
  5. vml
    Replies:
    3
    Views:
    678
Loading...

Share This Page