Can straight line est fit with error measure be made in single pass?

Discussion in 'C++' started by Alf P. Steinbach, Oct 19, 2013.

  1. It would be nice if one could make do with a non-copyable forward
    iterator for the (x,y) data points, without storing the points.

    However, my current code requires the ability to iterate a second time,
    from a copy of the start iterator.

    Any suggestions for how to avoid the second iteration, or should I
    perhaps just forget it (I only need this to check algorithm complexity
    in unit test, so it's not a current problem, just "not ideal" code...)?


    Code:
    #pragma once
    // Copyright (c) 2013 Alf P. Steinbach
    
    // <url:
    http://terpconnect.umd.edu/~toh/spectrum/CurveFitting.html#MathDetails>
    //  n = number of x,y data points
    //  sumx = Σx
    //  sumy = Σy
    //  sumxy = Σx*y
    //  sumx2 = Σx*x
    //  meanx = sumx / n
    //  meany = sumy / n
    //  slope = (n*sumxy - sumx*sumy) / (n*sumx2 - sumx*sumx)
    //  intercept = meany-(slope*meanx)
    //  ssy = Σ(y-meany)^2
    //  ssr = Σ(y-intercept-slope*x)^2
    //  R2 = 1-(ssr/ssy)
    //  Standard deviation of the slope = SQRT(ssr/(n-2))*SQRT(n/(n*sumx2 -
    sumx*sumx))
    //  Standard deviation of the intercept =
    SQRT(ssr/(n-2))*SQRT(sumx2/(n*sumx2 - sumx*sumx))
    //
    // Also, <url: http://mathworld.wolfram.com/LeastSquaresFitting.html>
    
    #include <rfc/cppx/calc/Statistics.h>       // cppx::calc::Statistics
    
    namespace cppx{ namespace calc{
    
    class Best_line_fit
    : private Statistics
    {
    private:
    double intercept_;
    double slope_;
    double ssy_;            // Sum of squares of y-expressions
    double ssr_;            // Sum of squares of residue expressions
    
    public:
    auto stats() const -> Statistics const&     { return *this; }
    
    auto slope() const -> double        { return slope_; }
    auto intercept() const -> double    { return intercept_; }
    
    auto ssy() const -> double          { return ssy_; }
    auto ssr() const -> double          { return ssr_; }
    auto r_squared() const -> double    { return (ssr_ == 0 || ssy_
    == 0? 1.0 : 1.0 - ssr_/ssy_); }
    auto r() const -> double            { return sqrt( r_squared() ); }
    
    template< class It >
    Best_line_fit( It const first, It const end )
    : Statistics( first, end )
    , ssy_( 0.0 )
    , ssr_( 0.0 )
    {
    slope_ = (n()*sum_xy() - sum_x()*sum_y()) / (n()*sum_xx() -
    square( sum_x() ));
    intercept_ = mean_y() - slope_*mean_x();
    
    for( It it = first; it != end; ++it )
    {
    double const x = x_of( *it );
    double const y = y_of( *it );
    
    //ssy_ += square( y - mean_y() );
    ssr_ += square( y - (slope_*x + intercept_) );
    }
    ssy_ = sum_yy() - 2*sum_y()*mean_y() + n()*square( mean_y() );
    }
    };
    
    } }  // namespace cppx::calc
    
    Disclaimer 1: I do not KNOW that the code is correct, in particular the
    r_squared() implementation.

    Disclaimer 2: In the mathworld reference I do not understand eq. 17.


    Cheers,

    - Alf
    Alf P. Steinbach, Oct 19, 2013
    #1
    1. Advertising

  2. Re: Can straight line est fit with error measure be made in singlepass?

    On 10/18/2013 9:57 PM, Alf P. Steinbach wrote:
    > It would be nice if one could make do with a non-copyable forward
    > iterator for the (x,y) data points, without storing the points.


    Looking at the code below, I cannot really answer this question without
    knowing what 'Statistics' is. Presuming that it does iterate over the
    given range to collect the "sum of" and "mean" and the other values,
    your other choice is to fix the 'Statistics' class to either calculate
    your 'ssr' to boot, or to make it store the x and y values so you can
    extract them later at your leisure.

    > However, my current code requires the ability to iterate a second time,
    > from a copy of the start iterator.
    >
    > Any suggestions for how to avoid the second iteration, or should I
    > perhaps just forget it (I only need this to check algorithm complexity
    > in unit test, so it's not a current problem, just "not ideal" code...)?
    >
    >
    >
    Code:
    > #pragma once
    > // Copyright (c) 2013 Alf P. Steinbach
    >
    > // <url:
    > http://terpconnect.umd.edu/~toh/spectrum/CurveFitting.html#MathDetails>
    > //  n = number of x,y data points
    > //  sumx = Σx
    > //  sumy = Σy
    > //  sumxy = Σx*y
    > //  sumx2 = Σx*x
    > //  meanx = sumx / n
    > //  meany = sumy / n
    > //  slope = (n*sumxy - sumx*sumy) / (n*sumx2 - sumx*sumx)
    > //  intercept = meany-(slope*meanx)
    > //  ssy = Σ(y-meany)^2
    > //  ssr = Σ(y-intercept-slope*x)^2
    > //  R2 = 1-(ssr/ssy)
    > //  Standard deviation of the slope = SQRT(ssr/(n-2))*SQRT(n/(n*sumx2 -
    > sumx*sumx))
    > //  Standard deviation of the intercept =
    > SQRT(ssr/(n-2))*SQRT(sumx2/(n*sumx2 - sumx*sumx))
    > //
    > // Also, <url: http://mathworld.wolfram.com/LeastSquaresFitting.html>
    >
    > #include <rfc/cppx/calc/Statistics.h>       // cppx::calc::Statistics
    >
    > namespace cppx{ namespace calc{
    >
    >      class Best_line_fit
    >          : private Statistics
    >      {
    >      private:
    >          double intercept_;
    >          double slope_;
    >          double ssy_;            // Sum of squares of y-expressions
    >          double ssr_;            // Sum of squares of residue expressions
    >
    >      public:
    >          auto stats() const -> Statistics const&     { return *this; }
    >
    >          auto slope() const -> double        { return slope_; }
    >          auto intercept() const -> double    { return intercept_; }
    >
    >          auto ssy() const -> double          { return ssy_; }
    >          auto ssr() const -> double          { return ssr_; }
    >          auto r_squared() const -> double    { return (ssr_ == 0 || ssy_
    > == 0? 1.0 : 1.0 - ssr_/ssy_); }
    >          auto r() const -> double            { return sqrt( r_squared()
    > ); }
    >
    >          template< class It >
    >          Best_line_fit( It const first, It const end )
    >              : Statistics( first, end )
    >              , ssy_( 0.0 )
    >              , ssr_( 0.0 )
    >          {
    >              slope_ = (n()*sum_xy() - sum_x()*sum_y()) / (n()*sum_xx() -
    > square( sum_x() ));
    >              intercept_ = mean_y() - slope_*mean_x();
    >
    >              for( It it = first; it != end; ++it )
    >              {
    >                  double const x = x_of( *it );
    >                  double const y = y_of( *it );
    >
    >                  //ssy_ += square( y - mean_y() );
    >                  ssr_ += square( y - (slope_*x + intercept_) );
    >              }
    >              ssy_ = sum_yy() - 2*sum_y()*mean_y() + n()*square( mean_y() );
    >          }
    >      };
    >
    > } }  // namespace cppx::calc
    > 
    >
    > Disclaimer 1: I do not KNOW that the code is correct, in particular the
    > r_squared() implementation.
    >
    > Disclaimer 2: In the mathworld reference I do not understand eq. 17.


    It's the simplification of eq 16. X' (x with line over it) is the mean
    x value, isn't it?

    SUMi(Xi - X')^2 = SUMi(XiXi - 2XiX' + X'X') =

    = SUMi(Xi^2) - SUMi(2XiX') + SUMi(X'X') =

    = SUMi(Xi^2) - 2X'*SUMi(Xi) + N*X'X' = ( assuming X'=SUM(Xi)/N )

    = SUMi(Xi^2) - 2X'*N*X' + N*X'X' = SUMi(Xi^2) - N*X'X'

    Same with eq's 18 and 20.

    V
    --
    I do not respond to top-posted replies, please don't ask
    Victor Bazarov, Oct 19, 2013
    #2
    1. Advertising

  3. Re: Can straight line est fit with error measure be made in singlepass?

    On 19.10.2013 15:54, Victor Bazarov wrote:
    > On 10/18/2013 9:57 PM, Alf P. Steinbach wrote:
    >> It would be nice if one could make do with a non-copyable forward
    >> iterator for the (x,y) data points, without storing the points.

    >
    > Looking at the code below, I cannot really answer this question without
    > knowing what 'Statistics' is. Presuming that it does iterate over the
    > given range to collect the "sum of" and "mean" and the other values,


    Yes.


    > your other choice is to fix the 'Statistics' class to either calculate
    > your 'ssr' to boot,


    This would be preferable.

    However the 'ssr' value depends on values that apparently depend on the
    whole collection of numbers, + those original numbers themselves again,

    > // slope = (n*sumxy - sumx*sumy) / (n*sumx2 - sumx*sumx)
    > // intercept = meany-(slope*meanx)
    > // ssr = Σ(y-intercept-slope*x)^2


    I know there is a neat trick for updating a mean value incrementally, so
    I thought maybe someone knew about something similar here?

    It seems like a tradeoff math/C++: more of one to get less of the other.


    > or to make it store the x and y values so you can
    > extract them later at your leisure.


    Yes.


    [snip]
    >> Disclaimer 1: I do not KNOW that the code is correct, in particular the
    >> r_squared() implementation.
    >>
    >> Disclaimer 2: In the mathworld reference I do not understand eq. 17.

    >
    > It's the simplification of eq 16. X' (x with line over it) is the mean
    > x value, isn't it?
    >
    > SUMi(Xi - X')^2 = SUMi(XiXi - 2XiX' + X'X') =
    >
    > = SUMi(Xi^2) - SUMi(2XiX') + SUMi(X'X') =
    >
    > = SUMi(Xi^2) - 2X'*SUMi(Xi) + N*X'X' = ( assuming X'=SUM(Xi)/N )
    >
    > = SUMi(Xi^2) - 2X'*N*X' + N*X'X' = SUMi(Xi^2) - N*X'X'
    >
    > Same with eq's 18 and 20.


    Thanks! :)

    I got some help with that already this morning, on Facebook, and yes I
    felt pretty stupid... :-( Not to mention whether that feeling now
    reflects reality. Which it may.

    Anyway, for my unit testing this now works, using 40 000 time
    measurement points I get an R goodness between 0.99 and 1 (perfect) for
    linear complexity, and forcing O(n^2) behavior[1] I get an R around
    0.98. Apparently values tightly up against 0.99 or thereabouts are
    normal, i.e. apparently this is a very finely tuned goodness-of-fit
    criterion. But again I do not know, and the texts I've found neglect to
    say anything about which numerical values decide one way or other.


    Cheers,

    - Alf (still in uncharted territory)

    Notes:
    [1] By simple expedient of allocating just required size of internal
    buffer in string, instead of using the doubling buffer size idea.
    Alf P. Steinbach, Oct 19, 2013
    #3
  4. Re: Can straight line est fit with error measure be made in singlepass?

    On 19.10.2013 20:57, Richard Damon wrote:
    > On 10/18/13 9:57 PM, Alf P. Steinbach wrote:
    >>

    > ...
    >> // ssy = Σ(y-meany)^2
    >> // ssr = Σ(y-intercept-slope*x)^2
    >> // R2 = 1-(ssr/ssy)
    >> // Standard deviation of the slope = SQRT(ssr/(n-2))*SQRT(n/(n*sumx2 -
    >> sumx*sumx))
    >> // Standard deviation of the intercept =
    >> SQRT(ssr/(n-2))*SQRT(sumx2/(n*sumx2 - sumx*sumx))
    >> //
    >> // Also, <url: http://mathworld.wolfram.com/LeastSquaresFitting.html>

    > ...
    >> - Alf

    >
    > Sounds like you just need to precompute any needed terms to compute ssy
    > and ssr.
    >
    > For example, ssy = sum( (y-my)^2 ) =
    > sum( y^2 - 2*my*y + my^2)
    >
    > since my is a constant over the sum =
    >
    > sum(y^2) - 2*my*sum(y) + my^2*sum(1)
    >
    > so you need to have computed sum(y^2) and sum(y)
    >
    >
    > Do the same math for ssr, and Bob's your uncle, you have your solution.
    >
    > First glance, srr will need sums of y^2, y, xy, x, x^2, 1
    >


    Thanks, but the current (presented) code already does this for ssy, and
    ssr is the problem, not easily updated.

    Cheers,

    - Alf
    Alf P. Steinbach, Oct 20, 2013
    #4
  5. Re: Can straight line est fit with error measure be made in singlepass?

    On 19.10.2013 17:29, Paavo Helde wrote:
    > "Alf P. Steinbach" <> wrote in
    > news:l3souu$4j1$:
    >
    >> It would be nice if one could make do with a non-copyable forward
    >> iterator for the (x,y) data points, without storing the points.
    >>
    >> However, my current code requires the ability to iterate a second
    >> time, from a copy of the start iterator.
    >>
    >> Any suggestions for how to avoid the second iteration, or should I
    >> perhaps just forget it (I only need this to check algorithm complexity
    >> in unit test, so it's not a current problem, just "not ideal"
    >> code...)?

    >
    > Depends on what you mean by ideal? Code simplicity, performance, accuracy,
    > reducing the number of passes? These goals might be in conflict with each
    > other. For example std. deviation and variance can be easily calculated by
    > one-pass algorithm. However, a naive and simple implementation produces big
    > round-off errors; there is a numerically stable one-pass algorithm from
    > Knuth, but it involves a division operation inside loop, so might be slow.
    > There is a stable two-pass algorithm that can actually be faster, but well,
    > it has two passes.


    Well, unfortunately my Mr. Knuth (the three first volumes, I never got
    around to obtaining the TeX works or the current fourth volume stuff)
    resides at the bottom of some unidentified, anonymous-looking box, among
    numerous other unidentified anonymous-looking boxes.

    But thanks anyway, it really helps to have some background.


    > This was about variance; for linear regression it seems to be yet more
    > complicated. First google search turns up
    >
    > http://stats.stackexchange.com/ques...g-running-linear-or-logistic-regression-param


    Hm, among the answers there the last one, with 0 votes, seemed most
    promising.

    Still, the upshot seems to be "can't be done".


    Cheers,

    - Alf

    PS: Donald Knuth did most of the first volume work while he was (doing
    postdoc work?) at Norsk Regnesentral in Oslo. I can't help but connect
    that somehow to the Simula OO work being done at that time at the
    University of Oslo. Even though I never read about any connection. I
    once interviewed there, and mentioned this, just to show I'd done my bit
    about background research. But they had not even heard of him. :(
    Alf P. Steinbach, Oct 20, 2013
    #5
  6. Re: Can straight line est fit with error measure be made in singlepass?

    On 21.10.2013 02:48, Richard Damon wrote:
    > On 10/20/13 1:16 AM, Alf P. Steinbach wrote:
    >> On 19.10.2013 20:57, Richard Damon wrote:
    >>> On 10/18/13 9:57 PM, Alf P. Steinbach wrote:
    >>>>
    >>> ...
    >>>> // ssy = Σ(y-meany)^2
    >>>> // ssr = Σ(y-intercept-slope*x)^2
    >>>> // R2 = 1-(ssr/ssy)
    >>>> // Standard deviation of the slope = SQRT(ssr/(n-2))*SQRT(n/(n*sumx2 -
    >>>> sumx*sumx))
    >>>> // Standard deviation of the intercept =
    >>>> SQRT(ssr/(n-2))*SQRT(sumx2/(n*sumx2 - sumx*sumx))
    >>>> //
    >>>> // Also, <url: http://mathworld.wolfram.com/LeastSquaresFitting.html>
    >>> ...
    >>>> - Alf
    >>>
    >>> Sounds like you just need to precompute any needed terms to compute ssy
    >>> and ssr.
    >>>
    >>> For example, ssy = sum( (y-my)^2 ) =
    >>> sum( y^2 - 2*my*y + my^2)
    >>>
    >>> since my is a constant over the sum =
    >>>
    >>> sum(y^2) - 2*my*sum(y) + my^2*sum(1)
    >>>
    >>> so you need to have computed sum(y^2) and sum(y)
    >>>
    >>>
    >>> Do the same math for ssr, and Bob's your uncle, you have your solution.
    >>>
    >>> First glance, srr will need sums of y^2, y, xy, x, x^2, 1
    >>>

    >>
    >> Thanks, but the current (presented) code already does this for ssy, and
    >> ssr is the problem, not easily updated.
    >>
    >> Cheers,
    >>
    >> - Alf
    >>

    >
    > But if you do the math, you can derive similar equations for ssr.
    >
    > ssr = sum((y-b-m*x)^2)
    > = sum( y^2 + b^2 + m^2x^2 - 2*y*b - 2*m*x*y - 2*b*m*x )
    >
    > = sum(y^2) + b^2*sum(1) + m^2*sum(x^2) - 2*b*sum(y) - 2*m*sum(x*y) -
    > 2*b*m*sum(x)
    >
    > Remember, for THIS summation, intercept (call b above) and slope (called
    > m above) are constants and can be factored out of the loop.
    >
    > No promise I haven't made a dumb math error above, but the principle is
    > solid.


    Thanks a bunch! :)


    Cheers,

    - Alf

    Oh, why did that seem impossible to me, why didn't I think of it...
    Alf P. Steinbach, Oct 21, 2013
    #6
  7. Alf P. Steinbach

    SG Guest

    On Saturday, October 19, 2013 3:57:36 AM UTC+2, Alf P. Steinbach wrote:
    > It would be nice if one could make do with a non-copyable forward
    > iterator for the (x,y) data points, without storing the points.


    Let me just mention that this extends to any linear least squares
    problem. The amount of memory you really need is just determined by
    the number of unknowns, not the number of "data points". Suppose

    A x = b

    is an overdetermined linear equation system with very few unknowns
    (and hence very hew columns in A) but many equations (rows in A).
    Then, you can, for example, compute A^T A and A^T b of

    A^T A x = A^T b // Gauss' normal equation system

    using a single loop over all rows of A and b. A^T A will be a small
    square matrix with n^2 entries where n is the number of unknowns.
    This can then be solved with a Cholesky factorization, for example.

    Other approaches are possible, too. If you care about numerical
    accuracy you could "accumulate" a couple of rows and periorically
    reduce the whole system to O(n^2) memory via Householder
    transformations.

    Cheers!
    SG
    SG, Oct 23, 2013
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Noozer
    Replies:
    3
    Views:
    546
    Noozer
    Apr 23, 2004
  2. Piet
    Replies:
    0
    Views:
    510
  3. PyPK

    Straight line detection

    PyPK, Sep 28, 2005, in forum: Python
    Replies:
    3
    Views:
    600
    Nigel Rowe
    Oct 15, 2005
  4. Sandy

    Treeview displaying as straight line of text

    Sandy, Oct 19, 2003, in forum: ASP .Net Web Controls
    Replies:
    6
    Views:
    369
  5. Steve Grazzini
    Replies:
    2
    Views:
    114
    Steve
    Jul 17, 2003
Loading...

Share This Page