# Can straight line est fit with error measure be made in single pass?

Discussion in 'C++' started by Alf P. Steinbach, Oct 19, 2013.

1. ### Alf P. SteinbachGuest

It would be nice if one could make do with a non-copyable forward
iterator for the (x,y) data points, without storing the points.

However, my current code requires the ability to iterate a second time,
from a copy of the start iterator.

Any suggestions for how to avoid the second iteration, or should I
perhaps just forget it (I only need this to check algorithm complexity
in unit test, so it's not a current problem, just "not ideal" code...)?

Code:
#pragma once
// Copyright (c) 2013 Alf P. Steinbach

// <url:
http://terpconnect.umd.edu/~toh/spectrum/CurveFitting.html#MathDetails>
//  n = number of x,y data points
//  sumx = Î£x
//  sumy = Î£y
//  sumxy = Î£x*y
//  sumx2 = Î£x*x
//  meanx = sumx / n
//  meany = sumy / n
//  slope = (n*sumxy - sumx*sumy) / (n*sumx2 - sumx*sumx)
//  intercept = meany-(slope*meanx)
//  ssy = Î£(y-meany)^2
//  ssr = Î£(y-intercept-slope*x)^2
//  R2 = 1-(ssr/ssy)
//  Standard deviation of the slope = SQRT(ssr/(n-2))*SQRT(n/(n*sumx2 -
sumx*sumx))
//  Standard deviation of the intercept =
SQRT(ssr/(n-2))*SQRT(sumx2/(n*sumx2 - sumx*sumx))
//
// Also, <url: http://mathworld.wolfram.com/LeastSquaresFitting.html>

#include <rfc/cppx/calc/Statistics.h>       // cppx::calc::Statistics

namespace cppx{ namespace calc{

class Best_line_fit
: private Statistics
{
private:
double intercept_;
double slope_;
double ssy_;            // Sum of squares of y-expressions
double ssr_;            // Sum of squares of residue expressions

public:
auto stats() const -> Statistics const&     { return *this; }

auto slope() const -> double        { return slope_; }
auto intercept() const -> double    { return intercept_; }

auto ssy() const -> double          { return ssy_; }
auto ssr() const -> double          { return ssr_; }
auto r_squared() const -> double    { return (ssr_ == 0 || ssy_
== 0? 1.0 : 1.0 - ssr_/ssy_); }
auto r() const -> double            { return sqrt( r_squared() ); }

template< class It >
Best_line_fit( It const first, It const end )
: Statistics( first, end )
, ssy_( 0.0 )
, ssr_( 0.0 )
{
slope_ = (n()*sum_xy() - sum_x()*sum_y()) / (n()*sum_xx() -
square( sum_x() ));
intercept_ = mean_y() - slope_*mean_x();

for( It it = first; it != end; ++it )
{
double const x = x_of( *it );
double const y = y_of( *it );

//ssy_ += square( y - mean_y() );
ssr_ += square( y - (slope_*x + intercept_) );
}
ssy_ = sum_yy() - 2*sum_y()*mean_y() + n()*square( mean_y() );
}
};

} }  // namespace cppx::calc

Disclaimer 1: I do not KNOW that the code is correct, in particular the
r_squared() implementation.

Disclaimer 2: In the mathworld reference I do not understand eq. 17.

Cheers,

- Alf
Alf P. Steinbach, Oct 19, 2013

2. ### Victor BazarovGuest

Re: Can straight line est fit with error measure be made in singlepass?

On 10/18/2013 9:57 PM, Alf P. Steinbach wrote:
> It would be nice if one could make do with a non-copyable forward
> iterator for the (x,y) data points, without storing the points.

Looking at the code below, I cannot really answer this question without
knowing what 'Statistics' is. Presuming that it does iterate over the
given range to collect the "sum of" and "mean" and the other values,
your other choice is to fix the 'Statistics' class to either calculate
your 'ssr' to boot, or to make it store the x and y values so you can
extract them later at your leisure.

> However, my current code requires the ability to iterate a second time,
> from a copy of the start iterator.
>
> Any suggestions for how to avoid the second iteration, or should I
> perhaps just forget it (I only need this to check algorithm complexity
> in unit test, so it's not a current problem, just "not ideal" code...)?
>
>
>
Code:
> #pragma once
> // Copyright (c) 2013 Alf P. Steinbach
>
> // <url:
> http://terpconnect.umd.edu/~toh/spectrum/CurveFitting.html#MathDetails>
> //  n = number of x,y data points
> //  sumx = Î£x
> //  sumy = Î£y
> //  sumxy = Î£x*y
> //  sumx2 = Î£x*x
> //  meanx = sumx / n
> //  meany = sumy / n
> //  slope = (n*sumxy - sumx*sumy) / (n*sumx2 - sumx*sumx)
> //  intercept = meany-(slope*meanx)
> //  ssy = Î£(y-meany)^2
> //  ssr = Î£(y-intercept-slope*x)^2
> //  R2 = 1-(ssr/ssy)
> //  Standard deviation of the slope = SQRT(ssr/(n-2))*SQRT(n/(n*sumx2 -
> sumx*sumx))
> //  Standard deviation of the intercept =
> SQRT(ssr/(n-2))*SQRT(sumx2/(n*sumx2 - sumx*sumx))
> //
> // Also, <url: http://mathworld.wolfram.com/LeastSquaresFitting.html>
>
> #include <rfc/cppx/calc/Statistics.h>       // cppx::calc::Statistics
>
> namespace cppx{ namespace calc{
>
>      class Best_line_fit
>          : private Statistics
>      {
>      private:
>          double intercept_;
>          double slope_;
>          double ssy_;            // Sum of squares of y-expressions
>          double ssr_;            // Sum of squares of residue expressions
>
>      public:
>          auto stats() const -> Statistics const&     { return *this; }
>
>          auto slope() const -> double        { return slope_; }
>          auto intercept() const -> double    { return intercept_; }
>
>          auto ssy() const -> double          { return ssy_; }
>          auto ssr() const -> double          { return ssr_; }
>          auto r_squared() const -> double    { return (ssr_ == 0 || ssy_
> == 0? 1.0 : 1.0 - ssr_/ssy_); }
>          auto r() const -> double            { return sqrt( r_squared()
> ); }
>
>          template< class It >
>          Best_line_fit( It const first, It const end )
>              : Statistics( first, end )
>              , ssy_( 0.0 )
>              , ssr_( 0.0 )
>          {
>              slope_ = (n()*sum_xy() - sum_x()*sum_y()) / (n()*sum_xx() -
> square( sum_x() ));
>              intercept_ = mean_y() - slope_*mean_x();
>
>              for( It it = first; it != end; ++it )
>              {
>                  double const x = x_of( *it );
>                  double const y = y_of( *it );
>
>                  //ssy_ += square( y - mean_y() );
>                  ssr_ += square( y - (slope_*x + intercept_) );
>              }
>              ssy_ = sum_yy() - 2*sum_y()*mean_y() + n()*square( mean_y() );
>          }
>      };
>
> } }  // namespace cppx::calc
>
>
> Disclaimer 1: I do not KNOW that the code is correct, in particular the
> r_squared() implementation.
>
> Disclaimer 2: In the mathworld reference I do not understand eq. 17.

It's the simplification of eq 16. X' (x with line over it) is the mean
x value, isn't it?

SUMi(Xi - X')^2 = SUMi(XiXi - 2XiX' + X'X') =

= SUMi(Xi^2) - SUMi(2XiX') + SUMi(X'X') =

= SUMi(Xi^2) - 2X'*SUMi(Xi) + N*X'X' = ( assuming X'=SUM(Xi)/N )

= SUMi(Xi^2) - 2X'*N*X' + N*X'X' = SUMi(Xi^2) - N*X'X'

Same with eq's 18 and 20.

V
--
Victor Bazarov, Oct 19, 2013

3. ### Alf P. SteinbachGuest

Re: Can straight line est fit with error measure be made in singlepass?

On 19.10.2013 15:54, Victor Bazarov wrote:
> On 10/18/2013 9:57 PM, Alf P. Steinbach wrote:
>> It would be nice if one could make do with a non-copyable forward
>> iterator for the (x,y) data points, without storing the points.

>
> Looking at the code below, I cannot really answer this question without
> knowing what 'Statistics' is. Presuming that it does iterate over the
> given range to collect the "sum of" and "mean" and the other values,

Yes.

> your other choice is to fix the 'Statistics' class to either calculate

This would be preferable.

However the 'ssr' value depends on values that apparently depend on the
whole collection of numbers, + those original numbers themselves again,

> // slope = (n*sumxy - sumx*sumy) / (n*sumx2 - sumx*sumx)
> // intercept = meany-(slope*meanx)
> // ssr = Î£(y-intercept-slope*x)^2

I know there is a neat trick for updating a mean value incrementally, so
I thought maybe someone knew about something similar here?

It seems like a tradeoff math/C++: more of one to get less of the other.

> or to make it store the x and y values so you can
> extract them later at your leisure.

Yes.

[snip]
>> Disclaimer 1: I do not KNOW that the code is correct, in particular the
>> r_squared() implementation.
>>
>> Disclaimer 2: In the mathworld reference I do not understand eq. 17.

>
> It's the simplification of eq 16. X' (x with line over it) is the mean
> x value, isn't it?
>
> SUMi(Xi - X')^2 = SUMi(XiXi - 2XiX' + X'X') =
>
> = SUMi(Xi^2) - SUMi(2XiX') + SUMi(X'X') =
>
> = SUMi(Xi^2) - 2X'*SUMi(Xi) + N*X'X' = ( assuming X'=SUM(Xi)/N )
>
> = SUMi(Xi^2) - 2X'*N*X' + N*X'X' = SUMi(Xi^2) - N*X'X'
>
> Same with eq's 18 and 20.

Thanks!

I got some help with that already this morning, on Facebook, and yes I
felt pretty stupid... :-( Not to mention whether that feeling now
reflects reality. Which it may.

Anyway, for my unit testing this now works, using 40 000 time
measurement points I get an R goodness between 0.99 and 1 (perfect) for
linear complexity, and forcing O(n^2) behavior[1] I get an R around
0.98. Apparently values tightly up against 0.99 or thereabouts are
normal, i.e. apparently this is a very finely tuned goodness-of-fit
criterion. But again I do not know, and the texts I've found neglect to
say anything about which numerical values decide one way or other.

Cheers,

- Alf (still in uncharted territory)

Notes:
[1] By simple expedient of allocating just required size of internal
buffer in string, instead of using the doubling buffer size idea.
Alf P. Steinbach, Oct 19, 2013
4. ### Alf P. SteinbachGuest

Re: Can straight line est fit with error measure be made in singlepass?

On 19.10.2013 20:57, Richard Damon wrote:
> On 10/18/13 9:57 PM, Alf P. Steinbach wrote:
>>

> ...
>> // ssy = Î£(y-meany)^2
>> // ssr = Î£(y-intercept-slope*x)^2
>> // R2 = 1-(ssr/ssy)
>> // Standard deviation of the slope = SQRT(ssr/(n-2))*SQRT(n/(n*sumx2 -
>> sumx*sumx))
>> // Standard deviation of the intercept =
>> SQRT(ssr/(n-2))*SQRT(sumx2/(n*sumx2 - sumx*sumx))
>> //
>> // Also, <url: http://mathworld.wolfram.com/LeastSquaresFitting.html>

> ...
>> - Alf

>
> Sounds like you just need to precompute any needed terms to compute ssy
> and ssr.
>
> For example, ssy = sum( (y-my)^2 ) =
> sum( y^2 - 2*my*y + my^2)
>
> since my is a constant over the sum =
>
> sum(y^2) - 2*my*sum(y) + my^2*sum(1)
>
> so you need to have computed sum(y^2) and sum(y)
>
>
> Do the same math for ssr, and Bob's your uncle, you have your solution.
>
> First glance, srr will need sums of y^2, y, xy, x, x^2, 1
>

Thanks, but the current (presented) code already does this for ssy, and
ssr is the problem, not easily updated.

Cheers,

- Alf
Alf P. Steinbach, Oct 20, 2013
5. ### Alf P. SteinbachGuest

Re: Can straight line est fit with error measure be made in singlepass?

On 19.10.2013 17:29, Paavo Helde wrote:
> "Alf P. Steinbach" <> wrote in
> news:l3souu\$4j1\$:
>
>> It would be nice if one could make do with a non-copyable forward
>> iterator for the (x,y) data points, without storing the points.
>>
>> However, my current code requires the ability to iterate a second
>> time, from a copy of the start iterator.
>>
>> Any suggestions for how to avoid the second iteration, or should I
>> perhaps just forget it (I only need this to check algorithm complexity
>> in unit test, so it's not a current problem, just "not ideal"
>> code...)?

>
> Depends on what you mean by ideal? Code simplicity, performance, accuracy,
> reducing the number of passes? These goals might be in conflict with each
> other. For example std. deviation and variance can be easily calculated by
> one-pass algorithm. However, a naive and simple implementation produces big
> round-off errors; there is a numerically stable one-pass algorithm from
> Knuth, but it involves a division operation inside loop, so might be slow.
> There is a stable two-pass algorithm that can actually be faster, but well,
> it has two passes.

Well, unfortunately my Mr. Knuth (the three first volumes, I never got
around to obtaining the TeX works or the current fourth volume stuff)
resides at the bottom of some unidentified, anonymous-looking box, among
numerous other unidentified anonymous-looking boxes.

But thanks anyway, it really helps to have some background.

> This was about variance; for linear regression it seems to be yet more
> complicated. First google search turns up
>
> http://stats.stackexchange.com/ques...g-running-linear-or-logistic-regression-param

Hm, among the answers there the last one, with 0 votes, seemed most
promising.

Still, the upshot seems to be "can't be done".

Cheers,

- Alf

PS: Donald Knuth did most of the first volume work while he was (doing
postdoc work?) at Norsk Regnesentral in Oslo. I can't help but connect
that somehow to the Simula OO work being done at that time at the
University of Oslo. Even though I never read about any connection. I
once interviewed there, and mentioned this, just to show I'd done my bit
Alf P. Steinbach, Oct 20, 2013
6. ### Alf P. SteinbachGuest

Re: Can straight line est fit with error measure be made in singlepass?

On 21.10.2013 02:48, Richard Damon wrote:
> On 10/20/13 1:16 AM, Alf P. Steinbach wrote:
>> On 19.10.2013 20:57, Richard Damon wrote:
>>> On 10/18/13 9:57 PM, Alf P. Steinbach wrote:
>>>>
>>> ...
>>>> // ssy = Î£(y-meany)^2
>>>> // ssr = Î£(y-intercept-slope*x)^2
>>>> // R2 = 1-(ssr/ssy)
>>>> // Standard deviation of the slope = SQRT(ssr/(n-2))*SQRT(n/(n*sumx2 -
>>>> sumx*sumx))
>>>> // Standard deviation of the intercept =
>>>> SQRT(ssr/(n-2))*SQRT(sumx2/(n*sumx2 - sumx*sumx))
>>>> //
>>>> // Also, <url: http://mathworld.wolfram.com/LeastSquaresFitting.html>
>>> ...
>>>> - Alf
>>>
>>> Sounds like you just need to precompute any needed terms to compute ssy
>>> and ssr.
>>>
>>> For example, ssy = sum( (y-my)^2 ) =
>>> sum( y^2 - 2*my*y + my^2)
>>>
>>> since my is a constant over the sum =
>>>
>>> sum(y^2) - 2*my*sum(y) + my^2*sum(1)
>>>
>>> so you need to have computed sum(y^2) and sum(y)
>>>
>>>
>>> Do the same math for ssr, and Bob's your uncle, you have your solution.
>>>
>>> First glance, srr will need sums of y^2, y, xy, x, x^2, 1
>>>

>>
>> Thanks, but the current (presented) code already does this for ssy, and
>> ssr is the problem, not easily updated.
>>
>> Cheers,
>>
>> - Alf
>>

>
> But if you do the math, you can derive similar equations for ssr.
>
> ssr = sum((y-b-m*x)^2)
> = sum( y^2 + b^2 + m^2x^2 - 2*y*b - 2*m*x*y - 2*b*m*x )
>
> = sum(y^2) + b^2*sum(1) + m^2*sum(x^2) - 2*b*sum(y) - 2*m*sum(x*y) -
> 2*b*m*sum(x)
>
> Remember, for THIS summation, intercept (call b above) and slope (called
> m above) are constants and can be factored out of the loop.
>
> No promise I haven't made a dumb math error above, but the principle is
> solid.

Thanks a bunch!

Cheers,

- Alf

Oh, why did that seem impossible to me, why didn't I think of it...
Alf P. Steinbach, Oct 21, 2013
7. ### SGGuest

On Saturday, October 19, 2013 3:57:36 AM UTC+2, Alf P. Steinbach wrote:
> It would be nice if one could make do with a non-copyable forward
> iterator for the (x,y) data points, without storing the points.

Let me just mention that this extends to any linear least squares
problem. The amount of memory you really need is just determined by
the number of unknowns, not the number of "data points". Suppose

A x = b

is an overdetermined linear equation system with very few unknowns
(and hence very hew columns in A) but many equations (rows in A).
Then, you can, for example, compute A^T A and A^T b of

A^T A x = A^T b // Gauss' normal equation system

using a single loop over all rows of A and b. A^T A will be a small
square matrix with n^2 entries where n is the number of unknowns.
This can then be solved with a Cholesky factorization, for example.

Other approaches are possible, too. If you care about numerical
accuracy you could "accumulate" a couple of rows and periorically
reduce the whole system to O(n^2) memory via Householder
transformations.

Cheers!
SG
SG, Oct 23, 2013