Numpy outlier removal

Discussion in 'Python' started by Joseph L. Casale, Jan 6, 2013.

  1. I have a dataset that consists of a dict with text descriptions and values that are integers. If
    required, I collect the values into a list and create a numpy array runningit through a simple
    routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations
    to include.


    The problem is I loos track of which were removed so the original display of the dataset is
    misleading when the processed average is returned as it includes the removed key/values.


    Ayone know how I can maintain the relationship and when I exclude a value, remove it from
    the dict?

    Thanks!
    jlc
    Joseph L. Casale, Jan 6, 2013
    #1
    1. Advertising

  2. Joseph L. Casale

    Hans Mulder Guest

    On 6/01/13 20:44:08, Joseph L. Casale wrote:
    > I have a dataset that consists of a dict with text descriptions and values that are integers. If
    > required, I collect the values into a list and create a numpy array running it through a simple
    > routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations
    > to include.
    >
    >
    > The problem is I loos track of which were removed so the original display of the dataset is
    > misleading when the processed average is returned as it includes the removed key/values.
    >
    >
    > Ayone know how I can maintain the relationship and when I exclude a value, remove it from
    > the dict?


    Assuming your data and the dictionary are keyed by a common set of keys:

    for key in descriptions:
    if abs(data[key] - mean(data)) >= m * std(data):
    del data[key]
    del descriptions[key]


    Hope this helps,

    -- HansM
    Hans Mulder, Jan 6, 2013
    #2
    1. Advertising

  3. >Assuming your data and the dictionary are keyed by a common set of keys: 

    >
    >for key in descriptions:
    >    if abs(data[key] - mean(data)) >= m * std(data):
    >        del data[key]
    >        del descriptions[key]



    Heh, yeah sometimes the obvious is too simple to see. I used a dict comp torebuild
    the results with the comparison.


    Thanks!
    jlc
    Joseph L. Casale, Jan 6, 2013
    #3
  4. Joseph L. Casale

    MRAB Guest

    On 2013-01-06 22:33, Hans Mulder wrote:
    > On 6/01/13 20:44:08, Joseph L. Casale wrote:
    >> I have a dataset that consists of a dict with text descriptions and values that are integers. If
    >> required, I collect the values into a list and create a numpy array running it through a simple
    >> routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations
    >> to include.
    >>
    >>
    >> The problem is I loos track of which were removed so the original display of the dataset is
    >> misleading when the processed average is returned as it includes the removed key/values.
    >>
    >>
    >> Ayone know how I can maintain the relationship and when I exclude a value, remove it from
    >> the dict?

    >
    > Assuming your data and the dictionary are keyed by a common set of keys:
    >
    > for key in descriptions:
    > if abs(data[key] - mean(data)) >= m * std(data):
    > del data[key]
    > del descriptions[key]
    >

    It's generally a bad idea to modify a collection over which you're
    iterating. It's better to, say, make a list of what you're going to
    delete and then iterate over that list to make the deletions:

    deletions = []

    for key in in descriptions:
    if abs(data[key] - mean(data)) >= m * std(data):
    deletions.append(key)

    for key in deletions:
    del data[key]
    del descriptions[key]
    MRAB, Jan 6, 2013
    #4
  5. On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:

    > I have a dataset that consists of a dict with text descriptions and
    > values that are integers. If required, I collect the values into a list
    > and create a numpy array running it through a simple routine: 
    >
    > data[abs(data - mean(data)) < m * std(data)]
    >
    > where m is the number of std deviations to include.


    I'm not sure that this approach is statistically robust. No, let me be
    even more assertive: I'm sure that this approach is NOT statistically
    robust, and may be scientifically dubious.

    The above assumes your data is normally distributed. How sure are you
    that this is actually the case?

    For normally distributed data:

    Since both the mean and std calculations as effected by the presence of
    outliers, your test for what counts as an outlier will miss outliers for
    data from a normal distribution. For small N (sample size), it may be
    mathematically impossible for any data point to be greater than m*SD from
    the mean. For example, with N=5, no data point can be more than 1.789*SD
    from the mean. So for N=5, m=1 may throw away good data, and m=2 will
    fail to find any outliers no matter how outrageous they are.

    For large N, you will expect to find significant numbers of data points
    more than m*SD from the mean. With N=100000, and m=3, you will expect to
    throw away 270 perfectly good data points simply because they are out on
    the tails of the distribution.

    Worse, if the data is not in fact from a normal distribution, all bets
    are off. You may be keeping obvious outliers; or more often, your test
    will be throwing away perfectly good data that it misidentifies as
    outliers.

    In other words: this approach for detecting outliers is nothing more than
    a very rough, and very bad, heuristic, and should be avoided.

    Identifying outliers is fraught with problems even for experts. For
    example, the ozone hole over the Antarctic was ignored for many years
    because the software being used to analyse it misidentified the data as
    outliers.

    The best general advice I have seen is:

    Never automatically remove outliers except for values that are physically
    impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"),
    unless you have good, solid, physical reasons for justifying removal of
    outliers. Other than that, manually remove outliers with care, or not at
    all, and if you do so, always report your results twice, once with all
    the data, and once with supposed outliers removed.

    You can read up more about outlier detection, and the difficulties
    thereof, here:

    http://www.medcalc.org/manual/outliers.php

    https://secure.graphpad.com/guides/prism/6/statistics/index.htm

    http://www.webapps.cee.vt.edu/ewr/environmental/teach/smprimer/outlier/outlier.html

    http://stats.stackexchange.com/questions/38001/detecting-outliers-using-standard-deviations



    --
    Steven
    Steven D'Aprano, Jan 7, 2013
    #5
  6. > In other words: this approach for detecting outliers is nothing more than 

    > a very rough, and very bad, heuristic, and should be avoided.


    Heh, very true but the results will only be used for conversational purposes.
    I am making an assumption that the data is normally distributed and I do expect
    valid results to all be very nearly the same.

    > You can read up more about outlier detection, and the difficulties 
    > thereof, here:



    I much appreciate the links and the thought in the post. I'll admit I didn't
    realize outlier detection was as involved.


    Again, thanks!
    jlc
    Joseph L. Casale, Jan 7, 2013
    #6
  7. Joseph L. Casale

    Paul Simon Guest

    "Steven D'Aprano" <> wrote in message
    news:50ea28e7$0$30003$c3e8da3$...
    > On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
    >
    >> I have a dataset that consists of a dict with text descriptions and
    >> values that are integers. If required, I collect the values into a list
    >> and create a numpy array running it through a simple routine:
    >>
    >> data[abs(data - mean(data)) < m * std(data)]
    >>
    >> where m is the number of std deviations to include.

    >
    > I'm not sure that this approach is statistically robust. No, let me be
    > even more assertive: I'm sure that this approach is NOT statistically
    > robust, and may be scientifically dubious.
    >
    > The above assumes your data is normally distributed. How sure are you
    > that this is actually the case?
    >
    > For normally distributed data:
    >
    > Since both the mean and std calculations as effected by the presence of
    > outliers, your test for what counts as an outlier will miss outliers for
    > data from a normal distribution. For small N (sample size), it may be
    > mathematically impossible for any data point to be greater than m*SD from
    > the mean. For example, with N=5, no data point can be more than 1.789*SD
    > from the mean. So for N=5, m=1 may throw away good data, and m=2 will
    > fail to find any outliers no matter how outrageous they are.
    >
    > For large N, you will expect to find significant numbers of data points
    > more than m*SD from the mean. With N=100000, and m=3, you will expect to
    > throw away 270 perfectly good data points simply because they are out on
    > the tails of the distribution.
    >
    > Worse, if the data is not in fact from a normal distribution, all bets
    > are off. You may be keeping obvious outliers; or more often, your test
    > will be throwing away perfectly good data that it misidentifies as
    > outliers.
    >
    > In other words: this approach for detecting outliers is nothing more than
    > a very rough, and very bad, heuristic, and should be avoided.
    >
    > Identifying outliers is fraught with problems even for experts. For
    > example, the ozone hole over the Antarctic was ignored for many years
    > because the software being used to analyse it misidentified the data as
    > outliers.
    >
    > The best general advice I have seen is:
    >
    > Never automatically remove outliers except for values that are physically
    > impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"),
    > unless you have good, solid, physical reasons for justifying removal of
    > outliers. Other than that, manually remove outliers with care, or not at
    > all, and if you do so, always report your results twice, once with all
    > the data, and once with supposed outliers removed.
    >
    > You can read up more about outlier detection, and the difficulties
    > thereof, here:
    >
    > http://www.medcalc.org/manual/outliers.php
    >
    > https://secure.graphpad.com/guides/prism/6/statistics/index.htm
    >
    > http://www.webapps.cee.vt.edu/ewr/environmental/teach/smprimer/outlier/outlier.html
    >
    > http://stats.stackexchange.com/questions/38001/detecting-outliers-using-standard-deviations
    >
    >
    >
    > --
    > Steven

    If you suspect that the data may not be normal you might look at exploratory
    data analysis, see Tukey. It's descriptive rather than analytic, treats
    outliers respectfully, uses median rather than mean, and is very visual.
    Wherever I analyzed data both gaussian and with EDA, EDA always won.

    Paul
    Paul Simon, Jan 7, 2013
    #7
  8. On 7 January 2013 01:46, Steven D'Aprano
    <> wrote:
    > On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
    >
    >> I have a dataset that consists of a dict with text descriptions and
    >> values that are integers. If required, I collect the values into a list
    >> and create a numpy array running it through a simple routine:
    >>
    >> data[abs(data - mean(data)) < m * std(data)]
    >>
    >> where m is the number of std deviations to include.

    >
    > I'm not sure that this approach is statistically robust. No, let me be
    > even more assertive: I'm sure that this approach is NOT statistically
    > robust, and may be scientifically dubious.


    Whether or not this is "statistically robust" requires more
    explanation about the OP's intention. Thus far, the OP has not given
    any reason/motivation for excluding data or even for having any data
    in the first place! It's hard to say whether any technique applied is
    really accurate/robust without knowing *anything* about the purpose of
    the operation.


    Oscar
    Oscar Benjamin, Jan 7, 2013
    #8
  9. On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote:

    > On 7 January 2013 01:46, Steven D'Aprano
    > <> wrote:
    >> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
    >>
    >>> I have a dataset that consists of a dict with text descriptions and
    >>> values that are integers. If required, I collect the values into a
    >>> list and create a numpy array running it through a simple routine:
    >>>
    >>> data[abs(data - mean(data)) < m * std(data)]
    >>>
    >>> where m is the number of std deviations to include.

    >>
    >> I'm not sure that this approach is statistically robust. No, let me be
    >> even more assertive: I'm sure that this approach is NOT statistically
    >> robust, and may be scientifically dubious.

    >
    > Whether or not this is "statistically robust" requires more explanation
    > about the OP's intention.


    Not really. Statistics robustness is objectively defined, and the user's
    intention doesn't come into it. The mean is not a robust measure of
    central tendency, the median is, regardless of why you pick one or the
    other.

    There are sometimes good reasons for choosing non-robust statistics or
    techniques over robust ones, but some techniques are so dodgy that there
    is *never* a good reason for doing so. E.g. finding the line of best fit
    by eye, or taking more and more samples until you get a statistically
    significant result. Such techniques are not just non-robust in the
    statistical sense, but non-robust in the general sense, if not outright
    deceitful.



    --
    Steven
    Steven D'Aprano, Jan 7, 2013
    #9
  10. On 7 January 2013 05:11, Steven D'Aprano
    <> wrote:
    > On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote:
    >
    >> On 7 January 2013 01:46, Steven D'Aprano
    >> <> wrote:
    >>> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
    >>>
    >>> I'm not sure that this approach is statistically robust. No, let me be
    >>> even more assertive: I'm sure that this approach is NOT statistically
    >>> robust, and may be scientifically dubious.

    >>
    >> Whether or not this is "statistically robust" requires more explanation
    >> about the OP's intention.

    >
    > Not really. Statistics robustness is objectively defined, and the user's
    > intention doesn't come into it. The mean is not a robust measure of
    > central tendency, the median is, regardless of why you pick one or the
    > other.


    Okay, I see what you mean. I wasn't thinking of robustness as a
    technical term but now I see that you are correct.

    Perhaps what I should have said is that whether or not this matters
    depends on the problem at hand (hopefully this isn't an important
    medical trial) and the particular type of data that you have; assuming
    normality is fine in many cases even if the data is not "really"
    normal.

    >
    > There are sometimes good reasons for choosing non-robust statistics or
    > techniques over robust ones, but some techniques are so dodgy that there
    > is *never* a good reason for doing so. E.g. finding the line of best fit
    > by eye, or taking more and more samples until you get a statistically
    > significant result. Such techniques are not just non-robust in the
    > statistical sense, but non-robust in the general sense, if not outright
    > deceitful.


    There are sometimes good reasons to get a line of best fit by eye. In
    particular if your data contains clusters that are hard to separate,
    sometimes it's useful to just pick out roughly where you think a line
    through a subset of the data is.


    Oscar
    Oscar Benjamin, Jan 7, 2013
    #10
  11. Joseph L. Casale

    Robert Kern Guest

    On 07/01/2013 15:20, Oscar Benjamin wrote:
    > On 7 January 2013 05:11, Steven D'Aprano
    > <> wrote:
    >> On Mon, 07 Jan 2013 02:29:27 +0000, Oscar Benjamin wrote:
    >>
    >>> On 7 January 2013 01:46, Steven D'Aprano
    >>> <> wrote:
    >>>> On Sun, 06 Jan 2013 19:44:08 +0000, Joseph L. Casale wrote:
    >>>>
    >>>> I'm not sure that this approach is statistically robust. No, let me be
    >>>> even more assertive: I'm sure that this approach is NOT statistically
    >>>> robust, and may be scientifically dubious.
    >>>
    >>> Whether or not this is "statistically robust" requires more explanation
    >>> about the OP's intention.

    >>
    >> Not really. Statistics robustness is objectively defined, and the user's
    >> intention doesn't come into it. The mean is not a robust measure of
    >> central tendency, the median is, regardless of why you pick one or the
    >> other.

    >
    > Okay, I see what you mean. I wasn't thinking of robustness as a
    > technical term but now I see that you are correct.
    >
    > Perhaps what I should have said is that whether or not this matters
    > depends on the problem at hand (hopefully this isn't an important
    > medical trial) and the particular type of data that you have; assuming
    > normality is fine in many cases even if the data is not "really"
    > normal.


    "Having outliers" literally means that assuming normality is not fine. If
    assuming normality were fine, then you wouldn't need to remove outliers.

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
    Robert Kern, Jan 7, 2013
    #11
  12. [Offtopic] Line fitting [was Re: Numpy outlier removal]

    On Mon, 07 Jan 2013 15:20:57 +0000, Oscar Benjamin wrote:

    > There are sometimes good reasons to get a line of best fit by eye. In
    > particular if your data contains clusters that are hard to separate,
    > sometimes it's useful to just pick out roughly where you think a line
    > through a subset of the data is.


    Cherry picking subsets of your data as well as line fitting by eye? Two
    wrongs do not make a right.

    If you're going to just invent a line based on where you think it should
    be, what do you need the data for? Just declare "this is the line I wish
    to believe in" and save yourself the time and energy of collecting the
    data in the first place. Your conclusion will be no less valid.

    How do you distinguish between "data contains clusters that are hard to
    separate" from "data doesn't fit a line at all"?

    Even if the data actually is linear, on what basis could we distinguish
    between the line you fit by eye (say) y = 2.5x + 3.7, and the line I fit
    by eye (say) y = 3.1x + 4.1? The line you assert on the basis of purely
    subjective judgement can be equally denied on the basis of subjective
    judgement.

    Anyone can fool themselves into placing a line through a subset of non-
    linear data. Or, sadly more often, *deliberately* cherry picking fake
    clusters in order to fool others. Here is a real world example of what
    happens when people pick out the data clusters that they like based on
    visual inspection:

    http://www.skepticalscience.com/images/TempEscalator.gif

    And not linear by any means, but related to the cherry picking theme:

    http://www.skepticalscience.com/pics/1_ArcticEscalator2012.gif


    To put it another way, when we fit patterns to data by eye, we can easily
    fool ourselves into seeing patterns that aren't there, or missing the
    patterns which are there. At best line fitting by eye is prone to honest
    errors; at worst, it is open to the most deliberate abuse. We have eyes
    and brains that evolved to spot the ripe fruit in trees, not to spot
    linear trends in noisy data, and fitting by eye is not safe or
    appropriate.


    --
    Steven
    Steven D'Aprano, Jan 7, 2013
    #12
  13. Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

    On Tue, Jan 8, 2013 at 4:58 AM, Steven D'Aprano
    <> wrote:
    > Anyone can fool themselves into placing a line through a subset of non-
    > linear data. Or, sadly more often, *deliberately* cherry picking fake
    > clusters in order to fool others. Here is a real world example of what
    > happens when people pick out the data clusters that they like based on
    > visual inspection:
    >
    > http://www.skepticalscience.com/images/TempEscalator.gif


    And sensible people will notice that, even drawn like that, it's only
    a ~0.6 deg increase across ~30 years. Hardly statistically
    significant, given that weather patterns have been known to follow
    cycles at least that long. But that's nothing to do with drawing lines
    through points, and more to do with how much data you collect before
    you announce a conclusion, and how easily a graph can prove any point
    you like.

    Statistical analysis is a huge science. So is lying. And I'm not sure
    most people can pick one from the other.

    ChrisA
    Chris Angelico, Jan 7, 2013
    #13
  14. Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

    On 7 January 2013 17:58, Steven D'Aprano
    <> wrote:
    > On Mon, 07 Jan 2013 15:20:57 +0000, Oscar Benjamin wrote:
    >
    >> There are sometimes good reasons to get a line of best fit by eye. In
    >> particular if your data contains clusters that are hard to separate,
    >> sometimes it's useful to just pick out roughly where you think a line
    >> through a subset of the data is.

    >
    > Cherry picking subsets of your data as well as line fitting by eye? Two
    > wrongs do not make a right.


    It depends on what you're doing, though. I wouldn't use an eyeball fit
    to get numbers that were an important part of the conclusion of some
    or other study. I would very often use it while I'm just in the
    process of trying to understand something.

    > If you're going to just invent a line based on where you think it should
    > be, what do you need the data for? Just declare "this is the line I wish
    > to believe in" and save yourself the time and energy of collecting the
    > data in the first place. Your conclusion will be no less valid.


    An example: Earlier today I was looking at some experimental data. A
    simple model of the process underlying the experiment suggests that
    two variables x and y will vary in direct proportion to one another
    and the data broadly reflects this. However, at this stage there is
    some non-normal variability in the data, caused by experimental
    difficulties. A subset of the data appears to closely follow a well
    defined linear pattern but there are outliers and the pattern breaks
    down in an asymmetric way at larger x and y values. At some later time
    either the sources of experimental variation will be reduced, or they
    will be better understood but for now it is still useful to estimate
    the constant of proportionality in order to check whether it seems
    consistent with the observed values of z. With this particular dataset
    I would have wasted a lot of time if I had tried to find a
    computational method to match the line that to me was very visible so
    I chose the line visually.

    >
    > How do you distinguish between "data contains clusters that are hard to
    > separate" from "data doesn't fit a line at all"?
    >


    In the example I gave it isn't possible to make that distinction with
    the currently available data. That doesn't make it meaningless to try
    and estimate the parameters of the relationship between the variables
    using the preliminary data.

    > Even if the data actually is linear, on what basis could we distinguish
    > between the line you fit by eye (say) y = 2.5x + 3.7, and the line I fit
    > by eye (say) y = 3.1x + 4.1? The line you assert on the basis of purely
    > subjective judgement can be equally denied on the basis of subjective
    > judgement.


    It gets a bit easier if the line is constrained to go through the
    origin. You seem to be thinking that the important thing is proving
    that the line is "real", rather than identifying where it is. Both
    things are important but not necessarily in the same problem. In my
    example, the "real line" may not be straight and may not go through
    the origin, but it is definitely there and if there were no
    experimental problems then the data would all be very close to it.

    > Anyone can fool themselves into placing a line through a subset of non-
    > linear data. Or, sadly more often, *deliberately* cherry picking fake
    > clusters in order to fool others. Here is a real world example of what
    > happens when people pick out the data clusters that they like based on
    > visual inspection:
    >
    > http://www.skepticalscience.com/images/TempEscalator.gif
    >
    > And not linear by any means, but related to the cherry picking theme:
    >
    > http://www.skepticalscience.com/pics/1_ArcticEscalator2012.gif
    >
    >
    > To put it another way, when we fit patterns to data by eye, we can easily
    > fool ourselves into seeing patterns that aren't there, or missing the
    > patterns which are there. At best line fitting by eye is prone to honest
    > errors; at worst, it is open to the most deliberate abuse. We have eyes
    > and brains that evolved to spot the ripe fruit in trees, not to spot
    > linear trends in noisy data, and fitting by eye is not safe or
    > appropriate.


    This is all true. But the human brain is also in many ways much better
    than a typical computer program at recognising patterns in data when
    the data can be depicted visually. I would very rarely attempt to
    analyse data without representing it in some visual form. I also think
    it would be highly foolish to go so far with refusing to eyeball data
    that you would accept the output of some regression algorithm even
    when it clearly looks wrong.


    Oscar
    Oscar Benjamin, Jan 7, 2013
    #14
  15. Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

    On Mon, 07 Jan 2013 22:32:54 +0000, Oscar Benjamin wrote:

    > An example: Earlier today I was looking at some experimental data. A
    > simple model of the process underlying the experiment suggests that two
    > variables x and y will vary in direct proportion to one another and the
    > data broadly reflects this. However, at this stage there is some
    > non-normal variability in the data, caused by experimental difficulties.
    > A subset of the data appears to closely follow a well defined linear
    > pattern but there are outliers and the pattern breaks down in an
    > asymmetric way at larger x and y values. At some later time either the
    > sources of experimental variation will be reduced, or they will be
    > better understood but for now it is still useful to estimate the
    > constant of proportionality in order to check whether it seems
    > consistent with the observed values of z. With this particular dataset I
    > would have wasted a lot of time if I had tried to find a computational
    > method to match the line that to me was very visible so I chose the line
    > visually.



    If you mean:

    "I looked at the data, identified that the range a < x < b looks linear
    and the range x > b does not, then used least squares (or some other
    recognised, objective technique for fitting a line) to the data in that
    linear range"

    then I'm completely cool with that. That's fine, with the understanding
    that this is the first step in either fixing your measurement problems,
    fixing your model, or at least avoiding extrapolation into the non-linear
    range.

    But that is not fitting a line by eye, which is what I am talking about.

    If on the other hand you mean:

    "I looked at the data, identified that the range a < x < b looked linear,
    so I laid a ruler down over the graph and pushed it around until I was
    satisfied that the ruler looked more or less like it fitted the data
    points, according to my guess of what counts as a close fit"

    that *is* fitting a line by eye, and it is entirely subjective and
    extremely dodgy for anything beyond quick and dirty back of the envelope
    calculations[1]. That's okay if all you want is to get something within
    an order of magnitude or so, or a line roughly pointing in the right
    direction, but that's all.


    [...]
    > I also think it would
    > be highly foolish to go so far with refusing to eyeball data that you
    > would accept the output of some regression algorithm even when it
    > clearly looks wrong.


    I never said anything of the sort.

    I said, don't fit lines to data by eye. I didn't say not to sanity check
    your straight line fit is reasonable by eyeballing it.



    [1] Or if your data is so accurate and noise-free that you hardly have to
    care about errors, since there clearly is one and only one straight line
    that passes through all the points.


    --
    Steven
    Steven D'Aprano, Jan 8, 2013
    #15
  16. Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

    On Tue, 08 Jan 2013 06:43:46 +1100, Chris Angelico wrote:

    > On Tue, Jan 8, 2013 at 4:58 AM, Steven D'Aprano
    > <> wrote:
    >> Anyone can fool themselves into placing a line through a subset of non-
    >> linear data. Or, sadly more often, *deliberately* cherry picking fake
    >> clusters in order to fool others. Here is a real world example of what
    >> happens when people pick out the data clusters that they like based on
    >> visual inspection:
    >>
    >> http://www.skepticalscience.com/images/TempEscalator.gif

    >
    > And sensible people will notice that, even drawn like that, it's only a
    > ~0.6 deg increase across ~30 years. Hardly statistically significant,


    Well, I don't know about "sensible people", but magnitude of an effect
    has little to do with whether or not something is statistically
    significant or not. Given noisy data, statistical significance relates to
    whether or not we can be confident that the effect is *real*, not whether
    it is a big effect or a small effect.

    Here's an example: assume that you are on a fixed salary with a constant
    weekly income. If you happen to win the lottery one day, and consequently
    your income for that week quadruples, that is a large effect that fails
    to have any statistical significance -- it's a blip, not part of any long-
    term change in income. You can't conclude that you'll win the lottery
    every week from now on.

    On the other hand, if the government changes the rules relating to tax,
    deductions, etc., even by a small amount, your weekly income might go
    down, or up, by a single dollar. Even though that is a tiny effect, it is
    *not* a blip, and will be statistically significant. In practice, it
    takes a certain number of data points to reach that confidence level.
    Your accountant, who knows the tax laws, will conclude that the change is
    real immediately, but a statistician who sees only the pay slips may take
    some months before she is convinced that the change is signal rather than
    noise. With only three weeks pay slips in hand, the statistician cannot
    be sure that the difference is not just some accounting error or other
    fluke, but each additional data point increases the confidence that the
    difference is real and not just some temporary aberration.

    The other meaning of "significant" has nothing to do with statistics, and
    everything to do with "a difference is only a difference if it makes a
    difference". 0.2° per decade doesn't sound like much, not when we
    consider daily or yearly temperatures that typically have a range of tens
    of degrees between night and day, or winter and summer. But that is
    misunderstanding the nature of long-term climate versus daily weather and
    glossing over the fact that we're only talking about an average and
    ignoring changes to the variability of the climate: a small increase in
    average can lead to a large increase in extreme events.


    > given that weather patterns have been known to follow cycles at least
    > that long.


    That is not a given. "Weather patterns" don't last for thirty years.
    Perhaps you are talking about climate patterns? In which case, well, yes,
    we can see a very strong climate pattern of warming on a time scale of
    decades, with no evidence that it is a cycle.

    There are, of course, many climate cycles that take place on a time frame
    of years or decades, such as the North Atlantic Oscillation and the El
    Nino Southern Oscillation. None of them are global, and as far as I know
    none of them are exactly periodic. They are noise in the system, and
    certainly not responsible for linear trends.



    --
    Steven
    Steven D'Aprano, Jan 8, 2013
    #16
  17. Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

    On Tue, Jan 8, 2013 at 1:06 PM, Steven D'Aprano
    <> wrote:
    >> given that weather patterns have been known to follow cycles at least
    >> that long.

    >
    > That is not a given. "Weather patterns" don't last for thirty years.
    > Perhaps you are talking about climate patterns?


    Yes, that's what I meant. In any case, debate about global warming is
    quite tangential to the point about statistical validity; it looks
    quite significant to show a line going from the bottom of the graph to
    the top, but sounds a lot less noteworthy when you see it as a
    half-degree increase on about (I think?) 30 degrees, and even less
    when you measure temperatures in absolute scale (Kelvin) and it's half
    a degree in three hundred. Those are principles worth considering,
    regardless of the subject matter. If your railway tracks have widened
    by a full eight millimeters due to increased pounding from heavier
    vehicles travelling over it, that's significant and dangerous on
    HO-scale model trains, but utterly insignificant on 5'3" gauge.

    ChrisA
    Chris Angelico, Jan 8, 2013
    #17
  18. Joseph L. Casale

    Terry Reedy Guest

    Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

    On 1/7/2013 8:23 PM, Steven D'Aprano wrote:
    > On Mon, 07 Jan 2013 22:32:54 +0000, Oscar Benjamin wrote:
    >
    >> An example: Earlier today I was looking at some experimental data. A
    >> simple model of the process underlying the experiment suggests that two
    >> variables x and y will vary in direct proportion to one another and the
    >> data broadly reflects this. However, at this stage there is some
    >> non-normal variability in the data, caused by experimental difficulties.
    >> A subset of the data appears to closely follow a well defined linear
    >> pattern but there are outliers and the pattern breaks down in an
    >> asymmetric way at larger x and y values. At some later time either the
    >> sources of experimental variation will be reduced, or they will be
    >> better understood but for now it is still useful to estimate the
    >> constant of proportionality in order to check whether it seems
    >> consistent with the observed values of z. With this particular dataset I
    >> would have wasted a lot of time if I had tried to find a computational
    >> method to match the line that to me was very visible so I chose the line
    >> visually.

    >
    >
    > If you mean:
    >
    > "I looked at the data, identified that the range a < x < b looks linear
    > and the range x > b does not, then used least squares (or some other
    > recognised, objective technique for fitting a line) to the data in that
    > linear range"
    >
    > then I'm completely cool with that.


    If both x and y are measured values, then regressing x on y and y on x
    with give different answers and both will be wrong in that *neither*
    will be the best answer for the relationship between them. Oscar did not
    specify whether either was an experimentally set input variable.

    > But that is not fitting a line by eye, which is what I am talking about.


    With the line constrained to go through 0,0, a line eyeballed with a
    clear ruler could easily be better than either regression line, as a
    human will tend to minimize the deviations *perpendicular to the line*,
    which is the proper thing to do (assuming both variables are measured in
    the same units).

    --
    Terry Jan Reedy
    Terry Reedy, Jan 8, 2013
    #18
  19. Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

    On 8 January 2013 01:23, Steven D'Aprano
    <> wrote:
    > On Mon, 07 Jan 2013 22:32:54 +0000, Oscar Benjamin wrote:
    >
    > [...]
    >> I also think it would
    >> be highly foolish to go so far with refusing to eyeball data that you
    >> would accept the output of some regression algorithm even when it
    >> clearly looks wrong.

    >
    > I never said anything of the sort.
    >
    > I said, don't fit lines to data by eye. I didn't say not to sanity check
    > your straight line fit is reasonable by eyeballing it.


    I should have been a little clearer. That was the situation when I
    decided to just use a (digital) ruler - although really it was more of
    a visual bisection (1, 2, 1.5, 1.25...). The regression result was
    clearly wrong (and also invalid for the reasons Terry has described).
    Some of the problems were easily fixable and others were not. I could
    have spent an hour getting the code to make the line go where I wanted
    it to, or I could just fit the line visually in about 2 minutes.


    Oscar
    Oscar Benjamin, Jan 8, 2013
    #19
  20. Joseph L. Casale

    Robert Kern Guest

    Re: [Offtopic] Line fitting [was Re: Numpy outlier removal]

    On 08/01/2013 06:35, Chris Angelico wrote:
    > On Tue, Jan 8, 2013 at 1:06 PM, Steven D'Aprano
    > <> wrote:
    >>> given that weather patterns have been known to follow cycles at least
    >>> that long.

    >>
    >> That is not a given. "Weather patterns" don't last for thirty years.
    >> Perhaps you are talking about climate patterns?

    >
    > Yes, that's what I meant. In any case, debate about global warming is
    > quite tangential to the point about statistical validity; it looks
    > quite significant to show a line going from the bottom of the graph to
    > the top, but sounds a lot less noteworthy when you see it as a
    > half-degree increase on about (I think?) 30 degrees, and even less
    > when you measure temperatures in absolute scale (Kelvin) and it's half
    > a degree in three hundred.


    Why on Earth do you think that the distance from nominal surface temperatures to
    freezing much less absolute 0 is the right scale to compare global warming
    changes against? You need to compare against the size of global mean temperature
    changes that would cause large amounts of human suffering, and that scale is on
    the order of a *few* degrees, not hundreds. A change of half a degree over a few
    decades with no signs of slowing down *should* be alarming.

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
    Robert Kern, Jan 8, 2013
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. drife
    Replies:
    1
    Views:
    353
    Travis E. Oliphant
    Mar 1, 2006
  2. Duncan Smith
    Replies:
    3
    Views:
    409
    Duncan Smith
    Apr 25, 2007
  3. Replies:
    2
    Views:
    479
    Robert Kern
    Nov 13, 2007
  4. W. eWatson
    Replies:
    2
    Views:
    916
    W. eWatson
    Nov 23, 2009
  5. Alan Spence
    Replies:
    0
    Views:
    107
    Alan Spence
    Jan 11, 2013
Loading...

Share This Page