Numpy outlier removal

  • Thread starter Joseph L. Casale
  • Start date
J

Joseph L. Casale

I have a dataset that consists of a dict with text descriptions and values that are integers. If
required, I collect the values into a list and create a numpy array runningit through a simple
routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations
to include.


The problem is I loos track of which were removed so the original display of the dataset is
misleading when the processed average is returned as it includes the removed key/values.


Ayone know how I can maintain the relationship and when I exclude a value, remove it from
the dict?

Thanks!
jlc
 
H

Hans Mulder

I have a dataset that consists of a dict with text descriptions and values that are integers. If
required, I collect the values into a list and create a numpy array running it through a simple
routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations
to include.


The problem is I loos track of which were removed so the original display of the dataset is
misleading when the processed average is returned as it includes the removed key/values.


Ayone know how I can maintain the relationship and when I exclude a value, remove it from
the dict?

Assuming your data and the dictionary are keyed by a common set of keys:

for key in descriptions:
if abs(data[key] - mean(data)) >= m * std(data):
del data[key]
del descriptions[key]


Hope this helps,

-- HansM
 
J

Joseph L. Casale

Assuming your data and the dictionary are keyed by a common set of keys: 
for key in descriptions:
   if abs(data[key] - mean(data)) >= m * std(data):
       del data[key]
       del descriptions[key]


Heh, yeah sometimes the obvious is too simple to see. I used a dict comp torebuild
the results with the comparison.


Thanks!
jlc
 
M

MRAB

I have a dataset that consists of a dict with text descriptions and values that are integers. If
required, I collect the values into a list and create a numpy array running it through a simple
routine: data[abs(data - mean(data)) < m * std(data)] where m is the number of std deviations
to include.


The problem is I loos track of which were removed so the original display of the dataset is
misleading when the processed average is returned as it includes the removed key/values.


Ayone know how I can maintain the relationship and when I exclude a value, remove it from
the dict?

Assuming your data and the dictionary are keyed by a common set of keys:

for key in descriptions:
if abs(data[key] - mean(data)) >= m * std(data):
del data[key]
del descriptions[key]
It's generally a bad idea to modify a collection over which you're
iterating. It's better to, say, make a list of what you're going to
delete and then iterate over that list to make the deletions:

deletions = []

for key in in descriptions:
if abs(data[key] - mean(data)) >= m * std(data):
deletions.append(key)

for key in deletions:
del data[key]
del descriptions[key]
 
S

Steven D'Aprano

I have a dataset that consists of a dict with text descriptions and
values that are integers. If required, I collect the values into a list
and create a numpy array running it through a simple routine: 

data[abs(data - mean(data)) < m * std(data)]

where m is the number of std deviations to include.

I'm not sure that this approach is statistically robust. No, let me be
even more assertive: I'm sure that this approach is NOT statistically
robust, and may be scientifically dubious.

The above assumes your data is normally distributed. How sure are you
that this is actually the case?

For normally distributed data:

Since both the mean and std calculations as effected by the presence of
outliers, your test for what counts as an outlier will miss outliers for
data from a normal distribution. For small N (sample size), it may be
mathematically impossible for any data point to be greater than m*SD from
the mean. For example, with N=5, no data point can be more than 1.789*SD
from the mean. So for N=5, m=1 may throw away good data, and m=2 will
fail to find any outliers no matter how outrageous they are.

For large N, you will expect to find significant numbers of data points
more than m*SD from the mean. With N=100000, and m=3, you will expect to
throw away 270 perfectly good data points simply because they are out on
the tails of the distribution.

Worse, if the data is not in fact from a normal distribution, all bets
are off. You may be keeping obvious outliers; or more often, your test
will be throwing away perfectly good data that it misidentifies as
outliers.

In other words: this approach for detecting outliers is nothing more than
a very rough, and very bad, heuristic, and should be avoided.

Identifying outliers is fraught with problems even for experts. For
example, the ozone hole over the Antarctic was ignored for many years
because the software being used to analyse it misidentified the data as
outliers.

The best general advice I have seen is:

Never automatically remove outliers except for values that are physically
impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"),
unless you have good, solid, physical reasons for justifying removal of
outliers. Other than that, manually remove outliers with care, or not at
all, and if you do so, always report your results twice, once with all
the data, and once with supposed outliers removed.

You can read up more about outlier detection, and the difficulties
thereof, here:

http://www.medcalc.org/manual/outliers.php

https://secure.graphpad.com/guides/prism/6/statistics/index.htm

http://www.webapps.cee.vt.edu/ewr/environmental/teach/smprimer/outlier/outlier.html

http://stats.stackexchange.com/questions/38001/detecting-outliers-using-standard-deviations
 
J

Joseph L. Casale

In other words: this approach for detecting outliers is nothing more than 
a very rough, and very bad, heuristic, and should be avoided.

Heh, very true but the results will only be used for conversational purposes.
I am making an assumption that the data is normally distributed and I do expect
valid results to all be very nearly the same.
You can read up more about outlier detection, and the difficulties 
thereof, here:


I much appreciate the links and the thought in the post. I'll admit I didn't
realize outlier detection was as involved.


Again, thanks!
jlc
 
P

Paul Simon

Steven D'Aprano said:
I have a dataset that consists of a dict with text descriptions and
values that are integers. If required, I collect the values into a list
and create a numpy array running it through a simple routine:

data[abs(data - mean(data)) < m * std(data)]

where m is the number of std deviations to include.

I'm not sure that this approach is statistically robust. No, let me be
even more assertive: I'm sure that this approach is NOT statistically
robust, and may be scientifically dubious.

The above assumes your data is normally distributed. How sure are you
that this is actually the case?

For normally distributed data:

Since both the mean and std calculations as effected by the presence of
outliers, your test for what counts as an outlier will miss outliers for
data from a normal distribution. For small N (sample size), it may be
mathematically impossible for any data point to be greater than m*SD from
the mean. For example, with N=5, no data point can be more than 1.789*SD
from the mean. So for N=5, m=1 may throw away good data, and m=2 will
fail to find any outliers no matter how outrageous they are.

For large N, you will expect to find significant numbers of data points
more than m*SD from the mean. With N=100000, and m=3, you will expect to
throw away 270 perfectly good data points simply because they are out on
the tails of the distribution.

Worse, if the data is not in fact from a normal distribution, all bets
are off. You may be keeping obvious outliers; or more often, your test
will be throwing away perfectly good data that it misidentifies as
outliers.

In other words: this approach for detecting outliers is nothing more than
a very rough, and very bad, heuristic, and should be avoided.

Identifying outliers is fraught with problems even for experts. For
example, the ozone hole over the Antarctic was ignored for many years
because the software being used to analyse it misidentified the data as
outliers.

The best general advice I have seen is:

Never automatically remove outliers except for values that are physically
impossible (e.g. "baby's weight is 95kg", "test score of 31 out of 20"),
unless you have good, solid, physical reasons for justifying removal of
outliers. Other than that, manually remove outliers with care, or not at
all, and if you do so, always report your results twice, once with all
the data, and once with supposed outliers removed.

You can read up more about outlier detection, and the difficulties
thereof, here:

http://www.medcalc.org/manual/outliers.php

https://secure.graphpad.com/guides/prism/6/statistics/index.htm

http://www.webapps.cee.vt.edu/ewr/environmental/teach/smprimer/outlier/outlier.html

http://stats.stackexchange.com/questions/38001/detecting-outliers-using-standard-deviations
If you suspect that the data may not be normal you might look at exploratory
data analysis, see Tukey. It's descriptive rather than analytic, treats
outliers respectfully, uses median rather than mean, and is very visual.
Wherever I analyzed data both gaussian and with EDA, EDA always won.

Paul
 
O

Oscar Benjamin

I have a dataset that consists of a dict with text descriptions and
values that are integers. If required, I collect the values into a list
and create a numpy array running it through a simple routine:

data[abs(data - mean(data)) < m * std(data)]

where m is the number of std deviations to include.

I'm not sure that this approach is statistically robust. No, let me be
even more assertive: I'm sure that this approach is NOT statistically
robust, and may be scientifically dubious.

Whether or not this is "statistically robust" requires more
explanation about the OP's intention. Thus far, the OP has not given
any reason/motivation for excluding data or even for having any data
in the first place! It's hard to say whether any technique applied is
really accurate/robust without knowing *anything* about the purpose of
the operation.


Oscar
 
S

Steven D'Aprano

I have a dataset that consists of a dict with text descriptions and
values that are integers. If required, I collect the values into a
list and create a numpy array running it through a simple routine:

data[abs(data - mean(data)) < m * std(data)]

where m is the number of std deviations to include.

I'm not sure that this approach is statistically robust. No, let me be
even more assertive: I'm sure that this approach is NOT statistically
robust, and may be scientifically dubious.

Whether or not this is "statistically robust" requires more explanation
about the OP's intention.

Not really. Statistics robustness is objectively defined, and the user's
intention doesn't come into it. The mean is not a robust measure of
central tendency, the median is, regardless of why you pick one or the
other.

There are sometimes good reasons for choosing non-robust statistics or
techniques over robust ones, but some techniques are so dodgy that there
is *never* a good reason for doing so. E.g. finding the line of best fit
by eye, or taking more and more samples until you get a statistically
significant result. Such techniques are not just non-robust in the
statistical sense, but non-robust in the general sense, if not outright
deceitful.
 
O

Oscar Benjamin

Not really. Statistics robustness is objectively defined, and the user's
intention doesn't come into it. The mean is not a robust measure of
central tendency, the median is, regardless of why you pick one or the
other.

Okay, I see what you mean. I wasn't thinking of robustness as a
technical term but now I see that you are correct.

Perhaps what I should have said is that whether or not this matters
depends on the problem at hand (hopefully this isn't an important
medical trial) and the particular type of data that you have; assuming
normality is fine in many cases even if the data is not "really"
normal.
There are sometimes good reasons for choosing non-robust statistics or
techniques over robust ones, but some techniques are so dodgy that there
is *never* a good reason for doing so. E.g. finding the line of best fit
by eye, or taking more and more samples until you get a statistically
significant result. Such techniques are not just non-robust in the
statistical sense, but non-robust in the general sense, if not outright
deceitful.

There are sometimes good reasons to get a line of best fit by eye. In
particular if your data contains clusters that are hard to separate,
sometimes it's useful to just pick out roughly where you think a line
through a subset of the data is.


Oscar
 
R

Robert Kern

Okay, I see what you mean. I wasn't thinking of robustness as a
technical term but now I see that you are correct.

Perhaps what I should have said is that whether or not this matters
depends on the problem at hand (hopefully this isn't an important
medical trial) and the particular type of data that you have; assuming
normality is fine in many cases even if the data is not "really"
normal.

"Having outliers" literally means that assuming normality is not fine. If
assuming normality were fine, then you wouldn't need to remove outliers.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
S

Steven D'Aprano

There are sometimes good reasons to get a line of best fit by eye. In
particular if your data contains clusters that are hard to separate,
sometimes it's useful to just pick out roughly where you think a line
through a subset of the data is.

Cherry picking subsets of your data as well as line fitting by eye? Two
wrongs do not make a right.

If you're going to just invent a line based on where you think it should
be, what do you need the data for? Just declare "this is the line I wish
to believe in" and save yourself the time and energy of collecting the
data in the first place. Your conclusion will be no less valid.

How do you distinguish between "data contains clusters that are hard to
separate" from "data doesn't fit a line at all"?

Even if the data actually is linear, on what basis could we distinguish
between the line you fit by eye (say) y = 2.5x + 3.7, and the line I fit
by eye (say) y = 3.1x + 4.1? The line you assert on the basis of purely
subjective judgement can be equally denied on the basis of subjective
judgement.

Anyone can fool themselves into placing a line through a subset of non-
linear data. Or, sadly more often, *deliberately* cherry picking fake
clusters in order to fool others. Here is a real world example of what
happens when people pick out the data clusters that they like based on
visual inspection:

http://www.skepticalscience.com/images/TempEscalator.gif

And not linear by any means, but related to the cherry picking theme:

http://www.skepticalscience.com/pics/1_ArcticEscalator2012.gif


To put it another way, when we fit patterns to data by eye, we can easily
fool ourselves into seeing patterns that aren't there, or missing the
patterns which are there. At best line fitting by eye is prone to honest
errors; at worst, it is open to the most deliberate abuse. We have eyes
and brains that evolved to spot the ripe fruit in trees, not to spot
linear trends in noisy data, and fitting by eye is not safe or
appropriate.
 
C

Chris Angelico

Anyone can fool themselves into placing a line through a subset of non-
linear data. Or, sadly more often, *deliberately* cherry picking fake
clusters in order to fool others. Here is a real world example of what
happens when people pick out the data clusters that they like based on
visual inspection:

http://www.skepticalscience.com/images/TempEscalator.gif

And sensible people will notice that, even drawn like that, it's only
a ~0.6 deg increase across ~30 years. Hardly statistically
significant, given that weather patterns have been known to follow
cycles at least that long. But that's nothing to do with drawing lines
through points, and more to do with how much data you collect before
you announce a conclusion, and how easily a graph can prove any point
you like.

Statistical analysis is a huge science. So is lying. And I'm not sure
most people can pick one from the other.

ChrisA
 
O

Oscar Benjamin

Cherry picking subsets of your data as well as line fitting by eye? Two
wrongs do not make a right.

It depends on what you're doing, though. I wouldn't use an eyeball fit
to get numbers that were an important part of the conclusion of some
or other study. I would very often use it while I'm just in the
process of trying to understand something.
If you're going to just invent a line based on where you think it should
be, what do you need the data for? Just declare "this is the line I wish
to believe in" and save yourself the time and energy of collecting the
data in the first place. Your conclusion will be no less valid.

An example: Earlier today I was looking at some experimental data. A
simple model of the process underlying the experiment suggests that
two variables x and y will vary in direct proportion to one another
and the data broadly reflects this. However, at this stage there is
some non-normal variability in the data, caused by experimental
difficulties. A subset of the data appears to closely follow a well
defined linear pattern but there are outliers and the pattern breaks
down in an asymmetric way at larger x and y values. At some later time
either the sources of experimental variation will be reduced, or they
will be better understood but for now it is still useful to estimate
the constant of proportionality in order to check whether it seems
consistent with the observed values of z. With this particular dataset
I would have wasted a lot of time if I had tried to find a
computational method to match the line that to me was very visible so
I chose the line visually.
How do you distinguish between "data contains clusters that are hard to
separate" from "data doesn't fit a line at all"?

In the example I gave it isn't possible to make that distinction with
the currently available data. That doesn't make it meaningless to try
and estimate the parameters of the relationship between the variables
using the preliminary data.
Even if the data actually is linear, on what basis could we distinguish
between the line you fit by eye (say) y = 2.5x + 3.7, and the line I fit
by eye (say) y = 3.1x + 4.1? The line you assert on the basis of purely
subjective judgement can be equally denied on the basis of subjective
judgement.

It gets a bit easier if the line is constrained to go through the
origin. You seem to be thinking that the important thing is proving
that the line is "real", rather than identifying where it is. Both
things are important but not necessarily in the same problem. In my
example, the "real line" may not be straight and may not go through
the origin, but it is definitely there and if there were no
experimental problems then the data would all be very close to it.
Anyone can fool themselves into placing a line through a subset of non-
linear data. Or, sadly more often, *deliberately* cherry picking fake
clusters in order to fool others. Here is a real world example of what
happens when people pick out the data clusters that they like based on
visual inspection:

http://www.skepticalscience.com/images/TempEscalator.gif

And not linear by any means, but related to the cherry picking theme:

http://www.skepticalscience.com/pics/1_ArcticEscalator2012.gif


To put it another way, when we fit patterns to data by eye, we can easily
fool ourselves into seeing patterns that aren't there, or missing the
patterns which are there. At best line fitting by eye is prone to honest
errors; at worst, it is open to the most deliberate abuse. We have eyes
and brains that evolved to spot the ripe fruit in trees, not to spot
linear trends in noisy data, and fitting by eye is not safe or
appropriate.

This is all true. But the human brain is also in many ways much better
than a typical computer program at recognising patterns in data when
the data can be depicted visually. I would very rarely attempt to
analyse data without representing it in some visual form. I also think
it would be highly foolish to go so far with refusing to eyeball data
that you would accept the output of some regression algorithm even
when it clearly looks wrong.


Oscar
 
S

Steven D'Aprano

An example: Earlier today I was looking at some experimental data. A
simple model of the process underlying the experiment suggests that two
variables x and y will vary in direct proportion to one another and the
data broadly reflects this. However, at this stage there is some
non-normal variability in the data, caused by experimental difficulties.
A subset of the data appears to closely follow a well defined linear
pattern but there are outliers and the pattern breaks down in an
asymmetric way at larger x and y values. At some later time either the
sources of experimental variation will be reduced, or they will be
better understood but for now it is still useful to estimate the
constant of proportionality in order to check whether it seems
consistent with the observed values of z. With this particular dataset I
would have wasted a lot of time if I had tried to find a computational
method to match the line that to me was very visible so I chose the line
visually.


If you mean:

"I looked at the data, identified that the range a < x < b looks linear
and the range x > b does not, then used least squares (or some other
recognised, objective technique for fitting a line) to the data in that
linear range"

then I'm completely cool with that. That's fine, with the understanding
that this is the first step in either fixing your measurement problems,
fixing your model, or at least avoiding extrapolation into the non-linear
range.

But that is not fitting a line by eye, which is what I am talking about.

If on the other hand you mean:

"I looked at the data, identified that the range a < x < b looked linear,
so I laid a ruler down over the graph and pushed it around until I was
satisfied that the ruler looked more or less like it fitted the data
points, according to my guess of what counts as a close fit"

that *is* fitting a line by eye, and it is entirely subjective and
extremely dodgy for anything beyond quick and dirty back of the envelope
calculations[1]. That's okay if all you want is to get something within
an order of magnitude or so, or a line roughly pointing in the right
direction, but that's all.


[...]
I also think it would
be highly foolish to go so far with refusing to eyeball data that you
would accept the output of some regression algorithm even when it
clearly looks wrong.

I never said anything of the sort.

I said, don't fit lines to data by eye. I didn't say not to sanity check
your straight line fit is reasonable by eyeballing it.



[1] Or if your data is so accurate and noise-free that you hardly have to
care about errors, since there clearly is one and only one straight line
that passes through all the points.
 
S

Steven D'Aprano

And sensible people will notice that, even drawn like that, it's only a
~0.6 deg increase across ~30 years. Hardly statistically significant,

Well, I don't know about "sensible people", but magnitude of an effect
has little to do with whether or not something is statistically
significant or not. Given noisy data, statistical significance relates to
whether or not we can be confident that the effect is *real*, not whether
it is a big effect or a small effect.

Here's an example: assume that you are on a fixed salary with a constant
weekly income. If you happen to win the lottery one day, and consequently
your income for that week quadruples, that is a large effect that fails
to have any statistical significance -- it's a blip, not part of any long-
term change in income. You can't conclude that you'll win the lottery
every week from now on.

On the other hand, if the government changes the rules relating to tax,
deductions, etc., even by a small amount, your weekly income might go
down, or up, by a single dollar. Even though that is a tiny effect, it is
*not* a blip, and will be statistically significant. In practice, it
takes a certain number of data points to reach that confidence level.
Your accountant, who knows the tax laws, will conclude that the change is
real immediately, but a statistician who sees only the pay slips may take
some months before she is convinced that the change is signal rather than
noise. With only three weeks pay slips in hand, the statistician cannot
be sure that the difference is not just some accounting error or other
fluke, but each additional data point increases the confidence that the
difference is real and not just some temporary aberration.

The other meaning of "significant" has nothing to do with statistics, and
everything to do with "a difference is only a difference if it makes a
difference". 0.2° per decade doesn't sound like much, not when we
consider daily or yearly temperatures that typically have a range of tens
of degrees between night and day, or winter and summer. But that is
misunderstanding the nature of long-term climate versus daily weather and
glossing over the fact that we're only talking about an average and
ignoring changes to the variability of the climate: a small increase in
average can lead to a large increase in extreme events.

given that weather patterns have been known to follow cycles at least
that long.

That is not a given. "Weather patterns" don't last for thirty years.
Perhaps you are talking about climate patterns? In which case, well, yes,
we can see a very strong climate pattern of warming on a time scale of
decades, with no evidence that it is a cycle.

There are, of course, many climate cycles that take place on a time frame
of years or decades, such as the North Atlantic Oscillation and the El
Nino Southern Oscillation. None of them are global, and as far as I know
none of them are exactly periodic. They are noise in the system, and
certainly not responsible for linear trends.
 
C

Chris Angelico

That is not a given. "Weather patterns" don't last for thirty years.
Perhaps you are talking about climate patterns?

Yes, that's what I meant. In any case, debate about global warming is
quite tangential to the point about statistical validity; it looks
quite significant to show a line going from the bottom of the graph to
the top, but sounds a lot less noteworthy when you see it as a
half-degree increase on about (I think?) 30 degrees, and even less
when you measure temperatures in absolute scale (Kelvin) and it's half
a degree in three hundred. Those are principles worth considering,
regardless of the subject matter. If your railway tracks have widened
by a full eight millimeters due to increased pounding from heavier
vehicles travelling over it, that's significant and dangerous on
HO-scale model trains, but utterly insignificant on 5'3" gauge.

ChrisA
 
T

Terry Reedy

If you mean:

"I looked at the data, identified that the range a < x < b looks linear
and the range x > b does not, then used least squares (or some other
recognised, objective technique for fitting a line) to the data in that
linear range"

then I'm completely cool with that.

If both x and y are measured values, then regressing x on y and y on x
with give different answers and both will be wrong in that *neither*
will be the best answer for the relationship between them. Oscar did not
specify whether either was an experimentally set input variable.
But that is not fitting a line by eye, which is what I am talking about.

With the line constrained to go through 0,0, a line eyeballed with a
clear ruler could easily be better than either regression line, as a
human will tend to minimize the deviations *perpendicular to the line*,
which is the proper thing to do (assuming both variables are measured in
the same units).
 
O

Oscar Benjamin

On Mon, 07 Jan 2013 22:32:54 +0000, Oscar Benjamin wrote:

[...]
I also think it would
be highly foolish to go so far with refusing to eyeball data that you
would accept the output of some regression algorithm even when it
clearly looks wrong.

I never said anything of the sort.

I said, don't fit lines to data by eye. I didn't say not to sanity check
your straight line fit is reasonable by eyeballing it.

I should have been a little clearer. That was the situation when I
decided to just use a (digital) ruler - although really it was more of
a visual bisection (1, 2, 1.5, 1.25...). The regression result was
clearly wrong (and also invalid for the reasons Terry has described).
Some of the problems were easily fixable and others were not. I could
have spent an hour getting the code to make the line go where I wanted
it to, or I could just fit the line visually in about 2 minutes.


Oscar
 
R

Robert Kern

Yes, that's what I meant. In any case, debate about global warming is
quite tangential to the point about statistical validity; it looks
quite significant to show a line going from the bottom of the graph to
the top, but sounds a lot less noteworthy when you see it as a
half-degree increase on about (I think?) 30 degrees, and even less
when you measure temperatures in absolute scale (Kelvin) and it's half
a degree in three hundred.

Why on Earth do you think that the distance from nominal surface temperatures to
freezing much less absolute 0 is the right scale to compare global warming
changes against? You need to compare against the size of global mean temperature
changes that would cause large amounts of human suffering, and that scale is on
the order of a *few* degrees, not hundreds. A change of half a degree over a few
decades with no signs of slowing down *should* be alarming.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top