Riddle me this

S

Sharp Tool

Hi

Consider this list of numbers:

12.0
5.0
1.0
-0.1
-2.1
-124.0

what algorithm to use to remove large negative values such as -124.0?
how to determine a cutoff value that is statistically meaningful?

So far i have:

cuff off = smallest positive - smallest difference in negative pairs
= 1.0 - (2.1 - 0.1)
= 1.0 - 2.0
= -1.0

Problem is that would eliminate - 2.1!

Help appreciated.
Sharp Tool
 
R

Roedy Green

what algorithm to use to remove large negative values such as -124.0?
how to determine a cutoff value that is statistically meaningful?

That is not usually a statistical question but a plausibility
question. If you are scanning data for temperatures of Honolulu you
would look at history, give yourself a safety factor, and chop below
and above a given range.

Readings for human temperatures would have a narrower range unless you
included corpses.

If your numbers fit a normal bell shaped curve, you can compute the
mean and standard deviation. Then you could throw out numbers more
than n deviations from the mean.
 
T

Thomas Hawtin

Sharp said:
what algorithm to use to remove large negative values such as -124.0?
how to determine a cutoff value that is statistically meaningful?

This newsgroup probably isn't the best place to find statisticians
(although I guess there are a few).

You could google for "outliers" or similar. "Grubbs' Test for Outliers"
seems like a step in the right direction.

Tom Hawtin
 
S

SDB

: Consider this list of numbers:
:
: 12.0
: 5.0
: 1.0
: -0.1
: -2.1
: -124.0

: what algorithm to use to remove large negative values such as -124.0?
: how to determine a cutoff value that is statistically meaningful?

: So far i have:

: cuff off = smallest positive - smallest difference in negative pairs
: = 1.0 - (2.1 - 0.1)
: = 1.0 - 2.0
: = -1.0

How sophisticated do you need to be? Consider using the absolute value so
you don't need to worry about positive or negative numbers.

If the numbers you gave are just an example and the problem you are trying
to solve is more generic, look at a statics value called the 'Z-Score' also
sometimes called the 'Z-Value'. It computed by subtracting the number from
the mean then dividing it by the standard diviation of the set. You can
throw out value outside a range of Z-scores.

From your set, the standard deviation is 52.15.

The z-Score of the second one, 5.0 is .8603
The z-Score of the last one, -124, is .0282

In stats, the z-Score is your friend.
 
S

Sharp Tool

This newsgroup probably isn't the best place to find statisticians
(although I guess there are a few).

You could google for "outliers" or similar. "Grubbs' Test for Outliers"
seems like a step in the right direction.

Tom Hawtin

Grubbs Test is only suitable for data that has a normal distribution - mine
does not.

Cheers
Sharp
 
S

Sharp Tool

SDB said:
: Consider this list of numbers:
:
: 12.0
: 5.0
: 1.0
: -0.1
: -2.1
: -124.0

: what algorithm to use to remove large negative values such as -124.0?
: how to determine a cutoff value that is statistically meaningful?

: So far i have:

: cuff off = smallest positive - smallest difference in negative pairs
: = 1.0 - (2.1 - 0.1)
: = 1.0 - 2.0
: = -1.0

How sophisticated do you need to be? Consider using the absolute value so
you don't need to worry about positive or negative numbers.

If the numbers you gave are just an example and the problem you are trying
to solve is more generic, look at a statics value called the 'Z-Score' also
sometimes called the 'Z-Value'. It computed by subtracting the number from
the mean then dividing it by the standard diviation of the set. You can
throw out value outside a range of Z-scores.

From your set, the standard deviation is 52.15.

The z-Score of the second one, 5.0 is .8603
The z-Score of the last one, -124, is .0282

In stats, the z-Score is your friend.

My data does not fit a normal distribution.
I do not want to eliminate any positive values.
I only want to eliminate large negative values.
Z scores work with only with absolute values.
So whats the best way to go now? I'm not a statistician.

Cheers
Sharp Tool
 
R

Roedy Green

My data does not fit a normal distribution.
I do not want to eliminate any positive values.
I only want to eliminate large negative values.
Z scores work with only with absolute values.
So whats the best way to go now? I'm not a statistician.

What distribution do they conform to?
 
A

Andrew Thompson

Sharp said:
My data does not fit a normal distribution.

What distribution/pattern/logic does it fit, because..
I only want to eliminate large negative values.

...knowing that will lead to a lot closer to defining
(pinning down, and putting a value to) 'large'.

Beyond the hypothetical though, does this describe
an actual problem, or is it purely a mental exercise?
 
S

Sharp Tool

Sharp said:
What distribution/pattern/logic does it fit, because..


..knowing that will lead to a lot closer to defining
(pinning down, and putting a value to) 'large'.

A large value is one that is an obvious outlier.
I only want to eliminate large negative values.
By eye-balling the list of numbers, you can see that -124.0
doesn't 'fit in'. Wondering if there a statistical method for this.
Beyond the hypothetical though, does this describe
an actual problem, or is it purely a mental exercise?

Mental exercise, but i think it could be useful for removing
negative outliers.

Sharp Tool
 
C

Chris Uppal

Sharp said:
[...]
My data does not fit a normal distribution.
I do not want to eliminate any positive values.
I only want to eliminate large negative values. [...]
So whats the best way to go now? I'm not a statistician.

I you really mean that you want it to be "statistically meaningful" then you'll
have to talk to a statistician. In order for that talk to be worthwhile you'll
need to know what distribution the numbers do follow (either as an analytic
description -- possibly an approximation -- or as empirical data). You will
also need to know whether the distribution is identical on each run, or whether
it parameterised in some way. In the latter case the first part of the task
will be to estimate the parameters of the distribution based on the data from
that run (presumably including the positive values), then the second part of
the task will be eliminating data points that are "implausible" (in some fixed
sense) given the estimated distribution.

If the distribution is fixed across runs, then there is no need for the
curve-fitting step, and the question reduces to finding the a single, fixed,
threshold beyond which data-points are unlikely to occur by natural chance, and
which can therefore be dismissed (with a certain confidence) as outliers. In
this case you can run some experiments to find what value 95% (say) of negative
values lie above. On subsequent runs, values lower than that can be rejected
as "implausible" (on the assumption that they are drawn from the same
underlying distribution as your test runs). I'm not a statistician, so I don't
know whether you would be able to claim 95% confidence in this case, nor how to
quantify how much test data you would need (nor, indeed, how the two
interrelate).

Googling for
outlier removal
shows up lots of promising looking hints.

OTOH, it might be simplest to punt the question to the user, and have a
configurable parameter. If you do that then you should follow hallowed
practice and:

a) Bury the parameter in an XML file somewhere. Read and write out the data on
each run so that no human-readable formatting is preserved.

b) Give the parameter as vague and ambiguous a name as possible. In this case
you should ensure that neither the parameter name nor its documentation give
any hint as to whether the value is intended to be an absolute cut-off value,
the negation of an absolute cut-off, a high percentile threshold, a low
percentile threshold, or the absolute number of datapoints to reject.

c) Attempt to ensure that the default value is unsuitable for use in any
real-world application.

If you want to "go the extra mile" and work to the very highest professional
standards, then you should also:

d) Ensure that this behaviour is controlled by several parameters. The should
be confusingly named (a reliable technique here is to give them names that are
the opposite of what they actually mean), and should interact in ways that are
neither obvious nor documented. You should further ensure that sensible
results can only be achieved by setting one of the parameters explicitly (no
combination of the other parameters has the same effect), and mark that as
"deprecated" in /some/ of the documentation, whilst also making heavy use of it
in any examples.

-- chris
 
S

Sharp Tool

Sharp said:
what algorithm to use to remove large negative values such as -124.0?
how to determine a cutoff value that is statistically meaningful?
[...]
My data does not fit a normal distribution.
I do not want to eliminate any positive values.
I only want to eliminate large negative values. [...]
So whats the best way to go now? I'm not a statistician.

I you really mean that you want it to be "statistically meaningful" then you'll
have to talk to a statistician. In order for that talk to be worthwhile you'll
need to know what distribution the numbers do follow (either as an analytic
description -- possibly an approximation -- or as empirical data). You will
also need to know whether the distribution is identical on each run, or whether
it parameterised in some way. In the latter case the first part of the task
will be to estimate the parameters of the distribution based on the data from
that run (presumably including the positive values), then the second part of
the task will be eliminating data points that are "implausible" (in some fixed
sense) given the estimated distribution.

If the distribution is fixed across runs, then there is no need for the
curve-fitting step, and the question reduces to finding the a single, fixed,
threshold beyond which data-points are unlikely to occur by natural chance, and
which can therefore be dismissed (with a certain confidence) as outliers. In
this case you can run some experiments to find what value 95% (say) of negative
values lie above. On subsequent runs, values lower than that can be rejected
as "implausible" (on the assumption that they are drawn from the same
underlying distribution as your test runs). I'm not a statistician, so I don't
know whether you would be able to claim 95% confidence in this case, nor how to
quantify how much test data you would need (nor, indeed, how the two
interrelate).

You sure seem to know a fair bit about statistics.
My questions is now, how does one determine the distribution of data?
I haven't done much analysis but i say it looks random.
The cutoff based on the 95% confidence that negative values lies above
sounds good.
Looking at google searches now.

Sharp Tool
 
A

Andrew Thompson

Sharp said:
A large value is one that is an obvious

Obvious to who? What is the cut-off limit for 'obvious'?

You quote '-124.0', but what about '-74.2', or '-24.0'.

To me, even '-24' could be an 'obvious' outlier.
But without some form of 'confidence level' and a
mathematically definable group, we cannot even
determine exactly what constitutes a cut-off limit.

With such vague descriptions of what the group represents,
there is really no way to progress the problem.
..outlier.
I only want to eliminate large negative values.
By eye-balling the list of numbers, you can see that -124.0
doesn't 'fit in'. Wondering if there a statistical method for this.


Mental exercise, ..

I'll leave you with it.
 
S

Sharp Tool

In that case you can't make a case for tossing any of them. Keep in
mind even normal distributions are still random.

Your right.
The distribution looks like a bell shape curve skewed to the left with an
initial platoe then it slides to the right and then suddenly makes a sharp
dip down.
so i guess thats not really a normal distribution.

Sharp Tool
 
R

Roedy Green

My questions is now, how does one determine the distribution of data?

One way is to do a histogram.

If you see a bell shaped curve coming out, you likely have a normal
distribution.

Various other distributions have a characteristic shape.

One that comes up often is called Poisson. It looks like a skewed
bell shaped curve with the right hand side stretched out.
see http://www.math.csusb.edu/faculty/stanton/probstat/poisson.html
How long you wait for bus might follow a Poisson distribution.

Geometric is a falling off. see
http://www.stats.gla.ac.uk/steps/glossary/probability_distributions.html#geomdistn
 
S

Sharp Tool

Sharp said:
Obvious to who? What is the cut-off limit for 'obvious'?

Obvious to me when i look (eye balling) at the list of numbers i presented
in my first posting.
You quote '-124.0', but what about '-74.2', or '-24.0'.

There is no -74.2 or -24 in my original list.
If there were i would say -74.2 is possibly another negative outlier.
Again, there is not statistical backing for this.
To me, even '-24' could be an 'obvious' outlier.
But without some form of 'confidence level' and a
mathematically definable group, we cannot even
determine exactly what constitutes a cut-off limit.

'-24' does not seem like an obvious outlier to me.
Again, without some sort of statistics its all subjective.
With such vague descriptions of what the group represents,
there is really no way to progress the problem.

Not sure what you mean by mathematical definable group.
But I assume you mean the distribution of the data.
The confidence level would be the standard 95% in the statistical world.
The question is how to get a cutoff that will give me that confidence level.
Should one look at Z scores (this was suggested) or some other statistical
parameter to establish a cutoff or
just look at raw numbers to establish confidence level (this was suggested).

Its vague to you Andrew because its not your area of expertise - not my
'vague description'.

Sharp Tool
 
R

Roedy Green

The distribution looks like a bell shape curve skewed to the left with an
initial platoe then it slides to the right and then suddenly makes a sharp
dip down.
so i guess thats not really a normal distribution.

You may be able to analyse the physics of your readings to calculate
the expected distribution.

the classic shapes are not really clear until you have a lot of data.
You won't see the pattern with just 5 points.

This reminds me something that happened when I was studying physics at
UBC circa 1968. We were doing a lab with an experiment that was
supposed to produce a normal distribution. But it obviously wasn't.
The machine was broken. Student after student complained, but were
dismissed as incompetents. I keypunched the data and did a histogram
and produced it on the pen plotter -- a great novelty in that day.

It clearly showed a camel hump. The COMPUTER graph clinched it and
off the machine went for repair. You can't do that as easily today.
Back then anything that came from a computer was treated as divine
revelation.
 
S

Sharp Tool

One way is to do a histogram.

If you see a bell shaped curve coming out, you likely have a normal
distribution.

Various other distributions have a characteristic shape.

One that comes up often is called Poisson. It looks like a skewed
bell shaped curve with the right hand side stretched out.
see http://www.math.csusb.edu/faculty/stanton/probstat/poisson.html
How long you wait for bus might follow a Poisson distribution.

Geometric is a falling off. see
http://www.stats.gla.ac.uk/steps/glossary/probability_distributions.html#geo
mdistn

Thats a great link.
I plotted my data and it does look like a poisson distribution.
But the large negative number makes it falls off a cliff.
All these distribution dont include negative numbers?

Sharp Tool
 
A

Andrew Thompson

Sharp said:
Its vague to you Andrew because its not your area of expertise - not my
'vague description'.

Very sound assessment, coming from someone who first stated
the numbers had no 'normal distribution' and is now saying
it does, and that a confidence level of 95% 'sounds good'.
Sharp Tool

[ Seems a little 'blunt' at the moment.. ;-) ]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top