# Riddle me this

Discussion in 'Java' started by Sharp Tool, Nov 6, 2005.

1. ### Sharp ToolGuest

Hi

Consider this list of numbers:

12.0
5.0
1.0
-0.1
-2.1
-124.0

what algorithm to use to remove large negative values such as -124.0?
how to determine a cutoff value that is statistically meaningful?

So far i have:

cuff off = smallest positive - smallest difference in negative pairs
= 1.0 - (2.1 - 0.1)
= 1.0 - 2.0
= -1.0

Problem is that would eliminate - 2.1!

Help appreciated.
Sharp Tool

Sharp Tool, Nov 6, 2005

2. ### Roedy GreenGuest

On Sun, 06 Nov 2005 08:46:17 GMT, "Sharp Tool"
<> wrote, quoted or indirectly quoted someone
who said :

>what algorithm to use to remove large negative values such as -124.0?
>how to determine a cutoff value that is statistically meaningful?

That is not usually a statistical question but a plausibility
question. If you are scanning data for temperatures of Honolulu you
would look at history, give yourself a safety factor, and chop below
and above a given range.

Readings for human temperatures would have a narrower range unless you
included corpses.

If your numbers fit a normal bell shaped curve, you can compute the
mean and standard deviation. Then you could throw out numbers more
than n deviations from the mean.

--
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green, Nov 6, 2005

3. ### Thomas HawtinGuest

Sharp Tool wrote:
>
> what algorithm to use to remove large negative values such as -124.0?
> how to determine a cutoff value that is statistically meaningful?

This newsgroup probably isn't the best place to find statisticians
(although I guess there are a few).

You could google for "outliers" or similar. "Grubbs' Test for Outliers"
seems like a step in the right direction.

Tom Hawtin
--
Unemployed English Java programmer
http://jroller.com/page/tackline/

Thomas Hawtin, Nov 6, 2005
4. ### SDBGuest

"Sharp Tool" <> wrote in message
news:tpjbf.9940\$...

: Consider this list of numbers:
:
: 12.0
: 5.0
: 1.0
: -0.1
: -2.1
: -124.0

: what algorithm to use to remove large negative values such as -124.0?
: how to determine a cutoff value that is statistically meaningful?

: So far i have:

: cuff off = smallest positive - smallest difference in negative pairs
: = 1.0 - (2.1 - 0.1)
: = 1.0 - 2.0
: = -1.0

How sophisticated do you need to be? Consider using the absolute value so
you don't need to worry about positive or negative numbers.

If the numbers you gave are just an example and the problem you are trying
to solve is more generic, look at a statics value called the 'Z-Score' also
sometimes called the 'Z-Value'. It computed by subtracting the number from
the mean then dividing it by the standard diviation of the set. You can
throw out value outside a range of Z-scores.

From your set, the standard deviation is 52.15.

The z-Score of the second one, 5.0 is .8603
The z-Score of the last one, -124, is .0282

In stats, the z-Score is your friend.

SDB, Nov 6, 2005
5. ### Sharp ToolGuest

>> Sharp Tool wrote:
> >
> > what algorithm to use to remove large negative values such as -124.0?
> > how to determine a cutoff value that is statistically meaningful?

>
> This newsgroup probably isn't the best place to find statisticians
> (although I guess there are a few).
>
> You could google for "outliers" or similar. "Grubbs' Test for Outliers"
> seems like a step in the right direction.
>
> Tom Hawtin

Grubbs Test is only suitable for data that has a normal distribution - mine
does not.

Cheers
Sharp

Sharp Tool, Nov 7, 2005
6. ### Sharp ToolGuest

"SDB" <> wrote in message
news:...
> "Sharp Tool" <> wrote in message
> news:tpjbf.9940\$...
>
> : Consider this list of numbers:
> :
> : 12.0
> : 5.0
> : 1.0
> : -0.1
> : -2.1
> : -124.0
>
> : what algorithm to use to remove large negative values such as -124.0?
> : how to determine a cutoff value that is statistically meaningful?
>
> : So far i have:
>
> : cuff off = smallest positive - smallest difference in negative pairs
> : = 1.0 - (2.1 - 0.1)
> : = 1.0 - 2.0
> : = -1.0
>
> How sophisticated do you need to be? Consider using the absolute value so
> you don't need to worry about positive or negative numbers.
>
> If the numbers you gave are just an example and the problem you are trying
> to solve is more generic, look at a statics value called the 'Z-Score'

also
> sometimes called the 'Z-Value'. It computed by subtracting the number

from
> the mean then dividing it by the standard diviation of the set. You can
> throw out value outside a range of Z-scores.
>
> From your set, the standard deviation is 52.15.
>
> The z-Score of the second one, 5.0 is .8603
> The z-Score of the last one, -124, is .0282
>
> In stats, the z-Score is your friend.

My data does not fit a normal distribution.
I do not want to eliminate any positive values.
I only want to eliminate large negative values.
Z scores work with only with absolute values.
So whats the best way to go now? I'm not a statistician.

Cheers
Sharp Tool

Sharp Tool, Nov 7, 2005
7. ### Roedy GreenGuest

On Mon, 07 Nov 2005 08:42:24 GMT, "Sharp Tool"
<> wrote, quoted or indirectly quoted someone
who said :

>My data does not fit a normal distribution.
>I do not want to eliminate any positive values.
>I only want to eliminate large negative values.
>Z scores work with only with absolute values.
>So whats the best way to go now? I'm not a statistician.

What distribution do they conform to?
--
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green, Nov 7, 2005
8. ### Andrew ThompsonGuest

Sharp Tool wrote:

> My data does not fit a normal distribution.

What distribution/pattern/logic does it fit, because..

> I only want to eliminate large negative values.

...knowing that will lead to a lot closer to defining
(pinning down, and putting a value to) 'large'.

Beyond the hypothetical though, does this describe
an actual problem, or is it purely a mental exercise?

Andrew Thompson, Nov 7, 2005
9. ### Sharp ToolGuest

> Sharp Tool wrote:
>
> > My data does not fit a normal distribution.

>
> What distribution/pattern/logic does it fit, because..
>
> > I only want to eliminate large negative values.

>
> ..knowing that will lead to a lot closer to defining
> (pinning down, and putting a value to) 'large'.

A large value is one that is an obvious outlier.
I only want to eliminate large negative values.
By eye-balling the list of numbers, you can see that -124.0
doesn't 'fit in'. Wondering if there a statistical method for this.

> Beyond the hypothetical though, does this describe
> an actual problem, or is it purely a mental exercise?

Mental exercise, but i think it could be useful for removing
negative outliers.

Sharp Tool

Sharp Tool, Nov 7, 2005
10. ### Sharp ToolGuest

> <> wrote, quoted or indirectly quoted someone
> who said :
>
> >My data does not fit a normal distribution.
> >I do not want to eliminate any positive values.
> >I only want to eliminate large negative values.
> >Z scores work with only with absolute values.
> >So whats the best way to go now? I'm not a statistician.

>
> What distribution do they conform to?

Random I believe.

Sharp Tool

Sharp Tool, Nov 7, 2005
11. ### Chris UppalGuest

Sharp Tool wrote:

> > > what algorithm to use to remove large negative values such as -124.0?
> > > how to determine a cutoff value that is statistically meaningful?

[...]
> My data does not fit a normal distribution.
> I do not want to eliminate any positive values.
> I only want to eliminate large negative values.

[...]
> So whats the best way to go now? I'm not a statistician.

I you really mean that you want it to be "statistically meaningful" then you'll
have to talk to a statistician. In order for that talk to be worthwhile you'll
need to know what distribution the numbers do follow (either as an analytic
description -- possibly an approximation -- or as empirical data). You will
also need to know whether the distribution is identical on each run, or whether
it parameterised in some way. In the latter case the first part of the task
will be to estimate the parameters of the distribution based on the data from
that run (presumably including the positive values), then the second part of
the task will be eliminating data points that are "implausible" (in some fixed
sense) given the estimated distribution.

If the distribution is fixed across runs, then there is no need for the
curve-fitting step, and the question reduces to finding the a single, fixed,
threshold beyond which data-points are unlikely to occur by natural chance, and
which can therefore be dismissed (with a certain confidence) as outliers. In
this case you can run some experiments to find what value 95% (say) of negative
values lie above. On subsequent runs, values lower than that can be rejected
as "implausible" (on the assumption that they are drawn from the same
underlying distribution as your test runs). I'm not a statistician, so I don't
know whether you would be able to claim 95% confidence in this case, nor how to
quantify how much test data you would need (nor, indeed, how the two
interrelate).

Googling for
outlier removal
shows up lots of promising looking hints.

OTOH, it might be simplest to punt the question to the user, and have a
configurable parameter. If you do that then you should follow hallowed
practice and:

a) Bury the parameter in an XML file somewhere. Read and write out the data on
each run so that no human-readable formatting is preserved.

b) Give the parameter as vague and ambiguous a name as possible. In this case
you should ensure that neither the parameter name nor its documentation give
any hint as to whether the value is intended to be an absolute cut-off value,
the negation of an absolute cut-off, a high percentile threshold, a low
percentile threshold, or the absolute number of datapoints to reject.

c) Attempt to ensure that the default value is unsuitable for use in any
real-world application.

If you want to "go the extra mile" and work to the very highest professional
standards, then you should also:

d) Ensure that this behaviour is controlled by several parameters. The should
be confusingly named (a reliable technique here is to give them names that are
the opposite of what they actually mean), and should interact in ways that are
neither obvious nor documented. You should further ensure that sensible
results can only be achieved by setting one of the parameters explicitly (no
combination of the other parameters has the same effect), and mark that as
"deprecated" in /some/ of the documentation, whilst also making heavy use of it
in any examples.

-- chris

Chris Uppal, Nov 7, 2005
12. ### Roedy GreenGuest

On Mon, 07 Nov 2005 09:19:19 GMT, "Sharp Tool"
<> wrote, quoted or indirectly quoted someone
who said :

>> What distribution do they conform to?

>
>Random I believe.

In that case you can't make a case for tossing any of them. Keep in
mind even normal distributions are still random.
--
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green, Nov 7, 2005
13. ### Sharp ToolGuest

> Sharp Tool wrote:
>
> > > > what algorithm to use to remove large negative values such

as -124.0?
> > > > how to determine a cutoff value that is statistically meaningful?

> [...]
> > My data does not fit a normal distribution.
> > I do not want to eliminate any positive values.
> > I only want to eliminate large negative values.

> [...]
> > So whats the best way to go now? I'm not a statistician.

>
> I you really mean that you want it to be "statistically meaningful" then

you'll
> have to talk to a statistician. In order for that talk to be worthwhile

you'll
> need to know what distribution the numbers do follow (either as an

analytic
> description -- possibly an approximation -- or as empirical data). You

will
> also need to know whether the distribution is identical on each run, or

whether
> it parameterised in some way. In the latter case the first part of the

> will be to estimate the parameters of the distribution based on the data

from
> that run (presumably including the positive values), then the second part

of
> the task will be eliminating data points that are "implausible" (in some

fixed
> sense) given the estimated distribution.
>
> If the distribution is fixed across runs, then there is no need for the
> curve-fitting step, and the question reduces to finding the a single,

fixed,
> threshold beyond which data-points are unlikely to occur by natural

chance, and
> which can therefore be dismissed (with a certain confidence) as outliers.

In
> this case you can run some experiments to find what value 95% (say) of

negative
> values lie above. On subsequent runs, values lower than that can be

rejected
> as "implausible" (on the assumption that they are drawn from the same
> underlying distribution as your test runs). I'm not a statistician, so I

don't
> know whether you would be able to claim 95% confidence in this case, nor

how to
> quantify how much test data you would need (nor, indeed, how the two
> interrelate).

You sure seem to know a fair bit about statistics.
My questions is now, how does one determine the distribution of data?
I haven't done much analysis but i say it looks random.
The cutoff based on the 95% confidence that negative values lies above
sounds good.

Sharp Tool

Sharp Tool, Nov 7, 2005
14. ### Andrew ThompsonGuest

Sharp Tool wrote:

>>Sharp Tool wrote:
>>
>>>My data does not fit a normal distribution.

>>
>>What distribution/pattern/logic does it fit, because..
>>
>>>I only want to eliminate large negative values.

>>
>>..knowing that will lead to a lot closer to defining
>>(pinning down, and putting a value to) 'large'.

>
> A large value is one that is an obvious

Obvious to who? What is the cut-off limit for 'obvious'?

You quote '-124.0', but what about '-74.2', or '-24.0'.

To me, even '-24' could be an 'obvious' outlier.
But without some form of 'confidence level' and a
mathematically definable group, we cannot even
determine exactly what constitutes a cut-off limit.

With such vague descriptions of what the group represents,
there is really no way to progress the problem.

>..outlier.
> I only want to eliminate large negative values.
> By eye-balling the list of numbers, you can see that -124.0
> doesn't 'fit in'. Wondering if there a statistical method for this.
>
>>Beyond the hypothetical though, does this describe
>>an actual problem, or is it purely a mental exercise?

>
> Mental exercise, ..

I'll leave you with it.

Andrew Thompson, Nov 7, 2005
15. ### Sharp ToolGuest

> <> wrote, quoted or indirectly quoted someone
> who said :
>
> >> What distribution do they conform to?

> >
> >Random I believe.

>
> In that case you can't make a case for tossing any of them. Keep in
> mind even normal distributions are still random.

The distribution looks like a bell shape curve skewed to the left with an
initial platoe then it slides to the right and then suddenly makes a sharp
dip down.
so i guess thats not really a normal distribution.

Sharp Tool

Sharp Tool, Nov 7, 2005
16. ### Roedy GreenGuest

On Mon, 07 Nov 2005 10:47:02 GMT, "Sharp Tool"
<> wrote, quoted or indirectly quoted someone
who said :

>My questions is now, how does one determine the distribution of data?

One way is to do a histogram.

If you see a bell shaped curve coming out, you likely have a normal
distribution.

Various other distributions have a characteristic shape.

One that comes up often is called Poisson. It looks like a skewed
bell shaped curve with the right hand side stretched out.
see http://www.math.csusb.edu/faculty/stanton/probstat/poisson.html
How long you wait for bus might follow a Poisson distribution.

Geometric is a falling off. see
http://www.stats.gla.ac.uk/steps/glossary/probability_distributions.html#geomdistn

--
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green, Nov 7, 2005
17. ### Sharp ToolGuest

> Sharp Tool wrote:
>
> >>Sharp Tool wrote:
> >>
> >>>My data does not fit a normal distribution.
> >>
> >>What distribution/pattern/logic does it fit, because..
> >>
> >>>I only want to eliminate large negative values.
> >>
> >>..knowing that will lead to a lot closer to defining
> >>(pinning down, and putting a value to) 'large'.

> >
> > A large value is one that is an obvious

>
> Obvious to who? What is the cut-off limit for 'obvious'?

Obvious to me when i look (eye balling) at the list of numbers i presented
in my first posting.

> You quote '-124.0', but what about '-74.2', or '-24.0'.

There is no -74.2 or -24 in my original list.
If there were i would say -74.2 is possibly another negative outlier.
Again, there is not statistical backing for this.

> To me, even '-24' could be an 'obvious' outlier.
> But without some form of 'confidence level' and a
> mathematically definable group, we cannot even
> determine exactly what constitutes a cut-off limit.

'-24' does not seem like an obvious outlier to me.
Again, without some sort of statistics its all subjective.

> With such vague descriptions of what the group represents,
> there is really no way to progress the problem.

Not sure what you mean by mathematical definable group.
But I assume you mean the distribution of the data.
The confidence level would be the standard 95% in the statistical world.
The question is how to get a cutoff that will give me that confidence level.
Should one look at Z scores (this was suggested) or some other statistical
parameter to establish a cutoff or
just look at raw numbers to establish confidence level (this was suggested).

Its vague to you Andrew because its not your area of expertise - not my
'vague description'.

Sharp Tool

Sharp Tool, Nov 7, 2005
18. ### Roedy GreenGuest

On Mon, 07 Nov 2005 11:01:13 GMT, "Sharp Tool"
<> wrote, quoted or indirectly quoted someone
who said :

>The distribution looks like a bell shape curve skewed to the left with an
>initial platoe then it slides to the right and then suddenly makes a sharp
>dip down.
>so i guess thats not really a normal distribution.

You may be able to analyse the physics of your readings to calculate
the expected distribution.

the classic shapes are not really clear until you have a lot of data.
You won't see the pattern with just 5 points.

This reminds me something that happened when I was studying physics at
UBC circa 1968. We were doing a lab with an experiment that was
supposed to produce a normal distribution. But it obviously wasn't.
The machine was broken. Student after student complained, but were
dismissed as incompetents. I keypunched the data and did a histogram
and produced it on the pen plotter -- a great novelty in that day.

It clearly showed a camel hump. The COMPUTER graph clinched it and
off the machine went for repair. You can't do that as easily today.
Back then anything that came from a computer was treated as divine
revelation.

--
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green, Nov 7, 2005
19. ### Sharp ToolGuest

> On Mon, 07 Nov 2005 10:47:02 GMT, "Sharp Tool"
> <> wrote, quoted or indirectly quoted someone
> who said :
>
> >My questions is now, how does one determine the distribution of data?

>
> One way is to do a histogram.
>
> If you see a bell shaped curve coming out, you likely have a normal
> distribution.
>
> Various other distributions have a characteristic shape.
>
> One that comes up often is called Poisson. It looks like a skewed
> bell shaped curve with the right hand side stretched out.
> see http://www.math.csusb.edu/faculty/stanton/probstat/poisson.html
> How long you wait for bus might follow a Poisson distribution.
>
> Geometric is a falling off. see
>

http://www.stats.gla.ac.uk/steps/glossary/probability_distributions.html#geo
mdistn

I plotted my data and it does look like a poisson distribution.
But the large negative number makes it falls off a cliff.
All these distribution dont include negative numbers?

Sharp Tool

Sharp Tool, Nov 7, 2005
20. ### Andrew ThompsonGuest

Sharp Tool wrote:

> Its vague to you Andrew because its not your area of expertise - not my
> 'vague description'.

Very sound assessment, coming from someone who first stated
the numbers had no 'normal distribution' and is now saying
it does, and that a confidence level of 95% 'sounds good'.

> Sharp Tool

[ Seems a little 'blunt' at the moment.. ;-) ]

Andrew Thompson, Nov 7, 2005