Sharp said:
[...]
My data does not fit a normal distribution.
I do not want to eliminate any positive values.
I only want to eliminate large negative values. [...]
So whats the best way to go now? I'm not a statistician.
I you really mean that you want it to be "statistically meaningful" then you'll
have to talk to a statistician. In order for that talk to be worthwhile you'll
need to know what distribution the numbers do follow (either as an analytic
description -- possibly an approximation -- or as empirical data). You will
also need to know whether the distribution is identical on each run, or whether
it parameterised in some way. In the latter case the first part of the task
will be to estimate the parameters of the distribution based on the data from
that run (presumably including the positive values), then the second part of
the task will be eliminating data points that are "implausible" (in some fixed
sense) given the estimated distribution.
If the distribution is fixed across runs, then there is no need for the
curve-fitting step, and the question reduces to finding the a single, fixed,
threshold beyond which data-points are unlikely to occur by natural chance, and
which can therefore be dismissed (with a certain confidence) as outliers. In
this case you can run some experiments to find what value 95% (say) of negative
values lie above. On subsequent runs, values lower than that can be rejected
as "implausible" (on the assumption that they are drawn from the same
underlying distribution as your test runs). I'm not a statistician, so I don't
know whether you would be able to claim 95% confidence in this case, nor how to
quantify how much test data you would need (nor, indeed, how the two
interrelate).
Googling for
outlier removal
shows up lots of promising looking hints.
OTOH, it might be simplest to punt the question to the user, and have a
configurable parameter. If you do that then you should follow hallowed
practice and:
a) Bury the parameter in an XML file somewhere. Read and write out the data on
each run so that no human-readable formatting is preserved.
b) Give the parameter as vague and ambiguous a name as possible. In this case
you should ensure that neither the parameter name nor its documentation give
any hint as to whether the value is intended to be an absolute cut-off value,
the negation of an absolute cut-off, a high percentile threshold, a low
percentile threshold, or the absolute number of datapoints to reject.
c) Attempt to ensure that the default value is unsuitable for use in any
real-world application.
If you want to "go the extra mile" and work to the very highest professional
standards, then you should also:
d) Ensure that this behaviour is controlled by several parameters. The should
be confusingly named (a reliable technique here is to give them names that are
the opposite of what they actually mean), and should interact in ways that are
neither obvious nor documented. You should further ensure that sensible
results can only be achieved by setting one of the parameters explicitly (no
combination of the other parameters has the same effect), and mark that as
"deprecated" in /some/ of the documentation, whilst also making heavy use of it
in any examples.
-- chris