Riddle me this

Discussion in 'Java' started by Sharp Tool, Nov 6, 2005.

  1. Sharp Tool

    Sharp Tool Guest

    Hi

    Consider this list of numbers:

    12.0
    5.0
    1.0
    -0.1
    -2.1
    -124.0

    what algorithm to use to remove large negative values such as -124.0?
    how to determine a cutoff value that is statistically meaningful?

    So far i have:

    cuff off = smallest positive - smallest difference in negative pairs
    = 1.0 - (2.1 - 0.1)
    = 1.0 - 2.0
    = -1.0

    Problem is that would eliminate - 2.1!

    Help appreciated.
    Sharp Tool
     
    Sharp Tool, Nov 6, 2005
    #1
    1. Advertising

  2. Sharp Tool

    Roedy Green Guest

    On Sun, 06 Nov 2005 08:46:17 GMT, "Sharp Tool"
    <> wrote, quoted or indirectly quoted someone
    who said :

    >what algorithm to use to remove large negative values such as -124.0?
    >how to determine a cutoff value that is statistically meaningful?


    That is not usually a statistical question but a plausibility
    question. If you are scanning data for temperatures of Honolulu you
    would look at history, give yourself a safety factor, and chop below
    and above a given range.

    Readings for human temperatures would have a narrower range unless you
    included corpses.

    If your numbers fit a normal bell shaped curve, you can compute the
    mean and standard deviation. Then you could throw out numbers more
    than n deviations from the mean.




    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 6, 2005
    #2
    1. Advertising

  3. Sharp Tool wrote:
    >
    > what algorithm to use to remove large negative values such as -124.0?
    > how to determine a cutoff value that is statistically meaningful?


    This newsgroup probably isn't the best place to find statisticians
    (although I guess there are a few).

    You could google for "outliers" or similar. "Grubbs' Test for Outliers"
    seems like a step in the right direction.

    Tom Hawtin
    --
    Unemployed English Java programmer
    http://jroller.com/page/tackline/
     
    Thomas Hawtin, Nov 6, 2005
    #3
  4. Sharp Tool

    SDB Guest

    "Sharp Tool" <> wrote in message
    news:tpjbf.9940$...

    : Consider this list of numbers:
    :
    : 12.0
    : 5.0
    : 1.0
    : -0.1
    : -2.1
    : -124.0

    : what algorithm to use to remove large negative values such as -124.0?
    : how to determine a cutoff value that is statistically meaningful?

    : So far i have:

    : cuff off = smallest positive - smallest difference in negative pairs
    : = 1.0 - (2.1 - 0.1)
    : = 1.0 - 2.0
    : = -1.0

    How sophisticated do you need to be? Consider using the absolute value so
    you don't need to worry about positive or negative numbers.

    If the numbers you gave are just an example and the problem you are trying
    to solve is more generic, look at a statics value called the 'Z-Score' also
    sometimes called the 'Z-Value'. It computed by subtracting the number from
    the mean then dividing it by the standard diviation of the set. You can
    throw out value outside a range of Z-scores.

    From your set, the standard deviation is 52.15.

    The z-Score of the second one, 5.0 is .8603
    The z-Score of the last one, -124, is .0282

    In stats, the z-Score is your friend.
     
    SDB, Nov 6, 2005
    #4
  5. Sharp Tool

    Sharp Tool Guest


    >> Sharp Tool wrote:
    > >
    > > what algorithm to use to remove large negative values such as -124.0?
    > > how to determine a cutoff value that is statistically meaningful?

    >
    > This newsgroup probably isn't the best place to find statisticians
    > (although I guess there are a few).
    >
    > You could google for "outliers" or similar. "Grubbs' Test for Outliers"
    > seems like a step in the right direction.
    >
    > Tom Hawtin


    Grubbs Test is only suitable for data that has a normal distribution - mine
    does not.

    Cheers
    Sharp
     
    Sharp Tool, Nov 7, 2005
    #5
  6. Sharp Tool

    Sharp Tool Guest

    "SDB" <> wrote in message
    news:...
    > "Sharp Tool" <> wrote in message
    > news:tpjbf.9940$...
    >
    > : Consider this list of numbers:
    > :
    > : 12.0
    > : 5.0
    > : 1.0
    > : -0.1
    > : -2.1
    > : -124.0
    >
    > : what algorithm to use to remove large negative values such as -124.0?
    > : how to determine a cutoff value that is statistically meaningful?
    >
    > : So far i have:
    >
    > : cuff off = smallest positive - smallest difference in negative pairs
    > : = 1.0 - (2.1 - 0.1)
    > : = 1.0 - 2.0
    > : = -1.0
    >
    > How sophisticated do you need to be? Consider using the absolute value so
    > you don't need to worry about positive or negative numbers.
    >
    > If the numbers you gave are just an example and the problem you are trying
    > to solve is more generic, look at a statics value called the 'Z-Score'

    also
    > sometimes called the 'Z-Value'. It computed by subtracting the number

    from
    > the mean then dividing it by the standard diviation of the set. You can
    > throw out value outside a range of Z-scores.
    >
    > From your set, the standard deviation is 52.15.
    >
    > The z-Score of the second one, 5.0 is .8603
    > The z-Score of the last one, -124, is .0282
    >
    > In stats, the z-Score is your friend.


    My data does not fit a normal distribution.
    I do not want to eliminate any positive values.
    I only want to eliminate large negative values.
    Z scores work with only with absolute values.
    So whats the best way to go now? I'm not a statistician.

    Cheers
    Sharp Tool
     
    Sharp Tool, Nov 7, 2005
    #6
  7. Sharp Tool

    Roedy Green Guest

    On Mon, 07 Nov 2005 08:42:24 GMT, "Sharp Tool"
    <> wrote, quoted or indirectly quoted someone
    who said :

    >My data does not fit a normal distribution.
    >I do not want to eliminate any positive values.
    >I only want to eliminate large negative values.
    >Z scores work with only with absolute values.
    >So whats the best way to go now? I'm not a statistician.


    What distribution do they conform to?
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 7, 2005
    #7
  8. Sharp Tool wrote:

    > My data does not fit a normal distribution.


    What distribution/pattern/logic does it fit, because..

    > I only want to eliminate large negative values.


    ...knowing that will lead to a lot closer to defining
    (pinning down, and putting a value to) 'large'.

    Beyond the hypothetical though, does this describe
    an actual problem, or is it purely a mental exercise?
     
    Andrew Thompson, Nov 7, 2005
    #8
  9. Sharp Tool

    Sharp Tool Guest

    > Sharp Tool wrote:
    >
    > > My data does not fit a normal distribution.

    >
    > What distribution/pattern/logic does it fit, because..
    >
    > > I only want to eliminate large negative values.

    >
    > ..knowing that will lead to a lot closer to defining
    > (pinning down, and putting a value to) 'large'.


    A large value is one that is an obvious outlier.
    I only want to eliminate large negative values.
    By eye-balling the list of numbers, you can see that -124.0
    doesn't 'fit in'. Wondering if there a statistical method for this.

    > Beyond the hypothetical though, does this describe
    > an actual problem, or is it purely a mental exercise?


    Mental exercise, but i think it could be useful for removing
    negative outliers.

    Sharp Tool
     
    Sharp Tool, Nov 7, 2005
    #9
  10. Sharp Tool

    Sharp Tool Guest

    > <> wrote, quoted or indirectly quoted someone
    > who said :
    >
    > >My data does not fit a normal distribution.
    > >I do not want to eliminate any positive values.
    > >I only want to eliminate large negative values.
    > >Z scores work with only with absolute values.
    > >So whats the best way to go now? I'm not a statistician.

    >
    > What distribution do they conform to?


    Random I believe.

    Sharp Tool
     
    Sharp Tool, Nov 7, 2005
    #10
  11. Sharp Tool

    Chris Uppal Guest

    Sharp Tool wrote:

    > > > what algorithm to use to remove large negative values such as -124.0?
    > > > how to determine a cutoff value that is statistically meaningful?

    [...]
    > My data does not fit a normal distribution.
    > I do not want to eliminate any positive values.
    > I only want to eliminate large negative values.

    [...]
    > So whats the best way to go now? I'm not a statistician.


    I you really mean that you want it to be "statistically meaningful" then you'll
    have to talk to a statistician. In order for that talk to be worthwhile you'll
    need to know what distribution the numbers do follow (either as an analytic
    description -- possibly an approximation -- or as empirical data). You will
    also need to know whether the distribution is identical on each run, or whether
    it parameterised in some way. In the latter case the first part of the task
    will be to estimate the parameters of the distribution based on the data from
    that run (presumably including the positive values), then the second part of
    the task will be eliminating data points that are "implausible" (in some fixed
    sense) given the estimated distribution.

    If the distribution is fixed across runs, then there is no need for the
    curve-fitting step, and the question reduces to finding the a single, fixed,
    threshold beyond which data-points are unlikely to occur by natural chance, and
    which can therefore be dismissed (with a certain confidence) as outliers. In
    this case you can run some experiments to find what value 95% (say) of negative
    values lie above. On subsequent runs, values lower than that can be rejected
    as "implausible" (on the assumption that they are drawn from the same
    underlying distribution as your test runs). I'm not a statistician, so I don't
    know whether you would be able to claim 95% confidence in this case, nor how to
    quantify how much test data you would need (nor, indeed, how the two
    interrelate).

    Googling for
    outlier removal
    shows up lots of promising looking hints.

    OTOH, it might be simplest to punt the question to the user, and have a
    configurable parameter. If you do that then you should follow hallowed
    practice and:

    a) Bury the parameter in an XML file somewhere. Read and write out the data on
    each run so that no human-readable formatting is preserved.

    b) Give the parameter as vague and ambiguous a name as possible. In this case
    you should ensure that neither the parameter name nor its documentation give
    any hint as to whether the value is intended to be an absolute cut-off value,
    the negation of an absolute cut-off, a high percentile threshold, a low
    percentile threshold, or the absolute number of datapoints to reject.

    c) Attempt to ensure that the default value is unsuitable for use in any
    real-world application.

    If you want to "go the extra mile" and work to the very highest professional
    standards, then you should also:

    d) Ensure that this behaviour is controlled by several parameters. The should
    be confusingly named (a reliable technique here is to give them names that are
    the opposite of what they actually mean), and should interact in ways that are
    neither obvious nor documented. You should further ensure that sensible
    results can only be achieved by setting one of the parameters explicitly (no
    combination of the other parameters has the same effect), and mark that as
    "deprecated" in /some/ of the documentation, whilst also making heavy use of it
    in any examples.

    -- chris
     
    Chris Uppal, Nov 7, 2005
    #11
  12. Sharp Tool

    Roedy Green Guest

    On Mon, 07 Nov 2005 09:19:19 GMT, "Sharp Tool"
    <> wrote, quoted or indirectly quoted someone
    who said :

    >> What distribution do they conform to?

    >
    >Random I believe.


    In that case you can't make a case for tossing any of them. Keep in
    mind even normal distributions are still random.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 7, 2005
    #12
  13. Sharp Tool

    Sharp Tool Guest


    > Sharp Tool wrote:
    >
    > > > > what algorithm to use to remove large negative values such

    as -124.0?
    > > > > how to determine a cutoff value that is statistically meaningful?

    > [...]
    > > My data does not fit a normal distribution.
    > > I do not want to eliminate any positive values.
    > > I only want to eliminate large negative values.

    > [...]
    > > So whats the best way to go now? I'm not a statistician.

    >
    > I you really mean that you want it to be "statistically meaningful" then

    you'll
    > have to talk to a statistician. In order for that talk to be worthwhile

    you'll
    > need to know what distribution the numbers do follow (either as an

    analytic
    > description -- possibly an approximation -- or as empirical data). You

    will
    > also need to know whether the distribution is identical on each run, or

    whether
    > it parameterised in some way. In the latter case the first part of the

    task
    > will be to estimate the parameters of the distribution based on the data

    from
    > that run (presumably including the positive values), then the second part

    of
    > the task will be eliminating data points that are "implausible" (in some

    fixed
    > sense) given the estimated distribution.
    >
    > If the distribution is fixed across runs, then there is no need for the
    > curve-fitting step, and the question reduces to finding the a single,

    fixed,
    > threshold beyond which data-points are unlikely to occur by natural

    chance, and
    > which can therefore be dismissed (with a certain confidence) as outliers.

    In
    > this case you can run some experiments to find what value 95% (say) of

    negative
    > values lie above. On subsequent runs, values lower than that can be

    rejected
    > as "implausible" (on the assumption that they are drawn from the same
    > underlying distribution as your test runs). I'm not a statistician, so I

    don't
    > know whether you would be able to claim 95% confidence in this case, nor

    how to
    > quantify how much test data you would need (nor, indeed, how the two
    > interrelate).


    You sure seem to know a fair bit about statistics.
    My questions is now, how does one determine the distribution of data?
    I haven't done much analysis but i say it looks random.
    The cutoff based on the 95% confidence that negative values lies above
    sounds good.
    Looking at google searches now.

    Sharp Tool
     
    Sharp Tool, Nov 7, 2005
    #13
  14. Sharp Tool wrote:

    >>Sharp Tool wrote:
    >>
    >>>My data does not fit a normal distribution.

    >>
    >>What distribution/pattern/logic does it fit, because..
    >>
    >>>I only want to eliminate large negative values.

    >>
    >>..knowing that will lead to a lot closer to defining
    >>(pinning down, and putting a value to) 'large'.

    >
    > A large value is one that is an obvious


    Obvious to who? What is the cut-off limit for 'obvious'?

    You quote '-124.0', but what about '-74.2', or '-24.0'.

    To me, even '-24' could be an 'obvious' outlier.
    But without some form of 'confidence level' and a
    mathematically definable group, we cannot even
    determine exactly what constitutes a cut-off limit.

    With such vague descriptions of what the group represents,
    there is really no way to progress the problem.

    >..outlier.
    > I only want to eliminate large negative values.
    > By eye-balling the list of numbers, you can see that -124.0
    > doesn't 'fit in'. Wondering if there a statistical method for this.
    >
    >>Beyond the hypothetical though, does this describe
    >>an actual problem, or is it purely a mental exercise?

    >
    > Mental exercise, ..


    I'll leave you with it.
     
    Andrew Thompson, Nov 7, 2005
    #14
  15. Sharp Tool

    Sharp Tool Guest


    > <> wrote, quoted or indirectly quoted someone
    > who said :
    >
    > >> What distribution do they conform to?

    > >
    > >Random I believe.

    >
    > In that case you can't make a case for tossing any of them. Keep in
    > mind even normal distributions are still random.


    Your right.
    The distribution looks like a bell shape curve skewed to the left with an
    initial platoe then it slides to the right and then suddenly makes a sharp
    dip down.
    so i guess thats not really a normal distribution.

    Sharp Tool
     
    Sharp Tool, Nov 7, 2005
    #15
  16. Sharp Tool

    Roedy Green Guest

    On Mon, 07 Nov 2005 10:47:02 GMT, "Sharp Tool"
    <> wrote, quoted or indirectly quoted someone
    who said :

    >My questions is now, how does one determine the distribution of data?


    One way is to do a histogram.

    If you see a bell shaped curve coming out, you likely have a normal
    distribution.

    Various other distributions have a characteristic shape.

    One that comes up often is called Poisson. It looks like a skewed
    bell shaped curve with the right hand side stretched out.
    see http://www.math.csusb.edu/faculty/stanton/probstat/poisson.html
    How long you wait for bus might follow a Poisson distribution.

    Geometric is a falling off. see
    http://www.stats.gla.ac.uk/steps/glossary/probability_distributions.html#geomdistn

    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 7, 2005
    #16
  17. Sharp Tool

    Sharp Tool Guest

    > Sharp Tool wrote:
    >
    > >>Sharp Tool wrote:
    > >>
    > >>>My data does not fit a normal distribution.
    > >>
    > >>What distribution/pattern/logic does it fit, because..
    > >>
    > >>>I only want to eliminate large negative values.
    > >>
    > >>..knowing that will lead to a lot closer to defining
    > >>(pinning down, and putting a value to) 'large'.

    > >
    > > A large value is one that is an obvious

    >
    > Obvious to who? What is the cut-off limit for 'obvious'?


    Obvious to me when i look (eye balling) at the list of numbers i presented
    in my first posting.

    > You quote '-124.0', but what about '-74.2', or '-24.0'.


    There is no -74.2 or -24 in my original list.
    If there were i would say -74.2 is possibly another negative outlier.
    Again, there is not statistical backing for this.

    > To me, even '-24' could be an 'obvious' outlier.
    > But without some form of 'confidence level' and a
    > mathematically definable group, we cannot even
    > determine exactly what constitutes a cut-off limit.


    '-24' does not seem like an obvious outlier to me.
    Again, without some sort of statistics its all subjective.

    > With such vague descriptions of what the group represents,
    > there is really no way to progress the problem.


    Not sure what you mean by mathematical definable group.
    But I assume you mean the distribution of the data.
    The confidence level would be the standard 95% in the statistical world.
    The question is how to get a cutoff that will give me that confidence level.
    Should one look at Z scores (this was suggested) or some other statistical
    parameter to establish a cutoff or
    just look at raw numbers to establish confidence level (this was suggested).

    Its vague to you Andrew because its not your area of expertise - not my
    'vague description'.

    Sharp Tool
     
    Sharp Tool, Nov 7, 2005
    #17
  18. Sharp Tool

    Roedy Green Guest

    On Mon, 07 Nov 2005 11:01:13 GMT, "Sharp Tool"
    <> wrote, quoted or indirectly quoted someone
    who said :

    >The distribution looks like a bell shape curve skewed to the left with an
    >initial platoe then it slides to the right and then suddenly makes a sharp
    >dip down.
    >so i guess thats not really a normal distribution.


    You may be able to analyse the physics of your readings to calculate
    the expected distribution.

    the classic shapes are not really clear until you have a lot of data.
    You won't see the pattern with just 5 points.

    This reminds me something that happened when I was studying physics at
    UBC circa 1968. We were doing a lab with an experiment that was
    supposed to produce a normal distribution. But it obviously wasn't.
    The machine was broken. Student after student complained, but were
    dismissed as incompetents. I keypunched the data and did a histogram
    and produced it on the pen plotter -- a great novelty in that day.

    It clearly showed a camel hump. The COMPUTER graph clinched it and
    off the machine went for repair. You can't do that as easily today.
    Back then anything that came from a computer was treated as divine
    revelation.

    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 7, 2005
    #18
  19. Sharp Tool

    Sharp Tool Guest


    > On Mon, 07 Nov 2005 10:47:02 GMT, "Sharp Tool"
    > <> wrote, quoted or indirectly quoted someone
    > who said :
    >
    > >My questions is now, how does one determine the distribution of data?

    >
    > One way is to do a histogram.
    >
    > If you see a bell shaped curve coming out, you likely have a normal
    > distribution.
    >
    > Various other distributions have a characteristic shape.
    >
    > One that comes up often is called Poisson. It looks like a skewed
    > bell shaped curve with the right hand side stretched out.
    > see http://www.math.csusb.edu/faculty/stanton/probstat/poisson.html
    > How long you wait for bus might follow a Poisson distribution.
    >
    > Geometric is a falling off. see
    >

    http://www.stats.gla.ac.uk/steps/glossary/probability_distributions.html#geo
    mdistn

    Thats a great link.
    I plotted my data and it does look like a poisson distribution.
    But the large negative number makes it falls off a cliff.
    All these distribution dont include negative numbers?

    Sharp Tool
     
    Sharp Tool, Nov 7, 2005
    #19
  20. Sharp Tool wrote:

    > Its vague to you Andrew because its not your area of expertise - not my
    > 'vague description'.


    Very sound assessment, coming from someone who first stated
    the numbers had no 'normal distribution' and is now saying
    it does, and that a confidence level of 95% 'sounds good'.

    > Sharp Tool


    [ Seems a little 'blunt' at the moment.. ;-) ]
     
    Andrew Thompson, Nov 7, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. HK
    Replies:
    0
    Views:
    307
  2. magmike

    Database Results Riddle

    magmike, Aug 5, 2004, in forum: HTML
    Replies:
    18
    Views:
    700
    Toby Inkster
    Aug 8, 2004
  3. Big Bill
    Replies:
    33
    Views:
    2,634
    Big Bill
    Oct 8, 2004
  4. CBFalconer

    Re: Riddle

    CBFalconer, Jan 20, 2005, in forum: C++
    Replies:
    0
    Views:
    419
    CBFalconer
    Jan 20, 2005
  5. Juha Haataja

    Solving the Einstein's Riddle in Python

    Juha Haataja, May 24, 2004, in forum: Python
    Replies:
    10
    Views:
    907
    Christos TZOTZIOY Georgiou
    May 25, 2004
Loading...

Share This Page