Programmatically finding "significant" data points

Discussion in 'Python' started by erikcw, Nov 14, 2006.

  1. erikcw

    erikcw Guest

    Hi all,

    I have a collection of ordered numerical data in a list. The numbers
    when plotted on a line chart make a low-high-low-high-high-low (random)
    pattern. I need an algorithm to extract the "significant" high and low
    points from this data.

    Here is some sample data:
    data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
    1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
    0.10]

    In this data, some of the significant points include:
    data[0]
    data[2]
    data[4]
    data[6]
    data[8]
    data[9]
    data[13]
    data[14]
    .....

    How do I sort through this data and pull out these points of
    significance?

    Thanks for your help!

    Erik
    erikcw, Nov 14, 2006
    #1
    1. Advertising

  2. erikcw wrote:

    > I have a collection of ordered numerical data in a list. The numbers
    > when plotted on a line chart make a low-high-low-high-high-low (random)
    > pattern. I need an algorithm to extract the "significant" high and low
    > points from this data.
    >

    ....
    >
    > How do I sort through this data and pull out these points of
    > significance?


    Get a book on statistics. One idea is as follows. If you expect the points
    to be centred around a single value, you can calculate the median or mean
    of the points, calculate their standard deviation (aka spread), and remove
    points which are more than N-times the standard deviation from the median.

    Jeremy

    --
    Jeremy Sanders
    http://www.jeremysanders.net/
    Jeremy Sanders, Nov 14, 2006
    #2
    1. Advertising

  3. "erikcw" wrote:

    > I have a collection of ordered numerical data in a list. The numbers
    > when plotted on a line chart make a low-high-low-high-high-low (random)
    > pattern. I need an algorithm to extract the "significant" high and low
    > points from this data.
    >
    > Here is some sample data:
    > data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
    > 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
    > 0.10]


    silly solution:

    for i in range(1, len(data)-1):
    if data[i-1] < data > data[i+1] or data[i-1] > data < data[i+1]:
    print i

    (the above doesn't handle the "edges", but that's easy to fix)

    </F>
    Fredrik Lundh, Nov 14, 2006
    #3
  4. erikcw <> wrote:
    > I have a collection of ordered numerical data in a list. The numbers
    > when plotted on a line chart make a low-high-low-high-high-low (random)
    > pattern. I need an algorithm to extract the "significant" high and low
    > points from this data.


    I am not sure, what you mean by 'ordered' in this context. As
    pointed out by Jeremy, you need to find an appropriate statistical test.
    The appropriateness depend on how your data is (presumably) distributed
    and what exactly you are trying to test. E.g. do te data pints come from
    differetn groupos of some kind? Or are you just looking for extreme
    values (outliers maybe?)?

    So it's more of statistical question than a python one.

    cu
    Philipp

    --
    Dr. Philipp Pagel Tel. +49-8161-71 2131
    Dept. of Genome Oriented Bioinformatics Fax. +49-8161-71 2186
    Technical University of Munich
    http://mips.gsf.de/staff/pagel
    Philipp Pagel, Nov 14, 2006
    #4
  5. erikcw

    Peter Otten Guest

    erikcw wrote:

    > Hi all,
    >
    > I have a collection of ordered numerical data in a list. The numbers
    > when plotted on a line chart make a low-high-low-high-high-low (random)
    > pattern. I need an algorithm to extract the "significant" high and low
    > points from this data.
    >
    > Here is some sample data:
    > data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
    > 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
    > 0.10]
    >
    > In this data, some of the significant points include:
    > data[0]
    > data[2]
    > data[4]
    > data[6]
    > data[8]
    > data[9]
    > data[13]
    > data[14]
    > ....
    >
    > How do I sort through this data and pull out these points of
    > significance?


    I think you are looking for "extrema":

    def w3(items):
    items = iter(items)
    view = None, items.next(), items.next()
    for item in items:
    view = view[1:] + (item,)
    yield view

    for i, (a, b, c) in enumerate(w3(data)):
    if a > b < c:
    print i+1, "min", b
    elif a < b > c:
    print i+1, "max", b
    else:
    print i+1, "---", b

    Peter
    Peter Otten, Nov 14, 2006
    #5
  6. If the order doesn't matter, you can sort the data and remove x * 0.5 *
    n where x is the proportion of numbers you want. If you have too many
    similar values though, this falls down. I suggest you check out
    quantiles in a good statistics book.

    Alan.

    Peter Otten wrote:

    > erikcw wrote:
    >
    > > Hi all,
    > >
    > > I have a collection of ordered numerical data in a list. The numbers
    > > when plotted on a line chart make a low-high-low-high-high-low (random)
    > > pattern. I need an algorithm to extract the "significant" high and low
    > > points from this data.
    > >
    > > Here is some sample data:
    > > data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
    > > 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
    > > 0.10]
    > >
    > > In this data, some of the significant points include:
    > > data[0]
    > > data[2]
    > > data[4]
    > > data[6]
    > > data[8]
    > > data[9]
    > > data[13]
    > > data[14]
    > > ....
    > >
    > > How do I sort through this data and pull out these points of
    > > significance?

    >
    > I think you are looking for "extrema":
    >
    > def w3(items):
    > items = iter(items)
    > view = None, items.next(), items.next()
    > for item in items:
    > view = view[1:] + (item,)
    > yield view
    >
    > for i, (a, b, c) in enumerate(w3(data)):
    > if a > b < c:
    > print i+1, "min", b
    > elif a < b > c:
    > print i+1, "max", b
    > else:
    > print i+1, "---", b
    >
    > Peter
    Alan J. Salmoni, Nov 14, 2006
    #6
  7. >>>>> Jeremy Sanders <> writes:

    >> How do I sort through this data and pull out these points of
    >> significance?


    > Get a book on statistics. One idea is as follows. If you expect the points
    > to be centred around a single value, you can calculate the median or mean
    > of the points, calculate their standard deviation (aka spread), and remove
    > points which are more than N-times the standard deviation from the median.


    Standard deviation was the first thought that jumped to my mind
    too. However, that's not what the OP is after. He's seems to be looking for
    points when the direction changes.

    Ganesan

    --
    Ganesan Rajagopal
    Ganesan Rajagopal, Nov 14, 2006
    #7
  8. erikcw wrote:
    > I have a collection of ordered numerical data in a list. The numbers
    > when plotted on a line chart make a low-high-low-high-high-low (random)
    > pattern. I need an algorithm to extract the "significant" high and low
    > points from this data.


    In calculus, you identify high and low points by looking where the
    derivative changes its sign. When working with discrete samples, you can
    look at the sign changes in finite differences:

    >>> data = [...]
    >>> diff = [data[i + 1] - data for i in range(len(data))]
    >>> map(str, diff)

    ['0.4', '0.1', '-0.2', '-0.01', '0.11', '0.5', '-0.2', '-0.2', '0.6',
    '-0.1', '0.2', '0.1', '0.1', '-0.45', '0.15', '-0.3', '-0.2', '0.1',
    '-0.4', '0.05', '-0.1', '-0.25']

    The high points are those where diff changes from + to -, and the low
    points are those where diff changes from - to +.

    HTH,
    --
    Roberto Bonvallet
    Roberto Bonvallet, Nov 14, 2006
    #8
  9. erikcw

    Roy Smith Guest

    "erikcw" <> wrote:
    > I have a collection of ordered numerical data in a list. The numbers
    > when plotted on a line chart make a low-high-low-high-high-low (random)
    > pattern. I need an algorithm to extract the "significant" high and low
    > points from this data.


    I think you want a control chart. A good place to start might be
    http://en.wikipedia.org/wiki/Control_chart. Even if you don't actually
    graph the data, understanding the math behind control charts might help you
    with your analysis.

    Wow. I think this is the first time I'm actually used something I learned
    by sitting though those stupid Six Sigma training classes :)
    Roy Smith, Nov 14, 2006
    #9
  10. erikcw

    Beliavsky Guest

    erikcw wrote:
    > Hi all,
    >
    > I have a collection of ordered numerical data in a list.


    Called a "time series" in statistics.

    > The numbers
    > when plotted on a line chart make a low-high-low-high-high-low (random)
    > pattern. I need an algorithm to extract the "significant" high and low
    > points from this data.
    >
    > Here is some sample data:
    > data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
    > 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
    > 0.10]
    >
    > In this data, some of the significant points include:
    > data[0]
    > data[2]
    > data[4]
    > data[6]
    > data[8]
    > data[9]
    > data[13]
    > data[14]
    > ....
    >
    > How do I sort through this data and pull out these points of
    > significance?


    The best place to ask about an algorithm for this is not
    comp.lang.python -- maybe sci.stat.math would be better. Once you have
    an algorithm, coding it in Python should not be difficult. I'd suggest
    using the NumPy array rather than the native Python list, which is not
    designed for crunching numbers.
    Beliavsky, Nov 14, 2006
    #10
  11. erikcw

    robert Guest

    erikcw wrote:
    > Hi all,
    >
    > I have a collection of ordered numerical data in a list. The numbers
    > when plotted on a line chart make a low-high-low-high-high-low (random)
    > pattern. I need an algorithm to extract the "significant" high and low
    > points from this data.
    >
    > Here is some sample data:
    > data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
    > 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
    > 0.10]
    >
    > In this data, some of the significant points include:
    > data[0]
    > data[2]
    > data[4]
    > data[6]
    > data[8]
    > data[9]
    > data[13]
    > data[14]
    > ....
    >
    > How do I sort through this data and pull out these points of
    > significance?


    Its obviously a kind of time series and you are search for a "moving_max(data,t,window)>data(t)" / "moving_min(data,t,window)<data(t)": an extremum within a certain (time) window. And obviously your time window is as low as 2 or 3 or so.

    Unfortunately a moving_max func is not yet in numpy and probably not achievable from other existing array functions. You have to create slow looping code.


    Robert
    robert, Nov 19, 2006
    #11
  12. erikcw

    Paul McGuire Guest

    "robert" <> wrote in message
    news:ejpf2r$p8g$...
    > erikcw wrote:
    >> Hi all,
    >>
    >> I have a collection of ordered numerical data in a list. The numbers
    >> when plotted on a line chart make a low-high-low-high-high-low (random)
    >> pattern. I need an algorithm to extract the "significant" high and low
    >> points from this data.
    >>
    >> Here is some sample data:
    >> data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
    >> 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
    >> 0.10]
    >>
    >> In this data, some of the significant points include:
    >> data[0]
    >> data[2]
    >> data[4]
    >> data[6]
    >> data[8]
    >> data[9]
    >> data[13]
    >> data[14]
    >> ....
    >>
    >> How do I sort through this data and pull out these points of
    >> significance?


    Using zip and map, it's easy to compute first and second derivatives of a
    time series of values. The first lambda computes
    Paul McGuire, Nov 19, 2006
    #12
  13. erikcw

    Paul McGuire Guest

    .... dang touchy keyboard!

    > Here is some sample data:
    > data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
    > 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
    > 0.10]
    >
    > In this data, some of the significant points include:
    > data[0]
    > data[2]
    > data[4]
    > data[6]
    > data[8]
    > data[9]
    > data[13]
    > data[14]


    Using the first derivative, and looking for sign changes, finds many of the
    values you marked as "significant".

    -- Paul


    data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
    1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
    0.10]

    delta = lambda (x1,x2) : x2-x1
    dy_dx =[0]+map(delta,zip(data,data[1:]))
    d2y_dx2 = [0]+map(delta,zip(dy_dx,dy_dx[1:]))

    sgnChange = lambda (x1,x2) : x1*x2<0
    sigs = map(sgnChange,zip(dy_dx,dy_dx[1:]))
    print [i for i,v in enumerate(sigs) if v]
    [2, 4, 6, 8, 9, 10, 13, 14, 15, 17, 18, 19, 20]
    Paul McGuire, Nov 19, 2006
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steve
    Replies:
    5
    Views:
    42,649
    Steve
    May 17, 2004
  2. Jeff Higgins
    Replies:
    5
    Views:
    356
    Daniel Pitts
    Jan 8, 2007
  3. Replies:
    15
    Views:
    1,154
  4. a01lida
    Replies:
    2
    Views:
    688
    a01lida
    Nov 16, 2008
  5. SMH
    Replies:
    0
    Views:
    213
Loading...

Share This Page