# Programmatically finding "significant" data points

Discussion in 'Python' started by erikcw, Nov 14, 2006.

1. ### erikcwGuest

Hi all,

I have a collection of ordered numerical data in a list. The numbers
when plotted on a line chart make a low-high-low-high-high-low (random)
pattern. I need an algorithm to extract the "significant" high and low
points from this data.

Here is some sample data:
data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
0.10]

In this data, some of the significant points include:
data[0]
data[2]
data[4]
data[6]
data[8]
data[9]
data[13]
data[14]
.....

How do I sort through this data and pull out these points of
significance?

Erik

erikcw, Nov 14, 2006

2. ### Jeremy SandersGuest

erikcw wrote:

> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.
>

....
>
> How do I sort through this data and pull out these points of
> significance?

Get a book on statistics. One idea is as follows. If you expect the points
to be centred around a single value, you can calculate the median or mean
of the points, calculate their standard deviation (aka spread), and remove
points which are more than N-times the standard deviation from the median.

Jeremy

--
Jeremy Sanders
http://www.jeremysanders.net/

Jeremy Sanders, Nov 14, 2006

3. ### Fredrik LundhGuest

"erikcw" wrote:

> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.
>
> Here is some sample data:
> data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
> 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
> 0.10]

silly solution:

for i in range(1, len(data)-1):
if data[i-1] < data > data[i+1] or data[i-1] > data < data[i+1]:
print i

(the above doesn't handle the "edges", but that's easy to fix)

</F>

Fredrik Lundh, Nov 14, 2006
4. ### Philipp PagelGuest

erikcw <> wrote:
> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.

I am not sure, what you mean by 'ordered' in this context. As
pointed out by Jeremy, you need to find an appropriate statistical test.
The appropriateness depend on how your data is (presumably) distributed
and what exactly you are trying to test. E.g. do te data pints come from
differetn groupos of some kind? Or are you just looking for extreme
values (outliers maybe?)?

So it's more of statistical question than a python one.

cu
Philipp

--
Dr. Philipp Pagel Tel. +49-8161-71 2131
Dept. of Genome Oriented Bioinformatics Fax. +49-8161-71 2186
Technical University of Munich
http://mips.gsf.de/staff/pagel

Philipp Pagel, Nov 14, 2006
5. ### Peter OttenGuest

erikcw wrote:

> Hi all,
>
> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.
>
> Here is some sample data:
> data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
> 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
> 0.10]
>
> In this data, some of the significant points include:
> data[0]
> data[2]
> data[4]
> data[6]
> data[8]
> data[9]
> data[13]
> data[14]
> ....
>
> How do I sort through this data and pull out these points of
> significance?

I think you are looking for "extrema":

def w3(items):
items = iter(items)
view = None, items.next(), items.next()
for item in items:
view = view[1:] + (item,)
yield view

for i, (a, b, c) in enumerate(w3(data)):
if a > b < c:
print i+1, "min", b
elif a < b > c:
print i+1, "max", b
else:
print i+1, "---", b

Peter

Peter Otten, Nov 14, 2006
6. ### Alan J. SalmoniGuest

If the order doesn't matter, you can sort the data and remove x * 0.5 *
n where x is the proportion of numbers you want. If you have too many
similar values though, this falls down. I suggest you check out
quantiles in a good statistics book.

Alan.

Peter Otten wrote:

> erikcw wrote:
>
> > Hi all,
> >
> > I have a collection of ordered numerical data in a list. The numbers
> > when plotted on a line chart make a low-high-low-high-high-low (random)
> > pattern. I need an algorithm to extract the "significant" high and low
> > points from this data.
> >
> > Here is some sample data:
> > data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
> > 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
> > 0.10]
> >
> > In this data, some of the significant points include:
> > data[0]
> > data[2]
> > data[4]
> > data[6]
> > data[8]
> > data[9]
> > data[13]
> > data[14]
> > ....
> >
> > How do I sort through this data and pull out these points of
> > significance?

>
> I think you are looking for "extrema":
>
> def w3(items):
> items = iter(items)
> view = None, items.next(), items.next()
> for item in items:
> view = view[1:] + (item,)
> yield view
>
> for i, (a, b, c) in enumerate(w3(data)):
> if a > b < c:
> print i+1, "min", b
> elif a < b > c:
> print i+1, "max", b
> else:
> print i+1, "---", b
>
> Peter

Alan J. Salmoni, Nov 14, 2006
7. ### Ganesan RajagopalGuest

>>>>> Jeremy Sanders <> writes:

>> How do I sort through this data and pull out these points of
>> significance?

> Get a book on statistics. One idea is as follows. If you expect the points
> to be centred around a single value, you can calculate the median or mean
> of the points, calculate their standard deviation (aka spread), and remove
> points which are more than N-times the standard deviation from the median.

Standard deviation was the first thought that jumped to my mind
too. However, that's not what the OP is after. He's seems to be looking for
points when the direction changes.

Ganesan

--
Ganesan Rajagopal

Ganesan Rajagopal, Nov 14, 2006
8. ### Roberto BonvalletGuest

erikcw wrote:
> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.

In calculus, you identify high and low points by looking where the
derivative changes its sign. When working with discrete samples, you can
look at the sign changes in finite differences:

>>> data = [...]
>>> diff = [data[i + 1] - data for i in range(len(data))]
>>> map(str, diff)

['0.4', '0.1', '-0.2', '-0.01', '0.11', '0.5', '-0.2', '-0.2', '0.6',
'-0.1', '0.2', '0.1', '0.1', '-0.45', '0.15', '-0.3', '-0.2', '0.1',
'-0.4', '0.05', '-0.1', '-0.25']

The high points are those where diff changes from + to -, and the low
points are those where diff changes from - to +.

HTH,
--
Roberto Bonvallet

Roberto Bonvallet, Nov 14, 2006
9. ### Roy SmithGuest

"erikcw" <> wrote:
> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.

I think you want a control chart. A good place to start might be
http://en.wikipedia.org/wiki/Control_chart. Even if you don't actually
graph the data, understanding the math behind control charts might help you

Wow. I think this is the first time I'm actually used something I learned
by sitting though those stupid Six Sigma training classes

Roy Smith, Nov 14, 2006
10. ### BeliavskyGuest

erikcw wrote:
> Hi all,
>
> I have a collection of ordered numerical data in a list.

Called a "time series" in statistics.

> The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.
>
> Here is some sample data:
> data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
> 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
> 0.10]
>
> In this data, some of the significant points include:
> data[0]
> data[2]
> data[4]
> data[6]
> data[8]
> data[9]
> data[13]
> data[14]
> ....
>
> How do I sort through this data and pull out these points of
> significance?

The best place to ask about an algorithm for this is not
comp.lang.python -- maybe sci.stat.math would be better. Once you have
an algorithm, coding it in Python should not be difficult. I'd suggest
using the NumPy array rather than the native Python list, which is not
designed for crunching numbers.

Beliavsky, Nov 14, 2006
11. ### robertGuest

erikcw wrote:
> Hi all,
>
> I have a collection of ordered numerical data in a list. The numbers
> when plotted on a line chart make a low-high-low-high-high-low (random)
> pattern. I need an algorithm to extract the "significant" high and low
> points from this data.
>
> Here is some sample data:
> data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
> 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
> 0.10]
>
> In this data, some of the significant points include:
> data[0]
> data[2]
> data[4]
> data[6]
> data[8]
> data[9]
> data[13]
> data[14]
> ....
>
> How do I sort through this data and pull out these points of
> significance?

Its obviously a kind of time series and you are search for a "moving_max(data,t,window)>data(t)" / "moving_min(data,t,window)<data(t)": an extremum within a certain (time) window. And obviously your time window is as low as 2 or 3 or so.

Unfortunately a moving_max func is not yet in numpy and probably not achievable from other existing array functions. You have to create slow looping code.

Robert

robert, Nov 19, 2006
12. ### Paul McGuireGuest

"robert" <> wrote in message
news:ejpf2r\$p8g\$...
> erikcw wrote:
>> Hi all,
>>
>> I have a collection of ordered numerical data in a list. The numbers
>> when plotted on a line chart make a low-high-low-high-high-low (random)
>> pattern. I need an algorithm to extract the "significant" high and low
>> points from this data.
>>
>> Here is some sample data:
>> data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
>> 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
>> 0.10]
>>
>> In this data, some of the significant points include:
>> data[0]
>> data[2]
>> data[4]
>> data[6]
>> data[8]
>> data[9]
>> data[13]
>> data[14]
>> ....
>>
>> How do I sort through this data and pull out these points of
>> significance?

Using zip and map, it's easy to compute first and second derivatives of a
time series of values. The first lambda computes

Paul McGuire, Nov 19, 2006
13. ### Paul McGuireGuest

.... dang touchy keyboard!

> Here is some sample data:
> data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
> 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
> 0.10]
>
> In this data, some of the significant points include:
> data[0]
> data[2]
> data[4]
> data[6]
> data[8]
> data[9]
> data[13]
> data[14]

Using the first derivative, and looking for sign changes, finds many of the
values you marked as "significant".

-- Paul

data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,
1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,
0.10]

delta = lambda (x1,x2) : x2-x1
dy_dx =[0]+map(delta,zip(data,data[1:]))
d2y_dx2 = [0]+map(delta,zip(dy_dx,dy_dx[1:]))

sgnChange = lambda (x1,x2) : x1*x2<0
sigs = map(sgnChange,zip(dy_dx,dy_dx[1:]))
print [i for i,v in enumerate(sigs) if v]
[2, 4, 6, 8, 9, 10, 13, 14, 15, 17, 18, 19, 20]

Paul McGuire, Nov 19, 2006