Identifying the start of good data in a list

tkpmep · Aug 26, 2008

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

flag = True
i=-1
j=0
while flag and i < len(retHist)-1:
i += 1
if retHist == 0:
j = 0
else:
j += 1
if j == 5:
flag = False

del retHist[:i-4]

Thanks in advance for your help

Thomas Philips

bearophileHUGS · Aug 26, 2008

First solutions I have found, not much tested beside the few doctests:

from itertools import islice

def start_good1(alist, good_ones=4):
"""
Maybe more efficient for Python

>>> start_good = start_good1
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 4
>>> start_good([]) -1
>>> start_good([0, 0]) -1
>>> start_good([0, 0, 0]) -1
>>> start_good([0, 0, 0, 0, 1]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 3]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4]) 4
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5]) 4
>>> start_good([1, 2, 3, 4]) 0
>>> start_good([1, 2, 3]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 0, 4])

Click to expand...

Click to expand...

-1
"""
for i in xrange(len(alist) - good_ones + 1):
if all(islice(alist, i, i+good_ones)):
return i
return -1

def start_good2(alist, good_ones=4):
"""
Maybe more efficient for Psyco

>>> start_good = start_good2
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 4
>>> start_good([]) -1
>>> start_good([0, 0]) -1
>>> start_good([0, 0, 0]) -1
>>> start_good([0, 0, 0, 0, 1]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 3]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4]) 4
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5]) 4
>>> start_good([1, 2, 3, 4]) 0
>>> start_good([1, 2, 3]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 0, 4])

Click to expand...

Click to expand...

-1
"""
n_good = 0
for i, el in enumerate(alist):
if alist:
if n_good == good_ones:
return i - good_ones
else:
n_good += 1
else:
n_good = 0
if n_good == good_ones:
return len(alist) - good_ones
else:
return -1

if __name__ == "__main__":
import doctest
doctest.testmod()
print "Doctests done\n"

Bye,
bearophile

bearophileHUGS · Aug 26, 2008

Sorry, in the Psyco version replace this line:
for i, el in enumerate(alist):

With:
for i in xrange(len(alist)):

because Psyco doesn't digest enumerate well.

Bye,
bearophile

Matthew Fitzgibbons · Aug 27, 2008

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

flag = True
i=-1
j=0
while flag and i < len(retHist)-1:
i += 1
if retHist == 0:
j = 0
else:
j += 1
if j == 5:
flag = False

del retHist[:i-4]

Thanks in advance for your help

Thomas Philips

Maybe this will do?

reHist = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
count = 0
for i, d in enumerate(reHist):
if d == 0:
count = 0
else:
count += 1
if count == 5:
break
else:
raise Exception("No data found")
reHist = reHist[i-4:]
print reHist

-Matt

Mensanator · Aug 27, 2008

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

flag = True
i=-1
j=0
while flag and i < len(retHist)-1:
i += 1
if retHist == 0:
j = 0
else:
j += 1
if j == 5:
flag = False

del retHist[:i-4]

Thanks in advance for your help

Thomas Philips

Here's my attempt:

LL = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

i = 0

while (i<len(LL)) and (0 in LL[i:i+5]):
i += 1

print i, LL[i:i+5]

##
## 4 [1, 2, 3, 4, 5]
##

Emile van Sebille · Aug 27, 2008

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

.... if 0 not in retHist[ii:ii+5]:
.... break

>>> del retHist[:ii]

Click to expand...

Click to expand...

Well, to the extent short and sweet is elegant...

Emile

tdmj · Aug 27, 2008

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

flag = True
i=-1
j=0
while flag and i < len(retHist)-1:
i += 1
if retHist == 0:
j = 0
else:
j += 1
if j == 5:
flag = False

del retHist[:i-4]

Thanks in advance for your help

Thomas Philips

With regular expressions:

import re

hist = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
hist_str = ''.join(str(i) for i in hist)
match = re.search(r'[1-9]{5, }', hist_str)
hist = hist[-5:] if match is None else hist[match.start():]

Or slightly more concise:

import re

hist = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
match = re.search(r'[1-9]{5, }', ''.join(str(i) for i in hist))
hist = hist[-5:] if match is None else hist[match.start():]

Tommy McDaniel

tkpmep · Aug 27, 2008

[email protected] said:
[email protected] said:

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

Click to expand...

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

Click to expand...

>>> for ii,dummy in enumerate(retHist):
... if 0 not in retHist[ii:ii+5]:
... break

>>> del retHist[:ii]

Well, to the extent short and sweet is elegant...

Emile

This is just what the doctor ordered. Thank you, everyone, for the
help.

Sincerely

Thomas Philips

Terry Reedy · Aug 27, 2008

Matthew said:
(e-mail address removed) wrote:

reHist = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
count = 0
for i, d in enumerate(reHist):
if d == 0:
count = 0
else:
count += 1
if count == 5:
break
else:
raise Exception("No data found")
reHist = reHist[i-4:]
print reHist

This is what I would have suggested, except that the 'if count' test
should be left under the else clause, as in the original, so I consider
it the best of the responses ;-)

I thought of the repeated slicing alternative, but it would be slightly
slower. However, for occasional runs, the difference would be trivial.

Worrying about what Psyco does for this problem is rather premature
optimization.

My quarter's worth....

tjr

Steven D'Aprano · Aug 27, 2008

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero data).
For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], I
would define the point at which data turns good to be 4 (1 followed by
2, 3, 4, 5).

Click to expand...

....

With regular expressions:

Good grief. If you're suggesting that as a serious proposal, and not just
to prove it can be done, that's surely an example of "when all you have
is a hammer, everything looks like a nail" thinking.

In this particular case, your regex "solution" gives the wrong result,
indicating that you didn't test your code before posting. Hint:

re.search(r'[1-9]{5, }', "123456")

returns None.

The obvious fix for that specific bug is to use r'[1-9]{5,5}', but even
that will fail. Hint: what happens if an item has more than one digit?

Before posting another regex solution, make sure it does the right thing
with this:

[0, 0, 101, 0, 1002, 203, 3050, 4105, 5110, 623, 777]

Gerard flanagan · Aug 27, 2008

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

flag = True
i=-1
j=0
while flag and i < len(retHist)-1:
i += 1
if retHist == 0:
j = 0
else:
j += 1
if j == 5:
flag = False

del retHist[:i-4]

Thanks in advance for your help

Thomas Philips

data = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

def itergood(indata):
indata = iter(indata)
buf = []
while len(buf) < 4:
buf.append(indata.next())
if buf[-1] == 0:
buf[:] = []
for x in buf:
yield x
for x in indata:
yield x

for d in itergood(data):
print d

George Sakkis · Aug 27, 2008

[email protected] said:
[email protected] said:

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).
I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?
for ii,dummy in enumerate(retHist):

Click to expand...

... if 0 not in retHist[ii:ii+5]:
... break

del retHist[:ii]

Click to expand...

Click to expand...

Well, to the extent short and sweet is elegant...

Click to expand...

Emile

Click to expand...

This is just what the doctor ordered. Thank you, everyone, for the
help.

Note that the version above (as well as most others posted) fail for
boundary cases; check out bearophile's doctest to see some of them.
Below are two more versions that pass all the doctests: the first
works only for lists and modifies them in place and the second works
for arbitrary iterables:

def clean_inplace(seq, good_ones=4):
start = 0
n = len(seq)
while start < n:
try: end = seq.index(0, start)
except ValueError: end = n
if end-start >= good_ones:
break
start = end+1
del seq[:start]

def clean_iter(iterable, good_ones=4):
from itertools import chain, islice, takewhile, dropwhile
iterator = iter(iterable)
is_zero = float(0).__eq__
while True:
# consume all zeros up to the next non-zero
iterator = dropwhile(is_zero, iterator)
# take up to `good_ones` non-zeros
good = list(islice(takewhile(bool,iterator), good_ones))
if not good: # iterator exhausted
return iterator
if len(good) == good_ones:
# found `good_ones` consecutive non-zeros;
# chain them to the rest items and return them
return chain(good, iterator)

HTH,
George

George Sakkis · Aug 27, 2008

[email protected] said:
[email protected] said:

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

Click to expand...

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

Click to expand...

flag = True
i=-1
j=0
while flag and i < len(retHist)-1:
i += 1
if retHist == 0:
j = 0
else:
j += 1
if j == 5:
flag = False

Click to expand...

del retHist[:i-4]

Click to expand...

Thanks in advance for your help

Click to expand...

Thomas Philips

Click to expand...

data = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

def itergood(indata):
indata = iter(indata)
buf = []
while len(buf) < 4:
buf.append(indata.next())
if buf[-1] == 0:
buf[:] = []
for x in buf:
yield x
for x in indata:
yield x

for d in itergood(data):
print d

This seems the most efficient so far for arbitrary iterables. With a
few micro-optimizations it becomes:

from itertools import chain

def itergood(indata, good_ones=4):
indata = iter(indata); get_next = indata.next
buf = []; append = buf.append
while len(buf) < good_ones:
next = get_next()
if next: append(next)
else: del buf[:]
return chain(buf, indata)

$ python -m timeit -s "x = 1000*[0, 0, 0, 1, 2, 3] + [1,2,3,4]; from
itergood import itergood" "list(itergood(x))"
100 loops, best of 3: 3.09 msec per loop

And with Psyco enabled:
$ python -m timeit -s "x = 1000*[0, 0, 0, 1, 2, 3] + [1,2,3,4]; from
itergood import itergood" "list(itergood(x))"
1000 loops, best of 3: 466 usec per loop

George

bearophileHUGS · Aug 27, 2008

George Sakkis:

This seems the most efficient so far for arbitrary iterables.

This one probably scores well with Psyco ;-)

def start_good3(seq, good_ones=4):
"""

>>> start_good = start_good3
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 4
>>> start_good([]) -1
>>> start_good([0, 0]) -1
>>> start_good([0, 0, 0]) -1
>>> start_good([0, 0, 0, 0, 1]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 3]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4]) 4
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5]) 4
>>> start_good([1, 2, 3, 4]) 0
>>> start_good([1, 2, 3]) -1
>>> start_good([0, 0, 1, 0, 1, 2, 0, 4])

Click to expand...

Click to expand...

-1
"""
n_good = 0
pos = 0
for el in seq:
if el:
if n_good == good_ones:
return pos - good_ones
else:
n_good += 1
elif n_good:
n_good = 0
pos += 1
if n_good == good_ones:
return pos - good_ones
else:
return -1

Bye,
bearophile

castironpi · Aug 27, 2008

George Sakkis:

This one probably scores well with Psyco ;-)

def start_good3(seq, good_ones=4):
n_good = 0
pos = 0
for el in seq:
if el:
if n_good == good_ones:
return pos - good_ones
else:
n_good += 1
elif n_good:
n_good = 0
pos += 1
if n_good == good_ones:
return pos - good_ones
else:
return -1

Bye,
bearophile

There, that's the regular machine for it. Too much thinking in
objects, and you can't even write a linked list anymore, right?

George Sakkis · Aug 28, 2008

George Sakkis:

This one probably scores well with Psyco ;-)

I think if you update this so that it returns the "good" iterable
instead of the starting index, it is equivalent to Gerard's solution.

George

George Sakkis · Aug 28, 2008

There, that's the regular machine for it. Too much thinking in
objects, and you can't even write a linked list anymore, right?

And you're still wondering why do people killfile you or think you're
a failed AI project...

castironpi · Aug 28, 2008

And you're still wondering why do people killfile you or think you're
a failed AI project...

Just jumping on the bandwagon, George. And you see, everyone else's
passed the doctests perfectly. Were all the running times O( n* k )?

Gerard flanagan · Aug 28, 2008

George said:
[email protected] said:

I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).
I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?
flag = True
i=-1
j=0
while flag and i < len(retHist)-1:
i += 1
if retHist == 0:
j = 0
else:
j += 1
if j == 5:
flag = False
del retHist[:i-4]
Thanks in advance for your help
Thomas Philips

Click to expand...

data = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

def itergood(indata):
indata = iter(indata)
buf = []
while len(buf) < 4:
buf.append(indata.next())
if buf[-1] == 0:
buf[:] = []
for x in buf:
yield x
for x in indata:
yield x

for d in itergood(data):
print d

Click to expand...

This seems the most efficient so far for arbitrary iterables. With a
few micro-optimizations it becomes:

from itertools import chain

def itergood(indata, good_ones=4):
indata = iter(indata); get_next = indata.next
buf = []; append = buf.append
while len(buf) < good_ones:
next = get_next()
if next: append(next)
else: del buf[:]
return chain(buf, indata)

$ python -m timeit -s "x = 1000*[0, 0, 0, 1, 2, 3] + [1,2,3,4]; from
itergood import itergood" "list(itergood(x))"
100 loops, best of 3: 3.09 msec per loop

And with Psyco enabled:
$ python -m timeit -s "x = 1000*[0, 0, 0, 1, 2, 3] + [1,2,3,4]; from
itergood import itergood" "list(itergood(x))"
1000 loops, best of 3: 466 usec per loop

George
--

I always forget the 'del slice' method for clearing a list, thanks.

I think that returning a `chain` means that the function is not itself a
generator. And so if the indata has length less than or equal
to the threshold (good_ones), an unhandled StopIteration is raised
before the return statement is reached.

G.

castironpi · Aug 28, 2008

Below are two more versions that pass all the doctests: the first
works only for lists and modifies them in place and the second works
for arbitrary iterables:

def clean_inplace(seq, good_ones=4):
start = 0
n = len(seq)
while start < n:
try: end = seq.index(0, start)
except ValueError: end = n
if end-start >= good_ones:
break
start = end+1
del seq[:start]

def clean_iter(iterable, good_ones=4):
from itertools import chain, islice, takewhile, dropwhile
iterator = iter(iterable)
is_zero = float(0).__eq__
while True:
# consume all zeros up to the next non-zero
iterator = dropwhile(is_zero, iterator)
# take up to `good_ones` non-zeros
good = list(islice(takewhile(bool,iterator), good_ones))
if not good: # iterator exhausted
return iterator
if len(good) == good_ones:
# found `good_ones` consecutive non-zeros;
# chain them to the rest items and return them
return chain(good, iterator)

HTH,
George

You gave me an idea-- maybe an arbitrary 'lookahead' iterable could be
useful. I haven't seen them that much on the newsgroup, but more than
once. IOW a buffered consumer. Something that you could check a
fixed number of next elements of. You might implement it as a
iterator with a __getitem__ method.

Example, unproduced:

import itertools
a= itertools.count( )
a.next() 0
a.next() 1
a[ 3 ] 5
a.next() 2
a[ 3 ]

Click to expand...

Click to expand...

6

Does this make sense at all?

Range / empty list issues??	1	Dec 11, 2023
Data saving in condition of changing reality	0	Apr 29, 2022
Trying to build a SARIMAX model to forecast the S&P500 trend	0	Nov 5, 2023
How can I guarantee that the all callback functions of the first Ajax API call have finished executing before initiating the 2 call in JavaScript?	2	Oct 30, 2023
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
groupby - summing multiple columns in a list of lists	1	May 17, 2011
Addition and substraction of polynomials is working fine but the multiplication isn't; what's wrong with my code	1	Nov 22, 2022

Identifying the start of good data in a list

tkpmep

bearophileHUGS

bearophileHUGS

Matthew Fitzgibbons

Mensanator

Emile van Sebille

tdmj

tkpmep

Terry Reedy

Steven D'Aprano

Gerard flanagan

George Sakkis

George Sakkis

bearophileHUGS

castironpi

George Sakkis

George Sakkis

castironpi

Gerard flanagan

castironpi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads