# Struggling for inspiration with lists

D

#### Denis McMahon

Hi

I have a list of data that presents as:

timestamp: value

Timestamps are used solely to determine the sequence of items in the list.

I want to find the longest repeated sequence of values in the list.
Example, in the following list:

data = { 0: "d", 1: "x", 2: "y", 3: "t", 4: "d", 5: "y", 77: "g"' 78:
"h", 79: "x", 80: "y", 206: "t", 210: "d", 211: "x" }

I would pull out the sequence x-y-t-d (starting at 1 and 79)

I need to keep the timestamp / data association because I need to
generate output that identifies (a) the longest repeated sequence (b) how
many elements in the longest repeated sequence (c) at what timestamps
each occurrence started.

I'm not sure of the best way, programatically, to aproach this task,
which means I'm unsure whether eg a list of tuples ( time, data ) or an
OrderedDict keyed on the timestamp is the best starting point.

I can make a list of tuples using:

d = [ (k,v) for k,v in data ]

and with the list of tuples, I can do something like:

d.sort( key=lambda tup: tup[0] )

max_start_a = 0
max_start_b = 0
max_len = 0
i = 0

while i < len( d ):

j = i + 1

while j < len( d ):

o = 0

while j+o < len( d ) and d[i+o][1] == d[j+o][1]:

o += 1

if o > max_len:

max_len = 0
max_start_a = i
max_start_b = j

j += 1

i += 1

print d[max_start_a][0], d[max_start_b][0], max_len

Is there a better way to do this?

T

#### Tim Chase

I need to keep the timestamp / data association because I need to
generate output that identifies (a) the longest repeated sequence
(b) how many elements in the longest repeated sequence (c) at what
timestamps each occurrence started.

I'm not sure of the best way, programatically, to aproach this
task, which means I'm unsure whether eg a list of tuples ( time,
data ) or an OrderedDict keyed on the timestamp is the best
starting point. [snip]
Is there a better way to do this?

You might have to define your criteria for "best". Things that
might play into it:

- what happens if there's more than one "longest" sequence?

- can you specify a minimum length that you don't have to track? (a
maximum sequence of length-1 might be uninteresting)

- how large can your input data grow to be? If it's huge, you're
looking at searching a problem space something like O(N^2) or
possibly O(N log N) if you're lucky. Which in turn can involve
lots of memory consumption.

- do you just want to know what that sequence is, or do you want to
know all the locations/offsets where it the longest sequence was
found?

I've played with two simple versions, the first is naive, while the
second tries to use a little bit of smarts to eliminate possibilities
when you know you have at least one match of a certain length. The
result of the function does allow for finding a plurality of longest
sequences, but my demo code just snags one with the max() function.
I leave the "find multiple maxes" as an exercise for the reader.

-tkc

def build1(d):
sequences = {} # sequence -> [list of start offsets]
for i in range(len(d)):
for j in range(i+1, len(d)):
key = tuple(value for timestamp, value in d[i:j])
sequences.setdefault(key, []).append(i)
return sequences

def build2(d):
sequences = {} # sequence -> [list of start offsets]
longest = 1
for i in range(len(d)):
for j in range(i+longest, len(d)):
key = tuple(value for timestamp, value in d[i:j])
if key in sequences:
sequences[key].append(i)
if len(sequences[key]) > longest:
longest = len(sequences[key])
print longest
else:
sequences[key] =
return sequences

data = {
0: "d", 1: "x", 2: "y", 3: "t", 4: "d", 5: "y", 77: "g",
78: "h", 79: "x", 80: "y", 206: "t", 210: "d", 211: "x",
}
d = data.items()
d.sort()

for fn in (build1, build2):
print fn.__name__
sequences = fn(d)
longest_key, offsets = max((
pair
for pair in sequences.iteritems()
if len(pair[1]) > 1
), key=lambda (k,v): len(k))
print "Longest sequence %r found at %s" % (
longest_key,
[d[0] for i in offsets],
)
for k,v in sequences.iteritems():
if len(v) > 1:
print "%s: %r" % (k,v)
print '=' * 50

..