finding repeated data sequences in a column

yadin · May 20, 2009

Good day everyone!
I have a a table, from where I can extract a column.
I wanna go down trough that column made of numbers
examine undetermined chunks of data and see or detect if that sequence
of chunk
of data has been repeated before
and if it has been repeated detect it by giving it a name in an
adjacent column.
Imagine someting like this but made of 1800 numbers... how can I build
up column 3(category)

Item price Category
400 1000028706 A
400 1000028707 A
400 1000028708 A
101 100 C
101 12 C
500 1000028706 A
500 1000028707 A
500 1000028708 A
500 1000028709 B
120 1000028706 A
120 1000028707 A
120 1000028708 A
120 100 C
120 12 C

bearophileHUGS · May 20, 2009

yadin, understanding what you want is probably 10 times harder than
writing down the code

I have a a table, from where I can extract a column.

You can extract it? Or do you want to extract it? Or do you want to
process it? Etc.

I wanna go down trough that column made of numbers
examine undetermined chunks of data and see or detect if that sequence
of chunk
of data has been repeated before

What do you mean by "undetermined"? What kind of data? Where is this
data? How is this "chunk" shaped? Are you talking about a string?

and if it has been repeated detect it by giving it a name in an
adjacent column.

What kind of name? So you just need 2 names, like N and S for New and
Seen?
You can use a built-in set data structure to know if you have already
seen some data, while you scan the records.

Imagine someting like this but made of 1800 numbers...

How are such 1800 disposed? Do you mean 1800 columns of 32 bit
numbers?

how can I build up column 3(category)<

What does A, B and C mean?

Bye,
bearophile

yadin · May 20, 2009

yadin, understanding what you want is probably 10 times harder than
writing down the code

You can extract it? Or do you want to extract it? Or do you want to
process it? Etc.

What do you mean by "undetermined"? What kind ofdata? Where is thisdata? How is this "chunk" shaped? Are you talking about a string?

What kind of name? So you just need 2 names, like N and S for New and
Seen?
You can use a built-in setdatastructure to know if you have already
seen somedata, while you scan the records.

How are such 1800 disposed? Do you mean 1800 columns of 32 bit
numbers?

What does A, B and C mean?

Bye,
bearophile

lets say you have this column of numbers

1000028706
1000028707
1000028708
100
12
1000028706
1000028707
1000028708
1000028709
1000028706
1000028707
1000028708
100
12
6

How can I build up a program that tells me that this sequence
1000028706
1000028707
1000028708
is repeated somewhere in the column, and how can i know where?

thank you very much!

Tim Chase · May 20, 2009

lets say you have this column of numbers

1000028706
1000028707
1000028708
100
12
1000028706
1000028707
1000028708
1000028709
1000028706
1000028707
1000028708
100
12
6

How can I build up a program that tells me that this sequence
1000028706
1000028707
1000028708
is repeated somewhere in the column, and how can i know where?

In your example, would "100,12" also be output, or do you only
care about the first find? Do you have a minimum or maximum
number of repeats you care about? Is a repeated number a
"sequence of length 1"? Can it be capped to know that if you
have more than N items in common, you don't have to compare more
than N items from the entire pair of sequences?

-tkc

bearophileHUGS · May 20, 2009

yadin:

How can I build up a program that tells me that this sequence
1000028706
1000028707
1000028708
is repeated somewhere in the column, and how can i know where?

Can such patterns nest? That is, can you have a repeated pattern made
of an already seen pattern plus something else?
If you don't want a complex program, then you may need to specify the
problem better.

You may want something like LZ77 or releated (LZ78, etc):
http://en.wikipedia.org/wiki/LZ77
This may have a bug:
http://code.activestate.com/recipes/117226/

Bye,
bearophile

norseman · May 20, 2009

yadin:

Can such patterns nest? That is, can you have a repeated pattern made
of an already seen pattern plus something else?
If you don't want a complex program, then you may need to specify the
problem better.

You may want something like LZ77 or releated (LZ78, etc):
http://en.wikipedia.org/wiki/LZ77
This may have a bug:
http://code.activestate.com/recipes/117226/

Bye,
bearophile

============================================
index on column
Ndx1 is set to index #1
Ndx2 is set to index #2
test Ndx1 against Ndx2
if equal write line number and column content to a file
(that's two things on one line: 15 1000028706
283 1000028706 )
Ndx1 is set to Ndx2
Ndx2 is set to index #next
loop to test writing out each duplicate set

Then use the outfile and index on line number

In similar manor, check if line current and next line line numbers are
sequential. If so scan forward to match column content of lower line
number and check first matched column's line number and next for
sequential. Print them out if so

everything in outfile has 1 or more duplicates

4 aa |--
5 bb |-- | thus 4/5 match 100/101
6 cc | |
.. | |
100 aa | |--
101 bb |--
102 ddd
103 cc there is a duplicate but not a sequence
200 ff

mark duplicate sequences as tested and proceed on through
seq1 may have more than one other seq in file.
the progress is from start to finish without looking back
thus each step forward has fewer lines to test.
marking already knowns eliminates redundant sequence testing.

By subseting on pass1 the expensive testing is greatly reduced.
If you know your subset data won't exceed memory then the "outfile"
can be held in memory to speed things up considerably.

Today is: 20090520
no code

Steve

yadin · May 21, 2009

============================================
index on column
Ndx1 is set to index #1
Ndx2 is set to index #2
test Ndx1 against Ndx2
if equal write line number and column content to a file
(that's two things on one line: 15 1000028706
283 1000028706 )
Ndx1 is set to Ndx2
Ndx2 is set to index #next
loop to test writing out each duplicate set

Then use the outfile and index on line number

In similar manor, check if line current and next line line numbers are
sequential. If so scan forward to match column content of lower line
number and check first matched column's line number and next for
sequential. Print them out if so

everything in outfile has 1 or more duplicates

4 aa |--
5 bb |-- | thus 4/5 match 100/101
6 cc | |
. | |
100 aa | |--
101 bb |--
102 ddd
103 cc there is a duplicate but not a sequence
200 ff

mark duplicate sequences as tested and proceed on through
seq1 may have more than one other seq in file.
the progress is from start to finish without looking back
thus each step forward has fewer lines to test.
marking already knowns eliminates redundant sequence testing.

By subseting on pass1 the expensive testing is greatly reduced.
If you know your subset data won't exceed memory then the "outfile"
can be held in memory to speed things up considerably.

Today is: 20090520
no code

Steve- Hide quoted text -

- Show quoted text -

this is the program...I wrote but is not working
I have a list of valves, and another of pressures;
If I am ask to find out which ones are the valves that are using all
this set of pressures, wanted best pressures
this is the program i wrote but is not working properly, it suppossed
to return in the case
find all the valves that are using pressures 1 "and" 2 "and" 3.
It returns me A, A2, A35....
The correct answer supposed to be A and A2...
if I were asked for pressures 56 and 78 the correct answer supossed to
be valves G and G2...

Valves = ['A','A','A','G', 'G', 'G',
'C','A2','A2','A2','F','G2','G2','G2','A35','A345','A4'] ##valve names
pressures = [1,2,3,4235,56,78,12, 1, 2, 3, 445, 45,56,78,1, 23,7] ##
valve pressures
result = []

bestpress = [1,2,3] ##wanted base pressures
print bestpress,'len bestpress is' , len(bestpress)

print len(Valves)
print len(Valves)
for j in range(len(Valves)):
#for i in range(len(bestpress)):
#for j in range(len(Valves)):
for i in range(len(bestpress)-2):
if pressures [j]== bestpress and bestpress [i+1]
==pressures [j+1] and bestpress [i+2]==pressures [j+2]:
result.append(Valves[j])
#i = i+1
#j = j+1
# print i, j, bestpress
print "common PSVs are", result

Peter Otten · May 21, 2009

yadin said:
============================================
index on column
Ndx1 is set to index #1
Ndx2 is set to index #2
test Ndx1 against Ndx2
if equal write line number and column content to a file
(that's two things on one line: 15 1000028706
283 1000028706 )
Ndx1 is set to Ndx2
Ndx2 is set to index #next
loop to test writing out each duplicate set

Then use the outfile and index on line number

In similar manor, check if line current and next line line numbers are
sequential. If so scan forward to match column content of lower line
number and check first matched column's line number and next for
sequential. Print them out if so

everything in outfile has 1 or more duplicates

4 aa |--
5 bb |-- | thus 4/5 match 100/101
6 cc | |
. | |
100 aa | |--
101 bb |--
102 ddd
103 cc there is a duplicate but not a sequence
200 ff

mark duplicate sequences as tested and proceed on through
seq1 may have more than one other seq in file.
the progress is from start to finish without looking back
thus each step forward has fewer lines to test.
marking already knowns eliminates redundant sequence testing.

By subseting on pass1 the expensive testing is greatly reduced.
If you know your subset data won't exceed memory then the "outfile"
can be held in memory to speed things up considerably.

Today is: 20090520
no code

Steve- Hide quoted text -

- Show quoted text -

Click to expand...

this is the program...I wrote but is not working
I have a list of valves, and another of pressures;
If I am ask to find out which ones are the valves that are using all
this set of pressures, wanted best pressures
this is the program i wrote but is not working properly, it suppossed
to return in the case
find all the valves that are using pressures 1 "and" 2 "and" 3.
It returns me A, A2, A35....
The correct answer supposed to be A and A2...
if I were asked for pressures 56 and 78 the correct answer supossed to
be valves G and G2...

Valves = ['A','A','A','G', 'G', 'G',
'C','A2','A2','A2','F','G2','G2','G2','A35','A345','A4'] ##valve names
pressures = [1,2,3,4235,56,78,12, 1, 2, 3, 445, 45,56,78,1, 23,7] ##
valve pressures
result = []

bestpress = [1,2,3] ##wanted base pressures
print bestpress,'len bestpress is' , len(bestpress)

print len(Valves)
print len(Valves)
for j in range(len(Valves)):
#for i in range(len(bestpress)):
#for j in range(len(Valves)):
for i in range(len(bestpress)-2):
if pressures [j]== bestpress and bestpress [i+1]
==pressures [j+1] and bestpress [i+2]==pressures [j+2]:
result.append(Valves[j])
#i = i+1
#j = j+1
# print i, j, bestpress
print "common PSVs are", result

If I understand your explanation correctly you don't actually care about the
sequence, you just want all valves that show the pressures specified in
bestpress. If that's correct your problem becomes easier. You can make a
lookup table that maps a given pressure to a set of all valves that had it
and then calculate the intersection.

from collections import defaultdict

def intersection(sets):
return reduce(set.intersection, sets)

valves = ['A', 'A', 'A', 'G', 'G', 'G', 'C', 'A2', 'A2', 'A2', 'F', 'G2',
'G2', 'G2', 'A35', 'A345', 'A4']
pressures = [1, 2, 3, 4235, 56, 78, 12, 1, 2, 3, 445, 45, 56, 78, 1, 23, 7]

lookup = defaultdict(set)
for v, p in zip(valves, pressures):
lookup[p].add(v)

result = []

wanted_pressures = [1, 2, 3]
print wanted_pressures, "-->", intersection(lookup[p] for p in
wanted_pressures)

Peter

norseman · May 21, 2009

yadin said:
this is the program...I wrote but is not working
I have a list of valves, and another of pressures;
If I am ask to find out which ones are the valves that are using all
this set of pressures, wanted best pressures
this is the program i wrote but is not working properly, it suppossed
to return in the case
find all the valves that are using pressures 1 "and" 2 "and" 3.
It returns me A, A2, A35....

looking at the data that seems correct.
there are 3 '1's in the list, 1-A, 1-A2, 1-A35
there are 2 '2's in the list, 2-A, 2-A2
there are 2 '3's in the list, 3-A, 3-A2
and so on

after the the two sets are paired
indexing on the right yields 1-A,2-A,3-A,1-A2,2-A2,3-A2,7-A4...
indexing on the left yiels1 1-A,1-A2,1-A35,2-A,2-A2,3-A,3-A2,7-A4...
and the two 78s would pair with a G and with a G2 (78-G, 78-G2)
beyond that I'm a bit lost.

20090521 Steve

The correct answer supposed to be A and A2...
if I were asked for pressures 56 and 78 the correct answer supossed to
be valves G and G2...

Valves = ['A','A','A','G', 'G', 'G',
'C','A2','A2','A2','F','G2','G2','G2','A35','A345','A4'] ##valve names
pressures = [1,2,3,4235,56,78,12, 1, 2, 3, 445, 45,56,78,1, 23,7] ##
valve pressures
result = []

bestpress = [1,2,3] ##wanted base pressures
print bestpress,'len bestpress is' , len(bestpress)

print len(Valves)
print len(Valves)
for j in range(len(Valves)):
#for i in range(len(bestpress)):
#for j in range(len(Valves)):
for i in range(len(bestpress)-2):
if pressures [j]== bestpress and bestpress [i+1]
==pressures [j+1] and bestpress [i+2]==pressures [j+2]:
result.append(Valves[j])
#i = i+1
#j = j+1
# print i, j, bestpress
print "common PSVs are", result

Rhodri James · May 21, 2009

this is the program...I wrote but is not working
I have a list of valves, and another of pressures;
If I am ask to find out which ones are the valves that are using all
this set of pressures, wanted best pressures
this is the program i wrote but is not working properly, it suppossed
to return in the case
find all the valves that are using pressures 1 "and" 2 "and" 3.

So if I understand you correctly, you actually want to split your
data up by valve name to find each valve that has listed pressures of
1, 2 and 3 in that order? That's a lot simpler, though it has to be
said that your data isn't in a terribly convenient format.

It returns me A, A2, A35....
The correct answer supposed to be A and A2...
if I were asked for pressures 56 and 78 the correct answer supossed to
be valves G and G2...

Ah, so the target "best" pressure sequence doesn't have to be all of the
values listed. Hmm. Here goes...

====HERE BE CODE====

from itertools import izip, groupby

VALVES = ['A','A','A','G', 'G', 'G',
'C','A2','A2','A2','F','G2',
'G2','G2','A35','A345','A4'] ##valve names
PRESSURES = [1,2,3,4235,56,78,
12, 1, 2, 3, 445, 45,
56,78, 1, 23,7] ## valve pressures
TARGET = [1, 2, 3]

target_len = len(TARGET) # Since we're using this a lot
result = []

for valve, p in groupby(izip(VALVES, PRESSURES),
key=lambda x: x[0]):
pressures = [x[1] for x in p]
for i in xrange((len(pressures) - target_len) + 1):
if pressures[i:i+target_len] == TARGET:
result.append(valve)
break

print "The answer you want is", result

====HERE ENDETH THE CODE====

Not terribly pretty largely because of having to do sublist
matching, but it should work for most "best pressures".

The unfamiliar looking stuff are functions from the iterator
toolkit that make this a lot simpler. If you don't get what's
going on here, I don't blame you. I just deleted my attempt
to explain it because it was confusing me

Reading the
descriptions of izip and groupby in the standard library
documentation should make things clearer.

Hey can anyone tell me why input data wont save in my database?	2	Jun 15, 2024
Data saving in condition of changing reality	0	Apr 29, 2022
StringChain -- a data structure for managing large sequences ofchunks of bytes	12	Mar 12, 2010
Pump data by row and column to dynamic created column in gridview	0	Feb 27, 2010
Help with code	0	Jun 11, 2022
Populating 'long' data column with Ruby OCI8 (trying again, need a solution badly)	2	Jan 17, 2006
copying the column values of a Datatable into another data table in vb.net	5	Jan 2, 2006
Finding the next file in a sequence	7	Jan 17, 2006

finding repeated data sequences in a column

yadin

bearophileHUGS

yadin

Tim Chase

bearophileHUGS

norseman

yadin

Peter Otten

norseman

Rhodri James

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads