finding repeated data sequences in a column

Discussion in 'Python' started by yadin, May 20, 2009.

  1. yadin

    yadin Guest

    Good day everyone!
    I have a a table, from where I can extract a column.
    I wanna go down trough that column made of numbers
    examine undetermined chunks of data and see or detect if that sequence
    of chunk
    of data has been repeated before
    and if it has been repeated detect it by giving it a name in an
    adjacent column.
    Imagine someting like this but made of 1800 numbers... how can I build
    up column 3(category)

    Item price Category
    400 1000028706 A
    400 1000028707 A
    400 1000028708 A
    101 100 C
    101 12 C
    500 1000028706 A
    500 1000028707 A
    500 1000028708 A
    500 1000028709 B
    120 1000028706 A
    120 1000028707 A
    120 1000028708 A
    120 100 C
    120 12 C
     
    yadin, May 20, 2009
    #1
    1. Advertisements

  2. yadin, understanding what you want is probably 10 times harder than
    writing down the code :)
    You can extract it? Or do you want to extract it? Or do you want to
    process it? Etc.

    What do you mean by "undetermined"? What kind of data? Where is this
    data? How is this "chunk" shaped? Are you talking about a string?

    What kind of name? So you just need 2 names, like N and S for New and
    Seen?
    You can use a built-in set data structure to know if you have already
    seen some data, while you scan the records.

    How are such 1800 disposed? Do you mean 1800 columns of 32 bit
    numbers?

    What does A, B and C mean?

    Bye,
    bearophile
     
    bearophileHUGS, May 20, 2009
    #2
    1. Advertisements

  3. yadin

    yadin Guest

    lets say you have this column of numbers

    1000028706
    1000028707
    1000028708
    100
    12
    1000028706
    1000028707
    1000028708
    1000028709
    1000028706
    1000028707
    1000028708
    100
    12
    6

    How can I build up a program that tells me that this sequence
    1000028706
    1000028707
    1000028708
    is repeated somewhere in the column, and how can i know where?

    thank you very much!
     
    yadin, May 20, 2009
    #3
  4. yadin

    Tim Chase Guest

    lets say you have this column of numbers
    In your example, would "100,12" also be output, or do you only
    care about the first find? Do you have a minimum or maximum
    number of repeats you care about? Is a repeated number a
    "sequence of length 1"? Can it be capped to know that if you
    have more than N items in common, you don't have to compare more
    than N items from the entire pair of sequences?

    -tkc
     
    Tim Chase, May 20, 2009
    #4
  5. yadin:
    Can such patterns nest? That is, can you have a repeated pattern made
    of an already seen pattern plus something else?
    If you don't want a complex program, then you may need to specify the
    problem better.

    You may want something like LZ77 or releated (LZ78, etc):
    http://en.wikipedia.org/wiki/LZ77
    This may have a bug:
    http://code.activestate.com/recipes/117226/

    Bye,
    bearophile
     
    bearophileHUGS, May 20, 2009
    #5
  6. yadin

    norseman Guest

    ============================================
    index on column
    Ndx1 is set to index #1
    Ndx2 is set to index #2
    test Ndx1 against Ndx2
    if equal write line number and column content to a file
    (that's two things on one line: 15 1000028706
    283 1000028706 )
    Ndx1 is set to Ndx2
    Ndx2 is set to index #next
    loop to test writing out each duplicate set

    Then use the outfile and index on line number

    In similar manor, check if line current and next line line numbers are
    sequential. If so scan forward to match column content of lower line
    number and check first matched column's line number and next for
    sequential. Print them out if so


    everything in outfile has 1 or more duplicates

    4 aa |--
    5 bb |-- | thus 4/5 match 100/101
    6 cc | |
    .. | |
    100 aa | |--
    101 bb |--
    102 ddd
    103 cc there is a duplicate but not a sequence
    200 ff

    mark duplicate sequences as tested and proceed on through
    seq1 may have more than one other seq in file.
    the progress is from start to finish without looking back
    thus each step forward has fewer lines to test.
    marking already knowns eliminates redundant sequence testing.

    By subseting on pass1 the expensive testing is greatly reduced.
    If you know your subset data won't exceed memory then the "outfile"
    can be held in memory to speed things up considerably.

    Today is: 20090520
    no code

    Steve
     
    norseman, May 20, 2009
    #6
  7. yadin

    yadin Guest

    this is the program...I wrote but is not working
    I have a list of valves, and another of pressures;
    If I am ask to find out which ones are the valves that are using all
    this set of pressures, wanted best pressures
    this is the program i wrote but is not working properly, it suppossed
    to return in the case
    find all the valves that are using pressures 1 "and" 2 "and" 3.
    It returns me A, A2, A35....
    The correct answer supposed to be A and A2...
    if I were asked for pressures 56 and 78 the correct answer supossed to
    be valves G and G2...

    Valves = ['A','A','A','G', 'G', 'G',
    'C','A2','A2','A2','F','G2','G2','G2','A35','A345','A4'] ##valve names
    pressures = [1,2,3,4235,56,78,12, 1, 2, 3, 445, 45,56,78,1, 23,7] ##
    valve pressures
    result = []

    bestpress = [1,2,3] ##wanted base pressures
    print bestpress,'len bestpress is' , len(bestpress)

    print len(Valves)
    print len(Valves)
    for j in range(len(Valves)):
    #for i in range(len(bestpress)):
    #for j in range(len(Valves)):
    for i in range(len(bestpress)-2):
    if pressures [j]== bestpress and bestpress [i+1]
    ==pressures [j+1] and bestpress [i+2]==pressures [j+2]:
    result.append(Valves[j])
    #i = i+1
    #j = j+1
    # print i, j, bestpress
    print "common PSVs are", result
     
    yadin, May 21, 2009
    #7
  8. yadin

    Peter Otten Guest



    If I understand your explanation correctly you don't actually care about the
    sequence, you just want all valves that show the pressures specified in
    bestpress. If that's correct your problem becomes easier. You can make a
    lookup table that maps a given pressure to a set of all valves that had it
    and then calculate the intersection.

    from collections import defaultdict

    def intersection(sets):
    return reduce(set.intersection, sets)

    valves = ['A', 'A', 'A', 'G', 'G', 'G', 'C', 'A2', 'A2', 'A2', 'F', 'G2',
    'G2', 'G2', 'A35', 'A345', 'A4']
    pressures = [1, 2, 3, 4235, 56, 78, 12, 1, 2, 3, 445, 45, 56, 78, 1, 23, 7]

    lookup = defaultdict(set)
    for v, p in zip(valves, pressures):
    lookup[p].add(v)

    result = []

    wanted_pressures = [1, 2, 3]
    print wanted_pressures, "-->", intersection(lookup[p] for p in
    wanted_pressures)

    Peter
     
    Peter Otten, May 21, 2009
    #8
  9. yadin

    norseman Guest

    looking at the data that seems correct.
    there are 3 '1's in the list, 1-A, 1-A2, 1-A35
    there are 2 '2's in the list, 2-A, 2-A2
    there are 2 '3's in the list, 3-A, 3-A2
    and so on

    after the the two sets are paired
    indexing on the right yields 1-A,2-A,3-A,1-A2,2-A2,3-A2,7-A4...
    indexing on the left yiels1 1-A,1-A2,1-A35,2-A,2-A2,3-A,3-A2,7-A4...
    and the two 78s would pair with a G and with a G2 (78-G, 78-G2)
    beyond that I'm a bit lost.

    20090521 Steve
     
    norseman, May 22, 2009
    #9
  10. yadin

    Rhodri James Guest

    So if I understand you correctly, you actually want to split your
    data up by valve name to find each valve that has listed pressures of
    1, 2 and 3 in that order? That's a lot simpler, though it has to be
    said that your data isn't in a terribly convenient format.
    Ah, so the target "best" pressure sequence doesn't have to be all of the
    values listed. Hmm. Here goes...

    ====HERE BE CODE====

    from itertools import izip, groupby

    VALVES = ['A','A','A','G', 'G', 'G',
    'C','A2','A2','A2','F','G2',
    'G2','G2','A35','A345','A4'] ##valve names
    PRESSURES = [1,2,3,4235,56,78,
    12, 1, 2, 3, 445, 45,
    56,78, 1, 23,7] ## valve pressures
    TARGET = [1, 2, 3]

    target_len = len(TARGET) # Since we're using this a lot
    result = []

    for valve, p in groupby(izip(VALVES, PRESSURES),
    key=lambda x: x[0]):
    pressures = [x[1] for x in p]
    for i in xrange((len(pressures) - target_len) + 1):
    if pressures[i:i+target_len] == TARGET:
    result.append(valve)
    break

    print "The answer you want is", result

    ====HERE ENDETH THE CODE====

    Not terribly pretty largely because of having to do sublist
    matching, but it should work for most "best pressures".

    The unfamiliar looking stuff are functions from the iterator
    toolkit that make this a lot simpler. If you don't get what's
    going on here, I don't blame you. I just deleted my attempt
    to explain it because it was confusing me :) Reading the
    descriptions of izip and groupby in the standard library
    documentation should make things clearer.
     
    Rhodri James, May 22, 2009
    #10
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.