Wildcard String Comparisons: Set Pattern to a Wildcard Source

Discussion in 'Python' started by chaoticcranium@gmail.com, Oct 5, 2010.

  1. Guest

    So, I have a rather tricky string comparison problem: I want to search
    for a set pattern in a variable source.

    To give you the context, I am searching for set primer sequences
    within a variable gene sequence. In addition to the non-degenerate A/G/
    C/T, the gene sequence could have degenerate bases that could encode
    for more than one base (for example, R means A or G, N means A or G or
    C or T). One brute force way to do it would be to generate every
    single non-degenerate sequence the degenerate sequence could mean and
    do my comparison with all of those, but that would of course be very
    space and time inefficient.

    For the sake of simplicity, let's say I replace each degenerate base
    with a single wildcard character "?". We can do this because there are
    so many more non-degenerate bases that the probability of a degenerate
    mismatch is low if the nondegenerates in a primer match up.

    So, my goal is to search for a small, set pattern (the primer) inside
    a large source with single wildcard characters (my degenerate gene).

    The first thing that comes to my mind are regular expressions, but I'm
    rather n00bish when it comes to using them and I've only been able to
    find help online where the smaller search pattern has wildcards and
    the source is constant, such as here:
    http://www.velocityreviews.com/forums/t337057-efficient-string-lookup.html

    Of course, that's the reverse of my situation and the proposed
    solutions there won't work for me. So, could you help me out, oh great
    Python masters? *bows*
     
    , Oct 5, 2010
    #1
    1. Advertising

  2. MRAB Guest

    On 05/10/2010 20:03, wrote:
    > So, I have a rather tricky string comparison problem: I want to search
    > for a set pattern in a variable source.
    >
    > To give you the context, I am searching for set primer sequences
    > within a variable gene sequence. In addition to the non-degenerate A/G/
    > C/T, the gene sequence could have degenerate bases that could encode
    > for more than one base (for example, R means A or G, N means A or G or
    > C or T). One brute force way to do it would be to generate every
    > single non-degenerate sequence the degenerate sequence could mean and
    > do my comparison with all of those, but that would of course be very
    > space and time inefficient.
    >
    > For the sake of simplicity, let's say I replace each degenerate base
    > with a single wildcard character "?". We can do this because there are
    > so many more non-degenerate bases that the probability of a degenerate
    > mismatch is low if the nondegenerates in a primer match up.
    >
    > So, my goal is to search for a small, set pattern (the primer) inside
    > a large source with single wildcard characters (my degenerate gene).
    >
    > The first thing that comes to my mind are regular expressions, but I'm
    > rather n00bish when it comes to using them and I've only been able to
    > find help online where the smaller search pattern has wildcards and
    > the source is constant, such as here:
    > http://www.velocityreviews.com/forums/t337057-efficient-string-lookup.html
    >
    > Of course, that's the reverse of my situation and the proposed
    > solutions there won't work for me. So, could you help me out, oh great
    > Python masters? *bows*


    Stand back, I'm going to try regex. :)

    Both "A" and "R" in the variable sequence should match "A" in the
    primer sequence, so "A" in the primer sequence should be replaced by
    the character set "[AR]". The other bases should be replaced similarly.

    Use a simple dict lookup:

    wildcards = {"A": "[ARN]", "G": "[GRN]", "C": "[CN]", "T": "[TN]"}

    and create the regex for the primer sequence:

    primer_pattern = re.compile("".join(wildcards[c] for c in primer))

    Would that work?
     
    MRAB, Oct 5, 2010
    #2
    1. Advertising

  3. Guest

    On Oct 5, 3:38 pm, MRAB <> wrote:
    > On 05/10/2010 20:03, wrote:
    >
    >
    >
    > > So, I have a rather tricky string comparison problem: I want to search
    > > for a set pattern in a variable source.

    >
    > > To give you the context, I am searching for set primer sequences
    > > within a variable gene sequence. In addition to the non-degenerate A/G/
    > > C/T, the gene sequence could have degenerate bases that could encode
    > > for more than one base (for example, R means A or G, N means A or G or
    > > C or T). One brute force way to do it would be to generate every
    > > single non-degenerate sequence the degenerate sequence could mean and
    > > do my comparison with all of those, but that would of course be very
    > > space and time inefficient.

    >
    > > For the sake of simplicity, let's say I replace each degenerate base
    > > with a single wildcard character "?". We can do this because there are
    > > so many more non-degenerate bases that the probability of a degenerate
    > > mismatch is low if the nondegenerates in a primer match up.

    >
    > > So, my goal is to search for a small, set pattern (the primer) inside
    > > a large source with single wildcard characters (my degenerate gene).

    >
    > > The first thing that comes to my mind are regular expressions, but I'm
    > > rather n00bish when it comes to using them and I've only been able to
    > > find help online where the smaller search pattern has wildcards and
    > > the source is constant, such as here:
    > >http://www.velocityreviews.com/forums/t337057-efficient-string-lookup...

    >
    > > Of course, that's the reverse of my situation and the proposed
    > > solutions there won't work for me. So, could you help me out, oh great
    > > Python masters? *bows*

    >
    > Stand back, I'm going to try regex. :)
    >
    > Both "A" and "R" in the variable sequence should match "A" in the
    > primer sequence, so "A" in the primer sequence should be replaced by
    > the character set "[AR]". The other bases should be replaced similarly.
    >
    > Use a simple dict lookup:
    >
    > wildcards = {"A": "[ARN]", "G": "[GRN]", "C": "[CN]", "T": "[TN]"}
    >
    > and create the regex for the primer sequence:
    >
    > primer_pattern = re.compile("".join(wildcards[c] for c in primer))
    >
    > Would that work?



    Thank you for your response, MRAB.

    That's a rather clever way to do this sort of matching, but I actually
    forgot one other crucial thing in my problem description (and I'm
    hitting myself on the head for forgetting it!) - I need to know at
    what position in my gene the primer was found.

    As far as I know (and I'm a regex n00b, so please tell me if I'm
    wrong), you can't use string's find() on a regex and regex's match()
    does not return a position in the regex. I understand there are
    elements of in regular expressions that expand to variable numbers of
    characters so a "position number" in a regular expression is often a
    meaningless concept. Here, however, my regular expression has a 1 to 1
    correspondence since each degenerate base should occupy only one
    wildcard slot. In this particular case, a position number is
    meaningful AND I need to know it for my program.

    Now. . .is there anything we can do about that?
     
    , Oct 5, 2010
    #3
  4. Tim Chase Guest

    On 10/05/10 15:06, wrote:
    > On Oct 5, 3:38 pm, MRAB<> wrote:
    >> On 05/10/2010 20:03, wrote:
    >>
    >>
    >>
    >>> So, I have a rather tricky string comparison problem: I want to search
    >>> for a set pattern in a variable source.

    >>
    >>> To give you the context, I am searching for set primer sequences
    >>> within a variable gene sequence. In addition to the non-degenerate A/G/
    >>> C/T, the gene sequence could have degenerate bases that could encode
    >>> for more than one base (for example, R means A or G, N means A or G or
    >>> C or T). One brute force way to do it would be to generate every
    >>> single non-degenerate sequence the degenerate sequence could mean and
    >>> do my comparison with all of those, but that would of course be very
    >>> space and time inefficient.

    >>
    >>> For the sake of simplicity, let's say I replace each degenerate base
    >>> with a single wildcard character "?". We can do this because there are
    >>> so many more non-degenerate bases that the probability of a degenerate
    >>> mismatch is low if the nondegenerates in a primer match up.

    >>
    >>> So, my goal is to search for a small, set pattern (the primer) inside
    >>> a large source with single wildcard characters (my degenerate gene).

    >>
    >>> The first thing that comes to my mind are regular expressions, but I'm
    >>> rather n00bish when it comes to using them and I've only been able to
    >>> find help online where the smaller search pattern has wildcards and
    >>> the source is constant, such as here:
    >>> http://www.velocityreviews.com/forums/t337057-efficient-string-lookup...

    >>
    >>> Of course, that's the reverse of my situation and the proposed
    >>> solutions there won't work for me. So, could you help me out, oh great
    >>> Python masters? *bows*

    >>
    >> Stand back, I'm going to try regex. :)
    >>
    >> Both "A" and "R" in the variable sequence should match "A" in the
    >> primer sequence, so "A" in the primer sequence should be replaced by
    >> the character set "[AR]". The other bases should be replaced similarly.
    >>
    >> Use a simple dict lookup:
    >>
    >> wildcards = {"A": "[ARN]", "G": "[GRN]", "C": "[CN]", "T": "[TN]"}
    >>
    >> and create the regex for the primer sequence:
    >>
    >> primer_pattern = re.compile("".join(wildcards[c] for c in primer))
    >>
    >> Would that work?

    >
    >
    > Thank you for your response, MRAB.
    >
    > That's a rather clever way to do this sort of matching, but I actually
    > forgot one other crucial thing in my problem description (and I'm
    > hitting myself on the head for forgetting it!) - I need to know at
    > what position in my gene the primer was found.


    If you use the primer_pattern.search() method (which searches
    starting at all offsets) instead of .match() (which only
    searches from the beginning), it should return a match object
    that has a .start() method to let you know the offset:

    m = primer_pattern.search(my_data)
    if m is None:
    print "Not found"
    else:
    print "Found at %i" % m.start()

    -tkc
     
    Tim Chase, Oct 5, 2010
    #4
  5. MRAB Guest

    On 05/10/2010 21:06, wrote:
    > On Oct 5, 3:38 pm, MRAB<> wrote:
    >> On 05/10/2010 20:03, wrote:
    >>
    >>
    >>
    >>> So, I have a rather tricky string comparison problem: I want to search
    >>> for a set pattern in a variable source.

    >>
    >>> To give you the context, I am searching for set primer sequences
    >>> within a variable gene sequence. In addition to the non-degenerate A/G/
    >>> C/T, the gene sequence could have degenerate bases that could encode
    >>> for more than one base (for example, R means A or G, N means A or G or
    >>> C or T). One brute force way to do it would be to generate every
    >>> single non-degenerate sequence the degenerate sequence could mean and
    >>> do my comparison with all of those, but that would of course be very
    >>> space and time inefficient.

    >>
    >>> For the sake of simplicity, let's say I replace each degenerate base
    >>> with a single wildcard character "?". We can do this because there are
    >>> so many more non-degenerate bases that the probability of a degenerate
    >>> mismatch is low if the nondegenerates in a primer match up.

    >>
    >>> So, my goal is to search for a small, set pattern (the primer) inside
    >>> a large source with single wildcard characters (my degenerate gene).

    >>
    >>> The first thing that comes to my mind are regular expressions, but I'm
    >>> rather n00bish when it comes to using them and I've only been able to
    >>> find help online where the smaller search pattern has wildcards and
    >>> the source is constant, such as here:
    >>> http://www.velocityreviews.com/forums/t337057-efficient-string-lookup...

    >>
    >>> Of course, that's the reverse of my situation and the proposed
    >>> solutions there won't work for me. So, could you help me out, oh great
    >>> Python masters? *bows*

    >>
    >> Stand back, I'm going to try regex. :)
    >>
    >> Both "A" and "R" in the variable sequence should match "A" in the
    >> primer sequence, so "A" in the primer sequence should be replaced by
    >> the character set "[AR]". The other bases should be replaced similarly.
    >>
    >> Use a simple dict lookup:
    >>
    >> wildcards = {"A": "[ARN]", "G": "[GRN]", "C": "[CN]", "T": "[TN]"}
    >>
    >> and create the regex for the primer sequence:
    >>
    >> primer_pattern = re.compile("".join(wildcards[c] for c in primer))
    >>
    >> Would that work?

    >
    >
    > Thank you for your response, MRAB.
    >
    > That's a rather clever way to do this sort of matching, but I actually
    > forgot one other crucial thing in my problem description (and I'm
    > hitting myself on the head for forgetting it!) - I need to know at
    > what position in my gene the primer was found.
    >
    > As far as I know (and I'm a regex n00b, so please tell me if I'm
    > wrong), you can't use string's find() on a regex and regex's match()
    > does not return a position in the regex. I understand there are
    > elements of in regular expressions that expand to variable numbers of
    > characters so a "position number" in a regular expression is often a
    > meaningless concept. Here, however, my regular expression has a 1 to 1
    > correspondence since each degenerate base should occupy only one
    > wildcard slot. In this particular case, a position number is
    > meaningful AND I need to know it for my program.
    >
    > Now. . .is there anything we can do about that?


    A successful search returns a match object. That has methods including
    ..start(), which returns the start position of the match. It's all in
    the documentation.
     
    MRAB, Oct 5, 2010
    #5
  6. MRAB Guest

    [snip]

    Additional: I forgot to mention that you should understand the
    difference between the .match() and .search() mthods. .match() is
    anchored to the starting position, so you'll want to use .search()
    instead.
     
    MRAB, Oct 5, 2010
    #6
  7. "" <> writes:

    > So, I have a rather tricky string comparison problem: I want to search
    > for a set pattern in a variable source.
    >
    > To give you the context, I am searching for set primer sequences
    > within a variable gene sequence. In addition to the non-degenerate A/G/
    > C/T, the gene sequence could have degenerate bases that could encode
    > for more than one base (for example, R means A or G, N means A or G or
    > C or T). One brute force way to do it would be to generate every
    > single non-degenerate sequence the degenerate sequence could mean and
    > do my comparison with all of those, but that would of course be very
    > space and time inefficient.
    >
    > For the sake of simplicity, let's say I replace each degenerate base
    > with a single wildcard character "?". We can do this because there are
    > so many more non-degenerate bases that the probability of a degenerate
    > mismatch is low if the nondegenerates in a primer match up.
    >
    > So, my goal is to search for a small, set pattern (the primer) inside
    > a large source with single wildcard characters (my degenerate gene).
    >
    > The first thing that comes to my mind are regular expressions, but I'm
    > rather n00bish when it comes to using them and I've only been able to
    > find help online where the smaller search pattern has wildcards and
    > the source is constant, such as here:
    > http://www.velocityreviews.com/forums/t337057-efficient-string-lookup.html
    >
    > Of course, that's the reverse of my situation and the proposed
    > solutions there won't work for me. So, could you help me out, oh great
    > Python masters? *bows*


    Knuth-Morris-Pratt to the rescue! This is a well known algorithm for
    searching for occurences of a substring in a text. Just change the
    search algorithm slightly so that it takes wildcards in the text to
    search into account. Here is a quick-and-dirty implementation in
    Python:

    --------------------------------------------------
    def build_table(word):
    """build the Knuth-Morris-Pratt partial match table"""
    table = [-1, 0]
    wpos = 0
    while len(table) < len(word):
    if word[len(table) - 1] == word[wpos]:
    wpos += 1
    table.append(wpos)
    elif wpos > 0:
    wpos = table[wpos]
    else:
    table.append(0)
    return table


    def search(word, text, wildcard="?"):
    """Perform Knuth-Morris-Pratt search with wildcard for text"""
    table = build_table(word)
    ti, wi = 0, 0
    while ti <= len(text) - len(word):
    c = text[ti + wi]
    if c == wildcard or c == word[wi]:
    if wi + 1 == len(word):
    return ti
    wi += 1
    else:
    ti += wi - table[wi]
    if wi:
    wi = table[wi]
    return None

    examples = [
    ("spam", "sa?ps?a?mp"),
    ("ababc", "??aba?ba??b??a")
    ]

    for word, text in examples:
    i = search(word, text)
    found = text[i:i+len(word)]
    print "found [%s] in [%s] at %s: %s" % (word, text, i, found)
    --------------------------------------------------

    Output:

    found [spam] in [sa?ps?a?mp] at 4: s?a?
    found [ababc] in [??aba?ba??b??a] at 7: a??b?


    You could even change the line:

    if c == wildcard or c == word[wi]

    to something that checks that e.g. when c == "N", word[wi] == "A" or
    "G", etc...

    Knuth-Morris-Pratt is such that the main loop in "search()" is
    guaranteed to run no more than 2*n times, where n is the length of
    "text".

    OTOH Python is not the best language to implement this. If it was me,
    I'd do it in C. It wouldn't take long at all and it would probably be a
    lot faster (maybe 100x, that's a pure guess).

    Warning: I didn't really check my code more than the provided
    examples...

    HTH

    --
    Arnaud
     
    Arnaud Delobelle, Oct 5, 2010
    #7
  8. Guest

    Ah, very good, it's working perfectly now. Thank you so much for your
    help - regular expressions are very powerful!


    On Oct 5, 4:26 pm, MRAB <> wrote:
    > [snip]
    >
    > Additional: I forgot to mention that you should understand the
    > difference between the .match() and .search() mthods. .match() is
    > anchored to the starting position, so you'll want to use .search()
    > instead.
     
    , Oct 5, 2010
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    1
    Views:
    438
    Mike Schilling
    Jun 27, 2005
  2. Replies:
    17
    Views:
    1,890
    Chris Uppal
    Nov 16, 2005
  3. Horn

    string comparisons

    Horn, Oct 3, 2003, in forum: C++
    Replies:
    7
    Views:
    387
    Ashish
    Oct 6, 2003
  4. Paul Brettschneider

    std::set: gratuitous comparisons?

    Paul Brettschneider, Mar 14, 2008, in forum: C++
    Replies:
    11
    Views:
    510
    James Kanze
    Mar 18, 2008
  5. ChrisC
    Replies:
    4
    Views:
    181
    ChrisC
    Jun 25, 2010
Loading...

Share This Page