Wildcard String Comparisons: Set Pattern to a Wildcard Source

chaoticcranium · Oct 5, 2010

So, I have a rather tricky string comparison problem: I want to search
for a set pattern in a variable source.

To give you the context, I am searching for set primer sequences
within a variable gene sequence. In addition to the non-degenerate A/G/
C/T, the gene sequence could have degenerate bases that could encode
for more than one base (for example, R means A or G, N means A or G or
C or T). One brute force way to do it would be to generate every
single non-degenerate sequence the degenerate sequence could mean and
do my comparison with all of those, but that would of course be very
space and time inefficient.

For the sake of simplicity, let's say I replace each degenerate base
with a single wildcard character "?". We can do this because there are
so many more non-degenerate bases that the probability of a degenerate
mismatch is low if the nondegenerates in a primer match up.

So, my goal is to search for a small, set pattern (the primer) inside
a large source with single wildcard characters (my degenerate gene).

The first thing that comes to my mind are regular expressions, but I'm
rather n00bish when it comes to using them and I've only been able to
find help online where the smaller search pattern has wildcards and
the source is constant, such as here:
http://www.velocityreviews.com/forums/t337057-efficient-string-lookup.html

Of course, that's the reverse of my situation and the proposed
solutions there won't work for me. So, could you help me out, oh great
Python masters? *bows*

MRAB · Oct 5, 2010

So, I have a rather tricky string comparison problem: I want to search
for a set pattern in a variable source.

To give you the context, I am searching for set primer sequences
within a variable gene sequence. In addition to the non-degenerate A/G/
C/T, the gene sequence could have degenerate bases that could encode
for more than one base (for example, R means A or G, N means A or G or
C or T). One brute force way to do it would be to generate every
single non-degenerate sequence the degenerate sequence could mean and
do my comparison with all of those, but that would of course be very
space and time inefficient.

For the sake of simplicity, let's say I replace each degenerate base
with a single wildcard character "?". We can do this because there are
so many more non-degenerate bases that the probability of a degenerate
mismatch is low if the nondegenerates in a primer match up.

So, my goal is to search for a small, set pattern (the primer) inside
a large source with single wildcard characters (my degenerate gene).

The first thing that comes to my mind are regular expressions, but I'm
rather n00bish when it comes to using them and I've only been able to
find help online where the smaller search pattern has wildcards and
the source is constant, such as here:
http://www.velocityreviews.com/forums/t337057-efficient-string-lookup.html

Of course, that's the reverse of my situation and the proposed
solutions there won't work for me. So, could you help me out, oh great
Python masters? *bows*

Stand back, I'm going to try regex.

Both "A" and "R" in the variable sequence should match "A" in the
primer sequence, so "A" in the primer sequence should be replaced by
the character set "[AR]". The other bases should be replaced similarly.

Use a simple dict lookup:

wildcards = {"A": "[ARN]", "G": "[GRN]", "C": "[CN]", "T": "[TN]"}

and create the regex for the primer sequence:

primer_pattern = re.compile("".join(wildcards[c] for c in primer))

Would that work?

chaoticcranium · Oct 5, 2010

So, I have a rather tricky string comparison problem: I want to search
for a set pattern in a variable source.

Click to expand...

To give you the context, I am searching for set primer sequences
within a variable gene sequence. In addition to the non-degenerate A/G/
C/T, the gene sequence could have degenerate bases that could encode
for more than one base (for example, R means A or G, N means A or G or
C or T). One brute force way to do it would be to generate every
single non-degenerate sequence the degenerate sequence could mean and
do my comparison with all of those, but that would of course be very
space and time inefficient.

Click to expand...

For the sake of simplicity, let's say I replace each degenerate base
with a single wildcard character "?". We can do this because there are
so many more non-degenerate bases that the probability of a degenerate
mismatch is low if the nondegenerates in a primer match up.

Click to expand...

So, my goal is to search for a small, set pattern (the primer) inside
a large source with single wildcard characters (my degenerate gene).

Click to expand...

The first thing that comes to my mind are regular expressions, but I'm
rather n00bish when it comes to using them and I've only been able to
find help online where the smaller search pattern has wildcards and
the source is constant, such as here:
http://www.velocityreviews.com/forums/t337057-efficient-string-lookup...

Click to expand...

Of course, that's the reverse of my situation and the proposed
solutions there won't work for me. So, could you help me out, oh great
Python masters? *bows*

Click to expand...

Stand back, I'm going to try regex.

Both "A" and "R" in the variable sequence should match "A" in the
primer sequence, so "A" in the primer sequence should be replaced by
the character set "[AR]". The other bases should be replaced similarly.

Use a simple dict lookup:

wildcards = {"A": "[ARN]", "G": "[GRN]", "C": "[CN]", "T": "[TN]"}

and create the regex for the primer sequence:

primer_pattern = re.compile("".join(wildcards[c] for c in primer))

Would that work?

Thank you for your response, MRAB.

That's a rather clever way to do this sort of matching, but I actually
forgot one other crucial thing in my problem description (and I'm
hitting myself on the head for forgetting it!) - I need to know at
what position in my gene the primer was found.

As far as I know (and I'm a regex n00b, so please tell me if I'm
wrong), you can't use string's find() on a regex and regex's match()
does not return a position in the regex. I understand there are
elements of in regular expressions that expand to variable numbers of
characters so a "position number" in a regular expression is often a
meaningless concept. Here, however, my regular expression has a 1 to 1
correspondence since each degenerate base should occupy only one
wildcard slot. In this particular case, a position number is
meaningful AND I need to know it for my program.

Now. . .is there anything we can do about that?

Tim Chase · Oct 5, 2010

So, I have a rather tricky string comparison problem: I want to search
for a set pattern in a variable source.

Click to expand...

To give you the context, I am searching for set primer sequences
within a variable gene sequence. In addition to the non-degenerate A/G/
C/T, the gene sequence could have degenerate bases that could encode
for more than one base (for example, R means A or G, N means A or G or
C or T). One brute force way to do it would be to generate every
single non-degenerate sequence the degenerate sequence could mean and
do my comparison with all of those, but that would of course be very
space and time inefficient.

Click to expand...

For the sake of simplicity, let's say I replace each degenerate base
with a single wildcard character "?". We can do this because there are
so many more non-degenerate bases that the probability of a degenerate
mismatch is low if the nondegenerates in a primer match up.

Click to expand...

So, my goal is to search for a small, set pattern (the primer) inside
a large source with single wildcard characters (my degenerate gene).

Click to expand...

The first thing that comes to my mind are regular expressions, but I'm
rather n00bish when it comes to using them and I've only been able to
find help online where the smaller search pattern has wildcards and
the source is constant, such as here:
http://www.velocityreviews.com/forums/t337057-efficient-string-lookup...

Click to expand...

Of course, that's the reverse of my situation and the proposed
solutions there won't work for me. So, could you help me out, oh great
Python masters? *bows*

Click to expand...

Stand back, I'm going to try regex.

Both "A" and "R" in the variable sequence should match "A" in the
primer sequence, so "A" in the primer sequence should be replaced by
the character set "[AR]". The other bases should be replaced similarly.

Use a simple dict lookup:

wildcards = {"A": "[ARN]", "G": "[GRN]", "C": "[CN]", "T": "[TN]"}

and create the regex for the primer sequence:

primer_pattern = re.compile("".join(wildcards[c] for c in primer))

Would that work?

Click to expand...

Thank you for your response, MRAB.

That's a rather clever way to do this sort of matching, but I actually
forgot one other crucial thing in my problem description (and I'm
hitting myself on the head for forgetting it!) - I need to know at
what position in my gene the primer was found.

If you use the primer_pattern.search() method (which searches
starting at all offsets) instead of .match() (which only
searches from the beginning), it should return a match object
that has a .start() method to let you know the offset:

m = primer_pattern.search(my_data)
if m is None:
print "Not found"
else:
print "Found at %i" % m.start()

-tkc

MRAB · Oct 5, 2010

So, I have a rather tricky string comparison problem: I want to search
for a set pattern in a variable source.

Click to expand...

To give you the context, I am searching for set primer sequences
within a variable gene sequence. In addition to the non-degenerate A/G/
C/T, the gene sequence could have degenerate bases that could encode
for more than one base (for example, R means A or G, N means A or G or
C or T). One brute force way to do it would be to generate every
single non-degenerate sequence the degenerate sequence could mean and
do my comparison with all of those, but that would of course be very
space and time inefficient.

Click to expand...

For the sake of simplicity, let's say I replace each degenerate base
with a single wildcard character "?". We can do this because there are
so many more non-degenerate bases that the probability of a degenerate
mismatch is low if the nondegenerates in a primer match up.

Click to expand...

So, my goal is to search for a small, set pattern (the primer) inside
a large source with single wildcard characters (my degenerate gene).

Click to expand...

The first thing that comes to my mind are regular expressions, but I'm
rather n00bish when it comes to using them and I've only been able to
find help online where the smaller search pattern has wildcards and
the source is constant, such as here:
http://www.velocityreviews.com/forums/t337057-efficient-string-lookup...

Click to expand...

Of course, that's the reverse of my situation and the proposed
solutions there won't work for me. So, could you help me out, oh great
Python masters? *bows*

Click to expand...

Stand back, I'm going to try regex.

Both "A" and "R" in the variable sequence should match "A" in the
primer sequence, so "A" in the primer sequence should be replaced by
the character set "[AR]". The other bases should be replaced similarly.

Use a simple dict lookup:

wildcards = {"A": "[ARN]", "G": "[GRN]", "C": "[CN]", "T": "[TN]"}

and create the regex for the primer sequence:

primer_pattern = re.compile("".join(wildcards[c] for c in primer))

Would that work?

Click to expand...

Thank you for your response, MRAB.

That's a rather clever way to do this sort of matching, but I actually
forgot one other crucial thing in my problem description (and I'm
hitting myself on the head for forgetting it!) - I need to know at
what position in my gene the primer was found.

As far as I know (and I'm a regex n00b, so please tell me if I'm
wrong), you can't use string's find() on a regex and regex's match()
does not return a position in the regex. I understand there are
elements of in regular expressions that expand to variable numbers of
characters so a "position number" in a regular expression is often a
meaningless concept. Here, however, my regular expression has a 1 to 1
correspondence since each degenerate base should occupy only one
wildcard slot. In this particular case, a position number is
meaningful AND I need to know it for my program.

Now. . .is there anything we can do about that?

A successful search returns a match object. That has methods including
..start(), which returns the start position of the match. It's all in
the documentation.

MRAB · Oct 5, 2010

[snip]

Additional: I forgot to mention that you should understand the
difference between the .match() and .search() mthods. .match() is
anchored to the starting position, so you'll want to use .search()
instead.

Arnaud Delobelle · Oct 5, 2010

[email protected] said:
So, I have a rather tricky string comparison problem: I want to search
for a set pattern in a variable source.

To give you the context, I am searching for set primer sequences
within a variable gene sequence. In addition to the non-degenerate A/G/
C/T, the gene sequence could have degenerate bases that could encode
for more than one base (for example, R means A or G, N means A or G or
C or T). One brute force way to do it would be to generate every
single non-degenerate sequence the degenerate sequence could mean and
do my comparison with all of those, but that would of course be very
space and time inefficient.

For the sake of simplicity, let's say I replace each degenerate base
with a single wildcard character "?". We can do this because there are
so many more non-degenerate bases that the probability of a degenerate
mismatch is low if the nondegenerates in a primer match up.

So, my goal is to search for a small, set pattern (the primer) inside
a large source with single wildcard characters (my degenerate gene).

The first thing that comes to my mind are regular expressions, but I'm
rather n00bish when it comes to using them and I've only been able to
find help online where the smaller search pattern has wildcards and
the source is constant, such as here:
http://www.velocityreviews.com/forums/t337057-efficient-string-lookup.html

Of course, that's the reverse of my situation and the proposed
solutions there won't work for me. So, could you help me out, oh great
Python masters? *bows*

Knuth-Morris-Pratt to the rescue! This is a well known algorithm for
searching for occurences of a substring in a text. Just change the
search algorithm slightly so that it takes wildcards in the text to
search into account. Here is a quick-and-dirty implementation in
Python:

--------------------------------------------------
def build_table(word):
"""build the Knuth-Morris-Pratt partial match table"""
table = [-1, 0]
wpos = 0
while len(table) < len(word):
if word[len(table) - 1] == word[wpos]:
wpos += 1
table.append(wpos)
elif wpos > 0:
wpos = table[wpos]
else:
table.append(0)
return table

def search(word, text, wildcard="?"):
"""Perform Knuth-Morris-Pratt search with wildcard for text"""
table = build_table(word)
ti, wi = 0, 0
while ti <= len(text) - len(word):
c = text[ti + wi]
if c == wildcard or c == word[wi]:
if wi + 1 == len(word):
return ti
wi += 1
else:
ti += wi - table[wi]
if wi:
wi = table[wi]
return None

examples = [
("spam", "sa?ps?a?mp"),
("ababc", "??aba?ba??b??a")
]

for word, text in examples:
i = search(word, text)
found = text[i:i+len(word)]
print "found [%s] in [%s] at %s: %s" % (word, text, i, found)
--------------------------------------------------

Output:

found [spam] in [sa?ps?a?mp] at 4: s?a?
found [ababc] in [??aba?ba??b??a] at 7: a??b?

You could even change the line:

if c == wildcard or c == word[wi]

to something that checks that e.g. when c == "N", word[wi] == "A" or
"G", etc...

Knuth-Morris-Pratt is such that the main loop in "search()" is
guaranteed to run no more than 2*n times, where n is the length of
"text".

OTOH Python is not the best language to implement this. If it was me,
I'd do it in C. It wouldn't take long at all and it would probably be a
lot faster (maybe 100x, that's a pure guess).

Warning: I didn't really check my code more than the provided
examples...

HTH

chaoticcranium · Oct 5, 2010

Ah, very good, it's working perfectly now. Thank you so much for your
help - regular expressions are very powerful!

Trying to use clangd with VSCodium, CMake_World_COMPILER not set	1	Nov 4, 2024
How to read a file as binary or hex "string" so that I can do regex search?	3	Dec 18, 2024
Counting elements in a list wildcard	14	Apr 24, 2006
Programming challenge: wildcard exclusion in cartesian products	78	Mar 16, 2006
(silly?) speed comparisons	1	Jul 8, 2008
Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
std::set: gratuitous comparisons?	11	Mar 14, 2008
Question about Source Control	3	Mar 17, 2014

Wildcard String Comparisons: Set Pattern to a Wildcard Source

chaoticcranium

MRAB

chaoticcranium

Tim Chase

MRAB

MRAB

Arnaud Delobelle

chaoticcranium

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads