searching substrings with interpositions

borges2003xx · May 24, 2005

hi everyone.
a problem:
two binary strings, a="0101" b="000011110100";
i search a function f(a,b) that gives 1 if a is "contained" in b with
any sub strings interposed.
in this example a in contained cause 000<01>111<01>00 but also
0<0>001111<101>00"
but also <0>0001111<101>00 but also 000<0>1111<01>0<0> etc....
any idea?
Thanx in advance.
Giorgi Borghi

bplumhoff · May 24, 2005

Hello Giorgi,

I suggest to google for "python boyer moore" to get a fast
implementation of a string search algorithm in Python (the Boyer-Moore
algorithm).

One promising hit seems to be:
http://www.egenix.com/files/python/mxTextTools.html

HTH,
Bernd

Kent Johnson · May 24, 2005

hi everyone.
a problem:
two binary strings, a="0101" b="000011110100";
i search a function f(a,b) that gives 1 if a is "contained" in b with
any sub strings interposed.
in this example a in contained cause 000<01>111<01>00 but also
0<0>001111<101>00"
but also <0>0001111<101>00 but also 000<0>1111<01>0<0> etc....
any idea?
Thanx in advance.
Giorgi Borghi

You can do this easily with regular expressions though I guess may be poor with long strings

0, 10)
Put the chars of the search string in groups if you need to know where they were found.

Kent

Claudio Grondi · May 24, 2005

i search a function f(a,b) that gives 1 if a is "contained" in b with

any sub strings interposed.

If I understand it right, it should be something
like this:

def blnFindCharSequenceAevenIfSpreadOverEntireStringB(strA, strB):
intNoOfCharsFound = 0
intPtrToBeginOfSubsectionOfB = 0
intLenA = len(strA)
intLenB = len(strB)
blnStrAinB = False
for chrA in strA:
blnFoundChrA = False
# print chrA
for indxToB in range(intPtrToBeginOfSubsectionOfB, intLenB):
if(chrA == strB[indxToB]):
intNoOfCharsFound += 1
# print " ",chrA, strB[indxToB], indxToB
intPtrToBeginOfSubsectionOfB = indxToB + 1
blnFoundChrA = True
break
#:if
#:for
if(intNoOfCharsFound == intLenA):
blnStrAinB = True
print "sequence '%s' found in '%s'"%(strA, strB)
break
#:if
if(blnFoundChrA == False):
break
#:if
#:for
if blnStrAinB == False:
print "sequence '%s' not in '%s'"%(strA, strB)
#:if
#:def

strA = "0101"
strB = "000011110100"
blnFindCharSequenceAevenIfSpreadOverEntireStringB(strA, strB)

strA = "010101"
strB = "000011110100"
blnFindCharSequenceAevenIfSpreadOverEntireStringB(strA, strB)

Note: code above is intended to help clarify things only,
so that a bunch of examples can be tested.
If you need production quality code, maybe someone else
can provide it.

Is it what you are looking for?

By the way:
it looks to me like a standard problem while
comparing DNA sequences, so I suppose
that there are ready to use fast libraries
providing such kind of function.

Claudio

borges2003xx · May 24, 2005

thanx everyone, is what i need.
As Claudio argues, it's a standard problem of dna sequences
comparation.
the next step of my job is to make limits of lenght of interposed
sequences (if someone can help me in this way i'll apreciate a lot)
thanx everyone.
giorgio

Andrew Dalke · May 24, 2005

the next step of my job is to make limits of lenght of interposed
sequences (if someone can help me in this way i'll apreciate a lot)
thanx everyone.

Kent Johnson had the right approach, with regular expressions.
For a bit of optimization, use non-greedy groups. That will
give you shorter matches.

Suppose you want no more than 10 bases between terms. You could
use this pattern.

a.{,10}?t.{,10}?c.{,10}?g.{,10}?

If you want to know the location of each of the bases, and
you'll have less than 100 of them (I think that's the limit)
then you can use groups in the regular expression language
.... if limit is None:
.... t = ".*?"
.... else:
.... t = ".{,%d}?" % (limit,)
.... text = []
.... for c in s:
.... text.append("(%s)%s" % (c, t))
.... return "".join(text)
.... .... print m.group(i), m.start(i), m.end(i)
....
a 3 4
t 9 10
c 16 17
g 27 28

Andrew
(e-mail address removed)

Claudio Grondi · May 24, 2005

thanx everyone, is what i need.
As Claudio argues, it's a standard problem of dna sequences
comparation.
the next step of my job is to make limits of lenght of interposed
sequences (if someone can help me in this way i'll apreciate a lot)
thanx everyone.
giorgio

Note: code below is intended to help to clarify things only,
so that a bunch of examples can be tested.
If you need bugfree production quality code, maybe
someone else can provide it.

I have introduced two additional parameter to the function.
If intMaxLenOfGap == 0 the gap size doesn't matter.
lstStartEndOfRangeOfBwithOccurenceOfA returns in its
0,1 elements the begin and end of the range strA was
found in strB.

Hope this does what you mean with
"make limits of lenght of interposed sequences",
does it?

Claudio
P.S. Here the code:

def blnFindCharSequenceAevenIfSpreadOverEntireStringB(strA, strB,
intMaxLenOfGap = 0, lstStartEndOfRangeOfBwithOccurenceOfA = []):

lstStartEndOfRangeOfBwithOccurenceOfA = []
intNoOfCharsFound = 0
intPtrToFirstCharFound = 0
intPtrToBeginOfSubsectionOfB = 0
intLenA = len(strA)
intLenB = len(strB)
blnStrAinB = False
indxToA = 0

while(indxToA < intLenA):
# print chrA
if(indxToA == 0):
blnFoundChrA = False
for indxToB in range(intPtrToBeginOfSubsectionOfB, intLenB):
if(strA[indxToA] == strB[indxToB]):
intNoOfCharsFound += 1
# print " ",chrA, strB[indxToB], indxToB
intPtrToFirstCharFound = indxToB
intPtrToBeginOfSubsectionOfB = indxToB + 1
blnFoundChrA = True
break
#:if
#:for
if(intNoOfCharsFound == intLenA):
blnStrAinB = True
print "sequence '%s' found in '%s'"%(strA, strB)
break
#:if
if(blnFoundChrA == False):
break
#:if
indxToA += 1
else:
intGapLen = 0
blnFoundChrA = False
for indxToB in range(intPtrToBeginOfSubsectionOfB, intLenB):
if(strA[indxToA] == strB[indxToB]):
intNoOfCharsFound += 1
# print " ",chrA, strB[indxToB], indxToB
intPtrToBeginOfSubsectionOfB = indxToB + 1
blnFoundChrA = True
break
#:if
intGapLen += 1
if(intMaxLenOfGap > 0 and intGapLen > intMaxLenOfGap):
indxToA = 0
blnFoundChrA = False
intPtrToBeginOfSubsectionOfB = intPtrToFirstCharFound + 1
intNoOfCharsFound = 0
break
#:if
#:for
if(intNoOfCharsFound == intLenA):
blnStrAinB = True
print "sequence '%s' found in '%s' at range(%i, %i)"%(strA, strB,
intPtrToFirstCharFound, indxToB+1)
lstStartEndOfRangeOfB.append(intPtrToFirstCharFound)
lstStartEndOfRangeOfB.append(indxToB+1)
break
#:if
if(blnFoundChrA == False):
break
#:if
indxToA += 1
#:if/else
#:while
if blnStrAinB == False:
if(intMaxLenOfGap > 0 and intGapLen > intMaxLenOfGap):
print "sequence '%s' not in '%s' (maybe allowed gap of %i chars was
too small?)"%(strA, strB, intMaxLenOfGap)
else:
print "sequence '%s' not in '%s'"%(strA, strB)
#:if
#:def

print

lstStartEndOfRangeOfB = []
strA = "0101"
strB = "000011110100"
blnFindCharSequenceAevenIfSpreadOverEntireStringB(strA, strB)

lstStartEndOfRangeOfB = []
strA = "0101"
strB = "000011110100"
blnFindCharSequenceAevenIfSpreadOverEntireStringB(strA, strB, 2)

strA = "010101"
strB = "000011110100"
blnFindCharSequenceAevenIfSpreadOverEntireStringB(strA, strB, 6,
lstStartEndOfRangeOfB)

strA = "010101"
strB = "00001111010000001"
blnFindCharSequenceAevenIfSpreadOverEntireStringB(strA, strB, 4,
lstStartEndOfRangeOfB)

strA = "010101"
strB = "00001111010000001"
blnFindCharSequenceAevenIfSpreadOverEntireStringB(strA, strB, 5,
lstStartEndOfRangeOfB)
print
print "usage of lstStartEndOfRangeOfB parameter passed to function for use
as return value:"
print "sequence '%s' was found in '%s' at range(%i, %i)"%(strA, strB,
lstStartEndOfRangeOfB[0], lstStartEndOfRangeOfB[1])

Andrew Dalke · May 24, 2005

Claudio said:
Note: code below is intended to help to clarify things only,
so that a bunch of examples can be tested.
If you need bugfree production quality code, maybe
someone else can provide it.

Still not tested enough to ensure that it's bug free, but more
concise. Here's one the implements the algorithm directly and
another that uses a regexp. The latter should be the preferred
approach. My intent was that the algorithm implements the given
pattern so they should given identical results.

# Doing the work ourselves
def find_spread_substring(query, target, limit = None):
stack = []
ti = qi = 0
Nq = len(query)
Nt = len(target)
delta = 0

while ti < Nt:
# We have a match
if query[qi] == target[ti]:
stack.append( (qi, ti, delta) )
qi = qi + 1
if qi == Nq:
return [ti for (qi, ti, delta) in stack]
ti = ti + 1
delta = 0
else:
# No match
while 1:
# If we have a partial match, check if we've
# gone over the limit.
if stack:
delta = delta + 1
if limit is not None and delta > limit:
# backtrack, treating it as an invalid match
# (so retry this 'else:' block)
qi, ti, delta = stack.pop()
continue
# No backtracking needed
break
# Advance to check the next character in the target
ti = ti + 1

# Failure
return None

# Using regular expressions
import re
def find_spread_substring2(query, target, limit = None):
if limit is None:
template = "(%s).*?"
else:
template = "(%%s).{,%d}?" % (limit,)
terms = [template % c for c in query]
pattern = "".join(terms)

pat = re.compile(pattern)
m = pat.search(target)
if not m:
return None
return [m.start(i) for i in range(1, len(query)+1)]

def test():
for (q, t, limit, is_valid) in (
("1010", "10001001", None, True),
("1010", "100011", None, False),
("1010", "100010", 3, True),
("1010", "100010", 1, True),
("1010", "1000010", 1, False),
("1010", "01000010", 2, True),
("1010", "01000010", 1, False),
("1010", "0100000", None, False),

):
result = find_spread_substring(q, t, limit)
result2 = find_spread_substring2(q, t, limit)
if result != result2:
raise AssertionError( (result, result2) )

if result is not None:
if limit is not None:
# check that it's a proper subset
for (x, y) in zip(result[:-1], result[1:]):
# +1 because 'limit' is the maximum gap size
if (y-x) > limit+1:
raise AssertionError((q, t, limit, result, x, y))
s = "".join([t for i in result])
if s != q:
raise AssertionError((q, t, limit, result, s))

if result is None and not is_valid:
pass
elif result is not None and is_valid:
pass
else:
raise AssertionError( (q, t, limit, is_valid, result) )

if __name__ == "__main__":
test()
print "All tests passed."

Andrew
(e-mail address removed)

efficiently splitting up strings based on substrings	7	Sep 5, 2009
Data saving in condition of changing reality	0	Apr 29, 2022
Searching through two logfiles in parallel?	6	Jan 7, 2013
Html data exchange help	0	Jan 2, 2020
Substrings and so on	27	Jan 26, 2010
Help with my responsive home page	2	Dec 14, 2022
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
First steps in setting up VSCode to work with Python.	2	Mar 13, 2023

searching substrings with interpositions

borges2003xx

bplumhoff

Kent Johnson

Claudio Grondi

borges2003xx

Andrew Dalke

Claudio Grondi

Andrew Dalke

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads