Regular Expression - Matching Multiples of 3 Characters exactly.

Discussion in 'Python' started by blaine, Apr 28, 2008.

  1. blaine

    blaine Guest

    Hey everyone,
    For the regular expression gurus...

    I'm trying to write a string matching algorithm for genomic
    sequences. I'm pulling out Genes from a large genomic pattern, with
    certain start and stop codons on either side. This is simple
    enough... for example:

    start = AUG stop=AGG
    BBBBBBAUGWWWWWWAGGBBBBBB

    So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
    This works great with my current regular expression.

    The problem, however, is that codons come in sets of 3 bases. So
    there are actually three different 'frames' I could be using. For
    example:
    ABCDEFGHIJ
    I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.

    So finally, my question. How can I represent this in a regular
    expression? :) This is what I'd like to do:
    (Find all groups of any three characters) (Find a start codon) (find
    any other codons) (Find an end codon)

    Is this possible? It seems that I'd want to do something like this: (\w
    \w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
    three non-whitespace characters, followed by AUG \s AGG, and then
    anything else. I hope I am making sense. Obviously, however, this
    will make sure that ANY set of three characters exist before a start
    codon. Is there a way to match exactly, to say something like 'Find
    all sets of three, then AUG and AGG, etc.'. This way, I could scan
    for genes, remove the first letter, scan for more genes, remove the
    first letter again, and scan for more genes. This would
    hypothetically yield different genes, since the frame would be
    shifted.

    This might be a lot of information... I appreciate any insight. Thank
    you!
    Blaine
    blaine, Apr 28, 2008
    #1
    1. Advertising

  2. blaine

    Guest

    On Apr 27, 8:31 pm, blaine <> wrote:
    > Hey everyone,
    >   For the regular expression gurus...
    >
    > I'm trying to write a string matching algorithm for genomic
    > sequences.  I'm pulling out Genes from a large genomic pattern, with
    > certain start and stop codons on either side.  This is simple
    > enough... for example:
    >
    > start = AUG stop=AGG
    > BBBBBBAUGWWWWWWAGGBBBBBB
    >
    > So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
    > This works great with my current regular expression.
    >
    > The problem, however, is that codons come in sets of 3 bases.  So
    > there are actually three different 'frames' I could be using.  For
    > example:
    > ABCDEFGHIJ
    > I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.
    >
    > So finally, my question.  How can I represent this in a regular
    > expression? :)  This is what I'd like to do:
    > (Find all groups of any three characters) (Find a start codon) (find
    > any other codons) (Find an end codon)
    >
    > Is this possible? It seems that I'd want to do something like this: (\w
    > \w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
    > three non-whitespace characters, followed by AUG \s AGG, and then
    > anything else.  I hope I am making sense.  Obviously, however, this
    > will make sure that ANY set of three characters exist before a start
    > codon.  Is there a way to match exactly, to say something like 'Find
    > all sets of three, then AUG and AGG, etc.'.  This way, I could scan
    > for genes, remove the first letter, scan for more genes, remove the
    > first letter again, and scan for more genes.  This would
    > hypothetically yield different genes, since the frame would be
    > shifted.
    >
    > This might be a lot of information... I appreciate any insight.  Thank
    > you!
    > Blaine


    Here's one idea (untested):

    s= { }
    for x in range( len( genes )- 3 ):
    s[ x ]= genes[ x: x+ 3 ]

    You might like Python's 'string slicing' feature.
    , Apr 28, 2008
    #2
    1. Advertising

  3. blaine

    blaine Guest

    On Apr 27, 10:24 pm, wrote:
    > On Apr 27, 8:31 pm, blaine <> wrote:
    >
    >
    >
    > > Hey everyone,
    > > For the regular expression gurus...

    >
    > > I'm trying to write a string matching algorithm for genomic
    > > sequences. I'm pulling out Genes from a large genomic pattern, with
    > > certain start and stop codons on either side. This is simple
    > > enough... for example:

    >
    > > start = AUG stop=AGG
    > > BBBBBBAUGWWWWWWAGGBBBBBB

    >
    > > So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
    > > This works great with my current regular expression.

    >
    > > The problem, however, is that codons come in sets of 3 bases. So
    > > there are actually three different 'frames' I could be using. For
    > > example:
    > > ABCDEFGHIJ
    > > I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.

    >
    > > So finally, my question. How can I represent this in a regular
    > > expression? :) This is what I'd like to do:
    > > (Find all groups of any three characters) (Find a start codon) (find
    > > any other codons) (Find an end codon)

    >
    > > Is this possible? It seems that I'd want to do something like this: (\w
    > > \w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
    > > three non-whitespace characters, followed by AUG \s AGG, and then
    > > anything else. I hope I am making sense. Obviously, however, this
    > > will make sure that ANY set of three characters exist before a start
    > > codon. Is there a way to match exactly, to say something like 'Find
    > > all sets of three, then AUG and AGG, etc.'. This way, I could scan
    > > for genes, remove the first letter, scan for more genes, remove the
    > > first letter again, and scan for more genes. This would
    > > hypothetically yield different genes, since the frame would be
    > > shifted.

    >
    > > This might be a lot of information... I appreciate any insight. Thank
    > > you!
    > > Blaine

    >
    > Here's one idea (untested):
    >
    > s= { }
    > for x in range( len( genes )- 3 ):
    > s[ x ]= genes[ x: x+ 3 ]
    >
    > You might like Python's 'string slicing' feature.


    True - I could try something like that. In fact I have a 'codon'
    function that does exactly that. The problem is that I then have to
    go back through and loop over the list. I'm trying to use Regular
    Expressions so that my processing is quicker. Complexity is key since
    this genomic string is pretty large.

    Thanks for the suggestion though!
    blaine, Apr 28, 2008
    #3
  4. blaine

    Roy Smith Guest

    In article
    <>,
    blaine <> wrote:

    > Hey everyone,
    > For the regular expression gurus...
    >
    > I'm trying to write a string matching algorithm for genomic
    > sequences.


    I strongly suggest you stop trying to reinvent the wheel and read up on the
    Biopython project (http://biopython.org/wiki/Main_Page).
    Roy Smith, Apr 28, 2008
    #4
  5. blaine

    Jeff Guest

    Regular expressions for that sort of thing can get *really* big. The
    most efficient way would be to programmatically compose the regular
    expression to be as exact as possible.

    import re

    def permutation(lst):
    """"
    From http://labix.org/snippets/permutations/. Computes permutations
    of a
    list iteratively.
    """
    queue = [-1]
    lenlst = len(lst)
    while queue:
    i = queue[-1]+1
    if i == lenlst:
    queue.pop()
    elif i not in queue:
    queue[-1] = i
    if len(queue) == lenlst:
    yield [lst[j] for j in queue]
    queue.append(-1)
    else:
    queue[-1] = i

    def segment_re(a, b):
    """
    Creates grouped regular expression pattern to match text between all
    possibilies of three-letter sets a and b.
    """
    def pattern(n):
    return "(%s)" % '|'.join( [''.join(grp) for grp in permutation(n)] )

    return re.compile( r'%s(\w+?)%s' % (pattern(a), pattern(b)) )

    print segment_re(["a", "b", "c"], ["d", "e", "f"])

    You could extend segment_re to accept an integer to limit the (\w+?)
    to a definite quantifier. This will grow the compiled expression in
    memory but make matching faster (such as \w{3,n} to match from 3 to n
    characters).

    See http://artfulcode.net/articles/optimizing-regular-expressions/ for
    specifics on optimizing regexes.
    Jeff, Apr 28, 2008
    #5
  6. blaine wrote:
    > Hey everyone,
    > For the regular expression gurus...
    >
    > I'm trying to write a string matching algorithm for genomic
    > sequences. I'm pulling out Genes from a large genomic pattern, with
    > certain start and stop codons on either side. This is simple
    > enough... for example:
    >
    > start = AUG stop=AGG
    > BBBBBBAUGWWWWWWAGGBBBBBB
    >
    > So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
    > This works great with my current regular expression.
    >
    > The problem, however, is that codons come in sets of 3 bases. So
    > there are actually three different 'frames' I could be using. For
    > example:
    > ABCDEFGHIJ
    > I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.
    >
    > So finally, my question. How can I represent this in a regular
    > expression? :) This is what I'd like to do:
    > (Find all groups of any three characters) (Find a start codon) (find
    > any other codons) (Find an end codon)
    >
    > Is this possible? It seems that I'd want to do something like this: (\w
    > \w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
    > three non-whitespace characters, followed by AUG \s AGG, and then
    > anything else. I hope I am making sense. Obviously, however, this
    > will make sure that ANY set of three characters exist before a start
    > codon. Is there a way to match exactly, to say something like 'Find
    > all sets of three, then AUG and AGG, etc.'. This way, I could scan
    > for genes, remove the first letter, scan for more genes, remove the
    > first letter again, and scan for more genes. This would
    > hypothetically yield different genes, since the frame would be
    > shifted.
    >


    As an alternative - if you do need speed - have a look at

    http://www.egenix.com/products/python/mxBase/mxTextTools/


    Helmut.
    --
    Helmut Jarausch

    Lehrstuhl fuer Numerische Mathematik
    RWTH - Aachen University
    D 52056 Aachen, Germany
    Helmut Jarausch, Apr 28, 2008
    #6
  7. blaine

    blaine Guest

    On Apr 28, 6:30 am, Nick Craig-Wood <> wrote:
    > blaine <> wrote:
    > > I'm trying to write a string matching algorithm for genomic
    > > sequences. I'm pulling out Genes from a large genomic pattern, with
    > > certain start and stop codons on either side. This is simple
    > > enough... for example:

    >
    > > start = AUG stop=AGG
    > > BBBBBBAUGWWWWWWAGGBBBBBB

    >
    > > So I obviously want to pull out AUGWWWWWWAGG (and all other matches).
    > > This works great with my current regular expression.

    >
    > > The problem, however, is that codons come in sets of 3 bases. So
    > > there are actually three different 'frames' I could be using. For
    > > example:
    > > ABCDEFGHIJ
    > > I could have ABC DEF GHI or BCD EFG HIJ or CDE FGH IJx.... etc.

    >
    > > So finally, my question. How can I represent this in a regular
    > > expression? :) This is what I'd like to do:
    > > (Find all groups of any three characters) (Find a start codon) (find
    > > any other codons) (Find an end codon)

    >
    > > Is this possible? It seems that I'd want to do something like this: (\w
    > > \w\w)+(AUG)(\s)(AGG)(\s)* - where \w\w\w matches EXACTLY all sets of
    > > three non-whitespace characters, followed by AUG \s AGG, and then
    > > anything else.

    >
    > I'm not sure what the \s are doing in there - there doesn't appear to
    > be any whitespace in your examples.
    >
    > > I hope I am making sense. Obviously, however, this will make sure
    > > that ANY set of three characters exist before a start codon. Is
    > > there a way to match exactly, to say something like 'Find all sets
    > > of three, then AUG and AGG, etc.'.

    >
    > I think you want
    >
    > ^(\w\w\w)*(AUG)((\w\w\w)*?)(AGG)
    >
    > which will match up 0 or more triples, match AUG match 0 or more triples
    > then AGG. The ? makes it a minimum match otherwise you'll match more
    > than you expect if there are two AUG...AGG sequences in a given genome.
    >
    > >>> import re
    > >>> m=re.compile(r"^(\w\w\w)*(AUG)((\w\w\w)*?)(AGG)")
    > >>> m.search("BBBBBBAUGWWWWWWAGGBBBBBB").groups()

    > ('BBB', 'AUG', 'WWWWWW', 'WWW', 'AGG')
    > >>> m.search("BBBQBBBAUGWWWWWWAGGBBBBBB")
    > >>> m.search("BBBQQBBBAUGWWWWWWAGGBBBBBB")
    > >>> m.search("BBBQQBBQBAUGWWWWWWAGGBBBBBB")

    > <_sre.SRE_Match object at 0xb7de33e0>
    > >>> m.search("BBBQQBBQBAUGWWWWWWAGGBBBBBB").groups()

    > ('BQB', 'AUG', 'WWWWWW', 'WWW', 'AGG')
    > >>> m.search("BBBQQBBQBAUGWQWWWWWAGGBBBBBB")
    > >>> m.search("BBBQQBBQBAUGWWWWQWWAGGBBBBBB")
    > >>> m.search("BBBQQBBQBAUGWWQWWQWWAGGBBBBBB")
    > >>> m.search("BBBQQBBQBAUGWWQWAWQWWAGGBBBBBB")

    > <_sre.SRE_Match object at 0xb7de33e0>
    > >>> m.search("BBBQQBBQBAUGWWQWAWQWWAGGBBBBBB").groups()

    > ('BQB', 'AUG', 'WWQWAWQWW', 'QWW', 'AGG')
    > >>>

    >
    > > This way, I could scan for genes, remove the first letter, scan for
    > > more genes, remove the first letter again, and scan for more genes.
    > > This would hypothetically yield different genes, since the frame
    > > would be shifted.

    >
    > Of you could just unconstrain the first match and it will do them all
    > at once :-
    >
    > (AUG)((\w\w\w)*?)(AGG)
    >
    > You could run this with re.findall, but beware that this will only
    > return non-overlapping matches which may not be what you want.
    >
    > I'm not sure re's are the best tool for the job, but they should give
    > you a quick idea of what the answers might be.
    >
    > --
    > Nick Craig-Wood <> --http://www.craig-wood.com/nick


    Thank you! Your suggestion was overly helpful.

    Also thank you for the package suggestions. BioPython is on my plate
    to check out, but I needed a kind of quick fix for this one. The
    documentation for biopython seems pretty thick - I'm not a biologist
    so I'm not even sure what kind of packages I'm even looking for.

    thanks!
    Blaine
    blaine, Apr 28, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Richard Loupatty
    Replies:
    2
    Views:
    370
    Richard Loupatty
    Jul 18, 2003
  2. VSK
    Replies:
    2
    Views:
    2,279
  3. FD
    Replies:
    0
    Views:
    821
  4. =?iso-8859-1?B?bW9vcJk=?=

    Matching abitrary expression in a regular expression

    =?iso-8859-1?B?bW9vcJk=?=, Dec 1, 2005, in forum: Java
    Replies:
    8
    Views:
    838
    Alan Moore
    Dec 2, 2005
  5. =?ISO-8859-1?Q?Mickel_Gr=F6nroos?=

    Regular expression for matching IPA characters in Unicode?

    =?ISO-8859-1?Q?Mickel_Gr=F6nroos?=, Oct 11, 2004, in forum: Python
    Replies:
    2
    Views:
    707
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Oct 12, 2004
Loading...

Share This Page