regexp with gaps

Discussion in 'Ruby' started by egrasso, Jul 14, 2008.

  1. egrasso

    egrasso Guest

    Hi, I need to find the position of some substrings inside of a long
    string. For this I'm using a loop that uses str.index(pattern,
    (last_found_position+1)) so I find all positions where the pattern
    matches. The pattern is a string of 20 chars, different each time I
    run the script. That worked perfect. The problem is that now I need to
    find all positions where the pattern matches 12 or more chars.
    For example: For the pattern "aaaaaa", find substrings "aaaaaa",
    "aaabaa", "baaaaa", "ababaa", etc

    First I thought that I could create all possible patterns (with \w)
    and check them, but I realized that there would be a lot of different
    patterns to check (over a few hundreds I think).
    Is there any way to do this without the need of checking a lot of
    patterns?
    thanks
     
    egrasso, Jul 14, 2008
    #1
    1. Advertising

  2. egrasso

    phlip Guest

    > The problem is that now I need to
    > find all positions where the pattern matches 12 or more chars.
    > For example: For the pattern "aaaaaa", find substrings "aaaaaa",
    > "aaabaa", "baaaaa", "ababaa", etc
    >
    > First I thought that I could create all possible patterns (with \w)


    \w{12,}

    Right?

    Either that or \w{12}\w*
     
    phlip, Jul 14, 2008
    #2
    1. Advertising

  3. egrasso

    Guest

    Mmmmm... nop. I think I didn't explain the idea very well... I'm writing a
    script to find specific secuences of DNA (binding sites) inside of a large
    secuence of DNA (for thosse who doesn't know, DNA sequences are made of 4
    diferent bases: A, T, C and G). The problem is that the binding sites don't
    need to be 100% exact to work. For example, the binding site for an X
    protein is "AAATTT", but the protein can also bind to the secuence "AAAGTT"
    or "AACGTT" and work fine. I need to find all this sites, but the only data
    I have is that "Protein X binds to AAATTT".
    I finally solve the problem without using str.index nor regexp, basically,
    I seek it manually:

    (Note: variables are in spanish!: buscarBS=find binding site,
    patron=pattern, semejanza=1 to 0, minimal similarity, cadena=string,
    respuesta=answer, largo=length)

    def buscarBS(patron, semejanza=0.6, cadena=@secuencia)
    respuesta = ""
    i = 0.0
    j = 0.0
    largoc = cadena.length
    largop = patron.length

    while i <= (largoc-largop)
    j = 0.0
    puntos = 0.0
    subpuntos = largop * (1-semejanza)

    while (j < largop) and (subpuntos > 0)
    pos = i + j
    if cadena[pos] == patron[j] then
    puntos +=1
    else
    subpuntos -=1
    end
    j+=1
    end
    if (puntos / largop) >= semejanza then
    respuesta = respuesta + "desde: "+(i+1).to_i.to_s+" hasta:
    "+(i+j).to_i.to_s+" - similitud: - "+(puntos / largop * 100).to_s+"%\n"
    end
    i+=1
    end

    if respuesta == "" then
    respuesta = "No se encontro ninguna secuencia similar (similitud:
    #{semejanza} - #{patron})"
    else
    respuesta = "\nSe encontraron las siguientes similitudes:\n\n"+respuesta
    end
    return respuesta

    end

    I still need to polish and optimize the code but it find all possible
    sites with at least an specific similarity and tells me how similar they
    are. If anyone have another idea, need more details about the code or is
    interested in bioinformatic with ruby tell me.
    Thanks

    On Mon, 14 Jul 2008 23:15:30 +0900, phlip <> wrote:
    >> The problem is that now I need to
    >> find all positions where the pattern matches 12 or more chars.
    >> For example: For the pattern "aaaaaa", find substrings "aaaaaa",
    >> "aaabaa", "baaaaa", "ababaa", etc
    >>
    >> First I thought that I could create all possible patterns (with \w)

    >
    > \w{12,}
    >
    > Right?
    >
    > Either that or \w{12}\w*
     
    , Jul 15, 2008
    #3
  4. egrasso

    Axel Etzold Guest

    -------- Original-Nachricht --------
    > Datum: Tue, 15 Jul 2008 12:18:09 +0900
    > Von:
    > An:
    > Betreff: Re: regexp with gaps


    > Mmmmm... nop. I think I didn't explain the idea very well... I'm writing a
    > script to find specific secuences of DNA (binding sites) inside of a large
    > secuence of DNA (for thosse who doesn't know, DNA sequences are made of 4
    > diferent bases: A, T, C and G). The problem is that the binding sites
    > don't
    > need to be 100% exact to work. For example, the binding site for an X
    > protein is "AAATTT", but the protein can also bind to the secuence
    > "AAAGTT"
    > or "AACGTT" and work fine. I need to find all this sites, but the only
    > data
    > I have is that "Protein X binds to AAATTT".
    > I finally solve the problem without using str.index nor regexp, basically,
    > I seek it manually:
    >
    > (Note: variables are in spanish!: buscarBS=find binding site,
    > patron=pattern, semejanza=1 to 0, minimal similarity, cadena=string,
    > respuesta=answer, largo=length)
    >
    > def buscarBS(patron, semejanza=0.6, cadena=@secuencia)
    > respuesta = ""
    > i = 0.0
    > j = 0.0
    > largoc = cadena.length
    > largop = patron.length
    >
    > while i <= (largoc-largop)
    > j = 0.0
    > puntos = 0.0
    > subpuntos = largop * (1-semejanza)
    >
    > while (j < largop) and (subpuntos > 0)
    > pos = i + j
    > if cadena[pos] == patron[j] then
    > puntos +=1
    > else
    > subpuntos -=1
    > end
    > j+=1
    > end
    > if (puntos / largop) >= semejanza then
    > respuesta = respuesta + "desde: "+(i+1).to_i.to_s+" hasta:
    > "+(i+j).to_i.to_s+" - similitud: - "+(puntos / largop * 100).to_s+"%\n"
    > end
    > i+=1
    > end
    >
    > if respuesta == "" then
    > respuesta = "No se encontro ninguna secuencia similar (similitud:
    > #{semejanza} - #{patron})"
    > else
    > respuesta = "\nSe encontraron las siguientes
    > similitudes:\n\n"+respuesta
    > end
    > return respuesta
    >
    > end
    >
    > I still need to polish and optimize the code but it find all possible
    > sites with at least an specific similarity and tells me how similar they
    > are. If anyone have another idea, need more details about the code or is
    > interested in bioinformatic with ruby tell me.
    > Thanks
    >
    > On Mon, 14 Jul 2008 23:15:30 +0900, phlip <> wrote:
    > >> The problem is that now I need to
    > >> find all positions where the pattern matches 12 or more chars.
    > >> For example: For the pattern "aaaaaa", find substrings "aaaaaa",
    > >> "aaabaa", "baaaaa", "ababaa", etc
    > >>
    > >> First I thought that I could create all possible patterns (with \w)

    > >
    > > \w{12,}
    > >
    > > Right?
    > >
    > > Either that or \w{12}\w*

    >


    Hi ---

    you could make use of the McIlroy-Hunt longest common subsequence (LCS) algorithm,
    which will give you longest common subsequences, and also information of the type

    'sequence AAATTT is transformed into AAAGTT by changing T to G at the fourth entry.'

    You can find a Ruby gem implementation here: http://raa.ruby-lang.org/project/diff-lcs/

    Best regards,

    Axel

    --
    Psssst! Schon das coole Video vom GMX MultiMessenger gesehen?
    Der Eine für Alle: http://www.gmx.net/de/go/messenger03
     
    Axel Etzold, Jul 15, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Arthur Dent
    Replies:
    1
    Views:
    1,091
    Jim Gibson
    Dec 3, 2003
  2. greg
    Replies:
    4
    Views:
    390
  3. Mark247
    Replies:
    1
    Views:
    633
    Marrow
    Sep 3, 2004
  4. Andrey
    Replies:
    6
    Views:
    378
    Andrew Koenig
    Jun 24, 2004
  5. Joao Silva
    Replies:
    16
    Views:
    409
    7stud --
    Aug 21, 2009
Loading...

Share This Page