Need feedback on ORF-extracting code

Discussion in 'Python' started by gb345, Aug 12, 2009.

  1. gb345

    gb345 Guest

    Hi everyone. I'm relatively new to Python, and could use your
    feedback on the code below.

    First some nomenclature. A "nucleotide" is one of A, C, G, T, or
    U. (In practice, a sequence of such nucleotides never contains
    both T and U, but this fact is not important in what follows.) A
    "codon" is a sequence of 3 of these. A "stop codon" is any one of
    UAA, UAG, UGA, TAA, TAG, or TGA. Two codons are said to be "in
    frame" in a containing sequence of nucleotides if their positions
    differ by a multiple of 3. An "open reading frame," or ORF, is
    defined as a maximal subsequence of nucleotides whose length is a
    multiple of 3, begins with either AUG or ATG, terminates right
    before a stop codon in the original sequence, and contains no stop
    codons that are in frame with the initial codon (AUG or ATG).

    The fact that ORFs have lengths that are multiples of 3 means that
    there are three possible "registers" for ORFs (depending the modulo
    3 of their starting positions), and that ORFs in different registers
    can overlap. I'll refer to these registers as 0, 1, and 2, because
    they contain ORFs that are in frame with the codons at positions
    0, 1, and 2 of the original sequence, respectively.

    In the code below, routine extract_orfs takes as input a string,
    assumed to be a sequence of nucleotides, and returns a tuple of
    tuples, describing ORFs. These ORFs can overlap.

    The helper routine _xo takes as input a string (a sequence of
    nucleotides), and an offset, and returns a tuple of tuples, again
    representing ORFs, but this time all in the same register, since
    they are all in frame with the position passed as the second argument
    to the function.

    I would appreciate your comments on this code. I feel that it is
    not as clear as it could be, but I don't know how to make it any
    clearer.

    (NOTE: I realize that, in all likelihood, publicly available Python
    code already exists for this. At the moment I'm more interested
    in improving my Python skills.)


    Many thanks in advance!

    Gabe



    # BEGINNING OF CODE
    import sys
    import re

    _start = r'A[TU]G'
    _stop = r'(?:[TU]A[AG]|[TU]GA)'
    _nonstop = r'(?:[CAG][TUCAG]{2}|[TU](?:[TUC][TUCAG]|[AG][TUC])|[TU]GG)'
    _codon = r'(?:[TUCAG]{3})'
    _orf_re = re.compile('(' + _codon + r'*?)(A[TU]G' +
    _nonstop + '*)(' + _stop + ')', flags=re.I)
    _lead_re = re.compile(r'[TUCAG]*?A[TU]G', flags=re.I)

    def _xo(seq, pos):
    """
    Helper routine that finds all the non-overlapping in-frame orfs
    starting from a specific position in the input sequence.

    input: a string of letters in the set 'tucagTUCAG', and a starting
    position;
    output: a tuple of tuples; each internal tuple consists of a
    starting position, an orf, and the stop codon that
    terminates it.
    """

    ret = []
    while True:
    m = _orf_re.match(seq, pos)
    if not m:
    break
    orf = m.group(2)
    stop = m.group(3)
    assert len(orf) % 3 == 0
    ret.append((m.start() + len(m.group(1)), orf, stop))
    pos = m.end()
    return ret

    def extract_orfs(seq):
    """
    Extracts all (possibly overlapping) maximal open reading frames,
    defined as sequences beginning with AUG (or ATG), ending with an
    in-frame stop codon, and containing no in-frame stop codons
    in-between.

    input: a string of letters in the set 'tucagTUCAG';
    output: a tuple of tuples; each internal tuple consists of a
    starting position, an orf, and the stop codon that
    terminates it.
    """

    m = _lead_re.match(seq)
    if not m:
    return ()
    pos = m.start()
    ret = []

    for i in range(min(3, len(seq))):
    ret.extend(_xo(seq, pos + i))
    ret.sort()
    return tuple(ret)

    # END OF CODE
     
    gb345, Aug 12, 2009
    #1
    1. Advertisements

  2. gb345

    Terry Reedy Guest

    I am not currently familiar with the re module so I cannot comment in
    detail. The assert should be a claim that the re should *never* match
    anything other than a multiple of 3, so that it is a program bug if it
    ever do so. If this is not true, you should use an if statement.

    tjr

     
    Terry Reedy, Aug 13, 2009
    #2
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.