processing the genetic code with python?

nuttydevil · Mar 6, 2006

I have many notepad documents that all contain long chunks of genetic
code. They look something like this:

atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag
tacgacatcaacgaagctatcgcactgctgaaagagctggcgactgctaaattcgtagaa
agcgtggacgtagctgttaacctcggcatcgacgctcgtaaatctgaccagaacgtacgt
ggtgcaactgtactgccgcacggtactggccgttccgttcgcgtagccgtatttacccaa

Basically, I want to design a program using python that can open and
read these documents. However, I want them to be read 3 base pairs at a
time (to analyse them codon by codon) and find the value that each
codon has a value assigned to it. An example of this is below:

** If the three base pairs were UUU the value assigned to it (from the
codon value table) would be 0.296

The program has to read all the sequence three pairs at a time, then I
want to get all the values for each codon, multiply them together and
put them to the power of 1 / the length of the sequence in codons
(which is the length of the whole sequence divided by three).

However, to make things even more complicated, the notebook sequences
are in lowercase and the codon value table is in uppercase, so the
sequences need to be converted into uppercase. Also, the Ts in the DNA
sequences need to be changed to Us (again to match the codon value
table). And finally, before the DNA sequences are read and analysed I
need to remove the first 50 codons (i.e. the first 150 letters) and the
last 20 codons (the last 60 letters) from the DNA sequence. I've also
been having problems ensuring the program reads ALL the sequence 3
letters at a time.

I've tried various ways of doing this but keep coming unstuck along the
way. Has anyone got any suggestions for how they would tackle this
problem?
Thanks for any help recieved!

Diez B. Roggisch · Mar 6, 2006

nuttydevil said:
I've tried various ways of doing this but keep coming unstuck along the
way. Has anyone got any suggestions for how they would tackle this
problem?
Thanks for any help recieved!

Show us your ways, show us where you got stuck - then we'd might be able to help you.

Diez

Steve Holden · Mar 6, 2006

Diez said:
Show us your ways, show us where you got stuck - then we'd might be able to help you.

Also, take a look at the biopython package, which will almost certainly
save you large amounts of time on tasks such as the one you describe.

http://www.biopython.org/

regards
Steve

Roy Smith · Mar 6, 2006

I have many notepad documents that all contain long chunks of genetic
code. They look something like this:

atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag
tacgacatcaacgaagctatcgcactgctgaaagagctggcgactgctaaattcgtagaa
agcgtggacgtagctgttaacctcggcatcgacgctcgtaaatctgaccagaacgtacgt
ggtgcaactgtactgccgcacggtactggccgttccgttcgcgtagccgtatttacccaa

Basically, I want to design a program using python that can open and
read these documents.

Start by googling for biopython.

David E. Konerding DSD staff · Mar 6, 2006

I have many notepad documents that all contain long chunks of genetic
code. They look something like this:

atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag
tacgacatcaacgaagctatcgcactgctgaaagagctggcgactgctaaattcgtagaa
agcgtggacgtagctgttaacctcggcatcgacgctcgtaaatctgaccagaacgtacgt
ggtgcaactgtactgccgcacggtactggccgttccgttcgcgtagccgtatttacccaa

Basically, I want to design a program using python that can open and
read these documents. However, I want them to be read 3 base pairs at a
time (to analyse them codon by codon) and find the value that each
codon has a value assigned to it. An example of this is below:

** If the three base pairs were UUU the value assigned to it (from the
codon value table) would be 0.296

The program has to read all the sequence three pairs at a time, then I
want to get all the values for each codon, multiply them together and
put them to the power of 1 / the length of the sequence in codons
(which is the length of the whole sequence divided by three).

I don't really understand precisely what you're trying to do.

First off, those aren't base pairs, they're bases. Only when you have double-stranded
DNA (or RNA, or some other oddball cases) would they be base pairs.

Second, I don't know what the codon to value function is, is this frequency (IE number n occurences of codon
X out of N total codons)? Or is the lookup table provided for you?

Anyay, I can help you with most of the preprocessing. For example,

However, to make things even more complicated, the notebook sequences
are in lowercase and the codon value table is in uppercase, so the
sequences need to be converted into uppercase. Also, the Ts in the DNA
sequences need to be changed to Us (again to match the codon value
table). And finally, before the DNA sequences are read and analysed I
need to remove the first 50 codons (i.e. the first 150 letters) and the
last 20 codons (the last 60 letters) from the DNA sequence. I've also
been having problems ensuring the program reads ALL the sequence 3
letters at a time.

So, if the file is called "notepad.txt", I'd do what you did above as:

import string
o = open("notepad.txt")
l = o.readlines() ## read all lines
l = map(string.strip, l) ## strip newlines
l = "".join(l) ## join into one string (in case codon boundaries cross lines)
l = l[50:-60]
l = l.upper()
print l

codons = []
for i in range(0, len(l), 3):
codons.append(l[i:i+3])

print codons

That gets you about 30% of the way there.

Dave

Dennis Lee Bieber · Mar 6, 2006

read these documents. However, I want them to be read 3 base pairs at a

For some reason my memory thinks this same subject came up less than
a month ago. A Google session might even find them..
--

Dennis Lee Bieber · Mar 6, 2006

{talking to myself -- a sure sign I need to log off and go to work}

For some reason my memory thinks this same subject came up less than
a month ago. A Google session might even find them..

Second hit for "python codon" (first hit is /this/ thread):

http://groups.google.com/group/comp...&q=python+codon&rnum=2&hl=en#47b29145d99fc454
--

James Stroud · Mar 6, 2006

nuttydevil said:
I have many notepad documents that all contain long chunks of genetic
code. They look something like this:

atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag
tacgacatcaacgaagctatcgcactgctgaaagagctggcgactgctaaattcgtagaa
agcgtggacgtagctgttaacctcggcatcgacgctcgtaaatctgaccagaacgtacgt
ggtgcaactgtactgccgcacggtactggccgttccgttcgcgtagccgtatttacccaa

Basically, I want to design a program using python that can open and
read these documents. However, I want them to be read 3 base pairs at a
time (to analyse them codon by codon) and find the value that each
codon has a value assigned to it. An example of this is below:

** If the three base pairs were UUU the value assigned to it (from the
codon value table) would be 0.296

The program has to read all the sequence three pairs at a time, then I
want to get all the values for each codon, multiply them together and
put them to the power of 1 / the length of the sequence in codons
(which is the length of the whole sequence divided by three).

However, to make things even more complicated, the notebook sequences
are in lowercase and the codon value table is in uppercase, so the
sequences need to be converted into uppercase. Also, the Ts in the DNA
sequences need to be changed to Us (again to match the codon value
table). And finally, before the DNA sequences are read and analysed I
need to remove the first 50 codons (i.e. the first 150 letters) and the
last 20 codons (the last 60 letters) from the DNA sequence. I've also
been having problems ensuring the program reads ALL the sequence 3
letters at a time.

I've tried various ways of doing this but keep coming unstuck along the
way. Has anyone got any suggestions for how they would tackle this
problem?

Yes: use python.

Thanks for any help recieved!

I couldn't help myself. I strongly suggest you study this example. It
will cut your coding time way down in the future.

I'm writing your name down and this is the last time I'm doing homework
for you.

James

from operator import mul

table = { 'AUG' : 0.98999, 'CCC' : 0.9755 } # <== you fill this in
trim_front = 50
trim_back = 20

# Why I did this:
# Python >=1 line per thought; you have to love it
data = "".join([s.strip() for s in open(filename)])
data = data.upper().replace('T', 'U')
codons = [data[i:i+3] for i in xrange(0, len(data), 3)] # Alex Martelli
trimmed = codons[trim_front:-trim_back]
product = reduce(mul, [table[codon] for codon in codons])
value = product**(1.0/len(trimmed)) # <== is this really ALL codons?

print value # useless print statement

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

James Stroud · Mar 6, 2006

James said:
nuttydevil said:

I have many notepad documents that all contain long chunks of genetic
code. They look something like this:

atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag
tacgacatcaacgaagctatcgcactgctgaaagagctggcgactgctaaattcgtagaa
agcgtggacgtagctgttaacctcggcatcgacgctcgtaaatctgaccagaacgtacgt
ggtgcaactgtactgccgcacggtactggccgttccgttcgcgtagccgtatttacccaa

Basically, I want to design a program using python that can open and
read these documents. However, I want them to be read 3 base pairs at a
time (to analyse them codon by codon) and find the value that each
codon has a value assigned to it. An example of this is below:

** If the three base pairs were UUU the value assigned to it (from the
codon value table) would be 0.296

The program has to read all the sequence three pairs at a time, then I
want to get all the values for each codon, multiply them together and
put them to the power of 1 / the length of the sequence in codons
(which is the length of the whole sequence divided by three).

However, to make things even more complicated, the notebook sequences
are in lowercase and the codon value table is in uppercase, so the
sequences need to be converted into uppercase. Also, the Ts in the DNA
sequences need to be changed to Us (again to match the codon value
table). And finally, before the DNA sequences are read and analysed I
need to remove the first 50 codons (i.e. the first 150 letters) and the
last 20 codons (the last 60 letters) from the DNA sequence. I've also
been having problems ensuring the program reads ALL the sequence 3
letters at a time.

I've tried various ways of doing this but keep coming unstuck along the
way. Has anyone got any suggestions for how they would tackle this
problem?

Click to expand...

Yes: use python.

Thanks for any help recieved!

Click to expand...

I couldn't help myself. I strongly suggest you study this example. It
will cut your coding time way down in the future.

I'm writing your name down and this is the last time I'm doing homework
for you.

James

from operator import mul

table = { 'AUG' : 0.98999, 'CCC' : 0.9755 } # <== you fill this in
trim_front = 50
trim_back = 20

# Why I did this:
# Python >=1 line per thought; you have to love it
data = "".join([s.strip() for s in open(filename)])
data = data.upper().replace('T', 'U')
codons = [data[i:i+3] for i in xrange(0, len(data), 3)] # Alex Martelli
trimmed = codons[trim_front:-trim_back]
product = reduce(mul, [table[codon] for codon in codons])
value = product**(1.0/len(trimmed)) # <== is this really ALL codons?

print value # useless print statement

I noticed a typo. Should be "Python <= 1 line per thought".

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Ralf Muschall · Mar 7, 2006

James said:
I'm writing your name down and this is the last time I'm doing homework
for you.

This won't help - she doesn't even know her name
(but google helps with that) ;-)

The fact that she uses E.coli ribosomal protein L1
strongly indicates that this is really homework.

....

data = data.upper().replace('T', 'U')

....

I'd apply the "inverse" of this line to the codon table
(or a copy thereof, or just adding the entries) in the hope
that changing 37 table entries is less work than changing
each codon in a base sequence.

Ralf

nuttydevil · Mar 7, 2006

Thanks so much guys for you all your help. I've had a month to learn
python and do this for my project, I had the basics down but just kept
getting unstuck. I won't message again - promise! Hehehe, thanks again
everyone.

Tim Roberts · Mar 9, 2006

David E. Konerding DSD staff said:
I don't really understand precisely what you're trying to do.

First off, those aren't base pairs, they're bases. Only when you have double-stranded
DNA (or RNA, or some other oddball cases) would they be base pairs.

Isn't that just a standard way to write DNA pairs? After all, every "a" is
paired with a "t", and every "c" is paired with a "g", so it is redundant
to specify both ends of the pair.

brainy_muppet · Mar 10, 2006

I'm writing your name down and this is the last time I'm doing homework
for you.

James

Wow, you are really a pretentious asshole. If you don't want to provide
people with help, don't bother.

And that code's incorrect anyway.

Steve Holden · Mar 10, 2006

Wow, you are really a pretentious asshole. If you don't want to provide
people with help, don't bother.

And that code's incorrect anyway.

So a smiley has to be explicitly included for you to perceive something
as being funny?

regards
Steve

Processing in Python help	0	Aug 31, 2022
Need feedback on ORF-extracting code	1	Aug 12, 2009
Processing text using python	12	Feb 20, 2006
[QUIZ] Genetic Programming (#212)	4	Jul 3, 2009
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
PyPDF Processing Errors (ValueError: invalid literal for int() with	0	Aug 8, 2011

processing the genetic code with python?

nuttydevil

Diez B. Roggisch

Steve Holden

Roy Smith

David E. Konerding DSD staff

Dennis Lee Bieber

Dennis Lee Bieber

James Stroud

James Stroud

Ralf Muschall

nuttydevil

Tim Roberts

brainy_muppet

Steve Holden

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads