processing the genetic code with python?

N

nuttydevil

I have many notepad documents that all contain long chunks of genetic
code. They look something like this:

atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag
tacgacatcaacgaagctatcgcactgctgaaagagctggcgactgctaaattcgtagaa
agcgtggacgtagctgttaacctcggcatcgacgctcgtaaatctgaccagaacgtacgt
ggtgcaactgtactgccgcacggtactggccgttccgttcgcgtagccgtatttacccaa

Basically, I want to design a program using python that can open and
read these documents. However, I want them to be read 3 base pairs at a
time (to analyse them codon by codon) and find the value that each
codon has a value assigned to it. An example of this is below:

** If the three base pairs were UUU the value assigned to it (from the
codon value table) would be 0.296

The program has to read all the sequence three pairs at a time, then I
want to get all the values for each codon, multiply them together and
put them to the power of 1 / the length of the sequence in codons
(which is the length of the whole sequence divided by three).

However, to make things even more complicated, the notebook sequences
are in lowercase and the codon value table is in uppercase, so the
sequences need to be converted into uppercase. Also, the Ts in the DNA
sequences need to be changed to Us (again to match the codon value
table). And finally, before the DNA sequences are read and analysed I
need to remove the first 50 codons (i.e. the first 150 letters) and the
last 20 codons (the last 60 letters) from the DNA sequence. I've also
been having problems ensuring the program reads ALL the sequence 3
letters at a time.

I've tried various ways of doing this but keep coming unstuck along the
way. Has anyone got any suggestions for how they would tackle this
problem?
Thanks for any help recieved!
 
D

Diez B. Roggisch

nuttydevil said:
I've tried various ways of doing this but keep coming unstuck along the
way. Has anyone got any suggestions for how they would tackle this
problem?
Thanks for any help recieved!

Show us your ways, show us where you got stuck - then we'd might be able to help you.

Diez
 
S

Steve Holden

Diez said:
Show us your ways, show us where you got stuck - then we'd might be able to help you.

Also, take a look at the biopython package, which will almost certainly
save you large amounts of time on tasks such as the one you describe.

http://www.biopython.org/

regards
Steve
 
R

Roy Smith

I have many notepad documents that all contain long chunks of genetic
code. They look something like this:

atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag
tacgacatcaacgaagctatcgcactgctgaaagagctggcgactgctaaattcgtagaa
agcgtggacgtagctgttaacctcggcatcgacgctcgtaaatctgaccagaacgtacgt
ggtgcaactgtactgccgcacggtactggccgttccgttcgcgtagccgtatttacccaa

Basically, I want to design a program using python that can open and
read these documents.

Start by googling for biopython.
 
D

David E. Konerding DSD staff

I have many notepad documents that all contain long chunks of genetic
code. They look something like this:

atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag
tacgacatcaacgaagctatcgcactgctgaaagagctggcgactgctaaattcgtagaa
agcgtggacgtagctgttaacctcggcatcgacgctcgtaaatctgaccagaacgtacgt
ggtgcaactgtactgccgcacggtactggccgttccgttcgcgtagccgtatttacccaa

Basically, I want to design a program using python that can open and
read these documents. However, I want them to be read 3 base pairs at a
time (to analyse them codon by codon) and find the value that each
codon has a value assigned to it. An example of this is below:

** If the three base pairs were UUU the value assigned to it (from the
codon value table) would be 0.296

The program has to read all the sequence three pairs at a time, then I
want to get all the values for each codon, multiply them together and
put them to the power of 1 / the length of the sequence in codons
(which is the length of the whole sequence divided by three).

I don't really understand precisely what you're trying to do.

First off, those aren't base pairs, they're bases. Only when you have double-stranded
DNA (or RNA, or some other oddball cases) would they be base pairs.

Second, I don't know what the codon to value function is, is this frequency (IE number n occurences of codon
X out of N total codons)? Or is the lookup table provided for you?

Anyay, I can help you with most of the preprocessing. For example,
However, to make things even more complicated, the notebook sequences
are in lowercase and the codon value table is in uppercase, so the
sequences need to be converted into uppercase. Also, the Ts in the DNA
sequences need to be changed to Us (again to match the codon value
table). And finally, before the DNA sequences are read and analysed I
need to remove the first 50 codons (i.e. the first 150 letters) and the
last 20 codons (the last 60 letters) from the DNA sequence. I've also
been having problems ensuring the program reads ALL the sequence 3
letters at a time.

So, if the file is called "notepad.txt", I'd do what you did above as:

import string
o = open("notepad.txt")
l = o.readlines() ## read all lines
l = map(string.strip, l) ## strip newlines
l = "".join(l) ## join into one string (in case codon boundaries cross lines)
l = l[50:-60]
l = l.upper()
print l

codons = []
for i in range(0, len(l), 3):
codons.append(l[i:i+3])

print codons


That gets you about 30% of the way there.

Dave
 
D

Dennis Lee Bieber

read these documents. However, I want them to be read 3 base pairs at a

For some reason my memory thinks this same subject came up less than
a month ago. A Google session might even find them..
--
 
J

James Stroud

nuttydevil said:
I have many notepad documents that all contain long chunks of genetic
code. They look something like this:

atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag
tacgacatcaacgaagctatcgcactgctgaaagagctggcgactgctaaattcgtagaa
agcgtggacgtagctgttaacctcggcatcgacgctcgtaaatctgaccagaacgtacgt
ggtgcaactgtactgccgcacggtactggccgttccgttcgcgtagccgtatttacccaa

Basically, I want to design a program using python that can open and
read these documents. However, I want them to be read 3 base pairs at a
time (to analyse them codon by codon) and find the value that each
codon has a value assigned to it. An example of this is below:

** If the three base pairs were UUU the value assigned to it (from the
codon value table) would be 0.296

The program has to read all the sequence three pairs at a time, then I
want to get all the values for each codon, multiply them together and
put them to the power of 1 / the length of the sequence in codons
(which is the length of the whole sequence divided by three).

However, to make things even more complicated, the notebook sequences
are in lowercase and the codon value table is in uppercase, so the
sequences need to be converted into uppercase. Also, the Ts in the DNA
sequences need to be changed to Us (again to match the codon value
table). And finally, before the DNA sequences are read and analysed I
need to remove the first 50 codons (i.e. the first 150 letters) and the
last 20 codons (the last 60 letters) from the DNA sequence. I've also
been having problems ensuring the program reads ALL the sequence 3
letters at a time.

I've tried various ways of doing this but keep coming unstuck along the
way. Has anyone got any suggestions for how they would tackle this
problem?

Yes: use python.
Thanks for any help recieved!

I couldn't help myself. I strongly suggest you study this example. It
will cut your coding time way down in the future.

I'm writing your name down and this is the last time I'm doing homework
for you.

James


from operator import mul

table = { 'AUG' : 0.98999, 'CCC' : 0.9755 } # <== you fill this in
trim_front = 50
trim_back = 20

# Why I did this:
# Python >=1 line per thought; you have to love it
data = "".join([s.strip() for s in open(filename)])
data = data.upper().replace('T', 'U')
codons = [data[i:i+3] for i in xrange(0, len(data), 3)] # Alex Martelli
trimmed = codons[trim_front:-trim_back]
product = reduce(mul, [table[codon] for codon in codons])
value = product**(1.0/len(trimmed)) # <== is this really ALL codons?

print value # useless print statement


--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
 
J

James Stroud

James said:
nuttydevil said:
I have many notepad documents that all contain long chunks of genetic
code. They look something like this:

atggctaaactgaccaagcgcatgcgtgttatccgcgagaaagttgatgcaaccaaacag
tacgacatcaacgaagctatcgcactgctgaaagagctggcgactgctaaattcgtagaa
agcgtggacgtagctgttaacctcggcatcgacgctcgtaaatctgaccagaacgtacgt
ggtgcaactgtactgccgcacggtactggccgttccgttcgcgtagccgtatttacccaa

Basically, I want to design a program using python that can open and
read these documents. However, I want them to be read 3 base pairs at a
time (to analyse them codon by codon) and find the value that each
codon has a value assigned to it. An example of this is below:

** If the three base pairs were UUU the value assigned to it (from the
codon value table) would be 0.296

The program has to read all the sequence three pairs at a time, then I
want to get all the values for each codon, multiply them together and
put them to the power of 1 / the length of the sequence in codons
(which is the length of the whole sequence divided by three).

However, to make things even more complicated, the notebook sequences
are in lowercase and the codon value table is in uppercase, so the
sequences need to be converted into uppercase. Also, the Ts in the DNA
sequences need to be changed to Us (again to match the codon value
table). And finally, before the DNA sequences are read and analysed I
need to remove the first 50 codons (i.e. the first 150 letters) and the
last 20 codons (the last 60 letters) from the DNA sequence. I've also
been having problems ensuring the program reads ALL the sequence 3
letters at a time.

I've tried various ways of doing this but keep coming unstuck along the
way. Has anyone got any suggestions for how they would tackle this
problem?


Yes: use python.
Thanks for any help recieved!

I couldn't help myself. I strongly suggest you study this example. It
will cut your coding time way down in the future.

I'm writing your name down and this is the last time I'm doing homework
for you.

James


from operator import mul

table = { 'AUG' : 0.98999, 'CCC' : 0.9755 } # <== you fill this in
trim_front = 50
trim_back = 20

# Why I did this:
# Python >=1 line per thought; you have to love it
data = "".join([s.strip() for s in open(filename)])
data = data.upper().replace('T', 'U')
codons = [data[i:i+3] for i in xrange(0, len(data), 3)] # Alex Martelli
trimmed = codons[trim_front:-trim_back]
product = reduce(mul, [table[codon] for codon in codons])
value = product**(1.0/len(trimmed)) # <== is this really ALL codons?

print value # useless print statement

I noticed a typo. Should be "Python <= 1 line per thought".

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
 
R

Ralf Muschall

James said:
I'm writing your name down and this is the last time I'm doing homework
for you.

This won't help - she doesn't even know her name
(but google helps with that) ;-)

The fact that she uses E.coli ribosomal protein L1
strongly indicates that this is really homework.

....
data = data.upper().replace('T', 'U')
....

I'd apply the "inverse" of this line to the codon table
(or a copy thereof, or just adding the entries) in the hope
that changing 37 table entries is less work than changing
each codon in a base sequence.

Ralf
 
N

nuttydevil

Thanks so much guys for you all your help. I've had a month to learn
python and do this for my project, I had the basics down but just kept
getting unstuck. I won't message again - promise! Hehehe, thanks again
everyone.
 
T

Tim Roberts

David E. Konerding DSD staff said:
I don't really understand precisely what you're trying to do.

First off, those aren't base pairs, they're bases. Only when you have double-stranded
DNA (or RNA, or some other oddball cases) would they be base pairs.

Isn't that just a standard way to write DNA pairs? After all, every "a" is
paired with a "t", and every "c" is paired with a "g", so it is redundant
to specify both ends of the pair.
 
B

brainy_muppet

I'm writing your name down and this is the last time I'm doing homework
for you.

James

Wow, you are really a pretentious asshole. If you don't want to provide
people with help, don't bother.

And that code's incorrect anyway.
 
S

Steve Holden

Wow, you are really a pretentious asshole. If you don't want to provide
people with help, don't bother.

And that code's incorrect anyway.

So a smiley has to be explicitly included for you to perceive something
as being funny?

regards
Steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top