remove last 76 letters from string

P

PeroMHC

Hi All, So here is the problem... I have a FASTA file (used for DNA
analyses) that looks like this:

....
gnl|SRA|SRR019045.10.1 SL-XAY_956090708:2:1:0:1028.1 length=152 NCTTTTTTTATTTTTTGTATAAATGAAGTTTCACTATATCGGACGAGCGGTTCAGCAGTCATTCCGAGAC
CGATATAGTGAAACTTCATTTCTACAAAAANTACCAAACGTCGCTCGGCAGAGCGTCGTGTTGGGCAAGA
GAGTAGCACTCG
gnl|SRA|SRR019045.11.1 SL-XAY_956090708:2:1:0:1151.1 length=152 NGGTNTGGNNNNCNCCNTNCTNCNNCNTCANCCTCCNGTCNCANNCCNCNTNNNNNCNNNNNCNNTNCTT
CTNCNNTCTCCATTCCTTCTTNATAGCCTGCTCCANCGCACGTTGAACCTTCTGCACCACGAACGCACTC
ACACCACTCATC
gnl|SRA|SRR019045.12.1 SL-XAY_956090708:2:1:0:1197.1 length=152
NGTCGGGTCTTCGCTATCACTGGACTGCTCCCATCAGCTATAGGTCCTCCCCGCCACACCCCATGCCCAC
CGCCTATCCACGTCTGTCACAACCTCATACATCAGACAGTCACACTTACCAACATATCCAAGCACCTCAA
GCAACACATCAT
....

This snippet represents 3 individual DNA sequences. Each sequences is
identified by the line starting with >
The complete file has about 10 million individual sequences.

A simple enough problem, I want to read in this data, and cut out the
last 76 letters (nucleotides) from each individual sequence and send
them to a new txt file with a similar format.

Any help on how to do this would be appreciated.
Thanks!
 
J

Jan Kaliszewski

Dnia 06-08-2009 o 01:54:46 PeroMHC said:
This snippet represents 3 individual DNA sequences. Each sequences is
identified by the line starting with >
The complete file has about 10 million individual sequences.

A simple enough problem, I want to read in this data, and cut out the
last 76 letters (nucleotides) from each individual sequence and send
them to a new txt file with a similar format.

If I understand correctly you want sth like this:


with open(path_to_the_input_file) as fasta:
with open(path_to_the_input_file) as nucleotides:
for seq in fasta:
print >>nucleotides, '> foo bar length=76'
print >>nucleotides, seq[-76]


Cheers,
*j
 
J

Jan Kaliszewski

Dnia 06-08-2009 o 01:54:46 PeroMHC said:
This snippet represents 3 individual DNA sequences. Each sequences is
identified by the line starting with >
The complete file has about 10 million individual sequences.

A simple enough problem, I want to read in this data, and cut out the
last 76 letters (nucleotides) from each individual sequence and send
them to a new txt file with a similar format.

If I understand correctly you want sth like this:


with open(path_to_the_input_file) as fasta:
with open(path_to_the_output_file) as nucleotides:
for seq in fasta:
print >>nucleotides, '> foo bar length=76'
print >>nucleotides, seq[-76]


Cheers,
*j
 
M

MRAB

PeroMHC said:
Hi All, So here is the problem... I have a FASTA file (used for DNA
analyses) that looks like this:

...
NGTCGGGTCTTCGCTATCACTGGACTGCTCCCATCAGCTATAGGTCCTCCCCGCCACACCCCATGCCCAC
CGCCTATCCACGTCTGTCACAACCTCATACATCAGACAGTCACACTTACCAACATATCCAAGCACCTCAA
GCAACACATCAT
...

This snippet represents 3 individual DNA sequences. Each sequences is
identified by the line starting with >
The complete file has about 10 million individual sequences.

A simple enough problem, I want to read in this data, and cut out the
last 76 letters (nucleotides) from each individual sequence and send
them to a new txt file with a similar format.

Any help on how to do this would be appreciated.
Thanks!

If the input file is large then you can reduce the amount of memory
needed by reading the input file a line at a time by iterating over the
file object:

input_file = open(input_path)
for line in input_file:
...
input_file.close()

Each line will end with '\n', so use the 'rstrip' method to remove it,
and then slice the last 76 characters:

last_part = line.rstrip()[-76 : ]
 
I

Iain King

print >>nucleotides, seq[-76]
     last_part = line.rstrip()[-76 : ]

You all mean: seq[:-76] , right? (assuming you've already stripped
any junk off the end of the string)

Iain
 
M

MRAB

Iain said:
print >>nucleotides, seq[-76]
last_part = line.rstrip()[-76 : ]

You all mean: seq[:-76] , right? (assuming you've already stripped
any junk off the end of the string)
The OP said "cut out the last 76 letters (nucleotides) from each
individual sequence and send them to a new txt file with a similar
format.", ie extract the last 76 letters (in the same format) to a file.
 
I

Iain King

Iain said:
     print >>nucleotides, seq[-76]
     last_part = line.rstrip()[-76 : ]
You all mean:   seq[:-76]   , right? (assuming you've already stripped
any junk off the end of the string)

The OP said "cut out the last 76 letters (nucleotides) from each
individual sequence and send them to a new txt file with a similar
format.", ie extract the last 76 letters (in the same format) to a file.

So he did. Excuse me while I go eat my other foot.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top