remove last 76 letters from string

PeroMHC · Aug 6, 2009

Hi All, So here is the problem... I have a FASTA file (used for DNA
analyses) that looks like this:

....

gnl|SRA|SRR019045.10.1 SL-XAY_956090708:2:1:0:1028.1 length=152 NCTTTTTTTATTTTTTGTATAAATGAAGTTTCACTATATCGGACGAGCGGTTCAGCAGTCATTCCGAGAC
CGATATAGTGAAACTTCATTTCTACAAAAANTACCAAACGTCGCTCGGCAGAGCGTCGTGTTGGGCAAGA
GAGTAGCACTCG
gnl|SRA|SRR019045.11.1 SL-XAY_956090708:2:1:0:1151.1 length=152 NGGTNTGGNNNNCNCCNTNCTNCNNCNTCANCCTCCNGTCNCANNCCNCNTNNNNNCNNNNNCNNTNCTT
CTNCNNTCTCCATTCCTTCTTNATAGCCTGCTCCANCGCACGTTGAACCTTCTGCACCACGAACGCACTC
ACACCACTCATC
gnl|SRA|SRR019045.12.1 SL-XAY_956090708:2:1:0:1197.1 length=152

NGTCGGGTCTTCGCTATCACTGGACTGCTCCCATCAGCTATAGGTCCTCCCCGCCACACCCCATGCCCAC
CGCCTATCCACGTCTGTCACAACCTCATACATCAGACAGTCACACTTACCAACATATCCAAGCACCTCAA
GCAACACATCAT
....

This snippet represents 3 individual DNA sequences. Each sequences is
identified by the line starting with >
The complete file has about 10 million individual sequences.

A simple enough problem, I want to read in this data, and cut out the
last 76 letters (nucleotides) from each individual sequence and send
them to a new txt file with a similar format.

Any help on how to do this would be appreciated.
Thanks!

Jan Kaliszewski · Aug 6, 2009

Dnia 06-08-2009 o 01:54:46 PeroMHC said:
This snippet represents 3 individual DNA sequences. Each sequences is
identified by the line starting with >
The complete file has about 10 million individual sequences.

A simple enough problem, I want to read in this data, and cut out the
last 76 letters (nucleotides) from each individual sequence and send
them to a new txt file with a similar format.

If I understand correctly you want sth like this:

with open(path_to_the_input_file) as fasta:
with open(path_to_the_input_file) as nucleotides:
for seq in fasta:
print >>nucleotides, '> foo bar length=76'
print >>nucleotides, seq[-76]

Cheers,
*j

Jan Kaliszewski · Aug 6, 2009

Dnia 06-08-2009 o 01:54:46 PeroMHC said:
This snippet represents 3 individual DNA sequences. Each sequences is
identified by the line starting with >
The complete file has about 10 million individual sequences.

A simple enough problem, I want to read in this data, and cut out the
last 76 letters (nucleotides) from each individual sequence and send
them to a new txt file with a similar format.

If I understand correctly you want sth like this:

with open(path_to_the_input_file) as fasta:
with open(path_to_the_output_file) as nucleotides:
for seq in fasta:
print >>nucleotides, '> foo bar length=76'
print >>nucleotides, seq[-76]

Cheers,
*j

MRAB · Aug 6, 2009

PeroMHC said:
Hi All, So here is the problem... I have a FASTA file (used for DNA
analyses) that looks like this:

...
NGTCGGGTCTTCGCTATCACTGGACTGCTCCCATCAGCTATAGGTCCTCCCCGCCACACCCCATGCCCAC
CGCCTATCCACGTCTGTCACAACCTCATACATCAGACAGTCACACTTACCAACATATCCAAGCACCTCAA
GCAACACATCAT
...

This snippet represents 3 individual DNA sequences. Each sequences is
identified by the line starting with >
The complete file has about 10 million individual sequences.

A simple enough problem, I want to read in this data, and cut out the
last 76 letters (nucleotides) from each individual sequence and send
them to a new txt file with a similar format.

Any help on how to do this would be appreciated.
Thanks!

If the input file is large then you can reduce the amount of memory
needed by reading the input file a line at a time by iterating over the
file object:

input_file = open(input_path)
for line in input_file:
...
input_file.close()

Each line will end with '\n', so use the 'rstrip' method to remove it,
and then slice the last 76 characters:

last_part = line.rstrip()[-76 : ]

Iain King · Aug 6, 2009

print >>nucleotides, seq[-76]

last_part = line.rstrip()[-76 : ]

You all mean: seq[:-76] , right? (assuming you've already stripped
any junk off the end of the string)

Iain

MRAB · Aug 6, 2009

Iain said:
print >>nucleotides, seq[-76]

Click to expand...

last_part = line.rstrip()[-76 : ]

Click to expand...

You all mean: seq[:-76] , right? (assuming you've already stripped
any junk off the end of the string)

The OP said "cut out the last 76 letters (nucleotides) from each
individual sequence and send them to a new txt file with a similar
format.", ie extract the last 76 letters (in the same format) to a file.

Iain King · Aug 6, 2009

Iain said:
Iain said:

print >>nucleotides, seq[-76]
last_part = line.rstrip()[-76 : ]

Click to expand...

Click to expand...

You all mean: seq[:-76] , right? (assuming you've already stripped
any junk off the end of the string)

Click to expand...

The OP said "cut out the last 76 letters (nucleotides) from each
individual sequence and send them to a new txt file with a similar
format.", ie extract the last 76 letters (in the same format) to a file.

So he did. Excuse me while I go eat my other foot.

Erase Last Character of basic::string Variable	4	Dec 11, 2012
generate De Bruijn sequence memory and string vs lists	0	Jan 23, 2014
permuting letters and fairy tales	19	Nov 11, 2004
Need help with a program	25	Jan 28, 2010
[rcr] String#first / String#last	29	Oct 24, 2004
Need feedback on ORF-extracting code	1	Aug 12, 2009
Remove repeated words from a file	3	Sep 18, 2009
Automatic Type Conversion to String	6	Feb 13, 2012

remove last 76 letters from string

PeroMHC

Jan Kaliszewski

Jan Kaliszewski

MRAB

Iain King

MRAB

Iain King

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads