print given character range.

  • Thread starter Jayaprakash Rudraraju
  • Start date
J

Jayaprakash Rudraraju

Most of the files in bioinformatics save their sequences in fasta
format. Fasta format files contain header lines followed by dna
sequence. I have been using the following short-cut to get sequence
given the range in the sequence.

perl -ne 'chomp; next if />/; print' FASTA.TXT | cut -c3450-3470

Is there is a better and convinient way to do it.
 
A

Andre Majorel

Most of the files in bioinformatics save their sequences in fasta
format. Fasta format files contain header lines followed by dna
sequence. I have been using the following short-cut to get sequence
given the range in the sequence.

perl -ne 'chomp; next if />/; print' FASTA.TXT | cut -c3450-3470

Is there is a better and convinient way to do it.

Other ways to do it would be:

grep -v '>' FASTA.TXT | tr -d '\n' | cut -c3450-3470

perl -ne '
chomp;
next if />/;
$result .= $_;
if (length ($result) >= 3470)
{
print substr ($result, 3449, 21), "\n";
exit 0
}'

Whether they're faster or more convenient than the above, I
don't know. But the solutions involving cut(1) may not do what
you want if FASTA.TXT is too big to be swallowed in one line.
 
C

Cognition Peon

Yesterday, IP packets from Andre Majorel delivered:
Other ways to do it would be:

grep -v '>' FASTA.TXT | tr -d '\n' | cut -c3450-3470

Thanks for the solution.. wanted a simpler way to get the range of
sequence from a fasta file. The headers in fasta files always start
with '>' but I was not looking for a faster solution. will use a script
if fasta file is too long.
 
A

Adam Price

Most of the files in bioinformatics save their sequences in fasta
format. Fasta format files contain header lines followed by dna
sequence. I have been using the following short-cut to get sequence
given the range in the sequence.

perl -ne 'chomp; next if />/; print' FASTA.TXT | cut -c3450-3470

Is there is a better and convinient way to do it.

You could try looking at CPAN, try
http://search.cpan.org/~birney/bioperl-1.4/
as a place to start looking.
It seems to cover lots of stuff to do with FASTA files.
Adam
 
K

Kevin Collins

Jayaprakash Rudraraju said:
Most of the files in bioinformatics save their sequences in fasta
format. Fasta format files contain header lines followed by dna
sequence. I have been using the following short-cut to get sequence
given the range in the sequence.

perl -ne 'chomp; next if />/; print' FASTA.TXT | cut -c3450-3470

Is there is a better and convinient way to do it.

Try this:

perl -ne 'chomp; print substr($_, 3449, 20) unless (/^>/);'

The "^" assumes (as you mentioned in another reply) that the header
starts with '>' - otherwise you can leave it out. However, if the
lines do start with '>', it is much faster (especially for the long
records) for the regexp engine if you anchor the RE with '^'.

This single perl command should always be faster that 'perl | cut'...

Kevin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top