print given character range.

Discussion in 'Perl Misc' started by Jayaprakash Rudraraju, Apr 5, 2004.

  1. Most of the files in bioinformatics save their sequences in fasta
    format. Fasta format files contain header lines followed by dna
    sequence. I have been using the following short-cut to get sequence
    given the range in the sequence.

    perl -ne 'chomp; next if />/; print' FASTA.TXT | cut -c3450-3470

    Is there is a better and convinient way to do it.
     
    Jayaprakash Rudraraju, Apr 5, 2004
    #1
    1. Advertising

  2. On 2004-04-05, Jayaprakash Rudraraju <> wrote:

    > Most of the files in bioinformatics save their sequences in fasta
    > format. Fasta format files contain header lines followed by dna
    > sequence. I have been using the following short-cut to get sequence
    > given the range in the sequence.
    >
    > perl -ne 'chomp; next if />/; print' FASTA.TXT | cut -c3450-3470
    >
    > Is there is a better and convinient way to do it.


    Other ways to do it would be:

    grep -v '>' FASTA.TXT | tr -d '\n' | cut -c3450-3470

    perl -ne '
    chomp;
    next if />/;
    $result .= $_;
    if (length ($result) >= 3470)
    {
    print substr ($result, 3449, 21), "\n";
    exit 0
    }'

    Whether they're faster or more convenient than the above, I
    don't know. But the solutions involving cut(1) may not do what
    you want if FASTA.TXT is too big to be swallowed in one line.

    --
    André Majorel <URL:http://www.teaser.fr/~amajorel/>
    "Finally I am becoming stupider no more." -- Paul Erdös' epitaph
     
    Andre Majorel, Apr 6, 2004
    #2
    1. Advertising

  3. Yesterday, IP packets from Andre Majorel delivered:

    > On 2004-04-05, Jayaprakash Rudraraju <> wrote:
    >
    > > Most of the files in bioinformatics save their sequences in fasta
    > > format. Fasta format files contain header lines followed by dna
    > > sequence. I have been using the following short-cut to get sequence
    > > given the range in the sequence.
    > >
    > > perl -ne 'chomp; next if />/; print' FASTA.TXT | cut -c3450-3470
    > >
    > > Is there is a better and convinient way to do it.

    >
    > Other ways to do it would be:
    >
    > grep -v '>' FASTA.TXT | tr -d '\n' | cut -c3450-3470


    Thanks for the solution.. wanted a simpler way to get the range of
    sequence from a fasta file. The headers in fasta files always start
    with '>' but I was not looking for a faster solution. will use a script
    if fasta file is too long.

    >
    > perl -ne '
    > chomp;
    > next if />/;
    > $result .= $_;
    > if (length ($result) >= 3470)
    > {
    > print substr ($result, 3449, 21), "\n";
    > exit 0
    > }'
    >
    > Whether they're faster or more convenient than the above, I
    > don't know. But the solutions involving cut(1) may not do what
    > you want if FASTA.TXT is too big to be swallowed in one line.
    >
    >


    --
    echo | perl -pe 'y/a-z/n-za-m/'

    If you want to make God laugh, tell him your future plans.
    -------------------------------------
    Printed using 100% recycled electrons
     
    Cognition Peon, Apr 6, 2004
    #3
  4. Jayaprakash Rudraraju

    Adam Price Guest

    On Mon, 5 Apr 2004 15:09:36 -0700, Jayaprakash Rudraraju wrote:

    > Most of the files in bioinformatics save their sequences in fasta
    > format. Fasta format files contain header lines followed by dna
    > sequence. I have been using the following short-cut to get sequence
    > given the range in the sequence.
    >
    > perl -ne 'chomp; next if />/; print' FASTA.TXT | cut -c3450-3470
    >
    > Is there is a better and convinient way to do it.


    You could try looking at CPAN, try
    http://search.cpan.org/~birney/bioperl-1.4/
    as a place to start looking.
    It seems to cover lots of stuff to do with FASTA files.
    Adam
     
    Adam Price, Apr 8, 2004
    #4
  5. Jayaprakash Rudraraju <> wrote in message news:<>...
    > Most of the files in bioinformatics save their sequences in fasta
    > format. Fasta format files contain header lines followed by dna
    > sequence. I have been using the following short-cut to get sequence
    > given the range in the sequence.
    >
    > perl -ne 'chomp; next if />/; print' FASTA.TXT | cut -c3450-3470
    >
    > Is there is a better and convinient way to do it.


    Try this:

    perl -ne 'chomp; print substr($_, 3449, 20) unless (/^>/);'

    The "^" assumes (as you mentioned in another reply) that the header
    starts with '>' - otherwise you can leave it out. However, if the
    lines do start with '>', it is much faster (especially for the long
    records) for the regexp engine if you anchor the RE with '^'.

    This single perl command should always be faster that 'perl | cut'...

    Kevin
     
    Kevin Collins, Apr 8, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. vidya
    Replies:
    2
    Views:
    384
    Roedy Green
    Aug 22, 2005
  2. Lord0
    Replies:
    1
    Views:
    573
    Thomas Weidenfeller
    Apr 19, 2006
  3. chiara
    Replies:
    6
    Views:
    478
    Barry Schwarz
    Oct 6, 2005
  4. 2Barter.net
    Replies:
    0
    Views:
    372
    2Barter.net
    Dec 13, 2006
  5. Casey Hawthorne
    Replies:
    385
    Views:
    5,705
    ng2010
    Apr 4, 2010
Loading...

Share This Page