remove sequence

Discussion in 'Perl Misc' started by Ross, Jul 2, 2005.

  1. Ross

    Ross Guest

    Dear All,
    For a file with many records like follows, I would like to remove each
    record, if any, containing "XXXXX and the ATCCAAT... follows". If the
    sequence is a single line, i can just simply use

    if (line =~ ^>.*)
    if (line =~ (.*)X(.*) )
    newline = $1;

    anybody has good idea to solve the problem? thanks in advance

    >9P01P10A.y putative DD1A protein [Oryza sativa (japonica
    >cultivar-group)] HSP:949

    GAACAATTAGTATAAACTTTAGTTGAATCTCGTTACTATATTAGCTTCGG
    AGCTCAATTACAAACAGCTAGCAAAAAATGCCAGGTCCCCCATAAAAGAA
    ACCATCATGTTCATAATCAGACACACGGTAGCAATTTGATATATATCCGA
    GAGCAGAATTGATTTGATGGGTGTTGCCGCCTGCATCAAAAAACTTGACG
    CCACTAAATGATGCAGCGTTTTTGATTGGAGCATTCCCACTGCCCATCGG
    AGGACTTGGTTTATGTCCCTTTTTCAAGGCATAGCCACCAAACATTATTG
    TCACTGGTTTATTTGACAAGCTTGTAAACAGAGATCTTGGATAGTAACCT
    GTAAGTCTGGCTTCACCATTAAACCCATAGTAGACTTGCCAATCACCAGA
    AATTTGATCCTTGGATACTCTGACTGTAGTGTATCGTTTGTCGCTAGAGG
    TGGTGGAAACAGGGTTAATCACCATTCCTGGAACGATTTCTGAGCTAAAT
    ACACTTTCGAATCCAGGACAACGCATATCAGGACAGGCATTAGATCCTTG
    AGTAAACCAAGTACTGAAGTGCGTCTGTGAATCATTGTATGATTCAGGCT
    CAATATTCCATCCAGCTATAACATTATTTATGGCGGATGCTTCATCCTTA
    TTATAAATCGAAATGAAACCTCCTGTTTGTTGTCCATGCTCTAGATTAAA
    GCATAAACATCCATGGTGGCCTCTACTCCATAATACGTTATAGCATTATC
    TGAAGGACCCCATCCATATACTGCAAGATACAACGTGCCAGCTTGATTTG
    ATTCATGACCCGACGAAGATAAATTCACATCAAGTATAAGGGGCATCAAT
    GTTTGCCATTTCTTTTGGACCTCTTCTTCCATAGGAGGGAAGCAACTCCT
    ACTGTTTGACTACCAAGGAACACACACAGAGCAGTGCAGATTGATTAAAA
    ATTTCTCCATATTATATTTGGGGATGGAGAGGGTATATGTTTGAGTTCCC
    CGGCGTTAGGCCGATTTCCGGGTACACAAAATGCGGGCTTCCGAGAAAAA
    AAATTCCCCCAACCTTGGATTTGTTTTTTTTTTTCTCTTCTTCTTCTACT
    CTATTTTTATTTCTTGTGTTTGTTTCTGTACTTTTCTTGTTGTTTTTTGT
    GTGTTCTTTTTGTTGTGTTTGTTTTTTTTCTTTTCTTTTTGTTTTTATGT
    ATCTATCCTTTCTTATTGTTTGTATTTTTTTTTTGTTATTTTTGTATGTT
    TTCTTTGTTGTGTTATTTTTTTGGTTTTCTTTTTTTGTTTTTATCACTTT
    CTCTTTGTATTGAGTGCTTTTCTTGTTTTTATTTTGTTGATTCTTTTGTC
    TTGTCTCTGTCTTTTTTTTCCGTATATGCTTTGTTTGTTTCTTATCCTTT
    GCTTG
    >9P01P10B.y prolamin precursor (clone pX24) - rice emb|CAA37850.1|
    >prolamin [Oryza sativa (japonica cultivar-group)] HSP:418

    GTTGCTATGAAAGCACTTTATTTCTATTTATATCACCCAAAGTTTCACAT
    GTCACATATGATGATATCTGAGCTTATTTTTAACTTCCGAACCACTATAC
    TGTTAAAACTCATTACAAGACACCGCCAAGGGTGGTAATGGTACTGGGTG
    CACCATAGTACCTAGGGTAGATACCATATCTAGATGGCACGTTAAAAGCC
    AATAGAGCTTGAGCTTGAGCCAGATTCCGATCAAAGTAGAGATCACCAAA
    CTGCTGGAGTTGTAGCTGCTGCGCTATGGCCTGAACAATGTTAATGTCCT
    GATAGTGAGATTGTTGCGCCACCAGCGCGAGATGTTGCCAGACTTGGTTG
    TTTCTCAGTTGAAACGCAGCTGATTGCAAGAAGGGGCTTGCCGCTATGCC
    ATACTGCTGCCTTACGAACTCATTATATGGGCTAAGCACCTGTTGCTGTA
    GCAGGACAGGCGACTGCAGCTGATATTGCCTATAACTTTGACCTAAAACA
    TCAAACTGCGCAGAGGCGCTGCATGCAGCAATAGCAAGGAGAGCAAAGAC
    GAAAATGATCTTCATTGCTGCGGGACACTANATCTTTCTATTTTTCTGTA
    TAATGCTTGAACTGTGTGAACGATCXXXXXXXXXXXXXXXXXXXXXXXXX
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCTCTTCAAT
    CTCGGGAANNNNNTGTNGGGGTGTTGGGAAATCCCCCCCTTGTTGGGGTT
    TTTCTTGGTTAAACACAAGTGTCCCTTCTCTTTAAAAAAAACCCCTTTTC
    CTGTTGGGGGGGTNNTTTTTTTTTTTTTCTTTTTTTTTTTTTTNTTTTTT
    TCCTTTTTTTTTTTTTTTTTTTTTTCTTTTTGTTCCTTTCTTGTTTCTGT
    TTCTCTTTTTTTTTTTTTTTTTTTTTTTGTTTTCTTTTTTTTTTTTTCTG
    Ross, Jul 2, 2005
    #1
    1. Advertising

  2. Ross

    John Bokma Guest

    "Ross" <> wrote:

    > Dear All,
    > For a file with many records like follows, I would like to remove each
    > record, if any, containing "XXXXX and the ATCCAAT... follows". If the
    > sequence is a single line, i can just simply use
    >
    > if (line =~ ^>.*)
    > if (line =~ (.*)X(.*) )
    > newline = $1;
    >
    > anybody has good idea to solve the problem? thanks in advance


    read either record for record, or slurp everything in one scalar and try to
    match records.

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
    John Bokma, Jul 2, 2005
    #2
    1. Advertising

  3. Ross

    Ross Guest

    "John Bokma" <> wrote in message
    news:Xns9686EAF3FA39Dcastleamber@130.133.1.4...
    > "Ross" <> wrote:
    >
    >> Dear All,
    >> For a file with many records like follows, I would like to remove each
    >> record, if any, containing "XXXXX and the ATCCAAT... follows". If the
    >> sequence is a single line, i can just simply use
    >>
    >> if (line =~ ^>.*)
    >> if (line =~ (.*)X(.*) )
    >> newline = $1;
    >>
    >> anybody has good idea to solve the problem? thanks in advance

    >
    > read either record for record, or slurp everything in one scalar and try
    > to
    > match records.
    >
    > --
    > John Small Perl scripts: http://johnbokma.com/perl/
    > Perl programmer available: http://castleamber.com/
    > Happy Customers: http://castleamber.com/testimonials.html
    >


    Thanks John. However, each record is not a single-line format and it
    encounters an embarrassing situation that one doesn't know how many X's
    there are. On the other hand, i don't know how to read character by
    character in PERL, thanks.
    Ross, Jul 2, 2005
    #3
  4. Ross

    John Bokma Guest

    "Ross" <> wrote:

    > "John Bokma" <> wrote in message


    > Thanks John. However, each record is not a single-line format and it
    > encounters an embarrassing situation that one doesn't know how many
    > X's there are.


    I was aware of that. What might work is:

    open my $fh, $filename or die ...
    my $record = '';

    while ( my $line = <$fh> ) {

    if ( $line =~ /^begin of a new record/ and length $record ) {

    # check if it has XXXX
    # and if so, drop it
    $record = '';
    }

    $record .= $line;
    }

    close $fh or die ...

    if ( length $record ) {

    # check if it has XXXX
    # and if so, drop it
    }

    ( BTW: the language is Perl the interpreter is perl, there is no such
    thing as PERL )

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
    John Bokma, Jul 2, 2005
    #4
  5. Ross <> wrote:
    : Dear All,
    : For a file with many records like follows, I would like to remove each
    : record, if any, containing "XXXXX and the ATCCAAT... follows". If the
    : sequence is a single line, i can just simply use

    : if (line =~ ^>.*)
    : if (line =~ (.*)X(.*) )
    : newline = $1;

    : anybody has good idea to solve the problem? thanks in advance

    I'm not sure that I'm interpreting what you're trying to remove
    correctly, but is this perhaps what you're looking for?

    #!/usr/bin/perl -w

    use strict;

    open RD, "atcx" or die $!, " Couldn't open input file\n";
    # atcx is the file of lines that you're trying to process

    my @cleanlines = ();
    my $truline = "";
    foreach (<RD>)
    { my $line = $_;
    chomp $line;
    if($line =~ /^>.*/)
    { if($truline ne "")
    { push @cleanlines,$truline; # Add next cleaned line when
    $truline = ""; # signal for new line [>] is found
    }
    }
    else
    { $line =~ s/(.*)X.*/$1/;
    $truline .= $line."\n"; # Append good part of line
    }
    }

    if($truline ne "") { push @cleanlines,$truline; }

    print "\n\n\n";
    foreach(@cleanlines) { print "$_\n\n"; }
    print "\n";

    You might be trying to go even further and skip everything from the
    first X to the next new line signal, in which case you'd stop
    appending after that first X and use the loop only to find the next
    >, and then start the process again for the next line


    --thelma
    Thelma Lubkin, Jul 2, 2005
    #5
  6. Ross <> wrote:

    > i can just simply use
    >
    > if (line =~ ^>.*)
    > if (line =~ (.*)X(.*) )
    > newline = $1;



    You can?

    How did you get it to compile?


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Jul 2, 2005
    #6
  7. Ross

    Guest

    >Ross said
    >For a file with many records like follows, I would like to remove each
    >record, if any, containing "XXXXX and the ATCCAAT... follows". If the
    >sequence is a single line, i can just simply use



    >if (line =~ ^>.*)
    > if (line =~ (.*)X(.*) )
    > newline = $1;



    >anybody has good idea to solve the problem? thanks in advance


    Hi Ross

    If the file can be slurped into memory, the following might be an
    approach that will work (provided the format is as you posted).

    #!/usr/bin/perl
    use strict;
    use warnings;
    use Data::Dumper;

    my $data = do {local $/; <DATA>};

    my @parts = grep {!/X+/} split /\n\n(?=>)/, $data;
    print Dumper \@parts;

    __DATA__
    >9P01P10A.y putative DD1A protein [Oryza sativa (japonica
    >cultivar-group)] HSP:949



    GAACAATTAGTATAAACTTTAGTTGAATCT­CGTTACTATATTAGCTTCGG
    AGCTCAATTACAAACAGCTAGCAAAAAATG­CCAGGTCCCCCATAAAAGAA
    ACCATCATGTTCATAATCAGACACACGGTA­GCAATTTGATATATATCCGA
    GAGCAGAATTGATTTGATGGGTGTTGCCGC­CTGCATCAAAAAACTTGACG
    CCACTAAATGATGCAGCGTTTTTGATTGGA­GCATTCCCACTGCCCATCGG
    AGGACTTGGTTTATGTCCCTTTTTCAAGGC­ATAGCCACCAAACATTATTG
    TCACTGGTTTATTTGACAAGCTTGTAAACA­GAGATCTTGGATAGTAACCT
    GTAAGTCTGGCTTCACCATTAAACCCATAG­TAGACTTGCCAATCACCAGA
    AATTTGATCCTTGGATACTCTGACTGTAGT­GTATCGTTTGTCGCTAGAGG
    TGGTGGAAACAGGGTTAATCACCATTCCTG­GAACGATTTCTGAGCTAAAT
    ACACTTTCGAATCCAGGACAACGCATATCA­GGACAGGCATTAGATCCTTG
    AGTAAACCAAGTACTGAAGTGCGTCTGTGA­ATCATTGTATGATTCAGGCT
    CAATATTCCATCCAGCTATAACATTATTTA­TGGCGGATGCTTCATCCTTA
    TTATAAATCGAAATGAAACCTCCTGTTTGT­TGTCCATGCTCTAGATTAAA
    GCATAAACATCCATGGTGGCCTCTACTCCA­TAATACGTTATAGCATTATC
    TGAAGGACCCCATCCATATACTGCAAGATA­CAACGTGCCAGCTTGATTTG
    ATTCATGACCCGACGAAGATAAATTCACAT­CAAGTATAAGGGGCATCAAT
    GTTTGCCATTTCTTTTGGACCTCTTCTTCC­ATAGGAGGGAAGCAACTCCT
    ACTGTTTGACTACCAAGGAACACACACAGA­GCAGTGCAGATTGATTAAAA
    ATTTCTCCATATTATATTTGGGGATGGAGA­GGGTATATGTTTGAGTTCCC
    CGGCGTTAGGCCGATTTCCGGGTACACAAA­ATGCGGGCTTCCGAGAAAAA
    AAATTCCCCCAACCTTGGATTTGTTTTTTT­TTTTCTCTTCTTCTTCTACT
    CTATTTTTATTTCTTGTGTTTGTTTCTGTA­CTTTTCTTGTTGTTTTTTGT
    GTGTTCTTTTTGTTGTGTTTGTTTTTTTTC­TTTTCTTTTTGTTTTTATGT
    ATCTATCCTTTCTTATTGTTTGTATTTTTT­TTTTGTTATTTTTGTATGTT
    TTCTTTGTTGTGTTATTTTTTTGGTTTTCT­TTTTTTGTTTTTATCACTTT
    CTCTTTGTATTGAGTGCTTTTCTTGTTTTT­ATTTTGTTGATTCTTTTGTC
    TTGTCTCTGTCTTTTTTTTCCGTATATGCT­TTGTTTGTTTCTTATCCTTT
    GCTTG

    >9P01P10B.y prolamin precursor (clone pX24) - rice emb|CAA37850.1|
    >prolamin [Oryza sativa (japonica cultivar-group)] HSP:418



    GTTGCTATGAAAGCACTTTATTTCTATTTA­TATCACCCAAAGTTTCACAT
    GTCACATATGATGATATCTGAGCTTATTTT­TAACTTCCGAACCACTATAC
    TGTTAAAACTCATTACAAGACACCGCCAAG­GGTGGTAATGGTACTGGGTG
    CACCATAGTACCTAGGGTAGATACCATATC­TAGATGGCACGTTAAAAGCC
    AATAGAGCTTGAGCTTGAGCCAGATTCCGA­TCAAAGTAGAGATCACCAAA
    CTGCTGGAGTTGTAGCTGCTGCGCTATGGC­CTGAACAATGTTAATGTCCT
    GATAGTGAGATTGTTGCGCCACCAGCGCGA­GATGTTGCCAGACTTGGTTG
    TTTCTCAGTTGAAACGCAGCTGATTGCAAG­AAGGGGCTTGCCGCTATGCC
    ATACTGCTGCCTTACGAACTCATTATATGG­GCTAAGCACCTGTTGCTGTA
    GCAGGACAGGCGACTGCAGCTGATATTGCC­TATAACTTTGACCTAAAACA
    TCAAACTGCGCAGAGGCGCTGCATGCAGCA­ATAGCAAGGAGAGCAAAGAC
    GAAAATGATCTTCATTGCTGCGGGACACTA­NATCTTTCTATTTTTCTGTA
    TAATGCTTGAACTGTGTGAACGATCXXXXX­XXXXXXXXXXXXXXXXXXXX
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX­XXXXXXXXXXXCTCTTCAAT
    CTCGGGAANNNNNTGTNGGGGTGTTGGGAA­ATCCCCCCCTTGTTGGGGTT
    TTTCTTGGTTAAACACAAGTGTCCCTTCTC­TTTAAAAAAAACCCCTTTTC
    CTGTTGGGGGGGTNNTTTTTTTTTTTTTCT­TTTTTTTTTTTTTNTTTTTT
    TCCTTTTTTTTTTTTTTTTTTTTTTCTTTT­TGTTCCTTTCTTGTTTCTGT
    TTCTCTTTTTTTTTTTTTTTTTTTTTTTGT­TTTCTTTTTTTTTTTTTCTG
    , Jul 2, 2005
    #7
  8. Ross

    Guest

    Chris wrote:
    >If the file can be slurped into memory, the following might be an
    >approach that will work (provided the format is as you posted).
    >
    >
    >#!/usr/bin/perl
    >use strict;
    >use warnings;
    >use Data::Dumper;
    >
    >
    >my $data = do {local $/; <DATA>};
    >
    >
    >my @parts = grep {!/X+/} split /\n\n(?=>)/, $data;
    >print Dumper \@parts;


    HI again Ross

    I saw a possible problem with my solution. If there are any uppercase
    X's in the header lines, then the code above will not work. If you knew
    that there would be a minimum number of X's in the fasta body proper,
    (and none in the header), you might use the following in the grep
    regular expression:

    ! /X{3,}/ or whatever minimum amount of X's would possibly be in the
    fasta sequence.

    Also, the use of Data::Dumper was just to show the legit fastas
    w/headers held in the @parts array.

    To print them you could write:

    print join "\n\n", @parts;

    Hope this clears things up.

    Chris
    , Jul 2, 2005
    #8
  9. Ross

    John Bokma Guest

    John Bokma, Jul 2, 2005
    #9
  10. Ross

    Ross Guest

    Small-potato feedback and new question: remove sequence

    Dear all,
    Indeed with all your comments and based on my beginner ability of perl,
    the solution i wrote myself to solve the problem is at the end of this
    message. This time, i'm asking about reading character by character in perl.
    Again the problem arises whenever situation like:

    CTCTTTTTAGCAAAGAGGAATAATAAAATTGTGTGTTGCCAAAAAAAAAA
    AAAAAAAAAAAAAAAAACTTTGTGGGGCCCCCCGGGCCAATTCCCCTCCA

    that i need to count a continuous number of 'A' for control. I don't wanna
    transform the data file into a single line format. Has perl any getchar()
    like function so i can count easily? Thanks again for so many responses.

    Gratefully,
    Ross


    ====================================================

    $file = $ARGV[0];
    $output = "$file.cleaned";
    if($file eq '') {
    print "Usage: $0 input\n";
    exit;
    }
    open(OUT, ">$output") || die "Could not open $output\n";

    while($line= <>) {
    if ($line !~ /^>.*/) {
    if ($line =~ /(.*)X(.*)/ ) {
    $tmpline = $1;
    $tmpline =~ s/X//g;
    print OUT "$tmpline\n";

    while ($tmpline !~ /^>.*/) {
    $tmpline = <>;
    if (eof) {
    last;
    }
    }
    $line = $tmpline;
    }
    }
    print OUT $line;
    }
    close(OUT);
    exit;
    Ross, Jul 3, 2005
    #10
  11. Ross

    Paul Lalli Guest

    Re: Small-potato feedback and new question: remove sequence

    Ross wrote:
    > Dear all,
    > Indeed with all your comments and based on my beginner ability of perl,
    > the solution i wrote myself to solve the problem is at the end of this
    > message. This time, i'm asking about reading character by character in perl.
    > Again the problem arises whenever situation like:
    >
    > CTCTTTTTAGCAAAGAGGAATAATAAAATTGTGTGTTGCCAAAAAAAAAA
    > AAAAAAAAAAAAAAAAACTTTGTGGGGCCCCCCGGGCCAATTCCCCTCCA
    >
    > that i need to count a continuous number of 'A' for control. I don't wanna
    > transform the data file into a single line format. Has perl any getchar()
    > like function so i can count easily? Thanks again for so many responses.

    ^^^^^

    This is a FAQ:
    perldoc -q count
    "How can I count the number of occurrences of a substring within a
    string?"

    The first example in the answer deals with counting single-character
    substrings.

    Paul Lalli
    Paul Lalli, Jul 3, 2005
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. lezah
    Replies:
    0
    Views:
    463
    lezah
    Feb 4, 2004
  2. Paul Porcelli
    Replies:
    1
    Views:
    452
    Chris Charley
    Jul 14, 2004
  3. Simon-Pierre  Jarry
    Replies:
    2
    Views:
    2,344
    Henrik
    Aug 10, 2005
  4. tshad
    Replies:
    6
    Views:
    21,435
    tshad
    Aug 8, 2006
  5. stef mientki
    Replies:
    13
    Views:
    616
    stef mientki
    Oct 20, 2007
Loading...

Share This Page