remove sequence

R

Ross

Dear All,
For a file with many records like follows, I would like to remove each
record, if any, containing "XXXXX and the ATCCAAT... follows". If the
sequence is a single line, i can just simply use

if (line =~ ^>.*)
if (line =~ (.*)X(.*) )
newline = $1;

anybody has good idea to solve the problem? thanks in advance
9P01P10A.y putative DD1A protein [Oryza sativa (japonica
cultivar-group)] HSP:949 GAACAATTAGTATAAACTTTAGTTGAATCTCGTTACTATATTAGCTTCGG
AGCTCAATTACAAACAGCTAGCAAAAAATGCCAGGTCCCCCATAAAAGAA
ACCATCATGTTCATAATCAGACACACGGTAGCAATTTGATATATATCCGA
GAGCAGAATTGATTTGATGGGTGTTGCCGCCTGCATCAAAAAACTTGACG
CCACTAAATGATGCAGCGTTTTTGATTGGAGCATTCCCACTGCCCATCGG
AGGACTTGGTTTATGTCCCTTTTTCAAGGCATAGCCACCAAACATTATTG
TCACTGGTTTATTTGACAAGCTTGTAAACAGAGATCTTGGATAGTAACCT
GTAAGTCTGGCTTCACCATTAAACCCATAGTAGACTTGCCAATCACCAGA
AATTTGATCCTTGGATACTCTGACTGTAGTGTATCGTTTGTCGCTAGAGG
TGGTGGAAACAGGGTTAATCACCATTCCTGGAACGATTTCTGAGCTAAAT
ACACTTTCGAATCCAGGACAACGCATATCAGGACAGGCATTAGATCCTTG
AGTAAACCAAGTACTGAAGTGCGTCTGTGAATCATTGTATGATTCAGGCT
CAATATTCCATCCAGCTATAACATTATTTATGGCGGATGCTTCATCCTTA
TTATAAATCGAAATGAAACCTCCTGTTTGTTGTCCATGCTCTAGATTAAA
GCATAAACATCCATGGTGGCCTCTACTCCATAATACGTTATAGCATTATC
TGAAGGACCCCATCCATATACTGCAAGATACAACGTGCCAGCTTGATTTG
ATTCATGACCCGACGAAGATAAATTCACATCAAGTATAAGGGGCATCAAT
GTTTGCCATTTCTTTTGGACCTCTTCTTCCATAGGAGGGAAGCAACTCCT
ACTGTTTGACTACCAAGGAACACACACAGAGCAGTGCAGATTGATTAAAA
ATTTCTCCATATTATATTTGGGGATGGAGAGGGTATATGTTTGAGTTCCC
CGGCGTTAGGCCGATTTCCGGGTACACAAAATGCGGGCTTCCGAGAAAAA
AAATTCCCCCAACCTTGGATTTGTTTTTTTTTTTCTCTTCTTCTTCTACT
CTATTTTTATTTCTTGTGTTTGTTTCTGTACTTTTCTTGTTGTTTTTTGT
GTGTTCTTTTTGTTGTGTTTGTTTTTTTTCTTTTCTTTTTGTTTTTATGT
ATCTATCCTTTCTTATTGTTTGTATTTTTTTTTTGTTATTTTTGTATGTT
TTCTTTGTTGTGTTATTTTTTTGGTTTTCTTTTTTTGTTTTTATCACTTT
CTCTTTGTATTGAGTGCTTTTCTTGTTTTTATTTTGTTGATTCTTTTGTC
TTGTCTCTGTCTTTTTTTTCCGTATATGCTTTGTTTGTTTCTTATCCTTT
GCTTG
9P01P10B.y prolamin precursor (clone pX24) - rice emb|CAA37850.1|
prolamin [Oryza sativa (japonica cultivar-group)] HSP:418
GTTGCTATGAAAGCACTTTATTTCTATTTATATCACCCAAAGTTTCACAT
GTCACATATGATGATATCTGAGCTTATTTTTAACTTCCGAACCACTATAC
TGTTAAAACTCATTACAAGACACCGCCAAGGGTGGTAATGGTACTGGGTG
CACCATAGTACCTAGGGTAGATACCATATCTAGATGGCACGTTAAAAGCC
AATAGAGCTTGAGCTTGAGCCAGATTCCGATCAAAGTAGAGATCACCAAA
CTGCTGGAGTTGTAGCTGCTGCGCTATGGCCTGAACAATGTTAATGTCCT
GATAGTGAGATTGTTGCGCCACCAGCGCGAGATGTTGCCAGACTTGGTTG
TTTCTCAGTTGAAACGCAGCTGATTGCAAGAAGGGGCTTGCCGCTATGCC
ATACTGCTGCCTTACGAACTCATTATATGGGCTAAGCACCTGTTGCTGTA
GCAGGACAGGCGACTGCAGCTGATATTGCCTATAACTTTGACCTAAAACA
TCAAACTGCGCAGAGGCGCTGCATGCAGCAATAGCAAGGAGAGCAAAGAC
GAAAATGATCTTCATTGCTGCGGGACACTANATCTTTCTATTTTTCTGTA
TAATGCTTGAACTGTGTGAACGATCXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCTCTTCAAT
CTCGGGAANNNNNTGTNGGGGTGTTGGGAAATCCCCCCCTTGTTGGGGTT
TTTCTTGGTTAAACACAAGTGTCCCTTCTCTTTAAAAAAAACCCCTTTTC
CTGTTGGGGGGGTNNTTTTTTTTTTTTTCTTTTTTTTTTTTTTNTTTTTT
TCCTTTTTTTTTTTTTTTTTTTTTTCTTTTTGTTCCTTTCTTGTTTCTGT
TTCTCTTTTTTTTTTTTTTTTTTTTTTTGTTTTCTTTTTTTTTTTTTCTG
 
J

John Bokma

Ross said:
Dear All,
For a file with many records like follows, I would like to remove each
record, if any, containing "XXXXX and the ATCCAAT... follows". If the
sequence is a single line, i can just simply use

if (line =~ ^>.*)
if (line =~ (.*)X(.*) )
newline = $1;

anybody has good idea to solve the problem? thanks in advance

read either record for record, or slurp everything in one scalar and try to
match records.
 
R

Ross

John Bokma said:
read either record for record, or slurp everything in one scalar and try
to
match records.

Thanks John. However, each record is not a single-line format and it
encounters an embarrassing situation that one doesn't know how many X's
there are. On the other hand, i don't know how to read character by
character in PERL, thanks.
 
J

John Bokma

Ross said:
"John Bokma" <[email protected]> wrote in message
Thanks John. However, each record is not a single-line format and it
encounters an embarrassing situation that one doesn't know how many
X's there are.

I was aware of that. What might work is:

open my $fh, $filename or die ...
my $record = '';

while ( my $line = <$fh> ) {

if ( $line =~ /^begin of a new record/ and length $record ) {

# check if it has XXXX
# and if so, drop it
$record = '';
}

$record .= $line;
}

close $fh or die ...

if ( length $record ) {

# check if it has XXXX
# and if so, drop it
}

( BTW: the language is Perl the interpreter is perl, there is no such
thing as PERL )
 
T

Thelma Lubkin

: Dear All,
: For a file with many records like follows, I would like to remove each
: record, if any, containing "XXXXX and the ATCCAAT... follows". If the
: sequence is a single line, i can just simply use

: if (line =~ ^>.*)
: if (line =~ (.*)X(.*) )
: newline = $1;

: anybody has good idea to solve the problem? thanks in advance

I'm not sure that I'm interpreting what you're trying to remove
correctly, but is this perhaps what you're looking for?

#!/usr/bin/perl -w

use strict;

open RD, "atcx" or die $!, " Couldn't open input file\n";
# atcx is the file of lines that you're trying to process

my @cleanlines = ();
my $truline = "";
foreach (<RD>)
{ my $line = $_;
chomp $line;
if($line =~ /^>.*/)
{ if($truline ne "")
{ push @cleanlines,$truline; # Add next cleaned line when
$truline = ""; # signal for new line [>] is found
}
}
else
{ $line =~ s/(.*)X.*/$1/;
$truline .= $line."\n"; # Append good part of line
}
}

if($truline ne "") { push @cleanlines,$truline; }

print "\n\n\n";
foreach(@cleanlines) { print "$_\n\n"; }
print "\n";

You might be trying to go even further and skip everything from the
first X to the next new line signal, in which case you'd stop
appending after that first X and use the loop only to find the next
, and then start the process again for the next line

--thelma
 
C

charley

Ross said
For a file with many records like follows, I would like to remove each
record, if any, containing "XXXXX and the ATCCAAT... follows". If the
sequence is a single line, i can just simply use

if (line =~ ^>.*)
if (line =~ (.*)X(.*) )
newline = $1;

anybody has good idea to solve the problem? thanks in advance

Hi Ross

If the file can be slurped into memory, the following might be an
approach that will work (provided the format is as you posted).

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my $data = do {local $/; <DATA>};

my @parts = grep {!/X+/} split /\n\n(?=>)/, $data;
print Dumper \@parts;

__DATA__
9P01P10A.y putative DD1A protein [Oryza sativa (japonica
cultivar-group)] HSP:949

GAACAATTAGTATAAACTTTAGTTGAATCT­CGTTACTATATTAGCTTCGG
AGCTCAATTACAAACAGCTAGCAAAAAATG­CCAGGTCCCCCATAAAAGAA
ACCATCATGTTCATAATCAGACACACGGTA­GCAATTTGATATATATCCGA
GAGCAGAATTGATTTGATGGGTGTTGCCGC­CTGCATCAAAAAACTTGACG
CCACTAAATGATGCAGCGTTTTTGATTGGA­GCATTCCCACTGCCCATCGG
AGGACTTGGTTTATGTCCCTTTTTCAAGGC­ATAGCCACCAAACATTATTG
TCACTGGTTTATTTGACAAGCTTGTAAACA­GAGATCTTGGATAGTAACCT
GTAAGTCTGGCTTCACCATTAAACCCATAG­TAGACTTGCCAATCACCAGA
AATTTGATCCTTGGATACTCTGACTGTAGT­GTATCGTTTGTCGCTAGAGG
TGGTGGAAACAGGGTTAATCACCATTCCTG­GAACGATTTCTGAGCTAAAT
ACACTTTCGAATCCAGGACAACGCATATCA­GGACAGGCATTAGATCCTTG
AGTAAACCAAGTACTGAAGTGCGTCTGTGA­ATCATTGTATGATTCAGGCT
CAATATTCCATCCAGCTATAACATTATTTA­TGGCGGATGCTTCATCCTTA
TTATAAATCGAAATGAAACCTCCTGTTTGT­TGTCCATGCTCTAGATTAAA
GCATAAACATCCATGGTGGCCTCTACTCCA­TAATACGTTATAGCATTATC
TGAAGGACCCCATCCATATACTGCAAGATA­CAACGTGCCAGCTTGATTTG
ATTCATGACCCGACGAAGATAAATTCACAT­CAAGTATAAGGGGCATCAAT
GTTTGCCATTTCTTTTGGACCTCTTCTTCC­ATAGGAGGGAAGCAACTCCT
ACTGTTTGACTACCAAGGAACACACACAGA­GCAGTGCAGATTGATTAAAA
ATTTCTCCATATTATATTTGGGGATGGAGA­GGGTATATGTTTGAGTTCCC
CGGCGTTAGGCCGATTTCCGGGTACACAAA­ATGCGGGCTTCCGAGAAAAA
AAATTCCCCCAACCTTGGATTTGTTTTTTT­TTTTCTCTTCTTCTTCTACT
CTATTTTTATTTCTTGTGTTTGTTTCTGTA­CTTTTCTTGTTGTTTTTTGT
GTGTTCTTTTTGTTGTGTTTGTTTTTTTTC­TTTTCTTTTTGTTTTTATGT
ATCTATCCTTTCTTATTGTTTGTATTTTTT­TTTTGTTATTTTTGTATGTT
TTCTTTGTTGTGTTATTTTTTTGGTTTTCT­TTTTTTGTTTTTATCACTTT
CTCTTTGTATTGAGTGCTTTTCTTGTTTTT­ATTTTGTTGATTCTTTTGTC
TTGTCTCTGTCTTTTTTTTCCGTATATGCT­TTGTTTGTTTCTTATCCTTT
GCTTG

9P01P10B.y prolamin precursor (clone pX24) - rice emb|CAA37850.1|
prolamin [Oryza sativa (japonica cultivar-group)] HSP:418


GTTGCTATGAAAGCACTTTATTTCTATTTA­TATCACCCAAAGTTTCACAT
GTCACATATGATGATATCTGAGCTTATTTT­TAACTTCCGAACCACTATAC
TGTTAAAACTCATTACAAGACACCGCCAAG­GGTGGTAATGGTACTGGGTG
CACCATAGTACCTAGGGTAGATACCATATC­TAGATGGCACGTTAAAAGCC
AATAGAGCTTGAGCTTGAGCCAGATTCCGA­TCAAAGTAGAGATCACCAAA
CTGCTGGAGTTGTAGCTGCTGCGCTATGGC­CTGAACAATGTTAATGTCCT
GATAGTGAGATTGTTGCGCCACCAGCGCGA­GATGTTGCCAGACTTGGTTG
TTTCTCAGTTGAAACGCAGCTGATTGCAAG­AAGGGGCTTGCCGCTATGCC
ATACTGCTGCCTTACGAACTCATTATATGG­GCTAAGCACCTGTTGCTGTA
GCAGGACAGGCGACTGCAGCTGATATTGCC­TATAACTTTGACCTAAAACA
TCAAACTGCGCAGAGGCGCTGCATGCAGCA­ATAGCAAGGAGAGCAAAGAC
GAAAATGATCTTCATTGCTGCGGGACACTA­NATCTTTCTATTTTTCTGTA
TAATGCTTGAACTGTGTGAACGATCXXXXX­XXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX­XXXXXXXXXXXCTCTTCAAT
CTCGGGAANNNNNTGTNGGGGTGTTGGGAA­ATCCCCCCCTTGTTGGGGTT
TTTCTTGGTTAAACACAAGTGTCCCTTCTC­TTTAAAAAAAACCCCTTTTC
CTGTTGGGGGGGTNNTTTTTTTTTTTTTCT­TTTTTTTTTTTTTNTTTTTT
TCCTTTTTTTTTTTTTTTTTTTTTTCTTTT­TGTTCCTTTCTTGTTTCTGT
TTCTCTTTTTTTTTTTTTTTTTTTTTTTGT­TTTCTTTTTTTTTTTTTCTG
 
C

charley

Chris said:
If the file can be slurped into memory, the following might be an
approach that will work (provided the format is as you posted).


#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;


my $data = do {local $/; <DATA>};


my @parts = grep {!/X+/} split /\n\n(?=>)/, $data;
print Dumper \@parts;

HI again Ross

I saw a possible problem with my solution. If there are any uppercase
X's in the header lines, then the code above will not work. If you knew
that there would be a minimum number of X's in the fasta body proper,
(and none in the header), you might use the following in the grep
regular expression:

! /X{3,}/ or whatever minimum amount of X's would possibly be in the
fasta sequence.

Also, the use of Data::Dumper was just to show the legit fastas
w/headers held in the @parts array.

To print them you could write:

print join "\n\n", @parts;

Hope this clears things up.

Chris
 
R

Ross

Dear all,
Indeed with all your comments and based on my beginner ability of perl,
the solution i wrote myself to solve the problem is at the end of this
message. This time, i'm asking about reading character by character in perl.
Again the problem arises whenever situation like:

CTCTTTTTAGCAAAGAGGAATAATAAAATTGTGTGTTGCCAAAAAAAAAA
AAAAAAAAAAAAAAAAACTTTGTGGGGCCCCCCGGGCCAATTCCCCTCCA

that i need to count a continuous number of 'A' for control. I don't wanna
transform the data file into a single line format. Has perl any getchar()
like function so i can count easily? Thanks again for so many responses.

Gratefully,
Ross


====================================================

$file = $ARGV[0];
$output = "$file.cleaned";
if($file eq '') {
print "Usage: $0 input\n";
exit;
}
open(OUT, ">$output") || die "Could not open $output\n";

while($line= <>) {
if ($line !~ /^>.*/) {
if ($line =~ /(.*)X(.*)/ ) {
$tmpline = $1;
$tmpline =~ s/X//g;
print OUT "$tmpline\n";

while ($tmpline !~ /^>.*/) {
$tmpline = <>;
if (eof) {
last;
}
}
$line = $tmpline;
}
}
print OUT $line;
}
close(OUT);
exit;
 
P

Paul Lalli

Ross said:
Dear all,
Indeed with all your comments and based on my beginner ability of perl,
the solution i wrote myself to solve the problem is at the end of this
message. This time, i'm asking about reading character by character in perl.
Again the problem arises whenever situation like:

CTCTTTTTAGCAAAGAGGAATAATAAAATTGTGTGTTGCCAAAAAAAAAA
AAAAAAAAAAAAAAAAACTTTGTGGGGCCCCCCGGGCCAATTCCCCTCCA

that i need to count a continuous number of 'A' for control. I don't wanna
transform the data file into a single line format. Has perl any getchar()
like function so i can count easily? Thanks again for so many responses.
^^^^^

This is a FAQ:
perldoc -q count
"How can I count the number of occurrences of a substring within a
string?"

The first example in the answer deals with counting single-character
substrings.

Paul Lalli
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top