How to take acton upon a pattern of nth occurrence?

R

Ross

Dear all,
For the sequence below (indeed a single line), when i use the conditional
checking

if ($line =~ /(.*)A{10,}(.*)/ ) {
$tmpline = $1;
}

to try to remove substring after 10 or more consecutive A's, perl seems to
recognize the last poly A's and leave the former ones intact. what can i do?
In general, How to take acton upon a pattern of nth occurrence?

TCCTCAGTGGGAATTCGGCATTACGGCCGGGGCACCACAATGAATGATCATTTTC
TTCTTTGCTCTCCTTGCTATTGCTGCATGCAGCGCCTCTGCGCAGTTTGATGCTG
TTACTCAAGTTTACAGGCAATATCAGCTGCAGCCGCATCTCATGCTGCAGCAACA
GATGCTTAGCCCATGCGGTGAGTTCGTAAGGCAGCAGTGCAGCACAGTGGCAACC
CCCTTCTTCCAATCACCCGTGTTTCAACTGAGAAACTGCCAAGTCATGCAGCAGC
AGTGCTGCCAACAGCTCAGGATGATCGCGCAACAGTCTCACCGCCAGGCCATTAG
TAGTGTTCAGGCGATTGTGCAGCAGCTACAGCTACAACAGTTTGCTGGCGTCTAC
TTCGATCAGACTCAAGCTCAAGCCCAAGCTATGTTGGCCCTAAACTTGCTGTCAA
TATGCGGTATCTACCCAAGCTACAACACTGCTCCCTGTAGCATTCCCACCGTCGG
TGGTATCTGGTACTGAATTGTAGCAGTATAGTAGTACAGGAGAGAAAAATAAAGT
CATGCATCATCGTGTGTGACAAGTTGAAACATCGGGGTGATACAAATCTGAATAA
AAATGTCATGCAAGTTTAAACANNNNANANNNANNNNAAANAAAAAAAAAAAAAA
AAAANANAAAAAAAAAAAAAAAAAAAAAAAAAAANAAAAANAAAAAAAAAAAAAA
AAAAANNNNNNANANNNNNNAAAAAAAAAAAAAAAAANNNNNNNNNNGGGGGGGG
GGGGGGGCGGGAAGAAAAAAAAAAA
 
G

Gunnar Hjalmarsson

Ross said:
For the sequence below (indeed a single line), when i use the conditional
checking

if ($line =~ /(.*)A{10,}(.*)/ ) {
------------------------^-^^^^
Why the comma?
Why do you have Perl capture the part you are not interested in?
$tmpline = $1;
}

to try to remove substring after 10 or more consecutive A's, perl seems to
recognize the last poly A's and leave the former ones intact. what can i do?

Did you try making .* non-greedy?

/(.*?A{10})/

(including A{10} in the capturing parenteses matches your description).

Please read about greediness in "perldoc perlre".
In general, How to take acton upon a pattern of nth occurrence?

Specifically, what is a "pattern of nth occurrence"?
 
R

Ross

Gunnar Hjalmarsson said:
------------------------^-^^^^
Why the comma?
Why do you have Perl capture the part you are not interested in?


Did you try making .* non-greedy?

/(.*?A{10})/

(including A{10} in the capturing parenteses matches your description).

Please read about greediness in "perldoc perlre".


Specifically, what is a "pattern of nth occurrence"?

Dear Gunnar Hjalmarsson,
I am a beginner. From a Perl bible, {10,} means minimum number and without
maximum. What i'm trying to specify the pattern is to find out, the 1st
occurrence of poly A's (number equal to or more than 10) and then only
retain the substring in front of it. To generalize, i wonder if there is
convenient built-in syntax to test for nth occurrence (indeed, i know a loop
may do the task). the .*, according to the bible, just represents a
substring of any pattern only. Thanks for your response.

--Ross
 
G

Gunnar Hjalmarsson

Ross said:
From a Perl bible, {10,} means minimum number and without
maximum.

I agree that {10,} means minimum, but {10} means exactly 10, not maximum
( which would have been written {0,10} ). If your "bible" says something
else, please drop it and use the Perl documentation instead.

The reason I asked about the comma, i.e. matching at least 10, is that
it seems to be unnecessary considering what you are trying to do. But
it's not wrong. Just trying to avoid a redundant character. ;-)
What i'm trying to specify the pattern is to find out, the 1st
occurrence of poly A's (number equal to or more than 10) and then only
retain the substring in front of it.

That differs slightly from how you explained it in your original post,
so I'd better change my suggested regex to

/(.*?)A{10}/

Did you try it?

Did you read about greediness in "perldoc perlre"?
To generalize, i wonder if there is
convenient built-in syntax to test for nth occurrence

And I still don't understand the meaning of "nth occurrence".
the .*, according to the bible, just represents a
substring of any pattern

Not newline by default, according to the Perl documentation.

Greediness, greediness...
 
A

A. Sinan Unur

Dear all,
For the sequence below (indeed a single line), when i use the
conditional
checking

if ($line =~ /(.*)A{10,}(.*)/ ) {
$tmpline = $1;
}

to try to remove substring after 10 or more consecutive A's, perl
seems to recognize the last poly A's and leave the former ones intact.
what can i do? In general, How to take acton upon a pattern of nth
occurrence?

It seems to me that you should be using index rather than regular
expressions, although I am not sure what you mean by "nth occurence".

If I understand you correctly, you want to find the first string of at
least 10 As, and only keep the substring up to and including the last
character before that string of at least 10 As. That can be translated
directly to Perl in a very straightforward way:
#!/usr/bin/perl

use strict;
use warnings;

my $s;

while( <DATA> ) {
chomp;
$s .= $_;
}

my $r = substr $s, 0, index $s, 'AAAAAAAAAA';

print "$r\n";


__END__
TCCTCAGTGGGAATTCGGCATTACGGCCGGGGCACCACAATGAATGATCATTTTC
TTCTTTGCTCTCCTTGCTATTGCTGCATGCAGCGCCTCTGCGCAGTTTGATGCTG
TTACTCAAGTTTACAGGCAATATCAGCTGCAGCCGCATCTCATGCTGCAGCAACA
GATGCTTAGCCCATGCGGTGAGTTCGTAAGGCAGCAGTGCAGCACAGTGGCAACC
CCCTTCTTCCAATCACCCGTGTTTCAACTGAGAAACTGCCAAGTCATGCAGCAGC
AGTGCTGCCAACAGCTCAGGATGATCGCGCAACAGTCTCACCGCCAGGCCATTAG
TAGTGTTCAGGCGATTGTGCAGCAGCTACAGCTACAACAGTTTGCTGGCGTCTAC
TTCGATCAGACTCAAGCTCAAGCCCAAGCTATGTTGGCCCTAAACTTGCTGTCAA
TATGCGGTATCTACCCAAGCTACAACACTGCTCCCTGTAGCATTCCCACCGTCGG
TGGTATCTGGTACTGAATTGTAGCAGTATAGTAGTACAGGAGAGAAAAATAAAGT
CATGCATCATCGTGTGTGACAAGTTGAAACATCGGGGTGATACAAATCTGAATAA
AAATGTCATGCAAGTTTAAACANNNNANANNNANNNNAAANAAAAAAAAAAAAAA
AAAANANAAAAAAAAAAAAAAAAAAAAAAAAAAANAAAAANAAAAAAAAAAAAAA
AAAAANNNNNNANANNNNNNAAAAAAAAAAAAAAAAANNNNNNNNNNGGGGGGGG
GGGGGGGCGGGAAGAAAAAAAAAAA
 
B

Brandon

Try adding ? to the first .* to make (.*?). The * modifier takes up as much
as it can without turning false, so it'll match the last 10 consecutive A's,
not the first.
 
D

Debo

On Wed, 6 Jul 2005, Ross wrote:
R> Dear all,
<snip code>
R> to try to remove substring after 10 or more consecutive A's, perl seems to
R> recognize the last poly A's and leave the former ones intact. what can i do?
R> In general, How to take acton upon a pattern of nth occurrence?

Generally, if you're trying to do something that seems fairly common --
such as trimming a poly-A tail -- it has already been done in bioperl :)

This seems to be somewhat along the lines of what you're trying to do:

http://doc.bioperl.org/bioperl-run/Bio/Tools/Run/PiseApplication/trimest.html

If that's not helpful, let me know and I'll see if I can dig up something
better.

-Debo
 
R

Ross

#!/usr/bin/perl
use strict;
use warnings;

my $s;

while( <DATA> ) {
chomp;
$s .= $_;
}

what is .= ?

It is most difficult to find explanation from Google as .= cannot be
searched. any suggestion for a novice? again, does __END__ declare the
beginning of DATA? index can't also be searched as web search engine
interprets as another meaning. anyway, thanks for all the responders'
replies. :)
 
D

Debo

On Wed, 6 Jul 2005, Ross wrote:

R> > #!/usr/bin/perl
R> >
R> > use strict;
R> > use warnings;
R> >
R> > my $s;
R> >
R> > while( <DATA> ) {
R> > chomp;
R> > $s .= $_;
R> > }
R>
R> what is .= ?
R>
R> It is most difficult to find explanation from Google as .= cannot be
R> searched.

If you have questions about perl's operators, 'perldoc perlop' should help
you out.

-Debo
 
A

A. Sinan Unur

what is .= ?

perldoc perlop

Please do not quote signatures unless you have something to say about
the signature itself. On the other hand, you would benefit from reading
the posting guidelines mentioned above. They contain invaluable
information on how you can help yourself, and help others help you.

With this message, you have given a strong signal that you are not
willing to do much work yourself. I hope, for your sake, that you will
send a strong signal in the opposite direction in your next post.

Sinan
 
P

Paul Lalli

Ross said:
what is .= ?

It is most difficult to find explanation from Google as .= cannot be
searched. any suggestion for a novice? again, does __END__ declare the
beginning of DATA?
index can't also be searched as web search engine
interprets as another meaning. anyway, thanks for all the responders'
replies. :)

You seem to be under the impression that searching the entire web, via
Google, is your only hope of understanding Perl. I would suggest that
rather than look for random explantions, you read the actual
documentation that comes with perl. From your command line, type:
perldoc perl

That will get you started with the manual. Once you understand what
you're seeing, the following specific "chapters" of the manual will be
helpful towards answering each of the questions above:

for .= and all other operators:
perldoc perlop

for the index function:
perldoc -f index

for the __END__ and DATA markers:
perldoc perldata

for a Table Of Contents for all topics available in the documentation:
perldoc perltoc

for help using the manual itself (explains the -f option above)
perldoc perldoc

Paul Lalli
 
R

Ross

From your command line, type:
perldoc perl

That will get you started with the manual. Once you understand what
you're seeing, the following specific "chapters" of the manual will be
helpful towards answering each of the questions above:

for .= and all other operators:
perldoc perlop

for the index function:
perldoc -f index

for the __END__ and DATA markers:
perldoc perldata

for a Table Of Contents for all topics available in the documentation:
perldoc perltoc

for help using the manual itself (explains the -f option above)
perldoc perldoc

Paul Lalli

Thanks! Your guidance is instructive and comprehensive.
 
S

Sherm Pendley

Ross said:
what is .= ?

Other folks have mentioned perlop, which explains the "how" but doesn't say
much about the "why". I'd imagine that's because perlop is more of a reference
than a tutorial, and the "why" actually predates Perl anyway.

Many languages that relate to C in some way have operators like these, .=
-=, *=, |=, etc., because it's *very* common to write something like this:

$foo = $foo + $bar;

It's so common, in fact, that a shortcut operator was created just for that
purpose, combining the assignment and addition:

$foo += $bar;

Back in the day of straightforward, non-optimizing C compilers, the short form
was an optimization. The long form would generate instructions to perform the
addition, store the results, and immediately re-read that same memory location
to do the assignment. The C compilers of the day wouldn't optimize that into a
simpler set of instructions that would operate directly on the target, which is
what the abbreviated form would do.

Another, even shorter set of operators is ++ and --, which replaces this:

$foo += 1;

with this:

$foo++;

Again, at one time that was an optimization. Early C compilers would generate
the same addition instructions every time += was used. Many CPUs have a direct
"increment by one" instruction that would be generated instead, whenever ++
was used.

sherm--
 
S

Sherm Pendley

A. Sinan Unur said:
IMHO, it is a cognitive optimization for the programmer as well. Once
one gets used to the notation, the difference between "Add $bar to $foo
and store the result in $foo" and "add $bar to $foo, and store the
result somewhere else" becomes easier to see, helping (at least me) with
comprehension.

I agree, but that's something that was discovered later, when C became more
widely adopted and used. That's why it's still in common use today, now that
optimizing compilers and cheap hardware have made the original purpose moot.

sherm--
 
A

A. Sinan Unur

Other folks have mentioned perlop, which explains the "how" but
doesn't say much about the "why". I'd imagine that's because perlop is
more of a reference than a tutorial, and the "why" actually predates
Perl anyway.

Many languages that relate to C in some way have operators like these,
.= -=, *=, |=, etc., because it's *very* common to write something
like this:

$foo = $foo + $bar;

It's so common, in fact, that a shortcut operator was created just for
that purpose, combining the assignment and addition:

$foo += $bar;

Back in the day of straightforward, non-optimizing C compilers, the
short form was an optimization.

IMHO, it is a cognitive optimization for the programmer as well. Once
one gets used to the notation, the difference between "Add $bar to $foo
and store the result in $foo" and "add $bar to $foo, and store the
result somewhere else" becomes easier to see, helping (at least me) with
comprehension.

Sinan
 
J

John W. Kennedy

Sherm said:
I agree, but that's something that was discovered later, when C became more
widely adopted and used. That's why it's still in common use today, now that
optimizing compilers and cheap hardware have made the original purpose moot.

It also has a semantic function:
$a[function-call()] = $a[function-call()] + $b;
and
$a[function-call()] += $b;
have different results if "function-call()" is not idempotent.

--
John W. Kennedy
"You can, if you wish, class all science-fiction together; but it is
about as perceptive as classing the works of Ballantyne, Conrad and W.
W. Jacobs together as the 'sea-story' and then criticizing _that_."
-- C. S. Lewis. "An Experiment in Criticism"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top