Problem with global text search using Regular Expressions

J

Jayal

Hi
I am trying to perform a global search in a string using regular
expressions.

consider the string

"attgccgccgccatt"
If I wish to search for the string "ccgcc" within it, if I use the
operation given below, I would get only one find

$a =~ /ccgcc/g;
$location = pos......

which would be the find match.
What do I do if I want to get both occurances?

I also tried the index() function but that does not allow use of
regular expressions. Is there any function available for this?

Thank you in advance

Jayal
 
A

A. Sinan Unur

(e-mail address removed) (Jayal) wrote in @posting.google.com:
Hi
I am trying to perform a global search in a string using regular
expressions.

consider the string

"attgccgccgccatt"
If I wish to search for the string "ccgcc" within it, if I use the
operation given below, I would get only one find

$a =~ /ccgcc/g;
$location = pos......

which would be the find match.
What do I do if I want to get both occurances?

You'll need to read up on your Perl regexes. There 'ccgcc' occurs only
once in the string above. Once a match is made, the search for a next one
does starts at the third 'g' in the source string.

That is,

my $s = 'attgccgccgccatt';

while( $s =~ /ccgcc/g ) {
print 'Position: ', pos $s, "\n";
}

will only print 9.

I remember seeing something to deal with this kind of a situation in the
Cookbook, but I cannot remember how it is done.
 
M

Matt Garrish

A. Sinan Unur said:
(e-mail address removed) (Jayal) wrote in @posting.google.com:


You'll need to read up on your Perl regexes. There 'ccgcc' occurs only
once in the string above. Once a match is made, the search for a next one
does starts at the third 'g' in the source string.

That is,

my $s = 'attgccgccgccatt';

while( $s =~ /ccgcc/g ) {
print 'Position: ', pos $s, "\n";
}

will only print 9.

I remember seeing something to deal with this kind of a situation in the
Cookbook, but I cannot remember how it is done.

You can reset pos after the match. In this case, you'd set it back two
characters to match the next ccgcc:

my $s = 'attgccgccgccatt';

while( $s =~ /ccgcc/g ) {
print 'Position: ', pos $s, "\n";
pos($s) -= 2;
}

Matt
 
J

Josef Moellers

Matt said:
You can reset pos after the match. In this case, you'd set it back two
characters

I'd suggest length('ccgcc') - 1
then e.g. a search for 'cccc' in 'cccccc' would produce the "expected" 3
matches.
Agreed, it wouldn't produce any more matches in the original case and
would take slightly longer, but for the sake of genericity ...
 
A

Anno Siegel

A. Sinan Unur said:
(e-mail address removed) (Jayal) wrote in @posting.google.com:


You'll need to read up on your Perl regexes. There 'ccgcc' occurs only
once in the string above. Once a match is made, the search for a next one
does starts at the third 'g' in the source string.

That is,

my $s = 'attgccgccgccatt';

while( $s =~ /ccgcc/g ) {
print 'Position: ', pos $s, "\n";
}

will only print 9.

I remember seeing something to deal with this kind of a situation in the
Cookbook, but I cannot remember how it is done.

It probably uses lookahead in place of a plain match. Lookahead
doesn't consume the characters it looks at, so it can detect overlapping
matches. It is still possible to capture the match "inside" the
lookahead. So

print "$\n" for $s =~ /(?=(ccgcc))/g;

prints "ccgcc" twice.

Anno
 
A

A. Sinan Unur

(e-mail address removed)-berlin.de (Anno Siegel) wrote in
It probably uses lookahead in place of a plain match.

Thanks Anno. Should've thought of that but I rarely use anything other than
the most basic regexes. Thanks also to Matt for reminding me that pos
returns an lvalue.

Sinan.
 
A

Anno Siegel

A. Sinan Unur said:
(e-mail address removed)-berlin.de (Anno Siegel) wrote in



Thanks Anno. Should've thought of that but I rarely use anything other than
the most basic regexes.

....which is as it should be. Only a small regex is a good regex :)

Anno
 
C

Charles DeRykus

Hi
I am trying to perform a global search in a string using regular
expressions.

consider the string

"attgccgccgccatt"
If I wish to search for the string "ccgcc" within it, if I use the
operation given below, I would get only one find

$a =~ /ccgcc/g;
$location = pos......

which would be the find match.
What do I do if I want to get both occurances?

I also tried the index() function but that does not allow use of
regular expressions. Is there any function available for this?

With a slight mod to Anno's solution, you could even get
match pos's:

my $pos = 0;
while ( $s =~ /(.+?)(?=(ccgcc))/g ) {
print "pos=", ($pos += length $1), " match=$2\n";

}

pos=4 match=ccgcc
pos=7 match=ccgcc


hth,
 
C

Charles DeRykus

With a slight mod to Anno's solution, you could even get
match pos's:

my $pos = 0;
while ( $s =~ /(.+?)(?=(ccgcc))/g ) {
print "pos=", ($pos += length $1), " match=$2\n";

}

pos=4 match=ccgcc
pos=7 match=ccgcc

Needs a mod to the mod:

# mod to handle match string at pos 0
my $pos = 0;
while ( $s =~ /(^|.+?)(?=(ccgcc))/g ) {
print "pos=", ($pos += length $1), " match=$2\n";
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top