Match a number of repeated chars, but NO MORE.

U

usenet

One particular aspect of a question in another newsgroup
(http://tinyurl.com/cbakx) interested me; I played around with some
solutions but couldn't come up with one that I thought was elegant. So
I thought I would introduce the question to this group for further
enlightenment.

<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:

#!/usr/bin/perl
use warnings; use strict;
my $string = "abCCCdefg";
print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
__END__

But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

Are there any elegant ideas?
 
E

Eric J. Roode

(e-mail address removed) wrote in @g14g2000cwa.googlegroups.com:
But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

Are there any elegant ideas?

I can get you part of the way there. Perhaps someone better at regexes
can take you the rest of the way.

First, use a "negative lookahead assertion": (and use the x
modifier!)

$string =~ / ([\w\d_-]+) \1{2} (?!\1) /x;

But there's still a problem: Though it won't match the first or second
"CCC" in your above string, it will match the third "CCC". In other
words, it'll match the "CCC" that begins after "abCC".

So you'll need to use a negative lookbehind assertion, too:

$string =~ /([\w\d_-]+) # Your match
\1{2} # Two more of it
(?!\1) # But not another one
(?<!\1{4}) # Not preceeded by 4 of \1 at this point
/x;

But there's a problem: since your match is variable-length (due to the +
quantifier), the negative lookbehind is variable-length, and that is
unfortunately not yet implemented in Perl.

I'm not sure where to take it from here, sorry.

--
Eric
`$=`;$_=\%!;($_)=/(.)/;$==++$|;($.,$/,$,,$\,$",$;,$^,$#,$~,$*,$:,@%)=(
$!=~/(.)(.).(.)(.)(.)(.)..(.)(.)(.)..(.)......(.)/,$"),$=++;$.++;$.++;
$_++;$_++;($_,$\,$,)=($~.$"."$;$/$%[$?]$_$\$,$:$%[$?]",$"&$~,$#,);$,++
;$,++;$^|=$";`$_$\$,$/$:$;$~$*$%[$?]$.$~$*${#}$%[$?]$;$\$"$^$~$*.>&$=`
 
I

it_says_BALLS_on_your_forehead

One particular aspect of a question in another newsgroup
(http://tinyurl.com/cbakx) interested me; I played around with some
solutions but couldn't come up with one that I thought was elegant. So
I thought I would introduce the question to this group for further
enlightenment.

<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:

#!/usr/bin/perl
use warnings; use strict;
my $string = "abCCCdefg";
print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
__END__

But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

Are there any elegant ideas?

i believe that \w includes \d as well as '_', [\w-] would be the char
class you want.
 
I

it_says_BALLS_on_your_forehead

Eric said:
(e-mail address removed) wrote in @g14g2000cwa.googlegroups.com:
But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

Are there any elegant ideas?

I can get you part of the way there. Perhaps someone better at regexes
can take you the rest of the way.

First, use a "negative lookahead assertion": (and use the x
modifier!)

$string =~ / ([\w\d_-]+) \1{2} (?!\1) /x;

But there's still a problem: Though it won't match the first or second
"CCC" in your above string, it will match the third "CCC". In other
words, it'll match the "CCC" that begins after "abCC".

So you'll need to use a negative lookbehind assertion, too:

$string =~ /([\w\d_-]+) # Your match
\1{2} # Two more of it
(?!\1) # But not another one
(?<!\1{4}) # Not preceeded by 4 of \1 at this point
/x;

But there's a problem: since your match is variable-length (due to the +
quantifier), the negative lookbehind is variable-length, and that is
unfortunately not yet implemented in Perl.

I'm not sure where to take it from here, sorry.

hmm, i'm aware of that constraint with lookbehinds. maybe it's too
early in the morning, but would you need lookbehinds? don't the matches
on the string occur from left to right, so you only need the negative
lookahead?
 
A

Anno Siegel

One particular aspect of a question in another newsgroup
(http://tinyurl.com/cbakx) interested me; I played around with some
solutions but couldn't come up with one that I thought was elegant. So
I thought I would introduce the question to this group for further
enlightenment.

<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:

#!/usr/bin/perl
use warnings; use strict;
my $string = "abCCCdefg";
print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
__END__

That regex isn't quite correct, it should only capture one occurrence
of the repeated character, not more. Also, \w already matches digits
and underscore:

/(\w)\1{2}/;
But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

\1 doesn't get consumed, it is interpolated as a character escape, not
a backreference. [^\1] matches all characters except chr(1).

A negative lookahead works as intended, but still doesn't solve the
problem:

qr/([\w\d_-])\1{2}(?!\1)/;

This forces the following character to be different from \1, but
then the regex just moves on and matches the last three "C" in
"abCCCCCdefg". I don't see a way to force it to match only if
the preceding character is different from the repeated one.

Following this vein leads to something like this

my $re = qr/
(.) # any character
(?!\1) # ...followed by a different character
(\w) # ...which is a word character
\2{2} # ...followed by exactly two copies of itself
(?!\2) # ...followed by a different character
/x;

That works with the given examples, but only if there is actual text
before and after the repeated group, not if the repetitions appear
in the beginning or end of the string. Not to mention elegance...

Conclusion: It probably can be done in a single regex, but I doubt it
is worth the effort.

/((\w)\2{2,})/ and length( $1) == 3

Anno
 
A

Anno Siegel

Eric J. Roode said:
(e-mail address removed) wrote in @g14g2000cwa.googlegroups.com:
But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>
[...]

First, use a "negative lookahead assertion": (and use the x
modifier!)

$string =~ / ([\w\d_-]+) \1{2} (?!\1) /x;

But there's still a problem: Though it won't match the first or second
"CCC" in your above string, it will match the third "CCC". In other
words, it'll match the "CCC" that begins after "abCC".

So you'll need to use a negative lookbehind assertion, too:

$string =~ /([\w\d_-]+) # Your match
\1{2} # Two more of it
(?!\1) # But not another one
(?<!\1{4}) # Not preceeded by 4 of \1 at this point
/x;

But there's a problem: since your match is variable-length (due to the +
quantifier), the negative lookbehind is variable-length, and that is
unfortunately not yet implemented in Perl.

Capturing multiple characters isn't right anyway, the "+" ought to
be outside the parentheses. (With 6 or more "C", the difference shows.)
But that doesn't solve the problem with variable-length lookbehind.
It complains if you try to interpolate a backreference, even if the
backreference can logically only have one definite length.

Anno
 
A

Anno Siegel

One particular aspect of a question in another newsgroup
(http://tinyurl.com/cbakx) interested me; I played around with some
solutions but couldn't come up with one that I thought was elegant. So
I thought I would introduce the question to this group for further
enlightenment.

<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:

#!/usr/bin/perl
use warnings; use strict;
my $string = "abCCCdefg";
print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
__END__

That regex isn't quite correct, it should only capture one occurrence
of the repeated character, not more. Also, \w already matches digits
and underscore:

[Later correction: It doesn't match underscore. I'm not correcting the
code, id doesn't matter to the discussion]

/(\w)\1{2}/;
But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

\1 doesn't get consumed, it is interpolated as a character escape, not
a backreference. [^\1] matches all characters except chr(1).

A negative lookahead works as intended, but still doesn't solve the
problem:

qr/([\w\d_-])\1{2}(?!\1)/;

This forces the following character to be different from \1, but
then the regex just moves on and matches the last three "C" in
"abCCCCCdefg". I don't see a way to force it to match only if
the preceding character is different from the repeated one.

Following this vein leads to something like this

my $re = qr/
(.) # any character
(?!\1) # ...followed by a different character
(\w) # ...which is a word character
\2{2} # ...followed by exactly two copies of itself
(?!\2) # ...followed by a different character
/x;

That works with the given examples, but only if there is actual text
before and after the repeated group, not if the repetitions appear
in the beginning or end of the string. Not to mention elegance...

Conclusion: It probably can be done in a single regex, but I doubt it
is worth the effort.

/((\w)\2{2,})/ and length( $1) == 3

Anno
 
I

it_says_BALLS_on_your forehead

Anno said:
One particular aspect of a question in another newsgroup
(http://tinyurl.com/cbakx) interested me; I played around with some
solutions but couldn't come up with one that I thought was elegant. So
I thought I would introduce the question to this group for further
enlightenment.

<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:

#!/usr/bin/perl
use warnings; use strict;
my $string = "abCCCdefg";
print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
__END__

That regex isn't quite correct, it should only capture one occurrence
of the repeated character, not more. Also, \w already matches digits
and underscore:

[Later correction: It doesn't match underscore. I'm not correcting the
code, id doesn't matter to the discussion]

are you sure it doesn't match underscore?

my $string2 = '_';
if ( $string2 =~ m/\w/ ) {
print "underscore matched.\n";
}
else {
print "underscore did not match.\n";
}

__OUTPUT__
underscore matched.
 
I

it_says_BALLS_on_your forehead

Anno said:
One particular aspect of a question in another newsgroup
(http://tinyurl.com/cbakx) interested me; I played around with some
solutions but couldn't come up with one that I thought was elegant. So
I thought I would introduce the question to this group for further
enlightenment.

<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:

#!/usr/bin/perl
use warnings; use strict;
my $string = "abCCCdefg";
print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
__END__

That regex isn't quite correct, it should only capture one occurrence
of the repeated character, not more. Also, \w already matches digits
and underscore:

[Later correction: It doesn't match underscore. I'm not correcting the
code, id doesn't matter to the discussion]

/(\w)\1{2}/;
But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

\1 doesn't get consumed, it is interpolated as a character escape, not
a backreference. [^\1] matches all characters except chr(1).

A negative lookahead works as intended, but still doesn't solve the
problem:

qr/([\w\d_-])\1{2}(?!\1)/;

This forces the following character to be different from \1, but
then the regex just moves on and matches the last three "C" in
"abCCCCCdefg". I don't see a way to force it to match only if
the preceding character is different from the repeated one.

actually, does the negative lookahead even work? it doesn't seem to. i
appear to get the same results as the OP, although for a different
reason perhaps, since you say that in the context of a character class,
\1 simply is an escaped 1, which is the same as the number 1. when
using the negative lookahead, it appears that the \1 is 'consumed'
already.
(in the example below, it would be \2).

my $testString = "abCCCCd";
if ($testString =~ m/((\w)\2{2})(?!\2)/) {
print "$1. matched\n";
}
else {
print "no match\n";
}

__OUTPUT__
CCC. matched
 
A

Anno Siegel

it_says_BALLS_on_your forehead said:
Anno said:
<[email protected]> wrote in comp.lang.perl.misc:
[...]
[...]
But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

\1 doesn't get consumed, it is interpolated as a character escape, not
a backreference. [^\1] matches all characters except chr(1).

A negative lookahead works as intended, but still doesn't solve the
problem:

qr/([\w\d_-])\1{2}(?!\1)/;

This forces the following character to be different from \1, but
then the regex just moves on and matches the last three "C" in
"abCCCCCdefg". I don't see a way to force it to match only if
the preceding character is different from the repeated one.

actually, does the negative lookahead even work? it doesn't seem to. i
appear to get the same results as the OP, although for a different
reason perhaps, since you say that in the context of a character class,
\1 simply is an escaped 1, which is the same as the number 1. when

No, it is a character escape. In a non-regex double-quotish string as the
interior of [] in a regex, "\1" is the character chr( 1), etc.
using the negative lookahead, it appears that the \1 is 'consumed'
already.
(in the example below, it would be \2).

my $testString = "abCCCCd";
if ($testString =~ m/((\w)\2{2})(?!\2)/) {
print "$1. matched\n";
}
else {
print "no match\n";
}

__OUTPUT__
CCC. matched

So? It matched the last three "C" before "d", as enforced by the
lookahead:

my $testString = "abCCCCd";
if ($testString =~ m/((\w)\2{2})(?!\2)(.*)/) {
print "$1. matched before $3\n";
}
else {
print "no match\n";
}

CCC. matched before d

Anno
 
I

it_says_BALLS_on_your forehead

Anno said:
it_says_BALLS_on_your forehead said:
Anno said:
<[email protected]> wrote in comp.lang.perl.misc:
[...]
<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:
[...]
But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

\1 doesn't get consumed, it is interpolated as a character escape, not
a backreference. [^\1] matches all characters except chr(1).

A negative lookahead works as intended, but still doesn't solve the
problem:

qr/([\w\d_-])\1{2}(?!\1)/;

This forces the following character to be different from \1, but
then the regex just moves on and matches the last three "C" in
"abCCCCCdefg". I don't see a way to force it to match only if
the preceding character is different from the repeated one.

actually, does the negative lookahead even work? it doesn't seem to. i
appear to get the same results as the OP, although for a different
reason perhaps, since you say that in the context of a character class,
\1 simply is an escaped 1, which is the same as the number 1. when

No, it is a character escape. In a non-regex double-quotish string as the
interior of [] in a regex, "\1" is the character chr( 1), etc.
using the negative lookahead, it appears that the \1 is 'consumed'
already.
(in the example below, it would be \2).

my $testString = "abCCCCd";
if ($testString =~ m/((\w)\2{2})(?!\2)/) {
print "$1. matched\n";
}
else {
print "no match\n";
}

__OUTPUT__
CCC. matched

So? It matched the last three "C" before "d", as enforced by the
lookahead:

my $testString = "abCCCCd";
if ($testString =~ m/((\w)\2{2})(?!\2)(.*)/) {
print "$1. matched before $3\n";
}
else {
print "no match\n";
}

CCC. matched before d

ahh, i suspected that was happening, but hadn't pursued it further--my
fault for being lazy. thanks for the illumination Anno.
 
X

xicheng

A test string for some proposed solutions:
$_="asCCCCCCChwCCCsad";
------------------------------------------------------
# it_says_BALLS_on_your forehead's solution:
print $` if/((\w)\2{2})(?!\2)/;
#asCCCC ==> no
-------------------------------------------------------
#Anno's solution:
print $` if(/((\w)\2{2,})/ and length( $1) == 3);
#empty ==> no
#Anno's thought:
while(/((\w)\2{2,})/g) {
print $` if(length( $1) == 3);
}
}
#asCCCCCCChw => ok
--------------------------------------------------------
#Steven's solution:
print $` if /
(\w) # a char
(??{ '(?<=' . ("$1" x 3) .')' }) # as the third in a series
(??{ '(?<!' . ("$1" x 4) .')' }) # but not the fourth
(?!\1)/x; # not followd by the same
 
I

it_says_BALLS_on_your forehead

A test string for some proposed solutions:
$_="asCCCCCCChwCCCsad";

actually, that was NOT my solution. i stated that the above regex did
NOT work.
 
U

usenet

Agreed. The latter is much better than this:

print if /
(\w) # a char
(??{ '(?<=' . ("$1" x 3) .')' }) # as the third in a series
(??{ '(?<!' . ("$1" x 4) .')' }) # but not the fourth
(?!\1)/x; # not followd by same char

I also agree that Anno's solution is probably the most practical
solution that's been proposed, but this is a VERY interesting approach
(and I learned something today!) Thanks!
 
R

robic0

I also agree that Anno's solution is probably the most practical
solution that's been proposed, but this is a VERY interesting approach
(and I learned something today!) Thanks!
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Perl sucks?
 
U

usenet

robic0 said:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Perl sucks?

Actually, my appreciation of Perl was raised a little. But I wasn't
really thinking about Perl at all when I wrote those comments. I was
thinking of the logic of Steven's algorithm, which is independent of
the programming language which expresses it. I really do believe that
Donald Knuth himself would admire Steven's approach.
 
A

attn.steven.kuo

Actually, my appreciation of Perl was raised a little. But I wasn't
really thinking about Perl at all when I wrote those comments. I was
thinking of the logic of Steven's algorithm, which is independent of
the programming language which expresses it. I really do believe that
Donald Knuth himself would admire Steven's approach.



Credit should go to Anno who analyzed the problem in
succinct and logical way. I just worked the problem a little
bit further.

To allow the regex to match a sequence of any character, one
should add the \Q escape sequence -- that's something that I
previously neglected:

print if (/(.)
(??{ '(?<=' . ("\Q$1\E" x 3) . ')' })
(??{ '(?<!' . ("\Q$1\E" x 4) . ')' })
(?!\1)/x;


And if I ever were to make the same offer as Knuth -- to pay
a small finder's fee to others who could find a bug in my programs
-- I'd end up owing more than the U.S. National Debt. :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,156
Latest member
KetoBurnSupplement
Top