Match a number of repeated chars, but NO MORE.

usenet · Dec 2, 2005

One particular aspect of a question in another newsgroup
(http://tinyurl.com/cbakx) interested me; I played around with some
solutions but couldn't come up with one that I thought was elegant. So
I thought I would introduce the question to this group for further
enlightenment.

<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:

#!/usr/bin/perl
use warnings; use strict;
my $string = "abCCCdefg";
print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
__END__

But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

Are there any elegant ideas?

Eric J. Roode · Dec 2, 2005

(e-mail address removed) wrote in @g14g2000cwa.googlegroups.com:

But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

Are there any elegant ideas?

I can get you part of the way there. Perhaps someone better at regexes
can take you the rest of the way.

First, use a "negative lookahead assertion": (and use the x
modifier!)

$string =~ / ([\w\d_-]+) \1{2} (?!\1) /x;

But there's still a problem: Though it won't match the first or second
"CCC" in your above string, it will match the third "CCC". In other
words, it'll match the "CCC" that begins after "abCC".

So you'll need to use a negative lookbehind assertion, too:

$string =~ /([\w\d_-]+) # Your match
\1{2} # Two more of it
(?!\1) # But not another one
(?<!\1{4}) # Not preceeded by 4 of \1 at this point
/x;

But there's a problem: since your match is variable-length (due to the +
quantifier), the negative lookbehind is variable-length, and that is
unfortunately not yet implemented in Perl.

I'm not sure where to take it from here, sorry.

--
Eric
`$=`;$_=\%!;($_)=/(.)/;$==++$|;($.,$/,$,,$\,$",$;,$^,$#,$~,$*,$:,@%)=(
$!=~/(.)(.).(.)(.)(.)(.)..(.)(.)(.)..(.)......(.)/,$"),$=++;$.++;$.++;
$_++;$_++;($_,$\,$,)=($~.$"."$;$/$%[$?]$_$\$,$:$%[$?]",$"&$~,$#,);$,++
;$,++;$^|=$";`$_$\$,$/$:$;$~$*$%[$?]$.$~$*${#}$%[$?]$;$\$"$^$~$*.>&$=`

it_says_BALLS_on_your_forehead · Dec 2, 2005

One particular aspect of a question in another newsgroup
(http://tinyurl.com/cbakx) interested me; I played around with some
solutions but couldn't come up with one that I thought was elegant. So
I thought I would introduce the question to this group for further
enlightenment.

<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:

#!/usr/bin/perl
use warnings; use strict;
my $string = "abCCCdefg";
print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
__END__

But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

Are there any elegant ideas?

i believe that \w includes \d as well as '_', [\w-] would be the char
class you want.

it_says_BALLS_on_your_forehead · Dec 2, 2005

Eric said:
(e-mail address removed) wrote in @g14g2000cwa.googlegroups.com:

But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

Are there any elegant ideas?

Click to expand...

I can get you part of the way there. Perhaps someone better at regexes
can take you the rest of the way.

First, use a "negative lookahead assertion": (and use the x
modifier!)

$string =~ / ([\w\d_-]+) \1{2} (?!\1) /x;

But there's still a problem: Though it won't match the first or second
"CCC" in your above string, it will match the third "CCC". In other
words, it'll match the "CCC" that begins after "abCC".

So you'll need to use a negative lookbehind assertion, too:

$string =~ /([\w\d_-]+) # Your match
\1{2} # Two more of it
(?!\1) # But not another one
(?<!\1{4}) # Not preceeded by 4 of \1 at this point
/x;

But there's a problem: since your match is variable-length (due to the +
quantifier), the negative lookbehind is variable-length, and that is
unfortunately not yet implemented in Perl.

I'm not sure where to take it from here, sorry.

hmm, i'm aware of that constraint with lookbehinds. maybe it's too
early in the morning, but would you need lookbehinds? don't the matches
on the string occur from left to right, so you only need the negative
lookahead?

Anno Siegel · Dec 2, 2005

One particular aspect of a question in another newsgroup
(http://tinyurl.com/cbakx) interested me; I played around with some
solutions but couldn't come up with one that I thought was elegant. So
I thought I would introduce the question to this group for further
enlightenment.

<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:

#!/usr/bin/perl
use warnings; use strict;
my $string = "abCCCdefg";
print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
__END__

That regex isn't quite correct, it should only capture one occurrence
of the repeated character, not more. Also, \w already matches digits
and underscore:

/(\w)\1{2}/;

But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

\1 doesn't get consumed, it is interpolated as a character escape, not
a backreference. [^\1] matches all characters except chr(1).

A negative lookahead works as intended, but still doesn't solve the
problem:

qr/([\w\d_-])\1{2}(?!\1)/;

This forces the following character to be different from \1, but
then the regex just moves on and matches the last three "C" in
"abCCCCCdefg". I don't see a way to force it to match only if
the preceding character is different from the repeated one.

Following this vein leads to something like this

my $re = qr/
(.) # any character
(?!\1) # ...followed by a different character
(\w) # ...which is a word character
\2{2} # ...followed by exactly two copies of itself
(?!\2) # ...followed by a different character
/x;

That works with the given examples, but only if there is actual text
before and after the repeated group, not if the repetitions appear
in the beginning or end of the string. Not to mention elegance...

Conclusion: It probably can be done in a single regex, but I doubt it
is worth the effort.

/((\w)\2{2,})/ and length( $1) == 3

Anno

Anno Siegel · Dec 2, 2005

Eric J. Roode said:
(e-mail address removed) wrote in @g14g2000cwa.googlegroups.com:

But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

Click to expand...

[...]

First, use a "negative lookahead assertion": (and use the x
modifier!)

$string =~ / ([\w\d_-]+) \1{2} (?!\1) /x;

But there's still a problem: Though it won't match the first or second
"CCC" in your above string, it will match the third "CCC". In other
words, it'll match the "CCC" that begins after "abCC".

So you'll need to use a negative lookbehind assertion, too:

$string =~ /([\w\d_-]+) # Your match
\1{2} # Two more of it
(?!\1) # But not another one
(?<!\1{4}) # Not preceeded by 4 of \1 at this point
/x;

But there's a problem: since your match is variable-length (due to the +
quantifier), the negative lookbehind is variable-length, and that is
unfortunately not yet implemented in Perl.

Capturing multiple characters isn't right anyway, the "+" ought to
be outside the parentheses. (With 6 or more "C", the difference shows.)
But that doesn't solve the problem with variable-length lookbehind.
It complains if you try to interpolate a backreference, even if the
backreference can logically only have one definite length.

Anno

Anno Siegel · Dec 2, 2005

One particular aspect of a question in another newsgroup
(http://tinyurl.com/cbakx) interested me; I played around with some
solutions but couldn't come up with one that I thought was elegant. So
I thought I would introduce the question to this group for further
enlightenment.

<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:

#!/usr/bin/perl
use warnings; use strict;
my $string = "abCCCdefg";
print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
__END__

That regex isn't quite correct, it should only capture one occurrence
of the repeated character, not more. Also, \w already matches digits
and underscore:

[Later correction: It doesn't match underscore. I'm not correcting the
code, id doesn't matter to the discussion]

/(\w)\1{2}/;

But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

\1 doesn't get consumed, it is interpolated as a character escape, not
a backreference. [^\1] matches all characters except chr(1).

A negative lookahead works as intended, but still doesn't solve the
problem:

qr/([\w\d_-])\1{2}(?!\1)/;

This forces the following character to be different from \1, but
then the regex just moves on and matches the last three "C" in
"abCCCCCdefg". I don't see a way to force it to match only if
the preceding character is different from the repeated one.

Following this vein leads to something like this

my $re = qr/
(.) # any character
(?!\1) # ...followed by a different character
(\w) # ...which is a word character
\2{2} # ...followed by exactly two copies of itself
(?!\2) # ...followed by a different character
/x;

That works with the given examples, but only if there is actual text
before and after the repeated group, not if the repetitions appear
in the beginning or end of the string. Not to mention elegance...

Conclusion: It probably can be done in a single regex, but I doubt it
is worth the effort.

/((\w)\2{2,})/ and length( $1) == 3

Anno

it_says_BALLS_on_your forehead · Dec 2, 2005

Anno said:
One particular aspect of a question in another newsgroup
(http://tinyurl.com/cbakx) interested me; I played around with some
solutions but couldn't come up with one that I thought was elegant. So
I thought I would introduce the question to this group for further
enlightenment.

<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:

#!/usr/bin/perl
use warnings; use strict;
my $string = "abCCCdefg";
print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
__END__

Click to expand...

That regex isn't quite correct, it should only capture one occurrence
of the repeated character, not more. Also, \w already matches digits
and underscore:

[Later correction: It doesn't match underscore. I'm not correcting the
code, id doesn't matter to the discussion]

are you sure it doesn't match underscore?

my $string2 = '_';
if ( $string2 =~ m/\w/ ) {
print "underscore matched.\n";
}
else {
print "underscore did not match.\n";
}

__OUTPUT__
underscore matched.

it_says_BALLS_on_your forehead · Dec 2, 2005

Anno said:
One particular aspect of a question in another newsgroup
(http://tinyurl.com/cbakx) interested me; I played around with some
solutions but couldn't come up with one that I thought was elegant. So
I thought I would introduce the question to this group for further
enlightenment.

<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:

#!/usr/bin/perl
use warnings; use strict;
my $string = "abCCCdefg";
print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}/;
__END__

Click to expand...

That regex isn't quite correct, it should only capture one occurrence
of the repeated character, not more. Also, \w already matches digits
and underscore:

[Later correction: It doesn't match underscore. I'm not correcting the
code, id doesn't matter to the discussion]

/(\w)\1{2}/;

But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

Click to expand...

\1 doesn't get consumed, it is interpolated as a character escape, not
a backreference. [^\1] matches all characters except chr(1).

A negative lookahead works as intended, but still doesn't solve the
problem:

qr/([\w\d_-])\1{2}(?!\1)/;

This forces the following character to be different from \1, but
then the regex just moves on and matches the last three "C" in
"abCCCCCdefg". I don't see a way to force it to match only if
the preceding character is different from the repeated one.

actually, does the negative lookahead even work? it doesn't seem to. i
appear to get the same results as the OP, although for a different
reason perhaps, since you say that in the context of a character class,
\1 simply is an escaped 1, which is the same as the number 1. when
using the negative lookahead, it appears that the \1 is 'consumed'
already.
(in the example below, it would be \2).

my $testString = "abCCCCd";
if ($testString =~ m/((\w)\2{2})(?!\2)/) {
print "$1. matched\n";
}
else {
print "no match\n";
}

__OUTPUT__
CCC. matched

Anno Siegel · Dec 2, 2005

it_says_BALLS_on_your forehead said:
Anno said:

<[email protected]> wrote in comp.lang.perl.misc:

Click to expand...

[...]
[...]

But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

Click to expand...

\1 doesn't get consumed, it is interpolated as a character escape, not
a backreference. [^\1] matches all characters except chr(1).

A negative lookahead works as intended, but still doesn't solve the
problem:

qr/([\w\d_-])\1{2}(?!\1)/;

This forces the following character to be different from \1, but
then the regex just moves on and matches the last three "C" in
"abCCCCCdefg". I don't see a way to force it to match only if
the preceding character is different from the repeated one.

Click to expand...

actually, does the negative lookahead even work? it doesn't seem to. i
appear to get the same results as the OP, although for a different
reason perhaps, since you say that in the context of a character class,
\1 simply is an escaped 1, which is the same as the number 1. when

No, it is a character escape. In a non-regex double-quotish string as the
interior of [] in a regex, "\1" is the character chr( 1), etc.

using the negative lookahead, it appears that the \1 is 'consumed'
already.
(in the example below, it would be \2).

my $testString = "abCCCCd";
if ($testString =~ m/((\w)\2{2})(?!\2)/) {
print "$1. matched\n";
}
else {
print "no match\n";
}

__OUTPUT__
CCC. matched

So? It matched the last three "C" before "d", as enforced by the
lookahead:

my $testString = "abCCCCd";
if ($testString =~ m/((\w)\2{2})(?!\2)(.*)/) {
print "$1. matched before $3\n";
}
else {
print "no match\n";
}

CCC. matched before d

Anno

it_says_BALLS_on_your forehead · Dec 2, 2005

Anno said:
it_says_BALLS_on_your forehead said:

Anno said:

<[email protected]> wrote in comp.lang.perl.misc:

Click to expand...

[...]

<paraphrase> of the OP's question:

Suppose I have a string of characters: "abCCCdefg". I want to match
three consecutive occurrences of any character in a class. In this
example, my expression would match 'CCC'. OK, that's easy:

Click to expand...

[...]

But, suppose I wanted to constrain the match so that it would match
three consecutive occurrences, but NO MORE than three. In other words,
'abCCCCCdefg' would NOT match. </paraphrase>

I thought I could propose an 'elegant' answer like this:

print "Match!\n" if $string =~ /([\w\d_-]+)\1{2}[^\1]/;

but that doesn't work (it seems that \1 gets "used up" somehow). Of
course, I could write a bunch of code to do it... that's trivial to do
(but ugly, IMHO).

\1 doesn't get consumed, it is interpolated as a character escape, not
a backreference. [^\1] matches all characters except chr(1).

A negative lookahead works as intended, but still doesn't solve the
problem:

qr/([\w\d_-])\1{2}(?!\1)/;

This forces the following character to be different from \1, but
then the regex just moves on and matches the last three "C" in
"abCCCCCdefg". I don't see a way to force it to match only if
the preceding character is different from the repeated one.

Click to expand...

actually, does the negative lookahead even work? it doesn't seem to. i
appear to get the same results as the OP, although for a different
reason perhaps, since you say that in the context of a character class,
\1 simply is an escaped 1, which is the same as the number 1. when

Click to expand...

No, it is a character escape. In a non-regex double-quotish string as the
interior of [] in a regex, "\1" is the character chr( 1), etc.

using the negative lookahead, it appears that the \1 is 'consumed'
already.
(in the example below, it would be \2).

my $testString = "abCCCCd";
if ($testString =~ m/((\w)\2{2})(?!\2)/) {
print "$1. matched\n";
}
else {
print "no match\n";
}

__OUTPUT__
CCC. matched

Click to expand...

So? It matched the last three "C" before "d", as enforced by the
lookahead:

my $testString = "abCCCCd";
if ($testString =~ m/((\w)\2{2})(?!\2)(.*)/) {
print "$1. matched before $3\n";
}
else {
print "no match\n";
}

CCC. matched before d

ahh, i suspected that was happening, but hadn't pursued it further--my
fault for being lazy. thanks for the illumination Anno.

xicheng · Dec 2, 2005

A test string for some proposed solutions:
$_="asCCCCCCChwCCCsad";
------------------------------------------------------
# it_says_BALLS_on_your forehead's solution:
print $` if/((\w)\2{2})(?!\2)/;
#asCCCC ==> no
-------------------------------------------------------
#Anno's solution:
print $` if(/((\w)\2{2,})/ and length( $1) == 3);
#empty ==> no
#Anno's thought:
while(/((\w)\2{2,})/g) {
print $` if(length( $1) == 3);
}
}
#asCCCCCCChw => ok
--------------------------------------------------------
#Steven's solution:
print $` if /
(\w) # a char
(??{ '(?<=' . ("$1" x 3) .')' }) # as the third in a series
(??{ '(?<!' . ("$1" x 4) .')' }) # but not the fourth
(?!\1)/x; # not followd by the same

it_says_BALLS_on_your forehead · Dec 2, 2005

A test string for some proposed solutions:
$_="asCCCCCCChwCCCsad";

actually, that was NOT my solution. i stated that the above regex did
NOT work.

usenet · Dec 2, 2005

Agreed. The latter is much better than this:

print if /
(\w) # a char
(??{ '(?<=' . ("$1" x 3) .')' }) # as the third in a series
(??{ '(?<!' . ("$1" x 4) .')' }) # but not the fourth
(?!\1)/x; # not followd by same char

I also agree that Anno's solution is probably the most practical
solution that's been proposed, but this is a VERY interesting approach
(and I learned something today!) Thanks!

robic0 · Dec 3, 2005

I also agree that Anno's solution is probably the most practical
solution that's been proposed, but this is a VERY interesting approach
(and I learned something today!) Thanks!

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Perl sucks?

usenet · Dec 3, 2005

robic0 said:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Perl sucks?

Actually, my appreciation of Perl was raised a little. But I wasn't
really thinking about Perl at all when I wrote those comments. I was
thinking of the logic of Steven's algorithm, which is independent of
the programming language which expresses it. I really do believe that
Donald Knuth himself would admire Steven's approach.

attn.steven.kuo · Dec 3, 2005

Actually, my appreciation of Perl was raised a little. But I wasn't
really thinking about Perl at all when I wrote those comments. I was
thinking of the logic of Steven's algorithm, which is independent of
the programming language which expresses it. I really do believe that
Donald Knuth himself would admire Steven's approach.

Credit should go to Anno who analyzed the problem in
succinct and logical way. I just worked the problem a little
bit further.

To allow the regex to match a sequence of any character, one
should add the \Q escape sequence -- that's something that I
previously neglected:

print if (/(.)
(??{ '(?<=' . ("\Q$1\E" x 3) . ')' })
(??{ '(?<!' . ("\Q$1\E" x 4) . ')' })
(?!\1)/x;

And if I ever were to make the same offer as Knuth -- to pay
a small finder's fee to others who could find a bug in my programs
-- I'd end up owing more than the U.S. National Debt.

Did you know that there is a match-case function in python?	4	Dec 17, 2023
I am writing a Age of Empires game but it is being played by codes but ı am stuck.	1	Jul 14, 2023
No more than N element of an array	12	Jul 25, 2013
How to capture repeated subpatterns?	7	Nov 1, 2006
Decoding no of ways and printing each decode message	2	Jun 1, 2021
How to disregard the first match of a loop?	22	Aug 9, 2011
Capturing a Repeated Group	13	Jul 11, 2007
Regex to match a numerical IP range	7	Dec 11, 2010

Match a number of repeated chars, but NO MORE.

usenet

Eric J. Roode

it_says_BALLS_on_your_forehead

it_says_BALLS_on_your_forehead

Anno Siegel

Anno Siegel

Anno Siegel

it_says_BALLS_on_your forehead

it_says_BALLS_on_your forehead

Anno Siegel

it_says_BALLS_on_your forehead

xicheng

it_says_BALLS_on_your forehead

usenet

robic0

usenet

attn.steven.kuo

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads