Regular expressions, capture repeated groups

Iain Barnett · Jul 8, 2010

I'm trying to emulate something I've done in .Net many moons ago, which =
is capture a named group, but not just once, get all it's repetitions =
and then be able to see all those repetitions. I think they call them =
GroupCollections in C#. This is the kind of code I'm trying to emulate =
with Ruby(1.9.1):

using System;
using System.Text.RegularExpressions;

public class Test
{

public static void Main ()
{

// Define a regular expression for repeated words.
Regex rx =3D new Regex(@"\b(?<word>\w+)\s+(\k<word>)\b",
RegexOptions.Compiled | RegexOptions.IgnoreCase);

// Define a test string. =20
string text =3D "The the quick brown fox fox jumped over the =
lazy dog dog.";

// Find matches.
MatchCollection matches =3D rx.Matches(text);

// Report the number of matches found.
Console.WriteLine("{0} matches found in:\n {1}",=20
matches.Count,=20
text);

// Report on each match.
foreach (Match match in matches)
{
GroupCollection groups =3D match.Groups;
Console.WriteLine("'{0}' repeated at positions {1} and {2}", =
=20
groups["word"].Value,=20
groups[0].Index,=20
groups[1].Index);
}

}
=09
}
// The example produces the following output to the console:
// 3 matches found in:
// The the quick brown fox fox jumped over the lazy dog dog.
// 'The' repeated at positions 0 and 4
// 'fox' repeated at positions 20 and 25
// 'dog' repeated at positions 50 and 54

For example, if I had the string "11 12" I could have a regex like=20
/
(?<first> \d+ ) \s \g<first>
/x=20
that captured "11" and then the repetition "12" and put them in an =
array (or some kind of collection) referenced by the name.

I think my attempts to get this to work are better explanations. What I =
want is the result
#<MatchData "11 12" first:["11", "12"]> or something like it. At the =
moment all my attempts end with the named capture only keeping the last =
match it made i.e. 12 with no mention of 11.

I know I could do this a different way, perhaps with split or something, =
but I'd like to know if it's possible with just regex. I understand the =
Oniguruma engine is used now but I can't find any good docs for it.

These are my attempts, $ is my prompt.

$ md1 =3D /
(?<first> \d+ )
\s \g<first>
/x.match( "11 12" )=20
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

$ md1 =3D /
(?<first> \d+ )
(?: \s \g<first> )?
/x.match( "11 12" )=20
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

$ md1 =3D /
(?<first> \d+ )
(?: \s=20
(?<second> \g<first> )
)?
/x.match( "11 12" )=20
#<MatchData "11 12" first:"12" second:"12">

$ md1[:first]
"12"

$ md1[:second]
"12"

$ md1 =3D /=20
(?: (?<first> \d+ )\s* )+
/x.match( "11 12" )
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

Iain

w_a_x_man · Jul 8, 2010

I'm trying to emulate something I've done in .Net many moons ago, which is capture a named group, but not just once, get all it's repetitions and then be able to see all those repetitions. I think they call them GroupCollections in C#. This is the kind of code I'm trying to emulate with Ruby(1.9.1):

using System;
using System.Text.RegularExpressions;

public class Test
{

public static void Main ()
{

// Define a regular expression for repeated words.
Regex rx = new Regex(@"\b(?<word>\w+)\s+(\k<word>)\b",
RegexOptions.Compiled | RegexOptions.IgnoreCase);

// Define a test string.
string text = "The the quick brown fox fox jumped over the lazy dog dog.";

// Find matches.
MatchCollection matches = rx.Matches(text);

// Report the number of matches found.
Console.WriteLine("{0} matches found in:\n {1}",
matches.Count,
text);

// Report on each match.
foreach (Match match in matches)
{
GroupCollection groups = match.Groups;
Console.WriteLine("'{0}' repeated at positions {1} and {2}",
groups["word"].Value,
groups[0].Index,
groups[1].Index);
}

}

}

// The example produces the following output to the console:
// 3 matches found in:
// The the quick brown fox fox jumped over the lazydog dog.
// 'The' repeated at positions 0 and 4
// 'fox' repeated at positions 20 and 25
// 'dog' repeated at positions 50 and 54

For example, if I had the string "11 12" I could have a regex like
/
(?<first> \d+ ) \s \g<first>
/x
that captured "11" and then the repetition "12" and put them in an array (or some kind of collection) referenced by the name.

I think my attempts to get this to work are better explanations. What I want is the result
#<MatchData "11 12" first:["11", "12"]> or something like it. At the moment all my attempts end with the named capture only keeping the last match it made i.e. 12 with no mention of 11.

I know I could do this a different way, perhaps with split or something, but I'd like to know if it's possible with just regex. I understand the Oniguruma engine is used now but I can't find any good docs for it.

These are my attempts, $ is my prompt.

$ md1 = /
(?<first> \d+ )
\s \g<first>
/x.match( "11 12" )
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

$ md1 = /
(?<first> \d+ )
(?: \s \g<first> )?
/x.match( "11 12" )
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

$ md1 = /
(?<first> \d+ )
(?: \s
(?<second> \g<first> )
)?
/x.match( "11 12" )
#<MatchData "11 12" first:"12" second:"12">

$ md1[:first]
"12"

$ md1[:second]
"12"

$ md1 = /
(?: (?<first> \d+ )\s* )+
/x.match( "11 12" )
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

Iain

"The the quick brown fox fox jumped over the lazy dog dog.".
scan(/((\w+) +\2)/i){|x| puts "#{ x[0] } #{ $~.offset(0)[0]}"}
The the 0
fox fox 20
dog dog 50

Iain Barnett · Jul 8, 2010

=20

Click to expand...

=20
"The the quick brown fox fox jumped over the lazy dog dog.".
scan(/((\w+) +\2)/i){|x| puts "#{ x[0] } #{ $~.offset(0)[0]}"}
The the 0
fox fox 20
dog dog 50
=20

Thanks for that. That would certainly work to a degree, much better than =
my current alternative, but it nullifies the usefulness of named =
captures. For example, I can't call

$ md1[:first]

and get back all the matches for the (?<first> ) grouping, which would =
be phenomenally useful, because scan returns arrays of strings and not =
matchdata.

Iain=

botp · Jul 8, 2010

Thanks for that. That would certainly work to a degree, much better than my current alternative, but it nullifies the usefulness of named captures. For example, I can't call

$ md1[:first]

wait till you call the 21st ;-)

and get back all the matches for the (?<first> ) grouping, which would be phenomenally useful, because scan returns arrays of strings and not matchdata.

waxman hinted the $~

try eg,

s
#=> "The the quick brown fox fox jumped over the lazy dog dog."
m=[]
#=> []
s.scan(/((\w+) +\2)/i){|x| m << $~}
#=> "The the quick brown fox fox jumped over the lazy dog dog."
m.size
#=> 3
m[0]
#=> #<MatchData "The the" 1:"The the" 2:"The">
m[0].offset 0
#=> [0, 7]
m[0].offset

.... and so fort..

best regards -botp

Iain Barnett · Jul 8, 2010

=20

would be phenomenally useful, because scan returns arrays of strings and =
not matchdata.

=20
waxman hinted the $~
...
=20
best regards -botp
=20

Ok, I get it now. Thanks for the extra nudge (bang on the head

Iain

Regex ^ beginning not strong?	2	Jul 26, 2010
regex: capture groups and term binding	5	Sep 28, 2007
FAQ 6.12 Can I use Perl regular expressions to match balanced text?	0	Jan 9, 2011
Help with my responsive home page	2	Dec 14, 2022
Regex returning less number of groups - where is the error?	4	Mar 16, 2009
Regular expressions: how to skip characters from a capture	10	Nov 17, 2008
Capturing a Repeated Group	13	Jul 11, 2007
FAQ 6.1 How can I hope to use regular expressions without creating illegible and unmaintainable code	0	Feb 25, 2011

Regular expressions, capture repeated groups

Iain Barnett

w_a_x_man

Iain Barnett

botp

Iain Barnett

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads