Regular expressions, capture repeated groups


I

Iain Barnett

I'm trying to emulate something I've done in .Net many moons ago, which =
is capture a named group, but not just once, get all it's repetitions =
and then be able to see all those repetitions. I think they call them =
GroupCollections in C#. This is the kind of code I'm trying to emulate =
with Ruby(1.9.1):

using System;
using System.Text.RegularExpressions;

public class Test
{

public static void Main ()
{

// Define a regular expression for repeated words.
Regex rx =3D new Regex(@"\b(?<word>\w+)\s+(\k<word>)\b",
RegexOptions.Compiled | RegexOptions.IgnoreCase);

// Define a test string. =20
string text =3D "The the quick brown fox fox jumped over the =
lazy dog dog.";

// Find matches.
MatchCollection matches =3D rx.Matches(text);

// Report the number of matches found.
Console.WriteLine("{0} matches found in:\n {1}",=20
matches.Count,=20
text);

// Report on each match.
foreach (Match match in matches)
{
GroupCollection groups =3D match.Groups;
Console.WriteLine("'{0}' repeated at positions {1} and {2}", =
=20
groups["word"].Value,=20
groups[0].Index,=20
groups[1].Index);
}

}
=09
}
// The example produces the following output to the console:
// 3 matches found in:
// The the quick brown fox fox jumped over the lazy dog dog.
// 'The' repeated at positions 0 and 4
// 'fox' repeated at positions 20 and 25
// 'dog' repeated at positions 50 and 54


For example, if I had the string "11 12" I could have a regex like=20
/
(?<first> \d+ ) \s \g<first>
/x=20
that captured "11" and then the repetition "12" and put them in an =
array (or some kind of collection) referenced by the name.

I think my attempts to get this to work are better explanations. What I =
want is the result
#<MatchData "11 12" first:["11", "12"]> or something like it. At the =
moment all my attempts end with the named capture only keeping the last =
match it made i.e. 12 with no mention of 11.

I know I could do this a different way, perhaps with split or something, =
but I'd like to know if it's possible with just regex. I understand the =
Oniguruma engine is used now but I can't find any good docs for it.


These are my attempts, $ is my prompt.

$ md1 =3D /
(?<first> \d+ )
\s \g<first>
/x.match( "11 12" )=20
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"


$ md1 =3D /
(?<first> \d+ )
(?: \s \g<first> )?
/x.match( "11 12" )=20
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"


$ md1 =3D /
(?<first> \d+ )
(?: \s=20
(?<second> \g<first> )
)?
/x.match( "11 12" )=20
#<MatchData "11 12" first:"12" second:"12">


$ md1[:first]
"12"

$ md1[:second]
"12"


$ md1 =3D /=20
(?: (?<first> \d+ )\s* )+
/x.match( "11 12" )
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

Iain
 
Ad

Advertisements

W

w_a_x_man

I'm trying to emulate something I've done in .Net many moons ago, which is capture a named group, but not just once, get all it's repetitions and then be able to see all those repetitions. I think they call them GroupCollections in C#. This is the kind of code I'm trying to emulate with Ruby(1.9.1):

using System;
using System.Text.RegularExpressions;

public class Test
{

    public static void Main ()
    {

        // Define a regular expression for repeated words.
        Regex rx = new Regex(@"\b(?<word>\w+)\s+(\k<word>)\b",
          RegexOptions.Compiled | RegexOptions.IgnoreCase);

        // Define a test string.        
        string text = "The the quick brown fox  fox jumped over the lazy dog dog.";

        // Find matches.
        MatchCollection matches = rx.Matches(text);

        // Report the number of matches found.
        Console.WriteLine("{0} matches found in:\n   {1}",
                          matches.Count,
                          text);

        // Report on each match.
        foreach (Match match in matches)
        {
            GroupCollection groups = match.Groups;
            Console.WriteLine("'{0}' repeated at positions {1} and {2}",  
                              groups["word"].Value,
                              groups[0].Index,
                              groups[1].Index);
        }

    }

}

// The example produces the following output to the console:
//       3 matches found in:
//          The the quick brown fox  fox jumped over the lazydog dog.
//       'The' repeated at positions 0 and 4
//       'fox' repeated at positions 20 and 25
//       'dog' repeated at positions 50 and 54

For example, if I had the string "11 12" I could have a regex like
/
 (?<first> \d+ ) \s \g<first>
/x
 that captured "11" and then the repetition "12" and put them in an array (or some kind of collection) referenced by the name.

I think my attempts to get this to work are better explanations. What I want is the result
#<MatchData "11 12" first:["11", "12"]> or something like it. At the moment all my attempts end with the named capture only keeping the last match it made i.e. 12 with no mention of 11.

I know I could do this a different way, perhaps with split or something, but I'd like to know if it's possible with just regex. I understand the Oniguruma engine is used now but I can't find any good docs for it.

These are my attempts, $ is my prompt.

$ md1 = /
                (?<first> \d+ )
                \s \g<first>
            /x.match( "11 12" )
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

$ md1 = /
                (?<first> \d+ )
                (?: \s \g<first> )?
        /x.match( "11 12" )
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

$ md1 = /
                (?<first> \d+ )
                (?: \s
                        (?<second> \g<first> )
                )?
        /x.match( "11 12" )
#<MatchData "11 12" first:"12" second:"12">

$ md1[:first]
"12"

$ md1[:second]
"12"

$ md1 = /
        (?: (?<first> \d+ )\s* )+
      /x.match( "11 12" )
#<MatchData "11 12" first:"12">

$ md1[:first]
"12"

Iain

"The the quick brown fox fox jumped over the lazy dog dog.".
scan(/((\w+) +\2)/i){|x| puts "#{ x[0] } #{ $~.offset(0)[0]}"}
The the 0
fox fox 20
dog dog 50
 
I

Iain Barnett

=20
"The the quick brown fox fox jumped over the lazy dog dog.".
scan(/((\w+) +\2)/i){|x| puts "#{ x[0] } #{ $~.offset(0)[0]}"}
The the 0
fox fox 20
dog dog 50
=20

Thanks for that. That would certainly work to a degree, much better than =
my current alternative, but it nullifies the usefulness of named =
captures. For example, I can't call

$ md1[:first]

and get back all the matches for the (?<first> ) grouping, which would =
be phenomenally useful, because scan returns arrays of strings and not =
matchdata.


Iain=
 
B

botp

Thanks for that. That would certainly work to a degree, much better than my current alternative, but it nullifies the usefulness of named captures. For example, I can't call

$ md1[:first]

wait till you call the 21st ;-)
and get back all the matches for the (?<first> ) grouping, which would be phenomenally useful, because scan returns arrays of strings and not matchdata.

waxman hinted the $~

try eg,


s
#=> "The the quick brown fox fox jumped over the lazy dog dog."
m=[]
#=> []
s.scan(/((\w+) +\2)/i){|x| m << $~}
#=> "The the quick brown fox fox jumped over the lazy dog dog."
m.size
#=> 3
m[0]
#=> #<MatchData "The the" 1:"The the" 2:"The">
m[0].offset 0
#=> [0, 7]
m[0].offset

.... and so fort..

best regards -botp
 
Ad

Advertisements

I

Iain Barnett

would be phenomenally useful, because scan returns arrays of strings and =
not matchdata.
=20
waxman hinted the $~
...
=20
best regards -botp
=20

Ok, I get it now. Thanks for the extra nudge (bang on the head:)

Iain
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top