match/scan does not return multiple matches

Michal Suchanek · Feb 6, 2010

Hello

I tried scanning for multiple occurences of a group in a string and
match/scan would return only one.

"ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
=> #<MatchData "ajabcabck" 1:"a">

"ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
=> [["a"]]

clearly the a+ group must match twice to match the string from ^ to $
but only single match is returned.

It is possible to use split instead but using a single match would be
much nicer.

Any workaround?

ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]

Thanks

Michal

Ralf Mueller · Feb 6, 2010

Michal said:
Hello

I tried scanning for multiple occurences of a group in a string and
match/scan would return only one.

"ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
=> #<MatchData "ajabcabck" 1:"a">

"ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
=> [["a"]]

clearly the a+ group must match twice to match the string from ^ to $
but only single match is returned.

It is possible to use split instead but using a single match would be
much nicer.

Any workaround?

ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]

Thanks

Michal

Hi
as far as i know, nested groups are not allowed. regular expressions do
not form a language.

regards
ralf

Michal Suchanek · Feb 6, 2010

Michal said:
Michal said:

Hello

I tried scanning for multiple occurences of a group in a string and
match/scan would return only one.

"ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
=> #<MatchData "ajabcabck" 1:"a">

"ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
=> [["a"]]

clearly the a+ group must match twice to match the string from ^ to $
but only single match is returned.

It is possible to use split instead but using a single match would be
much nicer.

Any workaround?

ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]

Thanks

Michal

Click to expand...

Hi
as far as i know, nested groups are not allowed. regular expressions do not
form a language.

Actually they are allowed, otherwise I would not get a match at all.
Note also that I have manually unnested them in the example. The
problem is that repeated matches of the group are not returned.

Thanks

Michal

Ben Bleything · Feb 6, 2010

Actually they are allowed, otherwise I would not get a match at all.
Note also that I have manually unnested them in the example. The
problem is that repeated matches of the group are not returned.

Even so, I still think that there is a bug in your regex. I can't
find it, but I tried the same regular expression in perl and in Reggy,
a regex tool for osx (http://reggyapp.com/). Both cases only matched
the one a.

Ben

Brian Candler · Feb 6, 2010

Michal said:
Hello

I tried scanning for multiple occurences of a group in a string and
match/scan would return only one.

"ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
=> #<MatchData "ajabcabck" 1:"a">

"ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
=> [["a"]]

clearly the a+ group must match twice to match the string from ^ to $
but only single match is returned.

But the regular expression you're passing is anchored, so the entire
regexp is only matched once, and it only contains one capturing group.

Perhaps this is clearer:

"abcd".scan /^a(b)(c)d$/ => [["b", "c"]]
"abcd".scan /^a(?b|c)+)d$/ => [["c"]]

Click to expand...

In both cases the result is an array containing a single element,
because the regexp was matched exactly once.

The first gives [$1,$2] because there are two capture groups in its
regexp.

The second gives only [$1] because there is a single capture group. It
happens to have matched multiple times, but you get only the last value
for $1.

If multiple values were inserted into the result, then you wouldn't know
if ["foo","bar","baz"] came from [$1,$2,$3] or [$1,$1,$2] or [$1,$1,$1]
or [$1,$2,$2]

Rick DeNatale · Feb 6, 2010

Michal Suchanek wrote:

"ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
=> [["a"]]

clearly the a+ group must match twice to match the string from ^ to $
but only single match is returned.

Click to expand...

But the regular expression you're passing is anchored, so the entire
regexp is only matched once, and it only contains one capturing group.

Well I think that I understand what the OO is saying, let's break the
match down:

"ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/

/^a*j/ matches "aj" leaving "abcabck"
/(?:b*(a+)b+c*)+ matches "abcabc" leaving "k"
/k$/ matches "k" and we're done

Now there's a capture group inside that second part a non-capture
group which can (and does in this case repeat).

Since it repeats one might think that there would be one capture for
each repetition, but there isn't. Only the first actually gets
captured.

Here's a simpler example:

/^(a)+$/.match("aa").to_a
=> ["aa", "a"]

Also see http://www.regular-expressions.info/captureall.html
--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale

Michal Suchanek · Feb 6, 2010

Michal Suchanek wrote:

"ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
=3D> [["a"]]

clearly the a+ group must match twice to match the string from ^ to $
but only single match is returned.

Click to expand...

But the regular expression you're passing is anchored, so the entire
regexp is only matched once, and it only contains one capturing group.

Click to expand...

Well I think that I understand what the OO is saying, let's break the
match down:

"ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/

=C2=A0/^a*j/ =C2=A0matches "aj" leaving "abcabck"
/(?:b*(a+)b+c*)+ matches "abcabc" leaving "k"
/k$/ matches "k" and we're done

Now there's a capture group inside that second part a non-capture
group which can (and does in this case repeat).

Since it repeats one might think that there would be one capture for
each repetition, but there isn't. Only the first actually gets
captured.

Here's a simpler example:

/^(a)+$/.match("aa").to_a
=3D> ["aa", "a"]

Also see http://www.regular-expressions.info/captureall.html

Click to expand...

Thanks for the explanations. As mentioned on the page and also
explained in Brian's reply this is a design limitation of the return
value of the match method. It could return the additional matches but
then the return value would have to be structured differently than it
is now for the result to make sense. As scan most likely uses match
internally or at least returns results consistent with match it shares
the limitation.

So something like split has to be used to slice the string into pieces
where either a shorter non-anchored regex can match repeatedly or only
one match can be found.

The case which causes problems and is not actually well captured by
the example is something like

ab=3Dcd,ef, ...

where the regexes for 'ab', 'cd' and the rest are slightly different,
and so is the interpretation.

Thanks

Michal

Robert Klemme · Feb 7, 2010

Michal said:
Michal said:

Hello

I tried scanning for multiple occurences of a group in a string and
match/scan would return only one.

"ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
=> #<MatchData "ajabcabck" 1:"a">

"ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
=> [["a"]]

clearly the a+ group must match twice to match the string from ^ to $
but only single match is returned.

It is possible to use split instead but using a single match would be
much nicer.

Click to expand...

I would only use #split if you really want to split the string.
Otherwise please see below.

Any workaround?

ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]

Click to expand...

as far as i know, nested groups are not allowed. regular expressions do
not form a language.

Nested groups *are* allowed. However, one must understand how group
matching works: for each matching group only at most *one* capture is
recorded:

irb(main):001:0> s="abaab"
=> "abaab"
irb(main):002:0> /(?

a+)b)+/.match s
=> #<MatchData "abaab" 1:"aa">
irb(main):003:0> md = /(?

a+)b)+/.match s
=> #<MatchData "abaab" 1:"aa">
irb(main):004:0> md.to_a
=> ["abaab", "aa"]
irb(main):005:0> md[1]
=> "aa"
irb(main):006:0>

As you can see from this 1.9.1 test, it is the *last* match. I cannot
provide an official rationale for this, but one likely reason: The
memory overhead for storing arbitrary amount of matches per group can be
significant. Also, the number of groups is known at compile time of a
regular expression while the number of matches of each group is only
known at match time. This makes it easier to allocate the memory needed
for storing a single capture per group because it can be done when the
regular expression is compiled. Please also note that all regular
expression engines I know handle it that way, i.e. you get at most one
capture per group.

In those cases I usually employ a two level approach:

irb(main):015:0> s = "ajabcaabck"
=> "ajabcaabck"
irb(main):016:0> if /^a*j((?:b*a+b+c*)+)k$/ =~ s
irb(main):017:1> $1.scan(/b*(a+)b+c*/){|m| p m, $1}
irb(main):018:1> end
["a"]
"a"
["aa"]
"aa"
=> "abcaabc"
irb(main):019:0>

Because of the way how #scan works we can do:

irb(main):022:0> if /^a*j((?:b*a+b+c*)+)k$/ =~ s
irb(main):023:1> $1.scan(/b*(a+)b+c*/){|m| p m}
irb(main):024:1> end
["a"]
["aa"]
=> "abcaabc"
irb(main):025:0>

Kind regards

robert

Ralf Mueller · Feb 9, 2010

Robert said:
Michal said:

Hello

I tried scanning for multiple occurences of a group in a string and
match/scan would return only one.

"ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
=> #<MatchData "ajabcabck" 1:"a">

"ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
=> [["a"]]

clearly the a+ group must match twice to match the string from ^ to $
but only single match is returned.

It is possible to use split instead but using a single match would be
much nicer.

Click to expand...

Click to expand...

I would only use #split if you really want to split the string.
Otherwise please see below.

Any workaround?

ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]

Click to expand...

Click to expand...

as far as i know, nested groups are not allowed. regular expressions
do not form a language.

Click to expand...

Nested groups *are* allowed. However, one must understand how group
matching works: for each matching group only at most *one* capture is
recorded:

irb(main):001:0> s="abaab"
=> "abaab"
irb(main):002:0> /(?a+)b)+/.match s
=> #<MatchData "abaab" 1:"aa">
irb(main):003:0> md = /(?a+)b)+/.match s
=> #<MatchData "abaab" 1:"aa">
irb(main):004:0> md.to_a
=> ["abaab", "aa"]
irb(main):005:0> md[1]
=> "aa"
irb(main):006:0>

As you can see from this 1.9.1 test, it is the *last* match. I cannot
provide an official rationale for this, but one likely reason: The
memory overhead for storing arbitrary amount of matches per group can
be significant. Also, the number of groups is known at compile time
of a regular expression while the number of matches of each group is
only known at match time. This makes it easier to allocate the memory
needed for storing a single capture per group because it can be done
when the regular expression is compiled. Please also note that all
regular expression engines I know handle it that way, i.e. you get at
most one capture per group.

In those cases I usually employ a two level approach:

irb(main):015:0> s = "ajabcaabck"
=> "ajabcaabck"
irb(main):016:0> if /^a*j((?:b*a+b+c*)+)k$/ =~ s
irb(main):017:1> $1.scan(/b*(a+)b+c*/){|m| p m, $1}
irb(main):018:1> end
["a"]
"a"
["aa"]
"aa"
=> "abcaabc"
irb(main):019:0>

Because of the way how #scan works we can do:

irb(main):022:0> if /^a*j((?:b*a+b+c*)+)k$/ =~ s
irb(main):023:1> $1.scan(/b*(a+)b+c*/){|m| p m}
irb(main):024:1> end
["a"]
["aa"]
=> "abcaabc"
irb(main):025:0>

Sorry, I mixed grouping and capturing. Concerning grouping, regexp acts
like a language, but not concerning the capturing and for this reason
you have to make that two level trick. Nested caputring would lead to a
tree of results with bad performance, I guess.

regards
ralf

Michal Suchanek · Feb 9, 2010

Robert said:
Robert said:

Michal Suchanek wrote:

Hello

I tried scanning for multiple occurences of a group in a string and
match/scan would return only one.

"ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
=3D> #<MatchData "ajabcabck" 1:"a">

"ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
=3D> [["a"]]

clearly the a+ group must match twice to match the string from ^ to $
but only single match is returned.

It is possible to use split instead but using a single match would be
much nicer.

Click to expand...

I would only use #split if you really want to split the string. Otherwis= e
please see below.

Any workaround?

ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]

Click to expand...

as far as i know, nested groups are not allowed. regular expressions do
not form a language.

Click to expand...

Nested groups *are* allowed. =C2=A0However, one must understand how grou= p
matching works: for each matching group only at most *one* capture is
recorded:

irb(main):001:0> s=3D"abaab"
=3D> "abaab"
irb(main):002:0> /(?a+)b)+/.match s
=3D> #<MatchData "abaab" 1:"aa">
irb(main):003:0> md =3D /(?a+)b)+/.match s
=3D> #<MatchData "abaab" 1:"aa">
irb(main):004:0> md.to_a
=3D> ["abaab", "aa"]
irb(main):005:0> md[1]
=3D> "aa"
irb(main):006:0>

As you can see from this 1.9.1 test, it is the *last* match. =C2=A0I can= not
provide an official rationale for this, but one likely reason: The memor= y
overhead for storing arbitrary amount of matches per group can be
significant. =C2=A0Also, the number of groups is known at compile time o= f a
regular expression while the number of matches of each group is only kno= wn
at match time. =C2=A0This makes it easier to allocate the memory needed = for
storing a single capture per group because it can be done when the regul= ar
expression is compiled. =C2=A0Please also note that all regular expressi= on
engines I know handle it that way, i.e. you get at most one capture per
group.

In those cases I usually employ a two level approach:

irb(main):015:0> s =3D "ajabcaabck"
=3D> "ajabcaabck"
irb(main):016:0> if /^a*j((?:b*a+b+c*)+)k$/ =3D~ s
irb(main):017:1> $1.scan(/b*(a+)b+c*/){|m| p m, $1}
irb(main):018:1> end
["a"]
"a"
["aa"]
"aa"
=3D> "abcaabc"
irb(main):019:0>

Because of the way how #scan works we can do:

irb(main):022:0> if /^a*j((?:b*a+b+c*)+)k$/ =3D~ s
irb(main):023:1> $1.scan(/b*(a+)b+c*/){|m| p m}
irb(main):024:1> end
["a"]
["aa"]
=3D> "abcaabc"
irb(main):025:0>

Click to expand...

Sorry, I mixed grouping and capturing. Concerning grouping, regexp acts l= ike
a language, but not concerning the capturing and for this reason you have= to
make that two level trick. Nested caputring would lead to a tree of resul= ts
with bad performance, I guess.

Actually, nested capturing is also supported as you can see from the
examples here. What is not supported is returning multiple matches for
a group that matches multiple times.

Thanks

Michal

Ben Bleything · Feb 9, 2010

Actually, nested capturing is also supported as you can see from the
examples here. What is not supported is returning multiple matches for
a group that matches multiple times.

Are you sure it matches multiple times? As I mentioned earlier in the
thread, I can't get it to do so.

Ben

Michal Suchanek · Feb 9, 2010

Are you sure it matches multiple times? =C2=A0As I mentioned earlier in t= he
thread, I can't get it to do so.

Click to expand...

(stuff)+ matches multiple stuffs but returns only one.

"stuffstuffstuff".match /^(stuff)+$/
=3D> #<MatchData "stuffstuffstuff" 1:"stuff">

Still can be nested.

"stuffstuffstuff".match /^(stu(ff))+$/
=3D> #<MatchData "stuffstuffstuff" 1:"stuff" 2:"ff">

Thanks

Michal

Match a pattern multiple times, returning matches, captures andoffset?	9	Apr 5, 2011
How to find multiple matches in a string	10	Apr 13, 2010
Regular expressions (multiple match problem)	5	Apr 2, 2008
String#match vs. Regexp#match - confused	1	Sep 4, 2008
Regex - Exclude Multiple Characters and Global Scanning	2	Jun 21, 2008
matchdata	2	Nov 23, 2006
Regular expressions, capture repeated groups	4	Jul 8, 2010
Problem with gems (require not working)	8	Dec 22, 2010

match/scan does not return multiple matches

Michal Suchanek

Ralf Mueller

Michal Suchanek

Ben Bleything

Brian Candler

Rick DeNatale

Michal Suchanek

Robert Klemme

Ralf Mueller

Michal Suchanek

Ben Bleything

Michal Suchanek

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads