match/scan does not return multiple matches

Discussion in 'Ruby' started by Michal Suchanek, Feb 6, 2010.

  1. Hello

    I tried scanning for multiple occurences of a group in a string and
    match/scan would return only one.


    "ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
    => #<MatchData "ajabcabck" 1:"a">

    "ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
    => [["a"]]


    clearly the a+ group must match twice to match the string from ^ to $
    but only single match is returned.

    It is possible to use split instead but using a single match would be
    much nicer.

    Any workaround?

    ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]


    Thanks

    Michal
    Michal Suchanek, Feb 6, 2010
    #1
    1. Advertising

  2. Michal Suchanek

    Ralf Mueller Guest

    Michal Suchanek wrote:
    > Hello
    >
    > I tried scanning for multiple occurences of a group in a string and
    > match/scan would return only one.
    >
    >
    > "ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
    > => #<MatchData "ajabcabck" 1:"a">
    >
    > "ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
    > => [["a"]]
    >
    >
    > clearly the a+ group must match twice to match the string from ^ to $
    > but only single match is returned.
    >
    > It is possible to use split instead but using a single match would be
    > much nicer.
    >
    > Any workaround?
    >
    > ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]
    >
    >
    > Thanks
    >
    > Michal
    >
    >

    Hi
    as far as i know, nested groups are not allowed. regular expressions do
    not form a language.

    regards
    ralf
    Ralf Mueller, Feb 6, 2010
    #2
    1. Advertising

  3. On 6 February 2010 19:57, Ralf Mueller <> wrote:
    > Michal Suchanek wrote:
    >>
    >> Hello
    >>
    >> I tried scanning for multiple occurences of a group in a string and
    >> match/scan would return only one.
    >>
    >>
    >> "ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
    >> => #<MatchData "ajabcabck" 1:"a">
    >>
    >> "ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
    >> => [["a"]]
    >>
    >>
    >> clearly the a+ group must match twice to match the string from ^ to $
    >> but only single match is returned.
    >>
    >> It is possible to use split instead but using a single match would be
    >> much nicer.
    >>
    >> Any workaround?
    >>
    >> ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]
    >>
    >>
    >> Thanks
    >>
    >> Michal
    >>
    >>

    >
    > Hi
    > as far as i know, nested groups are not allowed. regular expressions do not
    > form a language.


    Actually they are allowed, otherwise I would not get a match at all.
    Note also that I have manually unnested them in the example. The
    problem is that repeated matches of the group are not returned.

    Thanks

    Michal
    Michal Suchanek, Feb 6, 2010
    #3
  4. On Sat, Feb 6, 2010 at 11:47 AM, Michal Suchanek <> wrote:
    > Actually they are allowed, otherwise I would not get a match at all.
    > Note also that I have manually unnested them in the example. The
    > problem is that repeated matches of the group are not returned.


    Even so, I still think that there is a bug in your regex. I can't
    find it, but I tried the same regular expression in perl and in Reggy,
    a regex tool for osx (http://reggyapp.com/). Both cases only matched
    the one a.

    Ben
    Ben Bleything, Feb 6, 2010
    #4
  5. Michal Suchanek wrote:
    > Hello
    >
    > I tried scanning for multiple occurences of a group in a string and
    > match/scan would return only one.
    >
    >
    > "ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
    > => #<MatchData "ajabcabck" 1:"a">
    >
    > "ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
    > => [["a"]]
    >
    >
    > clearly the a+ group must match twice to match the string from ^ to $
    > but only single match is returned.


    But the regular expression you're passing is anchored, so the entire
    regexp is only matched once, and it only contains one capturing group.

    Perhaps this is clearer:

    >> "abcd".scan /^a(b)(c)d$/

    => [["b", "c"]]
    >> "abcd".scan /^a(?:(b|c)+)d$/

    => [["c"]]
    >>


    In both cases the result is an array containing a single element,
    because the regexp was matched exactly once.

    The first gives [$1,$2] because there are two capture groups in its
    regexp.

    The second gives only [$1] because there is a single capture group. It
    happens to have matched multiple times, but you get only the last value
    for $1.

    If multiple values were inserted into the result, then you wouldn't know
    if ["foo","bar","baz"] came from [$1,$2,$3] or [$1,$1,$2] or [$1,$1,$1]
    or [$1,$2,$2]
    --
    Posted via http://www.ruby-forum.com/.
    Brian Candler, Feb 6, 2010
    #5
  6. On Sat, Feb 6, 2010 at 3:23 PM, Brian Candler <> wrote:
    > Michal Suchanek wrote:


    >> "ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
    >> => [["a"]]
    >>
    >>
    >> clearly the a+ group must match twice to match the string from ^ to $
    >> but only single match is returned.

    >
    > But the regular expression you're passing is anchored, so the entire
    > regexp is only matched once, and it only contains one capturing group.


    Well I think that I understand what the OO is saying, let's break the
    match down:

    "ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/

    /^a*j/ matches "aj" leaving "abcabck"
    /(?:b*(a+)b+c*)+ matches "abcabc" leaving "k"
    /k$/ matches "k" and we're done

    Now there's a capture group inside that second part a non-capture
    group which can (and does in this case repeat).

    Since it repeats one might think that there would be one capture for
    each repetition, but there isn't. Only the first actually gets
    captured.

    Here's a simpler example:

    /^(a)+$/.match("aa").to_a
    => ["aa", "a"]


    Also see http://www.regular-expressions.info/captureall.html
    --
    Rick DeNatale

    Blog: http://talklikeaduck.denhaven2.com/
    Twitter: http://twitter.com/RickDeNatale
    WWR: http://www.workingwithrails.com/person/9021-rick-denatale
    LinkedIn: http://www.linkedin.com/in/rickdenatale
    Rick DeNatale, Feb 6, 2010
    #6
  7. On 6 February 2010 21:47, Rick DeNatale <> wrote:
    > On Sat, Feb 6, 2010 at 3:23 PM, Brian Candler <> wrote=

    :
    >> Michal Suchanek wrote:

    >
    >>> "ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
    >>> =3D> [["a"]]
    >>>
    >>>
    >>> clearly the a+ group must match twice to match the string from ^ to $
    >>> but only single match is returned.

    >>
    >> But the regular expression you're passing is anchored, so the entire
    >> regexp is only matched once, and it only contains one capturing group.

    >
    > Well I think that I understand what the OO is saying, let's break the
    > match down:
    >
    > "ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
    >
    > =C2=A0/^a*j/ =C2=A0matches "aj" leaving "abcabck"
    > /(?:b*(a+)b+c*)+ matches "abcabc" leaving "k"
    > /k$/ matches "k" and we're done
    >
    > Now there's a capture group inside that second part a non-capture
    > group which can (and does in this case repeat).
    >
    > Since it repeats one might think that there would be one capture for
    > each repetition, but there isn't. Only the first actually gets
    > captured.
    >
    > Here's a simpler example:
    >
    > /^(a)+$/.match("aa").to_a
    > =3D> ["aa", "a"]
    >
    >
    > Also see http://www.regular-expressions.info/captureall.html


    Thanks for the explanations. As mentioned on the page and also
    explained in Brian's reply this is a design limitation of the return
    value of the match method. It could return the additional matches but
    then the return value would have to be structured differently than it
    is now for the result to make sense. As scan most likely uses match
    internally or at least returns results consistent with match it shares
    the limitation.

    So something like split has to be used to slice the string into pieces
    where either a shorter non-anchored regex can match repeatedly or only
    one match can be found.

    The case which causes problems and is not actually well captured by
    the example is something like

    ab=3Dcd,ef, ...

    where the regexes for 'ab', 'cd' and the rest are slightly different,
    and so is the interpretation.


    Thanks

    Michal
    Michal Suchanek, Feb 6, 2010
    #7
  8. On 02/06/2010 07:57 PM, Ralf Mueller wrote:
    > Michal Suchanek wrote:
    >> Hello
    >>
    >> I tried scanning for multiple occurences of a group in a string and
    >> match/scan would return only one.
    >>
    >>
    >> "ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
    >> => #<MatchData "ajabcabck" 1:"a">
    >>
    >> "ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
    >> => [["a"]]
    >>
    >>
    >> clearly the a+ group must match twice to match the string from ^ to $
    >> but only single match is returned.
    >>
    >> It is possible to use split instead but using a single match would be
    >> much nicer.


    I would only use #split if you really want to split the string.
    Otherwise please see below.

    >> Any workaround?
    >>
    >> ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]


    > as far as i know, nested groups are not allowed. regular expressions do
    > not form a language.


    Nested groups *are* allowed. However, one must understand how group
    matching works: for each matching group only at most *one* capture is
    recorded:

    irb(main):001:0> s="abaab"
    => "abaab"
    irb(main):002:0> /(?:(a+)b)+/.match s
    => #<MatchData "abaab" 1:"aa">
    irb(main):003:0> md = /(?:(a+)b)+/.match s
    => #<MatchData "abaab" 1:"aa">
    irb(main):004:0> md.to_a
    => ["abaab", "aa"]
    irb(main):005:0> md[1]
    => "aa"
    irb(main):006:0>

    As you can see from this 1.9.1 test, it is the *last* match. I cannot
    provide an official rationale for this, but one likely reason: The
    memory overhead for storing arbitrary amount of matches per group can be
    significant. Also, the number of groups is known at compile time of a
    regular expression while the number of matches of each group is only
    known at match time. This makes it easier to allocate the memory needed
    for storing a single capture per group because it can be done when the
    regular expression is compiled. Please also note that all regular
    expression engines I know handle it that way, i.e. you get at most one
    capture per group.

    In those cases I usually employ a two level approach:

    irb(main):015:0> s = "ajabcaabck"
    => "ajabcaabck"
    irb(main):016:0> if /^a*j((?:b*a+b+c*)+)k$/ =~ s
    irb(main):017:1> $1.scan(/b*(a+)b+c*/){|m| p m, $1}
    irb(main):018:1> end
    ["a"]
    "a"
    ["aa"]
    "aa"
    => "abcaabc"
    irb(main):019:0>

    Because of the way how #scan works we can do:

    irb(main):022:0> if /^a*j((?:b*a+b+c*)+)k$/ =~ s
    irb(main):023:1> $1.scan(/b*(a+)b+c*/){|m| p m}
    irb(main):024:1> end
    ["a"]
    ["aa"]
    => "abcaabc"
    irb(main):025:0>


    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Feb 7, 2010
    #8
  9. Michal Suchanek

    Ralf Mueller Guest

    Robert Klemme wrote:
    > On 02/06/2010 07:57 PM, Ralf Mueller wrote:
    >> Michal Suchanek wrote:
    >>> Hello
    >>>
    >>> I tried scanning for multiple occurences of a group in a string and
    >>> match/scan would return only one.
    >>>
    >>>
    >>> "ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
    >>> => #<MatchData "ajabcabck" 1:"a">
    >>>
    >>> "ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
    >>> => [["a"]]
    >>>
    >>>
    >>> clearly the a+ group must match twice to match the string from ^ to $
    >>> but only single match is returned.
    >>>
    >>> It is possible to use split instead but using a single match would be
    >>> much nicer.

    >
    > I would only use #split if you really want to split the string.
    > Otherwise please see below.
    >
    >>> Any workaround?
    >>>
    >>> ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]

    >
    >> as far as i know, nested groups are not allowed. regular expressions
    >> do not form a language.

    >
    > Nested groups *are* allowed. However, one must understand how group
    > matching works: for each matching group only at most *one* capture is
    > recorded:
    >
    > irb(main):001:0> s="abaab"
    > => "abaab"
    > irb(main):002:0> /(?:(a+)b)+/.match s
    > => #<MatchData "abaab" 1:"aa">
    > irb(main):003:0> md = /(?:(a+)b)+/.match s
    > => #<MatchData "abaab" 1:"aa">
    > irb(main):004:0> md.to_a
    > => ["abaab", "aa"]
    > irb(main):005:0> md[1]
    > => "aa"
    > irb(main):006:0>
    >
    > As you can see from this 1.9.1 test, it is the *last* match. I cannot
    > provide an official rationale for this, but one likely reason: The
    > memory overhead for storing arbitrary amount of matches per group can
    > be significant. Also, the number of groups is known at compile time
    > of a regular expression while the number of matches of each group is
    > only known at match time. This makes it easier to allocate the memory
    > needed for storing a single capture per group because it can be done
    > when the regular expression is compiled. Please also note that all
    > regular expression engines I know handle it that way, i.e. you get at
    > most one capture per group.
    >
    > In those cases I usually employ a two level approach:
    >
    > irb(main):015:0> s = "ajabcaabck"
    > => "ajabcaabck"
    > irb(main):016:0> if /^a*j((?:b*a+b+c*)+)k$/ =~ s
    > irb(main):017:1> $1.scan(/b*(a+)b+c*/){|m| p m, $1}
    > irb(main):018:1> end
    > ["a"]
    > "a"
    > ["aa"]
    > "aa"
    > => "abcaabc"
    > irb(main):019:0>
    >
    > Because of the way how #scan works we can do:
    >
    > irb(main):022:0> if /^a*j((?:b*a+b+c*)+)k$/ =~ s
    > irb(main):023:1> $1.scan(/b*(a+)b+c*/){|m| p m}
    > irb(main):024:1> end
    > ["a"]
    > ["aa"]
    > => "abcaabc"
    > irb(main):025:0>

    Sorry, I mixed grouping and capturing. Concerning grouping, regexp acts
    like a language, but not concerning the capturing and for this reason
    you have to make that two level trick. Nested caputring would lead to a
    tree of results with bad performance, I guess.

    regards
    ralf
    Ralf Mueller, Feb 9, 2010
    #9
  10. On 9 February 2010 11:49, Ralf Mueller <> wrote:
    > Robert Klemme wrote:
    >>
    >> On 02/06/2010 07:57 PM, Ralf Mueller wrote:
    >>>
    >>> Michal Suchanek wrote:
    >>>>
    >>>> Hello
    >>>>
    >>>> I tried scanning for multiple occurences of a group in a string and
    >>>> match/scan would return only one.
    >>>>
    >>>>
    >>>> "ajabcabck".match /^a*j(?:b*(a+)b+c*)+k$/
    >>>> =3D> #<MatchData "ajabcabck" 1:"a">
    >>>>
    >>>> "ajabcabck".scan /^a*j(?:b*(a+)b+c*)+k$/
    >>>> =3D> [["a"]]
    >>>>
    >>>>
    >>>> clearly the a+ group must match twice to match the string from ^ to $
    >>>> but only single match is returned.
    >>>>
    >>>> It is possible to use split instead but using a single match would be
    >>>> much nicer.

    >>
    >> I would only use #split if you really want to split the string. Otherwis=

    e
    >> please see below.
    >>
    >>>> Any workaround?
    >>>>
    >>>> ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]

    >>
    >>> as far as i know, nested groups are not allowed. regular expressions do
    >>> not form a language.

    >>
    >> Nested groups *are* allowed. =C2=A0However, one must understand how grou=

    p
    >> matching works: for each matching group only at most *one* capture is
    >> recorded:
    >>
    >> irb(main):001:0> s=3D"abaab"
    >> =3D> "abaab"
    >> irb(main):002:0> /(?:(a+)b)+/.match s
    >> =3D> #<MatchData "abaab" 1:"aa">
    >> irb(main):003:0> md =3D /(?:(a+)b)+/.match s
    >> =3D> #<MatchData "abaab" 1:"aa">
    >> irb(main):004:0> md.to_a
    >> =3D> ["abaab", "aa"]
    >> irb(main):005:0> md[1]
    >> =3D> "aa"
    >> irb(main):006:0>
    >>
    >> As you can see from this 1.9.1 test, it is the *last* match. =C2=A0I can=

    not
    >> provide an official rationale for this, but one likely reason: The memor=

    y
    >> overhead for storing arbitrary amount of matches per group can be
    >> significant. =C2=A0Also, the number of groups is known at compile time o=

    f a
    >> regular expression while the number of matches of each group is only kno=

    wn
    >> at match time. =C2=A0This makes it easier to allocate the memory needed =

    for
    >> storing a single capture per group because it can be done when the regul=

    ar
    >> expression is compiled. =C2=A0Please also note that all regular expressi=

    on
    >> engines I know handle it that way, i.e. you get at most one capture per
    >> group.
    >>
    >> In those cases I usually employ a two level approach:
    >>
    >> irb(main):015:0> s =3D "ajabcaabck"
    >> =3D> "ajabcaabck"
    >> irb(main):016:0> if /^a*j((?:b*a+b+c*)+)k$/ =3D~ s
    >> irb(main):017:1> $1.scan(/b*(a+)b+c*/){|m| p m, $1}
    >> irb(main):018:1> end
    >> ["a"]
    >> "a"
    >> ["aa"]
    >> "aa"
    >> =3D> "abcaabc"
    >> irb(main):019:0>
    >>
    >> Because of the way how #scan works we can do:
    >>
    >> irb(main):022:0> if /^a*j((?:b*a+b+c*)+)k$/ =3D~ s
    >> irb(main):023:1> $1.scan(/b*(a+)b+c*/){|m| p m}
    >> irb(main):024:1> end
    >> ["a"]
    >> ["aa"]
    >> =3D> "abcaabc"
    >> irb(main):025:0>

    >
    > Sorry, I mixed grouping and capturing. Concerning grouping, regexp acts l=

    ike
    > a language, but not concerning the capturing and for this reason you have=

    to
    > make that two level trick. Nested caputring would lead to a tree of resul=

    ts
    > with bad performance, I guess.
    >

    Actually, nested capturing is also supported as you can see from the
    examples here. What is not supported is returning multiple matches for
    a group that matches multiple times.

    Thanks

    Michal
    Michal Suchanek, Feb 9, 2010
    #10
  11. On Tue, Feb 9, 2010 at 9:36 AM, Michal Suchanek <> wrote:
    > Actually, nested capturing is also supported as you can see from the
    > examples here. What is not supported is returning multiple matches for
    > a group that matches multiple times.


    Are you sure it matches multiple times? As I mentioned earlier in the
    thread, I can't get it to do so.

    Ben
    Ben Bleything, Feb 9, 2010
    #11
  12. On 9 February 2010 19:27, Ben Bleything <> wrote:
    > On Tue, Feb 9, 2010 at 9:36 AM, Michal Suchanek <> wro=

    te:
    >> Actually, nested capturing is also supported as you can see from the
    >> examples here. What is not supported is returning multiple matches for
    >> a group that matches multiple times.

    >
    > Are you sure it matches multiple times? =C2=A0As I mentioned earlier in t=

    he
    > thread, I can't get it to do so.


    (stuff)+ matches multiple stuffs but returns only one.

    "stuffstuffstuff".match /^(stuff)+$/
    =3D> #<MatchData "stuffstuffstuff" 1:"stuff">

    Still can be nested.

    "stuffstuffstuff".match /^(stu(ff))+$/
    =3D> #<MatchData "stuffstuffstuff" 1:"stuff" 2:"ff">

    Thanks

    Michal
    Michal Suchanek, Feb 9, 2010
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mike D
    Replies:
    2
    Views:
    332
    Mike D
    Oct 27, 2008
  2. makoto kuwata
    Replies:
    5
    Views:
    132
    Michael Fellinger
    Feb 26, 2008
  3. Alex Allmont
    Replies:
    1
    Views:
    120
    Brian Candler
    Jul 30, 2009
  4. Markus Fischer
    Replies:
    9
    Views:
    164
    7stud --
    Apr 8, 2011
  5. Vijai Kalyan
    Replies:
    9
    Views:
    169
    Tad McClellan
    Oct 19, 2004
Loading...

Share This Page