multiple regexp matches

Discussion in 'Ruby' started by Kevin Howe, Aug 17, 2004.

  1. Kevin Howe

    Kevin Howe Guest

    I want to get multiple results of a regexp pattern match, offsets included.
    The following code gets the proper results, but does not return offsets:

    str = '<span id="1"> <span>...</span> </span>'
    re = Regexp.new('(<(\/?)span>)', true) # match start or end tag
    print str.scan(re).inspect

    The Regexp module will return offsets, but Regexp::match only returns the
    first match, so I'm not sure how to get the full list of matches?
    Kevin Howe, Aug 17, 2004
    #1
    1. Advertising

  2. Kevin Howe

    Zach Dennis Guest

    Instead of using str.scan(re) use:

    re.match( str );

    which returns a MatchData object. You can use the MatchData object's
    offset method to find your results.....i do believe.

    Zach

    Kevin Howe wrote:

    >I want to get multiple results of a regexp pattern match, offsets included.
    >The following code gets the proper results, but does not return offsets:
    >
    > str = '<span id="1"> <span>...</span> </span>'
    > re = Regexp.new('(<(\/?)span>)', true) # match start or end tag
    > print str.scan(re).inspect
    >
    >The Regexp module will return offsets, but Regexp::match only returns the
    >first match, so I'm not sure how to get the full list of matches?
    >
    >
    >
    >
    >
    >
    >
    Zach Dennis, Aug 17, 2004
    #2
    1. Advertising

  3. Kevin Howe

    Kevin Howe Guest

    "Zach Dennis" <> wrote in message
    news:...
    > Instead of using str.scan(re) use:
    >
    > re.match( str );
    >
    > which returns a MatchData object. You can use the MatchData object's
    > offset method to find your results.....i do believe.


    Yes that's true, but if you read the second part of my message, I'd already
    tried this:

    >The Regexp module will return offsets, but Regexp::match only returns the
    >first match, so I'm not sure how to get the full list of matches?
    Kevin Howe, Aug 17, 2004
    #3
  4. Kevin Howe

    Zach Dennis Guest

    According to rdoc you are mistaken.

    I also think you are mistaken:

    #!/usr/bin/ruby
    t = "This is my 1 text"

    re = /([^\s]*\s).*(\d)(\s.)/
    md = re.match( t );
    puts md.offset(0);
    puts ""
    puts md.offset( 1 );
    puts ""
    puts md.offset( 2 );
    puts ""
    puts md.offset( 3 );


    It returns the correct offsets of the matches. offset(0) being the whole
    regex, offset(1) does the first subexpression, offset(2) does the second
    subexpression. It works.

    Zach



    Kevin Howe wrote:

    >"Zach Dennis" <> wrote in message
    >news:...
    >
    >
    >>Instead of using str.scan(re) use:
    >>
    >>re.match( str );
    >>
    >>which returns a MatchData object. You can use the MatchData object's
    >>offset method to find your results.....i do believe.
    >>
    >>

    >
    >Yes that's true, but if you read the second part of my message, I'd already
    >tried this:
    >
    >
    >
    >>The Regexp module will return offsets, but Regexp::match only returns the
    >>first match, so I'm not sure how to get the full list of matches?
    >>
    >>

    >
    >
    >
    >
    >
    >
    >
    Zach Dennis, Aug 17, 2004
    #4
  5. Hi --

    On Wed, 18 Aug 2004, Zach Dennis wrote:

    > According to rdoc you are mistaken.
    >
    > I also think you are mistaken:
    >
    > #!/usr/bin/ruby
    > t = "This is my 1 text"
    >
    > re = /([^\s]*\s).*(\d)(\s.)/
    > md = re.match( t );
    > puts md.offset(0);
    > puts ""
    > puts md.offset( 1 );
    > puts ""
    > puts md.offset( 2 );
    > puts ""
    > puts md.offset( 3 );
    >
    >
    > It returns the correct offsets of the matches. offset(0) being the whole
    > regex, offset(1) does the first subexpression, offset(2) does the second
    > subexpression. It works.


    The problem is that Kevin wanted to scan a string more than once with
    the same regex:

    str = "abc abc abc"
    re = /(\w+)/ # not /(\w+) (\w+) (\w+)/

    re will scan against str three times. The difficulty is getting hold
    of the offsets of all the matches from all three times, in relation to
    the total length of the string.

    Someone will probably post a simple or elegant solution; in the
    meantime, here's mine:

    def find_offsets(str,re)
    offsets = []
    first = 0
    of = [0,0]

    loop do
    break unless m = re.match(str[first..-1])
    break if m.captures.empty?
    m.captures.each_with_index do |c,i|
    of = m.offset(i+1)
    res = [c, [of[0]+first, of[1]+first ]]
    yield res if block_given?
    offsets << res
    end
    first += of[0]
    end

    offsets
    end

    # Little test:

    str = '<span id="1"> <span>...</span> </span>'
    re = /(<(\/?)span>)/i

    puts str
    (str.size/9).times { print "0123456789" }
    puts; puts

    find_offsets(str,re).each do |capture, (start, stop)|
    puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
    end

    # Output:
    <span id="1"> <span>...</span> </span>
    0123456789012345678901234567890123456789

    "<span>" starts at 14, ends at 20
    "" starts at 15, ends at 15
    "</span>" starts at 23, ends at 30
    "/" starts at 24, ends at 25
    "</span>" starts at 31, ends at 38
    "/" starts at 32, ends at 33


    David

    --
    David A. Black
    David A. Black, Aug 17, 2004
    #5
  6. This should do what you want.

    -austin

    str = '<span id="1"> <span>...</span> </span>'
    re = /(<(\/?)span>)/i

    str.scan(re) # => [["<span>", ""], ["</span>", "/"], ["</span>", "/"]]

    matches = []
    str.scan(re) do
    matches << Regexp.last_match
    end

    matches.each do |match|
    match.captures.each_with_index do |capture, ii|
    soff, eoff = match.offset(ii + 1)
    puts %Q("#{capture}" #{soff} .. #{eoff})
    end
    end
    Austin Ziegler, Aug 17, 2004
    #6
  7. Kevin Howe

    Zach Dennis Guest

    Ah...thanks for the clarification David. I was mistaken.

    Sorry for the confusion Kevin.

    Zach

    David A. Black wrote:

    >Hi --
    >
    >On Wed, 18 Aug 2004, Zach Dennis wrote:
    >
    >
    >
    >>According to rdoc you are mistaken.
    >>
    >>I also think you are mistaken:
    >>
    >>#!/usr/bin/ruby
    >>t = "This is my 1 text"
    >>
    >>re = /([^\s]*\s).*(\d)(\s.)/
    >>md = re.match( t );
    >>puts md.offset(0);
    >>puts ""
    >>puts md.offset( 1 );
    >>puts ""
    >>puts md.offset( 2 );
    >>puts ""
    >>puts md.offset( 3 );
    >>
    >>
    >>It returns the correct offsets of the matches. offset(0) being the whole
    >>regex, offset(1) does the first subexpression, offset(2) does the second
    >>subexpression. It works.
    >>
    >>

    >
    >The problem is that Kevin wanted to scan a string more than once with
    >the same regex:
    >
    > str = "abc abc abc"
    > re = /(\w+)/ # not /(\w+) (\w+) (\w+)/
    >
    >re will scan against str three times. The difficulty is getting hold
    >of the offsets of all the matches from all three times, in relation to
    >the total length of the string.
    >
    >Someone will probably post a simple or elegant solution; in the
    >meantime, here's mine:
    >
    > def find_offsets(str,re)
    > offsets = []
    > first = 0
    > of = [0,0]
    >
    > loop do
    > break unless m = re.match(str[first..-1])
    > break if m.captures.empty?
    > m.captures.each_with_index do |c,i|
    > of = m.offset(i+1)
    > res = [c, [of[0]+first, of[1]+first ]]
    > yield res if block_given?
    > offsets << res
    > end
    > first += of[0]
    > end
    >
    > offsets
    > end
    >
    > # Little test:
    >
    > str = '<span id="1"> <span>...</span> </span>'
    > re = /(<(\/?)span>)/i
    >
    > puts str
    > (str.size/9).times { print "0123456789" }
    > puts; puts
    >
    > find_offsets(str,re).each do |capture, (start, stop)|
    > puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
    > end
    >
    > # Output:
    > <span id="1"> <span>...</span> </span>
    > 0123456789012345678901234567890123456789
    >
    > "<span>" starts at 14, ends at 20
    > "" starts at 15, ends at 15
    > "</span>" starts at 23, ends at 30
    > "/" starts at 24, ends at 25
    > "</span>" starts at 31, ends at 38
    > "/" starts at 32, ends at 33
    >
    >
    >David
    >
    >
    >
    Zach Dennis, Aug 17, 2004
    #7
  8. Kevin Howe

    Kevin Howe Guest

    Awesome that works great thank you. I have to wonder why Ruby doesn't have
    this built in, it's simple enough to add a method that returns a list of
    MatchData objects as follows:

    class MultiRegexp < Regexp
    def matches(str)
    str.scan(self) do
    yield Regexp.last_match
    end
    end
    end

    str = '<span id="1"> <span>...</span> </span>'
    re = MultiRegexp.new('(<(\/?)span>)', true)
    re.matches(str) { |i|
    capture = i.captures[0]
    start,stop = i.offset(0)
    puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
    }

    An even nicer alternative would be to add a Regexp::MULTIMATCH constant:

    str = '<span id="1"> <span>...</span> </span>'
    re = Regexp.new('(<(\/?)span>)', Regexp::MULTIMATCH)
    matches = re.match(str)

    Just a thought :)


    "Zach Dennis" <> wrote in message
    news:...
    > Ah...thanks for the clarification David. I was mistaken.
    >
    > Sorry for the confusion Kevin.
    >
    > Zach
    >
    > David A. Black wrote:
    >
    > >Hi --
    > >
    > >On Wed, 18 Aug 2004, Zach Dennis wrote:
    > >
    > >
    > >
    > >>According to rdoc you are mistaken.
    > >>
    > >>I also think you are mistaken:
    > >>
    > >>#!/usr/bin/ruby
    > >>t = "This is my 1 text"
    > >>
    > >>re = /([^\s]*\s).*(\d)(\s.)/
    > >>md = re.match( t );
    > >>puts md.offset(0);
    > >>puts ""
    > >>puts md.offset( 1 );
    > >>puts ""
    > >>puts md.offset( 2 );
    > >>puts ""
    > >>puts md.offset( 3 );
    > >>
    > >>
    > >>It returns the correct offsets of the matches. offset(0) being the whole
    > >>regex, offset(1) does the first subexpression, offset(2) does the second
    > >>subexpression. It works.
    > >>
    > >>

    > >
    > >The problem is that Kevin wanted to scan a string more than once with
    > >the same regex:
    > >
    > > str = "abc abc abc"
    > > re = /(\w+)/ # not /(\w+) (\w+) (\w+)/
    > >
    > >re will scan against str three times. The difficulty is getting hold
    > >of the offsets of all the matches from all three times, in relation to
    > >the total length of the string.
    > >
    > >Someone will probably post a simple or elegant solution; in the
    > >meantime, here's mine:
    > >
    > > def find_offsets(str,re)
    > > offsets = []
    > > first = 0
    > > of = [0,0]
    > >
    > > loop do
    > > break unless m = re.match(str[first..-1])
    > > break if m.captures.empty?
    > > m.captures.each_with_index do |c,i|
    > > of = m.offset(i+1)
    > > res = [c, [of[0]+first, of[1]+first ]]
    > > yield res if block_given?
    > > offsets << res
    > > end
    > > first += of[0]
    > > end
    > >
    > > offsets
    > > end
    > >
    > > # Little test:
    > >
    > > str = '<span id="1"> <span>...</span> </span>'
    > > re = /(<(\/?)span>)/i
    > >
    > > puts str
    > > (str.size/9).times { print "0123456789" }
    > > puts; puts
    > >
    > > find_offsets(str,re).each do |capture, (start, stop)|
    > > puts "\"#{capture}\" starts at #{start}, ends at #{stop}"
    > > end
    > >
    > > # Output:
    > > <span id="1"> <span>...</span> </span>
    > > 0123456789012345678901234567890123456789
    > >
    > > "<span>" starts at 14, ends at 20
    > > "" starts at 15, ends at 15
    > > "</span>" starts at 23, ends at 30
    > > "/" starts at 24, ends at 25
    > > "</span>" starts at 31, ends at 38
    > > "/" starts at 32, ends at 33
    > >
    > >
    > >David
    > >
    > >
    > >

    >
    >
    >
    Kevin Howe, Aug 17, 2004
    #8
  9. Regexp scanning with MatchData (Re: multiple regexp matches)

    "Austin Ziegler" <> schrieb im Newsbeitrag
    news:...
    > This should do what you want.
    >
    > -austin
    >
    > str = '<span id="1"> <span>...</span> </span>'
    > re = /(<(\/?)span>)/i
    >
    > str.scan(re) # => [["<span>", ""], ["</span>", "/"], ["</span>", "/"]]
    >
    > matches = []
    > str.scan(re) do
    > matches << Regexp.last_match
    > end
    >
    > matches.each do |match|
    > match.captures.each_with_index do |capture, ii|
    > soff, eoff = match.offset(ii + 1)
    > puts %Q("#{capture}" #{soff} .. #{eoff})
    > end
    > end
    >


    While that works, isn't it ridiculous that one has to resort to a class
    method ("Regexp.last_match")? I mean, there should rather be something like

    /o/.each( "foo" ) do |md|
    # md is MatchData
    end

    Or even

    /o/.matcher( "foo" ).each do |md|
    # md is MatchData
    end

    That way Matcher could implement Enumerable.

    Sounds like a candidate for a RCR. Any comments?

    robert
    Robert Klemme, Aug 18, 2004
    #9
  10. Re: Regexp scanning with MatchData (Re: multiple regexp matches)

    On Wednesday 18 August 2004 12:31, Robert Klemme wrote:
    > "Austin Ziegler" <> schrieb im Newsbeitrag
    > news:...
    >
    > > This should do what you want.
    > >
    > > -austin
    > >
    > > str = '<span id="1"> <span>...</span> </span>'
    > > re = /(<(\/?)span>)/i
    > >
    > > str.scan(re) # => [["<span>", ""], ["</span>", "/"], ["</span>", "/"]]
    > >
    > > matches = []
    > > str.scan(re) do
    > > matches << Regexp.last_match
    > > end
    > >
    > > matches.each do |match|
    > > match.captures.each_with_index do |capture, ii|
    > > soff, eoff = match.offset(ii + 1)
    > > puts %Q("#{capture}" #{soff} .. #{eoff})
    > > end
    > > end

    >
    > While that works, isn't it ridiculous that one has to resort to a class
    > method ("Regexp.last_match")? I mean, there should rather be something
    > like
    >
    > /o/.each( "foo" ) do |md|
    > # md is MatchData
    > end
    >
    > Or even
    >
    > /o/.matcher( "foo" ).each do |md|
    > # md is MatchData
    > end
    >
    > That way Matcher could implement Enumerable.
    >
    > Sounds like a candidate for a RCR. Any comments?



    What about $~ ?

    bash-2.05b$ ruby a.rb
    [[0, 13], [14, 20], [23, 30], [31, 38]]
    bash-2.05b$ expand -t2 a.rb
    str = '<span id="1"> <span>...</span> </span>'
    re = Regexp.new('(<(\/?)span[^\n/]*?>)', true) # match start or end tag
    positions = []
    str.scan(re) do
    positions << [$~.begin(0), $~.end(0)]
    end
    p positionsbash-2.05b$


    --
    Simon Strandgaard
    Simon Strandgaard, Aug 18, 2004
    #10
  11. Re: Regexp scanning with MatchData (Re: multiple regexp matches)

    "Simon Strandgaard" <> schrieb im Newsbeitrag
    news:...
    > On Wednesday 18 August 2004 12:31, Robert Klemme wrote:
    > > "Austin Ziegler" <> schrieb im Newsbeitrag
    > > news:...
    > >
    > > > This should do what you want.
    > > >
    > > > -austin
    > > >
    > > > str = '<span id="1"> <span>...</span> </span>'
    > > > re = /(<(\/?)span>)/i
    > > >
    > > > str.scan(re) # => [["<span>", ""], ["</span>", "/"], ["</span>", "/"]]
    > > >
    > > > matches = []
    > > > str.scan(re) do
    > > > matches << Regexp.last_match
    > > > end
    > > >
    > > > matches.each do |match|
    > > > match.captures.each_with_index do |capture, ii|
    > > > soff, eoff = match.offset(ii + 1)
    > > > puts %Q("#{capture}" #{soff} .. #{eoff})
    > > > end
    > > > end

    > >
    > > While that works, isn't it ridiculous that one has to resort to a class
    > > method ("Regexp.last_match")? I mean, there should rather be something
    > > like
    > >
    > > /o/.each( "foo" ) do |md|
    > > # md is MatchData
    > > end
    > >
    > > Or even
    > >
    > > /o/.matcher( "foo" ).each do |md|
    > > # md is MatchData
    > > end
    > >
    > > That way Matcher could implement Enumerable.
    > >
    > > Sounds like a candidate for a RCR. Any comments?

    >
    >
    > What about $~ ?
    >
    > bash-2.05b$ ruby a.rb
    > [[0, 13], [14, 20], [23, 30], [31, 38]]
    > bash-2.05b$ expand -t2 a.rb
    > str = '<span id="1"> <span>...</span> </span>'
    > re = Regexp.new('(<(\/?)span[^\n/]*?>)', true) # match start or end tag
    > positions = []
    > str.scan(re) do
    > positions << [$~.begin(0), $~.end(0)]
    > end
    > p positionsbash-2.05b$


    This has the same problem, only that in this case you don't use a class
    method but a global variable. Both of them are not in any way connected to
    the regexp you use other than through a hidden side effect of the matching
    process. I like more explicit connection similar to the one I suggested.

    Kind regards

    robert
    Robert Klemme, Aug 18, 2004
    #11
  12. Re: Regexp scanning with MatchData (Re: multiple regexp matches)

    On Wed, 18 Aug 2004 19:31:01 +0900, Robert Klemme <> wrote:
    > "Austin Ziegler" <> schrieb im Newsbeitrag
    > news:...
    >> str = '<span id="1"> <span> ...</span> </span> '
    >> re = /(<(\/?)span> )/i
    >>
    >> str.scan(re)
    >> # => [["<span> ", ""], ["</span> ", "/"], ["</span> ", "/"]]
    >>
    >> matches = []
    >> str.scan(re) do
    >> matches << Regexp.last_match
    >> end
    >>
    >> matches.each do |match|
    >> match.captures.each_with_index do |capture, ii|
    >> soff, eoff = match.offset(ii + 1)
    >> puts %Q("#{capture}" #{soff} .. #{eoff})
    >> end
    >> end

    > While that works, isn't it ridiculous that one has to resort to a
    > class method ("Regexp.last_match")? I mean, there should rather be
    > something like
    >
    > /o/.each( "foo" ) do |md|
    > # md is MatchData
    > end


    There's a simple solution, and I'll probably open an RCR about this
    if others agree with it. String#scan, #sub, and #gsub should yield
    MatchData objects, not Strings. There are probably others, but those
    are the ones that come to mind. This *will* break some code,
    unfortunately, but that can be mitigated by adding #to_str. IMO,
    this will make #gsub much easier to deal with, as you won't have to
    resort to either Regexp.last_match or $[0-9] variables to be able to
    work with captures. My Regexp.last_match call only presumes that
    Regexp.last_match is actually threadsafe, whereas we know that the
    ugly Perlish $ variables are threadsafe. I think this is an
    acceptable level of incompatibility because of the use of #to_str
    and the amount of flexibility that would be gained. As far as I
    know, it wouldn't require *that* big a change, because for
    Regexp.last_match to work, there must still be a MatchData object
    *somewhere*.

    What do you think?

    -austin
    --
    Austin Ziegler *
    * Alternate:
    Austin Ziegler, Aug 18, 2004
    #12
  13. Re: Regexp scanning with MatchData (Re: multiple regexp matches)

    On Wednesday 18 August 2004 16:55, Austin Ziegler wrote:
    > There's a simple solution, and I'll probably open an RCR about this
    > if others agree with it. String#scan, #sub, and #gsub should yield
    > MatchData objects, not Strings.

    [snip]

    Agree.. this would be nice.. I think I have seen an RCR about it long time
    ago (but I cannot locate that RCR).

    btw: my ruby regexp engine does so.. it yields matchdata instead of string.
    http://raa.ruby-lang.org/project/regexp/


    --
    Simon Strandgaard
    Simon Strandgaard, Aug 18, 2004
    #13
  14. Re: Regexp scanning with MatchData (Re: multiple regexp matches)

    Austin Ziegler wrote:

    > There's a simple solution, and I'll probably open an RCR about this
    > if others agree with it. String#scan, #sub, and #gsub should yield
    > MatchData objects, not Strings.


    I agree with this and it seems that matz only hasn't done this yet,
    because of backwards compatibility.

    I'm referring to this posting of him:

    http://groups.google.com/groups?selm=

    > What do you think?


    I heavily agree with this. It's the way it should have been since the
    beginning. #to_str sounds like a way that shouldn't break to much code
    and Ruby could issue a migration warning when it is called.

    Rite was said to sacrifice compatibility for the cost of more elegance
    so now might be a good time for switching.

    Regards,
    Florian Gross
    Florian Gross, Aug 18, 2004
    #14
  15. Re: Regexp scanning with MatchData (Re: multiple regexp matches)

    "Austin Ziegler" <> schrieb im Newsbeitrag
    news:...
    > On Wed, 18 Aug 2004 19:31:01 +0900, Robert Klemme <>

    wrote:
    > > "Austin Ziegler" <> schrieb im Newsbeitrag
    > > news:...
    > >> str = '<span id="1"> <span> ...</span> </span> '
    > >> re = /(<(\/?)span> )/i
    > >>
    > >> str.scan(re)
    > >> # => [["<span> ", ""], ["</span> ", "/"], ["</span> ", "/"]]
    > >>
    > >> matches = []
    > >> str.scan(re) do
    > >> matches << Regexp.last_match
    > >> end
    > >>
    > >> matches.each do |match|
    > >> match.captures.each_with_index do |capture, ii|
    > >> soff, eoff = match.offset(ii + 1)
    > >> puts %Q("#{capture}" #{soff} .. #{eoff})
    > >> end
    > >> end

    > > While that works, isn't it ridiculous that one has to resort to a
    > > class method ("Regexp.last_match")? I mean, there should rather be
    > > something like
    > >
    > > /o/.each( "foo" ) do |md|
    > > # md is MatchData
    > > end

    >
    > There's a simple solution, and I'll probably open an RCR about this
    > if others agree with it. String#scan, #sub, and #gsub should yield
    > MatchData objects, not Strings. There are probably others, but those
    > are the ones that come to mind. This *will* break some code,
    > unfortunately, but that can be mitigated by adding #to_str. IMO,
    > this will make #gsub much easier to deal with, as you won't have to
    > resort to either Regexp.last_match or $[0-9] variables to be able to
    > work with captures. My Regexp.last_match call only presumes that
    > Regexp.last_match is actually threadsafe, whereas we know that the
    > ugly Perlish $ variables are threadsafe. I think this is an
    > acceptable level of incompatibility because of the use of #to_str
    > and the amount of flexibility that would be gained. As far as I
    > know, it wouldn't require *that* big a change, because for
    > Regexp.last_match to work, there must still be a MatchData object
    > *somewhere*.
    >
    > What do you think?


    I like the functionality very much, but I'd prefer to *not* change the
    behavior of String#scan, #sub, and #gsub. I'd rather have Regexp#scan(str,
    &block), Regexp#sub(str, replace=nil, &block) and Regexp#gsub(str,
    replace=nil, &block) that yield MatchData if there is a block. There might
    be other names but since the behavior is quite similar to those methods in
    String these names are propably good. The only drawback I can see is that
    they might cause confusion ("Which were the ones that yielded MatchData?"),
    but IMHO people can cope with this - especially since old behavior does not
    change. (Personally I would find it easy to remember that Regexp <->
    MatchData and String <-> String or Array of String.)

    Kind regards

    robert
    Robert Klemme, Aug 19, 2004
    #15
  16. Re: Regexp scanning with MatchData (Re: multiple regexp matches)

    Here is the RCR I will be submitting. There is a server error on
    rcrchive that prevents me from submitting it there.

    Make String#scan, #gsub, and #sub yield MatchData objects
    backwards compatibility [x]

    Abtract:
    A "least-break" change to <code> String#scan</code>,
    <code>#gsub</code>, and <code> #sub</code> to provide the MatchData to
    attached code blocks.

    Problem:
    <code> String#scan</code>, <code> #gsub</code>, and <code> #sub</code>
    yield the string value of the matched regular expression to a provided
    block, which is of very limited value. Currently, we must rely upon
    either ugly numeric match variables (<code> $1</code> - <code>
    $9</code>, etc.) or a class method (<code> Regexp.last_match</code) to
    obtain the match.

    <pre>str = '<span id="1"> <span> ...</span> </span> '
    re = /(<(\/?)span> )/i

    str.scan(re)
    # => [["<span> ", ""], ["</span> ", "/"], ["</span> ", "/"]]

    matches = []
    str.scan(re) do
    matches << Regexp.last_match
    end

    matches.each do |match|
    match.captures.each_with_index do |capture, ii|
    soff, eoff = match.offset(ii + 1)
    puts %Q("#{capture}" #{soff} .. #{eoff})
    end
    end</pre>

    Proposal:
    <code>String#scan</code>, <code>#sub</code>, and <code>#gsub</code>
    yield MatchData objects instead of Strings. I think that this could be
    achieved while breaking the least amount of code by adding a #to_str
    implementation to MatchData.

    Analysis:
    I have written code as noted in the problem section; it feels
    unnecessarily complex and fragile. This change will work in all cases
    where a single string is provided; it will require a change to code
    that deals with array values (e.g., String#scan with groups are
    provided (because of the use of rb_reg_nth_match in scan_once);
    switching to the use of MatchData#captures by the developers will work
    just fine.

    Implementation:
    I *think* that the changes look something like this:
    <pre>
    --- re.c.old 2004-08-22 00:24:09 Eastern Daylight Time
    +++ re.c 2004-08-22 00:18:50 Eastern Daylight Time

    @@ -2320,6 +2320,7 @@
    rb_define_method(rb_cMatch, "pre_match", rb_reg_match_pre, 0);
    rb_define_method(rb_cMatch, "post_match", rb_reg_match_post, 0);
    rb_define_method(rb_cMatch, "to_s", match_to_s, 0);
    + rb_define_method(rb_cMatch, "to_str", match_to_s, 0);
    rb_define_method(rb_cMatch, "inspect", rb_any_to_s, 0); /* in object.c */
    rb_define_method(rb_cMatch, "string", match_string, 0);
    }

    --- string.c.old 2004-08-22 00:24:10 Eastern Daylight Time
    +++ string.c 2004-08-22 00:20:35 Eastern Daylight Time

    @@ -1928,7 +1928,7 @@

    if (iter) {
    rb_match_busy(match);
    - repl = rb_obj_as_string(rb_yield(rb_reg_nth_match(0, match)));
    + repl = rb_obj_as_string(rb_yield(0, match));
    rb_backref_set(match);
    }
    else {
    @@ -2043,7 +2043,7 @@
    regs = RMATCH(match)-> regs;
    if (iter) {
    rb_match_busy(match);
    - val = rb_obj_as_string(rb_yield(rb_reg_nth_match(0, match)));
    + val = rb_obj_as_string(rb_yield(match));
    rb_backref_set(match);
    }
    else {
    @@ -4164,15 +4164,7 @@
    else {
    *start = END(0);
    }
    - if (regs-> num_regs == 1) {
    - return rb_reg_nth_match(0, match);
    - }
    - result = rb_ary_new2(regs-> num_regs);
    - for (i=1; i < regs-> num_regs; i++) {
    - rb_ary_push(result, rb_reg_nth_match(i, match));
    - }
    -
    - return result;
    + return match;
    }
    return Qnil;
    }
    </pre>

    I'm not 100% sure that this is right, and I haven't tested it. The
    equivalent Ruby code would be (note: this code appears to work, but
    it does cause problems with irb):

    <pre>class MatchData
    def to_str
    self.to_s
    end
    end

    class String
    alias_method :eek:ld_scan, :scan
    alias_method :eek:ld_gsub!, :gsub!
    alias_method :eek:ld_sub!, :sub!

    def scan(pattern)
    if block_given?
    old_scan(pattern) { yield Regexp.last_match }
    else
    old_scan(pattern)
    end
    end

    def gsub(pattern, repl = nil, &block)
    s = self.dup
    s.gsub!(pattern, repl, &block)
    s
    end

    def gsub!(pattern, repl = nil)
    if block_given? and repl.nil?
    old_gsub!(pattern) { yield Regexp.last_match }
    elsif repl.nil?
    old_gsub!(pattern)
    else
    old_gsub!(pattern, repl)
    end
    end

    def sub(pattern, repl = nil, &block)
    s = self.dup
    s.sub!(pattern, repl, &block)
    s
    end

    def sub!(pattern, repl = nil)
    if block_given? and repl.nil?
    old_sub!(pattern) { yield Regexp.last_match }
    elsif repl.nil?
    old_sub!(pattern)
    else
    old_sub!(pattern, repl)
    end
    end
    end</pre>
    Austin Ziegler, Aug 22, 2004
    #16
  17. Re: Regexp scanning with MatchData (Re: multiple regexp matches)

    On Sun, 22 Aug 2004 01:00:39 -0400, Austin Ziegler <> wrote:
    > Here is the RCR I will be submitting. There is a server error on
    > rcrchive that prevents me from submitting it there.


    This has been resolved. This is now RCR 276.

    http://rcrchive.net/rcr/RCR/RCR276

    -austin
    --
    Austin Ziegler *
    * Alternate:
    Austin Ziegler, Aug 22, 2004
    #17
  18. Re: Regexp scanning with MatchData (Re: multiple regexp matches)

    "Austin Ziegler" <> schrieb im Newsbeitrag
    news:...
    > On Sun, 22 Aug 2004 01:00:39 -0400, Austin Ziegler <>

    wrote:
    > > Here is the RCR I will be submitting. There is a server error on
    > > rcrchive that prevents me from submitting it there.

    >
    > This has been resolved. This is now RCR 276.
    >
    > http://rcrchive.net/rcr/RCR/RCR276
    >
    > -austin
    > --
    > Austin Ziegler *
    > * Alternate:
    >
    >

    Thx for including my comment. I was about to add it myself but saw it just
    in time. :)

    robert
    Robert Klemme, Aug 22, 2004
    #18
  19. Kevin Howe

    Guest

    Re: Regexp scanning with MatchData (Re: multiple regexp matches)

    Hi,

    At Sun, 22 Aug 2004 14:00:45 +0900,
    Austin Ziegler wrote in [ruby-talk:110110]:
    > Proposal:
    > <code>String#scan</code>, <code>#sub</code>, and <code>#gsub</code>
    > yield MatchData objects instead of Strings. I think that this could be
    > achieved while breaking the least amount of code by adding a #to_str
    > implementation to MatchData.


    #to_str doesn't solve everything. MatchData#[] returns a matched
    portion for sub-patterns, whereas String#[] returns a byte at
    the position.

    --
    Nobu Nakada
    , Aug 22, 2004
    #19
  20. Re: Regexp scanning with MatchData (Re: multiple regexp matches)

    On Mon, 23 Aug 2004 07:33:18 +0900,
    <> wrote:
    > At Sun, 22 Aug 2004 14:00:45 +0900,
    > Austin Ziegler wrote in [ruby-talk:110110]:
    > > Proposal:
    > > <code>String#scan</code>, <code>#sub</code>, and <code>#gsub</code>
    > > yield MatchData objects instead of Strings. I think that this could be
    > > achieved while breaking the least amount of code by adding a #to_str
    > > implementation to MatchData.

    > #to_str doesn't solve everything. MatchData#[] returns a matched
    > portion for sub-patterns, whereas String#[] returns a byte at
    > the position.


    Agreed. It also is 100% incompatible on #scan with groups in the
    regexp (e.g., "foobar".scan(/(..)(.)/) will yield [["fo", "o"], ["ba",
    "b"]]. This is the argument for Regexp#scan instead of modifying
    String#scan. However, this is something that I believe should be
    changed. An alternative is to yield both the normal values and the
    match -- but that itself will be incompatible with #scan and most
    current uses of #gsub and #sub that use the match value.

    Yet another alternative is to add an optional parameter in all cases.
    String#gsub currently expects a regexp and a replace pattern OR a
    regexp and a block. #gsub could be modified such that when it gets a
    regexp, a "boolean", and a block, it yields something different. This
    could be, for example:

    String#gsub(pattern, true) { |match_data| ... }
    String#gsub(pattern) { |string| ... }

    I would actually rather see the opposite form, if we do this:

    String#gsub(pattern, true) { |string| ... }
    String#gsub(pattern) { |match_data| ... }

    This would encourage the use of the new form. By doing it this way, a
    transition period can be introduced for this (e.g., it in 1.8.3 it may
    warn that the current replace will be changed to yield a match_data
    instead of a string; in 1.9 it yields a match_data instead of a
    string).

    I have *not* analysed code out there that uses #gsub/#scan/#sub, but I
    think that this is an ideal change.

    -austin (I'm also adding this to the discussion on RCR276)
    --
    Austin Ziegler *
    * Alternate:
    Austin Ziegler, Aug 23, 2004
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Boris Pelakh
    Replies:
    3
    Views:
    463
    Purl Gurl
    Apr 8, 2004
  2. Replies:
    3
    Views:
    1,562
  3. gry
    Replies:
    4
    Views:
    224
  4. Mickael Faivre-Macon

    regexp multiple matches

    Mickael Faivre-Macon, Jan 28, 2009, in forum: Ruby
    Replies:
    6
    Views:
    156
    Rob Biedenharn
    Jan 28, 2009
  5. Joao Silva
    Replies:
    16
    Views:
    359
    7stud --
    Aug 21, 2009
Loading...

Share This Page