Match a pattern multiple times, returning matches, captures andoffset?

Discussion in 'Ruby' started by Markus Fischer, Apr 5, 2011.

  1. Hi,

    I'm used to be able to use the following in PHP. What is basically does
    is: return me all matches, including the captures, order by matching set
    and provide me the offsets.

    $ php -r 'preg_match_all("/_(\w+)_/", "_foo_ _bar_", $matches,
    PREG_SET_ORDER|PREG_OFFSET_CAPTURE); var_dump($matches);'
    array(2) {
    [0]=>
    array(2) {
    [0]=>
    array(2) {
    [0]=>
    string(5) "_foo_"
    [1]=>
    int(0)
    }
    [1]=>
    array(2) {
    [0]=>
    string(3) "foo"
    [1]=>
    int(1)
    }
    }
    [1]=>
    array(2) {
    [0]=>
    array(2) {
    [0]=>
    string(5) "_bar_"
    [1]=>
    int(6)
    }
    [1]=>
    array(2) {
    [0]=>
    string(3) "bar"
    [1]=>
    int(7)
    }
    }
    }

    I've found two ways in ruby getting in this direction, either use
    String#match or String#scan, but both only provide me partial
    information. I guess I can combine the knowledge of both, but before
    attempting this I wanted to verify if I didn't overlook something. Here
    are my ruby attempts:

    ruby-1.9.2-p180 :001 > m = "_foo_ _bar_".match(/_(\w+)_/)
    => #<MatchData "_foo_" 1:"foo">
    ruby-1.9.2-p180 :002 > [ m[0], m[1] ]
    => ["_foo_", "foo"]
    ruby-1.9.2-p180 :003 > [ m.begin(0), m.begin(1) ]
    => [0, 1]

    But here I'm missing the further possible matches, "_bar_" and "bar". Or
    the #scan approach:

    ruby-1.9.2-p180 :004 > m = "_foo_ _bar_".scan(/_(\w+)_/)
    => [["foo"], ["bar"]]

    But in this case I've even less information, the match including _foo_
    or _bar_ is not present and I can't get the offsets too.

    I re-read the documentation for Regexp#match and found out that you can
    pass an offset into the string as second parameter, so I guess I can
    iterate over the string in a loop until I find no further matches ...?
    Considering this I came up with:

    $ cat test_match_all.rb
    require 'pp'

    class String
    def match_all(pattern)
    matches = []
    offset = 0
    while m = match(pattern, offset) do
    matches << m
    offset = m.begin(0) + m[0].length
    end
    matches
    end
    end

    pp "_foo_ _bar_ _baz_".match_all(/_(\w+)_/)


    $ ruby test_match_all.rb
    [#<MatchData "_foo_" 1:"foo">,
    #<MatchData "_bar_" 1:"bar">,
    #<MatchData "_baz_" 1:"baz">]


    I've lots of data to parse so I could foresee that this approach can
    become a bottleneck. Is there a more direct solution to it?

    thanks,
    - Markus
     
    Markus Fischer, Apr 5, 2011
    #1
    1. Advertising

  2. String#scan with a block may do what you want:

    >> "_foo_ _bar_".scan(/_(\w+)_/) { |x| puts "Offset #{$`.size}, captures

    #{x.inspect}" }
    Offset 0, captures ["foo"]
    Offset 6, captures ["bar"]
    => "_foo_ _bar_"

    But it doesn't give you offsets to the individual captures, just to the
    start of the whole match. (You also get the full match in $& and the
    rest of the string after the match in $')

    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Apr 5, 2011
    #2
    1. Advertising

  3. Markus Fischer

    7stud -- Guest

    Markus Fischer wrote in post #991092:
    >
    > But here I'm missing the further possible matches, "_bar_" and "bar". Or
    > the #scan approach:
    >
    > ruby-1.9.2-p180 :004 > m = "_foo_ _bar_".scan(/_(\w+)_/)
    > => [["foo"], ["bar"]]
    >
    > But in this case I've even less information, the match including _foo_
    > or _bar_ is not present and I can't get the offsets too.
    >
    > I re-read the documentation for Regexp#match


    If you look at the preamble in the docs for the MatchData class, you can
    retrieve a MatchData object using Regexp.last_match, which you can call
    inside a scan() block:

    str = "_foo_ _bar_"

    str.scan(/_(\w+)_/) do |match|
    md = Regexp.last_match
    p [md[0], md[1], md.offset(1)]

    end

    --output:--
    ["_foo_", "foo", [1, 4]]
    ["_bar_", "bar", [7, 10]]

    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Apr 6, 2011
    #3
  4. On Wed, Apr 6, 2011 at 3:37 AM, 7stud -- <> wrote:
    > Markus Fischer wrote in post #991092:
    >>
    >> But here I'm missing the further possible matches, "_bar_" and "bar". Or
    >> the #scan approach:
    >>
    >> ruby-1.9.2-p180 :004 > m =3D "_foo_ _bar_".scan(/_(\w+)_/)
    >> =A0=3D> [["foo"], ["bar"]]
    >>
    >> But in this case I've even less information, the match including _foo_
    >> or _bar_ is not present and I can't get the offsets too.
    >>
    >> I re-read the documentation for Regexp#match

    >
    > If you look at the preamble in the docs for the MatchData class, you can
    > retrieve a MatchData object using Regexp.last_match, which you can call
    > inside a scan() block:


    When doing nested matching it may be better to use $~ because that is
    local to the current stack frame which Regexp.last_match isn't.
    Example with relative offsets as well:

    irb(main):022:0> str.scan /_(\w+)_/ do
    irb(main):023:1* 2.times {|i| p [$~, $~.offset(i), $~.offset(i).map
    {|o| o - $~.offset(0)[0]}]}
    irb(main):024:1> end
    ["_foo_", [0, 5], [0, 5]]
    ["foo", [1, 4], [1, 4]]
    ["_bar_", [6, 11], [0, 5]]
    ["bar", [7, 10], [1, 4]]
    =3D> "_foo_ _bar_"

    Kind regards

    robert

    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Apr 6, 2011
    #4
  5. Markus Fischer

    7stud -- Guest

    You can also get the relative offset like this:

    str = "_foo_ _bar_"

    str.scan(/_(\w+)_/) do |curr_match|
    md = Regexp.last_match
    whole_match = md[0]
    captures = md.captures
    captures.each do |capture|
    p [whole_match, capture, whole_match.index(capture)]
    end

    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Apr 7, 2011
    #5
  6. On Thu, Apr 7, 2011 at 1:58 AM, 7stud -- <> wrote:
    > You can also get the relative offset like this:
    >
    > str =3D "_foo_ _bar_"
    >
    > str.scan(/_(\w+)_/) do |curr_match|
    > =A0md =3D Regexp.last_match
    > =A0whole_match =3D md[0]
    > =A0captures =3D md.captures
    > =A0captures.each do |capture|
    > =A0 =A0p [whole_match, capture, whole_match.index(capture)]
    > end


    That's nice! I wasn't aware of this. Thanks for sharing!

    I also just read this in the docs:

    "Note that the last_match is local to the thread and method scope of the me=
    thod
    that did the pattern match."

    So forget my point about $~ being safer.

    Kind regards

    robert

    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Apr 7, 2011
    #6
  7. 7stud -- wrote in post #991338:
    > You can also get relative beginning offsets like this:
    >
    > str = "_foo_ _bar_"
    >
    > str.scan(/_(\w+)_/) do |curr_match|
    > md = Regexp.last_match
    > whole_match = md[0]
    > captures = md.captures
    >
    > captures.each do |capture|
    > p [whole_match, capture, whole_match.index(capture)]
    > end
    >
    > end


    Using 'index' doesn't work if you have multiple captures which have the
    same pattern, or one is a substring of the other.

    Use captures.begin and captures.end instead.

    >> md = /(...)(...)/.match "foofoo"

    => #<MatchData "foofoo" 1:"foo" 2:"foo">
    >> md.captures

    => ["foo", "foo"]
    >> md.begin(1)

    => 0
    >> md.begin(2)

    => 3

    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Apr 7, 2011
    #7
  8. Markus Fischer

    7stud -- Guest

    Brian Candler wrote in post #991406:
    > 7stud -- wrote in post #991338:
    >> You can also get relative beginning offsets like this:
    >>
    >> str = "_foo_ _bar_"
    >>
    >> str.scan(/_(\w+)_/) do |curr_match|
    >> md = Regexp.last_match
    >> whole_match = md[0]
    >> captures = md.captures
    >>
    >> captures.each do |capture|
    >> p [whole_match, capture, whole_match.index(capture)]
    >> end
    >>
    >> end

    >
    > Using 'index' doesn't work if you have multiple captures which have the
    > same pattern, or one is a substring of the other.
    >
    > Use captures.begin and captures.end instead.
    >


    begin() and end() are the two elements of offset(), which we've already
    discussed above:

    The idea was to get the relative offsets within a match, not the
    absolute offsets within the string.

    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Apr 7, 2011
    #8
  9. 7stud -- wrote in post #991546:
    > However, note that
    > begin() and end() are the two elements of offset(), which we've already
    > discussed above. The idea was to additionally provide the relative
    > offsets within a match, not just the absolute offsets within the string.


    That's easy - subtract begin(0) which is the absolute offset of the
    start of the match.

    >> "foo bar" =~ /ba(.)/

    => 4
    >> $~.captures

    => ["r"]
    >> $~.begin(1)

    => 6
    >> $~.begin(1) - $~.begin(0)

    => 2

    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Apr 8, 2011
    #9
  10. Markus Fischer

    7stud -- Guest

    Brian Candler wrote in post #991686:
    > 7stud -- wrote in post #991546:
    >> However, note that
    >> begin() and end() are the two elements of offset(), which we've already
    >> discussed above. The idea was to additionally provide the relative
    >> offsets within a match, not just the absolute offsets within the string.

    >
    > That's easy - subtract begin(0) which is the absolute offset of the
    > start of the match.


    The "subtraction method" was thoroughly vetted earlier.

    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Apr 8, 2011
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jack Steven
    Replies:
    2
    Views:
    476
    Chris Rebert
    Mar 9, 2009
  2. Michal Suchanek

    match/scan does not return multiple matches

    Michal Suchanek, Feb 6, 2010, in forum: Ruby
    Replies:
    11
    Views:
    263
    Michal Suchanek
    Feb 9, 2010
  3. Replies:
    4
    Views:
    116
    Fabian Pilkowski
    Jun 30, 2005
  4. gorjusborg
    Replies:
    2
    Views:
    100
    gorjusborg
    Sep 22, 2006
  5. jhu
    Replies:
    6
    Views:
    127
    Dave Weaver
    Nov 26, 2007
Loading...

Share This Page