Named groups in regexp matches?

Discussion in 'Ruby' started by Tom Pollard, Jan 29, 2007.

  1. Tom Pollard

    Tom Pollard Guest

    Hi,

    Does Ruby support regexps that assign names to specific matched
    groups? In Python, for instance, if you write a regexp like this,

    TEMP_RE = re.compile(r"""^(?P<temp>(M|-)?\d+|//|XX|MM)/
    (?P<dewpt>(M|-)?\d+|//|XX|MM)?\s+""",
    re.VERBOSE)

    the match object will provide a hash with keys 'temp' and 'dewpt',
    containing the values matched by the corresponding '(?P<temp>...).
    and '(?P<dewpt>...)' groups. This is another solution to the problem
    of creating regexps where you want to parentheses both for grouping
    and substring capture. I know Perl and Ruby support the '(?:...)'
    syntax to let you use parens for specifying alternatives without
    capturing that group, but the Python scheme for labeling capture
    groups produces more readable code, and I've used it heavily in some
    code I was hoping to port to Ruby. I was hoping that perhaps I'd
    just overlooked something in the Ruby docs.

    Thanks,

    Tom
     
    Tom Pollard, Jan 29, 2007
    #1
    1. Advertising

  2. On Tue, Jan 30, 2007 at 03:24:40AM +0900, Tom Pollard wrote:
    > Hi,
    >
    > Does Ruby support regexps that assign names to specific matched
    > groups? In Python, for instance, if you write a regexp like this,
    >
    > TEMP_RE = re.compile(r"""^(?P<temp>(M|-)?\d+|//|XX|MM)/
    > (?P<dewpt>(M|-)?\d+|//|XX|MM)?\s+""",
    > re.VERBOSE)
    >
    > the match object will provide a hash with keys 'temp' and 'dewpt',

    [...]

    Take a look at this thread:
    http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/80270

    > Thanks,
    > Tom

    --Greg
     
    Gregory Seidman, Jan 29, 2007
    #2
    1. Advertising

  3. Tom Pollard

    Tom Pollard Guest

    On Jan 29, 2007, at 1:28 PM, Gregory Seidman wrote:
    > On Tue, Jan 30, 2007 at 03:24:40AM +0900, Tom Pollard wrote:
    >> Does Ruby support regexps that assign names to specific matched
    >> groups? In Python, for instance, if you write a regexp like this,
    >>
    >> TEMP_RE = re.compile(r"""^(?P<temp>(M|-)?\d+|//|XX|MM)/
    >> (?P<dewpt>(M|-)?\d+|//|XX|MM)?\s+""",
    >> re.VERBOSE)
    >>
    >> the match object will provide a hash with keys 'temp' and 'dewpt',

    > [...]
    >
    > Take a look at this thread:
    > http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/80270


    Thanks very much for the quick response. It sounds like the answer
    is that Ruby does not support named captures, but that the Oniguruma
    library supplies this feature. I think it would be a nice feature to
    add in 1.9. My experience is this is very useful (if not necessary)
    for composing non-trivial regexps. Without them, it's just too easy
    to mess up the capture-group numbers when adding or removing
    parenthesized subexpressions in your regexp.

    Tom
     
    Tom Pollard, Jan 29, 2007
    #3
  4. Tom Pollard schrieb:
    > Does Ruby support regexps that assign names to specific matched groups?

    In Ruby 1.9 it works. I wrote some artikels with many examples on
    http://www.ruby-mine.de (the site may be down for maintenance the next two days,
    especially the following one: http://www.ruby-mine.de?p=130 - unfortunately it
    is only available in german in the moment, but the examples are Ruby code and
    irb usage, so it should be understandable without understanding the german texts.

    But - Ruby 1.9 is still under development. May be that there will be changes in
    details in future.

    Some examples:

    irb(main):001:0> md="abba".match(/(?<a1>.)(?<a2>.)\k<a2>\k<a1>/)
    => #<MatchData:0x2bf0488>
    irb(main):002:0> md[0]
    => "abba"
    irb(main):003:0> md[1]
    => "a"
    irb(main):004:0> md[2]
    => "b"
    irb(main):005:0> md[:a1]
    => "a"
    irb(main):006:0> md[:a2]
    => "b"
    irb(main):007:0> md['a1']
    => "a"
    irb(main):008:0> md['a2']
    => "b"

    Here it is visible, that the contents of a matched groups are accessible by
    number, name as symbol, and name as string, but it is not allowed to mix named
    groups and normal capturing groups in the same regular expression:

    irb(main):001:0> "abba".match(/(?<a1>.)(.)\2\k<a1>/)
    SyntaxError: compile error
    (irb):1: numbered backref/call is not allowed. (use name): /(?<a1>.)(.)\2\k<a1>/
    from (irb):1:in `Kernel#binding'

    -----

    When using "sub", "gsub", "sub!", or "gsub!" witout a block, it is only possible
    to access the groups by name, the positional access return the empty string


    irb(main):001:0> puts 'axbx'.sub(/(?<r>.)x(?<s>.)x/, '\k<s>\k<r>')
    ba
    => nil
    irb(main):002:0> puts 'axbx'.sub(/(?<r>.)x(?<s>.)x/, '\2\1')

    => nil

    -----

    Inside block a direct access to the group names is not possible - I must say, I
    don't find a way to do it directly. The use of positional variables "$1" etc. is
    possible. There is another possibility by using the MatchDate object "$~" inside
    the block. In doing this, the same possibilities are available as described for
    "match":

    irb(main):001:0> 'axbx'.sub(/(?<i>.)x(?<j>.)x/){|k|p k;p $1;p $2;'u'+$2}
    "axbx"
    "a"
    "b"
    => "ub"
    irb(main):002:0> 'axbxcxdx'.gsub(/(?<i>.)x(?<j>.)x/){|k|p k;p $1;p $2;'u'+$2}
    "axbx"
    "a"
    "b"
    "cxdx"
    "c"
    "d"
    => "ubud"

    and using MatchData object:

    irb(main):001:0> 'axbx'.sub(/(?<i>.)x(?<j>.)x/){|k|p k;p $1;p $2;'u'+$2}
    "axbx"
    "a"
    "b"
    => "ub"
    irb(main):002:0> 'axbxcxdx'.gsub(/(?<i>.)x(?<j>.)x/){|k|p k;p $1;p $2;'u'+$2}
    "axbx"
    "a"
    "b"
    "cxdx"
    "c"
    "d"
    => "ubud"

    ------

    There are special situations, where the possibilities of Oniguruma in Ruby 1.9
    allow solutions, which are not as simple to describe in Ruby 1.9.

    Ruby 1.8:

    irb(main):001:0> "rasbuavb".scan(/(.)a|(.)b/){|i|p i}
    ["r", nil]
    [nil, "s"]
    ["u", nil]
    [nil, "v"]
    => "rasbuavb"

    Ruby 1.9:

    irb(main):002:0> "rasbuavb".scan(/(.){0}\g<1>a|\g<1>b/){|i|p i}
    ["r"]
    ["s"]
    ["u"]
    ["v"]
    => "rasbuavb"

    Here isn't a named group the player, it is the possibility to call a
    subexpression. It is a very powerfull feature, which allows recursive
    constructs. I made in the article a pocket calculator as example, but it may
    useful for checking complex input fields in a GUI, or even later on in Rails:

    pattern = / (?<e>\g<t>\+\g<e>|\g<t>-\g<e>|\g<t>){0}
    (?<t>|\g<f>\*\g<t>|\g<f>\/\g<t>|\g<f>){0}
    (?<f>[-+]?\g<id>|\(\g<e>\)){0}
    (?<id>\g<n>|\g<v>){0}
    (?<n>[a-zA-Z_]\w*){0}
    (?<v>\d+(\.\d+)?){0}
    ^((?<var>\g<n>)=)?(?<expr>\g<e>)$
    /x

    vars = Hash.new(0)
    basbind = binding

    # print ‘input> ‘ # for interactive usage
    while (!(inp = DATA.gets).chomp.match(/^quit$/i))
    if (md = inp.chomp.gsub(/\s+/,‘‘).match(pattern))
    expr = md[:expr].gsub(/([a-zA-Z_]\w*)/, ‘vars["\1"]‘)
    erg = eval(expr, basbind)
    vars[md[:var]] = erg if md[:var]
    puts "#{inp.chomp}, result> #{(md[:var])?(md[:var]+‘=‘):‘‘}#{erg}"
    else
    puts "+++++ incorrect input: ‘#{inp.chomp}‘"
    end
    # print ‘input> ‘ # for interactive usage
    end
    puts ‘***** variables *****‘
    vars.keys.sort.each{|v|puts "#{v}=#{vars[v]}"}
    puts ‘******* End ********‘
    __END__
    30+12
    a = 30 + 12
    b = 2*a
    c = -(a*a+5)
    d = (6+5*a)*c
    quit

    results in:

    30+12, result> 42
    a = 30 + 12, result> a=42
    b = 2*a, result> b=84
    c = -(a*a+5), result> c=-1769
    d = (6+5*a)*c, result> d=-382104
    ***** variables *****
    a=42
    b=84
    c=-1769
    d=-382104
    ******* End ********

    -----

    Summary - in the near future you will habe a lot of powerful new features in
    Ruby's pattern matching facilities.

    Wolfgang Nádasi-Donner
     
    Wolfgang Nádasi-Donner, Jan 29, 2007
    #4
  5. Tom Pollard schrieb:
    > Does Ruby support regexps that assign names to specific matched groups?

    In Ruby 1.9 it works. I wrote some artikels with many examples on
    http://www.ruby-mine.de (the site may be down for maintenance the next two days,
    especially the following one: http://www.ruby-mine.de?p=130 - unfortunately it
    is only available in german in the moment, but the examples are Ruby code and
    irb usage, so it should be understandable without understanding the german texts.

    But - Ruby 1.9 is still under development. May be that there will be changes in
    details in future.

    Some examples:

    irb(main):001:0> md="abba".match(/(?<a1>.)(?<a2>.)\k<a2>\k<a1>/)
    => #<MatchData:0x2bf0488>
    irb(main):002:0> md[0]
    => "abba"
    irb(main):003:0> md[1]
    => "a"
    irb(main):004:0> md[2]
    => "b"
    irb(main):005:0> md[:a1]
    => "a"
    irb(main):006:0> md[:a2]
    => "b"
    irb(main):007:0> md['a1']
    => "a"
    irb(main):008:0> md['a2']
    => "b"

    Here it is visible, that the contents of a matched groups are accessible by
    number, name as symbol, and name as string, but it is not allowed to mix named
    groups and normal capturing groups in the same regular expression:

    irb(main):001:0> "abba".match(/(?<a1>.)(.)\2\k<a1>/)
    SyntaxError: compile error
    (irb):1: numbered backref/call is not allowed. (use name): /(?<a1>.)(.)\2\k<a1>/
    from (irb):1:in `Kernel#binding'

    -----

    When using "sub", "gsub", "sub!", or "gsub!" witout a block, it is only possible
    to access the groups by name, the positional access return the empty string


    irb(main):001:0> puts 'axbx'.sub(/(?<r>.)x(?<s>.)x/, '\k<s>\k<r>')
    ba
    => nil
    irb(main):002:0> puts 'axbx'.sub(/(?<r>.)x(?<s>.)x/, '\2\1')

    => nil

    -----

    Inside block a direct access to the group names is not possible - I must say, I
    don't find a way to do it directly. The use of positional variables "$1" etc. is
    possible. There is another possibility by using the MatchData object "$~" inside
    the block. In doing this, the same possibilities are available as described for
    "match":

    irb(main):001:0> 'axbx'.sub(/(?<i>.)x(?<j>.)x/){|k|p k;p $1;p $2;'u'+$2}
    "axbx"
    "a"
    "b"
    => "ub"
    irb(main):002:0> 'axbxcxdx'.gsub(/(?<i>.)x(?<j>.)x/){|k|p k;p $1;p $2;'u'+$2}
    "axbx"
    "a"
    "b"
    "cxdx"
    "c"
    "d"
    => "ubud"

    and using MatchData object:

    irb(main):003:0> 'axbxcxdx'.gsub(/(?<i>.)x(?<j>.)x/){|k|p $~[:i]}
    "a"
    "c"
    => ""

    ------

    There are special situations, where the possibilities of Oniguruma in Ruby 1.9
    allow solutions, which are not as simple to describe in Ruby 1.9.

    Ruby 1.8:

    irb(main):001:0> "rasbuavb".scan(/(.)a|(.)b/){|i|p i}
    ["r", nil]
    [nil, "s"]
    ["u", nil]
    [nil, "v"]
    => "rasbuavb"

    Ruby 1.9:

    irb(main):002:0> "rasbuavb".scan(/(.){0}\g<1>a|\g<1>b/){|i|p i}
    ["r"]
    ["s"]
    ["u"]
    ["v"]
    => "rasbuavb"

    Here isn't a named group the player, it is the possibility to call a
    subexpression. It is a very powerfull feature, which allows recursive
    constructs. I made in the article a pocket calculator as example, but it may
    useful for checking complex input fields in a GUI, or even later on in Rails:

    pattern = / (?<e>\g<t>\+\g<e>|\g<t>-\g<e>|\g<t>){0}
    (?<t>|\g<f>\*\g<t>|\g<f>\/\g<t>|\g<f>){0}
    (?<f>[-+]?\g<id>|\(\g<e>\)){0}
    (?<id>\g<n>|\g<v>){0}
    (?<n>[a-zA-Z_]\w*){0}
    (?<v>\d+(\.\d+)?){0}
    ^((?<var>\g<n>)=)?(?<expr>\g<e>)$
    /x

    vars = Hash.new(0)
    basbind = binding

    # print ‘input> ‘ # for interactive usage
    while (!(inp = DATA.gets).chomp.match(/^quit$/i))
    if (md = inp.chomp.gsub(/\s+/,‘‘).match(pattern))
    expr = md[:expr].gsub(/([a-zA-Z_]\w*)/, ‘vars["\1"]‘)
    erg = eval(expr, basbind)
    vars[md[:var]] = erg if md[:var]
    puts "#{inp.chomp}, result> #{(md[:var])?(md[:var]+‘=‘):‘‘}#{erg}"
    else
    puts "+++++ incorrect input: ‘#{inp.chomp}‘"
    end
    # print ‘input> ‘ # for interactive usage
    end
    puts ‘***** variables *****‘
    vars.keys.sort.each{|v|puts "#{v}=#{vars[v]}"}
    puts ‘******* End ********‘
    __END__
    30+12
    a = 30 + 12
    b = 2*a
    c = -(a*a+5)
    d = (6+5*a)*c
    quit

    results in:

    30+12, result> 42
    a = 30 + 12, result> a=42
    b = 2*a, result> b=84
    c = -(a*a+5), result> c=-1769
    d = (6+5*a)*c, result> d=-382104
    ***** variables *****
    a=42
    b=84
    c=-1769
    d=-382104
    ******* End ********

    -----

    Summary - in the near future you will habe a lot of powerful new features in
    Ruby's pattern matching facilities.

    Wolfgang Nádasi-Donner
     
    Wolfgang Nádasi-Donner, Jan 29, 2007
    #5
  6. Tom Pollard

    Tom Pollard Guest

    On Jan 29, 2007, at 5:25 PM, Wolfgang N=E1dasi-Donner wrote:
    > Summary - in the near future you will habe a lot of powerful new =20
    > features in
    > Ruby's pattern matching facilities.


    Thanks very much that report! Now I'll just have to decide whether =20
    to wait until 1.9 rolls out, or find some other way to port my code =20
    in the meantime.

    Tom=
     
    Tom Pollard, Jan 30, 2007
    #6
  7. Tom Pollard

    Paddy3118 Guest

    On Jan 29, 10:14 pm, Wolfgang Nádasi-Donner <>
    wrote:
    > Tom Pollard schrieb:> Does Ruby support regexps that assign names to specific matched groups?
    >
    > In Ruby 1.9 it works. I wrote some artikels with many examples onhttp://www.ruby-mine.de(the site may be down for maintenance the next two days,
    > especially the following one:http://www.ruby-mine.de?p=130- unfortunately it
    > is only available in german in the moment, but the examples are Ruby codeand
    > irb usage, so it should be understandable without understanding the german texts.
    >
    > But - Ruby 1.9 is still under development. May be that there will be changes in
    > details in future.
    >
    > Some examples:
    >
    > irb(main):001:0> md="abba".match(/(?<a1>.)(?<a2>.)\k<a2>\k<a1>/)
    > => #<MatchData:0x2bf0488>
    > irb(main):002:0> md[0]
    > => "abba"
    > irb(main):003:0> md[1]
    > => "a"
    > irb(main):004:0> md[2]
    > => "b"
    > irb(main):005:0> md[:a1]
    > => "a"
    > irb(main):006:0> md[:a2]
    > => "b"
    > irb(main):007:0> md['a1']
    > => "a"
    > irb(main):008:0> md['a2']
    > => "b"
    >
    > Here it is visible, that the contents of a matched groups are accessible by
    > number, name as symbol, and name as string, but it is not allowed to mix named
    > groups and normal capturing groups in the same regular expression:
    >


    Hi Wolfgang,
    I was going to ask why you did not use the syntax of (?P<name>...) as
    used in Python, but found that, according to http://www.amk.ca/python/
    howto/regex/regex.html#SECTION000530000000000000000, the P is for
    Python extensions.

    But then if the Ruby extension is quite like the Python one, but not
    seen as being the canonical implementation of grouped expressions then
    maybe it should be (?R<name>...) showing that this is a ruby-specific
    etension?

    Thanks,
    - Paddy.

    > Summary - in the near future you will habe a lot of powerful new featuresin
    > Ruby's pattern matching facilities.
    >
    > Wolfgang Nádasi-Donner
     
    Paddy3118, Feb 3, 2007
    #7
  8. Paddy3118 schrieb:
    > On Jan 29, 10:14 pm, Wolfgang Nádasi-Donner <>
    >> irb(main):001:0> md="abba".match(/(?<a1>.)(?<a2>.)\k<a2>\k<a1>/)

    > Hi Wolfgang,
    > I was going to ask why you did not use the syntax of (?P<name>...) as
    > used in Python, but found that, according to http://www.amk.ca/python/
    > howto/regex/regex.html#SECTION000530000000000000000, the P is for
    > Python extensions.


    It's not that easy. The regular expression engine used in Ruby 1.9 is not
    integral part or ruby. It is a stand alone regular expression engine called
    "Oniguruma" (http://www.geocities.jp/kosako3/oniguruma/).

    Oniguruma is actually existent in tree variants, "2.x.y" can be used in Ruby 1.6
    and 1.8, but it is not the standard engine of Ruby 1.6/1.8, "4.x.y" will be used
    in Ruby 1.9ff, and "5.x.y" is not related to Ruby.

    The syntax of the regular expressions are not defined by Ruby, they are defined
    by Oniguruma.

    Wolfgang Nádasi-Donner
     
    Wolfgang Nádasi-Donner, Feb 3, 2007
    #8
  9. Tom Pollard

    Tom Pollard Guest

    On Feb 3, 2007, at 11:23 AM, Robert Dober wrote:
    > What I do might be enough for your purpose
    >
    > S=Struct.new:)key, :value)
    > => S
    > irb(main):002:0> r=%r{(\w+)\s*=\s*(.*)}
    > => /(\w+)\s*=\s*(.*)/
    > irb(main):003:0> m= r.match("name = Tom Pollard")
    > => #<MatchData:0xb7dfbb5c>
    > irb(main):004:0> s=S.new(*m.captures)
    > => #<struct S key="name", value="Tom Pollard">
    > irb(main):005:0> s.key
    > => "name"
    > irb(main):006:0> s.value
    > => "Tom Pollard"
    >
    > this could easily be wrapped into a class BTW.


    Thanks. That's not a bad idea, but it only addresses half of my
    problem, because I still need to be careful to use non-capturing
    groups for the things I don't want to capture. In Python, I can
    ignore that - labeling the groups I /do/ want to capture is enough.
    Here are a few examples from my Python code:

    WIND_RE = re.compile(r"""^(?P<dir>[\dO]{3}|[0O]|///|MMM|VRB)
    (?P<speed>P?[\dO]{2,3}|[0O]+|[/M]{2,3})
    (G(?P<gust>P?(\d{1,3}|[/M]{1,3})))?
    (?P<units>KTS?|LT|K|T|KMH|MPS)?
    (\s+(?P<varfrom>\d\d\d)V
    (?P<varto>\d\d\d))?\s+""",
    re.VERBOSE)
    VISIBILITY_RE = re.compile(r"""^(?P<vis>(?P<dist>M?(\d\s+)?\d/\d\d?|
    M?\d+)
    ( \s*(?P<units>SM|KM|M|U) |
    (?P<dir>[NSEW][EW]?) )? |
    CAVOK )\s+""",
    re.VERBOSE)
    RUNWAY_RE = re.compile(r"""^(RVRNO |
    R(?P<name>\d\d(RR?|LL?|C)?)/
    (?P<low>(M|P)?\d\d\d\d)
    (V(?P<high>(M|P)?\d\d\d\d))?
    (?P<unit>FT)?[/NDU]*)\s+""",
    re.VERBOSE)
    TEMP_RE = re.compile(r"""^(?P<temp>(M|-)?\d+|//|XX|MM)/
    (?P<dewpt>(M|-)?\d+|//|XX|MM)?\s+""",
    re.VERBOSE)

    To port these to pre-1.9 Ruby, I'll need to remove the '?P<name>'
    labels and change the other groups from '(...)' to '(?:...)'. Once
    I've done that I can worry about assigning labels to the captured
    groups, after they're matched, or just using the index captures.
    There's nothing about that that's not straightforward; I'm mostly
    struggling with my motivation for going through this effort at all,
    simply to port a working, well-debugged and fairly fast Python module
    to Ruby, especially since I'm fairly sure the resulting Ruby module
    will be much slower and harder to maintain.

    Tom
     
    Tom Pollard, Feb 3, 2007
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Boris Pelakh
    Replies:
    3
    Views:
    467
    Purl Gurl
    Apr 8, 2004
  2. Replies:
    3
    Views:
    1,566
  3. gry
    Replies:
    4
    Views:
    226
  4. Kevin Howe

    multiple regexp matches

    Kevin Howe, Aug 17, 2004, in forum: Ruby
    Replies:
    27
    Views:
    278
    Martin DeMello
    Aug 24, 2004
  5. Joao Silva
    Replies:
    16
    Views:
    366
    7stud --
    Aug 21, 2009
Loading...

Share This Page