Short question on regex in Ruby

Discussion in 'Ruby' started by Chris Ro, Sep 26, 2008.

  1. Chris Ro

    Chris Ro Guest

    Hi,

    I have a little problem with a regex in Ruby:

    I have twos strings:

    string1 = "He is the 20th."
    string2 = "25th"

    I wrote this to "extract" the place (20 or 25 respectively):

    place1 = string1.gsub(/.*(\d+)th.*/,'\1')
    place2 = string2.gsub(/.*(\d+)th.*/,'\1')
    pp place1
    pp place1

    => "0"
    => "5"

    Of course, I would like to get all the digits before "th". Why is only
    the last one captured?

    If anyone could please explain this, and help me come up with a regex
    that captures 20 and 25, respectively, this would be greatly
    appreciated.

    Cheers, Chris
    --
    Posted via http://www.ruby-forum.com/.
     
    Chris Ro, Sep 26, 2008
    #1
    1. Advertising

  2. Chris Ro

    Mark Thomas Guest

    On Sep 26, 10:02 am, Chris Ro <> wrote:
    > Hi,
    >
    > I have a little problem with a regex in Ruby:
    >
    > I have twos strings:
    >
    > string1 = "He is the 20th."
    > string2 = "25th"
    >
    > I wrote this to "extract" the place (20 or 25 respectively):
    >
    > place1 = string1.gsub(/.*(\d+)th.*/,'\1')
    > place2 = string2.gsub(/.*(\d+)th.*/,'\1')
    > pp place1
    > pp place1
    >
    > => "0"
    > => "5"
    >
    > Of course, I would like to get all the digits before "th". Why is only
    > the last one captured?


    Because the .* is greedy and will get all it can, which is all but the
    last digit.

    > If anyone could please explain this, and help me come up with a regex
    > that captures 20 and 25, respectively, this would be greatly


    place = string[/\d+(?=th)/]

    -- Mark.
     
    Mark Thomas, Sep 26, 2008
    #2
    1. Advertising

  3. Chris Ro

    Thomas B. Guest

    Chris Ro wrote:
    > place1 = string1.gsub(/.*(\d+)th.*/,'\1')


    Hello. I think your approach with using gsub is not the best possible
    here. It's better to simply find the matching part using match and
    substitute it for the whole string, like this:
    place1 = string1.match(/(\d+)th\b/)[1]
    The \b ensures that the next character after 'th' is not a word
    character (\b is word boundary), and [1] at the end is extracting the
    first bracketed group. It also makes it possible to skip the .* at both
    ends, which is a bit ugly.

    Apart from that, a useful piece of knowledge about regexps:
    /.*?(\d+)th.*/ will match what you want, because the first .*? will be
    reluctant to eat up more characters, so it will pass to \d+ as many
    digits as it can.

    TPR.
    --
    Posted via http://www.ruby-forum.com/.
     
    Thomas B., Sep 26, 2008
    #3
  4. 2008/9/26 Thomas B. <>:
    > Chris Ro wrote:
    >> place1 = string1.gsub(/.*(\d+)th.*/,'\1')

    >
    > Hello. I think your approach with using gsub is not the best possible
    > here.


    Agree.

    > It's better to simply find the matching part using match and
    > substitute it for the whole string, like this:
    > place1 = string1.match(/(\d+)th\b/)[1]


    For extraction there is a simpler solution

    irb(main):002:0> "He is the 20th."[/(\d+)th\b/, 1]
    => "20"
    irb(main):003:0> "25th"[/(\d+)th\b/, 1]
    => "25"

    > The \b ensures that the next character after 'th' is not a word
    > character (\b is word boundary), and [1] at the end is extracting the
    > first bracketed group. It also makes it possible to skip the .* at both
    > ends, which is a bit ugly.


    Right.

    > Apart from that, a useful piece of knowledge about regexps:
    > /.*?(\d+)th.*/ will match what you want, because the first .*? will be
    > reluctant to eat up more characters, so it will pass to \d+ as many
    > digits as it can.


    But reluctant is slow (see my benchmark from a few days ago).

    Cheer

    robert

    --
    use.inject do |as, often| as.you_can - without end
     
    Robert Klemme, Sep 26, 2008
    #4
  5. Chris Ro

    Thomas B. Guest

    Robert Klemme wrote:
    >> It's better to simply find the matching part using match and
    >> substitute it for the whole string, like this:
    >> place1 = string1.match(/(\d+)th\b/)[1]

    >
    > For extraction there is a simpler solution
    >
    > irb(main):002:0> "He is the 20th."[/(\d+)th\b/, 1]
    > => "20"
    > irb(main):003:0> "25th"[/(\d+)th\b/, 1]
    > => "25"


    Yes, I forgot about this one. +1

    >> Apart from that, a useful piece of knowledge about regexps:
    >> /.*?(\d+)th.*/ will match what you want, because the first .*? will be
    >> reluctant to eat up more characters, so it will pass to \d+ as many
    >> digits as it can.

    >
    > But reluctant is slow (see my benchmark from a few days ago).


    OK. I guess reluctant is slow especially when the string that it has to
    cover is long. And I agree that it's not a very good idea to use
    reluctant regexps in time-critical applications, and the first solution
    is much better here. I mentioned them just to let the original poster
    gain some knowledge. I use reluctant patterns when not in hurry, because
    they make things much easier sometimes.

    TPR.

    --
    Posted via http://www.ruby-forum.com/.
     
    Thomas B., Sep 26, 2008
    #5
  6. Chris Ro

    Patrick He Guest

    [Note: parts of this message were removed to make it a legal post.]

    IMO, lookahead is the best solution for the problem.

    Mark Thomas wrote:
    > On Sep 26, 10:02 am, Chris Ro <> wrote:
    >
    >> Hi,
    >>
    >> I have a little problem with a regex in Ruby:
    >>
    >> I have twos strings:
    >>
    >> string1 = "He is the 20th."
    >> string2 = "25th"
    >>
    >> I wrote this to "extract" the place (20 or 25 respectively):
    >>
    >> place1 = string1.gsub(/.*(\d+)th.*/,'\1')
    >> place2 = string2.gsub(/.*(\d+)th.*/,'\1')
    >> pp place1
    >> pp place1
    >>
    >> => "0"
    >> => "5"
    >>
    >> Of course, I would like to get all the digits before "th". Why is only
    >> the last one captured?
    >>

    >
    > Because the .* is greedy and will get all it can, which is all but the
    > last digit.
    >
    >
    >> If anyone could please explain this, and help me come up with a regex
    >> that captures 20 and 25, respectively, this would be greatly
    >>

    >
    > place = string[/\d+(?=th)/]
    >
    > -- Mark.
    >
    >
    >
     
    Patrick He, Sep 26, 2008
    #6
  7. Chris Ro

    Nit Khair Guest

    If you need to get multiple numbers out you could try scan().

    > d="9,45, 567"

    => "9,45, 567"

    > d.scan(/\d+/)

    => ["9", "45", "567"]

    --
    Posted via http://www.ruby-forum.com/.
     
    Nit Khair, Sep 27, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David Geering

    longs, long longs, short short long ints . . . huh?!

    David Geering, Jan 8, 2007, in forum: C Programming
    Replies:
    15
    Views:
    592
    Keith Thompson
    Jan 11, 2007
  2. Replies:
    4
    Views:
    866
    Kaz Kylheku
    Oct 17, 2006
  3. Ioannis Vranos

    unsigned short, short literals

    Ioannis Vranos, Mar 4, 2008, in forum: C Programming
    Replies:
    5
    Views:
    730
    Eric Sosman
    Mar 5, 2008
  4. Replies:
    3
    Views:
    823
    Reedick, Andrew
    Jul 1, 2008
  5. Andre
    Replies:
    5
    Views:
    564
    Keith Thompson
    Jul 17, 2012
Loading...

Share This Page