Short question on regex in Ruby

Chris Ro · Sep 26, 2008

Hi,

I have a little problem with a regex in Ruby:

I have twos strings:

string1 = "He is the 20th."
string2 = "25th"

I wrote this to "extract" the place (20 or 25 respectively):

place1 = string1.gsub(/.*(\d+)th.*/,'\1')
place2 = string2.gsub(/.*(\d+)th.*/,'\1')
pp place1
pp place1

=> "0"
=> "5"

Of course, I would like to get all the digits before "th". Why is only
the last one captured?

If anyone could please explain this, and help me come up with a regex
that captures 20 and 25, respectively, this would be greatly
appreciated.

Cheers, Chris

Mark Thomas · Sep 26, 2008

Hi,

I have a little problem with a regex in Ruby:

I have twos strings:

string1 = "He is the 20th."
string2 = "25th"

I wrote this to "extract" the place (20 or 25 respectively):

place1 = string1.gsub(/.*(\d+)th.*/,'\1')
place2 = string2.gsub(/.*(\d+)th.*/,'\1')
pp place1
pp place1

=> "0"
=> "5"

Of course, I would like to get all the digits before "th". Why is only
the last one captured?

Because the .* is greedy and will get all it can, which is all but the
last digit.

If anyone could please explain this, and help me come up with a regex
that captures 20 and 25, respectively, this would be greatly

place = string[/\d+(?=th)/]

-- Mark.

Thomas B. · Sep 26, 2008

Chris said:
place1 = string1.gsub(/.*(\d+)th.*/,'\1')

Hello. I think your approach with using gsub is not the best possible
here. It's better to simply find the matching part using match and
substitute it for the whole string, like this:
place1 = string1.match(/(\d+)th\b/)[1]
The \b ensures that the next character after 'th' is not a word
character (\b is word boundary), and [1] at the end is extracting the
first bracketed group. It also makes it possible to skip the .* at both
ends, which is a bit ugly.

Apart from that, a useful piece of knowledge about regexps:
/.*?(\d+)th.*/ will match what you want, because the first .*? will be
reluctant to eat up more characters, so it will pass to \d+ as many
digits as it can.

TPR.

Robert Klemme · Sep 26, 2008

2008/9/26 Thomas B. said:
Chris said:

place1 = string1.gsub(/.*(\d+)th.*/,'\1')

Click to expand...

Hello. I think your approach with using gsub is not the best possible
here.
Agree.

It's better to simply find the matching part using match and
substitute it for the whole string, like this:
place1 = string1.match(/(\d+)th\b/)[1]

For extraction there is a simpler solution

irb(main):002:0> "He is the 20th."[/(\d+)th\b/, 1]
=> "20"
irb(main):003:0> "25th"[/(\d+)th\b/, 1]
=> "25"

The \b ensures that the next character after 'th' is not a word
character (\b is word boundary), and [1] at the end is extracting the
first bracketed group. It also makes it possible to skip the .* at both
ends, which is a bit ugly.
Right.

Apart from that, a useful piece of knowledge about regexps:
/.*?(\d+)th.*/ will match what you want, because the first .*? will be
reluctant to eat up more characters, so it will pass to \d+ as many
digits as it can.

But reluctant is slow (see my benchmark from a few days ago).

Cheer

robert

Thomas B. · Sep 26, 2008

Robert said:
It's better to simply find the matching part using match and
substitute it for the whole string, like this:
place1 = string1.match(/(\d+)th\b/)[1]

Click to expand...

For extraction there is a simpler solution

irb(main):002:0> "He is the 20th."[/(\d+)th\b/, 1]
=> "20"
irb(main):003:0> "25th"[/(\d+)th\b/, 1]
=> "25"

Yes, I forgot about this one. +1

But reluctant is slow (see my benchmark from a few days ago).

OK. I guess reluctant is slow especially when the string that it has to
cover is long. And I agree that it's not a very good idea to use
reluctant regexps in time-critical applications, and the first solution
is much better here. I mentioned them just to let the original poster
gain some knowledge. I use reluctant patterns when not in hurry, because
they make things much easier sometimes.

TPR.

Patrick He · Sep 26, 2008

[Note: parts of this message were removed to make it a legal post.]

IMO, lookahead is the best solution for the problem.

Mark said:
Hi,

I have a little problem with a regex in Ruby:

I have twos strings:

string1 = "He is the 20th."
string2 = "25th"

I wrote this to "extract" the place (20 or 25 respectively):

place1 = string1.gsub(/.*(\d+)th.*/,'\1')
place2 = string2.gsub(/.*(\d+)th.*/,'\1')
pp place1
pp place1

=> "0"
=> "5"

Of course, I would like to get all the digits before "th". Why is only
the last one captured?

Click to expand...

Because the .* is greedy and will get all it can, which is all but the
last digit.

If anyone could please explain this, and help me come up with a regex
that captures 20 and 25, respectively, this would be greatly

Click to expand...

place = string[/\d+(?=th)/]

-- Mark.

Nit Khair · Sep 27, 2008

If you need to get multiple numbers out you could try scan().

d="9,45, 567"

=> "9,45, 567"

d.scan(/\d+/)

=> ["9", "45", "567"]

ruby global regex question.	6	Nov 19, 2008
Participate in short survey on Ruby defects!	0	May 16, 2004
Ruby hangs on OSX	3	Nov 22, 2006
Ruby Weekly News 13th - 19th March 2006	0	Mar 20, 2006
Ruby Weekly News 27th March - 2nd April 2006	2	Apr 4, 2006
Ruby Weekly News 22nd - 28th August 2005	0	Aug 31, 2005
Ruby Weekly News 6th - 12th June 2005	0	Jun 14, 2005
How well did I do??? Quiz 2	4	Jul 18, 2005

Short question on regex in Ruby

Chris Ro

Mark Thomas

Thomas B.

Robert Klemme

Thomas B.

Patrick He

Nit Khair

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads