Interpreting "(.*?)" and "(?:\d+ [.]?)" in REs

R

RichardOnRails

Hi All,

I want to extract numbers from records with a leading string of period-
separated numbers. I got great responses to this on the thread
http://groups.google.com/group/comp.lang.ruby/browse_frm/thread/a811f41d733125f3#,
including the program below (stripped of all error handling).

My question is the meaning of a couple of constructs in the regular
expressions (and where I can find on-line documentation for them, if
possible):

1. "(.*?)", or specifically, the "?" in that expression.

2. "(?:\d+ [.]?)", or the two question marks in this case.

Thanks in advance,
Richard

Program
=======
input = <<DATA
2.002.1Topic 2.2.1
2.1Topic 2.1
2.2.02Topic 2.2.2
DATA

input.each do |line|
puts "\n" + "="*10 + "DBG", line, "="*10+ "DBG\n"
if line =~ /^ (.*?) [a-zA-Z] /x # Question 1
prefix = $1
if prefix =~ /^ (?:\d+ [.]?)+ $ /x # Question 2
arr = prefix.split('.')
print " Numbers: ", arr.join(', '), "\n"
end
end
end # input
 
P

Peña, Botp

From: RichardOnRails=20
# 1. "(.*?)", or specifically, the "?" in that expression.
# 2. "(?:\d+ [.]?)", or the two question marks in this case.

consider regex as another language on its own. basically, it describes =
string patterns, like a metastring, a string about a string... :)

besides the mastering book, the online free perl doc is very informative =
(and you can download the pdf too; in fact, i'm even tempted to copy it =
and convert the samples to ruby. is that illegal? :)

start here:
http://perldoc.perl.org/perlrequick.html

then here:
http://perldoc.perl.org/perlretut.html
http://perldoc.perl.org/perlre.html



# input.each do |line|

btw, in ruby, you can do

DATA.each do ...

and you can even do DATA.rewind :)

kind regards -botp
 
P

Phrogz

1. "(.*?)", or specifically, the "?" in that expression.

The ? in this case makes the match non-greedy. For example:

irb(main):007:0> s = "aaaaaaae"
=> "aaaaaaae"
irb(main):008:0> s[ /a+[aeiou]/ ]
=> "aaaaaaae"
irb(main):009:0> s[ /a+?[aeiou]/ ]
=> "aa"

By default, the ?, *, +, and {n,m} modifiers are all greedy,
attempting to match the longest substring possible while still
allowing the regular expression to succeed. As seen above, /a+/ keeps
finding a's until it cannot find any more, and then goes on to try and
match the rest of the pattern.

Adding a ? after one of those quantifiers makes it non-greedy. For
example:

a?? - match zero or one 'a' characters (prefer to match zero)
a*? - match zero or more 'a' characters (prefer as few as possible)
a+? - match one or more 'a' characters (prefer as few as possible)
a{3,} - match at least 3 'a' characters (prefer as few as possible)

As seen in the irb example above, /a+?/ matched a single 'a', and then
checked to see if it could find a vowel afterwards.


You'll often see this non-greedy matching used in simple non-nested
pairing, like with HTML tags.
%r{<p>(.*?)</p>}
will match "<p>", followed by the fewest number of characters until it
sees "</p>".

Without the non-greedy quantifier, the .+ could skim right over other
2. "(?:\d+ [.]?)", or the two question marks in this case.

The first one is part of the (?:...) construct. While the parenthesis
in /(xxx)/ will save the match group for later matching or
substitution, putting a ?: pair at the front tells the regexp to not
bother saving the contents as a numbered group. For example:
/(?:foo|fu)?bar/
will match "foobar", "fubar", or "bar", without saving "foo", "fu", or
"" as a group.

The second question mark follows a character set [...], which itself
matches a single character from the options inside the set. The
question mark in this case (and in my "fubar" example above) means
"match zero or one of the preceding characters/group expressions".
Since the character set has a single period inside it, this:
[.]?
means "And there may or may not be a period here."

This is identical to the regexp:
\.?
where the backslash escapes the traditional meaning of a period (match
any character [except possibly a newline]), and instead causes it to
mean a literal period.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top