Regular expression matches last occurrence instead of first

A

andyo

I've found an anomoly in the way Ruby handles non-greedy regular
expressions and wonder whether it's been discussed before. A search of
the documentation and a general Internet search didn't turn up
information on this issue.

When I want to match the first quoted string in a string such as:

"aaaaa""bbb""ccc"

I match the last quoted string instead. The exact characters don't
matter.

Here's the sample code; note that (.*?) and ([^"]+) behave the same
way--and not the way I'd expect:

str = '"aaaaa""bbb""ccc"'

str.scan(/"(.*?)"/)
puts $1
# ccc

Andy Oram
str.scan(/"([^"]+)"/)
puts $1
# ccc

str.scan(/"(.*?)"(.*)/)
puts $1
# aaaaa

Adding an extra (.*) to the end produces the result I want, but I
don't believe it should make any difference.

Here is the equivalent Perl, which works as expected:

$str = q{"aaaaa""bbb""ccc"};
$str =~ /"(.*?)"/;
print $1 , "\n";

$str =~ /"([^"]+)"/;
print $1 , "\n";
# aaaaa

$str =~ /"(.*?)"(.*)/;
print $1 , "\n";
# aaaaa

And the equivalent PHP:

<?php

$str = '"aaaaa""bbb""ccc"';
preg_match('/"(.*?)"/', $str, $matches);
echo $matches[1] , "\n";
// aaaaa

preg_match('/"([^"]+)"/', $str, $matches);
echo $matches[1] , "\n";
// aaaaa

preg_match('/"(.*?)"(.*)/', $str, $matches);
echo $matches[1] , "\n";
// aaaaa

?>
 
V

Vincent Fourmond

andyo said:
Here's the sample code; note that (.*?) and ([^"]+) behave the same
way--and not the way I'd expect:

str = '"aaaaa""bbb""ccc"'

str.scan(/"(.*?)"/)
puts $1
# ccc

Normal... #scan is not what you'r looking for:

------------------------------------------------------------ String#scan
str.scan(pattern) => array
str.scan(pattern) {|match, ...| block } => str
------------------------------------------------------------------------
Both forms iterate through str, matching the pattern (which may be
a Regexp or a String). For each match, a result is generated and
either added to the result array or passed to the block. [...]

scan find all successive matches for the pattern, and sets the
captured groups variables everytime it finds one. So, here, you simply
get the $1 for the last match, ie "ccc".

What you're looking for is simply =~, as in Perl:

irb(main):001:0> str = '"aaaaa""bbb""ccc"'
=> "\"aaaaa\"\"bbb\"\"ccc\""
irb(main):002:0> str =~ /"(.*?)"/
=> 0
irb(main):003:0> $1
=> "aaaaa"

Cheers,

Vincent
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top