Regex ^ beginning not strong?

I

Iain Barnett

Hi,

I've some more regex questions. I wrote a pattern to check for valid =
regexes and inspect the parts (we all have our reasons for the things we =
do:) It wasn't working so I went down to simpler and simpler patterns, =
but I'm a bit surprised at the way Ruby 1.9 is handling the regexes. I =
tested the same pattern in Perl and it came out with the answers I'd =
expect.

Is this down to me using perl regexes for so long, or is there something =
I'm missing about Ruby's implementation? It appears ^ at the beginning =
of a string doesn't bind as strongly as I'd expect.


I believe this test should fail as <delim> should be bound to the =
beginning of the string by the ^ , and the match result is a little bit =
crazy - shouldn't the main capture be "d\\d" if it's following the =
logical route it's chosen?
$ ruby -e ' =
=20
md =3D =
/^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>/.match( %q!/\d\d\\d! )=20
puts md.inspect
'
#<MatchData "/\\d" mors:nil delim:"d" pat:"\\">


Here I add on a trailing slash to the string, and (I believe) it should =
bring me back what's between the / / :
$ ruby -e '
md =3D =
/^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>/.match( %q!/\d\d\\d/! )
puts md.inspect
'
#<MatchData "/\\d" mors:nil delim:"d" pat:"\\">

Here's the first string in perl 5.12 :
$ perl -e '
if ( q(/\d\d\\d) =3D~ /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g{delim}/ ) { =
=20
while ( my ($key, $value) =3D each(%+) ) {
print "$key =3D> $value\n";
}
}
'
<nothing here, what I'd expect>

And here it is with the "valid" string:
$ perl -e '
if ( q(/\d\d\\d/) =3D~ /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g{delim}/ ) =
{
while ( my ($key, $value) =3D each(%+) ) {
print "$key =3D> $value\n";
}
}
'
pat =3D> \d\d\d
delim =3D> /

These are the answers I'd expect.


Even this seems unexpected to me, if I remove the <mors> then surely ^ =
should bind <delim> to the beginning???
$ ruby -e '
md =3D /^(?<delim>.)(?<pat>.+?)\g<delim>/.match( =
%q!/\d\d\\d/! )=20
puts md.inspect =20
' =20
#<MatchData "/\\d" delim:"d" pat:"\\">


These work as I'd expect by using the end of line $ :
$ ruby -e '=20
md =3D /^(?<delim>.)(?<pat>.+?)\g<delim>$/.match( =
%q!/\d\d\\d/! )
puts md.inspect
'
#<MatchData "/\\d\\d\\d/" delim:"/" pat:"\\d\\d\\d">

$ ruby -e '
md =3D =
/^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>$/.match( %q!/\d\d\\d/! )
puts md.inspect =20
' =20
#<MatchData "/\\d\\d\\d/" mors:nil delim:"/" pat:"\\d\\d\\d">

And finally, if I remove the caret but leave the $ I get the answer I'd =
expect (or that I'm looking for) :
$ ruby -e '=20
md =3D =
/(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>$/.match( %q!/\d\d\\d/! )=20
puts md.inspect
'

#<MatchData "/\\d\\d\\d/" mors:nil delim:"/" pat:"\\d\\d\\d">


Regards,
Iain
 
R

Robert Klemme

Hi,

I've some more regex questions. I wrote a pattern to check for valid
regexes and inspect the parts (we all have our reasons for the things
we do:) It wasn't working so I went down to simpler and simpler
patterns, but I'm a bit surprised at the way Ruby 1.9 is handling the
regexes. I tested the same pattern in Perl and it came out with the
answers I'd expect.

Is this down to me using perl regexes for so long, or is there
something I'm missing about Ruby's implementation? It appears ^ at
the beginning of a string doesn't bind as strongly as I'd expect.


I believe this test should fail as<delim> should be bound to the
beginning of the string by the ^ , and the match result is a little
bit crazy - shouldn't the main capture be "d\\d" if it's following
the logical route it's chosen? $ ruby -e ' md =
/^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>/.match( %q!/\d\d\\d! )
puts md.inspect ' #<MatchData "/\\d" mors:nil delim:"d" pat:"\\">

I think you found a bug - probably related to referring to back
references to named capturing groups:

irb(main):013:0> s = %q!/\d\d\\d!
=> "/\\d\\d\\d"

irb(main):027:0> r = /^(?<mors>m)?(?<delim>.)(?<pat>.+?)/
=> /^(?<mors>m)?(?<delim>.)(?<pat>.+?)/
irb(main):028:0> md = r.match s
=> #<MatchData "/\\" mors:nil delim:"/" pat:"\\">

This must not match at all:

irb(main):029:0> r = /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>/
=> /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>/
irb(main):030:0> md = r.match s
=> #<MatchData "/\\d" mors:nil delim:"d" pat:"\\">

It seems to work better with numbered capturing groups

irb(main):027:0> r = /^(?<mors>m)?(?<delim>.)(?<pat>.+?)/
=> /^(?<mors>m)?(?<delim>.)(?<pat>.+?)/
irb(main):028:0> md = r.match s
=> #<MatchData "/\\" mors:nil delim:"/" pat:"\\">
irb(main):029:0> r = /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>/
=> /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>/
irb(main):030:0> md = r.match s
=> #<MatchData "/\\d" mors:nil delim:"d" pat:"\\">

Normal greediness:

irb(main):035:0> r = /^(m)?(.)(.+)\2/
=> /^(m)?(.)(.+)\2/
irb(main):036:0> md = r.match s
=> nil

This works:

irb(main):038:0> /^(m)?(.)(.+)\2/.match 'abbba'
=> #<MatchData "abbba" 1:nil 2:"a" 3:"bbb">

Maybe the numbering gets out of order if we try to mix:

irb(main):039:0> /^(?<delim>m)?(.)(.+)\2/.match 'abbba'
SyntaxError: (irb):39: numbered backref/call is not allowed. (use name):
/^(?<delim>m)?(.)(.+)\2/
from /usr/local/bin/irb19:12:in `<main>'
irb(main):040:0> /^(?<delim>m)?(.)(.+)\k<2>/.match 'abbba'
SyntaxError: (irb):40: numbered backref/call is not allowed. (use name):
/^(?<delim>m)?(.)(.+)\k<2>/
from /usr/local/bin/irb19:12:in `<main>'
irb(main):041:0>

irb(main):047:0> RUBY_VERSION
=> "1.9.1"
irb(main):048:0> RUBY_PATCHLEVEL
=> 376

Frankly, I never used named capturing groups yet (simply for habit and
compatibility). It was probably a good choice so far.

Kind regards

robert
 
I

Iain Barnett

=20
I think you found a bug - probably related to referring to back =
references to named capturing groups:
=20
=20
=20
Frankly, I never used named capturing groups yet (simply for habit and =
compatibility). It was probably a good choice so far.
=20
Kind regards
=20
robert
=20

Thanks for checking that. While searching for more information on the =
Oniguruma engine I noticed that there was a CPAN library for running it =
under Perl, so I installed it and ran the same regexes against the perl =
engine, and it had the same results as Ruby. This indicates that it's a =
problem with the engine and not something Ruby is doing along the way, =
so I'll file a report with the Oniguruma team and include all your tests =
too and see what happens.

With Oniguruma:

$ perl -Mre::engine::Oniguruma -e '
if ( q(/\d\d\\d/) =3D~ /^(?<delim>.)(?<pat>.+?)\g{delim}/ ) {
while ( my ($key, $value) =3D each(%+) ) {
print "$key =3D> $value\n";
}
}
'
<nothing here>


Usual Perl engine:

$ perl -e '
if ( q(/\d\d\\d/) =3D~ /^(?<delim>.)(?<pat>.+?)\g{delim}/ ) {
while ( my ($key, $value) =3D each(%+) ) {
print "$key =3D> $value\n";
}
}
'
pat =3D> \d\d\d
delim =3D> /

Regards,
Iain=
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top