Regexp equality

J

Jamis Buck

Here's an oddity I recently came across in a unit test. Identical
regexps declared with /.../ will be equivilent to each other, but not to
the same regexps created with %r{...}:

$ ruby -ve 'p( /\/blah/ == /\/blah/ )'
ruby 1.8.2 (2004-07-29) [i686-linux]
true

$ ruby -ve 'p( %r{/blah} == %r{/blah} )'
ruby 1.8.2 (2004-07-29) [i686-linux]
true

$ ruby -ve 'p( /\/blah/ == %r{/blah} )'
ruby 1.8.2 (2004-07-29) [i686-linux]
false

This appears to only be the case when there is a forward slash in the
regexp. In other words:

$ ruby -ve 'p( /blah/ == %r{blah} )'
ruby 1.8.2 (2004-07-29) [i686-linux]
true

Can anyone else confirm this? Can anyone *explain* it? Or is this a bug?

- Jamis
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: Regexp equality"

|Can anyone else confirm this? Can anyone *explain* it? Or is this a bug?

Regexp#== compares literal appearance of regex, e.g.

\/blah vs /blah

I consider this as a feature. I'm open for the discussion (as always).

matz.
 
J

Jamis Buck

Yukihiro said:
Hi,

In message "Re: Regexp equality"

|Can anyone else confirm this? Can anyone *explain* it? Or is this a bug?

Regexp#== compares literal appearance of regex, e.g.

\/blah vs /blah

I consider this as a feature. I'm open for the discussion (as always).

Is there a way to test regexen for equality based on the strings they
would match? Perhaps that's too general, since it depends on the strings
they would be given, but I guess I expected /\/blah/ to be equal to
%r{/blah}.

I suppose /\/blah/.to_s would equal %r{/blah}.to_s? Would that be a
better way to compare regexen?

- Jamis
 
J

James Edward Gray II

Is there a way to test regexen for equality based on the strings they
would match?

Wow. I bet that's a pretty tall order. Pulling from the current Ruby
Quiz, here are several ways to write a regex to match 1..12:

1|2|3|4|5|6|7|8|9|10|11|12

\d|1[012]

1?[12]|[03-9]

You're certainly not going to be able to compare like patterns that
way. Maybe some internal representation, but I would but super
impressed to see that.

Not at all saying I don't like the idea, just that it's a tall order.

James Edward Gray II
 
L

Lloyd Zusman

James Edward Gray II said:
Is there a way to test regexen for equality based on the strings they
would match?

Wow. I bet that's a pretty tall order. Pulling from the current Ruby
Quiz, here are several ways to write a regex to match 1..12:

1|2|3|4|5|6|7|8|9|10|11|12

\d|1[012]

1?[12]|[03-9]

You're certainly not going to be able to compare like patterns that way.
Maybe some internal representation, but I would but super impressed to
see that.

Not at all saying I don't like the idea, just that it's a tall order.

Not only is it a tall order to determine whether any two regex's will
match exactly the same set of strings, but in the general case, I'm
pretty sure that this question is undecidable.

If I'm not mistaken, I believe that this can be shown to be a reduction
of the Halting Problem.

In other words, short of feeding every possible character string to both
regex's and comparing the two sets of matches, I don't think that there
is an algorithmic way to measure equivalence of two regex's which don't
have the same external representation.
 
J

Jamis Buck

Lloyd said:
Is there a way to test regexen for equality based on the strings they
would match?

Wow. I bet that's a pretty tall order. Pulling from the current Ruby
Quiz, here are several ways to write a regex to match 1..12:

1|2|3|4|5|6|7|8|9|10|11|12

\d|1[012]

1?[12]|[03-9]

You're certainly not going to be able to compare like patterns that way.
Maybe some internal representation, but I would but super impressed to
see that.

Not at all saying I don't like the idea, just that it's a tall order.


Not only is it a tall order to determine whether any two regex's will
match exactly the same set of strings, but in the general case, I'm
pretty sure that this question is undecidable.

If I'm not mistaken, I believe that this can be shown to be a reduction
of the Halting Problem.

In other words, short of feeding every possible character string to both
regex's and comparing the two sets of matches, I don't think that there
is an algorithmic way to measure equivalence of two regex's which don't
have the same external representation.

I realize that particular question is most likely impossible to answer
for the general case. What I was wanting, though, was to be able to know
whether /\/blah/ is equal to %r{/blah}. Surely that's not too hard?

As I said, though, converting them both to strings gives the same
result, so the strings, at least, are comparable with expected results.

- Jamis
 
L

Lloyd Zusman

Jamis Buck said:
[ ... ]

I realize that particular question is most likely impossible to answer
for the general case. What I was wanting, though, was to be able to know
whether /\/blah/ is equal to %r{/blah}. Surely that's not too hard?

As I said, though, converting them both to strings gives the same
result, so the strings, at least, are comparable with expected results.

OK. Now I better understand your question. I agree with you that
something seems kind of counter-intuitive here:

irb(main):001:0> x = /\/blah/
=> /\/blah/
irb(main):002:0> y = %r{/blah}
=> /\/blah/
irb(main):003:0> z = %r{\/blah}
=> /\/blah/
irb(main):004:0> x == y
=> false
irb(main):005:0> x == z
=> true
irb(main):006:0> x.source
=> "\\/blah"
irb(main):007:0> y.source
=> "/blah"
irb(main):008:0> z.source
=> "\\/blah"

In line 006, why is the source of /\/blah/ returned as "\\/blah"?

It seems to me that it's incorrect to return the leading backslash
doubled. Isn't the backslash in the initial definition of x just an
escape character for enabling the second forward slash to be treated as
part of the regexp? Why is that initial escape character treated as a
bona fide part of the regexp source?

Also, consider this:

irb(main):001:0> x = /\/blah/
=> /\/blah/
irb(main):002:0> y = %r{/blah}
=> /\/blah/
irb(main):003:0> z = %r{\/blah}
=> /\/blah/
irb(main):004:0> x.to_s
=> "(?-mix:\\/blah)"
irb(main):005:0> y.to_s
=> "(?-mix:\\/blah)"
irb(main):006:0> z.to_s
=> "(?-mix:\\/blah)"
irb(main):007:0> x.to_s == y.to_s
=> true
irb(main):008:0> x.to_s == z.to_s
=> true

In other words, the string representation of the 3 regexp's (as opposed
to their _source_ representations) are identical.

So this brings up two questions:

1. Why is the backslash in a statement like this /\/blah/ _not_ just
treated as an escape character when determining the source?

2. Why does the `==' operator compare the source and not the internal
representations? It seems to me that the internal representation is
a much more meaningful piece of information for use in a comparison.
 
E

Eivind Eklund

Not only is it a tall order to determine whether any two regex's will
match exactly the same set of strings, but in the general case, I'm
pretty sure that this question is undecidable.

As far as I can tell: If you run Antimirov's algorithm for NFA
construction, you can compare each extracted partial derivative and
see if it is equal. If all extracted partial derivatives are equal,
your regular expressions are identical. If any differ - they're
different.

Eivind.
 
E

Eivind Eklund

As far as I can tell: If you run Antimirov's algorithm for NFA
construction, you can compare each extracted partial derivative and
see if it is equal. If all extracted partial derivatives are equal,
your regular expressions are identical. If any differ - they're
different.

I should probably include some suitable references so somebody can run
off an implement this:

Subtyping with Antimirov's algorithm:
http://www.inf.uni-konstanz.de/dbis/teaching/current/pathfinder/download/hohenade-ausarbeitung.pdf

Partial derivatives of regular expressions and finite automata
http://citeseer.ist.psu.edu/antimirov95partial.html

Eivind.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
474,266
Messages
2,571,072
Members
48,772
Latest member
Backspace Studios

Latest Threads

Top