Comparing Regular Expressions

R

Robert Inder

I'm interested in comparing the "coverage" of regular expressions.

In particular, if I have two regular expressions, I want to know
whether it is possible to construct a string that will match
both, or indeed that will match one and not the other.

So, for instance, suppose I believe that if an object has a
descripiton string matching regexp R1, it is of type A.
Someone now tells me that an object with a
descripiton string matching regexp R2 is of type B. What can I tell
about (the sets of things of type) A and B by looking at R1 and R2?

So if R1= "^a.*$" and R2 = "^b.*$", I know that A and B are disjoint.

Whereas if R1= "foo" and R2 = "foobar", I know that B is a subset of A.

But how do I formalise (i.e. program) this comparison process?

This strikes me as a very general problem/issue and there
just has to be some work on formalising and solving it. But I don't
know how to find it: Google searches for things like "regular
expression comparison" are swamped by algorithms for matching regexps
against strings and so forth, which isn't what I want.

Can anyone point me in the right direction?

Robert.
 
A

Anno Siegel

Robert Inder said:
I'm interested in comparing the "coverage" of regular expressions.

In particular, if I have two regular expressions, I want to know
whether it is possible to construct a string that will match
both, or indeed that will match one and not the other.

[snip]

That will be utterly non-trivial.

Consider the partial problem of finding a string that a given regex
matches. Since there are patterns that don't match anything (/$.^/),
there can't be a general solution. At the very least you'll need a
complete regex parser before you can tackle this problem.

Anno
 
B

Brian McCauley

Robert said:
I'm interested in comparing the "coverage" of regular expressions.

In particular, if I have two regular expressions, I want to know
whether it is possible to construct a string that will match
both, or indeed that will match one and not the other.

My instinct says that will be "hard". Not just "hard" in the common
sense, but hard in the CompSci sense of being equivalent to the halting
problem (i.e. insoluble in the general case in finite time).

Note I'm reading this in clp.misc. I suspect the nettizens of
comp.theory will be much more clued-up on this.
 
J

Jean-Marc Bourguet

Robert Inder said:
I'm interested in comparing the "coverage" of regular expressions.

For which definition or regular expressions? I see this is crossposted
to comp.lang.perl.misc and comp.theory and the default meaning of
regular expression is quite different in the two groups.
In particular, if I have two regular expressions, I want to know
whether it is possible to construct a string that will match both,
or indeed that will match one and not the other.

For the comp.theory adapted definition, it is possible to construct
FSM accepting the language described by a regular expressions.

Starting from the FSMs and using a variant of the algorithm used to
minimize them it is easy to construct an FSM which accept the union of
the two languages. As there is a know relation between the states in
the new FSM and the states in the old, it is easy to classify the
accepting states of the new FSM as having correspondance in only one
or both of the original FSM.

Yours,
 
T

Torben Ægidius Mogensen

Robert Inder said:
I'm interested in comparing the "coverage" of regular expressions.

In particular, if I have two regular expressions, I want to know
whether it is possible to construct a string that will match
both, or indeed that will match one and not the other.

Regular languages are closed under intersection and set difference, so
you can not only find strings (if any exist) in the intersection and
difference sets, you can construct regular expressions or DFA's that
describe all such strings.
So, for instance, suppose I believe that if an object has a
descripiton string matching regexp R1, it is of type A.
Someone now tells me that an object with a
descripiton string matching regexp R2 is of type B. What can I tell
about (the sets of things of type) A and B by looking at R1 and R2?

So if R1= "^a.*$" and R2 = "^b.*$", I know that A and B are disjoint.

Whereas if R1= "foo" and R2 = "foobar", I know that B is a subset of A.

Say what? The set containing only the string "foobar" is not a subset
of the set containing only "foo". Or do you consider a string to
match a regular expression if a prefix of the string does?
But how do I formalise (i.e. program) this comparison process?

This strikes me as a very general problem/issue and there
just has to be some work on formalising and solving it. But I don't
know how to find it: Google searches for things like "regular
expression comparison" are swamped by algorithms for matching regexps
against strings and so forth, which isn't what I want.

If you have a DFA D1 with states s_0 ... s_m for A and a DFA D2 with
states t_0 ... t_n for B, you can construct a DFA D3 for the
intersection of A and B in the following way:

1. The start state of D3 is the pair (s_0,t_0) of starting states of
D1 and D2.

2. If D1 has a transition on symbol c from s_i to s_j and D2 has a
transition on symbol c from t_k to t_l, D3 has a transition from
(s_i,t_k) to (s_j,t_l) on c.

3. If s_i is accepting in D1 and t_k is accepting in D2, (s_i,t_k) is
accepting in D3.

For A\B (A minus B), the construction is:

0. Add a non-accepting state t_(n+1) to D2. If there is a state t_k
in D2 and a symbol c such that t_k has no transition on c, add a
transition from t_k to t_(n+1) on c. There is a transition from
t_(n+1) to t_(n+1) on all symbols. This way, D2 has transitions
on all symbols from all states.

1. The start state of D3 is the pair (s_0,t_0) of starting states of
D1 and D2.

2. If D1 has a transition on symbol c from s_i to s_j and D2 has a
transition on symbol c from t_k to t_l, D3 has a transition from
(s_i,t_k) to (s_j,t_l) on c.

3. If s_i is accepting in D1 and t_k is not accepting in D2,
(s_i,t_k) is accepting in D3.


Torben
 
S

Steven N. Hirsch

Robert said:
I'm interested in comparing the "coverage" of regular expressions.

In particular, if I have two regular expressions, I want to know
whether it is possible to construct a string that will match
both, or indeed that will match one and not the other.

So, for instance, suppose I believe that if an object has a
descripiton string matching regexp R1, it is of type A.
Someone now tells me that an object with a
descripiton string matching regexp R2 is of type B. What can I tell
about (the sets of things of type) A and B by looking at R1 and R2?

So if R1= "^a.*$" and R2 = "^b.*$", I know that A and B are disjoint.

Whereas if R1= "foo" and R2 = "foobar", I know that B is a subset of A.

But how do I formalise (i.e. program) this comparison process?

This strikes me as a very general problem/issue and there
just has to be some work on formalising and solving it. But I don't
know how to find it: Google searches for things like "regular
expression comparison" are swamped by algorithms for matching regexps
against strings and so forth, which isn't what I want.

Can anyone point me in the right direction?

There's a wonderful treatment of the subject in "Higher Order Perl",
written by Mark Jason Dominus. This is easily one of the most
thought-provoking books on Perl to date (IMHO) and a highly-recommended
read.
 
M

Mark Jason Dominus

My instinct says that will be "hard". Not just "hard" in the common
sense, but hard in the CompSci sense of being equivalent to the halting
problem (i.e. insoluble in the general case in finite time).

My recollection is that this is not the case. Certainly equivalence
of recursive languages is undecidable, but regular languages are
highly restricted special cases.

I may be remembering wrong, but I think regex equivalence is
NP-complete. If so, this would mean that there would be an
exponential-time algorithm for solving it. A quick google search
finds various mentions of various versions of the problem being in
PSPACE or being EXP-complete, which means that it is not undecidable.

So the problem is hard, but not impossible.
 
J

Jürgen Exner

Mark said:
My recollection is that this is not the case. Certainly equivalence
of recursive languages is undecidable, but regular languages are
highly restricted special cases.

Except that Perl's REs (as most other REs that are in common use today) are
not"regular" in the Computer Science classification any longer but have been
extended.

jue
 
R

Robert Inder

Subject: Re: Comparing Regular Expressions
Date: Wed, 13 Apr 2005 23:22:52 GMT
Except that Perl's REs (as most other REs that are in common use
today) are not"regular" in the Computer Science classification
any longer but have been extended.

Right now, I'm just interested in the what can and can't be done.

So if some restrictions on the regexps would let me get a handle on
comparing them, I'd be intrigued.

Even some pointers to work that "merely" dealt with regular RE's
would be a great help.

Robert.
 
J

Jean-Marc Bourguet

Robert Inder said:
Even some pointers to work that "merely" dealt with regular RE's
would be a great help.

I think both Torben Ægidius Mogensen
([email protected]) and me
([email protected]) have outlined the
algorithms you want. It would help to give you better
responses if you state what you think is missing.

The algorithm for going from a RE to a FSM is probably in
every introduction book on compiler.

Yours,
 
T

Torben Ægidius Mogensen

Robert Inder said:
Right now, I'm just interested in the what can and can't be done.

So if some restrictions on the regexps would let me get a handle on
comparing them, I'd be intrigued.

Even some pointers to work that "merely" dealt with regular RE's
would be a great help.

I think this has been covered before:

1) Make minimal DFA's for both RE's.

2) Compare the DFA's for equality (equivalent graphs).

Alternatively,

1) Make NFA's for both RE's.

2) Check bisimulation of the two NFA's.

Both have exponential worst-case time.

The above is compairing for equality. If you want to comapre for
subset, you just replace bisimulation with simulation in the second
method or you construct the difference automaton between the two DFA's
and check for emptyness.

Torben
 
J

Jean-Marc Bourguet

Jean-Marc Bourguet said:
I think both Torben Ægidius Mogensen
([email protected]) and me
([email protected]) have outlined the
algorithms you want. It would help to give you better
responses if you state what you think is missing.

The algorithm for going from a RE to a FSM is probably in
every introduction book on compiler.

flex gives a warning when a rule can never be matched. So
you have an easy way to check if two regular expressions are
identical: input two files consisting simply of the two
expressions in both order. If flex gives that warning for
both files, they are identicals, if flex gives a warning
only for one file, the second in that file regexp is a
subset of the first, if flex never give a warning, both
expressions are able to match strings that the other can
not.

Yours,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top