Comparing Regular Expressions

Discussion in 'Perl Misc' started by Robert Inder, Apr 6, 2005.

  1. Robert Inder

    Robert Inder Guest

    I'm interested in comparing the "coverage" of regular expressions.

    In particular, if I have two regular expressions, I want to know
    whether it is possible to construct a string that will match
    both, or indeed that will match one and not the other.

    So, for instance, suppose I believe that if an object has a
    descripiton string matching regexp R1, it is of type A.
    Someone now tells me that an object with a
    descripiton string matching regexp R2 is of type B. What can I tell
    about (the sets of things of type) A and B by looking at R1 and R2?

    So if R1= "^a.*$" and R2 = "^b.*$", I know that A and B are disjoint.

    Whereas if R1= "foo" and R2 = "foobar", I know that B is a subset of A.

    But how do I formalise (i.e. program) this comparison process?

    This strikes me as a very general problem/issue and there
    just has to be some work on formalising and solving it. But I don't
    know how to find it: Google searches for things like "regular
    expression comparison" are swamped by algorithms for matching regexps
    against strings and so forth, which isn't what I want.

    Can anyone point me in the right direction?

    Robert.

    --
    __ To avoid the spam trap, mail me
    |_) _ |_ _ ._ |- | _ _| _ ._ at bcs.org.uk, not deadspam.com.
    | \(_)|_)(-'| |_ || |(_|(-'| '
    Best viewed in Ebriated.
     
    Robert Inder, Apr 6, 2005
    #1
    1. Advertising

  2. Robert Inder

    Anno Siegel Guest

    Robert Inder <> wrote in comp.lang.perl.misc:
    >
    > I'm interested in comparing the "coverage" of regular expressions.
    >
    > In particular, if I have two regular expressions, I want to know
    > whether it is possible to construct a string that will match
    > both, or indeed that will match one and not the other.


    [snip]

    That will be utterly non-trivial.

    Consider the partial problem of finding a string that a given regex
    matches. Since there are patterns that don't match anything (/$.^/),
    there can't be a general solution. At the very least you'll need a
    complete regex parser before you can tackle this problem.

    Anno
     
    Anno Siegel, Apr 6, 2005
    #2
    1. Advertising

  3. Robert Inder wrote:

    > I'm interested in comparing the "coverage" of regular expressions.
    >
    > In particular, if I have two regular expressions, I want to know
    > whether it is possible to construct a string that will match
    > both, or indeed that will match one and not the other.


    My instinct says that will be "hard". Not just "hard" in the common
    sense, but hard in the CompSci sense of being equivalent to the halting
    problem (i.e. insoluble in the general case in finite time).

    Note I'm reading this in clp.misc. I suspect the nettizens of
    comp.theory will be much more clued-up on this.
     
    Brian McCauley, Apr 6, 2005
    #3
  4. Robert Inder <> writes:

    > I'm interested in comparing the "coverage" of regular expressions.


    For which definition or regular expressions? I see this is crossposted
    to comp.lang.perl.misc and comp.theory and the default meaning of
    regular expression is quite different in the two groups.

    > In particular, if I have two regular expressions, I want to know
    > whether it is possible to construct a string that will match both,
    > or indeed that will match one and not the other.


    For the comp.theory adapted definition, it is possible to construct
    FSM accepting the language described by a regular expressions.

    Starting from the FSMs and using a variant of the algorithm used to
    minimize them it is easy to construct an FSM which accept the union of
    the two languages. As there is a know relation between the states in
    the new FSM and the states in the old, it is easy to classify the
    accepting states of the new FSM as having correspondance in only one
    or both of the original FSM.

    Yours,

    --
    Jean-Marc
     
    Jean-Marc Bourguet, Apr 6, 2005
    #4
  5. Robert Inder <> writes:

    > I'm interested in comparing the "coverage" of regular expressions.
    >
    > In particular, if I have two regular expressions, I want to know
    > whether it is possible to construct a string that will match
    > both, or indeed that will match one and not the other.


    Regular languages are closed under intersection and set difference, so
    you can not only find strings (if any exist) in the intersection and
    difference sets, you can construct regular expressions or DFA's that
    describe all such strings.

    > So, for instance, suppose I believe that if an object has a
    > descripiton string matching regexp R1, it is of type A.
    > Someone now tells me that an object with a
    > descripiton string matching regexp R2 is of type B. What can I tell
    > about (the sets of things of type) A and B by looking at R1 and R2?
    >
    > So if R1= "^a.*$" and R2 = "^b.*$", I know that A and B are disjoint.
    >
    > Whereas if R1= "foo" and R2 = "foobar", I know that B is a subset of A.


    Say what? The set containing only the string "foobar" is not a subset
    of the set containing only "foo". Or do you consider a string to
    match a regular expression if a prefix of the string does?

    > But how do I formalise (i.e. program) this comparison process?
    >
    > This strikes me as a very general problem/issue and there
    > just has to be some work on formalising and solving it. But I don't
    > know how to find it: Google searches for things like "regular
    > expression comparison" are swamped by algorithms for matching regexps
    > against strings and so forth, which isn't what I want.


    If you have a DFA D1 with states s_0 ... s_m for A and a DFA D2 with
    states t_0 ... t_n for B, you can construct a DFA D3 for the
    intersection of A and B in the following way:

    1. The start state of D3 is the pair (s_0,t_0) of starting states of
    D1 and D2.

    2. If D1 has a transition on symbol c from s_i to s_j and D2 has a
    transition on symbol c from t_k to t_l, D3 has a transition from
    (s_i,t_k) to (s_j,t_l) on c.

    3. If s_i is accepting in D1 and t_k is accepting in D2, (s_i,t_k) is
    accepting in D3.

    For A\B (A minus B), the construction is:

    0. Add a non-accepting state t_(n+1) to D2. If there is a state t_k
    in D2 and a symbol c such that t_k has no transition on c, add a
    transition from t_k to t_(n+1) on c. There is a transition from
    t_(n+1) to t_(n+1) on all symbols. This way, D2 has transitions
    on all symbols from all states.

    1. The start state of D3 is the pair (s_0,t_0) of starting states of
    D1 and D2.

    2. If D1 has a transition on symbol c from s_i to s_j and D2 has a
    transition on symbol c from t_k to t_l, D3 has a transition from
    (s_i,t_k) to (s_j,t_l) on c.

    3. If s_i is accepting in D1 and t_k is not accepting in D2,
    (s_i,t_k) is accepting in D3.


    Torben
     
    Torben Ægidius Mogensen, Apr 6, 2005
    #5
  6. Robert Inder wrote:
    > I'm interested in comparing the "coverage" of regular expressions.
    >
    > In particular, if I have two regular expressions, I want to know
    > whether it is possible to construct a string that will match
    > both, or indeed that will match one and not the other.
    >
    > So, for instance, suppose I believe that if an object has a
    > descripiton string matching regexp R1, it is of type A.
    > Someone now tells me that an object with a
    > descripiton string matching regexp R2 is of type B. What can I tell
    > about (the sets of things of type) A and B by looking at R1 and R2?
    >
    > So if R1= "^a.*$" and R2 = "^b.*$", I know that A and B are disjoint.
    >
    > Whereas if R1= "foo" and R2 = "foobar", I know that B is a subset of A.
    >
    > But how do I formalise (i.e. program) this comparison process?
    >
    > This strikes me as a very general problem/issue and there
    > just has to be some work on formalising and solving it. But I don't
    > know how to find it: Google searches for things like "regular
    > expression comparison" are swamped by algorithms for matching regexps
    > against strings and so forth, which isn't what I want.
    >
    > Can anyone point me in the right direction?


    There's a wonderful treatment of the subject in "Higher Order Perl",
    written by Mark Jason Dominus. This is easily one of the most
    thought-provoking books on Perl to date (IMHO) and a highly-recommended
    read.
     
    Steven N. Hirsch, Apr 12, 2005
    #6
  7. In article <d30g15$729$>,
    Brian McCauley <> wrote:
    >My instinct says that will be "hard". Not just "hard" in the common
    >sense, but hard in the CompSci sense of being equivalent to the halting
    >problem (i.e. insoluble in the general case in finite time).


    My recollection is that this is not the case. Certainly equivalence
    of recursive languages is undecidable, but regular languages are
    highly restricted special cases.

    I may be remembering wrong, but I think regex equivalence is
    NP-complete. If so, this would mean that there would be an
    exponential-time algorithm for solving it. A quick google search
    finds various mentions of various versions of the problem being in
    PSPACE or being EXP-complete, which means that it is not undecidable.

    So the problem is hard, but not impossible.
     
    Mark Jason Dominus, Apr 13, 2005
    #7
  8. Mark Jason Dominus wrote:
    > In article <d30g15$729$>,
    > Brian McCauley <> wrote:
    >> My instinct says that will be "hard". Not just "hard" in the common
    >> sense, but hard in the CompSci sense of being equivalent to the
    >> halting problem (i.e. insoluble in the general case in finite time).

    >
    > My recollection is that this is not the case. Certainly equivalence
    > of recursive languages is undecidable, but regular languages are
    > highly restricted special cases.


    Except that Perl's REs (as most other REs that are in common use today) are
    not"regular" in the Computer Science classification any longer but have been
    extended.

    jue
     
    Jürgen Exner, Apr 14, 2005
    #8
  9. Robert Inder

    Robert Inder Guest


    >>>>> Jürgen Exner writes:

    > Subject: Re: Comparing Regular Expressions
    > Date: Wed, 13 Apr 2005 23:22:52 GMT


    > Mark Jason Dominus wrote:
    >> In article <d30g15$729$>,
    >> Brian McCauley <> wrote:
    >>> My instinct says that will be "hard". Not just "hard" in the common
    >>> sense, but hard in the CompSci sense of being equivalent to the
    >>> halting problem (i.e. insoluble in the general case in finite time).

    >>
    >> My recollection is that this is not the case. Certainly equivalence
    >> of recursive languages is undecidable, but regular languages are
    >> highly restricted special cases.


    > Except that Perl's REs (as most other REs that are in common use
    > today) are not"regular" in the Computer Science classification
    > any longer but have been extended.


    Right now, I'm just interested in the what can and can't be done.

    So if some restrictions on the regexps would let me get a handle on
    comparing them, I'd be intrigued.

    Even some pointers to work that "merely" dealt with regular RE's
    would be a great help.

    > jue


    Robert.

    --
    __ To avoid the spam trap, mail me
    |_) _ |_ _ ._ |- | _ _| _ ._ at bcs.org.uk, not deadspam.com.
    | \(_)|_)(-'| |_ || |(_|(-'| '
    Best viewed in Ebriated.
     
    Robert Inder, Apr 14, 2005
    #9
  10. Robert Inder <> writes:

    > Even some pointers to work that "merely" dealt with regular RE's
    > would be a great help.


    I think both Torben Ægidius Mogensen
    () and me
    () have outlined the
    algorithms you want. It would help to give you better
    responses if you state what you think is missing.

    The algorithm for going from a RE to a FSM is probably in
    every introduction book on compiler.

    Yours,

    --
    Jean-Marc
     
    Jean-Marc Bourguet, Apr 15, 2005
    #10
  11. Robert Inder <> writes:

    > >>>>> Jürgen Exner writes:

    > > Subject: Re: Comparing Regular Expressions
    > > Date: Wed, 13 Apr 2005 23:22:52 GMT

    >
    > > Mark Jason Dominus wrote:
    > >> In article <d30g15$729$>,
    > >> Brian McCauley <> wrote:
    > >>> My instinct says that will be "hard". Not just "hard" in the common
    > >>> sense, but hard in the CompSci sense of being equivalent to the
    > >>> halting problem (i.e. insoluble in the general case in finite time).
    > >>
    > >> My recollection is that this is not the case. Certainly equivalence
    > >> of recursive languages is undecidable, but regular languages are
    > >> highly restricted special cases.

    >
    > > Except that Perl's REs (as most other REs that are in common use
    > > today) are not"regular" in the Computer Science classification
    > > any longer but have been extended.

    >
    > Right now, I'm just interested in the what can and can't be done.
    >
    > So if some restrictions on the regexps would let me get a handle on
    > comparing them, I'd be intrigued.
    >
    > Even some pointers to work that "merely" dealt with regular RE's
    > would be a great help.


    I think this has been covered before:

    1) Make minimal DFA's for both RE's.

    2) Compare the DFA's for equality (equivalent graphs).

    Alternatively,

    1) Make NFA's for both RE's.

    2) Check bisimulation of the two NFA's.

    Both have exponential worst-case time.

    The above is compairing for equality. If you want to comapre for
    subset, you just replace bisimulation with simulation in the second
    method or you construct the difference automaton between the two DFA's
    and check for emptyness.

    Torben
     
    Torben Ægidius Mogensen, Apr 15, 2005
    #11
  12. Jean-Marc Bourguet <> writes:

    > Robert Inder <> writes:
    >
    > > Even some pointers to work that "merely" dealt with regular RE's
    > > would be a great help.

    >
    > I think both Torben Ægidius Mogensen
    > () and me
    > () have outlined the
    > algorithms you want. It would help to give you better
    > responses if you state what you think is missing.
    >
    > The algorithm for going from a RE to a FSM is probably in
    > every introduction book on compiler.


    flex gives a warning when a rule can never be matched. So
    you have an easy way to check if two regular expressions are
    identical: input two files consisting simply of the two
    expressions in both order. If flex gives that warning for
    both files, they are identicals, if flex gives a warning
    only for one file, the second in that file regexp is a
    subset of the first, if flex never give a warning, both
    expressions are able to match strings that the other can
    not.

    Yours,

    --
    Jean-Marc
     
    Jean-Marc Bourguet, Apr 17, 2005
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Douglas

    Custom Regular Expressions in ASP.net

    Jay Douglas, Nov 2, 2003, in forum: ASP .Net
    Replies:
    3
    Views:
    608
    mikeb
    Nov 3, 2003
  2. mark

    Regular expressions

    mark, Jun 30, 2003, in forum: Perl
    Replies:
    4
    Views:
    1,723
  3. Dustin D.
    Replies:
    1
    Views:
    11,215
  4. Jay Douglas
    Replies:
    0
    Views:
    610
    Jay Douglas
    Aug 15, 2003
  5. Noman Shapiro
    Replies:
    0
    Views:
    235
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page