Inverting a regular expression?

Harry Ohlsen · Apr 29, 2005

A colleague of mine just asked me whether it was possible to invert an
arbitrary regular expression. Ie, create a new regular expression that
matches whatever the original *didn't*.

It seems like quite a difficult problem to me, but maybe I'm just not
looking at it the right way.

Obviously simple cases are easy. Eg

re = /[a-z]/
inverse = /[^a-z]/

However, I think it would be much harder for arbitrary cases. Eg, how
would one automatically invert something like the following

/<[^<>]*id="!?(region

.*?>/

Note that I'm not talking about finding lines that don't contain the
pattern (a la "grep -v"); I'm talking about finding all the occurrences
within a single line that don't match it.

Inquisitively yours,

Harry O.

************************************************************************

If you have received this e-mail in error, please delete it and notify the sender as soon as possible. The contents of this e-mail may be confidential and the unauthorized use, copying, or dissemination of it and any attachments to it, is prohibited.

Internet communications are not secure and Hyperion does not, therefore, accept legal responsibility for the contents of this message nor for any damage caused by viruses. The views expressed here do not necessarily represent those of Hyperion.

For more information about Hyperion, please visit our Web site at www.hyperion.com

Wolfgang Nádasi-Donner · Apr 29, 2005

Harry Ohlsen said:
A colleague of mine just asked me whether it was possible to invert an
arbitrary regular expression. Ie, create a new regular expression that
matches whatever the original *didn't*.

It is impossible in general terms if you are talking about the usual "extended regular expressions" (I didn't
think about the "real regular expressions"). Take the RegEx

/(.{4})xy\1/

First of all - what does it mean to invert this expression?
Second - the pattern, that is produced by the RegEx is not static, it is build on results from applying parts
of the pattern.

I assume you can handle this by applying the RegEx (=pattern) to a string and use the MatchData-Object to find
out, what parts of the string are not matched.

Wolfgang Nádasi-Donner
(e-mail address removed)

Florian Groß · Apr 29, 2005

Harry said:
A colleague of mine just asked me whether it was possible to invert an
arbitrary regular expression. Ie, create a new regular expression that
matches whatever the original *didn't*.

However, I think it would be much harder for arbitrary cases. Eg, how
would one automatically invert something like the following

/<[^<>]*id="!?(region.*?>/

Note that I'm not talking about finding lines that don't contain the
pattern (a la "grep -v"); I'm talking about finding all the occurrences
within a single line that don't match it.

I think the most general way of doing this is this:

/(?!#{re}).{#{match_length}}/

But that will of course only work for constant-length expressions.
Variable, but finite length sets can be done through alternation, but
that would not be enough to work for the above.

It can also be quite hard to find the match length of an arbitrary
Regexp so this *is* fairly esoteric stuff.

I have needed this before to generally match Strings that are delimited
by some sort of String which can also be escaped by another String --
the classic example is 'foo\'bar', but you might as well decide to do
DELIMtexttextESCAPEDELIMtextDELIM. Of course the match length is always
known in this case because the delimiter is going to be a String.

This also won't produce overlapping matches when used in .scan, but that
is usually the case for Regexps.

James Edward Gray II · Apr 29, 2005

A colleague of mine just asked me whether it was possible to invert an
arbitrary regular expression. Ie, create a new regular expression that
matches whatever the original *didn't*.

Switch from =~ to !~?

I know that's not what you asked, I'm just making sure we covered the
obvious.

James Edward Gray II

Nikolai Weibull · Apr 29, 2005

Harry Ohlsen, April 29:

A colleague of mine just asked me whether it was possible to invert an
arbitrary regular expression. Ie, create a new regular expression that
matches whatever the original *didn't*.

Well, we have to remember that regular languages are closed under
set-difference, so it can be done. However, this doesn't mean that we
have any way of expressing such a regular language with a regular
expression. The grammar of Ruby's regular expressions doesn't provide a
set-difference operator, so there's really no way of doing it in Ruby.

It can be done, as you show, for a single literal, but that's not the
same thing. That's taking the set-difference of the input alphabet and
a given set of literals, i.e., [^a-z] is equivalent to Î£ - {a, b, â€¦, z}.
One should also remember that [â€¦] is only syntactic sugar and is
equivalent to the the union of all members (with provisions for removing
literals with ^ and expanding aâ€“b ranges).

What you want is a complementation operator, i.e., one that gives you
the regular language Î£* - L, where L is the regular language denoted by
your regular expression and - is the set-difference operator.

The issue of matching an inverted regular expression also depends on
what kind of finite state automaton is used. It's simple to do for
deterministic finite automatons; not so simple for nondeterministic ones
(actually, I'm not even sure if it's possible at all).

So, in summary, Ruby has no way of letting you express complementation.
However, it may not be what you actually want in practice. Perhaps what
you really want is to match inbetween matches of a regular expression.
This is simple enough to do. All it takes is to keep track of a couple
of input positions,
nikolai

Nikolai Weibull · Apr 29, 2005

Nikolai Weibull, April 29:

Perhaps what you really want is to match inbetween matches of a
regular expression. This is simple enough to do.

And it can be done with String#split, as Warren Brown pointed out,
nikolai

Chris Pine · Apr 29, 2005

The issue of matching an inverted regular expression also depends on

what kind of finite state automaton is used. It's simple to do for
deterministic finite automatons; not so simple for nondeterministic ones
(actually, I'm not even sure if it's possible at all).

Hmm... I'm not sure this is true. As I recall, a language accepted
by a regular expression can always be described as the language
accepted by some FA, or by some non-deterministic FA; in other words,
that they are all equivalent.

I seem to remember that if A, B, and C are the sets of languages
accepted by RE's, FA's, and NFA's (not necessarily respectively), that
the easiest way to prove this was to show A as a subset of B, and B of
C, and C of A. I don't remember which ones are A, B, or C, but I
remember it was a cute proof that way.

But the RE's we use in programming languages are different from what
we use in CS, or at least in *how* we use them. Surely if I was
trying to match substrings which are not 'Chris' in some large string,
I would find many, many matches.

I guess I'm saying it should always be possible, but maybe not so
useful. What's the situation where you want to do this?

Chris

Nikolai Weibull · Apr 29, 2005

Chris Pine, April 30:

Hmm... I'm not sure this is true. As I recall, a language accepted
by a regular expression can always be described as the language
accepted by some FA, or by some non-deterministic FA; in other words,
that they are all equivalent.

Sure, they're equivalent. That parentheses was false, sorry. However,
it's a lot easier to do for deterministic FAs than for nondeterministic
FAs, as far as I know,
nikolai

Assaph Mehr · May 3, 2005

Harry said:
A colleague of mine just asked me whether it was possible to invert an
arbitrary regular expression. Ie, create a new regular expression that
matches whatever the original *didn't*.

It seems like quite a difficult problem to me, but maybe I'm just not
looking at it the right way.

Obviously simple cases are easy. Eg

re = /[a-z]/
inverse = /[^a-z]/

However, I think it would be much harder for arbitrary cases. Eg, how
would one automatically invert something like the following

/<[^<>]*id="!?(region.*?>/

Note that I'm not talking about finding lines that don't contain the
pattern (a la "grep -v"); I'm talking about finding all the occurrences
within a single line that don't match it.

How about splitting on the regex instead of scanning for it? It'll give
you the longest sequences between matches -> everything that didn't
match.

irb(main):002:0> s = 'DELIMtexttextDELIMtextDELIM'
=> "DELIMtexttextDELIMtextDELIM"
irb(main):003:0> s.scan /DELIM/
=> ["DELIM", "DELIM", "DELIM"]
irb(main):004:0> s.split /DELIM/
=> ["", "texttext", "text"]

Regular expression	7	Mar 23, 2009
Regular expression syntax error	1	Dec 20, 2015
Regular Expression Trouble	4	Dec 26, 2008
substitution in a regular expression	3	Mar 11, 2011
Inverting a Regular Expression in Validator	1	Jan 28, 2005
regular expression	1	Aug 6, 2008
About a regular expression	5	Nov 26, 2007
Multi-line regular expression match question	5	Nov 19, 2010

Inverting a regular expression?

Harry Ohlsen

Wolfgang Nádasi-Donner

Florian Groß

James Edward Gray II

Nikolai Weibull

Nikolai Weibull

Chris Pine

Nikolai Weibull

Assaph Mehr

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads