Inverting a regular expression?

H

Harry Ohlsen

A colleague of mine just asked me whether it was possible to invert an
arbitrary regular expression. Ie, create a new regular expression that
matches whatever the original *didn't*.

It seems like quite a difficult problem to me, but maybe I'm just not
looking at it the right way.

Obviously simple cases are easy. Eg

re = /[a-z]/
inverse = /[^a-z]/

However, I think it would be much harder for arbitrary cases. Eg, how
would one automatically invert something like the following

/<[^<>]*id="!?(region:cool:.*?>/

Note that I'm not talking about finding lines that don't contain the
pattern (a la "grep -v"); I'm talking about finding all the occurrences
within a single line that don't match it.

Inquisitively yours,

Harry O.


************************************************************************

If you have received this e-mail in error, please delete it and notify the sender as soon as possible. The contents of this e-mail may be confidential and the unauthorized use, copying, or dissemination of it and any attachments to it, is prohibited.

Internet communications are not secure and Hyperion does not, therefore, accept legal responsibility for the contents of this message nor for any damage caused by viruses. The views expressed here do not necessarily represent those of Hyperion.

For more information about Hyperion, please visit our Web site at www.hyperion.com
 
W

Wolfgang Nádasi-Donner

Harry Ohlsen said:
A colleague of mine just asked me whether it was possible to invert an
arbitrary regular expression. Ie, create a new regular expression that
matches whatever the original *didn't*.
It is impossible in general terms if you are talking about the usual "extended regular expressions" (I didn't
think about the "real regular expressions"). Take the RegEx

/(.{4})xy\1/

First of all - what does it mean to invert this expression?
Second - the pattern, that is produced by the RegEx is not static, it is build on results from applying parts
of the pattern.

I assume you can handle this by applying the RegEx (=pattern) to a string and use the MatchData-Object to find
out, what parts of the string are not matched.

Wolfgang Nádasi-Donner
(e-mail address removed)
 
F

Florian Groß

Harry said:
A colleague of mine just asked me whether it was possible to invert an
arbitrary regular expression. Ie, create a new regular expression that
matches whatever the original *didn't*.

However, I think it would be much harder for arbitrary cases. Eg, how
would one automatically invert something like the following

/<[^<>]*id="!?(region:cool:.*?>/

Note that I'm not talking about finding lines that don't contain the
pattern (a la "grep -v"); I'm talking about finding all the occurrences
within a single line that don't match it.

I think the most general way of doing this is this:

/(?!#{re}).{#{match_length}}/

But that will of course only work for constant-length expressions.
Variable, but finite length sets can be done through alternation, but
that would not be enough to work for the above.

It can also be quite hard to find the match length of an arbitrary
Regexp so this *is* fairly esoteric stuff.

I have needed this before to generally match Strings that are delimited
by some sort of String which can also be escaped by another String --
the classic example is 'foo\'bar', but you might as well decide to do
DELIMtexttextESCAPEDELIMtextDELIM. Of course the match length is always
known in this case because the delimiter is going to be a String.

This also won't produce overlapping matches when used in .scan, but that
is usually the case for Regexps.
 
J

James Edward Gray II

A colleague of mine just asked me whether it was possible to invert an
arbitrary regular expression. Ie, create a new regular expression that
matches whatever the original *didn't*.

Switch from =~ to !~? :)

I know that's not what you asked, I'm just making sure we covered the
obvious.

James Edward Gray II
 
N

Nikolai Weibull

Harry Ohlsen, April 29:
A colleague of mine just asked me whether it was possible to invert an
arbitrary regular expression. Ie, create a new regular expression that
matches whatever the original *didn't*.

Well, we have to remember that regular languages are closed under
set-difference, so it can be done. However, this doesn't mean that we
have any way of expressing such a regular language with a regular
expression. The grammar of Ruby's regular expressions doesn't provide a
set-difference operator, so there's really no way of doing it in Ruby.

It can be done, as you show, for a single literal, but that's not the
same thing. That's taking the set-difference of the input alphabet and
a given set of literals, i.e., [^a-z] is equivalent to Σ - {a, b, …, z}.
One should also remember that […] is only syntactic sugar and is
equivalent to the the union of all members (with provisions for removing
literals with ^ and expanding a–b ranges).

What you want is a complementation operator, i.e., one that gives you
the regular language Σ* - L, where L is the regular language denoted by
your regular expression and - is the set-difference operator.

The issue of matching an inverted regular expression also depends on
what kind of finite state automaton is used. It's simple to do for
deterministic finite automatons; not so simple for nondeterministic ones
(actually, I'm not even sure if it's possible at all).

So, in summary, Ruby has no way of letting you express complementation.
However, it may not be what you actually want in practice. Perhaps what
you really want is to match inbetween matches of a regular expression.
This is simple enough to do. All it takes is to keep track of a couple
of input positions,
nikolai
 
N

Nikolai Weibull

Nikolai Weibull, April 29:
Perhaps what you really want is to match inbetween matches of a
regular expression. This is simple enough to do.

And it can be done with String#split, as Warren Brown pointed out,
nikolai
 
C

Chris Pine

The issue of matching an inverted regular expression also depends on
what kind of finite state automaton is used. It's simple to do for
deterministic finite automatons; not so simple for nondeterministic ones
(actually, I'm not even sure if it's possible at all).

Hmm... I'm not sure this is true. As I recall, a language accepted
by a regular expression can always be described as the language
accepted by some FA, or by some non-deterministic FA; in other words,
that they are all equivalent.

I seem to remember that if A, B, and C are the sets of languages
accepted by RE's, FA's, and NFA's (not necessarily respectively), that
the easiest way to prove this was to show A as a subset of B, and B of
C, and C of A. I don't remember which ones are A, B, or C, but I
remember it was a cute proof that way.

:)

But the RE's we use in programming languages are different from what
we use in CS, or at least in *how* we use them. Surely if I was
trying to match substrings which are not 'Chris' in some large string,
I would find many, many matches.

I guess I'm saying it should always be possible, but maybe not so
useful. What's the situation where you want to do this?

Chris
 
N

Nikolai Weibull

Chris Pine, April 30:
Hmm... I'm not sure this is true. As I recall, a language accepted
by a regular expression can always be described as the language
accepted by some FA, or by some non-deterministic FA; in other words,
that they are all equivalent.

Sure, they're equivalent. That parentheses was false, sorry. However,
it's a lot easier to do for deterministic FAs than for nondeterministic
FAs, as far as I know,
nikolai
 
A

Assaph Mehr

Harry said:
A colleague of mine just asked me whether it was possible to invert an
arbitrary regular expression. Ie, create a new regular expression that
matches whatever the original *didn't*.

It seems like quite a difficult problem to me, but maybe I'm just not
looking at it the right way.

Obviously simple cases are easy. Eg

re = /[a-z]/
inverse = /[^a-z]/

However, I think it would be much harder for arbitrary cases. Eg, how
would one automatically invert something like the following

/<[^<>]*id="!?(region:cool:.*?>/

Note that I'm not talking about finding lines that don't contain the
pattern (a la "grep -v"); I'm talking about finding all the occurrences
within a single line that don't match it.

How about splitting on the regex instead of scanning for it? It'll give
you the longest sequences between matches -> everything that didn't
match.

irb(main):002:0> s = 'DELIMtexttextDELIMtextDELIM'
=> "DELIMtexttextDELIMtextDELIM"
irb(main):003:0> s.scan /DELIM/
=> ["DELIM", "DELIM", "DELIM"]
irb(main):004:0> s.split /DELIM/
=> ["", "texttext", "text"]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top