Inverting a regular expression?

Discussion in 'Ruby' started by Harry Ohlsen, Apr 29, 2005.

  1. Harry Ohlsen

    Harry Ohlsen Guest

    A colleague of mine just asked me whether it was possible to invert an
    arbitrary regular expression. Ie, create a new regular expression that
    matches whatever the original *didn't*.

    It seems like quite a difficult problem to me, but maybe I'm just not
    looking at it the right way.

    Obviously simple cases are easy. Eg

    re = /[a-z]/
    inverse = /[^a-z]/

    However, I think it would be much harder for arbitrary cases. Eg, how
    would one automatically invert something like the following

    /<[^<>]*id="!?(region:cool:.*?>/

    Note that I'm not talking about finding lines that don't contain the
    pattern (a la "grep -v"); I'm talking about finding all the occurrences
    within a single line that don't match it.

    Inquisitively yours,

    Harry O.


    ************************************************************************

    If you have received this e-mail in error, please delete it and notify the sender as soon as possible. The contents of this e-mail may be confidential and the unauthorized use, copying, or dissemination of it and any attachments to it, is prohibited.

    Internet communications are not secure and Hyperion does not, therefore, accept legal responsibility for the contents of this message nor for any damage caused by viruses. The views expressed here do not necessarily represent those of Hyperion.

    For more information about Hyperion, please visit our Web site at www.hyperion.com
    Harry Ohlsen, Apr 29, 2005
    #1
    1. Advertising

  2. "Harry Ohlsen" <> schrieb im Newsbeitrag
    news:...
    > A colleague of mine just asked me whether it was possible to invert an
    > arbitrary regular expression. Ie, create a new regular expression that
    > matches whatever the original *didn't*.
    >

    It is impossible in general terms if you are talking about the usual "extended regular expressions" (I didn't
    think about the "real regular expressions"). Take the RegEx

    /(.{4})xy\1/

    First of all - what does it mean to invert this expression?
    Second - the pattern, that is produced by the RegEx is not static, it is build on results from applying parts
    of the pattern.

    I assume you can handle this by applying the RegEx (=pattern) to a string and use the MatchData-Object to find
    out, what parts of the string are not matched.

    Wolfgang Nádasi-Donner
    Wolfgang Nádasi-Donner, Apr 29, 2005
    #2
    1. Advertising

  3. Harry Ohlsen wrote:

    > A colleague of mine just asked me whether it was possible to invert an
    > arbitrary regular expression. Ie, create a new regular expression that
    > matches whatever the original *didn't*.
    >
    > However, I think it would be much harder for arbitrary cases. Eg, how
    > would one automatically invert something like the following
    >
    > /<[^<>]*id="!?(region:cool:.*?>/
    >
    > Note that I'm not talking about finding lines that don't contain the
    > pattern (a la "grep -v"); I'm talking about finding all the occurrences
    > within a single line that don't match it.


    I think the most general way of doing this is this:

    /(?!#{re}).{#{match_length}}/

    But that will of course only work for constant-length expressions.
    Variable, but finite length sets can be done through alternation, but
    that would not be enough to work for the above.

    It can also be quite hard to find the match length of an arbitrary
    Regexp so this *is* fairly esoteric stuff.

    I have needed this before to generally match Strings that are delimited
    by some sort of String which can also be escaped by another String --
    the classic example is 'foo\'bar', but you might as well decide to do
    DELIMtexttextESCAPEDELIMtextDELIM. Of course the match length is always
    known in this case because the delimiter is going to be a String.

    This also won't produce overlapping matches when used in .scan, but that
    is usually the case for Regexps.
    Florian Groß, Apr 29, 2005
    #3
  4. On Apr 29, 2005, at 12:56 AM, Harry Ohlsen wrote:

    > A colleague of mine just asked me whether it was possible to invert an
    > arbitrary regular expression. Ie, create a new regular expression that
    > matches whatever the original *didn't*.


    Switch from =~ to !~? :)

    I know that's not what you asked, I'm just making sure we covered the
    obvious.

    James Edward Gray II
    James Edward Gray II, Apr 29, 2005
    #4
  5. Harry Ohlsen, April 29:

    > A colleague of mine just asked me whether it was possible to invert an
    > arbitrary regular expression. Ie, create a new regular expression that
    > matches whatever the original *didn't*.


    Well, we have to remember that regular languages are closed under
    set-difference, so it can be done. However, this doesn't mean that we
    have any way of expressing such a regular language with a regular
    expression. The grammar of Ruby's regular expressions doesn't provide a
    set-difference operator, so there's really no way of doing it in Ruby.

    It can be done, as you show, for a single literal, but that's not the
    same thing. That's taking the set-difference of the input alphabet and
    a given set of literals, i.e., [^a-z] is equivalent to Σ - {a, b, …, z}.
    One should also remember that […] is only syntactic sugar and is
    equivalent to the the union of all members (with provisions for removing
    literals with ^ and expanding a–b ranges).

    What you want is a complementation operator, i.e., one that gives you
    the regular language Σ* - L, where L is the regular language denoted by
    your regular expression and - is the set-difference operator.

    The issue of matching an inverted regular expression also depends on
    what kind of finite state automaton is used. It's simple to do for
    deterministic finite automatons; not so simple for nondeterministic ones
    (actually, I'm not even sure if it's possible at all).

    So, in summary, Ruby has no way of letting you express complementation.
    However, it may not be what you actually want in practice. Perhaps what
    you really want is to match inbetween matches of a regular expression.
    This is simple enough to do. All it takes is to keep track of a couple
    of input positions,
    nikolai

    --
    Nikolai Weibull: now available free of charge at http://bitwi.se/!
    Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
    main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}
    Nikolai Weibull, Apr 29, 2005
    #5
  6. Nikolai Weibull, April 29:

    > Perhaps what you really want is to match inbetween matches of a
    > regular expression. This is simple enough to do.


    And it can be done with String#split, as Warren Brown pointed out,
    nikolai

    --
    Nikolai Weibull: now available free of charge at http://bitwi.se/!
    Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
    main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}
    Nikolai Weibull, Apr 29, 2005
    #6
  7. Harry Ohlsen

    Chris Pine Guest

    > The issue of matching an inverted regular expression also depends on
    > what kind of finite state automaton is used. It's simple to do for
    > deterministic finite automatons; not so simple for nondeterministic ones
    > (actually, I'm not even sure if it's possible at all).


    Hmm... I'm not sure this is true. As I recall, a language accepted
    by a regular expression can always be described as the language
    accepted by some FA, or by some non-deterministic FA; in other words,
    that they are all equivalent.

    I seem to remember that if A, B, and C are the sets of languages
    accepted by RE's, FA's, and NFA's (not necessarily respectively), that
    the easiest way to prove this was to show A as a subset of B, and B of
    C, and C of A. I don't remember which ones are A, B, or C, but I
    remember it was a cute proof that way.

    :)

    But the RE's we use in programming languages are different from what
    we use in CS, or at least in *how* we use them. Surely if I was
    trying to match substrings which are not 'Chris' in some large string,
    I would find many, many matches.

    I guess I'm saying it should always be possible, but maybe not so
    useful. What's the situation where you want to do this?

    Chris
    Chris Pine, Apr 29, 2005
    #7
  8. Chris Pine, April 30:

    > > The issue of matching an inverted regular expression also depends on
    > > what kind of finite state automaton is used. It's simple to do for
    > > deterministic finite automatons; not so simple for nondeterministic ones
    > > (actually, I'm not even sure if it's possible at all).

    >
    > Hmm... I'm not sure this is true. As I recall, a language accepted
    > by a regular expression can always be described as the language
    > accepted by some FA, or by some non-deterministic FA; in other words,
    > that they are all equivalent.


    Sure, they're equivalent. That parentheses was false, sorry. However,
    it's a lot easier to do for deterministic FAs than for nondeterministic
    FAs, as far as I know,
    nikolai

    --
    Nikolai Weibull: now available free of charge at http://bitwi.se/!
    Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
    main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}
    Nikolai Weibull, Apr 29, 2005
    #8
  9. Harry Ohlsen

    Assaph Mehr Guest

    Harry Ohlsen wrote:
    > A colleague of mine just asked me whether it was possible to invert

    an
    > arbitrary regular expression. Ie, create a new regular expression

    that
    > matches whatever the original *didn't*.
    >
    > It seems like quite a difficult problem to me, but maybe I'm just not
    > looking at it the right way.
    >
    > Obviously simple cases are easy. Eg
    >
    > re = /[a-z]/
    > inverse = /[^a-z]/
    >
    > However, I think it would be much harder for arbitrary cases. Eg,

    how
    > would one automatically invert something like the following
    >
    > /<[^<>]*id="!?(region:cool:.*?>/
    >
    > Note that I'm not talking about finding lines that don't contain the
    > pattern (a la "grep -v"); I'm talking about finding all the

    occurrences
    > within a single line that don't match it.


    How about splitting on the regex instead of scanning for it? It'll give
    you the longest sequences between matches -> everything that didn't
    match.

    irb(main):002:0> s = 'DELIMtexttextDELIMtextDELIM'
    => "DELIMtexttextDELIMtextDELIM"
    irb(main):003:0> s.scan /DELIM/
    => ["DELIM", "DELIM", "DELIM"]
    irb(main):004:0> s.split /DELIM/
    => ["", "texttext", "text"]
    Assaph Mehr, May 3, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. arctan
    Replies:
    6
    Views:
    1,015
    arctan
    Oct 16, 2003
  2. VSK
    Replies:
    2
    Views:
    2,268
  3. MWells
    Replies:
    1
    Views:
    634
    spalding
    Jan 28, 2005
  4. spalding
    Replies:
    1
    Views:
    396
    Peter Blum
    Jan 28, 2005
  5. Niels Dybdahl

    inverting custom cursor

    Niels Dybdahl, Nov 21, 2005, in forum: Java
    Replies:
    0
    Views:
    387
    Niels Dybdahl
    Nov 21, 2005
Loading...

Share This Page