Negate a character sequence in a regular expression?

Discussion in 'Ruby' started by crm_114@mac.com, Nov 29, 2007.

  1. Guest

    For the following string:

    'cat sheep horse cat tac dog'

    I would like to write a regular expression that matches any substring
    that is prefixed by the word 'cat', is then followed by any characters
    as long as those characters do not comprise the word 'cat', and then
    finally suffixed by the string 'dog'. Therefore, this expression
    should match the substring 'cat tac dog' in the above string.

    Obviously, if I write an expression like:

    irb(main):002:0> /cat.*dog/.match('cat sheep horse cat tac dog').to_s
    => "cat sheep horse cat tac dog"

    it will match the entire string.

    And the non-greedy Kleene doesn't buy me anything either since the
    expression matches the first cat found anyway:

    irb(main):003:0> /cat.*?dog/.match('cat sheep horse cat tac dog').to_s
    => "cat sheep horse cat tac dog"


    What I think I want to do is to negate a sequence of characters,
    rather than just a character class, but I have looked around and not
    found anything quite right.

    Of course, there are ways of hacking this out, e.g. I could reverse
    the string first and match 'god' followed by the first instance of
    'tac', but I am hoping there is a more elegant way to do this with a
    single regular expression.

    Thanks--
     
    , Nov 29, 2007
    #1
    1. Advertising

  2. James Moore Guest

    I think this gets you close to where you want to be:

    irb(main):045:0> 'cat sheep horse cat tac dog' =~ /cat(?!.*cat)(.*)dog/
    => 16
    irb(main):046:0> $1
    => " tac "
    irb(main):047:0>

    The (?! bit is a nonmatching lookahead.

    On Nov 29, 2007 2:07 PM, <> wrote:
    > For the following string:
    >
    > 'cat sheep horse cat tac dog'
    >
    > I would like to write a regular expression that matches any substring
    > that is prefixed by the word 'cat', is then followed by any characters
    > as long as those characters do not comprise the word 'cat', and then
    > finally suffixed by the string 'dog'. Therefore, this expression
    > should match the substring 'cat tac dog' in the above string.


    I think this gets you closer to where you want to be:

    irb(main):045:0> 'cat sheep horse cat tac dog' =~ /cat(?!.*cat)(.*)dog/
    => 16
    irb(main):046:0> $1
    => " tac "
    irb(main):047:0>

    The (?! bit is a nonmatching lookahead.

    --
    James Moore |
    Ruby and Ruby on Rails consulting
    blog.restphone.com
     
    James Moore, Nov 29, 2007
    #2
    1. Advertising

  3. MonkeeSage Guest

    On Nov 29, 3:47 pm, wrote:
    > For the following string:
    >
    > 'cat sheep horse cat tac dog'
    >
    > I would like to write a regular expression that matches any substring
    > that is prefixed by the word 'cat', is then followed by any characters
    > as long as those characters do not comprise the word 'cat', and then
    > finally suffixed by the string 'dog'. Therefore, this expression
    > should match the substring 'cat tac dog' in the above string.
    >
    > Obviously, if I write an expression like:
    >
    > irb(main):002:0> /cat.*dog/.match('cat sheep horse cat tac dog').to_s
    > => "cat sheep horse cat tac dog"
    >
    > it will match the entire string.
    >
    > And the non-greedy Kleene doesn't buy me anything either since the
    > expression matches the first cat found anyway:
    >
    > irb(main):003:0> /cat.*?dog/.match('cat sheep horse cat tac dog').to_s
    > => "cat sheep horse cat tac dog"
    >
    > What I think I want to do is to negate a sequence of characters,
    > rather than just a character class, but I have looked around and not
    > found anything quite right.
    >
    > Of course, there are ways of hacking this out, e.g. I could reverse
    > the string first and match 'god' followed by the first instance of
    > 'tac', but I am hoping there is a more elegant way to do this with a
    > single regular expression.
    >
    > Thanks--


    If you just want the right-most match of the substring prefixed by
    'cat' and suffixed by 'dog', this should work: /.*(cat.*dog)/.match(s)
    [1].

    Regards,
    Jordan
     
    MonkeeSage, Nov 30, 2007
    #3
  4. yermej Guest

    On Nov 29, 5:06 pm, James Moore <> wrote:
    > On Nov 29, 2007 2:07 PM, <> wrote:
    >
    > > For the following string:

    >
    > > 'cat sheep horse cat tac dog'

    >
    > > I would like to write a regular expression that matches any substring
    > > that is prefixed by the word 'cat', is then followed by any characters
    > > as long as those characters do not comprise the word 'cat', and then
    > > finally suffixed by the string 'dog'. Therefore, this expression
    > > should match the substring 'cat tac dog' in the above string.

    >
    > I think this gets you closer to where you want to be:
    >
    > irb(main):045:0> 'cat sheep horse cat tac dog' =~ /cat(?!.*cat)(.*)dog/
    > => 16
    > irb(main):046:0> $1
    > => " tac "
    > irb(main):047:0>
    >
    > The (?! bit is a nonmatching lookahead.


    Nonmatching (or negative) lookahead is what you want, and with some
    adjustment of the capture you get:

    > 'cat sheep horse cat tac dog' =~ /(cat(?!.*cat).*dog)/

    => 16
    > $1

    => "cat tac dog"
     
    yermej, Nov 30, 2007
    #4
  5. > For the following string:
    >=20
    > 'cat sheep horse cat tac dog'
    >=20
    > I would like to write a regular expression that matches any substring
    > that is prefixed by the word 'cat', is then followed by any characters
    > as long as those characters do not comprise the word 'cat', and then
    > finally suffixed by the string 'dog'. Therefore, this expression
    > should match the substring 'cat tac dog' in the above string.


    Working out negative regular expressions is normally best avoided.

    One step is hard. Two steps is not:

    x =3D 'cat sheep horse cat tac dog'
    /(cat.*?dog)/.match(x) && $1.sub(/.*cat/,'cat')

    Or if you want multiple matches:

    x =3D 'cat sheep horse cat tac dog cat cat sheep dog'
    x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}
    =3D> ["cat tac dog", "cat sheep dog"]

    Dan.
     
    Daniel Sheppard, Nov 30, 2007
    #5
  6. yermej wrote:
    >
    > Nonmatching (or negative) lookahead is what you want, and with some
    > adjustment of the capture you get:
    >
    > 'cat sheep horse cat tac dog' =~ /(cat(?!.*cat).*dog)/
    > => 16
    > $1
    > => "cat tac dog"


    Negative lookaheads that contain '.*' are hard to comprehend (at least
    for me). It is enough to add a 'cat' at the end and the regexp does not
    find any more the 'cat tac dog' that should be matched:

    'cat sheep horse cat tac dog lion cat' =~ /(cat(?!.*cat).*dog)/
    => nil

    Also notice that, when it works, it will always give the last expression
    present:
    'cat sheep horse cat tac dog lion cat dog' =~ /(cat(?!.*cat).*dog)/
    => 33
    p $1 # => "cat dog"


    Daniel Sheppard wrote:
    > Working out negative regular expressions is normally best avoided.


    I would say that it is true when they contain '.*' type expressions;
    else they can be extremely useful.

    > Or if you want multiple matches:
    >
    > x = 'cat sheep horse cat tac dog cat cat sheep dog'
    > x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}
    > => ["cat tac dog", "cat sheep dog"]


    Very ingenious...

    Raul
    --
    Posted via http://www.ruby-forum.com/.
     
    Raul Parolari, Dec 1, 2007
    #6
  7. yermej Guest

    On Dec 1, 2:19 am, Raul Parolari <> wrote:
    > yermej wrote:
    >
    > > Nonmatching (or negative) lookahead is what you want, and with some
    > > adjustment of the capture you get:

    >
    > > 'cat sheep horse cat tac dog' =~ /(cat(?!.*cat).*dog)/
    > > => 16
    > > $1
    > > => "cat tac dog"

    >
    > Negative lookaheads that contain '.*' are hard to comprehend (at least
    > for me). It is enough to add a 'cat' at the end and the regexp does not
    > find any more the 'cat tac dog' that should be matched:
    >
    > 'cat sheep horse cat tac dog lion cat' =~ /(cat(?!.*cat).*dog)/
    > => nil
    >
    > Also notice that, when it works, it will always give the last expression
    > present:
    > 'cat sheep horse cat tac dog lion cat dog' =~ /(cat(?!.*cat).*dog)/
    > => 33
    > p $1 # => "cat dog"
    >
    > Daniel Sheppard wrote:
    > > Working out negative regular expressions is normally best avoided.

    >
    > I would say that it is true when they contain '.*' type expressions;
    > else they can be extremely useful.
    >
    > > Or if you want multiple matches:

    >
    > > x = 'cat sheep horse cat tac dog cat cat sheep dog'
    > > x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}
    > > => ["cat tac dog", "cat sheep dog"]

    >
    > Very ingenious...
    >
    > Raul
    > --
    > Posted viahttp://www.ruby-forum.com/.


    Thanks, Raul, for the clarification on that.

    To the original poster, I apologize for the misinformation. I guess
    when I'm not completely sure about such things, I should start a new
    thread, but attempting to answer questions here (some of which I do
    get right) has been a great help to me in my own learning. I think
    that starting new threads on all occasions would get to be too much
    and I probably wouldn't get many responses. In the future, I may just
    keep quiet until I'm either sure I have the correct answer or at least
    don't understand why my answer isn't correct.
     
    yermej, Dec 1, 2007
    #7
  8. MonkeeSage Guest

    On Dec 1, 3:10 am, yermej <> wrote:
    > On Dec 1, 2:19 am, Raul Parolari <> wrote:
    >
    >
    >
    > > yermej wrote:

    >
    > > > Nonmatching (or negative) lookahead is what you want, and with some
    > > > adjustment of the capture you get:

    >
    > > > 'cat sheep horse cat tac dog' =~ /(cat(?!.*cat).*dog)/
    > > > => 16
    > > > $1
    > > > => "cat tac dog"

    >
    > > Negative lookaheads that contain '.*' are hard to comprehend (at least
    > > for me). It is enough to add a 'cat' at the end and the regexp does not
    > > find any more the 'cat tac dog' that should be matched:

    >
    > > 'cat sheep horse cat tac dog lion cat' =~ /(cat(?!.*cat).*dog)/
    > > => nil

    >
    > > Also notice that, when it works, it will always give the last expression
    > > present:
    > > 'cat sheep horse cat tac dog lion cat dog' =~ /(cat(?!.*cat).*dog)/
    > > => 33
    > > p $1 # => "cat dog"

    >
    > > Daniel Sheppard wrote:
    > > > Working out negative regular expressions is normally best avoided.

    >
    > > I would say that it is true when they contain '.*' type expressions;
    > > else they can be extremely useful.

    >
    > > > Or if you want multiple matches:

    >
    > > > x = 'cat sheep horse cat tac dog cat cat sheep dog'
    > > > x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}
    > > > => ["cat tac dog", "cat sheep dog"]

    >
    > > Very ingenious...

    >
    > > Raul
    > > --
    > > Posted viahttp://www.ruby-forum.com/.

    >
    > Thanks, Raul, for the clarification on that.
    >
    > To the original poster, I apologize for the misinformation. I guess
    > when I'm not completely sure about such things, I should start a new
    > thread, but attempting to answer questions here (some of which I do
    > get right) has been a great help to me in my own learning. I think
    > that starting new threads on all occasions would get to be too much
    > and I probably wouldn't get many responses. In the future, I may just
    > keep quiet until I'm either sure I have the correct answer or at least
    > don't understand why my answer isn't correct.


    Not to blow my own horn, but I think the behavior requested by the OP
    is still /.*(cat.*dog)/...("a regular expression that matches any
    substring
    that is prefixed by the word 'cat', is then followed by any characters
    as long as those characters do not comprise the word 'cat', and then
    finally suffixed by the string 'dog'").

    But don't feel like you have to be 100% correct in order to help out.
    My answers are often bass-ackwards (not really a bragging point, heh).
    But the point is, like you say, to learn from each other and help
    where we can. None of us knows it all; all we can do is our best, and
    pray it might help someone here or there. :)

    Regards,
    Jordan
     
    MonkeeSage, Dec 1, 2007
    #8
  9. Tanaka Akira Guest

    In article <>,
    writes:

    > For the following string:
    >
    > 'cat sheep horse cat tac dog'
    >
    > I would like to write a regular expression that matches any substring
    > that is prefixed by the word 'cat', is then followed by any characters
    > as long as those characters do not comprise the word 'cat', and then
    > finally suffixed by the string 'dog'. Therefore, this expression
    > should match the substring 'cat tac dog' in the above string.


    % /usr/bin/ruby -e 'p /cat((?!cat).)*dog/.match("cat sheep horse cat tac dog").to_s'
    "cat tac dog"
    --
    Tanaka Akira
     
    Tanaka Akira, Dec 1, 2007
    #9
  10. yermej wrote:

    >> Raul
    >> --
    >> Posted viahttp://www.ruby-forum.com/.

    >
    > Thanks, Raul, for the clarification on that.
    >
    > To the original poster, I apologize for the misinformation.
    > In the future, I may just keep quiet until I'm either sure I have
    > the correct answer or at least
    > don't understand why my answer isn't correct.


    yermej,

    I often follow your postings, that I find very interesting. But do not
    take hard this; no one can be perfect all of the time.

    In this case, it is very easy to fall in the alluring trap of the
    negative lookahead with '.*' (see how many people keep posting a variant
    of that solution.. ?!).

    However, I totally share one's feeling of dismay when we give an advice
    that turns out not to be right (or just not completely right); we just
    have to accept that we are fallible.

    In this case hats off to Daniel Sheppard, who showed us how to do it (I
    rewrite it to reinstate it, after all the emails that followed):

    x = 'cat sheep horse cat tac dog cat cat sheep dog cat'

    x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}

    => ["cat tac dog", "cat sheep dog"]

    Absolutely brilliant

    Raul



    --
    Posted via http://www.ruby-forum.com/.
     
    Raul Parolari, Dec 1, 2007
    #10
  11. Guest

    Hello all - this was my first post, so thank you James, Jordan,
    yermej, Daniel, Raul and Tanaka. I really appreciate the help. I've
    learned quite a bit - especially about negative lookahead. Tricky
    stuff.

    As you might expect, I am working on a script to scan a big file for
    multiple cat...dog pairs. Well not cats and dogs, more like time
    stamps, etc.

    I've tried all the solutions posted here and they work very nicely for
    the string I posted. But I botched the original specification in at
    least two ways:

    1. I should have mentioned that there might be multiple cat...dog
    pairs. Then James would probably have given me something looking like
    Tanaka's solution, which uses negative lookahead but only consumes one
    character at a time.

    2. I should have mentioned that the characters between cat and dog
    should contain neither 'cat' nor 'dog'. I'm really looking for the
    'innermost' cat...dog pair.

    So I tried Daniel's solution and that works pretty well. If there is
    a way to break it I did not find it.

    I also tweaked Tanaka's solution a little bit after a trip back to the
    pickaxe book. I added the (?: to keep the group from generating
    backreferences, and I added an alternate 'dog' to the negative
    assertion. This is confusing but it seems to work:


    irb(main):001:0> x = 'cat sheep horse cat tac dog cat cat sheep dog
    dog '
    => "cat sheep horse cat tac dog cat cat sheep dog dog "
    irb(main):002:0> x.scan(/cat(?:(?!cat|dog).)*dog/)
    => ["cat tac dog", "cat sheep dog"]


    Here is my attempt at breaking it down, for anyone who is interested:
    /cat(?:(?!cat|dog).)*dog/
    1 2 3 4 5 6

    1. The engine starts scanning through the string until it matches
    'cat'.

    2. At the first position after the t in cat, the engine tries to
    match this expression (?:(?!cat|dog).), zero or more times. The (?:
    is nothing special. It is just a grouping that does not generate
    backreferences.

    3. The negative lookahead assertion. If the regular expression
    following the ! matches the string, starting from the current position
    of the engine, then the expression to which the assertion belongs will
    fail. But since the negative assertion belongs to a grouping that can
    match 0 or more times, the assertion can fail but the full regular
    expression can succeed.
    The hard part for me is that the negative lookahead assertion does
    not consume any characters. It looks ahead, tries to match its
    regular expression, and whether it returns 'MATCH' or 'NO MATCH', the
    current position of the engine stays the same.

    4. The regexp for the negative assertion. The stuff between each
    cat...dog pair should contain neither 'cat' nor 'dog'.

    5. One character is consumed, the engine moves down the string one
    position. Thanks to the negative lookahead, we know that that the
    consumed character is not the start of a 'cat' or 'dog' substring.

    6. Eventually, the (?:(?!cat|dog).)* fails when the current position
    of the engine reaches the beginning of a 'dog' substring. But then
    the last part of the regular expression (6) will match.

    ---

    Also, another way of getting the 'innermost' cat...dog pair is to use
    non-greedy kleene star(*?). This way, the engine will take the first
    'dog' suffix that it finds, rather than the last possible:

    irb(main):003:0> x.scan(/cat(?:(?!cat).)*?dog/)
    => ["cat tac dog", "cat sheep dog"]

    So there is more than one way to scan a cat! (Sorry!!!)

    ....In any event, I think I still like Daniel's solution better,
    because I can look at it and feel fairly certain that it will do
    exactly what it should do.


    Thanks--
    Josh
     
    , Dec 2, 2007
    #11
  12. unknown wrote:

    > I also tweaked Tanaka's solution a little bit after a trip back to the
    > pickaxe book. I added the (?: to keep the group from generating
    > backreferences,

    Great

    > and I added an alternate 'dog' to the negative assertion.

    Just using a non-greedy expression removes this complication (as you
    realized later), although it made the following even more interesting.

    > Here is my attempt at breaking it down, for anyone who is interested:
    > /cat(?:(?!cat|dog).)*dog/
    > 1 2 3 4 5 6
    >
    > 1. The engine starts scanning through the string until matches 'cat'.
    > 2. At the first position...


    This analysis was great (I had missed Tanaka's solution).

    > ...In any event, I think I still like Daniel's solution better,
    > because I can look at it and feel fairly certain that it will do
    > exactly what it should do.


    Josh
    I think both solutions are indestructible. I was curious to benchmark
    them; here, they showed interesting differences:

    a) for one-line sentences, where each substring cat...dog contains a few
    words, Tanaka's solution is approx 25% faster.

    b) for long paragraphs where cat...dog are separated by (say) 2 dozen
    words, Daniels' solution becomes faster by 20-25% (for longer intervals,
    the advantage can go up to 50% and more).

    c) For long paragraphs, where cat and dog were separated by only a few
    words, the performance is almost identical. So, it is not so much the
    length of the string, but the frequency of appearance of the keywords
    that is important.

    [Note: you need to add the /m modifier if your strings are paragraphs.
    In fact, it may be better to always write /m, so that you do not need to
    touch the regexp].

    The reason for the performance discrepancies is clear: for short gaps
    between the keywords (ie cat, dog), the trio scan+map+gsub exacts a
    toll. But for long paragraphs with many words between the keywords, it
    is instead the negative lookahead on 'cat' (at every character) which
    suffers.

    You may want to take this in account, depending on the type of text
    configurations you expect.

    In any case, both methods seem technically perfect.

    All the best, and thanks for that great summary,

    Raul

    --
    Posted via http://www.ruby-forum.com/.
     
    Raul Parolari, Dec 2, 2007
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Summercool
    Replies:
    13
    Views:
    703
    Dr.Ruud
    Feb 1, 2008
  2. Summercool
    Replies:
    22
    Views:
    426
    Ryan Holmes
    Aug 6, 2010
  3. Neil Morris
    Replies:
    1
    Views:
    112
    Lasse Reichstein Nielsen
    Jul 15, 2003
  4. Sherm Pendley

    need to negate regex in middle of expression

    Sherm Pendley, Jun 20, 2005, in forum: Perl Misc
    Replies:
    8
    Views:
    160
    Tad McClellan
    Jun 20, 2005
  5. Summercool
    Replies:
    14
    Views:
    199
    Dr.Ruud
    Feb 1, 2008
Loading...

Share This Page