Negate a character sequence in a regular expression?

crm_114 · Nov 29, 2007

For the following string:

'cat sheep horse cat tac dog'

I would like to write a regular expression that matches any substring
that is prefixed by the word 'cat', is then followed by any characters
as long as those characters do not comprise the word 'cat', and then
finally suffixed by the string 'dog'. Therefore, this expression
should match the substring 'cat tac dog' in the above string.

Obviously, if I write an expression like:

irb(main):002:0> /cat.*dog/.match('cat sheep horse cat tac dog').to_s
=> "cat sheep horse cat tac dog"

it will match the entire string.

And the non-greedy Kleene doesn't buy me anything either since the
expression matches the first cat found anyway:

irb(main):003:0> /cat.*?dog/.match('cat sheep horse cat tac dog').to_s
=> "cat sheep horse cat tac dog"

What I think I want to do is to negate a sequence of characters,
rather than just a character class, but I have looked around and not
found anything quite right.

Of course, there are ways of hacking this out, e.g. I could reverse
the string first and match 'god' followed by the first instance of
'tac', but I am hoping there is a more elegant way to do this with a
single regular expression.

Thanks--

James Moore · Nov 29, 2007

I think this gets you close to where you want to be:

irb(main):045:0> 'cat sheep horse cat tac dog' =~ /cat(?!.*cat)(.*)dog/
=> 16
irb(main):046:0> $1
=> " tac "
irb(main):047:0>

The (?! bit is a nonmatching lookahead.

For the following string:

'cat sheep horse cat tac dog'

I would like to write a regular expression that matches any substring
that is prefixed by the word 'cat', is then followed by any characters
as long as those characters do not comprise the word 'cat', and then
finally suffixed by the string 'dog'. Therefore, this expression
should match the substring 'cat tac dog' in the above string.

I think this gets you closer to where you want to be:

irb(main):045:0> 'cat sheep horse cat tac dog' =~ /cat(?!.*cat)(.*)dog/
=> 16
irb(main):046:0> $1
=> " tac "
irb(main):047:0>

The (?! bit is a nonmatching lookahead.

MonkeeSage · Nov 30, 2007

For the following string:

'cat sheep horse cat tac dog'

I would like to write a regular expression that matches any substring
that is prefixed by the word 'cat', is then followed by any characters
as long as those characters do not comprise the word 'cat', and then
finally suffixed by the string 'dog'. Therefore, this expression
should match the substring 'cat tac dog' in the above string.

Obviously, if I write an expression like:

irb(main):002:0> /cat.*dog/.match('cat sheep horse cat tac dog').to_s
=> "cat sheep horse cat tac dog"

it will match the entire string.

And the non-greedy Kleene doesn't buy me anything either since the
expression matches the first cat found anyway:

irb(main):003:0> /cat.*?dog/.match('cat sheep horse cat tac dog').to_s
=> "cat sheep horse cat tac dog"

What I think I want to do is to negate a sequence of characters,
rather than just a character class, but I have looked around and not
found anything quite right.

Of course, there are ways of hacking this out, e.g. I could reverse
the string first and match 'god' followed by the first instance of
'tac', but I am hoping there is a more elegant way to do this with a
single regular expression.

Thanks--

If you just want the right-most match of the substring prefixed by
'cat' and suffixed by 'dog', this should work: /.*(cat.*dog)/.match(s)
[1].

Regards,
Jordan

yermej · Nov 30, 2007

I think this gets you closer to where you want to be:

irb(main):045:0> 'cat sheep horse cat tac dog' =~ /cat(?!.*cat)(.*)dog/
=> 16
irb(main):046:0> $1
=> " tac "
irb(main):047:0>

The (?! bit is a nonmatching lookahead.

Nonmatching (or negative) lookahead is what you want, and with some
adjustment of the capture you get:

'cat sheep horse cat tac dog' =~ /(cat(?!.*cat).*dog)/ => 16
$1

=> "cat tac dog"

Daniel Sheppard · Nov 30, 2007

For the following string:

=20
'cat sheep horse cat tac dog'
=20
I would like to write a regular expression that matches any substring
that is prefixed by the word 'cat', is then followed by any characters
as long as those characters do not comprise the word 'cat', and then
finally suffixed by the string 'dog'. Therefore, this expression
should match the substring 'cat tac dog' in the above string.

Working out negative regular expressions is normally best avoided.

One step is hard. Two steps is not:

x =3D 'cat sheep horse cat tac dog'
/(cat.*?dog)/.match(x) && $1.sub(/.*cat/,'cat')

Or if you want multiple matches:

x =3D 'cat sheep horse cat tac dog cat cat sheep dog'
x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}
=3D> ["cat tac dog", "cat sheep dog"]

Dan.

Raul Parolari · Dec 1, 2007

yermej said:
Nonmatching (or negative) lookahead is what you want, and with some
adjustment of the capture you get:

'cat sheep horse cat tac dog' =~ /(cat(?!.*cat).*dog)/
=> 16
$1
=> "cat tac dog"

Negative lookaheads that contain '.*' are hard to comprehend (at least
for me). It is enough to add a 'cat' at the end and the regexp does not
find any more the 'cat tac dog' that should be matched:

'cat sheep horse cat tac dog lion cat' =~ /(cat(?!.*cat).*dog)/
=> nil

Also notice that, when it works, it will always give the last expression
present:
'cat sheep horse cat tac dog lion cat dog' =~ /(cat(?!.*cat).*dog)/
=> 33
p $1 # => "cat dog"

Daniel said:
Working out negative regular expressions is normally best avoided.

I would say that it is true when they contain '.*' type expressions;
else they can be extremely useful.

Or if you want multiple matches:

x = 'cat sheep horse cat tac dog cat cat sheep dog'
x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}
=> ["cat tac dog", "cat sheep dog"]

Very ingenious...

Raul

yermej · Dec 1, 2007

yermej said:
yermej said:

Nonmatching (or negative) lookahead is what you want, and with some
adjustment of the capture you get:

Click to expand...

'cat sheep horse cat tac dog' =~ /(cat(?!.*cat).*dog)/
=> 16
$1
=> "cat tac dog"

Click to expand...

Negative lookaheads that contain '.*' are hard to comprehend (at least
for me). It is enough to add a 'cat' at the end and the regexp does not
find any more the 'cat tac dog' that should be matched:

'cat sheep horse cat tac dog lion cat' =~ /(cat(?!.*cat).*dog)/
=> nil

Also notice that, when it works, it will always give the last expression
present:
'cat sheep horse cat tac dog lion cat dog' =~ /(cat(?!.*cat).*dog)/
=> 33
p $1 # => "cat dog"

Daniel said:

Working out negative regular expressions is normally best avoided.

Click to expand...

I would say that it is true when they contain '.*' type expressions;
else they can be extremely useful.

Or if you want multiple matches:

Click to expand...

x = 'cat sheep horse cat tac dog cat cat sheep dog'
x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}
=> ["cat tac dog", "cat sheep dog"]

Click to expand...

Very ingenious...

Raul

Thanks, Raul, for the clarification on that.

To the original poster, I apologize for the misinformation. I guess
when I'm not completely sure about such things, I should start a new
thread, but attempting to answer questions here (some of which I do
get right) has been a great help to me in my own learning. I think
that starting new threads on all occasions would get to be too much
and I probably wouldn't get many responses. In the future, I may just
keep quiet until I'm either sure I have the correct answer or at least
don't understand why my answer isn't correct.

MonkeeSage · Dec 1, 2007

yermej wrote:

Click to expand...

Negative lookaheads that contain '.*' are hard to comprehend (at least
for me). It is enough to add a 'cat' at the end and the regexp does not
find any more the 'cat tac dog' that should be matched:

Click to expand...

'cat sheep horse cat tac dog lion cat' =~ /(cat(?!.*cat).*dog)/
=> nil

Click to expand...

Also notice that, when it works, it will always give the last expression
present:
'cat sheep horse cat tac dog lion cat dog' =~ /(cat(?!.*cat).*dog)/
=> 33
p $1 # => "cat dog"

Click to expand...

I would say that it is true when they contain '.*' type expressions;
else they can be extremely useful.

Or if you want multiple matches:
x = 'cat sheep horse cat tac dog cat cat sheep dog'
x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}
=> ["cat tac dog", "cat sheep dog"]

Click to expand...

Click to expand...

Very ingenious...

Click to expand...

Raul

Click to expand...

Thanks, Raul, for the clarification on that.

To the original poster, I apologize for the misinformation. I guess
when I'm not completely sure about such things, I should start a new
thread, but attempting to answer questions here (some of which I do
get right) has been a great help to me in my own learning. I think
that starting new threads on all occasions would get to be too much
and I probably wouldn't get many responses. In the future, I may just
keep quiet until I'm either sure I have the correct answer or at least
don't understand why my answer isn't correct.

Not to blow my own horn, but I think the behavior requested by the OP
is still /.*(cat.*dog)/...("a regular expression that matches any
substring
that is prefixed by the word 'cat', is then followed by any characters
as long as those characters do not comprise the word 'cat', and then
finally suffixed by the string 'dog'").

But don't feel like you have to be 100% correct in order to help out.
My answers are often bass-ackwards (not really a bragging point, heh).
But the point is, like you say, to learn from each other and help
where we can. None of us knows it all; all we can do is our best, and
pray it might help someone here or there.

Regards,
Jordan

Tanaka Akira · Dec 1, 2007

For the following string:

'cat sheep horse cat tac dog'

I would like to write a regular expression that matches any substring
that is prefixed by the word 'cat', is then followed by any characters
as long as those characters do not comprise the word 'cat', and then
finally suffixed by the string 'dog'. Therefore, this expression
should match the substring 'cat tac dog' in the above string.

% /usr/bin/ruby -e 'p /cat((?!cat).)*dog/.match("cat sheep horse cat tac dog").to_s'
"cat tac dog"

Raul Parolari · Dec 1, 2007

yermej said:
Thanks, Raul, for the clarification on that.

To the original poster, I apologize for the misinformation.
In the future, I may just keep quiet until I'm either sure I have
the correct answer or at least
don't understand why my answer isn't correct.

yermej,

I often follow your postings, that I find very interesting. But do not
take hard this; no one can be perfect all of the time.

In this case, it is very easy to fall in the alluring trap of the
negative lookahead with '.*' (see how many people keep posting a variant
of that solution.. ?!).

However, I totally share one's feeling of dismay when we give an advice
that turns out not to be right (or just not completely right); we just
have to accept that we are fallible.

In this case hats off to Daniel Sheppard, who showed us how to do it (I
rewrite it to reinstate it, after all the emails that followed):

x = 'cat sheep horse cat tac dog cat cat sheep dog cat'

x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,'cat')}

=> ["cat tac dog", "cat sheep dog"]

Absolutely brilliant

Raul

crm_114 · Dec 2, 2007

Hello all - this was my first post, so thank you James, Jordan,
yermej, Daniel, Raul and Tanaka. I really appreciate the help. I've
learned quite a bit - especially about negative lookahead. Tricky
stuff.

As you might expect, I am working on a script to scan a big file for
multiple cat...dog pairs. Well not cats and dogs, more like time
stamps, etc.

I've tried all the solutions posted here and they work very nicely for
the string I posted. But I botched the original specification in at
least two ways:

1. I should have mentioned that there might be multiple cat...dog
pairs. Then James would probably have given me something looking like
Tanaka's solution, which uses negative lookahead but only consumes one
character at a time.

2. I should have mentioned that the characters between cat and dog
should contain neither 'cat' nor 'dog'. I'm really looking for the
'innermost' cat...dog pair.

So I tried Daniel's solution and that works pretty well. If there is
a way to break it I did not find it.

I also tweaked Tanaka's solution a little bit after a trip back to the
pickaxe book. I added the (?: to keep the group from generating
backreferences, and I added an alternate 'dog' to the negative
assertion. This is confusing but it seems to work:

irb(main):001:0> x = 'cat sheep horse cat tac dog cat cat sheep dog
dog '
=> "cat sheep horse cat tac dog cat cat sheep dog dog "
irb(main):002:0> x.scan(/cat(?

?!cat|dog).)*dog/)
=> ["cat tac dog", "cat sheep dog"]

Here is my attempt at breaking it down, for anyone who is interested:
/cat(?

?!cat|dog).)*dog/
1 2 3 4 5 6

1. The engine starts scanning through the string until it matches
'cat'.

2. At the first position after the t in cat, the engine tries to
match this expression (?

?!cat|dog).), zero or more times. The (?:
is nothing special. It is just a grouping that does not generate
backreferences.

3. The negative lookahead assertion. If the regular expression
following the ! matches the string, starting from the current position
of the engine, then the expression to which the assertion belongs will
fail. But since the negative assertion belongs to a grouping that can
match 0 or more times, the assertion can fail but the full regular
expression can succeed.
The hard part for me is that the negative lookahead assertion does
not consume any characters. It looks ahead, tries to match its
regular expression, and whether it returns 'MATCH' or 'NO MATCH', the
current position of the engine stays the same.

4. The regexp for the negative assertion. The stuff between each
cat...dog pair should contain neither 'cat' nor 'dog'.

5. One character is consumed, the engine moves down the string one
position. Thanks to the negative lookahead, we know that that the
consumed character is not the start of a 'cat' or 'dog' substring.

6. Eventually, the (?

?!cat|dog).)* fails when the current position
of the engine reaches the beginning of a 'dog' substring. But then
the last part of the regular expression (6) will match.

---

Also, another way of getting the 'innermost' cat...dog pair is to use
non-greedy kleene star(*?). This way, the engine will take the first
'dog' suffix that it finds, rather than the last possible:

irb(main):003:0> x.scan(/cat(?

?!cat).)*?dog/)
=> ["cat tac dog", "cat sheep dog"]

So there is more than one way to scan a cat! (Sorry!!!)

....In any event, I think I still like Daniel's solution better,
because I can look at it and feel fairly certain that it will do
exactly what it should do.

Thanks--
Josh

Raul Parolari · Dec 2, 2007

unknown said:
I also tweaked Tanaka's solution a little bit after a trip back to the
pickaxe book. I added the (?: to keep the group from generating
backreferences, Great

and I added an alternate 'dog' to the negative assertion.

Just using a non-greedy expression removes this complication (as you
realized later), although it made the following even more interesting.

Here is my attempt at breaking it down, for anyone who is interested:
/cat(??!cat|dog).)*dog/
1 2 3 4 5 6

1. The engine starts scanning through the string until matches 'cat'.
2. At the first position...

This analysis was great (I had missed Tanaka's solution).

...In any event, I think I still like Daniel's solution better,
because I can look at it and feel fairly certain that it will do
exactly what it should do.

Josh
I think both solutions are indestructible. I was curious to benchmark
them; here, they showed interesting differences:

a) for one-line sentences, where each substring cat...dog contains a few
words, Tanaka's solution is approx 25% faster.

b) for long paragraphs where cat...dog are separated by (say) 2 dozen
words, Daniels' solution becomes faster by 20-25% (for longer intervals,
the advantage can go up to 50% and more).

c) For long paragraphs, where cat and dog were separated by only a few
words, the performance is almost identical. So, it is not so much the
length of the string, but the frequency of appearance of the keywords
that is important.

[Note: you need to add the /m modifier if your strings are paragraphs.
In fact, it may be better to always write /m, so that you do not need to
touch the regexp].

The reason for the performance discrepancies is clear: for short gaps
between the keywords (ie cat, dog), the trio scan+map+gsub exacts a
toll. But for long paragraphs with many words between the keywords, it
is instead the negative lookahead on 'cat' (at every character) which
suffers.

You may want to take this in account, depending on the type of text
configurations you expect.

In any case, both methods seem technically perfect.

All the best, and thanks for that great summary,

Raul

Help in hangman game	1	Jul 24, 2023
Regular Expression for the special character "\|" pipe	7	May 27, 2014
Unwanted collector in regular expression	2	Apr 1, 2011
Regular expression	12	May 29, 2009
substitution in a regular expression	3	Mar 11, 2011
FAQ 6.20 What good is "\G" in a regular expression?	0	Mar 3, 2011
Regular expressions, capture repeated groups	4	Jul 8, 2010
About a regular expression	5	Nov 26, 2007

Negate a character sequence in a regular expression?

crm_114

James Moore

MonkeeSage

yermej

Daniel Sheppard

Raul Parolari

yermej

MonkeeSage

Tanaka Akira

Raul Parolari

crm_114

Raul Parolari

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads