"negative" regex matching?

seven.reeds · Dec 4, 2009

Hi,

I have a regex question. I have arbitrary text and I want to search
it for a set of terms/substrings. In the simple case of one term
it is easy to find the match(es) and then mark them up with HTML
"span" tags. My issue is with more than one term.

Here is an example to illustrate. If I have the string:

Sarah likes Johnny's cooking

and the single term: "john" then I can match and highlight the match
resulting in:

Sarah likes Johnny's cooking

Now what if I have two terms: "Johnny" & "john" -- in that order? I
can easily let myself end up with (in sequence):

<apply Johnny match>
Sarah likes Johnny's cooking
<apply john match>
Sarah likes Johnny's cooking

Ok, so what I want is to be able to search for and mark each term in
the string as long as that term is not already in a "span" clause.

I've done some digging in Friedl's RegEx book but I'm not sure if I
know enough to know what I am looking for?

ideas?

sln · Dec 5, 2009

Hi,

I have a regex question. I have arbitrary text and I want to search
it for a set of terms/substrings. In the simple case of one term
it is easy to find the match(es) and then mark them up with HTML
"span" tags. My issue is with more than one term.

Here is an example to illustrate. If I have the string:

Sarah likes Johnny's cooking

and the single term: "john" then I can match and highlight the match
resulting in:

Sarah likes Johnny's cooking

Now what if I have two terms: "Johnny" & "john" -- in that order? I
can easily let myself end up with (in sequence):

<apply Johnny match>
Sarah likes Johnny's cooking
<apply john match>
Sarah likes Johnny's cooking

Ok, so what I want is to be able to search for and mark each term in
the string as long as that term is not already in a "span" clause.

I've done some digging in Friedl's RegEx book but I'm not sure if I
know enough to know what I am looking for?

ideas?

This what you are trying to do?

rxhtml.pl
-sln

----------------
use strict;
use warnings;

## globs ..

my $string = "
<apply Johnny match>
Sarah likes Johnny's cooking
<apply john match>
Sarah likes Johnny's cooking
";

## code ..

# use terms: Johnny,john
if ( getMatch( $string,'span','Johnny|john')) # add mods in term's
{ print "Matched:\n'$string'\n\n" }
else
{ print "No match.\n\n" }

# use terms: King,john .. case insensitive
if ( getMatch( $string,'span','(?i)King|john'))
{ print "Matched:\n'$string'\n\n" }
else
{ print "No match.\n\n" }

exit(0);

## subs ..

sub getMatch {
my ($tag,$terms) = @_[1,2];
$_[0] =~ s {(?<!<$tag>)(.*)($terms)(?!.*</?$tag>)}
{$1<$tag>$2</$tag>}g;
}
__END__

Matched:
'
<apply Johnny match>
Sarah likes Johnny's cooking
<apply john match>
Sarah likes Johnny's cooking
'

Matched:
'
<apply Johnny match>
Sarah likes Johnny's cooking
<apply john match>
Sarah likes Johnny's cooking
'

sln · Dec 6, 2009

This what you are trying to do?

Yeah but don't do this, it doesen't work.
-sln

sln · Dec 6, 2009

Hi,

I have a regex question. I have arbitrary text and I want to search
it for a set of terms/substrings. In the simple case of one term
it is easy to find the match(es) and then mark them up with HTML
"span" tags. My issue is with more than one term.
[snip]

Ok, so what I want is to be able to search for and mark each term in
the string as long as that term is not already in a "span" clause.

I've done some digging in Friedl's RegEx book but I'm not sure if I
know enough to know what I am looking for?

ideas?

I posted an earlier plain look-ahead/behind assertion rx.
But, this won't work because of fixed width look behind.

So this friend, is a bullet proof way to do what you want.
Finally, a use for new 5.10 regex recursion code, which allows
for nested tags.

I've thoroughly tested this code. Taking into account the 'restraints'
of parsing markup (ie: validity), but thats the compromise you are
making for speed.

The regex will go along happily matching tags (in a nested fashion),
or, the terms you specify.

If any terms are inside of the tags (even nested), they are consumed
without any substitution (ie: they are left alone). The only thing
left to match are the terms themselves.

Both match, nested tags or terms, in an alternation (one or the other).
The reason the tags aren't substituted for themselves (ie its capture group)
is because of the new '\K' which excludes the tags.

Read about the new extended expressions
here -> 'perlre' in perldocs.

Also, in addition to tags, tag-attribute form is included as well:
<$tag></$tag> or <$tag attrib></$tag>.

Good luck!
-sln

-------------------
Output:
String =
'
<apply john Johnny match>
Sarah likes Johnny's cooking
<apply john match>
Sarah likes Johnny's cooking

Because Johnny does good cooking

King John
'

Terms =

Johnny|john - replaced 5
'
<apply john Johnny match>
Sarah likes Johnny's cooking
<apply john match>
Sarah likes Johnny's cooking

Because Johnny does good cooking

King John
'

(?i)King|john - replaced 4
'
<apply john Johnny match>
Sarah likes Johnny's cooking
<apply john match>
Sarah likes Johnny's cooking

Because Johnny does good cooking

King John
'
---------------------------------

use strict;
use warnings;
require 5.010_000;

## globs ..

my ($string, $result) =
qq{
<apply john Johnny match>
Sarah likes Johnny's cooking
<apply john match>
Sarah likes Johnny's cooking

Because Johnny does good cooking

King John
};

## code ..

print "\nString = \n'$string'\n\nTerms =\n";

print "\nJohnny|john - replaced ";
#
$result = getMatch( $string, 'span', 'Johnny|john');
print "$result\n";
print "'$string'\n" if $result;

print "\n(?i)King|john - replaced ";
#
$result = getMatch( $string, 'span', '(?i)King|john'); # case insensitive
print "$result\n";
print "'$string'\n" if $result;

exit(0);

## subs ..

sub getMatch
{
#* USES RX RECURSION '(?#)', new to 5.10
#* Start/End tags must have this specific form:
#* <$tag></$tag> or <$tag attrib></$tag>
#* --------------------------------------
my ($tag,$terms) = @_[1,2];
my $start = "<$tag(?:\\s+|>)"; # allow <tag> or <tag attribute>
my $end = "</$tag>";

my $replaced = 0;

$_[0] =~ s
{ # match ..

( # 1
$start
(?:
(?

?!$start|$end).)++ # no backtracking
|
(?1) # recurse group 1
)*
$end
)
\K # effecient -- don't include tag data in match
|
( # 2
$terms
)
}

{ # replace ..
$replaced++, "<$tag>".$2."</$tag>" if defined $2
}xsge;

return $replaced;
}

__END__

seven.reeds · Dec 11, 2009

s{(Johnny|john)} {$1}gi;

Hi Ted, this was perfect. I was way over-thinking this.

Thanks

Help with dynamic regex	14	Mar 7, 2012
Removing empty tags	2	Feb 24, 2011
Regex: deleting non-matching words	3	Aug 22, 2010
RegEx - matching previous match	4	Feb 27, 2008
Clickable link conversion regex?	0	Nov 30, 2012
Regex: match double OR single quote	4	Jul 12, 2012
mmap regex search replace	0	Apr 3, 2009
REGEX NAME Matching..	1	Jun 24, 2005

"negative" regex matching?

seven.reeds

sln

sln

sln

seven.reeds

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads