"negative" regex matching?

Discussion in 'Perl Misc' started by seven.reeds, Dec 4, 2009.

  1. seven.reeds

    seven.reeds Guest

    Hi,

    I have a regex question. I have arbitrary text and I want to search
    it for a set of terms/substrings. In the simple case of one term
    it is easy to find the match(es) and then mark them up with HTML
    "span" tags. My issue is with more than one term.

    Here is an example to illustrate. If I have the string:

    Sarah likes Johnny's cooking

    and the single term: "john" then I can match and highlight the match
    resulting in:

    Sarah likes <span>John</span>ny's cooking

    Now what if I have two terms: "Johnny" & "john" -- in that order? I
    can easily let myself end up with (in sequence):

    <apply Johnny match>
    Sarah likes <span>Johnny</span>'s cooking
    <apply john match>
    Sarah likes <span><span>John</span>ny</span>'s cooking

    Ok, so what I want is to be able to search for and mark each term in
    the string as long as that term is not already in a "span" clause.

    I've done some digging in Friedl's RegEx book but I'm not sure if I
    know enough to know what I am looking for?

    ideas?
     
    seven.reeds, Dec 4, 2009
    #1
    1. Advertising

  2. seven.reeds

    Guest

    On Fri, 4 Dec 2009 14:50:59 -0800 (PST), "seven.reeds" <> wrote:

    >Hi,
    >
    >I have a regex question. I have arbitrary text and I want to search
    >it for a set of terms/substrings. In the simple case of one term
    >it is easy to find the match(es) and then mark them up with HTML
    >"span" tags. My issue is with more than one term.
    >
    >Here is an example to illustrate. If I have the string:
    >
    > Sarah likes Johnny's cooking
    >
    >and the single term: "john" then I can match and highlight the match
    >resulting in:
    >
    > Sarah likes <span>John</span>ny's cooking
    >
    >Now what if I have two terms: "Johnny" & "john" -- in that order? I
    >can easily let myself end up with (in sequence):
    >
    > <apply Johnny match>
    > Sarah likes <span>Johnny</span>'s cooking
    > <apply john match>
    > Sarah likes <span><span>John</span>ny</span>'s cooking
    >
    >Ok, so what I want is to be able to search for and mark each term in
    >the string as long as that term is not already in a "span" clause.
    >
    >I've done some digging in Friedl's RegEx book but I'm not sure if I
    >know enough to know what I am looking for?
    >
    >ideas?


    This what you are trying to do?

    rxhtml.pl
    -sln

    ----------------
    use strict;
    use warnings;

    ## globs ..

    my $string = "
    <apply Johnny match>
    Sarah likes Johnny's cooking
    <apply john match>
    Sarah likes Johnny's cooking
    ";

    ## code ..

    # use terms: Johnny,john
    if ( getMatch( $string,'span','Johnny|john')) # add mods in term's
    { print "Matched:\n'$string'\n\n" }
    else
    { print "No match.\n\n" }

    # use terms: King,john .. case insensitive
    if ( getMatch( $string,'span','(?i)King|john'))
    { print "Matched:\n'$string'\n\n" }
    else
    { print "No match.\n\n" }

    exit(0);

    ## subs ..

    sub getMatch {
    my ($tag,$terms) = @_[1,2];
    $_[0] =~ s {(?<!<$tag>)(.*)($terms)(?!.*</?$tag>)}
    {$1<$tag>$2</$tag>}g;
    }
    __END__

    Matched:
    '
    <apply <span>Johnny</span> match>
    Sarah likes <span>Johnny</span>'s cooking
    <apply <span>john</span> match>
    Sarah likes <span>Johnny</span>'s cooking
    '

    Matched:
    '
    <apply <span>Johnny</span> match>
    Sarah likes <span>Johnny</span>'s coo<span>king</span>
    <apply <span>john</span> match>
    Sarah likes <span>Johnny</span>'s coo<span>king</span>
    '
     
    , Dec 5, 2009
    #2
    1. Advertising

  3. seven.reeds

    Guest

    On Sat, 05 Dec 2009 12:45:14 -0800, wrote:

    >On Fri, 4 Dec 2009 14:50:59 -0800 (PST), "seven.reeds" <> wrote:
    >
    >>ideas?

    >
    >This what you are trying to do?
    >


    Yeah but don't do this, it doesen't work.
    -sln
     
    , Dec 6, 2009
    #3
  4. seven.reeds

    Guest

    On Fri, 4 Dec 2009 14:50:59 -0800 (PST), "seven.reeds" <> wrote:

    >Hi,
    >
    >I have a regex question. I have arbitrary text and I want to search
    >it for a set of terms/substrings. In the simple case of one term
    >it is easy to find the match(es) and then mark them up with HTML
    >"span" tags. My issue is with more than one term.
    >

    [snip]
    >
    >Ok, so what I want is to be able to search for and mark each term in
    >the string as long as that term is not already in a "span" clause.
    >
    >I've done some digging in Friedl's RegEx book but I'm not sure if I
    >know enough to know what I am looking for?
    >
    >ideas?


    I posted an earlier plain look-ahead/behind assertion rx.
    But, this won't work because of fixed width look behind.

    So this friend, is a bullet proof way to do what you want.
    Finally, a use for new 5.10 regex recursion code, which allows
    for nested tags.

    I've thoroughly tested this code. Taking into account the 'restraints'
    of parsing markup (ie: validity), but thats the compromise you are
    making for speed.

    The regex will go along happily matching tags (in a nested fashion),
    or, the terms you specify.

    If any terms are inside of the tags (even nested), they are consumed
    without any substitution (ie: they are left alone). The only thing
    left to match are the terms themselves.

    Both match, nested tags or terms, in an alternation (one or the other).
    The reason the tags aren't substituted for themselves (ie its capture group)
    is because of the new '\K' which excludes the tags.

    Read about the new extended expressions
    here -> 'perlre' in perldocs.

    Also, in addition to tags, tag-attribute form is included as well:
    <$tag></$tag> or <$tag attrib></$tag>.

    Good luck!
    -sln

    -------------------
    Output:
    String =
    '
    <apply john Johnny match>
    Sarah likes Johnny's cooking
    <apply john match>
    Sarah likes Johnny's cooking
    <span id="medium_rectangle" class="_fwph">
    Because Johnny does good cooking
    </span>
    King John
    '

    Terms =

    Johnny|john - replaced 5
    '
    <apply <span>john</span> <span>Johnny</span> match>
    Sarah likes <span>Johnny</span>'s cooking
    <apply <span>john</span> match>
    Sarah likes <span>Johnny</span>'s cooking
    <span id="medium_rectangle" class="_fwph">
    Because Johnny does good cooking
    </span>
    King John
    '

    (?i)King|john - replaced 4
    '
    <apply <span>john</span> <span>Johnny</span> match>
    Sarah likes <span>Johnny</span>'s coo<span>king</span>
    <apply <span>john</span> match>
    Sarah likes <span>Johnny</span>'s coo<span>king</span>
    <span id="medium_rectangle" class="_fwph">
    Because Johnny does good cooking
    </span>
    <span>King</span> <span>John</span>
    '
    ---------------------------------

    use strict;
    use warnings;
    require 5.010_000;

    ## globs ..

    my ($string, $result) =
    qq{
    <apply john Johnny match>
    Sarah likes Johnny's cooking
    <apply john match>
    Sarah likes Johnny's cooking
    <span id="medium_rectangle" class="_fwph">
    Because Johnny does good cooking
    </span>
    King John
    };

    ## code ..

    print "\nString = \n'$string'\n\nTerms =\n";

    print "\nJohnny|john - replaced ";
    #
    $result = getMatch( $string, 'span', 'Johnny|john');
    print "$result\n";
    print "'$string'\n" if $result;

    print "\n(?i)King|john - replaced ";
    #
    $result = getMatch( $string, 'span', '(?i)King|john'); # case insensitive
    print "$result\n";
    print "'$string'\n" if $result;

    exit(0);


    ## subs ..

    sub getMatch
    {
    #* USES RX RECURSION '(?#)', new to 5.10
    #* Start/End tags must have this specific form:
    #* <$tag></$tag> or <$tag attrib></$tag>
    #* --------------------------------------
    my ($tag,$terms) = @_[1,2];
    my $start = "<$tag(?:\\s+|>)"; # allow <tag> or <tag attribute>
    my $end = "</$tag>";

    my $replaced = 0;

    $_[0] =~ s
    { # match ..

    ( # 1
    $start
    (?:
    (?:(?!$start|$end).)++ # no backtracking
    |
    (?1) # recurse group 1
    )*
    $end
    )
    \K # effecient -- don't include tag data in match
    |
    ( # 2
    $terms
    )
    }

    { # replace ..
    $replaced++, "<$tag>".$2."</$tag>" if defined $2
    }xsge;

    return $replaced;
    }

    __END__
     
    , Dec 6, 2009
    #4
  5. seven.reeds

    seven.reeds Guest

    >
    >     s{(Johnny|john)}  {<span>$1</span>}gi;
    >


    Hi Ted, this was perfect. I was way over-thinking this.

    Thanks
     
    seven.reeds, Dec 11, 2009
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. prem_eda
    Replies:
    5
    Views:
    7,924
    Pieter Hulshoff
    Oct 11, 2004
  2. Xah Lee
    Replies:
    1
    Views:
    954
    Ilias Lazaridis
    Sep 22, 2006
  3. Xah Lee
    Replies:
    8
    Views:
    466
    Ilias Lazaridis
    Sep 26, 2006
  4. Xah Lee
    Replies:
    2
    Views:
    224
    Xah Lee
    Sep 25, 2006
  5. Replies:
    2
    Views:
    400
Loading...

Share This Page