Regex losing <br> (different from the earlier topic about losing $1)

Discussion in 'Perl Misc' started by Jason C, Jun 22, 2012.

  1. Jason C

    Jason C Guest

    I'm building a profanity filter, and I'm using the following subroutine to replace matched words with XXXX:

    while (($original, $converted) = @profanityArr) {
    if (!$converted) {
    $len = length($original);
    $converted = "X" x $len;
    }

    $original = quotemeta($original);

    $text =~ s/(\r|\n|\r\n|<br>|\s)*$original(\r|\n|\r\n|<br>|\s)*/$1$converted$2/i;
    }


    # When I feed:
    $original = "daym";

    $text = "<br><br>daym<br><br>";
    ###

    I'm getting "<br>XXXX<br>". Meaning, it loses the matched <br> in both $1 and $2.

    # When I feed:
    $original = "jason";
    $converted = "brainfried";

    $text = "<br><br>jason<br><br>";
    ###

    I'm getting "<br>brainfried<br>". Again, it loses the matched <br> in both $1 and $2.


    # When I feed:
    $original = "dammit";
    $converted = "XXXXit";

    $text = "<br><br>dammit<br><br>";
    ###

    I'm getting "<br>XXXXit<br><br>". Meaning, it loses the matched <br> in $1, but keeps it in $2.

    It's the same if I change $1 and $2 to \1 and \2.

    Any suggestions on how to correct the sub to keep the matched <br>?
    Jason C, Jun 22, 2012
    #1
    1. Advertising

  2. Jason C

    Jan Pluntke Guest

    "Jason C" <> wrote:

    > $text =~
    > s/(\r|\n|\r\n|<br>|\s)*$original(\r|\n|\r\n|<br>|\s)*/$1$converted$2/i;

    [...]
    > # When I feed:
    > $original = "daym";
    >
    > $text = "<br><br>daym<br><br>";
    > ###
    >
    > I'm getting "<br>XXXX<br>". Meaning, it loses the matched <br> in both $1
    > and $2.


    You will want to capture the * also, otherwise $1 and $2 will
    contain only one (the last) match for that part of the string:

    ((?:\r|\n|\r\n|<br>|\s)*)

    The ?: will make the inner () non-capturing.

    I think (but might be wrong - did not test) that \s contains
    \r and \n, so you can remove them:

    ((?:<br>|\s)*)

    Regards,
    Jan
    Jan Pluntke, Jun 22, 2012
    #2
    1. Advertising

  3. Jason C

    Jason C Guest

    On Friday, June 22, 2012 1:37:51 AM UTC-4, Jan Pluntke wrote:
    > You will want to capture the * also, otherwise $1 and $2 will
    > contain only one (the last) match for that part of the string:
    >
    > ((?:\r|\n|\r\n|
    > |\s)*)
    >
    > The ?: will make the inner () non-capturing.


    Excellent! I was not familiar with the ?:, so I'll have to make a note of that for future reference.


    > I think (but might be wrong - did not test) that \s contains
    > \r and \n, so you can remove them:
    >
    > ((?:
    > |\s)*)
    >
    > Regards,
    > Jan


    Correct again! I thought that \s just captured the space, and didn't realize that it includes line breaks (and apparently tabs, too). I can modify all of my scripts for that, now, and save a little bandwidth :)

    Thanks for the help!
    Jason C, Jun 22, 2012
    #3
  4. Jason C

    Jason C Guest

    On Friday, June 22, 2012 5:07:40 AM UTC-4, Ben Morrow wrote:
    > Unless $original is supposed to be a regex, you want \Q\E around it.


    I originally did this in the function:

    $original = quotemeta($original);
    $text =~ ...;

    Is there a difference between quotemeta() and \Q\E?


    > You don't really need the final capture, you can just use lookahead.
    > Similarly you don't need to capture more than one \s just to put it back
    > again:
    >
    > s/(\s|
    > ) \Q$original\E (?= \s|
    > )/$1$converted/ix;
    >
    > Turning the initial capture into lookbehind is harder, since Perl
    > doesn't support variable-length lookbehind and the two branches of the
    > alternation are different lengths. However, if you have at least 5.10
    > (which you do, I hope), you can use \K like this:
    >
    > s/ (?:\s|
    > ) \K \Q$original\E (?=\s|
    > ) /$converted/ix;


    I'm afraid that you went just a little over my head on that one. What does the \K do? And what does (?=\s|<br>) do differently from (?:\s|<br>)? Or are they the same?

    This is slightly different, but how do I include "or at the beginning of the string" in that regex?

    I don't think that this would work, would it?

    ((?:^|\s|<br>)*)

    For this purpose, I'm specifically converting a string of "www.example.com"to "http://www.example.com". A string like

    $text = "Go to www.example.com";

    matches, but

    $text = "www.example.com<br><br>click here";

    doesn't.

    Further, in this case I don't want it to match when the www is between other characters (so that it doesn't change "http://www" to "http://http://www"), so I think I'll have to use a totally different regex without the trailing. But I still need to figure out how to make it match if it follows a \s,<br>, or is at the beginning of the string.
    Jason C, Jun 23, 2012
    #4
  5. Jason C

    Morty Abzug Guest

    In article <>,
    "Ben Morrow " <> spake thusly:
    >
    > s/ (?:\s|<br>) \K \Q$original\E (?=\s|<br>) /$converted/ix;


    In addition to Jan and Ben's excellent suggestions, please note that
    the patterns don't need to be applied in a loop. You can do something
    like this:

    my $regex=join "|", map quotemeta, @profanityArr;
    $text =~ s{ (?:\s|<br>) \K ($regex) (?=\s|<br>) }{"X" x length $1}ex;

    The "|" lets you match alternatives in a single regex, while the /e
    flag is used to eval an expression before performing a substitution.

    As I'm sure you know, folks who want to bypass the filters can usually
    figure out ways around them. :)

    - Morty
    Morty Abzug, Jun 26, 2012
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Darren
    Replies:
    3
    Views:
    1,462
  2. Ben Harper
    Replies:
    2
    Views:
    461
    Ben Harper
    Jul 5, 2005
  3. Gelmir Tinehtelë
    Replies:
    10
    Views:
    683
    Aidan
    Jun 9, 2004
  4. Replies:
    0
    Views:
    645
  5. Replies:
    3
    Views:
    754
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page