Regex losing <br> (different from the earlier topic about losing $1)

J

Jason C

I'm building a profanity filter, and I'm using the following subroutine to replace matched words with XXXX:

while (($original, $converted) = @profanityArr) {
if (!$converted) {
$len = length($original);
$converted = "X" x $len;
}

$original = quotemeta($original);

$text =~ s/(\r|\n|\r\n|<br>|\s)*$original(\r|\n|\r\n|<br>|\s)*/$1$converted$2/i;
}


# When I feed:
$original = "daym";

$text = "<br><br>daym<br><br>";
###

I'm getting "<br>XXXX<br>". Meaning, it loses the matched <br> in both $1 and $2.

# When I feed:
$original = "jason";
$converted = "brainfried";

$text = "<br><br>jason<br><br>";
###

I'm getting "<br>brainfried<br>". Again, it loses the matched <br> in both $1 and $2.


# When I feed:
$original = "dammit";
$converted = "XXXXit";

$text = "<br><br>dammit<br><br>";
###

I'm getting "<br>XXXXit<br><br>". Meaning, it loses the matched <br> in $1, but keeps it in $2.

It's the same if I change $1 and $2 to \1 and \2.

Any suggestions on how to correct the sub to keep the matched <br>?
 
J

Jan Pluntke

Jason C said:
$text =~
s/(\r|\n|\r\n|<br>|\s)*$original(\r|\n|\r\n|<br>|\s)*/$1$converted$2/i; [...]
# When I feed:
$original = "daym";

$text = "<br><br>daym<br><br>";
###

I'm getting "<br>XXXX<br>". Meaning, it loses the matched <br> in both $1
and $2.

You will want to capture the * also, otherwise $1 and $2 will
contain only one (the last) match for that part of the string:

((?:\r|\n|\r\n|<br>|\s)*)

The ?: will make the inner () non-capturing.

I think (but might be wrong - did not test) that \s contains
\r and \n, so you can remove them:

((?:<br>|\s)*)

Regards,
Jan
 
J

Jason C

You will want to capture the * also, otherwise $1 and $2 will
contain only one (the last) match for that part of the string:

((?:\r|\n|\r\n|
|\s)*)

The ?: will make the inner () non-capturing.

Excellent! I was not familiar with the ?:, so I'll have to make a note of that for future reference.

I think (but might be wrong - did not test) that \s contains
\r and \n, so you can remove them:

((?:
|\s)*)

Regards,
Jan

Correct again! I thought that \s just captured the space, and didn't realize that it includes line breaks (and apparently tabs, too). I can modify all of my scripts for that, now, and save a little bandwidth :)

Thanks for the help!
 
J

Jason C

Unless $original is supposed to be a regex, you want \Q\E around it.

I originally did this in the function:

$original = quotemeta($original);
$text =~ ...;

Is there a difference between quotemeta() and \Q\E?

You don't really need the final capture, you can just use lookahead.
Similarly you don't need to capture more than one \s just to put it back
again:

s/(\s|
) \Q$original\E (?= \s|
)/$1$converted/ix;

Turning the initial capture into lookbehind is harder, since Perl
doesn't support variable-length lookbehind and the two branches of the
alternation are different lengths. However, if you have at least 5.10
(which you do, I hope), you can use \K like this:

s/ (?:\s|
) \K \Q$original\E (?=\s|
) /$converted/ix;

I'm afraid that you went just a little over my head on that one. What does the \K do? And what does (?=\s|<br>) do differently from (?:\s|<br>)? Or are they the same?

This is slightly different, but how do I include "or at the beginning of the string" in that regex?

I don't think that this would work, would it?

((?:^|\s|<br>)*)

For this purpose, I'm specifically converting a string of "www.example.com"to "http://www.example.com". A string like

$text = "Go to www.example.com";

matches, but

$text = "www.example.com<br><br>click here";

doesn't.

Further, in this case I don't want it to match when the www is between other characters (so that it doesn't change "http://www" to "http://http://www"), so I think I'll have to use a totally different regex without the trailing. But I still need to figure out how to make it match if it follows a \s,<br>, or is at the beginning of the string.
 
M

Morty Abzug

s/ (?:\s|<br>) \K \Q$original\E (?=\s|<br>) /$converted/ix;

In addition to Jan and Ben's excellent suggestions, please note that
the patterns don't need to be applied in a loop. You can do something
like this:

my $regex=join "|", map quotemeta, @profanityArr;
$text =~ s{ (?:\s|<br>) \K ($regex) (?=\s|<br>) }{"X" x length $1}ex;

The "|" lets you match alternatives in a single regex, while the /e
flag is used to eval an expression before performing a substitution.

As I'm sure you know, folks who want to bypass the filters can usually
figure out ways around them. :)

- Morty
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top