Regex losing (different from the earlier topic about losing $1)

Jason C · Jun 22, 2012

I'm building a profanity filter, and I'm using the following subroutine to replace matched words with XXXX:

while (($original, $converted) = @profanityArr) {
if (!$converted) {
$len = length($original);
$converted = "X" x $len;
}

$original = quotemeta($original);

$text =~ s/(\r|\n|\r\n| |\s)*$original(\r|\n|\r\n| |\s)*/$1$converted$2/i;
}

# When I feed:
$original = "daym";

$text = " daym ";
###

I'm getting " XXXX ". Meaning, it loses the matched in both $1 and $2.

# When I feed:
$original = "jason";
$converted = "brainfried";

$text = " jason ";
###

I'm getting " brainfried ". Again, it loses the matched in both $1 and $2.

# When I feed:
$original = "dammit";
$converted = "XXXXit";

$text = " dammit ";
###

I'm getting " XXXXit ". Meaning, it loses the matched in $1, but keeps it in $2.

It's the same if I change $1 and $2 to \1 and \2.

Any suggestions on how to correct the sub to keep the matched ?

Jan Pluntke · Jun 22, 2012

Jason C said:
$text =~
s/(\r|\n|\r\n| |\s)*$original(\r|\n|\r\n| |\s)*/$1$converted$2/i; [...]
# When I feed:
$original = "daym";

$text = " daym ";
###

I'm getting " XXXX ". Meaning, it loses the matched in both $1
and $2.

You will want to capture the * also, otherwise $1 and $2 will
contain only one (the last) match for that part of the string:

((?:\r|\n|\r\n| |\s)*)

The ?: will make the inner () non-capturing.

I think (but might be wrong - did not test) that \s contains
\r and \n, so you can remove them:

((?: |\s)*)

Regards,
Jan

Jason C · Jun 22, 2012

You will want to capture the * also, otherwise $1 and $2 will
contain only one (the last) match for that part of the string:

((?:\r|\n|\r\n|
|\s)*)

The ?: will make the inner () non-capturing.

Excellent! I was not familiar with the ?:, so I'll have to make a note of that for future reference.

I think (but might be wrong - did not test) that \s contains
\r and \n, so you can remove them:

((?:
|\s)*)

Regards,
Jan

Correct again! I thought that \s just captured the space, and didn't realize that it includes line breaks (and apparently tabs, too). I can modify all of my scripts for that, now, and save a little bandwidth

Thanks for the help!

Jason C · Jun 23, 2012

Unless $original is supposed to be a regex, you want \Q\E around it.

I originally did this in the function:

$original = quotemeta($original);
$text =~ ...;

Is there a difference between quotemeta() and \Q\E?

You don't really need the final capture, you can just use lookahead.
Similarly you don't need to capture more than one \s just to put it back
again:

s/(\s|
) \Q$original\E (?= \s|
)/$1$converted/ix;

Turning the initial capture into lookbehind is harder, since Perl
doesn't support variable-length lookbehind and the two branches of the
alternation are different lengths. However, if you have at least 5.10
(which you do, I hope), you can use \K like this:

s/ (?:\s|
) \K \Q$original\E (?=\s|
) /$converted/ix;

I'm afraid that you went just a little over my head on that one. What does the \K do? And what does (?=\s| ) do differently from (?:\s| )? Or are they the same?

This is slightly different, but how do I include "or at the beginning of the string" in that regex?

I don't think that this would work, would it?

((?:^|\s| )*)

For this purpose, I'm specifically converting a string of "www.example.com"to "http://www.example.com". A string like

$text = "Go to www.example.com";

matches, but

$text = "www.example.com click here";

doesn't.

Further, in this case I don't want it to match when the www is between other characters (so that it doesn't change "http://www" to "http://http://www"), so I think I'll have to use a totally different regex without the trailing. But I still need to figure out how to make it match if it follows a \s, , or is at the beginning of the string.

Morty Abzug · Jun 26, 2012

s/ (?:\s| ) \K \Q$original\E (?=\s| ) /$converted/ix;

In addition to Jan and Ben's excellent suggestions, please note that
the patterns don't need to be applied in a loop. You can do something
like this:

my $regex=join "|", map quotemeta, @profanityArr;
$text =~ s{ (?:\s| ) \K ($regex) (?=\s| ) }{"X" x length $1}ex;

The "|" lets you match alternatives in a single regex, while the /e
flag is used to eval an expression before performing a substitution.

As I'm sure you know, folks who want to bypass the filters can usually
figure out ways around them.

- Morty

Regex question; match <br> after opening tag	23	Feb 16, 2011
Need help with this script	4	Mar 12, 2023
regex woes with breaklines	3	Apr 10, 2009
Trying to build a SARIMAX model to forecast the S&P500 trend	0	Nov 5, 2023
Question about regex (nagios plugin)	8	Sep 30, 2008
Different results when running script from IDLE versus Command Line	2	Mar 12, 2008
Dont work, it´s something whit the loops?	1	Jun 30, 2021
Regex question, limit repeats UNLESS within specified tags	3	Nov 2, 2012

Regex losing <br> (different from the earlier topic about losing $1)

Jason C

Jan Pluntke

Jason C

Jason C

Morty Abzug

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads