Parsing user-entered content to remove rude words

G

Greg

Hi,

I would like to make life easier for myself my automating (as best as
possible) the removal of messages supplied by users.

If my incoming string is $input, I originally thought of searching as
follows:

<Pseudocode>
foreach my $rudeword in @RudeWordList {
if ($input =~ s/$rudeword/i) {
REJECT;
}
}
</Pseudocode>


However, this seems a rather unoptimised method of searching. Is there a
more optimised way of doing this?

Cheers,


Greg
 
T

Tad McClellan

Greg said:
if ($input =~ s/$rudeword/i) {


The s/// operator has *two* parts.

You realize that automating this accurately is nearly impossible?

You should probably have word boundaries in your pattern.

Does the word list contain all of the ways that folks will
use to circumvent censorship?

Will it generate "false hits" and delete messages that don't
really contain the "bad words"?

Is there a
more optimised way of doing this?


Your Question is Asked Frequently:

perldoc -q match

How do I efficiently match many regular expressions at once?
 
J

Juha Laiho

Greg said:
I would like to make life easier for myself my automating (as best as
possible) the removal of messages supplied by users.

If my incoming string is $input, I originally thought of searching as
follows:

<Pseudocode>
foreach my $rudeword in @RudeWordList {
if ($input =~ s/$rudeword/i) {
REJECT;
}
}
</Pseudocode>

However, this seems a rather unoptimised method of searching. Is there a
more optimised way of doing this?

Not quite -- some points, though:
- s// is substituting, whereas you only need matching (m//, or just //)
- you realize you're playing 'catch' -- the rude words can be altered
enough to make your filter fail, yet have the result be readable for
users - then, of course, you can add these alterations to your word
list after you see them in use -- but, read on..
- if you still want to do this, use word boundaries around your regular
expressions -- otherwise (remembering the above), you'll trigger
a false positive with just 'AMD Sempron'

You might consider using a soundex algorithm, but that's also prone
to false positives - it's somewhat just an automated way to match
a set of alterations at a single go.
 
M

Mark Clements

Greg said:
Hi,

I would like to make life easier for myself my automating (as best as
possible) the removal of messages supplied by users.

If my incoming string is $input, I originally thought of searching as
follows:

<Pseudocode>
foreach my $rudeword in @RudeWordList {
if ($input =~ s/$rudeword/i) {
REJECT;
}
}
</Pseudocode>
You could check out Regexp::Common, specifically Regexp::Common::profanity.

Mark
 
G

Greg

Tad said:
The s/// operator has *two* parts.

I put the word "Pseudocode" for a reason :p Well done for noticing :)
You realize that automating this accurately is nearly impossible?

Oh yes. I just want to catch as many as I can automatically before
resorting to manual approaches
You should probably have word boundaries in your pattern.
Cheers!



Does the word list contain all of the ways that folks will
use to circumvent censorship?

I've considered this. As I mention above, I just want to catch the
"Usual suspects" :D

Will it generate "false hits" and delete messages that don't
really contain the "bad words"?

Could do, - I don't want to upset the residents of the lovely UK town of
Scunthorpe :D
Your Question is Asked Frequently:

perldoc -q match

How do I efficiently match many regular expressions at once?

You're a star! Thank you :D
 
G

Greg

You could check out Regexp::Common, specifically Regexp::Common::profanity.

Brilliant! I didn't even know it existed - time for me to start googling!

Cheers Mark!
 
G

Greg

Not quite -- some points, though:
- s// is substituting, whereas you only need matching (m//, or just //)

Yeah, It's <cough> pseudocode - just a typo on my part - but thanks for
noticing :D

- you realize you're playing 'catch' -- the rude words can be altered
enough to make your filter fail, yet have the result be readable for
users - then, of course, you can add these alterations to your word
list after you see them in use -- but, read on..

You're absolutely right - It's just like trying to write a piece of code
to catch people entering silly names and addresses - almost impossible.

I just want to catch the "Usual Suspects"

- if you still want to do this, use word boundaries around your regular
expressions -- otherwise (remembering the above), you'll trigger
a false positive with just 'AMD Sempron'

Good point!
You might consider using a soundex algorithm, but that's also prone
to false positives - it's somewhat just an automated way to match
a set of alterations at a single go.

Ahhh! That's what it was called... I remembered reading about it a few
months ago, but couldn't think of the name :)

Cheers Juha and thank you for the advise!
 
A

Anno Siegel

Greg said:
Yeah, It's <cough> pseudocode - just a typo on my part - but thanks for
noticing :D

Labelling something "pseudocode" is not a license to write anything
you please and let the reader figure it out. "s//" is nonsense,
pseudocode or not.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top