How to remove all duplications of characters

Ignoramus21673 · Apr 24, 2006

I am writing a little mail filter:

I receive messages with Subjects such as:

Hardcoore incesst Content

I want to replace that with "Hardcore incest Content" (note removal of
duplicate characters. Is there some regexp that would let me do that.

i

David Squire · Apr 24, 2006

Ignoramus21673 said:
I am writing a little mail filter:

I receive messages with Subjects such as:

Hardcoore incesst Content

I want to replace that with "Hardcore incest Content" (note removal of
duplicate characters. Is there some regexp that would let me do that.

Yes.

What have you tried so far?

Also, many English words contain perfectly valid double letters (there's
one now

). If you want your filtered results to be human-readable,
you will need to take that into account. If you intend just to reduce
things to a standard form before feeding to a filter, then this will not
matter.

DS

Lukas Mai · Apr 24, 2006

Ignoramus21673 said:
I am writing a little mail filter:

I receive messages with Subjects such as:

Hardcoore incesst Content

I want to replace that with "Hardcore incest Content" (note removal of
duplicate characters. Is there some regexp that would let me do that.

Not a regexp, but you can use tr/// with the s modifier. See perldoc
perlop.

HTH, Lukas

David Squire · Apr 24, 2006

David said:
Yes.

What have you tried so far?

Also, many English words contain perfectly valid double letters (there's
one now ). If you want your filtered results to be human-readable,
you will need to take that into account. If you intend just to reduce
things to a standard form before feeding to a filter, then this will not
matter.

OK. Here's an example of one:

echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {s/([A-Za-z])\1+/$1/g;
print}}'

(assuming that you are only interested in alphabetic characters being
duplicated)

DS

Ignoramus21673 · Apr 24, 2006

Yes.

What have you tried so far?

perldoc perlre

Also, many English words contain perfectly valid double letters (there's
one now ). If you want your filtered results to be human-readable,
you will need to take that into account. If you intend just to reduce
things to a standard form before feeding to a filter, then this will not
matter.

The corrected text is intended for the consumption of the filter, not
humans.

I need to filter certain spams, one is a sex spammer who sends emails
with subjects similar to the above, and another is a medications
spammer who sends messages with lines like

X a n @ x

etc. I want to write something smart that woudl detect it.

i

David Squire · Apr 24, 2006

Lukas said:
Not a regexp, but you can use tr/// with the s modifier. See perldoc
perlop.

Yes. This is indeed nicer:

echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {tr/A-Za-z//s; print}}'

DS

Ignoramus21673 · Apr 24, 2006

David said:
David said:

Yes.

What have you tried so far?

Also, many English words contain perfectly valid double letters (there's
one now ). If you want your filtered results to be human-readable,
you will need to take that into account. If you intend just to reduce
things to a standard form before feeding to a filter, then this will not
matter.

Click to expand...

OK. Here's an example of one:

echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {s/([A-Za-z])\1+/$1/g;
print}}'

(assuming that you are only interested in alphabetic characters being
duplicated)

DS

Thanks, works beautifully.

i

Tad McClellan · Apr 24, 2006

Ignoramus21673 said:
I am writing a little mail filter:

I receive messages with Subjects such as:

Hardcoore incesst Content

I want to replace that with "Hardcore incest Content" (note removal of
duplicate characters. Is there some regexp that would let me do that.

Yes, but a regex is not the Right Tool for this job.

You can do it fine without any regular expressions:

tr/a-zA-Z//s;

Note that 'Mississippi' becomes 'Misisipi' ...

Ignoramus21673 · Apr 24, 2006

Yes, but a regex is not the Right Tool for this job.

You can do it fine without any regular expressions:

tr/a-zA-Z//s;

Note that 'Mississippi' becomes 'Misisipi' ...

Thanks. Someone suggested to use a regexp like this

$s =~ s/([A-Za-z])\1+/$1/g;

which actually works. If tr is somehow better (not sure why), I can
switch to using tr.

i

David Squire · Apr 24, 2006

Tad said:
Yes, but a regex is not the Right Tool for this job.

You can do it fine without any regular expressions:

tr/a-zA-Z//s;

Out of interest, can tr handle more general cases, such as:

s/(.)\1+/$1/g;

or is a regex necessary for this?

DS

PS. Yes, I have tested 'tr/.//s;', and it doesn't remove any dupes.

Mintcake · Apr 24, 2006

David said:
Yes. This is indeed nicer:

echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {tr/A-Za-z//s; print}}'

DS

Or even...

echo 'Heelllooo WWWoorrld' | perl -pe 'tr/A-Za-z//s'

Dr.Ruud · Apr 24, 2006

Tad McClellan schreef:

Ignoramus21673:

Yes, but a regex is not the Right Tool for this job.

Well, it is if you would rather use [:alpha:].

(there can be more in [[:alpha:]] than is in [A-Za-z])

Anno Siegel · Apr 24, 2006

David Squire said:
Out of interest, can tr handle more general cases, such as:

s/(.)\1+/$1/g;

or is a regex necessary for this?

tr/\x00-\x7f//s;

covers the ASCII range. Any set of character ranges can be covered.
See tr/// in perlop.

PS. Yes, I have tested 'tr/.//s;', and it doesn't remove any dupes.

Do look up tr///. The similarity with s/// is rather superficial. In
particular, "." doesn't do in tr/// what it does in a regex.

Anno

Anno Siegel · Apr 24, 2006

Dr.Ruud said:
Tad McClellan schreef:

Ignoramus21673:

Yes, but a regex is not the Right Tool for this job.

Click to expand...

Well, it is if you would rather use [:alpha:].

(there can be more in [[:alpha:]] than is in [A-Za-z])

$_ = 'Heelllooo WWWoorrld';
do {
my $alpha = join '' =>
grep /[[:alpha:]]/,
map chr, 0 .. 255; # or whatever
eval "sub { tr/$alpha//s }";
}->();
print "$_\n";

Anno

Dr.Ruud · Apr 24, 2006

Anno Siegel schreef:

Dr.Ruud:

(there can be more in [[:alpha:]] than is in [A-Za-z])

Click to expand...

$_ = 'Heelllooo WWWoorrld';
do {
my $alpha = join '' =>
grep /[[:alpha:]]/,
map chr, 0 .. 255; # or whatever
eval "sub { tr/$alpha//s }";
}->();
print "$_\n";

Heheh, I actually had this technique of yours (!) in mind while posting.

The 'whatever' can be quite big:

#!/usr/bin/perl
use strict;
use warnings;

my ($alpha, $i, $n) = ('', 0, 0);

for (0x0000..0xD7FF, 0xE000..0xFDCF, 0xFDF0..0xFFFD) {
++$i;
$_ = chr;
$alpha .= $_ if /[[:alpha:]]/;
}
printf "%d / %d = %d%%\n", $n = length $alpha, $i, 100 * $n / $i;
printf "%s\n", substr( $alpha, 0, 160 );
__END__

47276 / 63454 = 74%
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz...

Tad McClellan · Apr 24, 2006

Ignoramus21673 said:
If tr is somehow better

It is.

(not sure why),

1) it is more self-documenting. s/// is for *patterns*, tr/// is
for characters, and you want to operate on characters not on patterns.

2) it is a lot faster than s///g

I can
switch to using tr.

Good.

Tad McClellan · Apr 24, 2006

David Squire said:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Out of interest, can tr handle more general cases, such as:

s/(.)\1+/$1/g;

or is a regex necessary for this?

You do not need any regular expressions for that either:

tr/\000-\011\013-\377//s; # No regex here!

PS. Yes, I have tested 'tr/.//s;', and it doesn't remove any dupes.

That must be because you tested on a string without any of
the tr-listed characters in it. It works for me:

perl -le '$_="etc..."; tr/.//s; print'

Note the underlined part above. tr/// is NOT a regular expression
(so a dot is a dot, not a "wildcard").

Jürgen Exner · Apr 25, 2006

Ignoramus21673 wrote:
[something]

Would you mind sticking to one email alias?
Or are you suffering from multiple shizophrenia?

jue

Lukas Mai · Apr 25, 2006

David Squire said:
Out of interest, can tr handle more general cases, such as:

s/(.)\1+/$1/g;

or is a regex necessary for this?

DS

PS. Yes, I have tested 'tr/.//s;', and it doesn't remove any dupes.

That's because tr doesn't do patterns. '.' matches '.' and nothing else.
Try perl -pe "tr///cs".

HTH, Lukas

How to remove the undefined thing?	1	Oct 19, 2022
Sort by number of characters	1	Nov 2, 2023
How to start, if at all ?	2	Apr 18, 2022
How can I remove the extra space marked in the image attached to my Email HTML template?	2	Feb 25, 2023
How to get all values of an object	1	Mar 26, 2022
FAQ 4.21 How do I remove consecutive pairs of characters?	0	Jan 14, 2011
I'm tempted to quit out of frustration	1	Aug 13, 2023
Button enabling before all inputs are filled and disabling when all inputs are filled	1	Aug 18, 2022

How to remove all duplications of characters

Ignoramus21673

David Squire

Lukas Mai

David Squire

Ignoramus21673

David Squire

Ignoramus21673

Tad McClellan

Ignoramus21673

David Squire

Mintcake

Dr.Ruud

Anno Siegel

Anno Siegel

Dr.Ruud

Tad McClellan

Tad McClellan

Jürgen Exner

Lukas Mai

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads