How to remove all duplications of characters

I

Ignoramus21673

I am writing a little mail filter:

I receive messages with Subjects such as:

Hardcoore incesst Content

I want to replace that with "Hardcore incest Content" (note removal of
duplicate characters. Is there some regexp that would let me do that.

i
 
D

David Squire

Ignoramus21673 said:
I am writing a little mail filter:

I receive messages with Subjects such as:

Hardcoore incesst Content

I want to replace that with "Hardcore incest Content" (note removal of
duplicate characters. Is there some regexp that would let me do that.

Yes.

What have you tried so far?

Also, many English words contain perfectly valid double letters (there's
one now :) ). If you want your filtered results to be human-readable,
you will need to take that into account. If you intend just to reduce
things to a standard form before feeding to a filter, then this will not
matter.

DS
 
L

Lukas Mai

Ignoramus21673 said:
I am writing a little mail filter:

I receive messages with Subjects such as:

Hardcoore incesst Content

I want to replace that with "Hardcore incest Content" (note removal of
duplicate characters. Is there some regexp that would let me do that.

Not a regexp, but you can use tr/// with the s modifier. See perldoc
perlop.

HTH, Lukas
 
D

David Squire

David said:
Yes.

What have you tried so far?

Also, many English words contain perfectly valid double letters (there's
one now :) ). If you want your filtered results to be human-readable,
you will need to take that into account. If you intend just to reduce
things to a standard form before feeding to a filter, then this will not
matter.

OK. Here's an example of one:

echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {s/([A-Za-z])\1+/$1/g;
print}}'

(assuming that you are only interested in alphabetic characters being
duplicated)

DS
 
I

Ignoramus21673

Yes.

What have you tried so far?

perldoc perlre

Also, many English words contain perfectly valid double letters (there's
one now :) ). If you want your filtered results to be human-readable,
you will need to take that into account. If you intend just to reduce
things to a standard form before feeding to a filter, then this will not
matter.

The corrected text is intended for the consumption of the filter, not
humans.

I need to filter certain spams, one is a sex spammer who sends emails
with subjects similar to the above, and another is a medications
spammer who sends messages with lines like


X a n @ x

etc. I want to write something smart that woudl detect it.


i
 
D

David Squire

Lukas said:
Not a regexp, but you can use tr/// with the s modifier. See perldoc
perlop.

Yes. This is indeed nicer:

echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {tr/A-Za-z//s; print}}'

DS
 
I

Ignoramus21673

David said:
Yes.

What have you tried so far?

Also, many English words contain perfectly valid double letters (there's
one now :) ). If you want your filtered results to be human-readable,
you will need to take that into account. If you intend just to reduce
things to a standard form before feeding to a filter, then this will not
matter.

OK. Here's an example of one:

echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {s/([A-Za-z])\1+/$1/g;
print}}'

(assuming that you are only interested in alphabetic characters being
duplicated)

DS

Thanks, works beautifully.

i
 
T

Tad McClellan

Ignoramus21673 said:
I am writing a little mail filter:

I receive messages with Subjects such as:

Hardcoore incesst Content

I want to replace that with "Hardcore incest Content" (note removal of
duplicate characters. Is there some regexp that would let me do that.


Yes, but a regex is not the Right Tool for this job.

You can do it fine without any regular expressions:

tr/a-zA-Z//s;


Note that 'Mississippi' becomes 'Misisipi' ...
 
I

Ignoramus21673

Yes, but a regex is not the Right Tool for this job.

You can do it fine without any regular expressions:

tr/a-zA-Z//s;


Note that 'Mississippi' becomes 'Misisipi' ...

Thanks. Someone suggested to use a regexp like this

$s =~ s/([A-Za-z])\1+/$1/g;


which actually works. If tr is somehow better (not sure why), I can
switch to using tr.

i
 
D

David Squire

Tad said:
Yes, but a regex is not the Right Tool for this job.

You can do it fine without any regular expressions:

tr/a-zA-Z//s;

Out of interest, can tr handle more general cases, such as:

s/(.)\1+/$1/g;

or is a regex necessary for this?

DS

PS. Yes, I have tested 'tr/.//s;', and it doesn't remove any dupes.
 
M

Mintcake

David said:
Yes. This is indeed nicer:

echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {tr/A-Za-z//s; print}}'

DS
Or even...

echo 'Heelllooo WWWoorrld' | perl -pe 'tr/A-Za-z//s'
 
D

Dr.Ruud

Tad McClellan schreef:
Ignoramus21673:

Yes, but a regex is not the Right Tool for this job.

Well, it is if you would rather use [:alpha:].

(there can be more in [[:alpha:]] than is in [A-Za-z])
 
A

Anno Siegel

David Squire said:
Out of interest, can tr handle more general cases, such as:

s/(.)\1+/$1/g;

or is a regex necessary for this?

tr/\x00-\x7f//s;

covers the ASCII range. Any set of character ranges can be covered.
See tr/// in perlop.
PS. Yes, I have tested 'tr/.//s;', and it doesn't remove any dupes.

Do look up tr///. The similarity with s/// is rather superficial. In
particular, "." doesn't do in tr/// what it does in a regex.

Anno
 
A

Anno Siegel

Dr.Ruud said:
Tad McClellan schreef:
Ignoramus21673:

Yes, but a regex is not the Right Tool for this job.

Well, it is if you would rather use [:alpha:].

(there can be more in [[:alpha:]] than is in [A-Za-z])

$_ = 'Heelllooo WWWoorrld';
do {
my $alpha = join '' =>
grep /[[:alpha:]]/,
map chr, 0 .. 255; # or whatever
eval "sub { tr/$alpha//s }";
}->();
print "$_\n";

:)

Anno
 
D

Dr.Ruud

Anno Siegel schreef:
Dr.Ruud:
(there can be more in [[:alpha:]] than is in [A-Za-z])

$_ = 'Heelllooo WWWoorrld';
do {
my $alpha = join '' =>
grep /[[:alpha:]]/,
map chr, 0 .. 255; # or whatever
eval "sub { tr/$alpha//s }";
}->();
print "$_\n";

:)

Heheh, I actually had this technique of yours (!) in mind while posting.

The 'whatever' can be quite big:

#!/usr/bin/perl
use strict;
use warnings;

my ($alpha, $i, $n) = ('', 0, 0);

for (0x0000..0xD7FF, 0xE000..0xFDCF, 0xFDF0..0xFFFD) {
++$i;
$_ = chr;
$alpha .= $_ if /[[:alpha:]]/;
}
printf "%d / %d = %d%%\n", $n = length $alpha, $i, 100 * $n / $i;
printf "%s\n", substr( $alpha, 0, 160 );
__END__

47276 / 63454 = 74%
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz...
 
T

Tad McClellan

David Squire said:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Out of interest, can tr handle more general cases, such as:

s/(.)\1+/$1/g;

or is a regex necessary for this?


You do not need any regular expressions for that either:

tr/\000-\011\013-\377//s; # No regex here!

PS. Yes, I have tested 'tr/.//s;', and it doesn't remove any dupes.


That must be because you tested on a string without any of
the tr-listed characters in it. It works for me:

perl -le '$_="etc..."; tr/.//s; print'


Note the underlined part above. tr/// is NOT a regular expression
(so a dot is a dot, not a "wildcard").
 
J

Jürgen Exner

Ignoramus21673 wrote:
[something]

Would you mind sticking to one email alias?
Or are you suffering from multiple shizophrenia?

jue
 
L

Lukas Mai

David Squire said:
Out of interest, can tr handle more general cases, such as:

s/(.)\1+/$1/g;

or is a regex necessary for this?

DS

PS. Yes, I have tested 'tr/.//s;', and it doesn't remove any dupes.

That's because tr doesn't do patterns. '.' matches '.' and nothing else.
Try perl -pe "tr///cs".

HTH, Lukas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top