How to remove all duplications of characters

Discussion in 'Perl Misc' started by Ignoramus21673, Apr 24, 2006.

  1. I am writing a little mail filter:

    I receive messages with Subjects such as:

    Hardcoore incesst Content

    I want to replace that with "Hardcore incest Content" (note removal of
    duplicate characters. Is there some regexp that would let me do that.

    i
     
    Ignoramus21673, Apr 24, 2006
    #1
    1. Advertising

  2. Ignoramus21673

    David Squire Guest

    Ignoramus21673 wrote:
    > I am writing a little mail filter:
    >
    > I receive messages with Subjects such as:
    >
    > Hardcoore incesst Content
    >
    > I want to replace that with "Hardcore incest Content" (note removal of
    > duplicate characters. Is there some regexp that would let me do that.


    Yes.

    What have you tried so far?

    Also, many English words contain perfectly valid double letters (there's
    one now :) ). If you want your filtered results to be human-readable,
    you will need to take that into account. If you intend just to reduce
    things to a standard form before feeding to a filter, then this will not
    matter.

    DS
     
    David Squire, Apr 24, 2006
    #2
    1. Advertising

  3. Ignoramus21673

    Lukas Mai Guest

    Ignoramus21673 <ignoramus21673@nospam.21673.invalid> schrob:
    > I am writing a little mail filter:
    >
    > I receive messages with Subjects such as:
    >
    > Hardcoore incesst Content
    >
    > I want to replace that with "Hardcore incest Content" (note removal of
    > duplicate characters. Is there some regexp that would let me do that.


    Not a regexp, but you can use tr/// with the s modifier. See perldoc
    perlop.

    HTH, Lukas
     
    Lukas Mai, Apr 24, 2006
    #3
  4. Ignoramus21673

    David Squire Guest

    David Squire wrote:
    > Ignoramus21673 wrote:
    >> I am writing a little mail filter:
    >>
    >> I receive messages with Subjects such as:
    >> Hardcoore incesst Content
    >>
    >> I want to replace that with "Hardcore incest Content" (note removal of
    >> duplicate characters. Is there some regexp that would let me do that.

    >
    > Yes.
    >
    > What have you tried so far?
    >
    > Also, many English words contain perfectly valid double letters (there's
    > one now :) ). If you want your filtered results to be human-readable,
    > you will need to take that into account. If you intend just to reduce
    > things to a standard form before feeding to a filter, then this will not
    > matter.


    OK. Here's an example of one:

    echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {s/([A-Za-z])\1+/$1/g;
    print}}'

    (assuming that you are only interested in alphabetic characters being
    duplicated)

    DS
     
    David Squire, Apr 24, 2006
    #4
  5. On Mon, 24 Apr 2006 15:51:59 +0100, David Squire <> wrote:
    > Ignoramus21673 wrote:
    >> I am writing a little mail filter:
    >>
    >> I receive messages with Subjects such as:
    >>
    >> Hardcoore incesst Content
    >>
    >> I want to replace that with "Hardcore incest Content" (note removal of
    >> duplicate characters. Is there some regexp that would let me do that.

    >
    > Yes.
    >
    > What have you tried so far?


    perldoc perlre


    > Also, many English words contain perfectly valid double letters (there's
    > one now :) ). If you want your filtered results to be human-readable,
    > you will need to take that into account. If you intend just to reduce
    > things to a standard form before feeding to a filter, then this will not
    > matter.


    The corrected text is intended for the consumption of the filter, not
    humans.

    I need to filter certain spams, one is a sex spammer who sends emails
    with subjects similar to the above, and another is a medications
    spammer who sends messages with lines like


    X a n @ x

    etc. I want to write something smart that woudl detect it.


    i
     
    Ignoramus21673, Apr 24, 2006
    #5
  6. Ignoramus21673

    David Squire Guest

    Lukas Mai wrote:
    > Ignoramus21673 <ignoramus21673@nospam.21673.invalid> schrob:
    >> I am writing a little mail filter:
    >>
    >> I receive messages with Subjects such as:
    >>
    >> Hardcoore incesst Content
    >>
    >> I want to replace that with "Hardcore incest Content" (note removal of
    >> duplicate characters. Is there some regexp that would let me do that.

    >
    > Not a regexp, but you can use tr/// with the s modifier. See perldoc
    > perlop.


    Yes. This is indeed nicer:

    echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {tr/A-Za-z//s; print}}'

    DS
     
    David Squire, Apr 24, 2006
    #6
  7. On Mon, 24 Apr 2006 16:01:25 +0100, David Squire <> wrote:
    > David Squire wrote:
    >> Ignoramus21673 wrote:
    >>> I am writing a little mail filter:
    >>>
    >>> I receive messages with Subjects such as:
    >>> Hardcoore incesst Content
    >>>
    >>> I want to replace that with "Hardcore incest Content" (note removal of
    >>> duplicate characters. Is there some regexp that would let me do that.

    >>
    >> Yes.
    >>
    >> What have you tried so far?
    >>
    >> Also, many English words contain perfectly valid double letters (there's
    >> one now :) ). If you want your filtered results to be human-readable,
    >> you will need to take that into account. If you intend just to reduce
    >> things to a standard form before feeding to a filter, then this will not
    >> matter.

    >
    > OK. Here's an example of one:
    >
    > echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {s/([A-Za-z])\1+/$1/g;
    > print}}'
    >
    > (assuming that you are only interested in alphabetic characters being
    > duplicated)
    >
    > DS


    Thanks, works beautifully.

    i
     
    Ignoramus21673, Apr 24, 2006
    #7
  8. Ignoramus21673 <ignoramus21673@NOSPAM.21673.invalid> wrote:
    > I am writing a little mail filter:
    >
    > I receive messages with Subjects such as:
    >
    > Hardcoore incesst Content
    >
    > I want to replace that with "Hardcore incest Content" (note removal of
    > duplicate characters. Is there some regexp that would let me do that.



    Yes, but a regex is not the Right Tool for this job.

    You can do it fine without any regular expressions:

    tr/a-zA-Z//s;


    Note that 'Mississippi' becomes 'Misisipi' ...


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 24, 2006
    #8
  9. On Mon, 24 Apr 2006 10:30:39 -0500, Tad McClellan <> wrote:
    > Ignoramus21673 <ignoramus21673@NOSPAM.21673.invalid> wrote:
    >> I am writing a little mail filter:
    >>
    >> I receive messages with Subjects such as:
    >>
    >> Hardcoore incesst Content
    >>
    >> I want to replace that with "Hardcore incest Content" (note removal of
    >> duplicate characters. Is there some regexp that would let me do that.

    >
    >
    > Yes, but a regex is not the Right Tool for this job.
    >
    > You can do it fine without any regular expressions:
    >
    > tr/a-zA-Z//s;
    >
    >
    > Note that 'Mississippi' becomes 'Misisipi' ...
    >
    >


    Thanks. Someone suggested to use a regexp like this

    $s =~ s/([A-Za-z])\1+/$1/g;


    which actually works. If tr is somehow better (not sure why), I can
    switch to using tr.

    i
     
    Ignoramus21673, Apr 24, 2006
    #9
  10. Ignoramus21673

    David Squire Guest

    Tad McClellan wrote:
    > Ignoramus21673 <ignoramus21673@NOSPAM.21673.invalid> wrote:
    >> I am writing a little mail filter:
    >>
    >> I receive messages with Subjects such as:
    >>
    >> Hardcoore incesst Content
    >>
    >> I want to replace that with "Hardcore incest Content" (note removal of
    >> duplicate characters. Is there some regexp that would let me do that.

    >
    >
    > Yes, but a regex is not the Right Tool for this job.
    >
    > You can do it fine without any regular expressions:
    >
    > tr/a-zA-Z//s;
    >
    >


    Out of interest, can tr handle more general cases, such as:

    s/(.)\1+/$1/g;

    or is a regex necessary for this?

    DS

    PS. Yes, I have tested 'tr/.//s;', and it doesn't remove any dupes.
     
    David Squire, Apr 24, 2006
    #10
  11. Ignoramus21673

    Mintcake Guest

    David Squire wrote:
    > Lukas Mai wrote:
    > > Ignoramus21673 <ignoramus21673@nospam.21673.invalid> schrob:
    > >> I am writing a little mail filter:
    > >>
    > >> I receive messages with Subjects such as:
    > >>
    > >> Hardcoore incesst Content
    > >>
    > >> I want to replace that with "Hardcore incest Content" (note removal of
    > >> duplicate characters. Is there some regexp that would let me do that.

    > >
    > > Not a regexp, but you can use tr/// with the s modifier. See perldoc
    > > perlop.

    >
    > Yes. This is indeed nicer:
    >
    > echo 'Heelllooo WWWoorrld' | perl -e '{while (<>) {tr/A-Za-z//s; print}}'
    >
    > DS

    Or even...

    echo 'Heelllooo WWWoorrld' | perl -pe 'tr/A-Za-z//s'
     
    Mintcake, Apr 24, 2006
    #11
  12. Ignoramus21673

    Dr.Ruud Guest

    Tad McClellan schreef:
    > Ignoramus21673:


    >> Hardcoore incesst Content
    >>
    >> I want to replace that with "Hardcore incest Content" (note removal
    >> of duplicate characters. Is there some regexp that would let me do
    >> that.

    >
    > Yes, but a regex is not the Right Tool for this job.


    Well, it is if you would rather use [:alpha:].

    (there can be more in [[:alpha:]] than is in [A-Za-z])

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Apr 24, 2006
    #12
  13. Ignoramus21673

    Anno Siegel Guest

    David Squire <> wrote in comp.lang.perl.misc:
    > Tad McClellan wrote:
    > > Ignoramus21673 <ignoramus21673@NOSPAM.21673.invalid> wrote:
    > >> I am writing a little mail filter:
    > >>
    > >> I receive messages with Subjects such as:
    > >>
    > >> Hardcoore incesst Content
    > >>
    > >> I want to replace that with "Hardcore incest Content" (note removal of
    > >> duplicate characters. Is there some regexp that would let me do that.

    > >
    > >
    > > Yes, but a regex is not the Right Tool for this job.
    > >
    > > You can do it fine without any regular expressions:
    > >
    > > tr/a-zA-Z//s;
    > >
    > >

    >
    > Out of interest, can tr handle more general cases, such as:
    >
    > s/(.)\1+/$1/g;
    >
    > or is a regex necessary for this?


    tr/\x00-\x7f//s;

    covers the ASCII range. Any set of character ranges can be covered.
    See tr/// in perlop.

    > PS. Yes, I have tested 'tr/.//s;', and it doesn't remove any dupes.


    Do look up tr///. The similarity with s/// is rather superficial. In
    particular, "." doesn't do in tr/// what it does in a regex.

    Anno
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
     
    Anno Siegel, Apr 24, 2006
    #13
  14. Ignoramus21673

    Anno Siegel Guest

    Dr.Ruud <> wrote in comp.lang.perl.misc:
    > Tad McClellan schreef:
    > > Ignoramus21673:

    >
    > >> Hardcoore incesst Content
    > >>
    > >> I want to replace that with "Hardcore incest Content" (note removal
    > >> of duplicate characters. Is there some regexp that would let me do
    > >> that.

    > >
    > > Yes, but a regex is not the Right Tool for this job.

    >
    > Well, it is if you would rather use [:alpha:].
    >
    > (there can be more in [[:alpha:]] than is in [A-Za-z])


    $_ = 'Heelllooo WWWoorrld';
    do {
    my $alpha = join '' =>
    grep /[[:alpha:]]/,
    map chr, 0 .. 255; # or whatever
    eval "sub { tr/$alpha//s }";
    }->();
    print "$_\n";

    :)

    Anno
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
     
    Anno Siegel, Apr 24, 2006
    #14
  15. Ignoramus21673

    Dr.Ruud Guest

    Anno Siegel schreef:
    > Dr.Ruud:


    >> (there can be more in [[:alpha:]] than is in [A-Za-z])

    >
    > $_ = 'Heelllooo WWWoorrld';
    > do {
    > my $alpha = join '' =>
    > grep /[[:alpha:]]/,
    > map chr, 0 .. 255; # or whatever
    > eval "sub { tr/$alpha//s }";
    > }->();
    > print "$_\n";
    >
    > :)


    Heheh, I actually had this technique of yours (!) in mind while posting.

    The 'whatever' can be quite big:

    #!/usr/bin/perl
    use strict;
    use warnings;

    my ($alpha, $i, $n) = ('', 0, 0);

    for (0x0000..0xD7FF, 0xE000..0xFDCF, 0xFDF0..0xFFFD) {
    ++$i;
    $_ = chr;
    $alpha .= $_ if /[[:alpha:]]/;
    }
    printf "%d / %d = %d%%\n", $n = length $alpha, $i, 100 * $n / $i;
    printf "%s\n", substr( $alpha, 0, 160 );
    __END__

    47276 / 63454 = 74%
    ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz...

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Apr 24, 2006
    #15
  16. Ignoramus21673 <ignoramus21673@NOSPAM.21673.invalid> wrote:
    > On Mon, 24 Apr 2006 10:30:39 -0500, Tad McClellan <> wrote:
    >> Ignoramus21673 <ignoramus21673@NOSPAM.21673.invalid> wrote:


    >>> Hardcoore incesst Content
    >>>
    >>> I want to replace that with "Hardcore incest Content" (note removal of
    >>> duplicate characters.


    >> You can do it fine without any regular expressions:
    >>
    >> tr/a-zA-Z//s;



    > If tr is somehow better



    It is.


    > (not sure why),



    1) it is more self-documenting. s/// is for *patterns*, tr/// is
    for characters, and you want to operate on characters not on patterns.

    2) it is a lot faster than s///g


    > I can
    > switch to using tr.



    Good.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 24, 2006
    #16
  17. David Squire <> wrote:
    > Tad McClellan wrote:
    >> Ignoramus21673 <ignoramus21673@NOSPAM.21673.invalid> wrote:



    >>> (note removal of
    >>> duplicate characters.



    >> You can do it fine without any regular expressions:

    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    >> tr/a-zA-Z//s;
    >>
    >>

    >
    > Out of interest, can tr handle more general cases, such as:
    >
    > s/(.)\1+/$1/g;
    >
    > or is a regex necessary for this?



    You do not need any regular expressions for that either:

    tr/\000-\011\013-\377//s; # No regex here!


    > PS. Yes, I have tested 'tr/.//s;', and it doesn't remove any dupes.



    That must be because you tested on a string without any of
    the tr-listed characters in it. It works for me:

    perl -le '$_="etc..."; tr/.//s; print'


    Note the underlined part above. tr/// is NOT a regular expression
    (so a dot is a dot, not a "wildcard").


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 24, 2006
    #17
  18. Ignoramus21673 wrote:
    [something]

    Would you mind sticking to one email alias?
    Or are you suffering from multiple shizophrenia?

    jue
     
    Jürgen Exner, Apr 25, 2006
    #18
  19. Ignoramus21673

    Lukas Mai Guest

    David Squire <> schrob:
    >
    > Out of interest, can tr handle more general cases, such as:
    >
    > s/(.)\1+/$1/g;
    >
    > or is a regex necessary for this?
    >
    > DS
    >
    > PS. Yes, I have tested 'tr/.//s;', and it doesn't remove any dupes.


    That's because tr doesn't do patterns. '.' matches '.' and nothing else.
    Try perl -pe "tr///cs".

    HTH, Lukas
     
    Lukas Mai, Apr 25, 2006
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. omission9
    Replies:
    5
    Views:
    1,474
    Peter Hansen
    Feb 17, 2004
  2. janet

    asp.net 2.0 email duplications

    janet, Nov 1, 2006, in forum: ASP .Net
    Replies:
    1
    Views:
    322
    Cowboy \(Gregory A. Beamer\)
    Nov 1, 2006
  3. cdg

    array duplications

    cdg, Feb 24, 2006, in forum: C++
    Replies:
    11
    Views:
    545
    Neil Cerutti
    Feb 27, 2006
  4. rvino
    Replies:
    0
    Views:
    4,680
    rvino
    Aug 14, 2007
  5. Duke of Hazard

    Regx to remove all characters after a match

    Duke of Hazard, Apr 18, 2008, in forum: Perl Misc
    Replies:
    3
    Views:
    185
    Gunnar Hjalmarsson
    Apr 18, 2008
Loading...

Share This Page