Data cleaning issue involving bad wide characters in what ought to beascii data

Discussion in 'Perl Misc' started by Ted Byers, Sep 3, 2009.

  1. Ted Byers

    Ted Byers Guest

    Again, I am trying to automatically process data I receive by email,
    so I have no control over the data that is coming in.

    The data is supposed to be plain text/HTML, but there are quite a
    number of records where the contraction "rec'd" is misrepresented when
    written to standard out as "Rec\342\200\231d"

    When the data is written to a file, these characters are represented
    by the character ' when it is opened using notepad, but by the string
    '’' when it is opened by open office.

    So how do I tell what character it is when in three different contexts
    it is displayed in three different ways? How can I make certain that
    when I either print it or store it in my DB, I get the correct
    "rec'd" (or, better, "received")?

    I suspect a minor glitch in the software that makes and send the email
    as this is the ONLY string where what ought to be an ascii ' character
    is identified as a wide character. Regardless of how that happens (as
    I don't control that), I need to clean this. And it gets confusing
    when different applications handle the i18n differently (Notepad is
    undoubtedly using the OS i18n support and Open Office is handling it
    differently, and Emacs is doing it differently from both).

    A little enlightenment would be appreciated.

    Thanks

    Ted
    Ted Byers, Sep 3, 2009
    #1
    1. Advertising

  2. Re: Data cleaning issue involving bad wide characters in what ought to be ascii data

    Ted Byers <> wrote:
    >Again, I am trying to automatically process data I receive by email,
    >so I have no control over the data that is coming in.
    >
    >The data is supposed to be plain text/HTML, but there are quite a
    >number of records where the contraction "rec'd" is misrepresented when
    >written to standard out as "Rec\342\200\231d"
    >
    >When the data is written to a file, these characters are represented
    >by the character ' when it is opened using notepad, but by the string
    >'’' when it is opened by open office.
    >
    >So how do I tell what character it is when in three different contexts
    >it is displayed in three different ways?


    By explicitely telling the displaying program the encoding that was used
    to create/save the file. In your case it very much looks like UTF-8.

    >How can I make certain that
    >when I either print it or store it in my DB, I get the correct
    >"rec'd" (or, better, "received")?
    >
    >I suspect a minor glitch in the software that makes and send the email
    >as this is the ONLY string where what ought to be an ascii ' character
    >is identified as a wide character.


    That's not a wide character. A wide character is something totally
    different.

    >Regardless of how that happens (as
    >I don't control that), I need to clean this. And it gets confusing
    >when different applications handle the i18n differently (Notepad is
    >undoubtedly using the OS i18n support and Open Office is handling it
    >differently, and Emacs is doing it differently from both).


    Yep. If the file doesn't contain information about the encoding and/or
    the application either doesn't support this encoding or misinterprets it
    or cannot guess the encoding correctly then you will have to tell the
    application which encoding to use (or use a different application).

    Does the file have a BOM? AFAIR Notepad uses the BOM to determine if a
    file is in UTF-8 in disregard of UTF-8 being a byte sequence and thus
    files in UTF-8 typically neither having nor needing a BOM.

    jue
    Jürgen Exner, Sep 3, 2009
    #2
    1. Advertising

  3. Ted Byers

    Ted Byers Guest

    Re: Data cleaning issue involving bad wide characters in what oughtto be ascii data

    On Sep 3, 11:51 am, Jürgen Exner <> wrote:
    > Ted Byers <> wrote:
    > >Again, I am trying to automatically process data I receive by email,
    > >so I have no control over the data that is coming in.

    >
    > >The data is supposed to be plain text/HTML, but there are quite a
    > >number of records where the contraction "rec'd" is misrepresented when
    > >written to standard out as "Rec\342\200\231d"

    >
    > >When the data is written to a file, these characters are represented
    > >by the character ' when it is opened using notepad, but by the string
    > >'’' when it is opened by open office.

    >
    > >So how do I tell what character it is when in three different contexts
    > >it is displayed in three different ways?

    >
    > By explicitely telling the displaying program the encoding that was used
    > to create/save the file. In your case it very much looks like UTF-8.
    >

    My program needs to store the data as plain ascii regardless of how
    the original data was encoded. And apart from this string, it looks
    like all the data can be safely treated as ascii. The data comes as a
    text/html attachment to the emails, so I am wondering if the headers
    to the email might tell me something about the encoding ...

    > >How can I make certain that
    > >when I either print it or store it in my DB, I get the correct
    > >"rec'd" (or, better, "received")?

    >
    > >I suspect a minor glitch in the software that makes and send the email
    > >as this is the ONLY string where what ought to be an ascii ' character
    > >is identified as a wide character.

    >
    > That's not a wide character. A wide character is something totally
    > different.
    >

    I have done almost no programming dealing with i18n, so I called it a
    wide character because that's what Emacs called it when my program
    wrote the data to standard out.

    > >Regardless of how that happens (as
    > >I don't control that), I need to clean this.  And it gets confusing
    > >when different applications handle the i18n differently (Notepad is
    > >undoubtedly using the OS i18n support and Open Office is handling it
    > >differently, and Emacs is doing it differently from both).

    >
    > Yep. If the file doesn't contain information about the encoding and/or
    > the application either doesn't support this encoding or misinterprets it
    > or cannot guess the encoding correctly then you will have to tell the
    > application which encoding to use (or use a different application).
    >
    > Does the file have a BOM? AFAIR Notepad uses the BOM to determine if a
    > file is in UTF-8 in disregard of UTF-8 being a byte sequence and thus
    > files in UTF-8 typically neither having nor needing a BOM.
    >
    > jue

    I don't know what a BOM is, let alone how to tell if a file has one.

    Is there a safe way to ensure that all the data that is being
    processed is plain ascii? I have seen email clients displaying this
    data so I know that there are never characters in it, as displayed,
    that would not be valid ascii.

    I thought I'd have to resort to a regex, if I could figure out what to
    scan for, but if there is a perl package that will make it easier to
    deal with this odd character, great.

    Thanks
    Ted
    Ted Byers, Sep 3, 2009
    #3
  4. Re: Data cleaning issue involving bad wide characters in what ought to be ascii data

    Ted Byers <> wrote:
    >On Sep 3, 11:51 am, Jürgen Exner <> wrote:
    >> Ted Byers <> wrote:

    >My program needs to store the data as plain ascii


    I dare to question the wisdom of this requirement. In today's world
    restricting your data to ASCII only is a severe limitation and will more
    often than not backfire when you least expect it. Does your data contain
    e.g. any names? Customers, employees, places, tools or equipment named
    after people or places? Can you guarantee that it will never be used
    outside of the English-speaking world, not even for Spanish names in the
    US?
    A much more robust way is to finally accept that ASCII is almost 50
    years old, obsolete, and completely inadequate for today's world and to
    use Unicode/UTF-8 as the standard throughout.

    >regardless of how the original data was encoded.


    If you insist on limiting yourself to ASCII only then obviously you will
    have to deal with any non-ASCII character in some way. What do you
    propose to do with e.g. my first name?

    >And apart from this string, it looks
    >like all the data can be safely treated as ascii. The data comes as a
    >text/html attachment to the emails, so I am wondering if the headers
    >to the email might tell me something about the encoding ...


    Sorry, I'm not a MIME expert.

    >> >How can I make certain that
    >> >when I either print it or store it in my DB, I get the correct
    >> >"rec'd" (or, better, "received")?


    Convert it, transform it, remove it, reject it, ....
    If it's really, really, really only this one instance ever, then
    probably a simple s/// will do. But that will work only until some other
    non-ASCII character shows up at your doorstep.

    >> Does the file have a BOM? AFAIR Notepad uses the BOM to determine if a
    >> file is in UTF-8 in disregard of UTF-8 being a byte sequence and thus
    >> files in UTF-8 typically neither having nor needing a BOM.
    >>

    >I don't know what a BOM is, let alone how to tell if a file has one.


    See http://en.wikipedia.org/wiki/Byte-order_mark. You might be able to
    use it to determine the encoding of your data.

    >Is there a safe way to ensure that all the data that is being
    >processed is plain ascii?


    Only if the character set is explicitely specified as ASCII. Every other
    character set does contain non-ASCII characters which you will have to
    handle.

    >I have seen email clients displaying this
    >data so I know that there are never characters in it, as displayed,
    >that would not be valid ascii.


    Would you bet your house on it?

    jue
    Jürgen Exner, Sep 3, 2009
    #4
  5. Re: Data cleaning issue involving bad wide characters in what ought to be ascii data

    Ted Byers <> wrote:
    >I thought I'd have to resort to a regex, if I could figure out what to
    >scan for, but if there is a perl package that will make it easier to
    >deal with this odd character, great.


    Forgot to mention:
    There is Text::Iconv (see
    http://search.cpan.org/~mpiotr/Text-Iconv-1.7/Iconv.pm) which will
    convert text between different encodings. However I have no idea what it
    does with characters that do not exist in the target character set.

    jue
    Jürgen Exner, Sep 3, 2009
    #5
  6. Ted Byers

    Guest

    Re: Data cleaning issue involving bad wide characters in what ought to be ascii data

    On Thu, 3 Sep 2009 07:10:36 -0700 (PDT), Ted Byers <> wrote:

    >Again, I am trying to automatically process data I receive by email,
    >so I have no control over the data that is coming in.
    >
    >The data is supposed to be plain text/HTML, but there are quite a
    >number of records where the contraction "rec'd" is misrepresented when
    >written to standard out as "Rec\342\200\231d"
    >
    >When the data is written to a file, these characters are represented
    >by the character ' when it is opened using notepad, but by the string
    >'’' when it is opened by open office.
    >
    >So how do I tell what character it is when in three different contexts
    >it is displayed in three different ways? How can I make certain that
    >when I either print it or store it in my DB, I get the correct
    >"rec'd" (or, better, "received")?
    >
    >I suspect a minor glitch in the software that makes and send the email
    >as this is the ONLY string where what ought to be an ascii ' character
    >is identified as a wide character. Regardless of how that happens (as
    >I don't control that), I need to clean this. And it gets confusing
    >when different applications handle the i18n differently (Notepad is
    >undoubtedly using the OS i18n support and Open Office is handling it
    >differently, and Emacs is doing it differently from both).
    >
    >A little enlightenment would be appreciated.
    >
    >Thanks
    >
    >Ted



    What you have there is encoded utf-9 character with
    code point \x{2019}.

    It is NOT an ascii single quote, rather a Unicode curly
    single quote (right). See this table and this web site:

    copyright sign 00A9 \u00A9
    registered sign 00AE \u00AE
    trademark sign 2122 \u2122
    em-dash 2014 \u2014
    euro sign 20AC \u20AC
    curly single quotation mark (left) 2018 \u2018
    curly single quotation mark (right) 2019 \u2019
    curly double quotation mark (left) 201C \u201C
    curly double quotation mark (right) 201D \u201D

    http://moock.org/asdg/technotes/usingSpecialCharacters/

    By the way it displays fine in Notepad and Word, it is
    not ascii, so you need a font and an app that can display
    utf-8 characters.

    If you want to convert these special characters, use a regex
    to strip them from your system.

    First before you do that, apparently, the embeddeding is done
    in raw octets 'Rec\342\200\231d' that need to be decoded into
    utf-8, then you can use code points in the regex.

    You can strip these after you decode. Something like this:

    $str = decode ('utf8', "your recieved string"); # utf-8 octets
    $str =~ s/\x{2018}/'/g;
    $str =~ s/\x{2019}/'/g;
    $str =~ s/\x{201C}/"/g;
    $str =~ s/\x{201D}/"/g;

    etc, ...

    Find a more efficient way to do the substitutions though.

    See below for an example.
    -sln
    ===========================
    use strict;
    use warnings;
    use Encode;

    my $str = decode ('utf8', "Rec\342\200\231d"); # utf-8 octets

    my $data = "Rec\x{2019}d"; # Unicode Code Point

    if ($str eq $data) {
    print "yes thier equal\n";
    }
    open my $fh, '>', 'chr1.txt' or die "can't open chr1.txt: $!";

    print $fh $data;
    exit;

    sub ordsplit
    {
    my $string = shift;
    my $buf = '';
    for (map {ord $_} split //, $string) {
    $buf.= sprintf ("%c %02x ",$_,$_);
    }
    return $buf;
    }
    __END__
    , Sep 4, 2009
    #6
  7. Ted Byers

    Guest

    Re: Data cleaning issue involving bad wide characters in what ought to be ascii data

    On Thu, 03 Sep 2009 16:07:07 -0700, wrote:

    >On Thu, 3 Sep 2009 07:10:36 -0700 (PDT), Ted Byers <> wrote:
    >
    >You can strip these after you decode. Something like this:
    >
    >$str = decode ('utf8', "your recieved string"); # utf-8 octets
    >$str =~ s/\x{2018}/'/g;
    >$str =~ s/\x{2019}/'/g;
    >$str =~ s/\x{201C}/"/g;
    >$str =~ s/\x{201D}/"/g;
    >
    >etc, ...
    >

    -sln
    ------------------
    use strict;
    use warnings;
    use Encode;

    binmode (STDOUT, ':utf8');

    my $str = decode ('utf8', "Rec\342\200\231d"); # utf8 octets
    my $data = "Rec\x{2019}d"; # Unicode Code Point

    if ($str eq $data) {
    print "yes thier equal\n";
    }
    print ordsplit($data),"\n";

    # Substitute select Unicode to ascii equivalent
    my %unisub = (
    "\x{2018}" => "'",
    "\x{2019}" => "'",
    "\x{201C}" => '"',
    "\x{201D}" => '"',
    );
    $str =~ s/$_/$unisub{$_}/ge for keys (%unisub);
    print $str,"\n";

    # OR -- Substitute all Unicode code points, 100 - 1fffff with ? character
    $data =~ s/[\x{100}-\x{1fffff}]/?/g;
    print $data,"\n";

    exit;

    sub ordsplit {
    my $string = shift;
    my $buf = '';
    for (map {ord $_} split //, $string) {
    $buf.= sprintf ("%c %02x ",$_,$_);
    }
    return $buf;
    }
    __END__

    output:

    yes thier equal
    R 52 e 65 c 63 GÇÖ 2019 d 64
    Rec'd
    Rec?d
    , Sep 4, 2009
    #7
  8. Ted Byers

    Ted Byers Guest

    Re: Data cleaning issue involving bad wide characters in what oughtto be ascii data

    On Sep 4, 1:44 am, Mart van de Wege <> wrote:
    > Jürgen Exner <> writes:
    > > Ted Byers <> wrote:
    > >>I thought I'd have to resort to a regex, if I could figure out what to
    > >>scan for, but if there is a perl package that will make it easier to
    > >>deal with this odd character, great.

    >
    > > Forgot to mention:
    > > There is Text::Iconv (see
    > >http://search.cpan.org/~mpiotr/Text-Iconv-1.7/Iconv.pm) which will
    > > convert text between different encodings. However I have no idea what it
    > > does with characters that do not exist in the target character set.

    >
    > If it uses iconv, or works the same as iconv, it'll drop them.
    >
    > Mart
    >
    > --
    > "We will need a longer wall when the revolution comes."
    > --- AJS, quoting an uncertain source.


    Does it work on Windows? I don't find it on any of the repositories
    identified in Activestate's PPM, and haven't had much luck installing
    packages from cpan that aren't in at least one of those PPM
    repositories. The documentation for it says nothing about
    dependencies.

    Thanks,

    Ted
    Ted Byers, Sep 4, 2009
    #8
  9. Ted Byers

    Ted Byers Guest

    Re: Data cleaning issue involving bad wide characters in what oughtto be ascii data

    On Sep 3, 8:22 pm, wrote:
    > On Thu, 03 Sep 2009 16:07:07 -0700, wrote:
    > >On Thu, 3 Sep 2009 07:10:36 -0700 (PDT), Ted Byers <> wrote:

    >
    > >You can strip these after you decode. Something like this:

    >
    > >$str = decode ('utf8', "your recieved string"); # utf-8 octets
    > >$str =~ s/\x{2018}/'/g;
    > >$str =~ s/\x{2019}/'/g;
    > >$str =~ s/\x{201C}/"/g;
    > >$str =~ s/\x{201D}/"/g;

    >
    > >etc, ...

    >
    > -sln
    > ------------------
    > use strict;
    > use warnings;
    > use Encode;
    >
    > binmode (STDOUT, ':utf8');
    >
    > my $str = decode ('utf8', "Rec\342\200\231d"); # utf8 octets
    > my $data  = "Rec\x{2019}d"; # Unicode Code Point
    >
    > if ($str eq $data) {
    >         print "yes thier equal\n";}
    >
    > print ordsplit($data),"\n";
    >
    > # Substitute select Unicode to ascii equivalent
    > my %unisub = (
    > "\x{2018}" => "'",
    > "\x{2019}" => "'",
    > "\x{201C}" => '"',
    > "\x{201D}" => '"',
    > );  
    > $str =~ s/$_/$unisub{$_}/ge for keys (%unisub);
    > print $str,"\n";
    >
    > # OR -- Substitute all Unicode code points, 100 - 1fffff with ? character
    > $data =~ s/[\x{100}-\x{1fffff}]/?/g;
    > print $data,"\n";
    >
    > exit;
    >
    > sub ordsplit {
    >         my $string = shift;
    >         my $buf = '';
    >         for (map {ord $_} split //, $string) {
    >                 $buf.= sprintf ("%c %02x  ",$_,$_);
    >         }
    >         return $buf;}
    >
    > __END__
    >
    > output:
    >
    > yes thier equal
    > R 52  e 65  c 63  GÇÖ 2019  d 64
    > Rec'd
    > Rec?d


    Thank you very much. Brilliant. I learned plenty from this, and
    Jue's posts about this.

    Cheers,

    Ted
    Ted Byers, Sep 4, 2009
    #9
  10. Re: Data cleaning issue involving bad wide characters in what ought to be ascii data

    Ted Byers <> wrote:
    >On Sep 4, 1:44 am, Mart van de Wege <> wrote:
    >> Jürgen Exner <> writes:
    >> > Ted Byers <> wrote:
    >> >>I thought I'd have to resort to a regex, if I could figure out what to
    >> >>scan for, but if there is a perl package that will make it easier to
    >> >>deal with this odd character, great.

    >>
    >> > Forgot to mention:
    >> > There is Text::Iconv (see
    >> >http://search.cpan.org/~mpiotr/Text-Iconv-1.7/Iconv.pm) which will
    >> > convert text between different encodings. However I have no idea what it
    >> > does with characters that do not exist in the target character set.

    >>
    >> If it uses iconv, or works the same as iconv, it'll drop them.
    >>
    >> Mart
    >>
    >> --
    >> "We will need a longer wall when the revolution comes."
    >> --- AJS, quoting an uncertain source.

    >
    >Does it work on Windows?


    What "it" are you referring to? According to your quoting style it must
    be the revolution in Mart's signature. However I find that rather
    unlikely. There has never been anything revolutionary about Windows.

    Or are you referreing to the iconv tool that Mart mentioned? I know
    nothing about that.

    Or are you referring to the Text::Iconv module that I mentioned?
    I used it a lot several years ago on Windows.

    >I don't find it on any of the repositories
    >identified in Activestate's PPM, and haven't had much luck installing
    >packages from cpan that aren't in at least one of those PPM
    >repositories. The documentation for it says nothing about
    >dependencies.


    I had no problems installing Text::Iconv from CPAN on Windows (XP and
    Server2000). However as I mentioned that was several years ago, no
    recent experience.

    jue
    Jürgen Exner, Sep 4, 2009
    #10
  11. Re: Data cleaning issue involving bad wide characters in what oughtto be ascii data

    On 2009-09-03 16:09, Ted Byers <> wrote:
    > On Sep 3, 11:51 am, Jürgen Exner <> wrote:
    >> Ted Byers <> wrote:
    >> >Again, I am trying to automatically process data I receive by email,
    >> >so I have no control over the data that is coming in.

    >>
    >> >The data is supposed to be plain text/HTML, but there are quite a
    >> >number of records where the contraction "rec'd" is misrepresented when
    >> >written to standard out as "Rec\342\200\231d"

    >>
    >> >When the data is written to a file, these characters are represented
    >> >by the character ' when it is opened using notepad, but by the string
    >> >'’' when it is opened by open office.

    >>
    >> >So how do I tell what character it is when in three different contexts
    >> >it is displayed in three different ways?

    >>
    >> By explicitely telling the displaying program the encoding that was used
    >> to create/save the file. In your case it very much looks like UTF-8.
    >>

    > My program needs to store the data as plain ascii regardless of how
    > the original data was encoded. And apart from this string, it looks
    > like all the data can be safely treated as ascii. The data comes as a
    > text/html attachment to the emails, so I am wondering if the headers
    > to the email might tell me something about the encoding ...


    Don't wonder, look! If you look at the source code of the email you will
    probably see a header like

    Content-Type: text/html; charset=utf-8

    This tells you that the encoding is UTF-8.

    Or maybe the HTML part itself contains a meta element.

    >> >How can I make certain that when I either print it or store it in my
    >> >DB, I get the correct "rec'd" (or, better, "received")?

    >>
    >> >I suspect a minor glitch in the software that makes and send the email
    >> >as this is the ONLY string where what ought to be an ascii ' character
    >> >is identified as a wide character.


    Looks like somebody tried to be cute and used a right single quotation
    mark ("\x{2019}", "’") instead of an apostrophe ("\x{27}", "'").


    >> That's not a wide character. A wide character is something totally
    >> different.
    >>

    > I have done almost no programming dealing with i18n, so I called it a
    > wide character because that's what Emacs called it when my program
    > wrote the data to standard out.


    In Perl jargon a "wide character" is usually a character with a code
    greater than 255, although sometimes it is used to refer to character in
    a character string. "\x{2019}" (RIGHT SINGLE QUOTATION MARK) is a wide
    character by both definitions. So emacs is right although I suspect that
    it uses a different definition.


    >> >Regardless of how that happens (as
    >> >I don't control that), I need to clean this.  And it gets confusing
    >> >when different applications handle the i18n differently (Notepad is
    >> >undoubtedly using the OS i18n support and Open Office is handling it
    >> >differently,


    The OpenOffice import filter for text files is absolutely horrible.
    In this case it obviously interprets the file as ISO-8859-1 (or
    something similar) instead of UTF-8.

    >> >and Emacs is doing it differently from both).


    Emacs 22.2.1 handles UTF-8 files fine on Linux. I think it has done so
    for quite a while, although I don't normally use it. Either your Emacs
    is very old or the Windows port is broken or there is some setting which
    you need to change.

    >> Yep. If the file doesn't contain information about the encoding and/or
    >> the application either doesn't support this encoding or misinterprets it
    >> or cannot guess the encoding correctly then you will have to tell the
    >> application which encoding to use (or use a different application).
    >>
    >> Does the file have a BOM? AFAIR Notepad uses the BOM to determine if a
    >> file is in UTF-8 in disregard of UTF-8 being a byte sequence and thus
    >> files in UTF-8 typically neither having nor needing a BOM.
    >>
    >> jue

    > I don't know what a BOM is, let alone how to tell if a file has one.
    >
    > Is there a safe way to ensure that all the data that is being
    > processed is plain ascii? I have seen email clients displaying this
    > data so I know that there are never characters in it, as displayed,
    > that would not be valid ascii.


    RIGHT SINGLE QUOTATION MARK is not valid ASCII. It may look very
    similar to APOSTROPHE, but it is not the same character. From the
    context you know that it should be an apostrophe and not a quotation
    mark, but that is your knowledge about the English language and has
    nothing to do whether an email client can display it (most email clients
    today will happily display characters from all the major languages in
    the world).

    > I thought I'd have to resort to a regex, if I could figure out what to
    > scan for, but if there is a perl package that will make it easier to
    > deal with this odd character, great.


    Text::Unidecode replaces Non-ASCII characters with ASCII sequences. The
    result may or may not be usable (in your case it is because it replaces
    ’ with '). Or you could just read the file character by character (*not*
    byte by byte!) and replace all characters with a code >= 128 with a
    useful substitute (since there are about 100000 characters you probably
    want to define substitutions for only a few and let your script to
    complain about all others).

    In both cases you need to decode your file properly (see perldoc -f
    binmode and perldoc -f open).

    hp
    Peter J. Holzer, Sep 4, 2009
    #11
  12. Ted Byers

    Guest

    Re: Data cleaning issue involving bad wide characters in what ought to be ascii data

    On Fri, 4 Sep 2009 10:59:59 -0700 (PDT), Ted Byers <> wrote:

    >On Sep 3, 8:22 pm, wrote:
    >> On Thu, 03 Sep 2009 16:07:07 -0700, wrote:
    >> >On Thu, 3 Sep 2009 07:10:36 -0700 (PDT), Ted Byers <> wrote:

    >>

    >
    > I learned plenty from this, and
    >Jue's posts about this.
    >
    >Cheers,
    >
    >Ted


    Looking back, it can for the most part be boiled down to this.
    A roll-your-own, simple regex, that covers all cases.

    Good luck!
    -sln
    -------------
    use strict;
    use warnings;
    use Encode;

    binmode (STDOUT, ':utf8');

    #my $charset = 'utf8'; # Decode raw bytes that are in $charset encoding
    #my $str = decode ($charset, "Your recieved string"); # encoded octets

    # Example: $str is utf8 via decoding recieved sample and is like this:
    my $str = "Rec\x{2019}d, copyright \x{00A9} 2009, trademark\x{2122} affixed";

    # Select Unicode to ascii char-to-string substitutions
    # ----
    my %unisub = (
    "\x{00A9}" => '(c)',
    "\x{2018}" => "'",
    "\x{2019}" => "'",
    "\x{201C}" => '"',
    "\x{201D}" => '"',
    );

    # Substitute non-ascii (code points 80 - 1fffff) with ascii equivalent
    # (or blank if not in hash)
    # ----
    $str =~ s/([\x{80}-\x{1fffff}])/ exists $unisub{$1} ? $unisub{$1} : ''/ge;
    print $str,"\n";

    __END__

    Output:

    Rec'd, copyright (c) 2009, trademark affixed
    , Sep 4, 2009
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Web Developer

    char 8bit wide or 7bit wide in c++?

    Web Developer, Jul 31, 2003, in forum: C++
    Replies:
    2
    Views:
    581
    John Harrison
    Jul 31, 2003
  2. Disc Magnet
    Replies:
    2
    Views:
    711
    Jukka K. Korpela
    May 15, 2010
  3. Disc Magnet
    Replies:
    2
    Views:
    788
    Neredbojias
    May 14, 2010
  4. rantingrick
    Replies:
    44
    Views:
    1,210
    Peter Pearson
    Jul 13, 2010
  5. Martin Rinehart

    80 columns wide? 132 columns wide?

    Martin Rinehart, Oct 31, 2008, in forum: Javascript
    Replies:
    16
    Views:
    176
    John W Kennedy
    Nov 13, 2008
Loading...

Share This Page