Data cleaning issue involving bad wide characters in what ought to beascii data

T

Ted Byers

Again, I am trying to automatically process data I receive by email,
so I have no control over the data that is coming in.

The data is supposed to be plain text/HTML, but there are quite a
number of records where the contraction "rec'd" is misrepresented when
written to standard out as "Rec\342\200\231d"

When the data is written to a file, these characters are represented
by the character ' when it is opened using notepad, but by the string
'’' when it is opened by open office.

So how do I tell what character it is when in three different contexts
it is displayed in three different ways? How can I make certain that
when I either print it or store it in my DB, I get the correct
"rec'd" (or, better, "received")?

I suspect a minor glitch in the software that makes and send the email
as this is the ONLY string where what ought to be an ascii ' character
is identified as a wide character. Regardless of how that happens (as
I don't control that), I need to clean this. And it gets confusing
when different applications handle the i18n differently (Notepad is
undoubtedly using the OS i18n support and Open Office is handling it
differently, and Emacs is doing it differently from both).

A little enlightenment would be appreciated.

Thanks

Ted
 
J

Jürgen Exner

Ted Byers said:
Again, I am trying to automatically process data I receive by email,
so I have no control over the data that is coming in.

The data is supposed to be plain text/HTML, but there are quite a
number of records where the contraction "rec'd" is misrepresented when
written to standard out as "Rec\342\200\231d"

When the data is written to a file, these characters are represented
by the character ' when it is opened using notepad, but by the string
'’' when it is opened by open office.

So how do I tell what character it is when in three different contexts
it is displayed in three different ways?

By explicitely telling the displaying program the encoding that was used
to create/save the file. In your case it very much looks like UTF-8.
How can I make certain that
when I either print it or store it in my DB, I get the correct
"rec'd" (or, better, "received")?

I suspect a minor glitch in the software that makes and send the email
as this is the ONLY string where what ought to be an ascii ' character
is identified as a wide character.

That's not a wide character. A wide character is something totally
different.
Regardless of how that happens (as
I don't control that), I need to clean this. And it gets confusing
when different applications handle the i18n differently (Notepad is
undoubtedly using the OS i18n support and Open Office is handling it
differently, and Emacs is doing it differently from both).

Yep. If the file doesn't contain information about the encoding and/or
the application either doesn't support this encoding or misinterprets it
or cannot guess the encoding correctly then you will have to tell the
application which encoding to use (or use a different application).

Does the file have a BOM? AFAIR Notepad uses the BOM to determine if a
file is in UTF-8 in disregard of UTF-8 being a byte sequence and thus
files in UTF-8 typically neither having nor needing a BOM.

jue
 
T

Ted Byers

By explicitely telling the displaying program the encoding that was used
to create/save the file. In your case it very much looks like UTF-8.
My program needs to store the data as plain ascii regardless of how
the original data was encoded. And apart from this string, it looks
like all the data can be safely treated as ascii. The data comes as a
text/html attachment to the emails, so I am wondering if the headers
to the email might tell me something about the encoding ...
That's not a wide character. A wide character is something totally
different.
I have done almost no programming dealing with i18n, so I called it a
wide character because that's what Emacs called it when my program
wrote the data to standard out.
Yep. If the file doesn't contain information about the encoding and/or
the application either doesn't support this encoding or misinterprets it
or cannot guess the encoding correctly then you will have to tell the
application which encoding to use (or use a different application).

Does the file have a BOM? AFAIR Notepad uses the BOM to determine if a
file is in UTF-8 in disregard of UTF-8 being a byte sequence and thus
files in UTF-8 typically neither having nor needing a BOM.

jue
I don't know what a BOM is, let alone how to tell if a file has one.

Is there a safe way to ensure that all the data that is being
processed is plain ascii? I have seen email clients displaying this
data so I know that there are never characters in it, as displayed,
that would not be valid ascii.

I thought I'd have to resort to a regex, if I could figure out what to
scan for, but if there is a perl package that will make it easier to
deal with this odd character, great.

Thanks
Ted
 
J

Jürgen Exner

Ted Byers said:
My program needs to store the data as plain ascii

I dare to question the wisdom of this requirement. In today's world
restricting your data to ASCII only is a severe limitation and will more
often than not backfire when you least expect it. Does your data contain
e.g. any names? Customers, employees, places, tools or equipment named
after people or places? Can you guarantee that it will never be used
outside of the English-speaking world, not even for Spanish names in the
US?
A much more robust way is to finally accept that ASCII is almost 50
years old, obsolete, and completely inadequate for today's world and to
use Unicode/UTF-8 as the standard throughout.
regardless of how the original data was encoded.

If you insist on limiting yourself to ASCII only then obviously you will
have to deal with any non-ASCII character in some way. What do you
propose to do with e.g. my first name?
And apart from this string, it looks
like all the data can be safely treated as ascii. The data comes as a
text/html attachment to the emails, so I am wondering if the headers
to the email might tell me something about the encoding ...

Sorry, I'm not a MIME expert.

Convert it, transform it, remove it, reject it, ....
If it's really, really, really only this one instance ever, then
probably a simple s/// will do. But that will work only until some other
non-ASCII character shows up at your doorstep.
I don't know what a BOM is, let alone how to tell if a file has one.

See http://en.wikipedia.org/wiki/Byte-order_mark. You might be able to
use it to determine the encoding of your data.
Is there a safe way to ensure that all the data that is being
processed is plain ascii?

Only if the character set is explicitely specified as ASCII. Every other
character set does contain non-ASCII characters which you will have to
handle.
I have seen email clients displaying this
data so I know that there are never characters in it, as displayed,
that would not be valid ascii.

Would you bet your house on it?

jue
 
J

Jürgen Exner

Ted Byers said:
I thought I'd have to resort to a regex, if I could figure out what to
scan for, but if there is a perl package that will make it easier to
deal with this odd character, great.

Forgot to mention:
There is Text::Iconv (see
http://search.cpan.org/~mpiotr/Text-Iconv-1.7/Iconv.pm) which will
convert text between different encodings. However I have no idea what it
does with characters that do not exist in the target character set.

jue
 
S

sln

Again, I am trying to automatically process data I receive by email,
so I have no control over the data that is coming in.

The data is supposed to be plain text/HTML, but there are quite a
number of records where the contraction "rec'd" is misrepresented when
written to standard out as "Rec\342\200\231d"

When the data is written to a file, these characters are represented
by the character ' when it is opened using notepad, but by the string
'’' when it is opened by open office.

So how do I tell what character it is when in three different contexts
it is displayed in three different ways? How can I make certain that
when I either print it or store it in my DB, I get the correct
"rec'd" (or, better, "received")?

I suspect a minor glitch in the software that makes and send the email
as this is the ONLY string where what ought to be an ascii ' character
is identified as a wide character. Regardless of how that happens (as
I don't control that), I need to clean this. And it gets confusing
when different applications handle the i18n differently (Notepad is
undoubtedly using the OS i18n support and Open Office is handling it
differently, and Emacs is doing it differently from both).

A little enlightenment would be appreciated.

Thanks

Ted


What you have there is encoded utf-9 character with
code point \x{2019}.

It is NOT an ascii single quote, rather a Unicode curly
single quote (right). See this table and this web site:

copyright sign 00A9 \u00A9
registered sign 00AE \u00AE
trademark sign 2122 \u2122
em-dash 2014 \u2014
euro sign 20AC \u20AC
curly single quotation mark (left) 2018 \u2018
curly single quotation mark (right) 2019 \u2019
curly double quotation mark (left) 201C \u201C
curly double quotation mark (right) 201D \u201D

http://moock.org/asdg/technotes/usingSpecialCharacters/

By the way it displays fine in Notepad and Word, it is
not ascii, so you need a font and an app that can display
utf-8 characters.

If you want to convert these special characters, use a regex
to strip them from your system.

First before you do that, apparently, the embeddeding is done
in raw octets 'Rec\342\200\231d' that need to be decoded into
utf-8, then you can use code points in the regex.

You can strip these after you decode. Something like this:

$str = decode ('utf8', "your recieved string"); # utf-8 octets
$str =~ s/\x{2018}/'/g;
$str =~ s/\x{2019}/'/g;
$str =~ s/\x{201C}/"/g;
$str =~ s/\x{201D}/"/g;

etc, ...

Find a more efficient way to do the substitutions though.

See below for an example.
-sln
===========================
use strict;
use warnings;
use Encode;

my $str = decode ('utf8', "Rec\342\200\231d"); # utf-8 octets

my $data = "Rec\x{2019}d"; # Unicode Code Point

if ($str eq $data) {
print "yes thier equal\n";
}
open my $fh, '>', 'chr1.txt' or die "can't open chr1.txt: $!";

print $fh $data;
exit;

sub ordsplit
{
my $string = shift;
my $buf = '';
for (map {ord $_} split //, $string) {
$buf.= sprintf ("%c %02x ",$_,$_);
}
return $buf;
}
__END__
 
S

sln

You can strip these after you decode. Something like this:

$str = decode ('utf8', "your recieved string"); # utf-8 octets
$str =~ s/\x{2018}/'/g;
$str =~ s/\x{2019}/'/g;
$str =~ s/\x{201C}/"/g;
$str =~ s/\x{201D}/"/g;

etc, ...
-sln
------------------
use strict;
use warnings;
use Encode;

binmode (STDOUT, ':utf8');

my $str = decode ('utf8', "Rec\342\200\231d"); # utf8 octets
my $data = "Rec\x{2019}d"; # Unicode Code Point

if ($str eq $data) {
print "yes thier equal\n";
}
print ordsplit($data),"\n";

# Substitute select Unicode to ascii equivalent
my %unisub = (
"\x{2018}" => "'",
"\x{2019}" => "'",
"\x{201C}" => '"',
"\x{201D}" => '"',
);
$str =~ s/$_/$unisub{$_}/ge for keys (%unisub);
print $str,"\n";

# OR -- Substitute all Unicode code points, 100 - 1fffff with ? character
$data =~ s/[\x{100}-\x{1fffff}]/?/g;
print $data,"\n";

exit;

sub ordsplit {
my $string = shift;
my $buf = '';
for (map {ord $_} split //, $string) {
$buf.= sprintf ("%c %02x ",$_,$_);
}
return $buf;
}
__END__

output:

yes thier equal
R 52 e 65 c 63 GÇÖ 2019 d 64
Rec'd
Rec?d
 
T

Ted Byers

If it uses iconv, or works the same as iconv, it'll drop them.

Mart

Does it work on Windows? I don't find it on any of the repositories
identified in Activestate's PPM, and haven't had much luck installing
packages from cpan that aren't in at least one of those PPM
repositories. The documentation for it says nothing about
dependencies.

Thanks,

Ted
 
T

Ted Byers

You can strip these after you decode. Something like this:
$str = decode ('utf8', "your recieved string"); # utf-8 octets
$str =~ s/\x{2018}/'/g;
$str =~ s/\x{2019}/'/g;
$str =~ s/\x{201C}/"/g;
$str =~ s/\x{201D}/"/g;

-sln
------------------
use strict;
use warnings;
use Encode;

binmode (STDOUT, ':utf8');

my $str = decode ('utf8', "Rec\342\200\231d"); # utf8 octets
my $data  = "Rec\x{2019}d"; # Unicode Code Point

if ($str eq $data) {
        print "yes thier equal\n";}

print ordsplit($data),"\n";

# Substitute select Unicode to ascii equivalent
my %unisub = (
"\x{2018}" => "'",
"\x{2019}" => "'",
"\x{201C}" => '"',
"\x{201D}" => '"',
);  
$str =~ s/$_/$unisub{$_}/ge for keys (%unisub);
print $str,"\n";

# OR -- Substitute all Unicode code points, 100 - 1fffff with ? character
$data =~ s/[\x{100}-\x{1fffff}]/?/g;
print $data,"\n";

exit;

sub ordsplit {
        my $string = shift;
        my $buf = '';
        for (map {ord $_} split //, $string) {
                $buf.= sprintf ("%c %02x  ",$_,$_);
        }
        return $buf;}

__END__

output:

yes thier equal
R 52  e 65  c 63  GÇÖ 2019  d 64
Rec'd
Rec?d

Thank you very much. Brilliant. I learned plenty from this, and
Jue's posts about this.

Cheers,

Ted
 
J

Jürgen Exner

Ted Byers said:
Does it work on Windows?

What "it" are you referring to? According to your quoting style it must
be the revolution in Mart's signature. However I find that rather
unlikely. There has never been anything revolutionary about Windows.

Or are you referreing to the iconv tool that Mart mentioned? I know
nothing about that.

Or are you referring to the Text::Iconv module that I mentioned?
I used it a lot several years ago on Windows.
I don't find it on any of the repositories
identified in Activestate's PPM, and haven't had much luck installing
packages from cpan that aren't in at least one of those PPM
repositories. The documentation for it says nothing about
dependencies.

I had no problems installing Text::Iconv from CPAN on Windows (XP and
Server2000). However as I mentioned that was several years ago, no
recent experience.

jue
 
P

Peter J. Holzer

My program needs to store the data as plain ascii regardless of how
the original data was encoded. And apart from this string, it looks
like all the data can be safely treated as ascii. The data comes as a
text/html attachment to the emails, so I am wondering if the headers
to the email might tell me something about the encoding ...

Don't wonder, look! If you look at the source code of the email you will
probably see a header like

Content-Type: text/html; charset=utf-8

This tells you that the encoding is UTF-8.

Or maybe the HTML part itself contains a meta element.

Looks like somebody tried to be cute and used a right single quotation
mark ("\x{2019}", "’") instead of an apostrophe ("\x{27}", "'").

I have done almost no programming dealing with i18n, so I called it a
wide character because that's what Emacs called it when my program
wrote the data to standard out.

In Perl jargon a "wide character" is usually a character with a code
greater than 255, although sometimes it is used to refer to character in
a character string. "\x{2019}" (RIGHT SINGLE QUOTATION MARK) is a wide
character by both definitions. So emacs is right although I suspect that
it uses a different definition.


The OpenOffice import filter for text files is absolutely horrible.
In this case it obviously interprets the file as ISO-8859-1 (or
something similar) instead of UTF-8.

Emacs 22.2.1 handles UTF-8 files fine on Linux. I think it has done so
for quite a while, although I don't normally use it. Either your Emacs
is very old or the Windows port is broken or there is some setting which
you need to change.
I don't know what a BOM is, let alone how to tell if a file has one.

Is there a safe way to ensure that all the data that is being
processed is plain ascii? I have seen email clients displaying this
data so I know that there are never characters in it, as displayed,
that would not be valid ascii.

RIGHT SINGLE QUOTATION MARK is not valid ASCII. It may look very
similar to APOSTROPHE, but it is not the same character. From the
context you know that it should be an apostrophe and not a quotation
mark, but that is your knowledge about the English language and has
nothing to do whether an email client can display it (most email clients
today will happily display characters from all the major languages in
the world).
I thought I'd have to resort to a regex, if I could figure out what to
scan for, but if there is a perl package that will make it easier to
deal with this odd character, great.

Text::Unidecode replaces Non-ASCII characters with ASCII sequences. The
result may or may not be usable (in your case it is because it replaces
’ with '). Or you could just read the file character by character (*not*
byte by byte!) and replace all characters with a code >= 128 with a
useful substitute (since there are about 100000 characters you probably
want to define substitutions for only a few and let your script to
complain about all others).

In both cases you need to decode your file properly (see perldoc -f
binmode and perldoc -f open).

hp
 
S

sln

I learned plenty from this, and
Jue's posts about this.

Cheers,

Ted

Looking back, it can for the most part be boiled down to this.
A roll-your-own, simple regex, that covers all cases.

Good luck!
-sln
-------------
use strict;
use warnings;
use Encode;

binmode (STDOUT, ':utf8');

#my $charset = 'utf8'; # Decode raw bytes that are in $charset encoding
#my $str = decode ($charset, "Your recieved string"); # encoded octets

# Example: $str is utf8 via decoding recieved sample and is like this:
my $str = "Rec\x{2019}d, copyright \x{00A9} 2009, trademark\x{2122} affixed";

# Select Unicode to ascii char-to-string substitutions
# ----
my %unisub = (
"\x{00A9}" => '(c)',
"\x{2018}" => "'",
"\x{2019}" => "'",
"\x{201C}" => '"',
"\x{201D}" => '"',
);

# Substitute non-ascii (code points 80 - 1fffff) with ascii equivalent
# (or blank if not in hash)
# ----
$str =~ s/([\x{80}-\x{1fffff}])/ exists $unisub{$1} ? $unisub{$1} : ''/ge;
print $str,"\n";

__END__

Output:

Rec'd, copyright (c) 2009, trademark affixed
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top