Unicode help please

D

Dave Saville

I have a perl script that does sanity checking on a mail store. The
"house keeping" files of the mail store used to be lines with the
fields separated by some hex character. This proved difficult if not
impossible to expand so the lead developer decided to change to XML
files in UTF8. Now one of the checks is between the old and new
versions of a file as if XML files are not found then they are
generated from the old ones - and we have had some "interesting"
problems :)

One of the files holds file system folder names and one of the checks
is that the name is the same in both files - we had a bug where they
weren't. It has been working fine until a German user came along with
an Umlaut in the folder name. :)

The base code page of the system is 850. So the true file system name
has the cp850 umlaut as does the old housekeeping file but of course
the XML version has a double byte version of the character.

My problem is getting them to compare equal. Just playing with a test
script and I just can't figure it out.

use strict;
use warnings;
use Unicode::Normalize;
open my $INI, '<', 'folder.ini' or die $!;
my ( $ini_folder_name, $id, $ini_is_archived ) = ( split /\xDE/,
<$INI> )[ 0, 1, 10 ];
print $ini_folder_name, "\n";
open my $XML, '<:raw:utf8', 'folderpr.xml' or die $!;
#open my $XML, '<', 'folderpr.xml' or die $!;
local $/ = undef;
my $xml = <$XML>;
my $XML_folder_name;
if ( $xml =~ m{>([^<]+)</profile>}s )
{
$XML_folder_name = $1;
}
print $XML_folder_name, "\n" if NFD($ini_folder_name) eq
NFD($XML_folder_name);

If I don't open the xml file :utf8 they obviously don't test equal but
neither do they when opened :utf8

In the latter case a hex dump of the output shows that all the utf8
seems to have done is drop the first of the two characters making up
the unicode. The XML file has the correct UTF8 code for the cp850
umlaut.

TIA
 
B

Bjoern Hoehrmann

* Dave Saville wrote in comp.lang.perl.misc:
The base code page of the system is 850. So the true file system name
has the cp850 umlaut as does the old housekeeping file but of course
the XML version has a double byte version of the character.

My problem is getting them to compare equal. Just playing with a test
script and I just can't figure it out.

Based on the description above, you have to decode both using the Encode
module, Encode::decode('cp850', ...) and Encode::decode('utf-8', ...),
and then simply use `eq` on the result. Do keep in mind that you might
well be dealing with Windows-1252 instead, if it's a semi-modern Windows
system only the console might be using CP850. Using `Unicode::Normalize`
is incorrect for this purpose, you would be using that e.g. when the OS
or file system modifies file names, but that's Apple's ballpark. Unicode
normalisation helps you if you want to compare U+00F6 ("ö") and the two-
character sequence U+006F ("o") followed by U+0308 (combining diaeresis)
where NFC(...) generates the short and NFD(...) generates the long form.
In the latter case a hex dump of the output shows that all the utf8
seems to have done is drop the first of the two characters making up
the unicode. The XML file has the correct UTF8 code for the cp850
umlaut.

If the above does not help, you should tell us the hex codes and actual
characters involved.
 
D

Dave Saville

Hi Ben
If this file really is in cp850 you need to tell perl that:

open my $INI, "<:encoding(cp850)", "folder.ini" or die $!;

I had assumed it was cp850 because that is what Western OS/2 systems
default to. But looking at the data I see hex4DFC6C6C which is
ISO8859-1 lower case u umlaut. The xml file has hex4DC2B36C 6C. Which
looks OK to me.
However, if, as Bjoern suggests, it's actually in cp1252, then this
isn't the cause of the problem, since all the umlauted characters are in
the same places in cp1252 and ISO8859-1, and perl assumes ISO8859-1 if
you don't tell it otherwise.

What character are you dealing with, and what byte is actually used to
represent it in the file?
my ( $ini_folder_name, $id, $ini_is_archived ) = ( split /\xDE/,
<$INI> )[ 0, 1, 10 ];
print $ini_folder_name, "\n";
open my $XML, '<:raw:utf8', 'folderpr.xml' or die $!;

You should never use :utf8 for input. It does no validity checking, and
the rest of perl tends to assume Unicode strings will be valid, which
can lead to segfaults if they're not. Always use :encoding(utf8)
instead. :)utf8 is generally safe for output.)

Ah, thanks. Copied from the perl cookbook :)

I really really don't get this stuff. :-(

Internally perl uses utf8 - yes? So if no code is specified it assumes
ISO8859-1 for input and output and converts to utf8 to store. I
presume it does not do any conversion if opened binary.

So the ini file is read assuming ISO8859-1 and converted to utf8.
The xml file is already utf8 so by telling perl that then no
conversion is done.

So if everything is in utf8 why can't I compare them?
 
B

Bjoern Hoehrmann

* Dave Saville wrote in comp.lang.perl.misc:
I had assumed it was cp850 because that is what Western OS/2 systems
default to. But looking at the data I see hex4DFC6C6C which is
ISO8859-1 lower case u umlaut. The xml file has hex4DC2B36C 6C. Which
looks OK to me.

If that is 4D C2 B3 6C 6C then

% perl -MEncode -Mcharnames=:full -e
"print charnames::viacode(ord decode('utf-8', qq(\xc2\xb3)))"
SUPERSCRIPT THREE

You probably need something like this:

% perl -MEncode -Mcharnames=:full -e
"print charnames::viacode(ord
decode('Windows-1252',
encode('cp850',
decode('utf-8', qq(\xc2\xb3)))))"
LATIN SMALL LETTER U WITH DIAERESIS

This seems to have multiple character encodings applied to it
incorrectly, and the sequence above might undo that, but more
test samples would be needed to determine that for sure.
I really really don't get this stuff. :-(

Internally perl uses utf8 - yes? So if no code is specified it assumes
ISO8859-1 for input and output and converts to utf8 to store. I
presume it does not do any conversion if opened binary.

So the ini file is read assuming ISO8859-1 and converted to utf8.
The xml file is already utf8 so by telling perl that then no
conversion is done.

So if everything is in utf8 why can't I compare them?

Perl internals are more complicated than the above. The Encode module,
and the features built upon it, like the `:encoding(...)` layer, know
how to turn bytes into something more character-ish and this is needed
pretty much always, and your original code did it only on one sequence
of bytes but not the other.
 
D

Dave Saville

You probably need something like this:

% perl -MEncode -Mcharnames=:full -e
"print charnames::viacode(ord
decode('Windows-1252',
encode('cp850',
decode('utf-8', qq(\xc2\xb3)))))"
LATIN SMALL LETTER U WITH DIAERESIS

Hmm I get encode(<somecode>, $foo) will take $foo and return it
encoded in somecode. But what does decode('utf-8', $foo) decode
*into*?
 
B

Ben Bacarisse

Dave Saville said:
Hmm I get encode(<somecode>, $foo) will take $foo and return it
encoded in somecode. But what does decode('utf-8', $foo) decode
*into*?

It decodes a sequence of octets (\xc2\xb3) into Perl's internal form.
The 'utf-8' tells decode how to interpret the octets. The result is the
Unicode character U+00B3 -- superscript 3.

encode('cp850', ...) takes a string in Perl's internal character format
and produces a sequence of octets. In this case, just one: \xfc -- the
code for superscript 3 in CP-850.

Finally, decode('Windows-1252', ...) takes this octet stream (just the
one: \xfc) and turns it into a Perl string using the Windows-1252 code
table to decide what character each code point refers to. In this case,
\xfc is a lower-case u with dieresis.

What may be more interesting is the reverse of this. Something happened
as the XML was generated that caused the wrong code to be used. Exactly
what is hard to say, but it stems from the fact that CP-850 and
Windows-1252 differ about what \xfc means. I think the simplest
explanation is that the data was always originally in Windows-1252
encoding (u-dieresis being \xfc) but when the data was converted to
UTF-8, it was incorrectly assumed to be CP-850. Thus the \xfc was taken
to be a superscript three, which was, in some sense, correctly rendered
in UTF-8 in the XML file.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top