Unicode help please

Discussion in 'Perl Misc' started by Dave Saville, Oct 19, 2013.

  1. Dave Saville

    Dave Saville Guest

    I have a perl script that does sanity checking on a mail store. The
    "house keeping" files of the mail store used to be lines with the
    fields separated by some hex character. This proved difficult if not
    impossible to expand so the lead developer decided to change to XML
    files in UTF8. Now one of the checks is between the old and new
    versions of a file as if XML files are not found then they are
    generated from the old ones - and we have had some "interesting"
    problems :)

    One of the files holds file system folder names and one of the checks
    is that the name is the same in both files - we had a bug where they
    weren't. It has been working fine until a German user came along with
    an Umlaut in the folder name. :)

    The base code page of the system is 850. So the true file system name
    has the cp850 umlaut as does the old housekeeping file but of course
    the XML version has a double byte version of the character.

    My problem is getting them to compare equal. Just playing with a test
    script and I just can't figure it out.

    use strict;
    use warnings;
    use Unicode::Normalize;
    open my $INI, '<', 'folder.ini' or die $!;
    my ( $ini_folder_name, $id, $ini_is_archived ) = ( split /\xDE/,
    <$INI> )[ 0, 1, 10 ];
    print $ini_folder_name, "\n";
    open my $XML, '<:raw:utf8', 'folderpr.xml' or die $!;
    #open my $XML, '<', 'folderpr.xml' or die $!;
    local $/ = undef;
    my $xml = <$XML>;
    my $XML_folder_name;
    if ( $xml =~ m{>([^<]+)</profile>}s )
    {
    $XML_folder_name = $1;
    }
    print $XML_folder_name, "\n" if NFD($ini_folder_name) eq
    NFD($XML_folder_name);

    If I don't open the xml file :utf8 they obviously don't test equal but
    neither do they when opened :utf8

    In the latter case a hex dump of the output shows that all the utf8
    seems to have done is drop the first of the two characters making up
    the unicode. The XML file has the correct UTF8 code for the cp850
    umlaut.

    TIA
    --
    Regards
    Dave Saville
    Dave Saville, Oct 19, 2013
    #1
    1. Advertising

  2. * Dave Saville wrote in comp.lang.perl.misc:
    >The base code page of the system is 850. So the true file system name
    >has the cp850 umlaut as does the old housekeeping file but of course
    >the XML version has a double byte version of the character.
    >
    >My problem is getting them to compare equal. Just playing with a test
    >script and I just can't figure it out.


    Based on the description above, you have to decode both using the Encode
    module, Encode::decode('cp850', ...) and Encode::decode('utf-8', ...),
    and then simply use `eq` on the result. Do keep in mind that you might
    well be dealing with Windows-1252 instead, if it's a semi-modern Windows
    system only the console might be using CP850. Using `Unicode::Normalize`
    is incorrect for this purpose, you would be using that e.g. when the OS
    or file system modifies file names, but that's Apple's ballpark. Unicode
    normalisation helps you if you want to compare U+00F6 ("ö") and the two-
    character sequence U+006F ("o") followed by U+0308 (combining diaeresis)
    where NFC(...) generates the short and NFD(...) generates the long form.

    >In the latter case a hex dump of the output shows that all the utf8
    >seems to have done is drop the first of the two characters making up
    >the unicode. The XML file has the correct UTF8 code for the cp850
    >umlaut.


    If the above does not help, you should tell us the hex codes and actual
    characters involved.
    --
    Björn Höhrmann · mailto: · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
    Bjoern Hoehrmann, Oct 19, 2013
    #2
    1. Advertising

  3. Dave Saville

    Dave Saville Guest

    On Sat, 19 Oct 2013 13:13:37 UTC, Ben Morrow <> wrote:

    Hi Ben

    >
    > Quoth "Dave Saville" <>:
    > > I have a perl script that does sanity checking on a mail store. The
    > > "house keeping" files of the mail store used to be lines with the
    > > fields separated by some hex character. This proved difficult if not
    > > impossible to expand so the lead developer decided to change to XML
    > > files in UTF8. Now one of the checks is between the old and new
    > > versions of a file as if XML files are not found then they are
    > > generated from the old ones - and we have had some "interesting"
    > > problems :)
    > >
    > > One of the files holds file system folder names and one of the checks
    > > is that the name is the same in both files - we had a bug where they
    > > weren't. It has been working fine until a German user came along with
    > > an Umlaut in the folder name. :)
    > >
    > > The base code page of the system is 850. So the true file system name
    > > has the cp850 umlaut as does the old housekeeping file but of course
    > > the XML version has a double byte version of the character.
    > >
    > > My problem is getting them to compare equal. Just playing with a test
    > > script and I just can't figure it out.
    > >
    > > use strict;
    > > use warnings;
    > > use Unicode::Normalize;
    > > open my $INI, '<', 'folder.ini' or die $!;

    >
    > If this file really is in cp850 you need to tell perl that:
    >
    > open my $INI, "<:encoding(cp850)", "folder.ini" or die $!;
    >


    I had assumed it was cp850 because that is what Western OS/2 systems
    default to. But looking at the data I see hex4DFC6C6C which is
    ISO8859-1 lower case u umlaut. The xml file has hex4DC2B36C 6C. Which
    looks OK to me.

    > However, if, as Bjoern suggests, it's actually in cp1252, then this
    > isn't the cause of the problem, since all the umlauted characters are in
    > the same places in cp1252 and ISO8859-1, and perl assumes ISO8859-1 if
    > you don't tell it otherwise.
    >
    > What character are you dealing with, and what byte is actually used to
    > represent it in the file?
    >
    > > my ( $ini_folder_name, $id, $ini_is_archived ) = ( split /\xDE/,
    > > <$INI> )[ 0, 1, 10 ];
    > > print $ini_folder_name, "\n";
    > > open my $XML, '<:raw:utf8', 'folderpr.xml' or die $!;

    >
    > You should never use :utf8 for input. It does no validity checking, and
    > the rest of perl tends to assume Unicode strings will be valid, which
    > can lead to segfaults if they're not. Always use :encoding(utf8)
    > instead. :)utf8 is generally safe for output.)
    >


    Ah, thanks. Copied from the perl cookbook :)

    I really really don't get this stuff. :-(

    Internally perl uses utf8 - yes? So if no code is specified it assumes
    ISO8859-1 for input and output and converts to utf8 to store. I
    presume it does not do any conversion if opened binary.

    So the ini file is read assuming ISO8859-1 and converted to utf8.
    The xml file is already utf8 so by telling perl that then no
    conversion is done.

    So if everything is in utf8 why can't I compare them?

    --
    Regards
    Dave Saville
    Dave Saville, Oct 19, 2013
    #3
  4. * Dave Saville wrote in comp.lang.perl.misc:
    >I had assumed it was cp850 because that is what Western OS/2 systems
    >default to. But looking at the data I see hex4DFC6C6C which is
    >ISO8859-1 lower case u umlaut. The xml file has hex4DC2B36C 6C. Which
    >looks OK to me.


    If that is 4D C2 B3 6C 6C then

    % perl -MEncode -Mcharnames=:full -e
    "print charnames::viacode(ord decode('utf-8', qq(\xc2\xb3)))"
    SUPERSCRIPT THREE

    You probably need something like this:

    % perl -MEncode -Mcharnames=:full -e
    "print charnames::viacode(ord
    decode('Windows-1252',
    encode('cp850',
    decode('utf-8', qq(\xc2\xb3)))))"
    LATIN SMALL LETTER U WITH DIAERESIS

    This seems to have multiple character encodings applied to it
    incorrectly, and the sequence above might undo that, but more
    test samples would be needed to determine that for sure.

    >I really really don't get this stuff. :-(
    >
    >Internally perl uses utf8 - yes? So if no code is specified it assumes
    >ISO8859-1 for input and output and converts to utf8 to store. I
    >presume it does not do any conversion if opened binary.
    >
    >So the ini file is read assuming ISO8859-1 and converted to utf8.
    >The xml file is already utf8 so by telling perl that then no
    >conversion is done.
    >
    >So if everything is in utf8 why can't I compare them?


    Perl internals are more complicated than the above. The Encode module,
    and the features built upon it, like the `:encoding(...)` layer, know
    how to turn bytes into something more character-ish and this is needed
    pretty much always, and your original code did it only on one sequence
    of bytes but not the other.
    --
    Björn Höhrmann · mailto: · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
    Bjoern Hoehrmann, Oct 19, 2013
    #4
  5. Dave Saville

    Dave Saville Guest

    On Sat, 19 Oct 2013 16:44:16 UTC, Bjoern Hoehrmann
    <> wrote:

    > You probably need something like this:
    >
    > % perl -MEncode -Mcharnames=:full -e
    > "print charnames::viacode(ord
    > decode('Windows-1252',
    > encode('cp850',
    > decode('utf-8', qq(\xc2\xb3)))))"
    > LATIN SMALL LETTER U WITH DIAERESIS
    >


    Hmm I get encode(<somecode>, $foo) will take $foo and return it
    encoded in somecode. But what does decode('utf-8', $foo) decode
    *into*?

    --
    Regards
    Dave Saville
    Dave Saville, Oct 19, 2013
    #5
  6. "Dave Saville" <> writes:

    > On Sat, 19 Oct 2013 16:44:16 UTC, Bjoern Hoehrmann
    > <> wrote:
    >
    >> You probably need something like this:
    >>
    >> % perl -MEncode -Mcharnames=:full -e
    >> "print charnames::viacode(ord
    >> decode('Windows-1252',
    >> encode('cp850',
    >> decode('utf-8', qq(\xc2\xb3)))))"
    >> LATIN SMALL LETTER U WITH DIAERESIS
    >>

    >
    > Hmm I get encode(<somecode>, $foo) will take $foo and return it
    > encoded in somecode. But what does decode('utf-8', $foo) decode
    > *into*?


    It decodes a sequence of octets (\xc2\xb3) into Perl's internal form.
    The 'utf-8' tells decode how to interpret the octets. The result is the
    Unicode character U+00B3 -- superscript 3.

    encode('cp850', ...) takes a string in Perl's internal character format
    and produces a sequence of octets. In this case, just one: \xfc -- the
    code for superscript 3 in CP-850.

    Finally, decode('Windows-1252', ...) takes this octet stream (just the
    one: \xfc) and turns it into a Perl string using the Windows-1252 code
    table to decide what character each code point refers to. In this case,
    \xfc is a lower-case u with dieresis.

    What may be more interesting is the reverse of this. Something happened
    as the XML was generated that caused the wrong code to be used. Exactly
    what is hard to say, but it stems from the fact that CP-850 and
    Windows-1252 differ about what \xfc means. I think the simplest
    explanation is that the data was always originally in Windows-1252
    encoding (u-dieresis being \xfc) but when the data was converted to
    UTF-8, it was incorrectly assumed to be CP-850. Thus the \xfc was taken
    to be a superscript three, which was, in some sense, correctly rendered
    in UTF-8 in the XML file.

    --
    Ben.
    Ben Bacarisse, Oct 19, 2013
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,897
    Robert Mark Bram
    Sep 28, 2003
  2. Replies:
    4
    Views:
    484
    Chris Uppal
    May 5, 2005
  3. KK
    Replies:
    2
    Views:
    498
    Big Brian
    Oct 14, 2003
  4. MuZZy
    Replies:
    7
    Views:
    1,710
    Mike Hewson
    Jan 7, 2005
  5. Chirag Mistry
    Replies:
    6
    Views:
    156
    Ollivier Robert
    Feb 8, 2008
Loading...

Share This Page