D
Dave Saville
I have a perl script that does sanity checking on a mail store. The
"house keeping" files of the mail store used to be lines with the
fields separated by some hex character. This proved difficult if not
impossible to expand so the lead developer decided to change to XML
files in UTF8. Now one of the checks is between the old and new
versions of a file as if XML files are not found then they are
generated from the old ones - and we have had some "interesting"
problems
One of the files holds file system folder names and one of the checks
is that the name is the same in both files - we had a bug where they
weren't. It has been working fine until a German user came along with
an Umlaut in the folder name.
The base code page of the system is 850. So the true file system name
has the cp850 umlaut as does the old housekeeping file but of course
the XML version has a double byte version of the character.
My problem is getting them to compare equal. Just playing with a test
script and I just can't figure it out.
use strict;
use warnings;
use Unicode::Normalize;
open my $INI, '<', 'folder.ini' or die $!;
my ( $ini_folder_name, $id, $ini_is_archived ) = ( split /\xDE/,
<$INI> )[ 0, 1, 10 ];
print $ini_folder_name, "\n";
open my $XML, '<:raw:utf8', 'folderpr.xml' or die $!;
#open my $XML, '<', 'folderpr.xml' or die $!;
local $/ = undef;
my $xml = <$XML>;
my $XML_folder_name;
if ( $xml =~ m{>([^<]+)</profile>}s )
{
$XML_folder_name = $1;
}
print $XML_folder_name, "\n" if NFD($ini_folder_name) eq
NFD($XML_folder_name);
If I don't open the xml file :utf8 they obviously don't test equal but
neither do they when opened :utf8
In the latter case a hex dump of the output shows that all the utf8
seems to have done is drop the first of the two characters making up
the unicode. The XML file has the correct UTF8 code for the cp850
umlaut.
TIA
"house keeping" files of the mail store used to be lines with the
fields separated by some hex character. This proved difficult if not
impossible to expand so the lead developer decided to change to XML
files in UTF8. Now one of the checks is between the old and new
versions of a file as if XML files are not found then they are
generated from the old ones - and we have had some "interesting"
problems
One of the files holds file system folder names and one of the checks
is that the name is the same in both files - we had a bug where they
weren't. It has been working fine until a German user came along with
an Umlaut in the folder name.
The base code page of the system is 850. So the true file system name
has the cp850 umlaut as does the old housekeeping file but of course
the XML version has a double byte version of the character.
My problem is getting them to compare equal. Just playing with a test
script and I just can't figure it out.
use strict;
use warnings;
use Unicode::Normalize;
open my $INI, '<', 'folder.ini' or die $!;
my ( $ini_folder_name, $id, $ini_is_archived ) = ( split /\xDE/,
<$INI> )[ 0, 1, 10 ];
print $ini_folder_name, "\n";
open my $XML, '<:raw:utf8', 'folderpr.xml' or die $!;
#open my $XML, '<', 'folderpr.xml' or die $!;
local $/ = undef;
my $xml = <$XML>;
my $XML_folder_name;
if ( $xml =~ m{>([^<]+)</profile>}s )
{
$XML_folder_name = $1;
}
print $XML_folder_name, "\n" if NFD($ini_folder_name) eq
NFD($XML_folder_name);
If I don't open the xml file :utf8 they obviously don't test equal but
neither do they when opened :utf8
In the latter case a hex dump of the output shows that all the utf8
seems to have done is drop the first of the two characters making up
the unicode. The XML file has the correct UTF8 code for the cp850
umlaut.
TIA