J
John
After an upgrade from Perl 5.6 to perl 5.8, strings with character semantics
stopped sorting according to the locale settings. In the following script,
I sort a latin1-encoded string and a utf8-encoded string. The first sorts
correctly, the latter doesn't. In Perl 5.6, strings that had character
semantics sorted just fine as long as I used 'locale'.
I found a bug notice in perlunicode.html, under "Interaction with Locales",
dissuading the use of locales with Unicode in perl 5.8. So, if I want to
correctly sort a list of Spanish words encoded in UTF8, what do I do? Do I
have to convert back to latin1 every time I want to do a collating
operation?
Or does someone out there know of a better solution?
use locale;
use charnames ':full';
use Encode qw (from_to);
###Latin1-encoded literals.
my @data1 = split //, "eáú";
my @data2 = split //, "e\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL
LETTER U WITH ACUTE}";
print "Data 1: ".join(', ', sort {$a cmp $b} @data1)."\n";
print "Data 2: ".join(', ', sort {$a cmp $b} @data2)."\n";
OUTPUT:
Data 1: á, e, ú
Data 2: á, ú, e
stopped sorting according to the locale settings. In the following script,
I sort a latin1-encoded string and a utf8-encoded string. The first sorts
correctly, the latter doesn't. In Perl 5.6, strings that had character
semantics sorted just fine as long as I used 'locale'.
I found a bug notice in perlunicode.html, under "Interaction with Locales",
dissuading the use of locales with Unicode in perl 5.8. So, if I want to
correctly sort a list of Spanish words encoded in UTF8, what do I do? Do I
have to convert back to latin1 every time I want to do a collating
operation?
Or does someone out there know of a better solution?
use locale;
use charnames ':full';
use Encode qw (from_to);
###Latin1-encoded literals.
my @data1 = split //, "eáú";
my @data2 = split //, "e\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL
LETTER U WITH ACUTE}";
print "Data 1: ".join(', ', sort {$a cmp $b} @data1)."\n";
print "Data 2: ".join(', ', sort {$a cmp $b} @data2)."\n";
OUTPUT:
Data 1: á, e, ú
Data 2: á, ú, e