character classes, locale and utf8 - strange behaviour


M

Michal Jankowski

This simple program:
----------------------------------------------------------------------
#!/usr/bin/perl -w
use strict;
use utf8;
use open ':utf8', ':std';
my $s = 'abąćłóµ_,.';
for (split('', $s)) {
print "$_ ";
{
no locale;
print /[[:word:]]/? 'T' : 'F', ' ', /\w/? 'T' : 'F', ' ';
}
{
use locale;
print /[[:word:]]/? 'T' : 'F', ' ', /\w/? 'T' : 'F', ' ';
}
{
use locale;
use POSIX;
setlocale(LC_ALL, "C");
print /[[:word:]]/? 'T' : 'F', ' ', /\w/? 'T' : 'F', ' ';
}
print "\n";
}
----------------------------------------------------------------------
Produces the following result:
a T T T T T T
b T T T T T T
Ä… T T T T T T
ć T T T T T T
Å‚ T T T T T T
ó T T T F T F
µ T T T F T F
_ T T T T T T
, F F F F F F
.. F F F F F F
----------------------------------------------------------------------
The first surprise is that with a 'no locale' in force non-ascii
accented or Greek characters belong to class [:word:] (or [:alpha:]).
I'd expect [:word:] to be equivalent to [a-zA-Z0-9_] in this case (and
in 'C' locale, too).

The second surprise is that after switching locale on, the classes
[:word:] and \w are no longer equivalent - some characters, notably
"ó", are no longer matched by \w. Seems independent of the actual
locale used (I've tried pl.PL, de.DE, C with identical results). This
clearly looks like a bug to me.

Tested on perl 5.10 (and 5.8).

Any comments?

Michał Jankowski
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top