character classes, locale and utf8 - strange behaviour

Michal Jankowski · Apr 29, 2011

This simple program:
----------------------------------------------------------------------
#!/usr/bin/perl -w
use strict;
use utf8;
use open ':utf8', ':std';
my $s = 'abÄ…Ä‡Å‚Ã³Âµ_,.';
for (split('', $s)) {
print "$_ ";
{
no locale;
print /[[:word:]]/? 'T' : 'F', ' ', /\w/? 'T' : 'F', ' ';
}
{
use locale;
print /[[:word:]]/? 'T' : 'F', ' ', /\w/? 'T' : 'F', ' ';
}
{
use locale;
use POSIX;
setlocale(LC_ALL, "C");
print /[[:word:]]/? 'T' : 'F', ' ', /\w/? 'T' : 'F', ' ';
}
print "\n";
}
----------------------------------------------------------------------
Produces the following result:
a T T T T T T
b T T T T T T
Ä… T T T T T T
Ä‡ T T T T T T
Å‚ T T T T T T
Ã³ T T T F T F
Âµ T T T F T F
_ T T T T T T
, F F F F F F
.. F F F F F F
----------------------------------------------------------------------
The first surprise is that with a 'no locale' in force non-ascii
accented or Greek characters belong to class [:word:] (or [:alpha:]).
I'd expect [:word:] to be equivalent to [a-zA-Z0-9_] in this case (and
in 'C' locale, too).

The second surprise is that after switching locale on, the classes
[:word:] and \w are no longer equivalent - some characters, notably
"Ã³", are no longer matched by \w. Seems independent of the actual
locale used (I've tried pl.PL, de.DE, C with identical results). This
clearly looks like a bug to me.

Tested on perl 5.10 (and 5.8).

Any comments?

MichaÅ‚ Jankowski

Cannot have locale word characters in a variable	9	Sep 2, 2013
Problems with utf8, locale and regex	0	Dec 5, 2007
UTF8 strings and filesystem access	6	Oct 11, 2007
Regex testing and UTF8 awarenes or Regex and numeric pattern matching	2	Mar 10, 2009
Locale confusion	2	Jan 7, 2005
Python and unicode	8	Sep 19, 2010
Trouble writing txt	1	Jan 21, 2009
m//i behaves strange : variable does not match itself	0	Oct 1, 2004

character classes, locale and utf8 - strange behaviour

Michal Jankowski

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads