Cannot have locale word characters in a variable

fmassion · Sep 2, 2013

My test file:

höheneinstellbar 1234
bedienbar 5678
1111 Müller
größer 8765

My script:
#!/usr/bin/perl -w
use locale;
open(FILE,'test.txt') ;
@sentence = <FILE>;
foreach $sentence (@sentence) {
chomp $sentence;
if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
print "$1\n";
}}

Instead of "use locale" I have also tried unsucessfully:
(1)
use utf8;
(2)
use POSIX qw(locale_h);
(3)
use POSIX qw(locale_h);
my $locale = setlocale(LC_ALL, "de_DE");

Result (words broken at German special characters):

heneinstellbar (instead of the expected "höheneinstellbar")
bedienbar
ßer (instead of the expected "größer")

The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is a better solution.

klaus03 · Sep 2, 2013

Le 02/09/2013 19:34, (e-mail address removed) a écrit :

My test file:
höheneinstellbar 1234
[...]
if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
print "$1\n";
[...]
Result (words broken at German special characters):
heneinstellbar (instead of the expected "höheneinstellbar")
[...]
The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is a better solution.

What is the perl version you are using ?

My very simple test.pl with perl 5.018...

( no "use locale", no "use utf8", no "setlocale()" ):

======================================
use 5.018;
use warnings;

my $sentence = 'höheneinstellbar 1234';

if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
print "$1\n";
}
======================================

....shows:

höheneinstellbar

Charles DeRykus · Sep 2, 2013

My test file:

höheneinstellbar 1234
bedienbar 5678
1111 Müller
größer 8765

My script:
#!/usr/bin/perl -w
use locale;
open(FILE,'test.txt') ;
@sentence = <FILE>;
foreach $sentence (@sentence) {
chomp $sentence;
if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
print "$1\n";
}}

Instead of "use locale" I have also tried unsucessfully:
(1)
use utf8;
(2)
use POSIX qw(locale_h);
(3)
use POSIX qw(locale_h);
my $locale = setlocale(LC_ALL, "de_DE");

Result (words broken at German special characters):

heneinstellbar (instead of the expected "höheneinstellbar")
bedienbar
ßer (instead of the expected "größer")

The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is a better solution.

binmode(STDOUT, ":utf8");

Horst-W. Radners · Sep 2, 2013

My test file:

hÃ¶heneinstellbar 1234
bedienbar 5678
1111 MÃ¼ller
grÃ¶ÃŸer 8765

My script:
#!/usr/bin/perl -w
use locale;
open(FILE,'test.txt') ;
@sentence = <FILE>;
foreach $sentence (@sentence) {
chomp $sentence;
if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
print "$1\n";
}}

Instead of "use locale" I have also tried unsucessfully:
(1)
use utf8;
(2)
use POSIX qw(locale_h);
(3)
use POSIX qw(locale_h);
my $locale = setlocale(LC_ALL, "de_DE");

Result (words broken at German special characters):

heneinstellbar (instead of the expected "hÃ¶heneinstellbar")
bedienbar
ÃŸer (instead of the expected "grÃ¶ÃŸer")

The script works with [\wÃ¶Ã¤Ã¼ÃŸÃ„Ã–Ãœ] instead of \w but I assume there is a better solution.

It depends on the encoding of your inputfile.
Perl assumes Latin-1 encoding unless told otherwise.
If your input-encoding is UTF-8, you'll need
open(my $FILE, '<:encoding(utf8)', 'test.txt') or die;
and don't use locale.

Furthermore on the output side, if your terminal-encoding is UTF-8 too,
you'll need
binmode(STDOUT, ':utf8');
to get the output right.

Please read at least
perldoc perluniintro

Regards, Horst

Peter J. Holzer · Sep 2, 2013

Which character encoding does the file use?

My script:
#!/usr/bin/perl -w
use locale;
open(FILE,'test.txt') ;
@sentence = <FILE>;
foreach $sentence (@sentence) {
chomp $sentence;
if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
print "$1\n";
}}

Instead of "use locale" I have also tried unsucessfully:
(1)
use utf8;
(2)
use POSIX qw(locale_h);
(3)
use POSIX qw(locale_h);
my $locale = setlocale(LC_ALL, "de_DE");

Result (words broken at German special characters):

heneinstellbar (instead of the expected "höheneinstellbar")
bedienbar
ßer (instead of the expected "größer")

The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is
a better solution.

Click to expand...

binmode(STDOUT, ":utf8");

Maybe, but that's secondary. First the file must be read correctly, then
you can worry about printing the results correctly.

So he needs to apply the correct encoding filter to FILE:

open(FILE, "<:encoding($encoding)", 'test.txt')

or

binmode FILE, ":encoding($encoding)";

(of course, $encoding must be set to the correct first, e.g. "UTF-8" or
"ISO-8859-15")

perldoc perlunitut.

hp

PS: Lexical file handles are preferred over bare filehandles.

klaus03 · Sep 2, 2013

Le 02/09/2013 22:40, Ben Morrow a Ã©crit :

If you want de_DE rather than Unicode \w semantics

de_DE semantics is probably not needed, the usual Unicode semantics of
\w should by default include all German umlauts + other special German
characters.

you also need perl 5.14,

Yes, Unicode semantics requires a recent perl.

and you need to call setlocale and either 'use locale' or use the
/l regex flag.

That's not necessarily needed:

My understanding is that Unicode takes precedence over any locales.

However, you might have to call setlocale, 'use locale' or /l regex
flag, but only if you don't have Unicode semantics (that is: only if
your perl is older than 5.014)

Charles DeRykus · Sep 3, 2013

Which character encoding does the file use?

My script:
#!/usr/bin/perl -w
use locale;
open(FILE,'test.txt') ;
@sentence = <FILE>;
foreach $sentence (@sentence) {
chomp $sentence;
if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
print "$1\n";
}}

Instead of "use locale" I have also tried unsucessfully:
(1)
use utf8;
(2)
use POSIX qw(locale_h);
(3)
use POSIX qw(locale_h);
my $locale = setlocale(LC_ALL, "de_DE");

Result (words broken at German special characters):

heneinstellbar (instead of the expected "höheneinstellbar")
bedienbar
ßer (instead of the expected "größer")

The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is
a better solution.

Click to expand...

binmode(STDOUT, ":utf8");

Click to expand...

Maybe, but that's secondary. First the file must be read correctly, then
you can worry about printing the results correctly.

So he needs to apply the correct encoding filter to FILE:

open(FILE, "<:encoding($encoding)", 'test.txt')

or

binmode FILE, ":encoding($encoding)";

(of course, $encoding must be set to the correct first, e.g. "UTF-8" or
"ISO-8859-15")
...

With 'use locale' plus 'binmode(STDOUT,":utf8")', there is correct
output but maybe there are potential shortcomings since locale can be
problematic.

IIUC doesn't Perl internally store as Latin-1,eg, and seamlessly upgrade
to Unicode as needed.. It seems clunky then to nail down the input
encoding as well although perhaps the idea is to throw an error if the
specified encoding doesn't validate?

fmassion · Sep 3, 2013

Thanks to all of you for your support. This below didn't work for whatever reason. I am using Perl v.14.1 (on Windows 7)

With 'use locale' plus 'binmode(STDOUT,":utf8")', there is correct
output but maybe there are potential shortcomings since locale can be
problematic.

I had also tried without success:

use utf8;
binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";
open(FILE,'testfile.txt') or die;

Finally, the following was successful:

open(FILE, '<:encoding(utf8)', 'testfile.txt') or die;
binmode STDOUT, ":utf8"; # output
@sentence = <FILE>;

Francois

Peter J. Holzer · Sep 4, 2013

Yes. However, Unicode will include (for example) non-Latin letter
characters as letters, which I would not expect a German locale to do.

Your expectation would be wrong on Linux (at least with glibc 2.11-2.13).
I've tested various locales and AFAICS all of them except C and POSIX
use the unicode semantics for wide characters.

Here's a test program in C:

---8<------8<------8<------8<------8<------8<------8<------8<------8<---
#include <locale.h>
#include <stdio.h>
#include <wctype.h>

int main(void) {
setlocale(LC_ALL, "");
wint_t c[] = {
0x30, 0x41, 0xD8, 0x03B1, 0x304B, 0x65e0
};
int n = sizeof(c) / sizeof(c[0]);
for (int i = 0; i < n; i++) {
printf("%04x", c);
printf(" %s", iswalpha(c) ? "alpha" : "-----");
printf(" %s", iswdigit(c) ? "digit" : "-----");
printf("\n");
}
return 0;
}
---8<------8<------8<------8<------8<------8<------8<------8<------8<---

Your understanding is out of date. Up until 5.12, whether regexes
matched with Unicode, ISO8859-1 or locale semantics was rather
unpredictable, though in general if either the pattern or the string was
Unicode then Unicode rules were used. In 5.12 the unpredictability was
fixed, so Unicode semantics were (IIRC) always used.

Click to expand...

Really always or only if the unicode_strings feature is used? I would
expect such a change to break rather a lot of code.

hp

Peter J. Holzer · Sep 4, 2013

Which character encoding does the file use?

Click to expand...

[...]
With 'use locale' plus 'binmode(STDOUT,":utf8")', there is correct
output

Since you are using UTF-8 on the terminal I am assuming that your
test.txt is encoded in UTF-8, too (This may or may not be true for the
OP: AFAICS he hasn't answered that question yet).

I don't see how there can be correct output in this case. â€œuse localeâ€
doesn't affect open, so the file will be read as a byte stream.

The first line is then "h\303\266heneinstellbar 1234". "\266" isn't a
word character in any locale AFAIK, so the regexp will match
"heneinstellbar 1234", which is wrong.

Even if it did match the whole line, writing the string to a stream with
the utf8 layer results in encoding the already UTF-8-encoded string a
second time, so the result is "h\303\203\302\266heneinstellbar 1234" or
"hÃƒÂ¶heneinstellbar 1234", which is also not correct.

(this is for Perl 5.14. Maybe something changed after that, but I doubt
it)

IIUC doesn't Perl internally store as Latin-1,eg, and seamlessly upgrade
to Unicode as needed..

You shouldn't care about how perl stores strings internally.

It seems clunky then to nail down the input encoding as well

You always[1] need to decode on input to convert from a sequence of
bytes to a sequence of characters. Only for Latin-1 this is an identity
mapping. If you don't specify the encoding, Perl can't know it (it can't
just assume that all files are text files in the current locale's
encoding: They might use a different one or not be text at all).

hp

[1] Not quite: Sometimes it is better to process text files as a byte
stream, but that's rare in my experience. As a rule of thunmb,
always decode on input and always encode on output.

character classes, locale and utf8 - strange behaviour	0	Apr 29, 2011
locale and print()	1	Oct 12, 2004
Locale confusion	2	Jan 7, 2005
Making \w catch Swedish characters (setlocale)	1	Dec 16, 2003
unicode: equal strings give different results?	2	Sep 27, 2004
Can't make this page work	6	Mar 8, 2006
Code clj FAQ automation	2	Jul 31, 2006
anybody help me	1	Feb 10, 2006

Cannot have locale word characters in a variable

fmassion

klaus03

Charles DeRykus

Horst-W. Radners

Peter J. Holzer

klaus03

Charles DeRykus

fmassion

Peter J. Holzer

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads