Cannot have locale word characters in a variable

F

fmassion

My test file:

höheneinstellbar 1234
bedienbar 5678
1111 Müller
größer 8765


My script:
#!/usr/bin/perl -w
use locale;
open(FILE,'test.txt') ;
@sentence = <FILE>;
foreach $sentence (@sentence) {
chomp $sentence;
if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
print "$1\n";
}}

Instead of "use locale" I have also tried unsucessfully:
(1)
use utf8;
(2)
use POSIX qw(locale_h);
(3)
use POSIX qw(locale_h);
my $locale = setlocale(LC_ALL, "de_DE");

Result (words broken at German special characters):

heneinstellbar (instead of the expected "höheneinstellbar")
bedienbar
ßer (instead of the expected "größer")

The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is a better solution.
 
K

klaus03

Le 02/09/2013 19:34, (e-mail address removed) a écrit :
My test file:
höheneinstellbar 1234
[...]
if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
print "$1\n";
[...]
Result (words broken at German special characters):
heneinstellbar (instead of the expected "höheneinstellbar")
[...]
The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is a better solution.

What is the perl version you are using ?

My very simple test.pl with perl 5.018...

( no "use locale", no "use utf8", no "setlocale()" ):

======================================
use 5.018;
use warnings;

my $sentence = 'höheneinstellbar 1234';

if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
print "$1\n";
}
======================================

....shows:

höheneinstellbar
 
C

Charles DeRykus

My test file:

höheneinstellbar 1234
bedienbar 5678
1111 Müller
größer 8765


My script:
#!/usr/bin/perl -w
use locale;
open(FILE,'test.txt') ;
@sentence = <FILE>;
foreach $sentence (@sentence) {
chomp $sentence;
if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
print "$1\n";
}}

Instead of "use locale" I have also tried unsucessfully:
(1)
use utf8;
(2)
use POSIX qw(locale_h);
(3)
use POSIX qw(locale_h);
my $locale = setlocale(LC_ALL, "de_DE");

Result (words broken at German special characters):

heneinstellbar (instead of the expected "höheneinstellbar")
bedienbar
ßer (instead of the expected "größer")

The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is a better solution.


binmode(STDOUT, ":utf8");
 
H

Horst-W. Radners

My test file:

höheneinstellbar 1234
bedienbar 5678
1111 Müller
größer 8765


My script:
#!/usr/bin/perl -w
use locale;
open(FILE,'test.txt') ;
@sentence = <FILE>;
foreach $sentence (@sentence) {
chomp $sentence;
if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
print "$1\n";
}}

Instead of "use locale" I have also tried unsucessfully:
(1)
use utf8;
(2)
use POSIX qw(locale_h);
(3)
use POSIX qw(locale_h);
my $locale = setlocale(LC_ALL, "de_DE");

Result (words broken at German special characters):

heneinstellbar (instead of the expected "höheneinstellbar")
bedienbar
ßer (instead of the expected "größer")

The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is a better solution.

It depends on the encoding of your inputfile.
Perl assumes Latin-1 encoding unless told otherwise.
If your input-encoding is UTF-8, you'll need
open(my $FILE, '<:encoding(utf8)', 'test.txt') or die;
and don't use locale.

Furthermore on the output side, if your terminal-encoding is UTF-8 too,
you'll need
binmode(STDOUT, ':utf8');
to get the output right.

Please read at least
perldoc perluniintro

Regards, Horst
 
P

Peter J. Holzer

Which character encoding does the file use?

My script:
#!/usr/bin/perl -w
use locale;
open(FILE,'test.txt') ;
@sentence = <FILE>;
foreach $sentence (@sentence) {
chomp $sentence;
if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
print "$1\n";
}}

Instead of "use locale" I have also tried unsucessfully:
(1)
use utf8;
(2)
use POSIX qw(locale_h);
(3)
use POSIX qw(locale_h);
my $locale = setlocale(LC_ALL, "de_DE");

Result (words broken at German special characters):

heneinstellbar (instead of the expected "höheneinstellbar")
bedienbar
ßer (instead of the expected "größer")

The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is
a better solution.


binmode(STDOUT, ":utf8");

Maybe, but that's secondary. First the file must be read correctly, then
you can worry about printing the results correctly.

So he needs to apply the correct encoding filter to FILE:

open(FILE, "<:encoding($encoding)", 'test.txt')

or

binmode FILE, ":encoding($encoding)";

(of course, $encoding must be set to the correct first, e.g. "UTF-8" or
"ISO-8859-15")

perldoc perlunitut.

hp

PS: Lexical file handles are preferred over bare filehandles.
 
K

klaus03

Le 02/09/2013 22:40, Ben Morrow a écrit :
If you want de_DE rather than Unicode \w semantics

de_DE semantics is probably not needed, the usual Unicode semantics of
\w should by default include all German umlauts + other special German
characters.
you also need perl 5.14,

Yes, Unicode semantics requires a recent perl.
and you need to call setlocale and either 'use locale' or use the
/l regex flag.

That's not necessarily needed:

My understanding is that Unicode takes precedence over any locales.

However, you might have to call setlocale, 'use locale' or /l regex
flag, but only if you don't have Unicode semantics (that is: only if
your perl is older than 5.014)
 
C

Charles DeRykus

Which character encoding does the file use?

My script:
#!/usr/bin/perl -w
use locale;
open(FILE,'test.txt') ;
@sentence = <FILE>;
foreach $sentence (@sentence) {
chomp $sentence;
if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
print "$1\n";
}}

Instead of "use locale" I have also tried unsucessfully:
(1)
use utf8;
(2)
use POSIX qw(locale_h);
(3)
use POSIX qw(locale_h);
my $locale = setlocale(LC_ALL, "de_DE");

Result (words broken at German special characters):

heneinstellbar (instead of the expected "höheneinstellbar")
bedienbar
ßer (instead of the expected "größer")

The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is
a better solution.


binmode(STDOUT, ":utf8");

Maybe, but that's secondary. First the file must be read correctly, then
you can worry about printing the results correctly.

So he needs to apply the correct encoding filter to FILE:

open(FILE, "<:encoding($encoding)", 'test.txt')

or

binmode FILE, ":encoding($encoding)";

(of course, $encoding must be set to the correct first, e.g. "UTF-8" or
"ISO-8859-15")
...

With 'use locale' plus 'binmode(STDOUT,":utf8")', there is correct
output but maybe there are potential shortcomings since locale can be
problematic.

IIUC doesn't Perl internally store as Latin-1,eg, and seamlessly upgrade
to Unicode as needed.. It seems clunky then to nail down the input
encoding as well although perhaps the idea is to throw an error if the
specified encoding doesn't validate?
 
F

fmassion

Thanks to all of you for your support. This below didn't work for whatever reason. I am using Perl v.14.1 (on Windows 7)
With 'use locale' plus 'binmode(STDOUT,":utf8")', there is correct
output but maybe there are potential shortcomings since locale can be
problematic.

I had also tried without success:

use utf8;
binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";
open(FILE,'testfile.txt') or die;

Finally, the following was successful:

open(FILE, '<:encoding(utf8)', 'testfile.txt') or die;
binmode STDOUT, ":utf8"; # output
@sentence = <FILE>;

Francois
 
P

Peter J. Holzer

Yes. However, Unicode will include (for example) non-Latin letter
characters as letters, which I would not expect a German locale to do.

Your expectation would be wrong on Linux (at least with glibc 2.11-2.13).
I've tested various locales and AFAICS all of them except C and POSIX
use the unicode semantics for wide characters.

Here's a test program in C:

---8<------8<------8<------8<------8<------8<------8<------8<------8<---
#include <locale.h>
#include <stdio.h>
#include <wctype.h>

int main(void) {
setlocale(LC_ALL, "");
wint_t c[] = {
0x30, 0x41, 0xD8, 0x03B1, 0x304B, 0x65e0
};
int n = sizeof(c) / sizeof(c[0]);
for (int i = 0; i < n; i++) {
printf("%04x", c);
printf(" %s", iswalpha(c) ? "alpha" : "-----");
printf(" %s", iswdigit(c) ? "digit" : "-----");
printf("\n");
}
return 0;
}
---8<------8<------8<------8<------8<------8<------8<------8<------8<---
Your understanding is out of date. Up until 5.12, whether regexes
matched with Unicode, ISO8859-1 or locale semantics was rather
unpredictable, though in general if either the pattern or the string was
Unicode then Unicode rules were used. In 5.12 the unpredictability was
fixed, so Unicode semantics were (IIRC) always used.

Really always or only if the unicode_strings feature is used? I would
expect such a change to break rather a lot of code.

hp
 
P

Peter J. Holzer

Which character encoding does the file use?
[...]
With 'use locale' plus 'binmode(STDOUT,":utf8")', there is correct
output

Since you are using UTF-8 on the terminal I am assuming that your
test.txt is encoded in UTF-8, too (This may or may not be true for the
OP: AFAICS he hasn't answered that question yet).

I don't see how there can be correct output in this case. “use localeâ€
doesn't affect open, so the file will be read as a byte stream.

The first line is then "h\303\266heneinstellbar 1234". "\266" isn't a
word character in any locale AFAIK, so the regexp will match
"heneinstellbar 1234", which is wrong.

Even if it did match the whole line, writing the string to a stream with
the utf8 layer results in encoding the already UTF-8-encoded string a
second time, so the result is "h\303\203\302\266heneinstellbar 1234" or
"höheneinstellbar 1234", which is also not correct.

(this is for Perl 5.14. Maybe something changed after that, but I doubt
it)
IIUC doesn't Perl internally store as Latin-1,eg, and seamlessly upgrade
to Unicode as needed..

You shouldn't care about how perl stores strings internally.
It seems clunky then to nail down the input encoding as well

You always[1] need to decode on input to convert from a sequence of
bytes to a sequence of characters. Only for Latin-1 this is an identity
mapping. If you don't specify the encoding, Perl can't know it (it can't
just assume that all files are text files in the current locale's
encoding: They might use a different one or not be text at all).

hp

[1] Not quite: Sometimes it is better to process text files as a byte
stream, but that's rare in my experience. As a rule of thunmb,
always decode on input and always encode on output.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top