Cannot have locale word characters in a variable

Discussion in 'Perl Misc' started by fmassion@web.de, Sep 2, 2013.

  1. Guest

    My test file:

    höheneinstellbar 1234
    bedienbar 5678
    1111 Müller
    größer 8765


    My script:
    #!/usr/bin/perl -w
    use locale;
    open(FILE,'test.txt') ;
    @sentence = <FILE>;
    foreach $sentence (@sentence) {
    chomp $sentence;
    if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
    print "$1\n";
    }}

    Instead of "use locale" I have also tried unsucessfully:
    (1)
    use utf8;
    (2)
    use POSIX qw(locale_h);
    (3)
    use POSIX qw(locale_h);
    my $locale = setlocale(LC_ALL, "de_DE");

    Result (words broken at German special characters):

    heneinstellbar (instead of the expected "höheneinstellbar")
    bedienbar
    ßer (instead of the expected "größer")

    The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is a better solution.
    , Sep 2, 2013
    #1
    1. Advertising

  2. klaus03 Guest

    Le 02/09/2013 19:34, a écrit :
    > My test file:
    > höheneinstellbar 1234
    > [...]
    > if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
    > print "$1\n";
    > [...]
    > Result (words broken at German special characters):
    > heneinstellbar (instead of the expected "höheneinstellbar")
    > [...]
    > The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is a better solution.


    What is the perl version you are using ?

    My very simple test.pl with perl 5.018...

    ( no "use locale", no "use utf8", no "setlocale()" ):

    ======================================
    use 5.018;
    use warnings;

    my $sentence = 'höheneinstellbar 1234';

    if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
    print "$1\n";
    }
    ======================================

    ....shows:

    höheneinstellbar
    klaus03, Sep 2, 2013
    #2
    1. Advertising

  3. On 9/2/2013 10:34 AM, wrote:
    > My test file:
    >
    > höheneinstellbar 1234
    > bedienbar 5678
    > 1111 Müller
    > größer 8765
    >
    >
    > My script:
    > #!/usr/bin/perl -w
    > use locale;
    > open(FILE,'test.txt') ;
    > @sentence = <FILE>;
    > foreach $sentence (@sentence) {
    > chomp $sentence;
    > if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
    > print "$1\n";
    > }}
    >
    > Instead of "use locale" I have also tried unsucessfully:
    > (1)
    > use utf8;
    > (2)
    > use POSIX qw(locale_h);
    > (3)
    > use POSIX qw(locale_h);
    > my $locale = setlocale(LC_ALL, "de_DE");
    >
    > Result (words broken at German special characters):
    >
    > heneinstellbar (instead of the expected "höheneinstellbar")
    > bedienbar
    > ßer (instead of the expected "größer")
    >
    > The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is a better solution.
    >



    binmode(STDOUT, ":utf8");


    --
    Charles DeRykus
    Charles DeRykus, Sep 2, 2013
    #3
  4. schrieb am 02.09.2013 19:34:
    > My test file:
    >
    > höheneinstellbar 1234
    > bedienbar 5678
    > 1111 Müller
    > größer 8765
    >
    >
    > My script:
    > #!/usr/bin/perl -w
    > use locale;
    > open(FILE,'test.txt') ;
    > @sentence = <FILE>;
    > foreach $sentence (@sentence) {
    > chomp $sentence;
    > if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
    > print "$1\n";
    > }}
    >
    > Instead of "use locale" I have also tried unsucessfully:
    > (1)
    > use utf8;
    > (2)
    > use POSIX qw(locale_h);
    > (3)
    > use POSIX qw(locale_h);
    > my $locale = setlocale(LC_ALL, "de_DE");
    >
    > Result (words broken at German special characters):
    >
    > heneinstellbar (instead of the expected "höheneinstellbar")
    > bedienbar
    > ßer (instead of the expected "größer")
    >
    > The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is a better solution.
    >


    It depends on the encoding of your inputfile.
    Perl assumes Latin-1 encoding unless told otherwise.
    If your input-encoding is UTF-8, you'll need
    open(my $FILE, '<:encoding(utf8)', 'test.txt') or die;
    and don't use locale.

    Furthermore on the output side, if your terminal-encoding is UTF-8 too,
    you'll need
    binmode(STDOUT, ':utf8');
    to get the output right.

    Please read at least
    perldoc perluniintro

    Regards, Horst
    --
    <remove S P A M 2x from my email address to get the real one>
    Horst-W. Radners, Sep 2, 2013
    #4
  5. On 2013-09-02 19:45, Charles DeRykus <> wrote:
    > On 9/2/2013 10:34 AM, wrote:
    >> My test file:
    >>
    >> höheneinstellbar 1234
    >> bedienbar 5678
    >> 1111 Müller
    >> größer 8765


    Which character encoding does the file use?


    >> My script:
    >> #!/usr/bin/perl -w
    >> use locale;
    >> open(FILE,'test.txt') ;
    >> @sentence = <FILE>;
    >> foreach $sentence (@sentence) {
    >> chomp $sentence;
    >> if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
    >> print "$1\n";
    >> }}
    >>
    >> Instead of "use locale" I have also tried unsucessfully:
    >> (1)
    >> use utf8;
    >> (2)
    >> use POSIX qw(locale_h);
    >> (3)
    >> use POSIX qw(locale_h);
    >> my $locale = setlocale(LC_ALL, "de_DE");
    >>
    >> Result (words broken at German special characters):
    >>
    >> heneinstellbar (instead of the expected "höheneinstellbar")
    >> bedienbar
    >> ßer (instead of the expected "größer")
    >>
    >> The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is
    >> a better solution.
    >>

    >
    >
    > binmode(STDOUT, ":utf8");


    Maybe, but that's secondary. First the file must be read correctly, then
    you can worry about printing the results correctly.

    So he needs to apply the correct encoding filter to FILE:

    open(FILE, "<:encoding($encoding)", 'test.txt')

    or

    binmode FILE, ":encoding($encoding)";

    (of course, $encoding must be set to the correct first, e.g. "UTF-8" or
    "ISO-8859-15")

    perldoc perlunitut.

    hp

    PS: Lexical file handles are preferred over bare filehandles.


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, Sep 2, 2013
    #5
  6. klaus03 Guest

    Le 02/09/2013 22:40, Ben Morrow a écrit :
    >
    > Quoth "Peter J. Holzer" <>:
    >> On 2013-09-02 19:45, Charles DeRykus <> wrote:
    >>> On 9/2/2013 10:34 AM, wrote:
    >>>>
    >>>> use locale;
    >>>> open(FILE,'test.txt') ;
    >>>> @sentence = <FILE>;
    >>>> foreach $sentence (@sentence) {
    >>>> chomp $sentence;
    >>>> if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
    >>>> print "$1\n";
    >>>> }}
    >>>>
    >>>> Instead of "use locale" I have also tried unsucessfully:
    >>>> (1)
    >>>> use utf8;
    >>>> (2)
    >>>> use POSIX qw(locale_h);
    >>>> (3)
    >>>> use POSIX qw(locale_h);
    >>>> my $locale = setlocale(LC_ALL, "de_DE");
    >>>
    >>> binmode(STDOUT, ":utf8");

    >>
    >> Maybe, but that's secondary. First the file must be read correctly, then
    >> you can worry about printing the results correctly.
    >>
    >> So he needs to apply the correct encoding filter to FILE:
    >>
    >> open(FILE, "<:encoding($encoding)", 'test.txt')
    >>
    >> or
    >>
    >> binmode FILE, ":encoding($encoding)";
    >>
    >> (of course, $encoding must be set to the correct first, e.g. "UTF-8" or
    >> "ISO-8859-15")

    >
    > If you want de_DE rather than Unicode \w semantics


    de_DE semantics is probably not needed, the usual Unicode semantics of
    \w should by default include all German umlauts + other special German
    characters.

    > you also need perl 5.14,


    Yes, Unicode semantics requires a recent perl.

    > and you need to call setlocale and either 'use locale' or use the
    > /l regex flag.


    That's not necessarily needed:

    My understanding is that Unicode takes precedence over any locales.

    However, you might have to call setlocale, 'use locale' or /l regex
    flag, but only if you don't have Unicode semantics (that is: only if
    your perl is older than 5.014)
    klaus03, Sep 2, 2013
    #6
  7. On 9/2/2013 1:08 PM, Peter J. Holzer wrote:
    > On 2013-09-02 19:45, Charles DeRykus <> wrote:
    >> On 9/2/2013 10:34 AM, wrote:
    >>> My test file:
    >>>
    >>> höheneinstellbar 1234
    >>> bedienbar 5678
    >>> 1111 Müller
    >>> größer 8765

    >
    > Which character encoding does the file use?
    >
    >
    >>> My script:
    >>> #!/usr/bin/perl -w
    >>> use locale;
    >>> open(FILE,'test.txt') ;
    >>> @sentence = <FILE>;
    >>> foreach $sentence (@sentence) {
    >>> chomp $sentence;
    >>> if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
    >>> print "$1\n";
    >>> }}
    >>>
    >>> Instead of "use locale" I have also tried unsucessfully:
    >>> (1)
    >>> use utf8;
    >>> (2)
    >>> use POSIX qw(locale_h);
    >>> (3)
    >>> use POSIX qw(locale_h);
    >>> my $locale = setlocale(LC_ALL, "de_DE");
    >>>
    >>> Result (words broken at German special characters):
    >>>
    >>> heneinstellbar (instead of the expected "höheneinstellbar")
    >>> bedienbar
    >>> ßer (instead of the expected "größer")
    >>>
    >>> The script works with [\wöäüßÄÖÜ] instead of \w but I assume there is
    >>> a better solution.
    >>>

    >>
    >>
    >> binmode(STDOUT, ":utf8");

    >
    > Maybe, but that's secondary. First the file must be read correctly, then
    > you can worry about printing the results correctly.
    >
    > So he needs to apply the correct encoding filter to FILE:
    >
    > open(FILE, "<:encoding($encoding)", 'test.txt')
    >
    > or
    >
    > binmode FILE, ":encoding($encoding)";
    >
    > (of course, $encoding must be set to the correct first, e.g. "UTF-8" or
    > "ISO-8859-15")
    > ...


    With 'use locale' plus 'binmode(STDOUT,":utf8")', there is correct
    output but maybe there are potential shortcomings since locale can be
    problematic.

    IIUC doesn't Perl internally store as Latin-1,eg, and seamlessly upgrade
    to Unicode as needed.. It seems clunky then to nail down the input
    encoding as well although perhaps the idea is to throw an error if the
    specified encoding doesn't validate?

    --
    Charles DeRykus
    Charles DeRykus, Sep 3, 2013
    #7
  8. Guest

    Thanks to all of you for your support. This below didn't work for whatever reason. I am using Perl v.14.1 (on Windows 7)

    > With 'use locale' plus 'binmode(STDOUT,":utf8")', there is correct
    > output but maybe there are potential shortcomings since locale can be
    > problematic.
    >


    I had also tried without success:

    use utf8;
    binmode STDIN, ":utf8";
    binmode STDOUT, ":utf8";
    open(FILE,'testfile.txt') or die;

    Finally, the following was successful:

    open(FILE, '<:encoding(utf8)', 'testfile.txt') or die;
    binmode STDOUT, ":utf8"; # output
    @sentence = <FILE>;

    Francois
    , Sep 3, 2013
    #8
  9. On 2013-09-02 23:50, Ben Morrow <> wrote:
    > Quoth klaus03 <>:
    >> Le 02/09/2013 22:40, Ben Morrow a écrit :
    >> >
    >> > If you want de_DE rather than Unicode \w semantics

    >>
    >> de_DE semantics is probably not needed, the usual Unicode semantics of
    >> \w should by default include all German umlauts + other special German
    >> characters.

    >
    > Yes. However, Unicode will include (for example) non-Latin letter
    > characters as letters, which I would not expect a German locale to do.


    Your expectation would be wrong on Linux (at least with glibc 2.11-2.13).
    I've tested various locales and AFAICS all of them except C and POSIX
    use the unicode semantics for wide characters.

    Here's a test program in C:

    ---8<------8<------8<------8<------8<------8<------8<------8<------8<---
    #include <locale.h>
    #include <stdio.h>
    #include <wctype.h>

    int main(void) {
    setlocale(LC_ALL, "");
    wint_t c[] = {
    0x30, 0x41, 0xD8, 0x03B1, 0x304B, 0x65e0
    };
    int n = sizeof(c) / sizeof(c[0]);
    for (int i = 0; i < n; i++) {
    printf("%04x", c);
    printf(" %s", iswalpha(c) ? "alpha" : "-----");
    printf(" %s", iswdigit(c) ? "digit" : "-----");
    printf("\n");
    }
    return 0;
    }
    ---8<------8<------8<------8<------8<------8<------8<------8<------8<---

    >> > and you need to call setlocale and either 'use locale' or use the
    >> > /l regex flag.

    >>
    >> That's not necessarily needed:
    >>
    >> My understanding is that Unicode takes precedence over any locales.
    >>
    >> However, you might have to call setlocale, 'use locale' or /l regex
    >> flag, but only if you don't have Unicode semantics (that is: only if
    >> your perl is older than 5.014)

    >
    > Your understanding is out of date. Up until 5.12, whether regexes
    > matched with Unicode, ISO8859-1 or locale semantics was rather
    > unpredictable, though in general if either the pattern or the string was
    > Unicode then Unicode rules were used. In 5.12 the unpredictability was
    > fixed, so Unicode semantics were (IIRC) always used.


    Really always or only if the unicode_strings feature is used? I would
    expect such a change to break rather a lot of code.

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, Sep 4, 2013
    #9
  10. On 2013-09-03 05:21, Charles DeRykus <> wrote:
    > On 9/2/2013 1:08 PM, Peter J. Holzer wrote:
    >> On 2013-09-02 19:45, Charles DeRykus <> wrote:
    >>> On 9/2/2013 10:34 AM, wrote:
    >>>> My test file:
    >>>>
    >>>> höheneinstellbar 1234
    >>>> bedienbar 5678
    >>>> 1111 Müller
    >>>> größer 8765

    >>
    >> Which character encoding does the file use?
    >>
    >>
    >>>> My script:
    >>>> #!/usr/bin/perl -w
    >>>> use locale;
    >>>> open(FILE,'test.txt') ;
    >>>> @sentence = <FILE>;
    >>>> foreach $sentence (@sentence) {
    >>>> chomp $sentence;
    >>>> if ($sentence =~ m/(\w+)(\s)(\d+)/gx) {
    >>>> print "$1\n";
    >>>> }}

    [...]
    > With 'use locale' plus 'binmode(STDOUT,":utf8")', there is correct
    > output


    Since you are using UTF-8 on the terminal I am assuming that your
    test.txt is encoded in UTF-8, too (This may or may not be true for the
    OP: AFAICS he hasn't answered that question yet).

    I don't see how there can be correct output in this case. “use localeâ€
    doesn't affect open, so the file will be read as a byte stream.

    The first line is then "h\303\266heneinstellbar 1234". "\266" isn't a
    word character in any locale AFAIK, so the regexp will match
    "heneinstellbar 1234", which is wrong.

    Even if it did match the whole line, writing the string to a stream with
    the utf8 layer results in encoding the already UTF-8-encoded string a
    second time, so the result is "h\303\203\302\266heneinstellbar 1234" or
    "höheneinstellbar 1234", which is also not correct.

    (this is for Perl 5.14. Maybe something changed after that, but I doubt
    it)

    > IIUC doesn't Perl internally store as Latin-1,eg, and seamlessly upgrade
    > to Unicode as needed..


    You shouldn't care about how perl stores strings internally.

    > It seems clunky then to nail down the input encoding as well


    You always[1] need to decode on input to convert from a sequence of
    bytes to a sequence of characters. Only for Latin-1 this is an identity
    mapping. If you don't specify the encoding, Perl can't know it (it can't
    just assume that all files are text files in the current locale's
    encoding: They might use a different one or not be text at all).

    hp

    [1] Not quite: Sometimes it is better to process text files as a byte
    stream, but that's rare in my experience. As a rule of thunmb,
    always decode on input and always encode on output.

    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, Sep 4, 2013
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Maurice Hulsman
    Replies:
    1
    Views:
    1,835
    Guus Bosman
    Jul 25, 2004
  2. Replies:
    4
    Views:
    998
  3. Gabriel Genellina
    Replies:
    0
    Views:
    691
    Gabriel Genellina
    Feb 18, 2009
  4. zade
    Replies:
    1
    Views:
    600
    James Kanze
    Mar 5, 2010
  5. Sibylle Koczian
    Replies:
    2
    Views:
    1,106
    Sibylle Koczian
    Nov 20, 2010
Loading...

Share This Page