utf8 and chomp

Discussion in 'Perl Misc' started by Josef Feit, Feb 22, 2009.

  1. Josef Feit

    Josef Feit Guest

    Hi,

    I have run accross a Perl behaviour, which I do not
    understand:

    I am trying to analyze some text with utf8 characters,
    eg a file with "nXlXx", where the 'X' stands for
    some utf8 encoded character. eg. "náláx"
    (not sure whether it gets through).

    Please change the 'X' in the %ascii for some
    utf8 character (should be 'á').


    #!/usr/bin/perl
    # -----------------------------------------------------------
    use warnings;
    use strict;
    use encoding 'utf-8';
    use 5.010;

    my %ascii = (
    'X' => 'a',
    );

    my $line = <>;
    chomp $line; # to chomp or not to chomp
    print length($line), ": ";;
    for( my $i = 0; $i < length($line); $i++ ){
    my $znak = substr($line, $i, 1);
    if( exists( $ascii{$znak} ) ){
    print "+";
    }else{
    print "-";
    }
    }
    print "\n";

    ---
    The problem is with the chomp:

    In case I chomp the $line, the output is as
    expected: 5: -+-+-

    If I comment out the chomp, the result is
    8: --------
    so the Perl does not consider the $line to be
    utf8 encoded.

    Is this a side effect of chomp or do I have it
    wrong? I need not to chomp and get the utf8.

    perl -v
    This is perl, v5.10.0 built for x86_64-linux-thread-multi

    Thanks
    Josef
     
    Josef Feit, Feb 22, 2009
    #1
    1. Advertising

  2. On 2009-02-22, Josef Feit <> wrote:
    *SKIP*
    > The problem is with the chomp:
    >
    > In case I chomp the $line, the output is as
    > expected: 5: -+-+-
    >
    > If I comment out the chomp, the result is
    > 8: --------
    > so the Perl does not consider the $line to be
    > utf8 encoded.
    >
    > Is this a side effect of chomp or do I have it
    > wrong? I need not to chomp and get the utf8.


    Just checked -- I can't recreate that. I have C<5: -+-+-> with B<chomp>
    and C<6: -+-+--> without. Consider forcing I<$line> to be utf8
    (C<perldoc Encode> has more).

    p.s. And rewrite your C in Perl.


    --
    Torvalds' goal for Linux is very simple: World Domination
    Stallman's goal for GNU is even simpler: Freedom
     
    Eric Pozharski, Feb 23, 2009
    #2
    1. Advertising

  3. Josef Feit

    Josef Feit Guest

    Utf8 and chomp problem:

    Thank you for replies.
    I tried to rewrite the script, but the problem seems
    to persist.
    UTF8 displayed OK, so I am sending the improved script.

    I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
    locale and on the server (Debian I think, with
    LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).

    The results are the same: the strings produced
    are different. I will try to force the utf8 etc,
    but it seems strange anyway.

    Josef


    #!/usr/bin/perl
    # ----------------------------
    # echo "náláx" >text.txt
    # thisscript text.txt
    # ----------------------------
    use warnings;
    use strict;
    use encoding 'utf-8';

    my %ascii = (
    'á' => 'a',
    );

    my $line = <>;
    my $linech = $line;
    chomp $linech;

    for my $l ( $line, $linech ){
    print length($l), ": ";
    for my $char (split //, $l){
    if( exists( $ascii{$char} ) ){
    print "+";
    }else{
    print "-";
    }
    }
    print "\n";
    }

    Output (orig/chomped):
    8: --------
    5: -+-+-
     
    Josef Feit, Feb 23, 2009
    #3
  4. Josef Feit <> wrote:

    > Utf8 and chomp problem:
    >
    > Thank you for replies.
    > I tried to rewrite the script, but the problem seems
    > to persist.
    > UTF8 displayed OK, so I am sending the improved script.
    >
    > I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
    > locale and on the server (Debian I think, with
    > LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).
    >
    > The results are the same: the strings produced
    > are different. I will try to force the utf8 etc,
    > but it seems strange anyway.
    >
    > Josef
    >
    >
    > #!/usr/bin/perl
    > # ----------------------------
    > # echo "náláx" >text.txt
    > # thisscript text.txt
    > # ----------------------------
    > use warnings;
    > use strict;
    > use encoding 'utf-8';
    >
    > my %ascii = (
    > 'á' => 'a',
    > );
    >
    > my $line = <>;
    > my $linech = $line;
    > chomp $linech;
    >
    > for my $l ( $line, $linech ){
    > print length($l), ": ";
    > for my $char (split //, $l){
    > if( exists( $ascii{$char} ) ){
    > print "+";
    > }else{
    > print "-";
    > }
    > }
    > print "\n";
    > }
    >
    > Output (orig/chomped):
    > 8: --------
    > 5: -+-+-


    Have you tried to use STDIN marked as utf8 stream?

    thisscript < text.txt

    binmode( STDIN, ':utf8') or die;
    my $line = <STDIN>;

    --
    [pl>en Andrew] Andrzej Adam Filip : :
    We have met the enemy, and he is us.
    -- Walt Kelly
     
    Andrzej Adam Filip, Feb 23, 2009
    #4
  5. Josef Feit

    Josef Feit Guest

    Andrzej Adam Filip napsal(a):
    > Josef Feit <> wrote:
    >
    >> Utf8 and chomp problem:
    >>
    >> Thank you for replies.
    >> I tried to rewrite the script, but the problem seems
    >> to persist.
    >> UTF8 displayed OK, so I am sending the improved script.
    >>
    >> I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
    >> locale and on the server (Debian I think, with
    >> LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).
    >>
    >> The results are the same: the strings produced
    >> are different. I will try to force the utf8 etc,
    >> but it seems strange anyway.
    >>
    >> Josef
    >>
    >>
    >> #!/usr/bin/perl
    >> # ----------------------------
    >> # echo "náláx" >text.txt
    >> # thisscript text.txt
    >> # ----------------------------
    >> use warnings;
    >> use strict;
    >> use encoding 'utf-8';
    >>
    >> my %ascii = (
    >> 'á' => 'a',
    >> );
    >>
    >> my $line = <>;
    >> my $linech = $line;
    >> chomp $linech;
    >>
    >> for my $l ( $line, $linech ){
    >> print length($l), ": ";
    >> for my $char (split //, $l){
    >> if( exists( $ascii{$char} ) ){
    >> print "+";
    >> }else{
    >> print "-";
    >> }
    >> }
    >> print "\n";
    >> }
    >>
    >> Output (orig/chomped):
    >> 8: --------
    >> 5: -+-+-

    >
    > Have you tried to use STDIN marked as utf8 stream?
    >
    > thisscript < text.txt
    >
    > binmode( STDIN, ':utf8') or die;
    > my $line = <STDIN>;
    >

    I have tried it now - no change in the output.
    However when the $line is set directly in the program,
    the results are as expected (my $line = "náláx";)

    And if I run it as
    thisscript < text.txt

    (with <) it works OK as well, even without the binmode setting:

    thisscript < text.txt
    6: -+-+--
    5: -+-+-

    thisscript text.txt
    8: --------
    5: -+-+-


    Regards
    Josef
     
    Josef Feit, Feb 23, 2009
    #5
  6. On 2009-02-23, Josef Feit <> wrote:
    > Utf8 and chomp problem:
    >
    > Thank you for replies.
    > I tried to rewrite the script, but the problem seems
    > to persist.
    > UTF8 displayed OK, so I am sending the improved script.
    >
    > I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
    > locale and on the server (Debian I think, with
    > LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).
    >
    > The results are the same: the strings produced
    > are different. I will try to force the utf8 etc,
    > but it seems strange anyway.
    >
    > Josef
    >
    >
    > #!/usr/bin/perl
    > # ----------------------------
    > # echo "náláx" >text.txt
    > # thisscript text.txt
    > # ----------------------------


    Snap! That's the problem -- everyone here are just a way lazy to dump
    string into file, and run your script through something like this
    instead:

    echo someutf8 | thisscript

    I've just gone through your original script with debugger, and found out
    that after C<$line = <>;> I<$line> is pure byte string. And then after
    C<chomp $line;> it automagically decodes into utf8 character(!) string.
    Should I keep on explaining? (No, no spoiler this time.)

    *CUT*

    --
    Torvalds' goal for Linux is very simple: World Domination
    Stallman's goal for GNU is even simpler: Freedom
     
    Eric Pozharski, Feb 23, 2009
    #6
  7. On 2009-02-23 17:05, Josef Feit <> wrote:
    > The results are the same: the strings produced
    > are different. I will try to force the utf8 etc,
    > but it seems strange anyway.
    >
    > Josef
    >
    >
    > #!/usr/bin/perl
    > # ----------------------------
    > # echo "náláx" >text.txt
    > # thisscript text.txt
    > # ----------------------------
    > use warnings;
    > use strict;
    > use encoding 'utf-8';


    I already wanted to advice against using "use encoding", because it
    behaves rather unintuitively. But I couldn't see what's wrong until you
    mentioned that reading from stdin works for you.

    Then it became clear.

    From perldoc encoding:

    The encoding pragma also modifies the filehandle layers of STDIN
    and STDOUT to the specified encoding.

    If you call your script like

    > # thisscript text.txt


    it does *not* read from STDIN, so the file will *not* automatically be
    decoded from UTF-8. You should either explicitely open the file with the
    correct encoding layer, or use "use open".

    hp
     
    Peter J. Holzer, Feb 24, 2009
    #7
  8. Josef Feit

    Marc Lucksch Guest

    Eric Pozharski schrieb:
    > I've just gone through your original script with debugger, and found out
    > that after C<$line = <>;> I<$line> is pure byte string. And then after
    > C<chomp $line;> it automagically decodes into utf8 character(!) string.
    > Should I keep on explaining? (No, no spoiler this time.)


    Ok now I am confused, do please explain.

    Marc "Maluku" Lucksch
     
    Marc Lucksch, Feb 24, 2009
    #8
  9. Josef Feit

    Josef Feit Guest

    Marc Lucksch napsal(a):
    > Eric Pozharski schrieb:
    >> I've just gone through your original script with debugger, and found out
    >> that after C<$line = <>;> I<$line> is pure byte string. And then after
    >> C<chomp $line;> it automagically decodes into utf8 character(!) string.
    >> Should I keep on explaining? (No, no spoiler this time.)

    >
    > Ok now I am confused, do please explain.
    >
    > Marc "Maluku" Lucksch


    ----

    Please spoil us... :)

    Yes, in the docs (encoding) is:
    Sets the script encoding to I<ENCNAME>. And unless ${^UNICODE}
    exists and non-zero, PerlIO layers of STDIN and STDOUT are set to
    ":encoding(I<ENCNAME>)".

    Note that STDERR WILL NOT be changed.

    Also note that non-STD file handles remain unaffected. Use C<use
    open> or C<binmode> to change layers of those.

    ---

    I tried to use (from Encode):
    print "UTFline: ", utf8::is_utf8($line), "\n";
    print "UTFlinech: ", utf8::is_utf8($linech), "\n";

    and really the $linech is utf8, the $line not.

    Combination of

    use encoding 'utf-8';
    use open IO => ':encoding(utf8)';

    solves the problem, thank you all.

    ---
    But still:
    1. why chomp changes the string to utf8 as side effect?
    2. can I tell the <> is utf8 if it is not STDIN?
    (I cannot figure out the syntax - OK, getting the file
    name through @ARGV should be possible).


    Thank you
    Josef
     
    Josef Feit, Feb 24, 2009
    #9
  10. On 2009-02-24, Marc Lucksch <> wrote:
    > Eric Pozharski schrieb:
    >> I've just gone through your original script with debugger, and found out
    >> that after C<$line = <>;> I<$line> is pure byte string. And then after
    >> C<chomp $line;> it automagically decodes into utf8 character(!) string.
    >> Should I keep on explaining? (No, no spoiler this time.)

    >
    > Ok now I am confused, do please explain.


    A long and boring way -- C<perldoc perlvar> then look for section
    C<ARGV> (it's the first one among many), read 4 of them thoroughly.
    Then return to C<perldoc encoding> and C<perldoc Encode> (it seems to be
    used internally by B<encoding> pragma anyway). Then think a lot and
    finally see the light.

    p.s. A quick and dirty way --

    perl -wle '
    while(<>) {
    system qq|ls -l /proc/$$/fd|;
    exit;
    };
    ' /etc/passwd
    total 0
    lrwx------ 1 whynot whynot 64 2009-02-24 22:47 0 -> /dev/pts/0
    lrwx------ 1 whynot whynot 64 2009-02-24 22:47 1 -> /dev/pts/0
    lrwx------ 1 whynot whynot 64 2009-02-24 22:47 2 -> /dev/pts/0
    lr-x------ 1 whynot whynot 64 2009-02-24 22:47 3 -> /etc/passwd
    lr-x------ 1 whynot whynot 64 2009-02-24 22:47 4 -> pipe:[7056143]
    l-wx------ 1 whynot whynot 64 2009-02-24 22:47 5 -> pipe:[7056143]

    Pay a bit of attention to I<fileno> #3

    --
    Torvalds' goal for Linux is very simple: World Domination
    Stallman's goal for GNU is even simpler: Freedom
     
    Eric Pozharski, Feb 24, 2009
    #10
  11. Josef Feit

    Dr.Ruud Guest

    Eric Pozharski wrote:

    > I've just gone through your original script with debugger, and found out
    > that after C<$line = <>;> I<$line> is pure byte string. And then after
    > C<chomp $line;> it automagically decodes into utf8 character(!) string.
    > Should I keep on explaining? (No, no spoiler this time.)


    Spoiler:

    $ perl -Mencoding=utf8 -wle '
    my $c;
    { use bytes;
    $c = "EUR:\xE2\x82\xAC";
    print length $c;
    }
    $c .= "";
    print length $c;
    '
    7
    5

    --
    Ruud
     
    Dr.Ruud, Feb 25, 2009
    #11
  12. Josef Feit

    Dr.Ruud Guest

    Dr.Ruud wrote:
    > Eric Pozharski:


    >> I've just gone through your original script with debugger, and found out
    >> that after C<$line = <>;> I<$line> is pure byte string. And then after
    >> C<chomp $line;> it automagically decodes into utf8 character(!) string.
    >> Should I keep on explaining? (No, no spoiler this time.)

    >
    > Spoiler:
    >
    > $ perl -Mencoding=utf8 -wle '
    > my $c;
    > { use bytes;
    > $c = "EUR:\xE2\x82\xAC";
    > print length $c;
    > }
    > $c .= "";
    > print length $c;
    > '
    > 7
    > 5


    Even more impressive:

    $ perl -Mencoding=utf8 -wle '
    my $c;
    { use bytes;
    $c = "EUR:\xE2\x82\xAC";
    print length $c;
    }
    print length $c;
    $c .= "";
    print length $c;
    '
    7
    7
    5

    (perl 5.8.5)

    --
    Ruud
     
    Dr.Ruud, Feb 25, 2009
    #12
  13. On 2009-02-25, Dr.Ruud <> wrote:
    *SKIP*
    > Even more impressive:
    >
    > $ perl -Mencoding=utf8 -wle '
    > my $c;
    > { use bytes;
    > $c = "EUR:\xE2\x82\xAC";
    > print length $c;
    > }
    > print length $c;
    > $c .= "";
    > print length $c;
    > '
    > 7
    > 7
    > 5
    >
    > (perl 5.8.5)
    >


    And I'm not impressed (any more) it's undocumented.


    --
    Torvalds' goal for Linux is very simple: World Domination
    Stallman's goal for GNU is even simpler: Freedom
     
    Eric Pozharski, Feb 25, 2009
    #13
  14. Josef Feit

    Josef Feit Guest

    Thanks to all who helped.
    Now some of my (rather long lasting) utf8 problems
    should be solved.

    JF
     
    Josef Feit, Feb 27, 2009
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    6
    Views:
    135
    Guillaume Benny
    Mar 29, 2006
  2. Doug Blackman
    Replies:
    7
    Views:
    130
    Robert Klemme
    Jan 28, 2011
  3. gry
    Replies:
    2
    Views:
    823
    Alf P. Steinbach
    Mar 13, 2012
  4. Robert TV
    Replies:
    5
    Views:
    142
    Ben Morrow
    Nov 5, 2003
  5. martin
    Replies:
    3
    Views:
    201
    Joe Smith
    Apr 15, 2006
Loading...

Share This Page