utf8 and chomp

Josef Feit · Feb 22, 2009

Hi,

I have run accross a Perl behaviour, which I do not
understand:

I am trying to analyze some text with utf8 characters,
eg a file with "nXlXx", where the 'X' stands for
some utf8 encoded character. eg. "nÃ¡lÃ¡x"
(not sure whether it gets through).

Please change the 'X' in the %ascii for some
utf8 character (should be 'Ã¡').

#!/usr/bin/perl
# -----------------------------------------------------------
use warnings;
use strict;
use encoding 'utf-8';
use 5.010;

my %ascii = (
'X' => 'a',
);

my $line = <>;
chomp $line; # to chomp or not to chomp
print length($line), ": ";;
for( my $i = 0; $i < length($line); $i++ ){
my $znak = substr($line, $i, 1);
if( exists( $ascii{$znak} ) ){
print "+";
}else{
print "-";
}
}
print "\n";

---
The problem is with the chomp:

In case I chomp the $line, the output is as
expected: 5: -+-+-

If I comment out the chomp, the result is
8: --------
so the Perl does not consider the $line to be
utf8 encoded.

Is this a side effect of chomp or do I have it
wrong? I need not to chomp and get the utf8.

perl -v
This is perl, v5.10.0 built for x86_64-linux-thread-multi

Thanks
Josef

Eric Pozharski · Feb 22, 2009

On 2009-02-22 said:
The problem is with the chomp:

In case I chomp the $line, the output is as
expected: 5: -+-+-

If I comment out the chomp, the result is
8: --------
so the Perl does not consider the $line to be
utf8 encoded.

Is this a side effect of chomp or do I have it
wrong? I need not to chomp and get the utf8.

Just checked -- I can't recreate that. I have C<5: -+-+-> with B<chomp>
and C<6: -+-+--> without. Consider forcing I<$line> to be utf8
(C<perldoc Encode> has more).

p.s. And rewrite your C in Perl.

Josef Feit · Feb 23, 2009

Utf8 and chomp problem:

Thank you for replies.
I tried to rewrite the script, but the problem seems
to persist.
UTF8 displayed OK, so I am sending the improved script.

I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
locale and on the server (Debian I think, with
LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).

The results are the same: the strings produced
are different. I will try to force the utf8 etc,
but it seems strange anyway.

Josef

#!/usr/bin/perl
# ----------------------------
# echo "nÃ¡lÃ¡x" >text.txt
# thisscript text.txt
# ----------------------------
use warnings;
use strict;
use encoding 'utf-8';

my %ascii = (
'Ã¡' => 'a',
);

my $line = <>;
my $linech = $line;
chomp $linech;

for my $l ( $line, $linech ){
print length($l), ": ";
for my $char (split //, $l){
if( exists( $ascii{$char} ) ){
print "+";
}else{
print "-";
}
}
print "\n";
}

Output (orig/chomped):
8: --------
5: -+-+-

Andrzej Adam Filip · Feb 23, 2009

Josef Feit said:
Utf8 and chomp problem:

Thank you for replies.
I tried to rewrite the script, but the problem seems
to persist.
UTF8 displayed OK, so I am sending the improved script.

I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
locale and on the server (Debian I think, with
LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).

The results are the same: the strings produced
are different. I will try to force the utf8 etc,
but it seems strange anyway.

Josef

#!/usr/bin/perl
# ----------------------------
# echo "nÃ¡lÃ¡x" >text.txt
# thisscript text.txt
# ----------------------------
use warnings;
use strict;
use encoding 'utf-8';

my %ascii = (
'Ã¡' => 'a',
);

my $line = <>;
my $linech = $line;
chomp $linech;

for my $l ( $line, $linech ){
print length($l), ": ";
for my $char (split //, $l){
if( exists( $ascii{$char} ) ){
print "+";
}else{
print "-";
}
}
print "\n";
}

Output (orig/chomped):
8: --------
5: -+-+-

Have you tried to use STDIN marked as utf8 stream?

thisscript < text.txt

binmode( STDIN, ':utf8') or die;
my $line = <STDIN>;

Josef Feit · Feb 23, 2009

Andrzej Adam Filip napsal(a):

Have you tried to use STDIN marked as utf8 stream?

thisscript < text.txt

binmode( STDIN, ':utf8') or die;
my $line = <STDIN>;

I have tried it now - no change in the output.
However when the $line is set directly in the program,
the results are as expected (my $line = "nÃ¡lÃ¡x"

And if I run it as
thisscript < text.txt

(with <) it works OK as well, even without the binmode setting:

thisscript < text.txt
6: -+-+--
5: -+-+-

thisscript text.txt
8: --------
5: -+-+-

Regards
Josef

Eric Pozharski · Feb 23, 2009

Utf8 and chomp problem:

Thank you for replies.
I tried to rewrite the script, but the problem seems
to persist.
UTF8 displayed OK, so I am sending the improved script.

I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
locale and on the server (Debian I think, with
LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).

The results are the same: the strings produced
are different. I will try to force the utf8 etc,
but it seems strange anyway.

Josef

#!/usr/bin/perl
# ----------------------------
# echo "nÃ¡lÃ¡x" >text.txt
# thisscript text.txt
# ----------------------------

Snap! That's the problem -- everyone here are just a way lazy to dump
string into file, and run your script through something like this
instead:

echo someutf8 | thisscript

I've just gone through your original script with debugger, and found out
that after C<$line = <>;> I<$line> is pure byte string. And then after
C<chomp $line;> it automagically decodes into utf8 character(!) string.
Should I keep on explaining? (No, no spoiler this time.)

*CUT*

Peter J. Holzer · Feb 23, 2009

The results are the same: the strings produced
are different. I will try to force the utf8 etc,
but it seems strange anyway.

Josef

#!/usr/bin/perl
# ----------------------------
# echo "náláx" >text.txt
# thisscript text.txt
# ----------------------------
use warnings;
use strict;
use encoding 'utf-8';

I already wanted to advice against using "use encoding", because it
behaves rather unintuitively. But I couldn't see what's wrong until you
mentioned that reading from stdin works for you.

Then it became clear.

From perldoc encoding:

The encoding pragma also modifies the filehandle layers of STDIN
and STDOUT to the specified encoding.

If you call your script like

# thisscript text.txt

it does *not* read from STDIN, so the file will *not* automatically be
decoded from UTF-8. You should either explicitely open the file with the
correct encoding layer, or use "use open".

hp

Marc Lucksch · Feb 24, 2009

Eric said:
I've just gone through your original script with debugger, and found out
that after C<$line = <>;> I<$line> is pure byte string. And then after
C<chomp $line;> it automagically decodes into utf8 character(!) string.
Should I keep on explaining? (No, no spoiler this time.)

Ok now I am confused, do please explain.

Marc "Maluku" Lucksch

Josef Feit · Feb 24, 2009

Marc Lucksch napsal(a):

Ok now I am confused, do please explain.

Marc "Maluku" Lucksch

----

Please spoil us...

Yes, in the docs (encoding) is:
Sets the script encoding to I<ENCNAME>. And unless ${^UNICODE}
exists and non-zero, PerlIO layers of STDIN and STDOUT are set to
":encoding(I<ENCNAME>)".

Note that STDERR WILL NOT be changed.

Also note that non-STD file handles remain unaffected. Use C<use
open> or C<binmode> to change layers of those.

---

I tried to use (from Encode):
print "UTFline: ", utf8::is_utf8($line), "\n";
print "UTFlinech: ", utf8::is_utf8($linech), "\n";

and really the $linech is utf8, the $line not.

Combination of

use encoding 'utf-8';
use open IO => ':encoding(utf8)';

solves the problem, thank you all.

---
But still:
1. why chomp changes the string to utf8 as side effect?
2. can I tell the <> is utf8 if it is not STDIN?
(I cannot figure out the syntax - OK, getting the file
name through @ARGV should be possible).

Thank you
Josef

Eric Pozharski · Feb 24, 2009

Ok now I am confused, do please explain.

A long and boring way -- C<perldoc perlvar> then look for section
C<ARGV> (it's the first one among many), read 4 of them thoroughly.
Then return to C<perldoc encoding> and C<perldoc Encode> (it seems to be
used internally by B<encoding> pragma anyway). Then think a lot and
finally see the light.

p.s. A quick and dirty way --

perl -wle '
while(<>) {
system qq|ls -l /proc/$$/fd|;
exit;
};
' /etc/passwd
total 0
lrwx------ 1 whynot whynot 64 2009-02-24 22:47 0 -> /dev/pts/0
lrwx------ 1 whynot whynot 64 2009-02-24 22:47 1 -> /dev/pts/0
lrwx------ 1 whynot whynot 64 2009-02-24 22:47 2 -> /dev/pts/0
lr-x------ 1 whynot whynot 64 2009-02-24 22:47 3 -> /etc/passwd
lr-x------ 1 whynot whynot 64 2009-02-24 22:47 4 -> pipe:[7056143]
l-wx------ 1 whynot whynot 64 2009-02-24 22:47 5 -> pipe:[7056143]

Pay a bit of attention to I<fileno> #3

Dr.Ruud · Feb 25, 2009

Eric said:
I've just gone through your original script with debugger, and found out
that after C<$line = <>;> I<$line> is pure byte string. And then after
C<chomp $line;> it automagically decodes into utf8 character(!) string.
Should I keep on explaining? (No, no spoiler this time.)

Spoiler:

$ perl -Mencoding=utf8 -wle '
my $c;
{ use bytes;
$c = "EUR:\xE2\x82\xAC";
print length $c;
}
$c .= "";
print length $c;
'
7
5

Dr.Ruud · Feb 25, 2009

Dr.Ruud said:
Eric Pozharski:

Spoiler:

$ perl -Mencoding=utf8 -wle '
my $c;
{ use bytes;
$c = "EUR:\xE2\x82\xAC";
print length $c;
}
$c .= "";
print length $c;
'
7
5

Even more impressive:

$ perl -Mencoding=utf8 -wle '
my $c;
{ use bytes;
$c = "EUR:\xE2\x82\xAC";
print length $c;
}
print length $c;
$c .= "";
print length $c;
'
7
7
5

(perl 5.8.5)

Eric Pozharski · Feb 25, 2009

On 2009-02-25 said:
Even more impressive:

$ perl -Mencoding=utf8 -wle '
my $c;
{ use bytes;
$c = "EUR:\xE2\x82\xAC";
print length $c;
}
print length $c;
$c .= "";
print length $c;
'
7
7
5

(perl 5.8.5)

And I'm not impressed (any more) it's undocumented.

Josef Feit · Feb 27, 2009

Thanks to all who helped.
Now some of my (rather long lasting) utf8 problems
should be solved.

JF

Chomp	4	Feb 20, 2009
utf8, length and syswrite are killing me	2	Feb 17, 2010
Confused by utf8/sysread/syswrite/DBD::Pg	1	Dec 29, 2009
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Question about using chomp and other functions together	4	Nov 16, 2009
DBD::Oracle, Unicode, non-UTF8-non-ASCII strings	0	Jul 23, 2009
UTF8 strings and filesystem access	6	Oct 10, 2007
Is the pod of Encode::MIME::Header giving wrong advice?	5	Apr 23, 2014

utf8 and chomp

Josef Feit

Eric Pozharski

Josef Feit

Andrzej Adam Filip

Josef Feit

Eric Pozharski

Peter J. Holzer

Marc Lucksch

Josef Feit

Eric Pozharski

Dr.Ruud

Dr.Ruud

Eric Pozharski

Josef Feit

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads