utf8 and chomp

J

Josef Feit

Hi,

I have run accross a Perl behaviour, which I do not
understand:

I am trying to analyze some text with utf8 characters,
eg a file with "nXlXx", where the 'X' stands for
some utf8 encoded character. eg. "náláx"
(not sure whether it gets through).

Please change the 'X' in the %ascii for some
utf8 character (should be 'á').


#!/usr/bin/perl
# -----------------------------------------------------------
use warnings;
use strict;
use encoding 'utf-8';
use 5.010;

my %ascii = (
'X' => 'a',
);

my $line = <>;
chomp $line; # to chomp or not to chomp
print length($line), ": ";;
for( my $i = 0; $i < length($line); $i++ ){
my $znak = substr($line, $i, 1);
if( exists( $ascii{$znak} ) ){
print "+";
}else{
print "-";
}
}
print "\n";

---
The problem is with the chomp:

In case I chomp the $line, the output is as
expected: 5: -+-+-

If I comment out the chomp, the result is
8: --------
so the Perl does not consider the $line to be
utf8 encoded.

Is this a side effect of chomp or do I have it
wrong? I need not to chomp and get the utf8.

perl -v
This is perl, v5.10.0 built for x86_64-linux-thread-multi

Thanks
Josef
 
E

Eric Pozharski

On 2009-02-22 said:
The problem is with the chomp:

In case I chomp the $line, the output is as
expected: 5: -+-+-

If I comment out the chomp, the result is
8: --------
so the Perl does not consider the $line to be
utf8 encoded.

Is this a side effect of chomp or do I have it
wrong? I need not to chomp and get the utf8.

Just checked -- I can't recreate that. I have C<5: -+-+-> with B<chomp>
and C<6: -+-+--> without. Consider forcing I<$line> to be utf8
(C<perldoc Encode> has more).

p.s. And rewrite your C in Perl.
 
J

Josef Feit

Utf8 and chomp problem:

Thank you for replies.
I tried to rewrite the script, but the problem seems
to persist.
UTF8 displayed OK, so I am sending the improved script.

I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
locale and on the server (Debian I think, with
LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).

The results are the same: the strings produced
are different. I will try to force the utf8 etc,
but it seems strange anyway.

Josef


#!/usr/bin/perl
# ----------------------------
# echo "náláx" >text.txt
# thisscript text.txt
# ----------------------------
use warnings;
use strict;
use encoding 'utf-8';

my %ascii = (
'á' => 'a',
);

my $line = <>;
my $linech = $line;
chomp $linech;

for my $l ( $line, $linech ){
print length($l), ": ";
for my $char (split //, $l){
if( exists( $ascii{$char} ) ){
print "+";
}else{
print "-";
}
}
print "\n";
}

Output (orig/chomped):
8: --------
5: -+-+-
 
A

Andrzej Adam Filip

Josef Feit said:
Utf8 and chomp problem:

Thank you for replies.
I tried to rewrite the script, but the problem seems
to persist.
UTF8 displayed OK, so I am sending the improved script.

I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
locale and on the server (Debian I think, with
LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).

The results are the same: the strings produced
are different. I will try to force the utf8 etc,
but it seems strange anyway.

Josef


#!/usr/bin/perl
# ----------------------------
# echo "náláx" >text.txt
# thisscript text.txt
# ----------------------------
use warnings;
use strict;
use encoding 'utf-8';

my %ascii = (
'á' => 'a',
);

my $line = <>;
my $linech = $line;
chomp $linech;

for my $l ( $line, $linech ){
print length($l), ": ";
for my $char (split //, $l){
if( exists( $ascii{$char} ) ){
print "+";
}else{
print "-";
}
}
print "\n";
}

Output (orig/chomped):
8: --------
5: -+-+-

Have you tried to use STDIN marked as utf8 stream?

thisscript < text.txt

binmode( STDIN, ':utf8') or die;
my $line = <STDIN>;
 
J

Josef Feit

Andrzej Adam Filip napsal(a):
Have you tried to use STDIN marked as utf8 stream?

thisscript < text.txt

binmode( STDIN, ':utf8') or die;
my $line = <STDIN>;
I have tried it now - no change in the output.
However when the $line is set directly in the program,
the results are as expected (my $line = "náláx";)

And if I run it as
thisscript < text.txt

(with <) it works OK as well, even without the binmode setting:

thisscript < text.txt
6: -+-+--
5: -+-+-

thisscript text.txt
8: --------
5: -+-+-


Regards
Josef
 
E

Eric Pozharski

Utf8 and chomp problem:

Thank you for replies.
I tried to rewrite the script, but the problem seems
to persist.
UTF8 displayed OK, so I am sending the improved script.

I tried it on my OpenSuse 11.0 Linux under cs_CZ.UTF-8
locale and on the server (Debian I think, with
LANG=en_US.UTF-8 etc. (and v5.8.8 Perl).

The results are the same: the strings produced
are different. I will try to force the utf8 etc,
but it seems strange anyway.

Josef


#!/usr/bin/perl
# ----------------------------
# echo "náláx" >text.txt
# thisscript text.txt
# ----------------------------

Snap! That's the problem -- everyone here are just a way lazy to dump
string into file, and run your script through something like this
instead:

echo someutf8 | thisscript

I've just gone through your original script with debugger, and found out
that after C<$line = <>;> I<$line> is pure byte string. And then after
C<chomp $line;> it automagically decodes into utf8 character(!) string.
Should I keep on explaining? (No, no spoiler this time.)

*CUT*
 
P

Peter J. Holzer

The results are the same: the strings produced
are different. I will try to force the utf8 etc,
but it seems strange anyway.

Josef


#!/usr/bin/perl
# ----------------------------
# echo "náláx" >text.txt
# thisscript text.txt
# ----------------------------
use warnings;
use strict;
use encoding 'utf-8';

I already wanted to advice against using "use encoding", because it
behaves rather unintuitively. But I couldn't see what's wrong until you
mentioned that reading from stdin works for you.

Then it became clear.

From perldoc encoding:

The encoding pragma also modifies the filehandle layers of STDIN
and STDOUT to the specified encoding.

If you call your script like
# thisscript text.txt

it does *not* read from STDIN, so the file will *not* automatically be
decoded from UTF-8. You should either explicitely open the file with the
correct encoding layer, or use "use open".

hp
 
M

Marc Lucksch

Eric said:
I've just gone through your original script with debugger, and found out
that after C<$line = <>;> I<$line> is pure byte string. And then after
C<chomp $line;> it automagically decodes into utf8 character(!) string.
Should I keep on explaining? (No, no spoiler this time.)

Ok now I am confused, do please explain.

Marc "Maluku" Lucksch
 
J

Josef Feit

Marc Lucksch napsal(a):
Ok now I am confused, do please explain.

Marc "Maluku" Lucksch

----

Please spoil us... :)

Yes, in the docs (encoding) is:
Sets the script encoding to I<ENCNAME>. And unless ${^UNICODE}
exists and non-zero, PerlIO layers of STDIN and STDOUT are set to
":encoding(I<ENCNAME>)".

Note that STDERR WILL NOT be changed.

Also note that non-STD file handles remain unaffected. Use C<use
open> or C<binmode> to change layers of those.

---

I tried to use (from Encode):
print "UTFline: ", utf8::is_utf8($line), "\n";
print "UTFlinech: ", utf8::is_utf8($linech), "\n";

and really the $linech is utf8, the $line not.

Combination of

use encoding 'utf-8';
use open IO => ':encoding(utf8)';

solves the problem, thank you all.

---
But still:
1. why chomp changes the string to utf8 as side effect?
2. can I tell the <> is utf8 if it is not STDIN?
(I cannot figure out the syntax - OK, getting the file
name through @ARGV should be possible).


Thank you
Josef
 
E

Eric Pozharski

Ok now I am confused, do please explain.

A long and boring way -- C<perldoc perlvar> then look for section
C<ARGV> (it's the first one among many), read 4 of them thoroughly.
Then return to C<perldoc encoding> and C<perldoc Encode> (it seems to be
used internally by B<encoding> pragma anyway). Then think a lot and
finally see the light.

p.s. A quick and dirty way --

perl -wle '
while(<>) {
system qq|ls -l /proc/$$/fd|;
exit;
};
' /etc/passwd
total 0
lrwx------ 1 whynot whynot 64 2009-02-24 22:47 0 -> /dev/pts/0
lrwx------ 1 whynot whynot 64 2009-02-24 22:47 1 -> /dev/pts/0
lrwx------ 1 whynot whynot 64 2009-02-24 22:47 2 -> /dev/pts/0
lr-x------ 1 whynot whynot 64 2009-02-24 22:47 3 -> /etc/passwd
lr-x------ 1 whynot whynot 64 2009-02-24 22:47 4 -> pipe:[7056143]
l-wx------ 1 whynot whynot 64 2009-02-24 22:47 5 -> pipe:[7056143]

Pay a bit of attention to I<fileno> #3
 
D

Dr.Ruud

Eric said:
I've just gone through your original script with debugger, and found out
that after C<$line = <>;> I<$line> is pure byte string. And then after
C<chomp $line;> it automagically decodes into utf8 character(!) string.
Should I keep on explaining? (No, no spoiler this time.)

Spoiler:

$ perl -Mencoding=utf8 -wle '
my $c;
{ use bytes;
$c = "EUR:\xE2\x82\xAC";
print length $c;
}
$c .= "";
print length $c;
'
7
5
 
D

Dr.Ruud

Dr.Ruud said:
Eric Pozharski:

Spoiler:

$ perl -Mencoding=utf8 -wle '
my $c;
{ use bytes;
$c = "EUR:\xE2\x82\xAC";
print length $c;
}
$c .= "";
print length $c;
'
7
5

Even more impressive:

$ perl -Mencoding=utf8 -wle '
my $c;
{ use bytes;
$c = "EUR:\xE2\x82\xAC";
print length $c;
}
print length $c;
$c .= "";
print length $c;
'
7
7
5

(perl 5.8.5)
 
E

Eric Pozharski

On 2009-02-25 said:
Even more impressive:

$ perl -Mencoding=utf8 -wle '
my $c;
{ use bytes;
$c = "EUR:\xE2\x82\xAC";
print length $c;
}
print length $c;
$c .= "";
print length $c;
'
7
7
5

(perl 5.8.5)

And I'm not impressed (any more) it's undocumented.
 
J

Josef Feit

Thanks to all who helped.
Now some of my (rather long lasting) utf8 problems
should be solved.

JF
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top