Why "Wide character in print"?


T

tcgo

Hi!
I just made a test code with Perl, using the Pi symbol with Unicode/UTF-8. That's the code:

#!/usr/bin/perl
use utf8;
my $cosa = "Here is my ☺ résúmé \x{2639}!";
print "$cosa\n";

And it gives me a "warning" message: "Wide character in print at ./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the warning disappears, but why was it showing before of adding the binmode?

Thanks!
~tcgo~
 
Ad

Advertisements

R

Rainer Weikusat

tcgo said:
I just made a test code with Perl, using the Pi symbol with
Unicode/UTF-8. That's the code:

#!/usr/bin/perl
use utf8;
my $cosa = "Here is my ☺ résúmé \x{2639}!";
print "$cosa\n";

And it gives me a "warning" message: "Wide character in print at
./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the
warning disappears, but why was it showing before of adding the
binmode?

Because the people who nowadays work on perl unicode support have
decided that it should behave as if the encoding used by it was some
super secret sauce shrouded in eternal mystery: All data flowing into
a Perl program is supposed to be converted to this super secret
internal mystery encoding before being used and all data flowing out
of a Perl program is supposed to be converted to something software
other than perl understands beforehand. De facto, the situation is
such that everything is fine when perl is used in an environment where
UTF-8 is the 'native' method for supporting wide characters because
this is also what perl uses itself, and anyone using something
else is essentially fucked. De jure, perl is supposed to be nasty to
everyone, or at least try as hard as possible without breaking
backwards compatibility.
 
A

Alan Curry

Hi!
I just made a test code with Perl, using the Pi symbol with
Unicode/UTF-8. That's the code:

#!/usr/bin/perl
use utf8;
my $cosa = "Here is my ☺ résúmé \x{2639}!";
print "$cosa\n";

And it gives me a "warning" message: "Wide character in print at
./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the warning
disappears, but why was it showing before of adding the binmode?

The binmode documents your assumption that nobody will ever run your program
on a non-UTF8-mode terminal.
 
P

Peter J. Holzer

I just made a test code with Perl, using the Pi symbol with
Unicode/UTF-8. That's the code:

#!/usr/bin/perl
use utf8;
my $cosa = "Here is my ☺ résúmé \x{2639}!";
print "$cosa\n";

And it gives me a "warning" message: "Wide character in print at
./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the
warning disappears, but why was it showing before of adding the
binmode?

Because, unless you tell it with binmode, Perl doesn't know what
encoding it is supposed to use. It could get the encoding from the
locale settings, but that would only work for text written to a
terminal, not for arbitrary data written to a file, so perl doesn't
make assumptions and asks you to set the encoding explicitely.

(If you want to get the encoding from the locale, use I18N::Langinfo,
unfortunately this doesn't work on all platforms (at least it didn't
work on Windows last time I looked, but that was a few years ago)

hp
 
J

johndelacour

#!/usr/bin/perl
use utf8;
my $cosa = "Here is my ☺ résúmé \x{2639}!";
print "$cosa\n";

And it gives me a "warning" message: "Wide character in print at
./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the
warning disappears, but why was it showing before of adding the
binmode?

“use utf8†means only that the script file itself is UTF-8-encoded;
It doesn’t say how to manage the output to STDOUT.

JD
 
C

C.DeRykus

Hi!

I just made a test code with Perl, using the Pi symbol with Unicode/UTF-8.. That's the code:



#!/usr/bin/perl

use utf8;

my $cosa = "Here is my ☺ résúmé \x{2639}!";

print "$cosa\n";
...

Here's a follow-on with an observation/question for someone more knowledgeable about Perl unicode)

I don't know how 'use locale' affects this but I
only see the OP's expected display of characters
by using the "\N{U+...}" notation to force character
semantics:

#use utf8;
my $cosa = "Here is my \N{U+263A} résúmé \N{U+03C0}!";

Output: Here is my ☺ résúmé π!
 
Ad

Advertisements

E

Eric Pozharski

*SKIP*
(In theory you can 'use encoding' to specify a different source
character encoding, but in practice that pragma has always been buggy
and is better avoided.)

Stop spreading FUD. They need

use encoding ENCNAME Filter => 1;

(what I<ENCNAME> could possibly be?) but

* "use utf8" is implicitly declared so you no longer have to "use
utf8" to "${"\x{4eba}"}++".

what pretty much defies the purpose of C<use encoding;>.

*SKIP*
The lexer converts the "Ã¥" into a 1-character string which eventually
gets passed to 'say', which appends a newline (that is, a character
with ordinal 0a) and passes it to the STDOUT filehandle for writing.

That's not a whole story.

{2754:13} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "а" ; Dump $aa'
SV = PV(0x927a750) at 0x9295fac
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x9291a08 "\320\260"\0 [UTF8 "\x{430}"]
CUR = 2
LEN = 12
{2936:14} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "Ã¥" ; Dump $aa'
SV = PV(0x9af4750) at 0x9b0ffac
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x9b0ba08 "\303\245"\0 [UTF8 "\x{e5}"]
CUR = 2
LEN = 12

For a first glance, me wondered: what the heck is with yours
C<use warnings;>. Now I feel much better.

*CUT*
 
E

Eric Pozharski

with said:
That was certainly not my intention. My understanding is that 'use
encoding' is liable to cause incorrect behaviour and segfaults; see
for instance

https://rt.perl.org/rt3/Public/Bug/Display.html?id=31923

C said:


Double encoding.

Monkey wrench.

Works just as expected, see below.
which suggests that 'use utf8' is also broken; I didn't know that
until just now, and I'm not sure I entirely believe it. If you have
newer information than me, I'd be happy to change my opinion.

Probably that's not safe to state things like this below unprivately,
but:

not perl->isa( 'fool-proof' ) or die

(I'm trying to speak Perl here). IOW, Perl has an entry level. And
it's quite high. And one of steps to get behind is ability to read. I
don't mind ability to read code, I mean ability to RTFM. Three former
examples are clearly (for me) of that type. I have a couple of scripts
that have C<use encoding 'utf8';> (I<STDIN>, I<STDOUT>, and quote-like
operators) and C<use open ':locale';> (other filehandles, quite risky,
but those scripts are not for distribution thus I'm safe here). Those
scripts were started 4.5 years ago (according to logs, I can't believe
it was sarge (thus 5.8.8?)). Anyway, 5.10.0, 5.10.1, 5.14.2 -- because
I've made those right. Because I've read carefully, all the unicode
documentation that comes with perl (namely perluniitro.pod,
perlunicode.pod, utf8.pod, encoding.pm, Encdoe.pm (perlunifaq.pod,
perlunitut, and perluniprops.pod weren't distributed five years ago,
should read them too)). I've found that I don't need utf8.pm (those
scripts and modules should be us-ascii anyway).

I feel utf8-safe because, first of all, I can read. If I can, they can
too, can't they? Apparently, they don't, maybe because they can't.
That installs a source filter; I'm not sure what the effects of that
are, but I wouldn't be surprised if you get the union of any bugs in
'use encoding' and any bugs in 'use utf8'.


I don't believe this is safe either. The pad code (which handles 'my'
variables) isn't utf8-safe, so you can't create 'my' variables with
Unicode names. (The above is a symref to a global; I don't know if the
code handling the names of globals is utf8-safe, but even if it is
that isn't terribly useful.)

Let me rephrase one famous proverb:

If an answer you've got is 'filter', you probably asking wrong
question.

*SKIP*
In any case, the result is exactly what I said: the string contains
one (logical) character. If you apply length() to that string it will
return 1. (This character happens to be represented internally as two
bytes; that is none of your business.) What do you think I omitted
from the story?

Right. And that's closely related to your last example (the one about
utf8.pm being unsafe). I've tried to make a point that *characters*
from different *ranges* happen to be of different length in bytes.

{9829:45} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "aàа" ; Dump $aa'
SV = PV(0xa06f750) at 0xa08afac
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
CUR = 5
LEN = 12

*Characters* of latin1 aren't wide (even if they are characters, they
are still one byte long)

{10406:65} [0:0]% perl -Mutf8 -wle 'print "[à]"'
[à]
{10415:66} [0:0]% perl -Mutf8 -wle 'print "[а]"'
Wide character in print at -e line 1.
[а]

I must have added those braces, because:

{10421:67} [0:0]% perl -wle 'print "à"' # no problmes, just a byte
à
{10477:68} [0:0]% perl -Mutf8 -wle 'print "à"' # oops

{10520:69} [0:0]% perl -Mutf8 -wle 'print "à "' # stupid
à
{10522:70} [0:0]% perl -Mutf8 -wle 'print "\x{E0}"' # oops

{10532:71} [0:0]% perl -Mutf8 -wle 'print "\x{E0} "' # stupid
à
{10602:79} [0:0]% perl -Mutf8 -wle 'print "\N{U+00E0}"' # oops

{10608:80} [0:0]% perl -Mutf8 -wle 'print "\N{U+00E0} "' # stupid
à

But watch this:

{10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "à"' # hooray!
à
{10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
�
{10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hooray!
à

Except the middle one (what I should think about), I think encoding.pm
wins again.
 
P

Peter J. Holzer

That was certainly not my intention. My understanding is that 'use
encoding' is liable to cause incorrect behaviour and segfaults; see for
instance

https://rt.perl.org/rt3/Public/Bug/Display.html?id=31923
https://rt.perl.org/rt3/Public/Bug/Display.html?id=36248
https://rt.perl.org/rt3/Public/Bug/Display.html?id=37526
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2009-09/msg00669.html

Incidentally, while looking for those I also found

http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2011-03/msg00255.html

which suggests that 'use utf8' is also broken; I didn't know that until
just now, and I'm not sure I entirely believe it.

That doesn't look like a bug in "use utf8" to me, but like a bug in the
code which generates the warnings.

It doesn't help that Tom just dumped a load of gibberish into his mail
without specifying which encoding he was using. I had to guess that he
was using CP1252.

Anyway, with use utf8, the qw[] section of his program is parsed correcly as

("élite", "Ævar", "μῦθος", "mío")

In the error message each character (even those in the printable ASCII
range U+0020 ... U+007E) is "helpfully" given in hex which I agree is
.... suboptimal.

If you have newer information than me, I'd be happy to change my opinion.

Me too, although frankly I see no reason to use encoding even if it
works. It mixes up encoding of the source code and the I/O, which is not
a good idea, IMSHO, and my editor handles UTF-8 just fine, so I don't
see why I should write my perl scripts in a different encoding than
UTF-8. I/O can be handled explicitely by I/O layers or implicitely by
"use open".

I don't believe this is safe either. The pad code (which handles 'my'
variables) isn't utf8-safe, so you can't create 'my' variables with
Unicode names. (The above is a symref to a global; I don't know if the
code handling the names of globals is utf8-safe, but even if it is that
isn't terribly useful.)

I'm puzzled about this part of the documentation, too. Why would anybody
want to use a variable ${"\x{4eba}"} ? I am guessing that the variable
is really supposed to be $人, i.e., there is a Han character in the
source code, not a symref.

Is this unsafe? I have occasionally used non-ascii characters in
variable names (mostly Greek characters in physical formulas) together
with use utf8 since 5.8.x and I never noticed a problem. (The only
"problem" I noticed is that the euro sign isn't a word character, so you
can't have a variable $amount_in_€. But then you can't have a variable
$amount_in_$ either, so I guess this is fair ;-))

hp
 
P

Peter J. Holzer

Right. And that's closely related to your last example (the one about
utf8.pm being unsafe). I've tried to make a point that *characters*
from different *ranges* happen to be of different length in bytes.

Then maybe you shouldn't have chosen two examples which both are same
length in bytes.
{9829:45} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "aàа" ; Dump $aa'
SV = PV(0xa06f750) at 0xa08afac
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
CUR = 5
LEN = 12

*Characters* of latin1 aren't wide (even if they are characters, they
are still one byte long)

In UTF-8, latin-1 characters >= 0x80 are 2 bytes, the same as cyrillic
characters. Your example shows this: "à" (LATIN SMALL LETTER A WITH
GRAVE) is "\303\240" and "а" (CYRILLIC SMALL LETTER A) is "\320\260".

But this isn't what "wide character" in the warning means. In the
warning, it means a string element with a code > 255. For string
elements <= 255, perl can assume that they are supposed to be bytes, not
characters, when you try to write them to a byte stream. It could be
argued that this assumption is a mistake, but for better or worse we are
stuck with that decision. But for string elements > 255, that just isn't
possible. It can't be a byte, it must be a character, and to convert a
character into bytes, the encoding needs to known.

{10406:65} [0:0]% perl -Mutf8 -wle 'print "[à]"'
[à]
{10415:66} [0:0]% perl -Mutf8 -wle 'print "[а]"'
Wide character in print at -e line 1.
[а]

.... as these examples demonstrate.

I must have added those braces, because:

{10421:67} [0:0]% perl -wle 'print "à"' # no problmes, just a byte
à

Assuming you use a UTF-8 terminal here: No, this isn't one byte. These are
two bytes, \303\240.
{10477:68} [0:0]% perl -Mutf8 -wle 'print "à"' # oops

Now you have one character (because of -Mutf8, the two bytes \303\240
are decoded to the character U+00e0), but you are trying to write it to a byte
stream without specifying the encoding. Perl writes the single byte
0xE0, which your UTF-8 terminal cannot interpret. (Mine displays a
question mark in a dark circle)

{10520:69} [0:0]% perl -Mutf8 -wle 'print "à "' # stupid
à

Huh? What version of Perl on what platform is this? The string is
"\x{E0}\x{20}". All elements of the string are <= 255, so the string is
output as a byte string. This isn't valid UTF-8, and your terminal
shouldn't be able to interpret it as "à" anymore than it was able to
interpret "\x{E0}\x{0A}" above.

[more equivalent examples snipped]

If your program does character I/O, you *need* to specify the encoding
of the I/O channels. For one-liners, the -C option is sufficent:

hrunkner:~/tmp 20:40 :) 195% perl -CS -Mutf8 -wle 'print "à"'
à

For scripts you would use binmode or 'use open'.

(Didn't you praise yourself on your ability to read? This is documented
and it has been repeated by several people in this newsgroup for years)

But watch this:

{10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "à"' # hooray!
à
{10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
�
{10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hooray!
à

Except the middle one (what I should think about), I think encoding.pm
wins again.

Excellent example, it shows exactly one of the pitfalls of using "use
encoding". One would expect "\x{E0}" to result in a string with a single
element with code 0xE0. At least you seem to have expected it, and for a
moment I was confused, too. But 'use encoding' doesn't work that way. It
was designed to convert string constants from the specified encoding to
Unicode, so it tries to interpret "\x{E0}" as UTF-8, but of course this
isn't valid UTF-8. So you get "\x{FFFD}" instead (U+FFFD is the
REPLACEMENT CHARACTER used to mark invalid characters).

If you use a correct UTF-8 encoded string, it works as expected (well,
expected by somebody who's read the documentation and remembers that
little pitfall):

hrunkner:~/tmp 20:47 :) 197% perl -Mencoding=utf8 -wle 'print "\303\240"'
à


For one-liners like this, using the same encoding for the script and the
I/O is useful ("-CS -Mutf8" is even shorter than "-Mencoding=utf8", but
maybe you don't have a UTF-8 capable terminal). However, for real
programs, I think tying the encoding of the source code to the encoding
of I/O-streams the script is supposed to handle is foolish. My scripts
are always encoded in UTF-8, but they frequently have to handle files in
CP-1252.

hp
 
H

Helmut Richter

But this isn't what "wide character" in the warning means. In the
warning, it means a string element with a code > 255. For string
elements <= 255, perl can assume that they are supposed to be bytes, not
characters, when you try to write them to a byte stream.

You have to distinguish what may work sometimes or always, and what is
part of the interface which *should* work. If it does nor work in the
latter case, it is an error; if it does not work in the former case you
have made a bad guess about how it is implemented. So do not rely on your
guesses but use the documented interface.

There are two ways to use the interface:

- You regard all strings, both during the run of the script and on
input/output, as bytes (=groups of 8 bits) without any meaning as
characters (=member of an alphabet for writing text). This will work if
all devices, and the script itself, use the same character code, which
must not have bytes with value >255. This *can* be a viable option if
you can either guarantee this restriction, or if your bytes do not
have a character meaning.

In this case, strings in the program text with characters that are not
contained in the common character code are meaningless, and will yield
errors.

- You regard the data during the run of the script as sequences of
characters, and the data on onput and output as sequences of bytes. Then
you have to convert bytes into textstrings on input and textstrings into
bytes on output -- in both cases you can specify the conversion once and
for all for each file. This is the only working way when the restrictions
of the last item are not fulfilled.

In this case, strings in the program text may contain any characters
whether or not they are representable in the codes used in input/output.
The "use utf8" pragma tells perl to interpret the program text itself as a
sequence of UTF-8 characters which will make a difference only for literal
strings in the program.

A third way does *not* work:

- You do input and output on strings of bytes and assume that perl will guess
correctly what characters these byte represent in your opinion.
Unfortunately that will *often* work (because perl assumes ISO-8859-1 on
many systems which may be what you are actually using), but it will also
often break (if you use other codes, or if you mix strings which happen to
contain only ISO-8859-1 characters with string containing also other
characters). But if it breaks, it is your fault: it is nowhere guaranteed
how text strings map to byte strings and vice versa, the sole exception
being the documented encode and decode functions.

This is fairly well explained in
http://search.cpan.org/~dom/perl-5.14.3/pod/perlunitut.pod
 
Ad

Advertisements

R

Rainer Weikusat

[...]
- You regard the data during the run of the script as sequences of
characters, and the data on onput and output as sequences of bytes. Then
you have to convert bytes into textstrings on input and textstrings into
bytes on output -- in both cases you can specify the conversion once and
for all for each file. This is the only working way when the restrictions
of the last item are not fulfilled.

This is the only 'working way' when the assumption that perl uses a
'secret mystery encoding' different from any other encoding known to
man is taken for granted. But this assumption is wrong and the concept
makes preciously little sense since it requires an additional copy of
all input data and all output data (possibly, times the number of perl
processes in a 'long' pipeline since not even perl is supposed to be
able to talk to perl natively). Considering the way perl is
implemented, this is a real problem for users of Windows (and Mac OS
X, AFAIK) because in both cases, perl uses something other than the
native encoding. That some people would like to inflict the same
damage onto users of platforms where the problem doesn't exist is
certainly very laudable but IMNSHO, best ignored.
 
P

Peter J. Holzer

You have to distinguish what may work sometimes or always, and what is
part of the interface which *should* work. If it does nor work in the
latter case, it is an error; if it does not work in the former case you
have made a bad guess about how it is implemented. So do not rely on your
guesses but use the documented interface.

I was careful to use the term "string element" and avoid the terms
"byte" and "character" when talking about the things a string is
composed of.

Perl has two types of strings: Character strings (often called utf8
strings in the documentation) and byte strings. Character strings are
composed of 32-bit entities, each denoting a unicode code point. So
"\x{1f42a}" is a string with the single character DROMEDARY CAMEL.
Byte strings are just that: Strings of uninterpreted bytes. Any
semantics assigned to them is semantics of the program, not of the Perl
language (this isn't quite correct: character oriented functions like lc
or character classes in regexps do work on them, but only for ASCII).

These differences are documented, and I consider them part of the
interface, although some members of p5p consider the distinction a bug
and try to remove it.

However, for the warning "Wide character in print" this is irrelevant.

Perl doesn't distinguish between character and byte strings when writing
them to a file handle. For both the strings "\x{E0}" (a byte string) and
"\N{U+00E0}" (a character string), if you write them to a raw file
handle, the single byte 0xE0 will be written. Both will be converted to
two bytes 0xC3 0xA0 if you write them the a file handle with the
":encoding(UTF-8)" layer. And so on. But for strings with elements >
255, it simply isn't possible, to write a single byte with this value to
a byte stream, because a byte has only 8 bits (on the platforms we care
about). So Perl prints a warning and encodes the string in UTF-8 (or
just copies its internal representation, which happens to be the same
thing). I would argue that perl should die() instead, but this has been
the observed and documented behaviour since 5.8.0, so I doubt it will
change.


[Rest snipped. All true, but IMHO not very relevant to this thread].

hp
 
P

Peter J. Holzer

This is the only 'working way' when the assumption that perl uses a
'secret mystery encoding' different from any other encoding known to
man is taken for granted.

The encoding isn't a 'secret mystery'. It is well documented that it
is Unicode.

perl -CS -MEncode -E 'say ord(Encode::decode("utf-8", "\xE2\x82\xAC"))'

is defined to print "8364".

It is a 'secret mystery' (wink, wink, nudge, nudge) how this is
represented internally, just like the representation of numbers is a
'secret mystery'.

However, for most programs you don't have to know that Perl character
strings are Unicode strings. It is sufficient to know that Perl has the
concept of a "character" which is different from the concept of a
"byte", that a character has certain properties (e.g. it can be a letter
or an ideograph, it may have an associated uppercase or lowercase
letter, ...) and to convert a sequence of characters into a sequence of
bytes you have to encode them. Whether the Euro sign has the numeric
code 8364 or 4711 is rarely significant.

But this assumption is wrong and the concept
makes preciously little sense since it requires an additional copy of
all input data and all output data

This is an unsubstantiated claim. It is possible that the current
implementation of I/O layers does indeed perform an additional copy (I
haven't checked the code), but this is certainly not required.

And even if it is true, it is almost certainly lost in the noise as soon
as your script does something more complex than "cat" with your input -
almost any string operation in perl performs a copy.
(possibly, times the number of perl processes in a 'long' pipeline
since not even perl is supposed to be able to talk to perl natively).
Considering the way perl is implemented, this is a real problem for
users of Windows (and Mac OS X, AFAIK) because in both cases, perl
uses something other than the native encoding.

Why is this a real problem?
That some people would like to inflict the same damage onto users of
platforms where the problem doesn't exist is certainly very laudable
but IMNSHO, best ignored.

Whatever "the problem" may be. The problem that characters and bytes
aren't the same and that most programmers prefer to think of text as a
sequence of characters, not a sequence of bytes exists on every
platform.

hp
 
H

Helmut Richter

However, for most programs you don't have to know that Perl character
strings are Unicode strings.

Are they? They are strings of characters that are contained in Unicode. They
are not necessarily internally encoded as Unicode. People run into problems
when they make assumptions about the way they are implemented. I would have
worded:

For all programs you must not pretend to know that Perl character strings
are Unicode strings.

It may be true, it may be false -- either way, it is not part of the
documented interface. Hence, it must not be used even if it be true.
 
E

Eric Pozharski

with said:
Then maybe you shouldn't have chosen two examples which both are same
length in bytes.

(Last night I've reread loads of perlunicode and friends, I feel much
better now) No, they are the same length *if* encoding of stream is set:

{7453:22} [0:0]% perl -CS -Mutf8 -wle 'print "à"' | xxd
0000000: c3a0 0a ...
{7459:23} [0:0]% perl -CS -Mutf8 -wle 'print "а"' | xxd
0000000: d0b0 0a ...
{7466:24} [0:0]%

But latin1 is special (I've reread perlunicode and friends), *if*
there's no reason (printing isn't reason) to upgrade to utf8 then
*characters* of latin1 script (and latin1 only) stay *bytes*:

{7466:24} [0:0]% perl -Mutf8 -wle 'print "à"' | xxd
0000000: e00a ..
{7795:25} [0:0]% perl -Mutf8 -wle 'print "а"' | xxd
Wide character in print at -e line 1.
0000000: d0b0 0a ...

But even if encoding of stream isn't set concatenation with non-latin1
script upgrades latin1 too:

{7800:26} [0:0]% perl -Mutf8 -wle 'print "[à][а]"' | xxd
Wide character in print at -e line 1.
0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].

Please rewind the thread. That's exactly what happened couple of posts
ago (specifically: <[email protected]> and
{9829:45} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "aàа" ; Dump $aa'
SV = PV(0xa06f750) at 0xa08afac
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
CUR = 5
LEN = 12

*Characters* of latin1 aren't wide (even if they are characters, they
are still one byte long)
In UTF-8, latin-1 characters >= 0x80 are 2 bytes, the same as cyrillic
characters. Your example shows this: "à" (LATIN SMALL LETTER A WITH
GRAVE) is "\303\240" and "а" (CYRILLIC SMALL LETTER A) is "\320\260".

No. Because it's not UTF-8, it's utf8. As long as utf8 semantics isn't
set, anything scalar stays plain bytes:

{2786:10} [0:0]% perl -MDevel::peek -wle 'Dump "à"'
SV = PV(0x9d0e878) at 0x9d29f28
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK)
PV = 0x9d2ddc8 "\303\240"\0
CUR = 2
LEN = 12

However, when utf8 semantics is set, then those codepoints that fit
latin1 script become special Perl-latin1:

{5930:11} [0:0]% perl -MDevel::peek -Mutf8 -wle 'Dump "à"'
SV = PV(0x9b92880) at 0x9badf10
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
PV = 0x9bb1eb0 "\303\240"\0 [UTF8 "\x{e0}"]
CUR = 2
LEN = 12

Upgrade to UTF-8 encoding or staying with latin1 encoding depends on
concatation with already upgraded to UTF-8 codepoints and/or encoding of
output stream.

*SKIP*
{10477:68} [0:0]% perl -Mutf8 -wle 'print "à"' # oops
Now you have one character (because of -Mutf8, the two bytes \303\240
are decoded to the character U+00e0), but you are trying to write it
to a byte stream without specifying the encoding. Perl writes the
single byte 0xE0, which your UTF-8 terminal cannot interpret. (Mine
displays a question mark in a dark circle)

{42:1} [0:0]% perl -Mutf8 -wle 'print "à"'
à
{1903:2} [0:0]% perl -Mutf8 -wle 'print "à"'

{1933:3} [0:0]% perl -Mutf8 -wle 'print "à"' | xxd
0000000: e00a

Instead it does. Once. It wasn't typeing, it was search through
history. Now I'm bothered. Does anyone here know how to list
extensions enabled in running instance of urxvt?

*SKIP*
For one-liners like this, using the same encoding for the script and
the I/O is useful ("-CS -Mutf8" is even shorter than
"-Mencoding=utf8", but maybe you don't have a UTF-8 capable terminal).

{14999:29} [0:0]% perl -mencoding -wle 'print "[à][а]"' | xxd
0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].
{15017:30} [0:0]% perl -CS -Mutf8 -wle 'print "[à][а]"' | xxd
0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].

Golf?
However, for real programs, I think tying the encoding of the source
code to the encoding of I/O-streams the script is supposed to handle
is foolish. My scripts are always encoded in UTF-8, but they
frequently have to handle files in CP-1252.

Mine are us-ascii, I have open.pm for rest.
 
Ad

Advertisements

R

Rainer Weikusat

Helmut Richter said:
However, for most programs you don't have to know that Perl character
strings are Unicode strings.
[...]

For all programs you must not pretend to know that Perl character strings
are Unicode strings.

It may be true, it may be false -- either way, it is not part of the
documented interface. Hence, it must not be used even if it be true.

At best, that's a part of the interface which was meanwhile
'undocumented' because the implementation choices which were made
weren't the implementation choices that should have been made,
according to the opinions of some people who didn't make the
descision. But indepedently of that, inventing the 'Perl is an
island!' character encoding - no matter how hypothetical - remains a
stupid idea. Perl is not an island and it has to interact with code
written in other programming languages, although maybe not in the
fantasy universe of people who implement 'wepp fremmwuergs' and
'ohpscheckt suesstemms' who are generally not troubled by the minor
consideration of making their stuff do something actually useful in
the real world. Conseqently, Perl should be compatible with some
existing convention, ideally, with all existing 'local'
conventions. If this isn't possible, the next best choice is not 'make
everyone bleed'.
 
P

Peter J. Holzer

(Last night I've reread loads of perlunicode and friends, I feel much
better now) No, they are the same length *if* encoding of stream is set:

You posted the output of Devel::peek::Dump, so I thought you were
talking about the *internal* representation.

How many bytes they occupy in an I/O stream depends on the encoding.

LATIN SMALL LETTER A WITH GRAVE is one byte in ISO-8859-1, CP850, ...
LATIN SMALL LETTER A WITH GRAVE is two bytes in UTF-8, UTF-16, ...
LATIN SMALL LETTER A WITH GRAVE is four bytes in UTF-32, ...

CYRILLIC SMALL LETTER A is one byte in ISO-8859-5, KOI-8, ...
CYRILLIC SMALL LETTER A is two bytes in UTF-8, UTF-16, ...
CYRILLIC SMALL LETTER A is four bytes in UTF-32, ...

(And of course, both characters cannot be represented at all in some
encodings: There is no LATIN SMALL LETTER A WITH GRAVE in ISO-8859-5,
and no CYRILLIC SMALL LETTER A in ISO-8859-1)
{7453:22} [0:0]% perl -CS -Mutf8 -wle 'print "à"' | xxd
0000000: c3a0 0a ...
{7459:23} [0:0]% perl -CS -Mutf8 -wle 'print "а"' | xxd
0000000: d0b0 0a ...
{7466:24} [0:0]%

But latin1 is special (I've reread perlunicode and friends), *if*
there's no reason (printing isn't reason) to upgrade to utf8 then
*characters* of latin1 script (and latin1 only) stay *bytes*:

I already explained that. When writing to a file handle, perl doesn't
care whether a string is composed of bytes or characters.

If the file handle has no :encoding() layer, it will try to write each
element of the string as a single byte.

If the file has an :encoding() layer, it will interpret each element of
the string as a character and convert that to a byte sequence according
to that encoding.

So without an encoding layer "\x{E0}" will always be written as the single byte
0xE0, regardless of whether the string is a byte string or a character
string. With an ":encoding(UTF-8)" layer it will always be written as
two bytes 0xC3 0xA0; and with an ":encoding(CP850)" layer, it will
always be written as a single byte 0x85.

What it apparently confusing you is what happens if that fails.

Obviously you can't write a single byte with the value 0x430, you can't
encode CYRILLIC SMALL LETTER A in ISO-8859-1 and you can't encode LATIN
SMALL LETTER A WITH GRAVE in ISO-8859-5.

So what does perl do? It prints a warning to STDERR and writes
a more or less reasonable approximation to the stream. The details
depend on the I/O layer:

If there is no :encoding() layer, the warning is "Wide character in
print" and the utf-8 representation is sent to the stream. And to
confuse matters further, this is done for the whole string, not just
this particular string element:

% perl -Mutf8 -E 'say "->\x{E0}\x{430}<-"'
Wide character in say at -e line 1.
->àа<-

(one string: \x{E0} and \x{430} converted to UTF-8)

% perl -Mutf8 -E 'say "->\x{E0}<-", "->\x{430}<-"'
Wide character in say at -e line 1.
->�<-->а<-

(two strings: \x{E0} printed as a single byte, \x{430} converted to UTF-8)

If there is an :encoding() layer, the warning is "\x{....} does not map
to $charset" and a \x{....} escape sequence is sent to the stream:

% perl -Mutf8 -E 'binmode STDOUT, ":encoding(iso-8859-5)"; say "->\x{E0}<-"'
"\x{00e0}" does not map to iso-8859-5 at -e line 1.
->\x{00e0}<-

But these are responses to an *error* condition. You shouldn't try to
write codepoints > 255 to a byte stream (actually, you shouldn't write
any characters to a byte stream, a byte stream is for bytes), and you
shouldn't try to write latin accented characters to a cyrillic stream.
Or at least you shouldn't be terribly surprised if the result is a
little confusing - garbage in, garbage out.

But even if encoding of stream isn't set concatenation with non-latin1
script upgrades latin1 too:

The term "upgrade" has a rather specific meaning in Perl in context with
byte and character strings, and I don't think you are talking about
that.

{7800:26} [0:0]% perl -Mutf8 -wle 'print "[à][а]"' | xxd
Wide character in print at -e line 1.
0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].

You have a single string "[à][а]" here. As I wrote above, print treats
the string as unit and in the absence of an :encoding() layer just dumps
it in UTF-8 encoding. So, yes, both the "à" and the "а" within this
single string will be UTF-8-encoded (as will be the square brackets, but
for them the UTF-8 encoding is the same as for US-ASCII, so you don't
notice that).

And I repeat it again: You are doing something which just doesn't make
sense (writing characters to a byte stream), so don't be surprised if
the result is a little surprising. Do it right and the result will make
sense.

Please rewind the thread. That's exactly what happened couple of posts
ago (specifically: <[email protected]> and
<[email protected]>).

I've read these postings but I don't know what you are referring to. If
you are referring to other postings (especially long ones), please cite
the relevant part.

{9829:45} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "aàа" ; Dump $aa'
SV = PV(0xa06f750) at 0xa08afac
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
CUR = 5
LEN = 12

*Characters* of latin1 aren't wide (even if they are characters, they
are still one byte long)
In UTF-8, latin-1 characters >= 0x80 are 2 bytes, the same as cyrillic
characters. Your example shows this: "à" (LATIN SMALL LETTER A WITH
GRAVE) is "\303\240" and "а" (CYRILLIC SMALL LETTER A) is "\320\260".

No. Because it's not UTF-8, it's utf8.

I presume that by "utf8" you mean a string with the UTF8 bit set
(testable with the utf8::is_utf8() function). But as I've written
repeatedly, this is completely irrelevant for I/O. A string will be
treated completely identical, whether is has this bit set or not. It is
only the value of the string which is important, not its internal type
and representation.

(Also, I find it very confusing that you post the output of
Devel::peek::Dump, but then apparently don't refer to it but talk about
something else. Please try to organize your postings in a way that one
can understand what you are talking about. It is very likely that this
exercise will also clear up the confusion in your mind)

As long as utf8 semantics isn't set, anything scalar stays plain
bytes:

{2786:10} [0:0]% perl -MDevel::peek -wle 'Dump "à"'
SV = PV(0x9d0e878) at 0x9d29f28
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK)
PV = 0x9d2ddc8 "\303\240"\0
CUR = 2
LEN = 12

However, when utf8 semantics is set, then those codepoints that fit
latin1 script become special Perl-latin1:

{5930:11} [0:0]% perl -MDevel::peek -Mutf8 -wle 'Dump "à"'
SV = PV(0x9b92880) at 0x9badf10
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
PV = 0x9bb1eb0 "\303\240"\0 [UTF8 "\x{e0}"]
CUR = 2
LEN = 12

Yes. We've been through that. Ben explained it in excruciating detail.
What don't you understand here?

Mine are us-ascii, I have open.pm for rest.

US-ASCII is a subset of UTF-8, so your files are UTF-8, too ;-). (Most
of mine don't contain non-ASCII characters either) What I meant is that
I don't use any other encoding (like ISO-8859-1 or ISO-8859-15) to
encode non-ASCII characters, so I don't have any need for "use
encoding". If your scripts are all in ASCII and you use open.pm for
"rest", what do you need "use encoding" for? Remember, this subthread
started when you berated Ben for discouraging the use "use encoding".

hp
 
H

Helmut Richter

But indepedently of that, inventing the 'Perl is an
island!' character encoding - no matter how hypothetical - remains a
stupid idea.

Every program is an "island" within its code. No matter what I use, I do not
normally know the internals, and if I happen to know them I should not use my
knowledge because the internals may change at any time.

Perl is not an island as far as interaction with other programs is
concerned. It is documented how to read and write byte data, and how to read
and write character data whose code and encoding is known. If desired, it is
also not really difficult to write code that tries to guess an unknown code --
with all the pitfalls such a behaviour entails.

There is one interface decision perl has made: it does not by default use the
locale settings to determine the default code and encoding, rather it requires
that these be specified in the script. Opinions may be divided; I like this
decision because my experience is that often the locale settings appear to be
randomly uncorrelated to the codes actually used.

The implementation decisions that are not part of the interface, in particular
the internal representation of values of different types including strings,
concern future developers but not users. If perl decides to store characters
internally as a 37-bit EBCDIC enhancement, it does not really bother me as
long as the programm still interacts correctly with the outside world in
standardised codes.
 
Ad

Advertisements

E

Eric Pozharski

with said:
*SKIP*
I've read these postings but I don't know what you are referring to.
If you are referring to other postings (especially long ones), please
cite the relevant part.

[quoting <[email protected]> on]

$ echo 'a' | perl -Mutf8 -wne 's/a/Ã¥/;print' | od -xc
0000000 0ae5
345 \n
0000002

[quote off]

*SKIP*
I presume that by "utf8" you mean a string with the UTF8 bit set
(testable with the utf8::is_utf8() function).

If "you" above refers to me then you're wrong.
But as I've written repeatedly, this is completely irrelevant for I/O.
A string will be treated completely identical, whether is has this bit
set or not. It is only the value of the string which is important, not
its internal type and representation.

Try to read it again. Slowly.
(Also, I find it very confusing that you post the output of
Devel::peek::Dump, but then apparently don't refer to it but talk
about something else. Please try to organize your postings in a way
that one can understand what you are talking about.

Indeed, only FLAGS and PV are relevant. Sadly that Devel::peek::Dump
doesn't provide means to filter arbitrary parts of output off (however,
that's not the purpose of D::p). And I consider editing copypastes a
bad taste.

*SKIP*
Yes. We've been through that. Ben explained it in excruciating detail.
What don't you understand here?

It's not about understanding. I'm trying to make a point that latin1 is
special.
US-ASCII is a subset of UTF-8, so your files are UTF-8, too ;-). (Most
of mine don't contain non-ASCII characters either) What I meant is that
I don't use any other encoding (like ISO-8859-1 or ISO-8859-15) to
encode non-ASCII characters, so I don't have any need for "use
encoding". If your scripts are all in ASCII and you use open.pm for
"rest", what do you need "use encoding" for?

Many years ago to get operations to work on characters instead of bytes
some strings must have been pulled. encoding.pm pulled right strings.
utf8.pm pulled irrelevant strings. Those days text related operations
worked for you because they fitted in latin1 script or you didn't hit
edge cases. However I did (more years ago, in 5.6.0, B<lcfirst()>
worked *only* on bytes, no matter what).

Guess what? I've just figured out I don't need either any more:

{40710:255} [0:0]% xxd foo.koi8-u
0000000: c6d9 d7c1 0a .....
{40731:262} [0:0]% perl -wle '
open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
read $fh, $fh, -s $fh;
$fh =~ m{(\w\w)};
print $1
'
Wide character in print at -e line 5.
Ñ„Ñ‹
Remember, this subthread started when you berated Ben for discouraging
the use "use encoding".

It comes clear to me now what made you both (you and Ben) believe in
bugginess of F<encoding.pm>. I'm fine with that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top