Why "Wide character in print"?

P

Peter J. Holzer

with said:
*SKIP*
I've read these postings but I don't know what you are referring to.
If you are referring to other postings (especially long ones), please
cite the relevant part.

[quoting <[email protected]> on]

$ echo 'a' | perl -Mutf8 -wne 's/a/Ã¥/;print' | od -xc
0000000 0ae5
345 \n
0000002

Then I don't understand what you meant by "that" in the quoted
paragraph, since that seemed to refer to something else.

If "you" above refers to me

Yes, of course. You used to the term "utf8", so I was wondering what you
meant by it.
then you're wrong.

Then I don't know what you meant by "utf8". Care to explain?

Try to read it again. Slowly.

Read *what* again? The paragraph you quoted is correct and explains the
behaviour you are seeing.

Indeed, only FLAGS and PV are relevant. Sadly that Devel::peek::Dump
doesn't provide means to filter arbitrary parts of output off (however,
that's not the purpose of D::p). And I consider editing copypastes a
bad taste.

That's not the problem. The problem is that you gave the output of
Devel::peek::Dump which clearly showed a latin-1 character occupying
*two* bytes and then claimed that it was only one byte long. Which it
clearly wasn't. What you probably meant was that the latin1 character
would be only 1 byte long if written to an output stream without an
encoding layer. But you didn't write that. You just made an assertion
which clearly contradicted the example you had just given and didn't
even give any indication that you had even noticed the contradiction.

It's not about understanding. I'm trying to make a point that latin1 is
special.

It is only special in the sense that all its codepoints have a value <=
255. So if you are writing to a byte stream, it can be directly
interpreted as a string of bytes and written to the stream without
modification.

The point that *I* am trying to make is that an I/O stream without an
:encoding() layer isn't for I/O of *characters*, it is for I/O of
*bytes*.

Thus, when you write the string "Käse" to such a stream, you aren't
writing Upper Case K, lower case umlaut a, etc. You are writing 4 bytes
with the values 0x4B, 0xE4, 0x73, 0x65. The I/O-code doesn't care about
whether the string is character string (with the UTF8 bit set) or a byte
string, it just interprets every element of the string as a byte. Those
four bytes could be pixels in image, for all the Perl I/O code knows.

OTOH, if there is an :encoding() layer, the string is taken to be
composed of (unicode) characters. If there is an element with the
codepoint \x{E4} in the string, it is a interpreted as a lower case
umlaut a, and converted to the proper encoding (e.g. one byte 0x84 for
CP850, two bytes 0xC3 0xA4 for UTF-8 and one byte 0xE4 for latin-1). But
again, this happens *always*. The Perl I/O layer doesn't care whether
the string is a character string (with the UTF8 bit set) or not.

Many years ago to get operations to work on characters instead of bytes
some strings must have been pulled. encoding.pm pulled right strings.
utf8.pm pulled irrelevant strings. Those days text related operations
worked for you because they fitted in latin1 script or you didn't hit
edge cases. However I did (more years ago, in 5.6.0, B<lcfirst()>
worked *only* on bytes, no matter what).

Perl aquired unicode support in its current form only in 5.8.0. 5.6.0
did have some experimental support for UTF-8-encoded strings, but it was
different and widely regarded as broken (that's why it was changed for
5.8.0). So what Perl 5.6.0 did or didn't do is irrelevant for this
discussion.

With some luck I managed to skip the 5.6 days and went directly from the
<=5.005 "bytestrings only" era to the modern >=5.8.0 "character
strings" era. However, in the early days of 5.8.x, the documentation was
quite bad and it took a lot of reading, experimenting and thinking to
arrive at a consistent understanding of the Perl string model.

But once you have this understanding, it is really quite simple and
consistent.
Guess what? I've just figured out I don't need either any more:

{40710:255} [0:0]% xxd foo.koi8-u
0000000: c6d9 d7c1 0a .....
{40731:262} [0:0]% perl -wle '
open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
read $fh, $fh, -s $fh;
$fh =~ m{(\w\w)};
print $1
'
Wide character in print at -e line 5.
Ñ„Ñ‹

This example doesn't have any non-ascii characters in the source code,
so of course it doesn't need 'use utf8'. The only effect of use utf8 it
to tell the perl compiler that the source code is encoded in UTF-8.

But you *do* need some indication of the encoding of STDOUT (did you
notice the warning "Wide character in print at -e line 5."? As long as
you get this warning, your code is wrong).

You could use "use encoding 'utf-8'":

% perl -wle '
use encoding "UTF-8";
open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
read $fh, $fh, -s $fh;
$fh =~ m{(\w\w)};
print $1
'
Ñ„Ñ‹

Or you could use -C on the command line:

% perl -CS -wle '
open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
read $fh, $fh, -s $fh;
$fh =~ m{(\w\w)};
print $1
'
Ñ„Ñ‹


Or could use "use open":

% perl -wle '
use open ":locale";
open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
read $fh, $fh, -s $fh;
$fh =~ m{(\w\w)};
print $1
'
Ñ„Ñ‹


Note: No warning in all three cases. The latter takes the encoding from
the environment, which hopefully matches your terminal settings. So it
works on a UTF-8 or ISO-8859-5 or KOI-8 terminal. But of course it
doesn't work on a latin-1 terminal and you get an appropriate warning:

"\x{0444}" does not map to iso-8859-1 at -e line 6.
"\x{044b}" does not map to iso-8859-1 at -e line 6.
\x{0444}\x{044b}


It comes clear to me now what made you both (you and Ben) believe in
bugginess of F<encoding.pm>. I'm fine with that.

I don't know whether encoding.pm is broken in the sense that it doesn't
do what is documented to do (it was, but it is possible that all of
those bugs have been fixed). I do think that it is "broken as designed",
because it conflates two different things:

* The encoding of the source code of the script
* The default encoding of some I/O streams

and it does so even in an inconsistent manner (e.g. the encoding is
applied to STDOUT, but not to STDERR) and finally, because it is too
complex and that will lead to surprising results.

hp
 
E

Eric Pozharski

*SKIP*
Then I don't know what you meant by "utf8". Care to explain?

Do you know difference between utf-8 and utf8 for Perl? (For long time,
up to yesterday, I believed that that utf-8 is all-caps; I was wrong,
it's caseless.)

*SKIP*
* The encoding of the source code of the script

Wrong.

[quote perldoc encoding on]

* Internally converts all literals ("q//,qq//,qr//,qw///, qx//") from
the encoding specified to utf8. In Perl 5.8.1 and later, literals in
"tr///" and "DATA" pseudo-filehandle are also converted.

[quote off]

In pre-all-utf8 times qr// was working on bytes without being told to
behave otherwise. That's different now.
* The default encoding of some I/O streams

We here, in our barbaric world, had (and still have) to process any
binary encoding except latin1 (guess what, CP866 is still alive).
However:

[quote perldoc encoding on]

* Changing PerlIO layers of "STDIN" and "STDOUT" to the encoding
specified.

[quote off]

That's not saying anything about 'default'. It's about 'encoding
specified'.
and it does so even in an inconsistent manner (e.g. the encoding is
applied to STDOUT, but not to STDERR)

No problems with that here. STDERR is us-ascii, point.
and finally, because it is too
complex and that will lead to surprising results.

In your elitist latin1 world -- may be so. But we, down here, are
barbarians, you know.
 
P

Peter J. Holzer

*SKIP*

Do you know difference between utf-8 and utf8 for Perl?

UTF-8 is the "UCS Transformation Format, 8-bit form" as defined by the
Unicode consortium. It defines a mapping from unicode characters to
bytes and back. When you use it as an encoding in Perl, There will be
some checks that the input is actually a valid unicode character. For
example, you can't encode a surrogate character:

$s2 = encode("utf-8", "\x{D812}");

results in the string "\xef\xbf\xbd", which is UTF-8 for U+FFFD (the
replacement character used to signal invalid characters).


utf8 may mean (at least) three different things in a Perl context:

* It is a perl-proprietary encoding (actually two encodings, but EBCDIC
support in perl has been dead for several years and I doubt it will
ever come back, so I'll ignore that) for storing strings. The
encoding is based on UTF-8, but it can represent code points with up
to 64 bits[1], while UTF-8 is limited to 36 bits by design and to
values <= 0x10FFFF by fiat. It also doesn't check for surrogates, so

$s2 = encode("utf8", "\x{D812}");

results in the string "\xed\xa0\x92", as one would naively expect.

You should never use this encoding when reading or writing files.
It's only for perl internal use and AFAIK it isn't documented
anywhere except possibly in the source code.

* Since the perl interpreter uses the format to store strings with
Unicode character semantics (marked with the UTF8 flag), such strings
are often called "utf8 strings" in the documentation. This is
somewhat unfortunate, because "utf8" looks very similar to "utf-8",
which can cause confusion and because it exposes an implementation
detail (There are several other possible storage formats a perl
interpreter could reasonable use) to the user.

I avoid this usage. I usually talk about "byte strings" or "character
strings", or use even more verbose language to make clear what I am
talking about. For example, in this thread the distinction between
byte strings and character is almost irrelevant, it is only important
whether a string contains an element > 0xFF or not.

* There is also an I/O layer “:utf8â€, which is subtly different from
both “:encoding(utf8)†and “:encoding(utf-8)“.
(For long time, up to yesterday, I believed that that utf-8 is
all-caps; I was wrong, it's caseless.)

Yes, the encoding names (as used in Encode::encode, Encode::decode and
the :encoding() I/O-Layers) are case-insensitive.

* The encoding of the source code of the script

Wrong.

[quote perldoc encoding on]

* Internally converts all literals ("q//,qq//,qr//,qw///, qx//") from
the encoding specified to utf8. In Perl 5.8.1 and later, literals in
"tr///" and "DATA" pseudo-filehandle are also converted.

[quote off]

How is this proving me wrong? It confirms what I wrote.

If you use “use encoding 'KOI8-U';â€, you can use KOI8 sequences (either
literally or via escape sequences) in your source code. For example, if
you store this program in KOI8-U encoding:


#!/usr/bin/perl
use warnings;
use strict;
use 5.010;
use encoding 'KOI8-U';

my $s1 = "Б";
say ord($s1);
my $s2 = "\x{E2}";
say ord($s2);
__END__

(i.e. the string literal on line 7 is stored as the byte sequence 0x22
0xE2 0x22), the program will print 1041 twice, because:

* The perl compiler knows that the source code is in KOI-8, so a single
byte 0xE2 in the source code represents the character “U+0411
CYRILLIC CAPITAL LETTER BEâ€. Similarly, Escape sequences of the form
\ooo and \Xxx are taken to denote bytes in the source character set
and translated to unicode. So both the literal Б on line 7 and the
\x{E2} on line 9 are translated to U+0411.

* At run time, the bytecode interpreter sees a string with the single
unicode character U+0411. How this character was represented in the
source code is irrelevant (and indeed, unknowable) to the byte code
interpreter at this stage. It just prints the decimal representation
of 0x0411, which happens to be 1041.

In pre-all-utf8 times qr// was working on bytes without being told to
behave otherwise. That's different now.

Yes, I think I wrote that before. I don't know what this has to do with
the behaviour of “use encodingâ€, except that historically, “use
encoding†was intended to convert old byte-oriented scripts to the brave new
unicode-centered world with minimal effort. (I don't think it met that
goal: Over the years I have encountered a lot of people who had problems
with “use encodingâ€, but I don't remember ever reading from someone who
successfully converted their scripts by slapping “use encoding '...'â€
at the beginning.)
* The default encoding of some I/O streams

We here, in our barbaric world, had (and still have) to process any
binary encoding except latin1 (guess what, CP866 is still alive).
However:

[quote perldoc encoding on]

* Changing PerlIO layers of "STDIN" and "STDOUT" to the encoding
specified.

[quote off]

That's not saying anything about 'default'. It's about 'encoding
specified'.

You misunderstood what I meant by "default". When The perl interpreter
creates the STDIN and STOUT file handles, these have some I/O layers
applied to them, without the user having to explicitely having to call
binmode(). These are applied by default, and hence I call them the
default layers. The list of default layers varies between systems
(Windows adds the :crlf layer, Linux doesn't), on command line settings
(-CS adds the :utf8 layer, IIRC), and of course it can also be
manipulated by modules like “encodingâ€. “use encoding 'CP866';†pushes
the layer “:encoding(CP866)†onto the STDIN and STDOUT handles. You can
still override them with binmode(), but they are there by default, you
don't have to call “binmode STDIN, ":encoding(CP866)"†explicitely
(but you do have to call it explicitely for STDERR, which IMNSHO is
inconsistent).

No problems with that here. STDERR is us-ascii, point.

If my scripts handle non-ascii characters, I want those characters also
in my error messages. If a script is intended for normal users (not
sysadmins), I might even want the error messages to be in their native
language instead of English. German can expressed in pure US-ASCII,
although it's awkward. Russian or Chinese is harder.
In your elitist latin1 world -- may be so. But we, down here, are
barbarians, you know.

May I remind you that it was you who was surprised by the behaviour of
“use encoding†in this thread, not me?


| {10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "à"' # hooray!
| à
| {10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
| �
| {10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hoora
| à
|
| Except the middle one (what I should think about), I think encoding.pm
| wins again.

You didn't understand why the the middle one produced this particular
result. So you were surprised by the way “use encoding†translates
string literals. I wasn't surprised. I knew how it works and explained
it to you in my followup.

Still, although I think I understand “use encoding†fairly well (because
I spent a lot of time reading the docs and playing with it when I still
thought it would be a useful tool, and later because I spent a lot of
time arguing on usenet that it isn't useful) I think it is too complex.
I would be afraid of making stupid mistakes like writing "\x{E0}" when I
meant chr(0xE0), and even if I don't make them, the next guy who has to
maintain the scripts probably understands much less about “use encodingâ€
than I do and is likely to misunderstand my code and introduce errors.

hp


[1] I admit that I was surprised by this. It is documented that strings
consist of 64-bit elements on 64-bit machines, but I thought this
was an obvious documentation error until I actually tried it.
 
R

Rainer Weikusat

[...]
(In practice it would break XS, so it probably won't happen, which is a
shame. UTF-8 was a very bad choice of internal representation, in
retrospect, though it seemed to make sense at the time. It makes a great
many internal operations much more complicated than they need to be,
because you can no longer index into an array to find a particular
character in the string.)

The only way to provide that is to store all characters as integer
values large enough to encompass all conceivably existing Unicode
codepoints. Otherwise, you're going to have multibyte characters and
consequently, 'indexing into the array to find a particular character
in the string' won't work anymore.

Independently of this, the UTF-8 encoding was designed to have
represenation of the Unicode character set which was backwards
compatible with 'ASCII-based systems' and it is not only a widely
supported internet standard (http://tools.ietf.org/html/rfc3629) and
the method of choice for dealing with 'Unicode' for UNIX(*) and
similar system but formed the 'basic character encoding' of complete
operating systems as early as 1992
(http://plan9.bell-labs.com/plan9/about.html). As such, supporting it
natively in a programming language closely associated with UNIX(*), at
least at that time, should have been pretty much a no brainer. "But
Microsoft did it difffentely !!1" is the ultimate argument for some
people but - thankfully - these didn't get to piss into Perl until
very much later and thus, the damage they can still do is mostly
limited to 'propaganda'.
 
R

Rainer Weikusat

Rainer Weikusat said:
[...]
(In practice it would break XS, so it probably won't happen, which is a
shame. UTF-8 was a very bad choice of internal representation, in
retrospect, though it seemed to make sense at the time. It makes a great
many internal operations much more complicated than they need to be,
because you can no longer index into an array to find a particular
character in the string.)

The only way to provide that is to store all characters as integer
values large enough to encompass all conceivably existing Unicode
codepoints. Otherwise, you're going to have multibyte characters and
consequently, 'indexing into the array to find a particular character
in the string' won't work anymore.

I would also like to point out that this is an inherent deficiency of
the idea to represent all glyphs of all conceivable scripts with a
single encoding scheme at that the practial consequences of that are
mostly 'anything which restricts itself to the US typewriter character
set is fine' (and everyone else is going to have no end of problems
because of that).

I actually stopped using German characters like a-umlaut years ago
exactly because of this.
 
P

Peter J. Holzer

Careful. You're conflating the existing-only-in-the-programmer's-head
concept of 'do I consider this string to contain bytes for IO or
characters for manipulation' with the perl-internal SvUTF8 flag, which
is exactly the mistake we have been trying to stop people making since
5.8.0 was released

Who is "we"? Before 5.12, you had to make the distinction.
Strings without the SvUTF8 flag simply didn't have Unicode semantics.
Now there is the unicode_strings feature, but

1) it still isn't default
2) it will be years before I can rely on perl 5.12+ being installed on
a sufficient number of machines to use it. I'm not even sure if most
of our machines have 5.10 yet (the Debian machines have, but most of
the RHEL machines have 5.8.x)

So, that distinction has at least existed for 8 years (2002-07-18 to
2010-04-12) and for many of us it will exist at for another few years.

So enforcing the concept I have my head in the Perl code is simply
defensive programming.
and we realised the 3rd-Camel model where Perl keeps track of the
characters/bytes distinction isn't workable.

It worked for me ;-).
It's entirely possible and sensible for a 'byte string', that is, a
string containing only characters <256 intended for raw IO, to happen
to have SvUTF8 set internally, with byte values >127 represented as 2
bytes.

Theoretically yes. In practice it almost always means that the
programmer forgot to call encode() somewhere.

And the other way around didn't work at all: You couldn't keep a string
with characters > 127 but < 256 in a string without the SvUTF8 flag set
and expect it to work.

hp
 
R

Rainer Weikusat

Ben Morrow said:
Quoth Rainer Weikusat said:
Ben Morrow <[email protected]> writes:
[...]

But then, you've never really understood the concept of abstraction,
have you?

This mostly means that I cannot possibly be a self-conscious human
being capable of interacting with the world in some kind of
'intelligent' (meaning, influencing it such that it changes according
to some desired outcome) way but must be some kind of lifeform below
the level of a dog or a bird. Yet, I'm capable of using written
language to communicate with you (with some difficulties), using a
computer connected to 'the internet' in order to run a program on a
completely different computer 9 miles away from my present location,
utilizing a server I have to pay for once a year from by bank account
which resides (AFAIK) in Berlin.

How can this possibly be?
 
R

Rainer Weikusat

Ben Morrow said:
Quoth Rainer Weikusat <[email protected]>:
[...]
The only way to provide that is to store all characters as integer
values large enough to encompass all conceivably existing Unicode
codepoints. Otherwise, you're going to have multibyte characters and
consequently, 'indexing into the array to find a particular character
in the string' won't work anymore.

Yes. That's called 'a 32-bit int', and is the standard wchar_t C
representation of Unicode. A sensible alternative would be a 1/2/4-byte
upgrade scheme somewhat similar to the current Perl scheme, but with all
the alternatives being constant width; a smarter alternative would be to
represent a string as a series of pieces, each of which could make a
different choice (and, potentially, some of which could be shared or CoW
with other strings).

With the most naive implementation, this would mean that moving 100G
of text data through Perl (and that's a small number for some jobs I'm
thinking of) requires copying 400G of data into Perl and 400G out of
it. What you consider 'smart' would only penalize people who actually
used non-ASCII-scripts to some (possibly serious) degree.
There is a very big difference between a sensible *internal*
representation and a sensible *external* representation.

This notion of 'internal' and 'external' representation is nonsense:
In order to cooperate sensibly, a number of different processes need
to use the same 'representation' for text data to avoid repeated
decoding and encoding whenever data needs to cross a process
boundary. And for 'external representation', using a proper
compression algorithm for data which doesn't need to be usable in its
stored form will yield better results than any 'encoding scheme'
biased towards making the important things (deal with US-english texts)
simple and resting comfortably on the notion that everything else is
someone else's problem.
 
E

Eric Pozharski

with said:
*SKIP*
If you use “use encoding 'KOI8-U';â€, you can use KOI8 sequences
(either literally or via escape sequences) in your source code. For
example, if you store this program in KOI8-U encoding:


#!/usr/bin/perl
use warnings;
use strict;
use 5.010;
use encoding 'KOI8-U';

my $s1 = "Б";
say ord($s1);
my $s2 = "\x{E2}";
say ord($s2);
__END__

(i.e. the string literal on line 7 is stored as the byte sequence 0x22
0xE2 0x22), the program will print 1041 twice, because:

* The perl compiler knows that the source code is in KOI-8, so a
single byte 0xE2 in the source code represents the character “U+0411
CYRILLIC CAPITAL LETTER BEâ€. Similarly, Escape sequences of the form
\ooo and \Xxx are taken to denote bytes in the source character set
and translated to unicode. So both the literal Б on line 7 and the
\x{E2} on line 9 are translated to U+0411.

* At run time, the bytecode interpreter sees a string with the single
unicode character U+0411. How this character was represented in the
source code is irrelevant (and indeed, unknowable) to the byte code
interpreter at this stage. It just prints the decimal representation
of 0x0411, which happens to be 1041.

Indeed, that renders perl somewhat lame. "They" could invent some
property attached at will to any scalar that would reflect some
byte-encoding somewhat connected with this scalar. Then make each other
operation to pay attention to that property. However, that hasn't been
done. Because on the way to all-utf8 Perl sacrifices have to be made.
Now, if that source would be saved as UTF-8 then output wouldn't be any
different.

I had no use for ord() (and I don't have now) but that wouldn't surprise
me if at some point in perl development ord() (in this script) would
return 208. And the only thing that could be done to make it work would
be upgrade, sometime later.

Look, *literals* are converted to utf8 with UTF8 flag on. Maybe that's
what made (and makes) qr// to work, as expected:

{41393:56} [0:0]% perl -wlE '"фыва" =~ m{(\w)}; print $1'

{42187:57} [0:0]% perl -Mutf8 -wle '"фыва" =~ m{(\w)}; print $1'
Wide character in print at -e line 1.
Ñ„
{42203:58} [0:0]% perl -Mencoding=utf8 -wle '"фыва" =~ m{(\w)}; print $1'
Ñ„

For explanation what happens in 1st example see below. I may be wrong
here, but I think, that in 2nd and 3rd example it all turns around $^H
anyway.
Yes, I think I wrote that before. I don't know what this has to do
with the behaviour of “use encodingâ€, except that historically, “use
encoding†was intended to convert old byte-oriented scripts to the
brave new unicode-centered world with minimal effort. (I don't think
it met that goal: Over the years I have encountered a lot of people
who had problems with “use encodingâ€, but I don't remember ever
reading from someone who successfully converted their scripts by
slapping “use encoding '...'†at the beginning.)

I didn't convert anything. So I don't pretend you can count me in.
Just now I've come to conclusion that C<use encoding 'utf8';> (that's
what I've ever used) is effects of C<use utf8;> plus binmode() on
streams minus posibility to make non us-ascii literals. I've been
always told that I *must* C<use utf8;> and than manually do binmode()s
myself. Nobody ever explained why I can't do that with C<use encoding
'utf8';>.

Now, C<use encoding 'binary-enc';> behaves as above (they have fully
functional UTF-8 script limited by advance of perl to all-utf8), except
actual source isn't UTF-8. I can imagine reasons why that could be
necessary. Indeed, such circumstances would be rare. Myself is in
aproximately full control of environment, thus it's not problem for me.

As of 'lot of people', I'll tell you who I've met. I've seen loads of
13-year-old boys (those are called snowflakes these days) who don't know
how to deal with shit. For those, who don't know how to deal with shit,
jobs.perl.org is the way.

*SKIP*
(but you do have to call it explicitely for STDERR, which IMNSHO is
inconsistent).

Think about it. What terminal presents (in fonts) is locale dependent.
That locale could be 'POSIX'. There's no 'POSIX.UTF-8'. And see below.

*SKIP*
You didn't understand why the the middle one produced this particular
result. So you were surprised by the way “use encoding†translates
string literals. I wasn't surprised. I knew how it works and explained
it to you in my followup.

That's nice you brought that back. I've already figured it all out.

----
{0:1} [0:0]% perl -Mutf8 -wle 'print "à"'
�
{23:2} [0:0]% perl -Mutf8 -wle 'print "à "'
�
----
{36271:17} [0:0]% perl -Mutf8 -wle 'print "à"'

{36280:18} [0:0]% perl -Mutf8 -wle 'print "à "'
à
----

What's common in those two pairs: it's special Perl-latin1, with UTF8
flag off, none utf8 concerned layer is set on output. What's different:
the former is xterm, the latter is urxvt. In eather case, that's what
is output actually:

{36831:20} [0:1]% perl -Mutf8 -wle 'print "à"' | xxd
0000000: e00a ..
{37121:21} [0:0]% perl -Mutf8 -wle 'print "à "' | xxd
0000000: e020 0a . .

So, 0xe0 has nothing to do in utf-8 output. xterm replaces it with
replacement (what makes sense). In contrary, urxvt applies some weird
heuristic (and it's really weird)

{37657:28} [0:0]% perl -Mutf8 -wle 'print "àá"'
à
{37663:29} [0:0]% perl -Mutf8 -wle 'print "àáâ"'
àá
{37666:30} [0:0]% perl -Mutf8 -wle 'print "àáâã"'
àáâ

*If* it's xterm vs. urxvt then, I think, it's religious (that means it's
not going to change). However, it doesn't look configurable or at least
documented while obviously it could be usable (configurability
provided). Then it may be some weird interaction with fontconfig, or
xft, or some unnamed perl extension, or whatever else. If I won't
forget I'll invsetigate it later after upgrades.

As of your explanation. It's not precise. encoding.pm does what it
always does. It doesn't mangle scalars itself, it *hints* Encode.pm
(and friends) for decoding from encoding specified to utf8. (How
Encode.pm comes into play is beyond my understanding for now.) In case
of C<use encoding 'utf8';> it happens to be decoding from utf-8 to utf8.
Encode.pm tries to decode byte with value more than 0x7F and falls back
for replacement.

That may be undesired. And considering this:

encoding - allows you to write your script in non-ascii or non-utf8

C<use encoding 'utf8';> may constitute abuse. What can I say? I'm
abusing it. May be that's why it works.

*CUT*
 
R

Rainer Weikusat

Rainer Weikusat said:
Ben Morrow said:
Quoth Rainer Weikusat <[email protected]>:
[...]
The only way to provide that is to store all characters as integer
values large enough to encompass all conceivably existing Unicode
codepoints. Otherwise, you're going to have multibyte characters and
consequently, 'indexing into the array to find a particular character
in the string' won't work anymore.

Yes. That's called 'a 32-bit int', and is the standard wchar_t C
representation of Unicode.
[...]

With the most naive implementation, this would mean that moving 100G
of text data through Perl (and that's a small number for some jobs I'm
thinking of) requires copying 400G of data into Perl and 400G out of
it.

And - of course - this still wouldn't help since a 'character'
as it appears in some script doesn't necessarily map 1:1 to a Unicode
codepoint. Eg, the German a-umlaut can either be represented as the
ISO-8859-1 code for that (IIRC) or as 'a' followed by a 'combining
diaresis' (and the policy of the Unicode consortium is actually to avoid
adding more 'precombined characters' in favor of 'grapheme
construction sequences', at least, that's what it was in 2005, when I
last had a closer look at this).
 
P

Peter J. Holzer

[...]
(In practice it would break XS, so it probably won't happen, which is a
shame. UTF-8 was a very bad choice of internal representation, in
retrospect, though it seemed to make sense at the time. It makes a great
many internal operations much more complicated than they need to be,
because you can no longer index into an array to find a particular
character in the string.)

The only way to provide that is to store all characters as integer
values large enough to encompass all conceivably existing Unicode
codepoints.

Not necessarily. As Ben already pointed out, not all strings have to
have the same representation. There is at least one programming language
(Pike) which uses 1, 2, or 4 bytes per character depending on the
"widest" character in the string. IIRC, Pike had Unicode code before
Perl, so Perl could have "stolen" that idea.

Otherwise, you're going to have multibyte characters and
consequently, 'indexing into the array to find a particular character
in the string' won't work anymore.

There are other tradeoffs, too: UTF-8 is quite compact for latin text,
but it takes about 2 bytes per character for most other alphabetic
scripts (e.g. Cyrillic, Greek, Devanagari) and 3 for CJK and some other
alphabetic scripts (e.g. Hiragana and Katakana). So the size problem you
mentioned may be reversed if you are mainly processing Asian text.
Plus scanning a text may be quite a bit faster if you can do it in 16
bit quantities instead of 8 bit quantities.

Independently of this, the UTF-8 encoding was designed to have
represenation of the Unicode character set which was backwards
compatible with 'ASCII-based systems' and it is not only a widely
supported internet standard (http://tools.ietf.org/html/rfc3629) and
the method of choice for dealing with 'Unicode' for UNIX(*) and
similar system but formed the 'basic character encoding' of complete
operating systems as early as 1992
(http://plan9.bell-labs.com/plan9/about.html).

However, the Plan 9 C API has exactly the distinction you are
criticizing: Internally, strings are arrays of 16-bit quantities,
externally, they read and written as UTF-8.

From the well-known "Hello world" paper:

| All programs in Plan 9 now read and write text as UTF, not ASCII.
| This change breaks two deep-rooted symmetries implicit in most C
| programs:
|
| 1. A character is no longer a char.
|
| 2. The internal representation (Rune) of a character now differs from
| its external representation (UTF).

(The paper was written before Unicode 2.0, so all characters were 16
bit. I don't know the current state of Plan 9)

hp
 
P

Peter J. Holzer

I don't know what Win32's internal representation is (I suspect 32bit
int, the same as Unix), but its default external representation is
UTF-16, which is about the most braindead concoction anyone has ever
come up with.

I guess you haven't seen Punycode ;-) [There seems to be no "barf"
emoticon in Unicode - I'm disappointed]
The only possible justification for its existence is
backwards-compatibility with systems which started implementing
Unicode before it was finished,

What do you mean by "finished"? There is a new version of the Unicode
standard about once per year, so it probably won't be "finished" as long
as the unicode consortium exists.

Unicode was originally intended to be a 16 bit code, and Unicode 1.0
reflected this: It was 16 bit only and there was no intention to expand
it. That was only added in 2.0, about 4 years later (and at that time it
was theoretical: The first characters outside of the BMP were defined in
Unicode 3.1 in 2001, 9 years after the first release).

So of course anybody who implemented Unicode between 1992 and 1996
implemented it as a 16 bit code, because that was what the standard
said. Those early adopters include Plan 9, Windows NT, and Java.

and even then I'm *certain* they could have made it less grotesquely
ugly if they'd tried (a UTF-8-like scheme, for instance).

UTF-16 has a few things in common with UTF-8:

* both are backward compatible with an existing shorter encoding
(UTF-8: US-ASCII, UTF-16: UCS-2)
* both are variable width
* both are self-terminating
* Both use some high bits to distinguish between a single unit (8 resp.
16 bits), the first unit and subsequent unit(s)

The main differences are

* UTF-16 is based on 16-bit units instead of bytes (well, duh!)
* There was no convenient free block at the top of the value range,
so the surrogate areas are somewhere in the middle.
* and therefore ordering isn't preserved (but that wouldn't be
meaningful anyway)

The main problem I have with UTF-16 is of a psychological nature: It is
extremely tempting to assume that it's a constant-width encoding because
"nobody uses those funky characters above U+FFFF anyway". Basically the
"all the world uses US-ASCII" trap reloaded.

hp
 
R

Rainer Weikusat

Ben Morrow said:
Quoth "Peter J. Holzer" <[email protected]>:
[...]
Unicode was originally intended to be a 16 bit code, and Unicode 1.0
reflected this: It was 16 bit only and there was no intention to expand
it. That was only added in 2.0, about 4 years later (and at that time it
was theoretical: The first characters outside of the BMP were defined in
Unicode 3.1 in 2001, 9 years after the first release).

So of course anybody who implemented Unicode between 1992 and 1996
implemented it as a 16 bit code, because that was what the standard
said. Those early adopters include Plan 9, Windows NT, and Java.

Yeah, fair enough, I suppose. It seems obvious in hindsight that 16 bits
weren't going to be enough, but maybe that isn't fair.

It should have been obvious 'in foresight' that the '16 bit code' of
today will turn into a 22 bit code tomorrow, a 56 bit code a fortnight
from now and then slip back to 18.5 bit two weeks later[*] (the 0.5 bit
introduced by some guy who used to work with MPEG who transferred to the
Unicode consortium), much in the same way the W3C keeps changing the
name of HTML 4.01 strict to give the impression of development beyond
aimlessly moving in circles in the hope that - some day - someone might
chose to adopt it (web developers have shown a remarkable common sense
in this respect).

BTW, there's another aspect of the "all the world is external to perl
and doesn't matter [to us]" nonsense: perl can be embedded. Eg, I
spend a sizable part of my day yesterday writing some Perl code
supposed to run inside of postgres, as part of an UTF-8 based
database. In practice, it is possible to chose a database encoding
which can represent everything which needs to be represented in this
database which is also compatible with Perl, making it feasible to use
it for data manipulation. In theory, that's another "Thing which must
not be done" which - in this case - simply means that avoiding Perl
for such code in favour of a language which gives its users less
gratuitious headaches is preferable.

[*] I keep wondering why the letter T isn't defined as 'vertical
bar' + 'combining overline' (or why A isn't 'greek delta' + 'combining
hyphen' ...)
 
D

Dr.Ruud

Yes. That's called 'a 32-bit int', and is the standard wchar_t C
representation of Unicode. A sensible alternative would be a 1/2/4-byte
upgrade scheme somewhat similar to the current Perl scheme, but with all
the alternatives being constant width; a smarter alternative would be to
represent a string as a series of pieces, each of which could make a
different choice (and, potentially, some of which could be shared or CoW
with other strings).

Let's invent the byte-oriented utf-2d.

The bytes for the special (i.e. non-ASCII) characters have the high bit
on, and further still have a meaningful value, such that they can be
matched as a (cased) word-character / digit / whitespace, punctuation, etc.
Per special character position there can be an entry in the side table,
that defines the real data for that position.

The 80-8F bytes are for future extensions. An 9E-byte can prepend a data
part. An 9F byte (ends a data part and) starts a table part.

An ASCII buffer remains as is. A latin1 buffer also remains as is,
unless it contains a code point between 80 and 9F.


Possible usage of 90-9F, assuming " 0Aa." collation:

90: .... space
91: ...# digit
92: ..#. upper
93: ..## upper|digit
94: .#.. lower
95: .#.# lower|digit
96: .##. alpha
97: .### alnum
98: #... punct
99: #..# numeric?
9A: #.#. ...
9B: #.## ...
9C: ##.. ...
9D: ##.# ...
9E: ###. SOD (start-of-data)
9F: #### SOT (start-of-table)
 
P

Peter J. Holzer

The main problem *I* have is the fact the surrogates are allocated out
of the Unicode character space, so everyone doing anything with Unicode
has to take account of them, even if they won't ever be touching UTF-16
data. UTF-8 doesn't do that: it has magic bits indicating the
variable-length sections, but they are kept away from the data bits
representing the actual characters encoded.

The same could have been done with UTF-16. If I'm reading the charts
right, Unicode 1.1.5 (the last version before the change) allocated
characters from 0000-9FA5 and from F900-FFFF, which leaves Axxx-Exxx
free to represent multi-word characters. So, for instance, they could
have used the following scheme: A word matching one of

0xxxxxxxxxxxxxxx
1001xxxxxxxxxxxx
1111xxxxxxxxxxxx

is a single-word character. Other characters are represented as two
words, encoded as

101ppppphhhhhhhh 110pppppllllllll

which represents the 26-bit character

pppppppppphhhhhhhhllllllll

That takes a huge chunk (25%, or even 37.5% if you include the ranges
which you have omitted above) out of the BMP. These codepoints would
either not be assigned at all (same as with UTF-16) or have to be
represented as four bytes. By comparison, the UTF-16 scheme reduces the
number of codepoints representable in 16 bits only by 3.1%. So there was
a tradeoff: Number of characters representable in 16 bits (63488 :
40960 or 49152) versus total number of representable characters (1112064
: 67108864). Clearly they thought 1112064 ought to be enough for
everyone and opted for a denser representation of common characters.
(That doesn't mean that they considered exactly your encoding: But
surely they considered several different encodings before settling on
what is now known as UTF-16.

I know that at that point they were intending to extend the character
set to 31 bits,

Yes, but certainly not with UTF-16: That encoding is limited to ~ 20
bits (codepoints U+0000 .. U+10FFFF).
but IMHO reducing that to 26 would have been a lesser evil than
stuffing a whole lot of encoding rubbish into the application- visible
character set.

The only thing that's visible in the character set is that there is a
chunk of 2048 reserved code points which will never be assigned. How is
that different from other chunks of unassigned code points which may or
may not be assigned in the future?

hp
 
P

Peter J. Holzer

Indeed, that renders perl somewhat lame. "They" could invent some
property attached at will to any scalar that would reflect some
byte-encoding somewhat connected with this scalar. Then make each other
operation to pay attention to that property.

Well, "they" could do all kinds of shit (to borrow your use of
language), but why should they?

Look, *literals* are converted to utf8 with UTF8 flag on. Maybe that's
what made (and makes) qr// to work, as expected:

{41393:56} [0:0]% perl -wlE '"фыва" =~ m{(\w)}; print $1'

{42187:57} [0:0]% perl -Mutf8 -wle '"фыва" =~ m{(\w)}; print $1'
Wide character in print at -e line 1.
Ñ„
{42203:58} [0:0]% perl -Mencoding=utf8 -wle '"фыва" =~ m{(\w)}; print $1'
Ñ„

For explanation what happens in 1st example see below. I may be wrong
here, but I think, that in 2nd and 3rd example it all turns around $^H
anyway.

You are thinking way too complicated. You don't need to know about $^H
to understand this. It's really very simple.

In the first example, you are dealing with a string of 8 bytes
"\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0". Depending on the version of Perl you
are using, either none of them are word characters, or several of them
are. You don't get a warning, so I assume you use a perl >= 5.12, where
“use feature unicode_strings†exists and is turned on by -E. In this
case, the first byte of your string is a word character (U+00D1 LATIN
CAPITAL LETTER N WITH TILDE), so the script prints "\xd1\x0a".

In the second and third example, you have a string of 4 characters
characters "\x{0444}\x{044b}\x{0432}\x{0430}", all of which are word
characters, so the script prints "\x{0444}\x{0a}" (which then gets
encoded by the I/O layers, but I've explained that already and won't
explain it again).
I didn't convert anything. So I don't pretend you can count me in.
Just now I've come to conclusion that C<use encoding 'utf8';> (that's
what I've ever used) is effects of C<use utf8;> plus binmode() on
streams minus posibility to make non us-ascii literals.

Congratulations on figuring that out (except the last one: You can make
non us-ascii literals with “use encoding†(that's one of the reasons why
it was written), the rules are just a bit different than with “use utf8â€).
And of course I explicitely wrote that 10 days ago (and Ben possibly
wrote it before that but I'm not going to reread the whole thread).

I've been always told that I *must* C<use utf8;> and than manually do
binmode()s myself. Nobody ever explained why I can't do that with
C<use encoding 'utf8';>.

I don't know who told you that and who didn't explain that. It wasn't
me, that's for sure ;-). I have explained (in this thread and various
others over the last 10 years) what use encoding does and why I think
it's a bad idea to use it. If you understand it and are aware of the
tradeoffs, feel free to use it. (And of course there is no reason to use
“use utf8†unless your *source code* contains non-ascii characters).

Think about it. What terminal presents (in fonts) is locale dependent.
That locale could be 'POSIX'. There's no 'POSIX.UTF-8'. And see below.

And how is this only relevant for STDERR but not for STDIN and STDOUT?


That's nice you brought that back. I've already figured it all out.
[...]

Uh, no. That was a completely different problem.

So, 0xe0 has nothing to do in utf-8 output. xterm replaces it with
replacement (what makes sense). In contrary, urxvt applies some weird
heuristic (and it's really weird)

Yes, we've been through that already.

As of your explanation. It's not precise. encoding.pm does what it
always does. It doesn't mangle scalars itself, it *hints* Encode.pm
(and friends) for decoding from encoding specified to utf8. (How
Encode.pm comes into play is beyond my understanding for now.)

Maybe you should be less confident about stuff which is beyond your
understanding.

hp
 
P

Peter J. Holzer

3. 5.8.7 is the last Perl release available on IBM's EBCDIC
operating systems, e.g., z/OS.

True. But what does that have to do with the paragraph you quoted?
I don't know whether there is a similar issue with Unisys.

It is my understanding that modern perl versions don't work on any
EBCDIC-based platform, so that would include Unisys[1], HP/MPE and other
EBCDIC-based platforms. Especially since these platforms are quite dead,
unlike z/OS which is still maintained.

hp

[1] Not all Unisys systems used EBCDIC. I think at least the 1100 series
used ASCII.
 
P

Peter J. Holzer

That paragraph appears to suggest upgrading to 5.12;

No that wasn't the intention. I was questioning Ben's assertion that
"we've been trying to stop people making this mistake since 5.8.0",
because before 5.12.0 it wasn't a mistake, it was a correct
understanding of how perl/Perl worked.

Unless of course by "people" he didn't mean Perl programmers but the
p5p team and and by "stop making this mistake" he meant "introducing the
unicode_strings feature and including it in 'use v5.12'". It is indeed
possible that the so-called "Unicode bug" was identified shortly after
5.8.0 and that Ben and others were trying to fix it since then.

I was pointing out that that is not always an option.

I mentioned that it wasn't an option for me just a few lines further
down. Of course in my case "not an option" just means "more hassle than
it's worth", not "impossible", I could install and maintain a current
Perl version on the 40+ servers I administer. But part of the appeal of
Perl is that it's part of the normal Linux infrastructure. Rolling my
own subverts that.

So, I hope I'll get rid of perl 5.8.x in 2017 (when the support for RHEL
5.x ends) and of perl 5.10.x in 2020 (EOL for RHEL 6.x). Then I can
write "use v5.12" into my scripts and enjoy a world without the Unicode
bug.

hp
 
E

Eric Pozharski

with said:
*SKIP*
Maybe you should be less confident about stuff which is beyond your
understanding.

Here's the deal. Explain me what's complicated in this:

[quote encoding.pm on]
[producing $enc and $name goes above]
unless ( $arg{Filter} ) {
DEBUG and warn "_exception($name) = ", _exception($name);
_exception($name) or ${^ENCODING} = $enc;
$HAS_PERLIO or return 1;
}
[dealing wit Filter option and STDIN/STDOUT goes below]
[quote encoding.pm off]

and I grant you and Ben unlimited right to spread FUD on encoding.pm
 
P

Peter J. Holzer

with said:
Maybe you should be less confident about stuff which is beyond your
understanding.

Here's the deal. Explain me what's complicated in this:

[quote encoding.pm on]
[producing $enc and $name goes above]
unless ( $arg{Filter} ) {
DEBUG and warn "_exception($name) = ", _exception($name);
_exception($name) or ${^ENCODING} = $enc;
$HAS_PERLIO or return 1;
}
[dealing wit Filter option and STDIN/STDOUT goes below]
[quote encoding.pm off]

So after reading 400 lines of perldoc encoding (presumably not for the
first time) and a rather long discussion thread you are starting to read
the source code to find out what “use encoding†does?

I think you are proving my point that “use encoding†is too complicated
rather nicely.

You will have to read the source code of perl, however. AFAICS
encoding.pm is just a frontend which sets up stuff for the parser to
use. I'm not going to follow you there - the perl parser has too many
tentacles for my taste.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top