Why "Wide character in print"?

Discussion in 'Perl Misc' started by tcgo, Sep 30, 2012.

  1. Then I don't understand what you meant by "that" in the quoted
    paragraph, since that seemed to refer to something else.

    Yes, of course. You used to the term "utf8", so I was wondering what you
    meant by it.
    Then I don't know what you meant by "utf8". Care to explain?

    Read *what* again? The paragraph you quoted is correct and explains the
    behaviour you are seeing.

    That's not the problem. The problem is that you gave the output of
    Devel::peek::Dump which clearly showed a latin-1 character occupying
    *two* bytes and then claimed that it was only one byte long. Which it
    clearly wasn't. What you probably meant was that the latin1 character
    would be only 1 byte long if written to an output stream without an
    encoding layer. But you didn't write that. You just made an assertion
    which clearly contradicted the example you had just given and didn't
    even give any indication that you had even noticed the contradiction.

    It is only special in the sense that all its codepoints have a value <=
    255. So if you are writing to a byte stream, it can be directly
    interpreted as a string of bytes and written to the stream without
    modification.

    The point that *I* am trying to make is that an I/O stream without an
    :encoding() layer isn't for I/O of *characters*, it is for I/O of
    *bytes*.

    Thus, when you write the string "Käse" to such a stream, you aren't
    writing Upper Case K, lower case umlaut a, etc. You are writing 4 bytes
    with the values 0x4B, 0xE4, 0x73, 0x65. The I/O-code doesn't care about
    whether the string is character string (with the UTF8 bit set) or a byte
    string, it just interprets every element of the string as a byte. Those
    four bytes could be pixels in image, for all the Perl I/O code knows.

    OTOH, if there is an :encoding() layer, the string is taken to be
    composed of (unicode) characters. If there is an element with the
    codepoint \x{E4} in the string, it is a interpreted as a lower case
    umlaut a, and converted to the proper encoding (e.g. one byte 0x84 for
    CP850, two bytes 0xC3 0xA4 for UTF-8 and one byte 0xE4 for latin-1). But
    again, this happens *always*. The Perl I/O layer doesn't care whether
    the string is a character string (with the UTF8 bit set) or not.

    Perl aquired unicode support in its current form only in 5.8.0. 5.6.0
    did have some experimental support for UTF-8-encoded strings, but it was
    different and widely regarded as broken (that's why it was changed for
    5.8.0). So what Perl 5.6.0 did or didn't do is irrelevant for this
    discussion.

    With some luck I managed to skip the 5.6 days and went directly from the
    <=5.005 "bytestrings only" era to the modern >=5.8.0 "character
    strings" era. However, in the early days of 5.8.x, the documentation was
    quite bad and it took a lot of reading, experimenting and thinking to
    arrive at a consistent understanding of the Perl string model.

    But once you have this understanding, it is really quite simple and
    consistent.
    This example doesn't have any non-ascii characters in the source code,
    so of course it doesn't need 'use utf8'. The only effect of use utf8 it
    to tell the perl compiler that the source code is encoded in UTF-8.

    But you *do* need some indication of the encoding of STDOUT (did you
    notice the warning "Wide character in print at -e line 5."? As long as
    you get this warning, your code is wrong).

    You could use "use encoding 'utf-8'":

    % perl -wle '
    use encoding "UTF-8";
    open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
    read $fh, $fh, -s $fh;
    $fh =~ m{(\w\w)};
    print $1
    '
    Ñ„Ñ‹

    Or you could use -C on the command line:

    % perl -CS -wle '
    open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
    read $fh, $fh, -s $fh;
    $fh =~ m{(\w\w)};
    print $1
    '
    Ñ„Ñ‹


    Or could use "use open":

    % perl -wle '
    use open ":locale";
    open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
    read $fh, $fh, -s $fh;
    $fh =~ m{(\w\w)};
    print $1
    '
    Ñ„Ñ‹


    Note: No warning in all three cases. The latter takes the encoding from
    the environment, which hopefully matches your terminal settings. So it
    works on a UTF-8 or ISO-8859-5 or KOI-8 terminal. But of course it
    doesn't work on a latin-1 terminal and you get an appropriate warning:

    "\x{0444}" does not map to iso-8859-1 at -e line 6.
    "\x{044b}" does not map to iso-8859-1 at -e line 6.
    \x{0444}\x{044b}


    I don't know whether encoding.pm is broken in the sense that it doesn't
    do what is documented to do (it was, but it is possible that all of
    those bugs have been fixed). I do think that it is "broken as designed",
    because it conflates two different things:

    * The encoding of the source code of the script
    * The default encoding of some I/O streams

    and it does so even in an inconsistent manner (e.g. the encoding is
    applied to STDOUT, but not to STDERR) and finally, because it is too
    complex and that will lead to surprising results.

    hp
     
    Peter J. Holzer, Nov 1, 2012
    #21
    1. Advertisements

  2. *SKIP*
    Do you know difference between utf-8 and utf8 for Perl? (For long time,
    up to yesterday, I believed that that utf-8 is all-caps; I was wrong,
    it's caseless.)

    *SKIP*
    Wrong.

    [quote perldoc encoding on]

    * Internally converts all literals ("q//,qq//,qr//,qw///, qx//") from
    the encoding specified to utf8. In Perl 5.8.1 and later, literals in
    "tr///" and "DATA" pseudo-filehandle are also converted.

    [quote off]

    In pre-all-utf8 times qr// was working on bytes without being told to
    behave otherwise. That's different now.
    We here, in our barbaric world, had (and still have) to process any
    binary encoding except latin1 (guess what, CP866 is still alive).
    However:

    [quote perldoc encoding on]

    * Changing PerlIO layers of "STDIN" and "STDOUT" to the encoding
    specified.

    [quote off]

    That's not saying anything about 'default'. It's about 'encoding
    specified'.
    No problems with that here. STDERR is us-ascii, point.
    In your elitist latin1 world -- may be so. But we, down here, are
    barbarians, you know.
     
    Eric Pozharski, Nov 2, 2012
    #22
    1. Advertisements

  3. UTF-8 is the "UCS Transformation Format, 8-bit form" as defined by the
    Unicode consortium. It defines a mapping from unicode characters to
    bytes and back. When you use it as an encoding in Perl, There will be
    some checks that the input is actually a valid unicode character. For
    example, you can't encode a surrogate character:

    $s2 = encode("utf-8", "\x{D812}");

    results in the string "\xef\xbf\xbd", which is UTF-8 for U+FFFD (the
    replacement character used to signal invalid characters).


    utf8 may mean (at least) three different things in a Perl context:

    * It is a perl-proprietary encoding (actually two encodings, but EBCDIC
    support in perl has been dead for several years and I doubt it will
    ever come back, so I'll ignore that) for storing strings. The
    encoding is based on UTF-8, but it can represent code points with up
    to 64 bits[1], while UTF-8 is limited to 36 bits by design and to
    values <= 0x10FFFF by fiat. It also doesn't check for surrogates, so

    $s2 = encode("utf8", "\x{D812}");

    results in the string "\xed\xa0\x92", as one would naively expect.

    You should never use this encoding when reading or writing files.
    It's only for perl internal use and AFAIK it isn't documented
    anywhere except possibly in the source code.

    * Since the perl interpreter uses the format to store strings with
    Unicode character semantics (marked with the UTF8 flag), such strings
    are often called "utf8 strings" in the documentation. This is
    somewhat unfortunate, because "utf8" looks very similar to "utf-8",
    which can cause confusion and because it exposes an implementation
    detail (There are several other possible storage formats a perl
    interpreter could reasonable use) to the user.

    I avoid this usage. I usually talk about "byte strings" or "character
    strings", or use even more verbose language to make clear what I am
    talking about. For example, in this thread the distinction between
    byte strings and character is almost irrelevant, it is only important
    whether a string contains an element > 0xFF or not.

    * There is also an I/O layer “:utf8â€, which is subtly different from
    both “:encoding(utf8)†and “:encoding(utf-8)“.
    Yes, the encoding names (as used in Encode::encode, Encode::decode and
    the :encoding() I/O-Layers) are case-insensitive.

    How is this proving me wrong? It confirms what I wrote.

    If you use “use encoding 'KOI8-U';â€, you can use KOI8 sequences (either
    literally or via escape sequences) in your source code. For example, if
    you store this program in KOI8-U encoding:


    #!/usr/bin/perl
    use warnings;
    use strict;
    use 5.010;
    use encoding 'KOI8-U';

    my $s1 = "Б";
    say ord($s1);
    my $s2 = "\x{E2}";
    say ord($s2);
    __END__

    (i.e. the string literal on line 7 is stored as the byte sequence 0x22
    0xE2 0x22), the program will print 1041 twice, because:

    * The perl compiler knows that the source code is in KOI-8, so a single
    byte 0xE2 in the source code represents the character “U+0411
    CYRILLIC CAPITAL LETTER BEâ€. Similarly, Escape sequences of the form
    \ooo and \Xxx are taken to denote bytes in the source character set
    and translated to unicode. So both the literal Б on line 7 and the
    \x{E2} on line 9 are translated to U+0411.

    * At run time, the bytecode interpreter sees a string with the single
    unicode character U+0411. How this character was represented in the
    source code is irrelevant (and indeed, unknowable) to the byte code
    interpreter at this stage. It just prints the decimal representation
    of 0x0411, which happens to be 1041.

    Yes, I think I wrote that before. I don't know what this has to do with
    the behaviour of “use encodingâ€, except that historically, “use
    encoding†was intended to convert old byte-oriented scripts to the brave new
    unicode-centered world with minimal effort. (I don't think it met that
    goal: Over the years I have encountered a lot of people who had problems
    with “use encodingâ€, but I don't remember ever reading from someone who
    successfully converted their scripts by slapping “use encoding '...'â€
    at the beginning.)
    You misunderstood what I meant by "default". When The perl interpreter
    creates the STDIN and STOUT file handles, these have some I/O layers
    applied to them, without the user having to explicitely having to call
    binmode(). These are applied by default, and hence I call them the
    default layers. The list of default layers varies between systems
    (Windows adds the :crlf layer, Linux doesn't), on command line settings
    (-CS adds the :utf8 layer, IIRC), and of course it can also be
    manipulated by modules like “encodingâ€. “use encoding 'CP866';†pushes
    the layer “:encoding(CP866)†onto the STDIN and STDOUT handles. You can
    still override them with binmode(), but they are there by default, you
    don't have to call “binmode STDIN, ":encoding(CP866)"†explicitely
    (but you do have to call it explicitely for STDERR, which IMNSHO is
    inconsistent).

    If my scripts handle non-ascii characters, I want those characters also
    in my error messages. If a script is intended for normal users (not
    sysadmins), I might even want the error messages to be in their native
    language instead of English. German can expressed in pure US-ASCII,
    although it's awkward. Russian or Chinese is harder.
    May I remind you that it was you who was surprised by the behaviour of
    “use encoding†in this thread, not me?


    | {10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "à"' # hooray!
    | à
    | {10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
    | �
    | {10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hoora
    | à
    |
    | Except the middle one (what I should think about), I think encoding.pm
    | wins again.

    You didn't understand why the the middle one produced this particular
    result. So you were surprised by the way “use encoding†translates
    string literals. I wasn't surprised. I knew how it works and explained
    it to you in my followup.

    Still, although I think I understand “use encoding†fairly well (because
    I spent a lot of time reading the docs and playing with it when I still
    thought it would be a useful tool, and later because I spent a lot of
    time arguing on usenet that it isn't useful) I think it is too complex.
    I would be afraid of making stupid mistakes like writing "\x{E0}" when I
    meant chr(0xE0), and even if I don't make them, the next guy who has to
    maintain the scripts probably understands much less about “use encodingâ€
    than I do and is likely to misunderstand my code and introduce errors.

    hp


    [1] I admit that I was surprised by this. It is documented that strings
    consist of 64-bit elements on 64-bit machines, but I thought this
    was an obvious documentation error until I actually tried it.
     
    Peter J. Holzer, Nov 3, 2012
    #23
  4. [...]
    The only way to provide that is to store all characters as integer
    values large enough to encompass all conceivably existing Unicode
    codepoints. Otherwise, you're going to have multibyte characters and
    consequently, 'indexing into the array to find a particular character
    in the string' won't work anymore.

    Independently of this, the UTF-8 encoding was designed to have
    represenation of the Unicode character set which was backwards
    compatible with 'ASCII-based systems' and it is not only a widely
    supported internet standard (http://tools.ietf.org/html/rfc3629) and
    the method of choice for dealing with 'Unicode' for UNIX(*) and
    similar system but formed the 'basic character encoding' of complete
    operating systems as early as 1992
    (http://plan9.bell-labs.com/plan9/about.html). As such, supporting it
    natively in a programming language closely associated with UNIX(*), at
    least at that time, should have been pretty much a no brainer. "But
    Microsoft did it difffentely !!1" is the ultimate argument for some
    people but - thankfully - these didn't get to piss into Perl until
    very much later and thus, the damage they can still do is mostly
    limited to 'propaganda'.
     
    Rainer Weikusat, Nov 5, 2012
    #24
  5. I would also like to point out that this is an inherent deficiency of
    the idea to represent all glyphs of all conceivable scripts with a
    single encoding scheme at that the practial consequences of that are
    mostly 'anything which restricts itself to the US typewriter character
    set is fine' (and everyone else is going to have no end of problems
    because of that).

    I actually stopped using German characters like a-umlaut years ago
    exactly because of this.
     
    Rainer Weikusat, Nov 5, 2012
    #25
  6. Who is "we"? Before 5.12, you had to make the distinction.
    Strings without the SvUTF8 flag simply didn't have Unicode semantics.
    Now there is the unicode_strings feature, but

    1) it still isn't default
    2) it will be years before I can rely on perl 5.12+ being installed on
    a sufficient number of machines to use it. I'm not even sure if most
    of our machines have 5.10 yet (the Debian machines have, but most of
    the RHEL machines have 5.8.x)

    So, that distinction has at least existed for 8 years (2002-07-18 to
    2010-04-12) and for many of us it will exist at for another few years.

    So enforcing the concept I have my head in the Perl code is simply
    defensive programming.
    It worked for me ;-).
    Theoretically yes. In practice it almost always means that the
    programmer forgot to call encode() somewhere.

    And the other way around didn't work at all: You couldn't keep a string
    with characters > 127 but < 256 in a string without the SvUTF8 flag set
    and expect it to work.

    hp
     
    Peter J. Holzer, Nov 5, 2012
    #26
  7. This mostly means that I cannot possibly be a self-conscious human
    being capable of interacting with the world in some kind of
    'intelligent' (meaning, influencing it such that it changes according
    to some desired outcome) way but must be some kind of lifeform below
    the level of a dog or a bird. Yet, I'm capable of using written
    language to communicate with you (with some difficulties), using a
    computer connected to 'the internet' in order to run a program on a
    completely different computer 9 miles away from my present location,
    utilizing a server I have to pay for once a year from by bank account
    which resides (AFAIK) in Berlin.

    How can this possibly be?
     
    Rainer Weikusat, Nov 5, 2012
    #27
  8. With the most naive implementation, this would mean that moving 100G
    of text data through Perl (and that's a small number for some jobs I'm
    thinking of) requires copying 400G of data into Perl and 400G out of
    it. What you consider 'smart' would only penalize people who actually
    used non-ASCII-scripts to some (possibly serious) degree.
    This notion of 'internal' and 'external' representation is nonsense:
    In order to cooperate sensibly, a number of different processes need
    to use the same 'representation' for text data to avoid repeated
    decoding and encoding whenever data needs to cross a process
    boundary. And for 'external representation', using a proper
    compression algorithm for data which doesn't need to be usable in its
    stored form will yield better results than any 'encoding scheme'
    biased towards making the important things (deal with US-english texts)
    simple and resting comfortably on the notion that everything else is
    someone else's problem.
     
    Rainer Weikusat, Nov 6, 2012
    #28
  9. Indeed, that renders perl somewhat lame. "They" could invent some
    property attached at will to any scalar that would reflect some
    byte-encoding somewhat connected with this scalar. Then make each other
    operation to pay attention to that property. However, that hasn't been
    done. Because on the way to all-utf8 Perl sacrifices have to be made.
    Now, if that source would be saved as UTF-8 then output wouldn't be any
    different.

    I had no use for ord() (and I don't have now) but that wouldn't surprise
    me if at some point in perl development ord() (in this script) would
    return 208. And the only thing that could be done to make it work would
    be upgrade, sometime later.

    Look, *literals* are converted to utf8 with UTF8 flag on. Maybe that's
    what made (and makes) qr// to work, as expected:

    {41393:56} [0:0]% perl -wlE '"фыва" =~ m{(\w)}; print $1'

    {42187:57} [0:0]% perl -Mutf8 -wle '"фыва" =~ m{(\w)}; print $1'
    Wide character in print at -e line 1.
    Ñ„
    {42203:58} [0:0]% perl -Mencoding=utf8 -wle '"фыва" =~ m{(\w)}; print $1'
    Ñ„

    For explanation what happens in 1st example see below. I may be wrong
    here, but I think, that in 2nd and 3rd example it all turns around $^H
    anyway.
    I didn't convert anything. So I don't pretend you can count me in.
    Just now I've come to conclusion that C<use encoding 'utf8';> (that's
    what I've ever used) is effects of C<use utf8;> plus binmode() on
    streams minus posibility to make non us-ascii literals. I've been
    always told that I *must* C<use utf8;> and than manually do binmode()s
    myself. Nobody ever explained why I can't do that with C<use encoding
    'utf8';>.

    Now, C<use encoding 'binary-enc';> behaves as above (they have fully
    functional UTF-8 script limited by advance of perl to all-utf8), except
    actual source isn't UTF-8. I can imagine reasons why that could be
    necessary. Indeed, such circumstances would be rare. Myself is in
    aproximately full control of environment, thus it's not problem for me.

    As of 'lot of people', I'll tell you who I've met. I've seen loads of
    13-year-old boys (those are called snowflakes these days) who don't know
    how to deal with shit. For those, who don't know how to deal with shit,
    jobs.perl.org is the way.

    *SKIP*
    Think about it. What terminal presents (in fonts) is locale dependent.
    That locale could be 'POSIX'. There's no 'POSIX.UTF-8'. And see below.

    *SKIP*
    That's nice you brought that back. I've already figured it all out.

    ----
    {0:1} [0:0]% perl -Mutf8 -wle 'print "à"'
    �
    {23:2} [0:0]% perl -Mutf8 -wle 'print "à "'
    �
    ----
    {36271:17} [0:0]% perl -Mutf8 -wle 'print "à"'

    {36280:18} [0:0]% perl -Mutf8 -wle 'print "à "'
    à
    ----

    What's common in those two pairs: it's special Perl-latin1, with UTF8
    flag off, none utf8 concerned layer is set on output. What's different:
    the former is xterm, the latter is urxvt. In eather case, that's what
    is output actually:

    {36831:20} [0:1]% perl -Mutf8 -wle 'print "à"' | xxd
    0000000: e00a ..
    {37121:21} [0:0]% perl -Mutf8 -wle 'print "à "' | xxd
    0000000: e020 0a . .

    So, 0xe0 has nothing to do in utf-8 output. xterm replaces it with
    replacement (what makes sense). In contrary, urxvt applies some weird
    heuristic (and it's really weird)

    {37657:28} [0:0]% perl -Mutf8 -wle 'print "àá"'
    à
    {37663:29} [0:0]% perl -Mutf8 -wle 'print "àáâ"'
    àá
    {37666:30} [0:0]% perl -Mutf8 -wle 'print "àáâã"'
    àáâ

    *If* it's xterm vs. urxvt then, I think, it's religious (that means it's
    not going to change). However, it doesn't look configurable or at least
    documented while obviously it could be usable (configurability
    provided). Then it may be some weird interaction with fontconfig, or
    xft, or some unnamed perl extension, or whatever else. If I won't
    forget I'll invsetigate it later after upgrades.

    As of your explanation. It's not precise. encoding.pm does what it
    always does. It doesn't mangle scalars itself, it *hints* Encode.pm
    (and friends) for decoding from encoding specified to utf8. (How
    Encode.pm comes into play is beyond my understanding for now.) In case
    of C<use encoding 'utf8';> it happens to be decoding from utf-8 to utf8.
    Encode.pm tries to decode byte with value more than 0x7F and falls back
    for replacement.

    That may be undesired. And considering this:

    encoding - allows you to write your script in non-ascii or non-utf8

    C<use encoding 'utf8';> may constitute abuse. What can I say? I'm
    abusing it. May be that's why it works.

    *CUT*
     
    Eric Pozharski, Nov 6, 2012
    #29
  10. And - of course - this still wouldn't help since a 'character'
    as it appears in some script doesn't necessarily map 1:1 to a Unicode
    codepoint. Eg, the German a-umlaut can either be represented as the
    ISO-8859-1 code for that (IIRC) or as 'a' followed by a 'combining
    diaresis' (and the policy of the Unicode consortium is actually to avoid
    adding more 'precombined characters' in favor of 'grapheme
    construction sequences', at least, that's what it was in 2005, when I
    last had a closer look at this).
     
    Rainer Weikusat, Nov 6, 2012
    #30
  11. Not necessarily. As Ben already pointed out, not all strings have to
    have the same representation. There is at least one programming language
    (Pike) which uses 1, 2, or 4 bytes per character depending on the
    "widest" character in the string. IIRC, Pike had Unicode code before
    Perl, so Perl could have "stolen" that idea.

    There are other tradeoffs, too: UTF-8 is quite compact for latin text,
    but it takes about 2 bytes per character for most other alphabetic
    scripts (e.g. Cyrillic, Greek, Devanagari) and 3 for CJK and some other
    alphabetic scripts (e.g. Hiragana and Katakana). So the size problem you
    mentioned may be reversed if you are mainly processing Asian text.
    Plus scanning a text may be quite a bit faster if you can do it in 16
    bit quantities instead of 8 bit quantities.

    However, the Plan 9 C API has exactly the distinction you are
    criticizing: Internally, strings are arrays of 16-bit quantities,
    externally, they read and written as UTF-8.

    From the well-known "Hello world" paper:

    | All programs in Plan 9 now read and write text as UTF, not ASCII.
    | This change breaks two deep-rooted symmetries implicit in most C
    | programs:
    |
    | 1. A character is no longer a char.
    |
    | 2. The internal representation (Rune) of a character now differs from
    | its external representation (UTF).

    (The paper was written before Unicode 2.0, so all characters were 16
    bit. I don't know the current state of Plan 9)

    hp
     
    Peter J. Holzer, Nov 6, 2012
    #31
  12. I guess you haven't seen Punycode ;-) [There seems to be no "barf"
    emoticon in Unicode - I'm disappointed]
    What do you mean by "finished"? There is a new version of the Unicode
    standard about once per year, so it probably won't be "finished" as long
    as the unicode consortium exists.

    Unicode was originally intended to be a 16 bit code, and Unicode 1.0
    reflected this: It was 16 bit only and there was no intention to expand
    it. That was only added in 2.0, about 4 years later (and at that time it
    was theoretical: The first characters outside of the BMP were defined in
    Unicode 3.1 in 2001, 9 years after the first release).

    So of course anybody who implemented Unicode between 1992 and 1996
    implemented it as a 16 bit code, because that was what the standard
    said. Those early adopters include Plan 9, Windows NT, and Java.

    UTF-16 has a few things in common with UTF-8:

    * both are backward compatible with an existing shorter encoding
    (UTF-8: US-ASCII, UTF-16: UCS-2)
    * both are variable width
    * both are self-terminating
    * Both use some high bits to distinguish between a single unit (8 resp.
    16 bits), the first unit and subsequent unit(s)

    The main differences are

    * UTF-16 is based on 16-bit units instead of bytes (well, duh!)
    * There was no convenient free block at the top of the value range,
    so the surrogate areas are somewhere in the middle.
    * and therefore ordering isn't preserved (but that wouldn't be
    meaningful anyway)

    The main problem I have with UTF-16 is of a psychological nature: It is
    extremely tempting to assume that it's a constant-width encoding because
    "nobody uses those funky characters above U+FFFF anyway". Basically the
    "all the world uses US-ASCII" trap reloaded.

    hp
     
    Peter J. Holzer, Nov 6, 2012
    #32
  13. It should have been obvious 'in foresight' that the '16 bit code' of
    today will turn into a 22 bit code tomorrow, a 56 bit code a fortnight
    from now and then slip back to 18.5 bit two weeks later[*] (the 0.5 bit
    introduced by some guy who used to work with MPEG who transferred to the
    Unicode consortium), much in the same way the W3C keeps changing the
    name of HTML 4.01 strict to give the impression of development beyond
    aimlessly moving in circles in the hope that - some day - someone might
    chose to adopt it (web developers have shown a remarkable common sense
    in this respect).

    BTW, there's another aspect of the "all the world is external to perl
    and doesn't matter [to us]" nonsense: perl can be embedded. Eg, I
    spend a sizable part of my day yesterday writing some Perl code
    supposed to run inside of postgres, as part of an UTF-8 based
    database. In practice, it is possible to chose a database encoding
    which can represent everything which needs to be represented in this
    database which is also compatible with Perl, making it feasible to use
    it for data manipulation. In theory, that's another "Thing which must
    not be done" which - in this case - simply means that avoiding Perl
    for such code in favour of a language which gives its users less
    gratuitious headaches is preferable.

    [*] I keep wondering why the letter T isn't defined as 'vertical
    bar' + 'combining overline' (or why A isn't 'greek delta' + 'combining
    hyphen' ...)
     
    Rainer Weikusat, Nov 7, 2012
    #33
  14. tcgo

    Dr.Ruud Guest

    Let's invent the byte-oriented utf-2d.

    The bytes for the special (i.e. non-ASCII) characters have the high bit
    on, and further still have a meaningful value, such that they can be
    matched as a (cased) word-character / digit / whitespace, punctuation, etc.
    Per special character position there can be an entry in the side table,
    that defines the real data for that position.

    The 80-8F bytes are for future extensions. An 9E-byte can prepend a data
    part. An 9F byte (ends a data part and) starts a table part.

    An ASCII buffer remains as is. A latin1 buffer also remains as is,
    unless it contains a code point between 80 and 9F.


    Possible usage of 90-9F, assuming " 0Aa." collation:

    90: .... space
    91: ...# digit
    92: ..#. upper
    93: ..## upper|digit
    94: .#.. lower
    95: .#.# lower|digit
    96: .##. alpha
    97: .### alnum
    98: #... punct
    99: #..# numeric?
    9A: #.#. ...
    9B: #.## ...
    9C: ##.. ...
    9D: ##.# ...
    9E: ###. SOD (start-of-data)
    9F: #### SOT (start-of-table)
     
    Dr.Ruud, Nov 8, 2012
    #34
  15. That takes a huge chunk (25%, or even 37.5% if you include the ranges
    which you have omitted above) out of the BMP. These codepoints would
    either not be assigned at all (same as with UTF-16) or have to be
    represented as four bytes. By comparison, the UTF-16 scheme reduces the
    number of codepoints representable in 16 bits only by 3.1%. So there was
    a tradeoff: Number of characters representable in 16 bits (63488 :
    40960 or 49152) versus total number of representable characters (1112064
    : 67108864). Clearly they thought 1112064 ought to be enough for
    everyone and opted for a denser representation of common characters.
    (That doesn't mean that they considered exactly your encoding: But
    surely they considered several different encodings before settling on
    what is now known as UTF-16.

    Yes, but certainly not with UTF-16: That encoding is limited to ~ 20
    bits (codepoints U+0000 .. U+10FFFF).
    The only thing that's visible in the character set is that there is a
    chunk of 2048 reserved code points which will never be assigned. How is
    that different from other chunks of unassigned code points which may or
    may not be assigned in the future?

    hp
     
    Peter J. Holzer, Nov 11, 2012
    #35
  16. Well, "they" could do all kinds of shit (to borrow your use of
    language), but why should they?

    You are thinking way too complicated. You don't need to know about $^H
    to understand this. It's really very simple.

    In the first example, you are dealing with a string of 8 bytes
    "\xd1\x84\xd1\x8b\xd0\xb2\xd0\xb0". Depending on the version of Perl you
    are using, either none of them are word characters, or several of them
    are. You don't get a warning, so I assume you use a perl >= 5.12, where
    “use feature unicode_strings†exists and is turned on by -E. In this
    case, the first byte of your string is a word character (U+00D1 LATIN
    CAPITAL LETTER N WITH TILDE), so the script prints "\xd1\x0a".

    In the second and third example, you have a string of 4 characters
    characters "\x{0444}\x{044b}\x{0432}\x{0430}", all of which are word
    characters, so the script prints "\x{0444}\x{0a}" (which then gets
    encoded by the I/O layers, but I've explained that already and won't
    explain it again).
    Congratulations on figuring that out (except the last one: You can make
    non us-ascii literals with “use encoding†(that's one of the reasons why
    it was written), the rules are just a bit different than with “use utf8â€).
    And of course I explicitely wrote that 10 days ago (and Ben possibly
    wrote it before that but I'm not going to reread the whole thread).

    I don't know who told you that and who didn't explain that. It wasn't
    me, that's for sure ;-). I have explained (in this thread and various
    others over the last 10 years) what use encoding does and why I think
    it's a bad idea to use it. If you understand it and are aware of the
    tradeoffs, feel free to use it. (And of course there is no reason to use
    “use utf8†unless your *source code* contains non-ascii characters).

    And how is this only relevant for STDERR but not for STDIN and STDOUT?


    [...]

    Uh, no. That was a completely different problem.

    Yes, we've been through that already.

    Maybe you should be less confident about stuff which is beyond your
    understanding.

    hp
     
    Peter J. Holzer, Nov 11, 2012
    #36
  17. True. But what does that have to do with the paragraph you quoted?
    It is my understanding that modern perl versions don't work on any
    EBCDIC-based platform, so that would include Unisys[1], HP/MPE and other
    EBCDIC-based platforms. Especially since these platforms are quite dead,
    unlike z/OS which is still maintained.

    hp

    [1] Not all Unisys systems used EBCDIC. I think at least the 1100 series
    used ASCII.
     
    Peter J. Holzer, Nov 11, 2012
    #37
  18. No that wasn't the intention. I was questioning Ben's assertion that
    "we've been trying to stop people making this mistake since 5.8.0",
    because before 5.12.0 it wasn't a mistake, it was a correct
    understanding of how perl/Perl worked.

    Unless of course by "people" he didn't mean Perl programmers but the
    p5p team and and by "stop making this mistake" he meant "introducing the
    unicode_strings feature and including it in 'use v5.12'". It is indeed
    possible that the so-called "Unicode bug" was identified shortly after
    5.8.0 and that Ben and others were trying to fix it since then.

    I mentioned that it wasn't an option for me just a few lines further
    down. Of course in my case "not an option" just means "more hassle than
    it's worth", not "impossible", I could install and maintain a current
    Perl version on the 40+ servers I administer. But part of the appeal of
    Perl is that it's part of the normal Linux infrastructure. Rolling my
    own subverts that.

    So, I hope I'll get rid of perl 5.8.x in 2017 (when the support for RHEL
    5.x ends) and of perl 5.10.x in 2020 (EOL for RHEL 6.x). Then I can
    write "use v5.12" into my scripts and enjoy a world without the Unicode
    bug.

    hp
     
    Peter J. Holzer, Nov 11, 2012
    #38
  19. Here's the deal. Explain me what's complicated in this:

    [quote encoding.pm on]
    [producing $enc and $name goes above]
    unless ( $arg{Filter} ) {
    DEBUG and warn "_exception($name) = ", _exception($name);
    _exception($name) or ${^ENCODING} = $enc;
    $HAS_PERLIO or return 1;
    }
    [dealing wit Filter option and STDIN/STDOUT goes below]
    [quote encoding.pm off]

    and I grant you and Ben unlimited right to spread FUD on encoding.pm
     
    Eric Pozharski, Nov 12, 2012
    #39
  20. So after reading 400 lines of perldoc encoding (presumably not for the
    first time) and a rather long discussion thread you are starting to read
    the source code to find out what “use encoding†does?

    I think you are proving my point that “use encoding†is too complicated
    rather nicely.

    You will have to read the source code of perl, however. AFAICS
    encoding.pm is just a frontend which sets up stuff for the parser to
    use. I'm not going to follow you there - the perl parser has too many
    tentacles for my taste.

    hp
     
    Peter J. Holzer, Nov 12, 2012
    #40
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.