Why "Wide character in print"?

Discussion in 'Perl Misc' started by tcgo, Sep 30, 2012.

  1. tcgo

    tcgo Guest

    Hi!
    I just made a test code with Perl, using the Pi symbol with Unicode/UTF-8. That's the code:

    #!/usr/bin/perl
    use utf8;
    my $cosa = "Here is my ☺ résúmé \x{2639}!";
    print "$cosa\n";

    And it gives me a "warning" message: "Wide character in print at ./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the warning disappears, but why was it showing before of adding the binmode?

    Thanks!
    ~tcgo~
     
    tcgo, Sep 30, 2012
    #1
    1. Advertisements

  2. Because the people who nowadays work on perl unicode support have
    decided that it should behave as if the encoding used by it was some
    super secret sauce shrouded in eternal mystery: All data flowing into
    a Perl program is supposed to be converted to this super secret
    internal mystery encoding before being used and all data flowing out
    of a Perl program is supposed to be converted to something software
    other than perl understands beforehand. De facto, the situation is
    such that everything is fine when perl is used in an environment where
    UTF-8 is the 'native' method for supporting wide characters because
    this is also what perl uses itself, and anyone using something
    else is essentially fucked. De jure, perl is supposed to be nasty to
    everyone, or at least try as hard as possible without breaking
    backwards compatibility.
     
    Rainer Weikusat, Sep 30, 2012
    #2
    1. Advertisements

  3. tcgo

    Alan Curry Guest

    The binmode documents your assumption that nobody will ever run your program
    on a non-UTF8-mode terminal.
     
    Alan Curry, Sep 30, 2012
    #3
  4. Because, unless you tell it with binmode, Perl doesn't know what
    encoding it is supposed to use. It could get the encoding from the
    locale settings, but that would only work for text written to a
    terminal, not for arbitrary data written to a file, so perl doesn't
    make assumptions and asks you to set the encoding explicitely.

    (If you want to get the encoding from the locale, use I18N::Langinfo,
    unfortunately this doesn't work on all platforms (at least it didn't
    work on Windows last time I looked, but that was a few years ago)

    hp
     
    Peter J. Holzer, Sep 30, 2012
    #4
  5. tcgo

    johndelacour Guest

    “use utf8†means only that the script file itself is UTF-8-encoded;
    It doesn’t say how to manage the output to STDOUT.

    JD
     
    johndelacour, Oct 23, 2012
    #5
  6. tcgo

    C.DeRykus Guest

    Here's a follow-on with an observation/question for someone more knowledgeable about Perl unicode)

    I don't know how 'use locale' affects this but I
    only see the OP's expected display of characters
    by using the "\N{U+...}" notation to force character
    semantics:

    #use utf8;
    my $cosa = "Here is my \N{U+263A} résúmé \N{U+03C0}!";

    Output: Here is my ☺ résúmé π!
     
    C.DeRykus, Oct 24, 2012
    #6
  7. *SKIP*
    Stop spreading FUD. They need

    use encoding ENCNAME Filter => 1;

    (what I<ENCNAME> could possibly be?) but

    * "use utf8" is implicitly declared so you no longer have to "use
    utf8" to "${"\x{4eba}"}++".

    what pretty much defies the purpose of C<use encoding;>.

    *SKIP*
    That's not a whole story.

    {2754:13} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "а" ; Dump $aa'
    SV = PV(0x927a750) at 0x9295fac
    REFCNT = 1
    FLAGS = (POK,pPOK,UTF8)
    PV = 0x9291a08 "\320\260"\0 [UTF8 "\x{430}"]
    CUR = 2
    LEN = 12
    {2936:14} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "Ã¥" ; Dump $aa'
    SV = PV(0x9af4750) at 0x9b0ffac
    REFCNT = 1
    FLAGS = (POK,pPOK,UTF8)
    PV = 0x9b0ba08 "\303\245"\0 [UTF8 "\x{e5}"]
    CUR = 2
    LEN = 12

    For a first glance, me wondered: what the heck is with yours
    C<use warnings;>. Now I feel much better.

    *CUT*
     
    Eric Pozharski, Oct 27, 2012
    #7
  8. Double encoding.
    Monkey wrench.
    Works just as expected, see below.
    Probably that's not safe to state things like this below unprivately,
    but:

    not perl->isa( 'fool-proof' ) or die

    (I'm trying to speak Perl here). IOW, Perl has an entry level. And
    it's quite high. And one of steps to get behind is ability to read. I
    don't mind ability to read code, I mean ability to RTFM. Three former
    examples are clearly (for me) of that type. I have a couple of scripts
    that have C<use encoding 'utf8';> (I<STDIN>, I<STDOUT>, and quote-like
    operators) and C<use open ':locale';> (other filehandles, quite risky,
    but those scripts are not for distribution thus I'm safe here). Those
    scripts were started 4.5 years ago (according to logs, I can't believe
    it was sarge (thus 5.8.8?)). Anyway, 5.10.0, 5.10.1, 5.14.2 -- because
    I've made those right. Because I've read carefully, all the unicode
    documentation that comes with perl (namely perluniitro.pod,
    perlunicode.pod, utf8.pod, encoding.pm, Encdoe.pm (perlunifaq.pod,
    perlunitut, and perluniprops.pod weren't distributed five years ago,
    should read them too)). I've found that I don't need utf8.pm (those
    scripts and modules should be us-ascii anyway).

    I feel utf8-safe because, first of all, I can read. If I can, they can
    too, can't they? Apparently, they don't, maybe because they can't.
    Let me rephrase one famous proverb:

    If an answer you've got is 'filter', you probably asking wrong
    question.

    *SKIP*
    Right. And that's closely related to your last example (the one about
    utf8.pm being unsafe). I've tried to make a point that *characters*
    from different *ranges* happen to be of different length in bytes.

    {9829:45} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "aàа" ; Dump $aa'
    SV = PV(0xa06f750) at 0xa08afac
    REFCNT = 1
    FLAGS = (POK,pPOK,UTF8)
    PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
    CUR = 5
    LEN = 12

    *Characters* of latin1 aren't wide (even if they are characters, they
    are still one byte long)

    {10406:65} [0:0]% perl -Mutf8 -wle 'print "[à]"'
    [à]
    {10415:66} [0:0]% perl -Mutf8 -wle 'print "[а]"'
    Wide character in print at -e line 1.
    [а]

    I must have added those braces, because:

    {10421:67} [0:0]% perl -wle 'print "à"' # no problmes, just a byte
    à
    {10477:68} [0:0]% perl -Mutf8 -wle 'print "à"' # oops

    {10520:69} [0:0]% perl -Mutf8 -wle 'print "à "' # stupid
    à
    {10522:70} [0:0]% perl -Mutf8 -wle 'print "\x{E0}"' # oops

    {10532:71} [0:0]% perl -Mutf8 -wle 'print "\x{E0} "' # stupid
    à
    {10602:79} [0:0]% perl -Mutf8 -wle 'print "\N{U+00E0}"' # oops

    {10608:80} [0:0]% perl -Mutf8 -wle 'print "\N{U+00E0} "' # stupid
    à

    But watch this:

    {10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "à"' # hooray!
    à
    {10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
    �
    {10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hooray!
    à

    Except the middle one (what I should think about), I think encoding.pm
    wins again.
     
    Eric Pozharski, Oct 28, 2012
    #8
  9. That doesn't look like a bug in "use utf8" to me, but like a bug in the
    code which generates the warnings.

    It doesn't help that Tom just dumped a load of gibberish into his mail
    without specifying which encoding he was using. I had to guess that he
    was using CP1252.

    Anyway, with use utf8, the qw[] section of his program is parsed correcly as

    ("élite", "Ævar", "μῦθος", "mío")

    In the error message each character (even those in the printable ASCII
    range U+0020 ... U+007E) is "helpfully" given in hex which I agree is
    .... suboptimal.

    Me too, although frankly I see no reason to use encoding even if it
    works. It mixes up encoding of the source code and the I/O, which is not
    a good idea, IMSHO, and my editor handles UTF-8 just fine, so I don't
    see why I should write my perl scripts in a different encoding than
    UTF-8. I/O can be handled explicitely by I/O layers or implicitely by
    "use open".

    I'm puzzled about this part of the documentation, too. Why would anybody
    want to use a variable ${"\x{4eba}"} ? I am guessing that the variable
    is really supposed to be $人, i.e., there is a Han character in the
    source code, not a symref.

    Is this unsafe? I have occasionally used non-ascii characters in
    variable names (mostly Greek characters in physical formulas) together
    with use utf8 since 5.8.x and I never noticed a problem. (The only
    "problem" I noticed is that the euro sign isn't a word character, so you
    can't have a variable $amount_in_€. But then you can't have a variable
    $amount_in_$ either, so I guess this is fair ;-))

    hp
     
    Peter J. Holzer, Oct 28, 2012
    #9
  10. Then maybe you shouldn't have chosen two examples which both are same
    length in bytes.
    In UTF-8, latin-1 characters >= 0x80 are 2 bytes, the same as cyrillic
    characters. Your example shows this: "à" (LATIN SMALL LETTER A WITH
    GRAVE) is "\303\240" and "а" (CYRILLIC SMALL LETTER A) is "\320\260".

    But this isn't what "wide character" in the warning means. In the
    warning, it means a string element with a code > 255. For string
    elements <= 255, perl can assume that they are supposed to be bytes, not
    characters, when you try to write them to a byte stream. It could be
    argued that this assumption is a mistake, but for better or worse we are
    stuck with that decision. But for string elements > 255, that just isn't
    possible. It can't be a byte, it must be a character, and to convert a
    character into bytes, the encoding needs to known.

    .... as these examples demonstrate.

    Assuming you use a UTF-8 terminal here: No, this isn't one byte. These are
    two bytes, \303\240.
    Now you have one character (because of -Mutf8, the two bytes \303\240
    are decoded to the character U+00e0), but you are trying to write it to a byte
    stream without specifying the encoding. Perl writes the single byte
    0xE0, which your UTF-8 terminal cannot interpret. (Mine displays a
    question mark in a dark circle)

    Huh? What version of Perl on what platform is this? The string is
    "\x{E0}\x{20}". All elements of the string are <= 255, so the string is
    output as a byte string. This isn't valid UTF-8, and your terminal
    shouldn't be able to interpret it as "à" anymore than it was able to
    interpret "\x{E0}\x{0A}" above.

    [more equivalent examples snipped]

    If your program does character I/O, you *need* to specify the encoding
    of the I/O channels. For one-liners, the -C option is sufficent:

    hrunkner:~/tmp 20:40 :) 195% perl -CS -Mutf8 -wle 'print "à"'
    à

    For scripts you would use binmode or 'use open'.

    (Didn't you praise yourself on your ability to read? This is documented
    and it has been repeated by several people in this newsgroup for years)

    Excellent example, it shows exactly one of the pitfalls of using "use
    encoding". One would expect "\x{E0}" to result in a string with a single
    element with code 0xE0. At least you seem to have expected it, and for a
    moment I was confused, too. But 'use encoding' doesn't work that way. It
    was designed to convert string constants from the specified encoding to
    Unicode, so it tries to interpret "\x{E0}" as UTF-8, but of course this
    isn't valid UTF-8. So you get "\x{FFFD}" instead (U+FFFD is the
    REPLACEMENT CHARACTER used to mark invalid characters).

    If you use a correct UTF-8 encoded string, it works as expected (well,
    expected by somebody who's read the documentation and remembers that
    little pitfall):

    hrunkner:~/tmp 20:47 :) 197% perl -Mencoding=utf8 -wle 'print "\303\240"'
    à


    For one-liners like this, using the same encoding for the script and the
    I/O is useful ("-CS -Mutf8" is even shorter than "-Mencoding=utf8", but
    maybe you don't have a UTF-8 capable terminal). However, for real
    programs, I think tying the encoding of the source code to the encoding
    of I/O-streams the script is supposed to handle is foolish. My scripts
    are always encoded in UTF-8, but they frequently have to handle files in
    CP-1252.

    hp
     
    Peter J. Holzer, Oct 28, 2012
    #10
  11. You have to distinguish what may work sometimes or always, and what is
    part of the interface which *should* work. If it does nor work in the
    latter case, it is an error; if it does not work in the former case you
    have made a bad guess about how it is implemented. So do not rely on your
    guesses but use the documented interface.

    There are two ways to use the interface:

    - You regard all strings, both during the run of the script and on
    input/output, as bytes (=groups of 8 bits) without any meaning as
    characters (=member of an alphabet for writing text). This will work if
    all devices, and the script itself, use the same character code, which
    must not have bytes with value >255. This *can* be a viable option if
    you can either guarantee this restriction, or if your bytes do not
    have a character meaning.

    In this case, strings in the program text with characters that are not
    contained in the common character code are meaningless, and will yield
    errors.

    - You regard the data during the run of the script as sequences of
    characters, and the data on onput and output as sequences of bytes. Then
    you have to convert bytes into textstrings on input and textstrings into
    bytes on output -- in both cases you can specify the conversion once and
    for all for each file. This is the only working way when the restrictions
    of the last item are not fulfilled.

    In this case, strings in the program text may contain any characters
    whether or not they are representable in the codes used in input/output.
    The "use utf8" pragma tells perl to interpret the program text itself as a
    sequence of UTF-8 characters which will make a difference only for literal
    strings in the program.

    A third way does *not* work:

    - You do input and output on strings of bytes and assume that perl will guess
    correctly what characters these byte represent in your opinion.
    Unfortunately that will *often* work (because perl assumes ISO-8859-1 on
    many systems which may be what you are actually using), but it will also
    often break (if you use other codes, or if you mix strings which happen to
    contain only ISO-8859-1 characters with string containing also other
    characters). But if it breaks, it is your fault: it is nowhere guaranteed
    how text strings map to byte strings and vice versa, the sole exception
    being the documented encode and decode functions.

    This is fairly well explained in
    http://search.cpan.org/~dom/perl-5.14.3/pod/perlunitut.pod
     
    Helmut Richter, Oct 28, 2012
    #11
  12. [...]
    This is the only 'working way' when the assumption that perl uses a
    'secret mystery encoding' different from any other encoding known to
    man is taken for granted. But this assumption is wrong and the concept
    makes preciously little sense since it requires an additional copy of
    all input data and all output data (possibly, times the number of perl
    processes in a 'long' pipeline since not even perl is supposed to be
    able to talk to perl natively). Considering the way perl is
    implemented, this is a real problem for users of Windows (and Mac OS
    X, AFAIK) because in both cases, perl uses something other than the
    native encoding. That some people would like to inflict the same
    damage onto users of platforms where the problem doesn't exist is
    certainly very laudable but IMNSHO, best ignored.
     
    Rainer Weikusat, Oct 28, 2012
    #12
  13. I was careful to use the term "string element" and avoid the terms
    "byte" and "character" when talking about the things a string is
    composed of.

    Perl has two types of strings: Character strings (often called utf8
    strings in the documentation) and byte strings. Character strings are
    composed of 32-bit entities, each denoting a unicode code point. So
    "\x{1f42a}" is a string with the single character DROMEDARY CAMEL.
    Byte strings are just that: Strings of uninterpreted bytes. Any
    semantics assigned to them is semantics of the program, not of the Perl
    language (this isn't quite correct: character oriented functions like lc
    or character classes in regexps do work on them, but only for ASCII).

    These differences are documented, and I consider them part of the
    interface, although some members of p5p consider the distinction a bug
    and try to remove it.

    However, for the warning "Wide character in print" this is irrelevant.

    Perl doesn't distinguish between character and byte strings when writing
    them to a file handle. For both the strings "\x{E0}" (a byte string) and
    "\N{U+00E0}" (a character string), if you write them to a raw file
    handle, the single byte 0xE0 will be written. Both will be converted to
    two bytes 0xC3 0xA0 if you write them the a file handle with the
    ":encoding(UTF-8)" layer. And so on. But for strings with elements >
    255, it simply isn't possible, to write a single byte with this value to
    a byte stream, because a byte has only 8 bits (on the platforms we care
    about). So Perl prints a warning and encodes the string in UTF-8 (or
    just copies its internal representation, which happens to be the same
    thing). I would argue that perl should die() instead, but this has been
    the observed and documented behaviour since 5.8.0, so I doubt it will
    change.


    [Rest snipped. All true, but IMHO not very relevant to this thread].

    hp
     
    Peter J. Holzer, Oct 29, 2012
    #13
  14. The encoding isn't a 'secret mystery'. It is well documented that it
    is Unicode.

    perl -CS -MEncode -E 'say ord(Encode::decode("utf-8", "\xE2\x82\xAC"))'

    is defined to print "8364".

    It is a 'secret mystery' (wink, wink, nudge, nudge) how this is
    represented internally, just like the representation of numbers is a
    'secret mystery'.

    However, for most programs you don't have to know that Perl character
    strings are Unicode strings. It is sufficient to know that Perl has the
    concept of a "character" which is different from the concept of a
    "byte", that a character has certain properties (e.g. it can be a letter
    or an ideograph, it may have an associated uppercase or lowercase
    letter, ...) and to convert a sequence of characters into a sequence of
    bytes you have to encode them. Whether the Euro sign has the numeric
    code 8364 or 4711 is rarely significant.

    This is an unsubstantiated claim. It is possible that the current
    implementation of I/O layers does indeed perform an additional copy (I
    haven't checked the code), but this is certainly not required.

    And even if it is true, it is almost certainly lost in the noise as soon
    as your script does something more complex than "cat" with your input -
    almost any string operation in perl performs a copy.
    Why is this a real problem?
    Whatever "the problem" may be. The problem that characters and bytes
    aren't the same and that most programmers prefer to think of text as a
    sequence of characters, not a sequence of bytes exists on every
    platform.

    hp
     
    Peter J. Holzer, Oct 29, 2012
    #14
  15. Are they? They are strings of characters that are contained in Unicode. They
    are not necessarily internally encoded as Unicode. People run into problems
    when they make assumptions about the way they are implemented. I would have
    worded:

    For all programs you must not pretend to know that Perl character strings
    are Unicode strings.

    It may be true, it may be false -- either way, it is not part of the
    documented interface. Hence, it must not be used even if it be true.
     
    Helmut Richter, Oct 29, 2012
    #15
  16. (Last night I've reread loads of perlunicode and friends, I feel much
    better now) No, they are the same length *if* encoding of stream is set:

    {7453:22} [0:0]% perl -CS -Mutf8 -wle 'print "à"' | xxd
    0000000: c3a0 0a ...
    {7459:23} [0:0]% perl -CS -Mutf8 -wle 'print "а"' | xxd
    0000000: d0b0 0a ...
    {7466:24} [0:0]%

    But latin1 is special (I've reread perlunicode and friends), *if*
    there's no reason (printing isn't reason) to upgrade to utf8 then
    *characters* of latin1 script (and latin1 only) stay *bytes*:

    {7466:24} [0:0]% perl -Mutf8 -wle 'print "à"' | xxd
    0000000: e00a ..
    {7795:25} [0:0]% perl -Mutf8 -wle 'print "а"' | xxd
    Wide character in print at -e line 1.
    0000000: d0b0 0a ...

    But even if encoding of stream isn't set concatenation with non-latin1
    script upgrades latin1 too:

    {7800:26} [0:0]% perl -Mutf8 -wle 'print "[à][а]"' | xxd
    Wide character in print at -e line 1.
    0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].

    Please rewind the thread. That's exactly what happened couple of posts
    ago (specifically: <eli$-neck.ny.us> and
    No. Because it's not UTF-8, it's utf8. As long as utf8 semantics isn't
    set, anything scalar stays plain bytes:

    {2786:10} [0:0]% perl -MDevel::peek -wle 'Dump "à"'
    SV = PV(0x9d0e878) at 0x9d29f28
    REFCNT = 1
    FLAGS = (PADTMP,POK,READONLY,pPOK)
    PV = 0x9d2ddc8 "\303\240"\0
    CUR = 2
    LEN = 12

    However, when utf8 semantics is set, then those codepoints that fit
    latin1 script become special Perl-latin1:

    {5930:11} [0:0]% perl -MDevel::peek -Mutf8 -wle 'Dump "à"'
    SV = PV(0x9b92880) at 0x9badf10
    REFCNT = 1
    FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
    PV = 0x9bb1eb0 "\303\240"\0 [UTF8 "\x{e0}"]
    CUR = 2
    LEN = 12

    Upgrade to UTF-8 encoding or staying with latin1 encoding depends on
    concatation with already upgraded to UTF-8 codepoints and/or encoding of
    output stream.

    *SKIP*
    {42:1} [0:0]% perl -Mutf8 -wle 'print "à"'
    à
    {1903:2} [0:0]% perl -Mutf8 -wle 'print "à"'

    {1933:3} [0:0]% perl -Mutf8 -wle 'print "à"' | xxd
    0000000: e00a

    Instead it does. Once. It wasn't typeing, it was search through
    history. Now I'm bothered. Does anyone here know how to list
    extensions enabled in running instance of urxvt?

    *SKIP*
    {14999:29} [0:0]% perl -mencoding -wle 'print "[à][а]"' | xxd
    0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].
    {15017:30} [0:0]% perl -CS -Mutf8 -wle 'print "[à][а]"' | xxd
    0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].

    Golf?
    Mine are us-ascii, I have open.pm for rest.
     
    Eric Pozharski, Oct 29, 2012
    #16
  17. At best, that's a part of the interface which was meanwhile
    'undocumented' because the implementation choices which were made
    weren't the implementation choices that should have been made,
    according to the opinions of some people who didn't make the
    descision. But indepedently of that, inventing the 'Perl is an
    island!' character encoding - no matter how hypothetical - remains a
    stupid idea. Perl is not an island and it has to interact with code
    written in other programming languages, although maybe not in the
    fantasy universe of people who implement 'wepp fremmwuergs' and
    'ohpscheckt suesstemms' who are generally not troubled by the minor
    consideration of making their stuff do something actually useful in
    the real world. Conseqently, Perl should be compatible with some
    existing convention, ideally, with all existing 'local'
    conventions. If this isn't possible, the next best choice is not 'make
    everyone bleed'.
     
    Rainer Weikusat, Oct 29, 2012
    #17
  18. You posted the output of Devel::peek::Dump, so I thought you were
    talking about the *internal* representation.

    How many bytes they occupy in an I/O stream depends on the encoding.

    LATIN SMALL LETTER A WITH GRAVE is one byte in ISO-8859-1, CP850, ...
    LATIN SMALL LETTER A WITH GRAVE is two bytes in UTF-8, UTF-16, ...
    LATIN SMALL LETTER A WITH GRAVE is four bytes in UTF-32, ...

    CYRILLIC SMALL LETTER A is one byte in ISO-8859-5, KOI-8, ...
    CYRILLIC SMALL LETTER A is two bytes in UTF-8, UTF-16, ...
    CYRILLIC SMALL LETTER A is four bytes in UTF-32, ...

    (And of course, both characters cannot be represented at all in some
    encodings: There is no LATIN SMALL LETTER A WITH GRAVE in ISO-8859-5,
    and no CYRILLIC SMALL LETTER A in ISO-8859-1)
    I already explained that. When writing to a file handle, perl doesn't
    care whether a string is composed of bytes or characters.

    If the file handle has no :encoding() layer, it will try to write each
    element of the string as a single byte.

    If the file has an :encoding() layer, it will interpret each element of
    the string as a character and convert that to a byte sequence according
    to that encoding.

    So without an encoding layer "\x{E0}" will always be written as the single byte
    0xE0, regardless of whether the string is a byte string or a character
    string. With an ":encoding(UTF-8)" layer it will always be written as
    two bytes 0xC3 0xA0; and with an ":encoding(CP850)" layer, it will
    always be written as a single byte 0x85.

    What it apparently confusing you is what happens if that fails.

    Obviously you can't write a single byte with the value 0x430, you can't
    encode CYRILLIC SMALL LETTER A in ISO-8859-1 and you can't encode LATIN
    SMALL LETTER A WITH GRAVE in ISO-8859-5.

    So what does perl do? It prints a warning to STDERR and writes
    a more or less reasonable approximation to the stream. The details
    depend on the I/O layer:

    If there is no :encoding() layer, the warning is "Wide character in
    print" and the utf-8 representation is sent to the stream. And to
    confuse matters further, this is done for the whole string, not just
    this particular string element:

    % perl -Mutf8 -E 'say "->\x{E0}\x{430}<-"'
    Wide character in say at -e line 1.
    ->àа<-

    (one string: \x{E0} and \x{430} converted to UTF-8)

    % perl -Mutf8 -E 'say "->\x{E0}<-", "->\x{430}<-"'
    Wide character in say at -e line 1.
    ->�<-->а<-

    (two strings: \x{E0} printed as a single byte, \x{430} converted to UTF-8)

    If there is an :encoding() layer, the warning is "\x{....} does not map
    to $charset" and a \x{....} escape sequence is sent to the stream:

    % perl -Mutf8 -E 'binmode STDOUT, ":encoding(iso-8859-5)"; say "->\x{E0}<-"'
    "\x{00e0}" does not map to iso-8859-5 at -e line 1.
    ->\x{00e0}<-

    But these are responses to an *error* condition. You shouldn't try to
    write codepoints > 255 to a byte stream (actually, you shouldn't write
    any characters to a byte stream, a byte stream is for bytes), and you
    shouldn't try to write latin accented characters to a cyrillic stream.
    Or at least you shouldn't be terribly surprised if the result is a
    little confusing - garbage in, garbage out.

    The term "upgrade" has a rather specific meaning in Perl in context with
    byte and character strings, and I don't think you are talking about
    that.

    You have a single string "[à][а]" here. As I wrote above, print treats
    the string as unit and in the absence of an :encoding() layer just dumps
    it in UTF-8 encoding. So, yes, both the "à" and the "а" within this
    single string will be UTF-8-encoded (as will be the square brackets, but
    for them the UTF-8 encoding is the same as for US-ASCII, so you don't
    notice that).

    And I repeat it again: You are doing something which just doesn't make
    sense (writing characters to a byte stream), so don't be surprised if
    the result is a little surprising. Do it right and the result will make
    sense.

    I've read these postings but I don't know what you are referring to. If
    you are referring to other postings (especially long ones), please cite
    the relevant part.

    I presume that by "utf8" you mean a string with the UTF8 bit set
    (testable with the utf8::is_utf8() function). But as I've written
    repeatedly, this is completely irrelevant for I/O. A string will be
    treated completely identical, whether is has this bit set or not. It is
    only the value of the string which is important, not its internal type
    and representation.

    (Also, I find it very confusing that you post the output of
    Devel::peek::Dump, but then apparently don't refer to it but talk about
    something else. Please try to organize your postings in a way that one
    can understand what you are talking about. It is very likely that this
    exercise will also clear up the confusion in your mind)

    Yes. We've been through that. Ben explained it in excruciating detail.
    What don't you understand here?

    US-ASCII is a subset of UTF-8, so your files are UTF-8, too ;-). (Most
    of mine don't contain non-ASCII characters either) What I meant is that
    I don't use any other encoding (like ISO-8859-1 or ISO-8859-15) to
    encode non-ASCII characters, so I don't have any need for "use
    encoding". If your scripts are all in ASCII and you use open.pm for
    "rest", what do you need "use encoding" for? Remember, this subthread
    started when you berated Ben for discouraging the use "use encoding".

    hp
     
    Peter J. Holzer, Oct 30, 2012
    #18
  19. Every program is an "island" within its code. No matter what I use, I do not
    normally know the internals, and if I happen to know them I should not use my
    knowledge because the internals may change at any time.

    Perl is not an island as far as interaction with other programs is
    concerned. It is documented how to read and write byte data, and how to read
    and write character data whose code and encoding is known. If desired, it is
    also not really difficult to write code that tries to guess an unknown code --
    with all the pitfalls such a behaviour entails.

    There is one interface decision perl has made: it does not by default use the
    locale settings to determine the default code and encoding, rather it requires
    that these be specified in the script. Opinions may be divided; I like this
    decision because my experience is that often the locale settings appear to be
    randomly uncorrelated to the codes actually used.

    The implementation decisions that are not part of the interface, in particular
    the internal representation of values of different types including strings,
    concern future developers but not users. If perl decides to store characters
    internally as a 37-bit EBCDIC enhancement, it does not really bother me as
    long as the programm still interacts correctly with the outside world in
    standardised codes.
     
    Helmut Richter, Oct 30, 2012
    #19
  20. [quoting <eli$-neck.ny.us> on]

    $ echo 'a' | perl -Mutf8 -wne 's/a/Ã¥/;print' | od -xc
    0000000 0ae5
    345 \n
    0000002

    [quote off]

    *SKIP*
    If "you" above refers to me then you're wrong.
    Try to read it again. Slowly.
    Indeed, only FLAGS and PV are relevant. Sadly that Devel::peek::Dump
    doesn't provide means to filter arbitrary parts of output off (however,
    that's not the purpose of D::p). And I consider editing copypastes a
    bad taste.

    *SKIP*
    It's not about understanding. I'm trying to make a point that latin1 is
    special.
    Many years ago to get operations to work on characters instead of bytes
    some strings must have been pulled. encoding.pm pulled right strings.
    utf8.pm pulled irrelevant strings. Those days text related operations
    worked for you because they fitted in latin1 script or you didn't hit
    edge cases. However I did (more years ago, in 5.6.0, B<lcfirst()>
    worked *only* on bytes, no matter what).

    Guess what? I've just figured out I don't need either any more:

    {40710:255} [0:0]% xxd foo.koi8-u
    0000000: c6d9 d7c1 0a .....
    {40731:262} [0:0]% perl -wle '
    open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
    read $fh, $fh, -s $fh;
    $fh =~ m{(\w\w)};
    print $1
    '
    Wide character in print at -e line 5.
    Ñ„Ñ‹
    It comes clear to me now what made you both (you and Ben) believe in
    bugginess of F<encoding.pm>. I'm fine with that.
     
    Eric Pozharski, Oct 31, 2012
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.