Why "Wide character in print"?

Discussion in 'Perl Misc' started by tcgo, Sep 30, 2012.

  1. tcgo

    tcgo Guest

    Hi!
    I just made a test code with Perl, using the Pi symbol with Unicode/UTF-8. That's the code:

    #!/usr/bin/perl
    use utf8;
    my $cosa = "Here is my ☺ résúmé \x{2639}!";
    print "$cosa\n";

    And it gives me a "warning" message: "Wide character in print at ./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the warning disappears, but why was it showing before of adding the binmode?

    Thanks!
    ~tcgo~
     
    tcgo, Sep 30, 2012
    #1
    1. Advertising

  2. tcgo <> writes:
    > I just made a test code with Perl, using the Pi symbol with
    > Unicode/UTF-8. That's the code:
    >
    > #!/usr/bin/perl
    > use utf8;
    > my $cosa = "Here is my ☺ résúmé \x{2639}!";
    > print "$cosa\n";
    >
    > And it gives me a "warning" message: "Wide character in print at
    > ./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the
    > warning disappears, but why was it showing before of adding the
    > binmode?


    Because the people who nowadays work on perl unicode support have
    decided that it should behave as if the encoding used by it was some
    super secret sauce shrouded in eternal mystery: All data flowing into
    a Perl program is supposed to be converted to this super secret
    internal mystery encoding before being used and all data flowing out
    of a Perl program is supposed to be converted to something software
    other than perl understands beforehand. De facto, the situation is
    such that everything is fine when perl is used in an environment where
    UTF-8 is the 'native' method for supporting wide characters because
    this is also what perl uses itself, and anyone using something
    else is essentially fucked. De jure, perl is supposed to be nasty to
    everyone, or at least try as hard as possible without breaking
    backwards compatibility.
     
    Rainer Weikusat, Sep 30, 2012
    #2
    1. Advertising

  3. tcgo

    Alan Curry Guest

    In article <>,
    tcgo <> wrote:
    >Hi!
    >I just made a test code with Perl, using the Pi symbol with
    >Unicode/UTF-8. That's the code:
    >
    >#!/usr/bin/perl
    >use utf8;
    >my $cosa = "Here is my ☺ résúmé \x{2639}!";
    >print "$cosa\n";
    >
    >And it gives me a "warning" message: "Wide character in print at
    >./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the warning
    >disappears, but why was it showing before of adding the binmode?


    The binmode documents your assumption that nobody will ever run your program
    on a non-UTF8-mode terminal.

    --
    Alan Curry
     
    Alan Curry, Sep 30, 2012
    #3
  4. On 2012-09-30 17:57, tcgo <> wrote:
    > I just made a test code with Perl, using the Pi symbol with
    > Unicode/UTF-8. That's the code:
    >
    > #!/usr/bin/perl
    > use utf8;
    > my $cosa = "Here is my ☺ résúmé \x{2639}!";
    > print "$cosa\n";
    >
    > And it gives me a "warning" message: "Wide character in print at
    > ./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the
    > warning disappears, but why was it showing before of adding the
    > binmode?


    Because, unless you tell it with binmode, Perl doesn't know what
    encoding it is supposed to use. It could get the encoding from the
    locale settings, but that would only work for text written to a
    terminal, not for arbitrary data written to a file, so perl doesn't
    make assumptions and asks you to set the encoding explicitely.

    (If you want to get the encoding from the locale, use I18N::Langinfo,
    unfortunately this doesn't work on all platforms (at least it didn't
    work on Windows last time I looked, but that was a few years ago)

    hp


    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
     
    Peter J. Holzer, Sep 30, 2012
    #4
  5. tcgo

    Guest

    On Sunday, September 30, 2012 6:57:38 PM UTC+1, tcgo wrote:

    > #!/usr/bin/perl
    > use utf8;
    > my $cosa = "Here is my ☺ résúmé \x{2639}!";
    > print "$cosa\n";
    >
    > And it gives me a "warning" message: "Wide character in print at
    > ./unicode line 4". After adding "binmode(STDOUT, ":utf8");" the
    > warning disappears, but why was it showing before of adding the
    > binmode?


    “use utf8†means only that the script file itself is UTF-8-encoded;
    It doesn’t say how to manage the output to STDOUT.

    JD
     
    , Oct 23, 2012
    #5
  6. tcgo

    C.DeRykus Guest

    On Sunday, September 30, 2012 10:57:38 AM UTC-7, tcgo wrote:
    > Hi!
    >
    > I just made a test code with Perl, using the Pi symbol with Unicode/UTF-8.. That's the code:
    >
    >
    >
    > #!/usr/bin/perl
    >
    > use utf8;
    >
    > my $cosa = "Here is my ☺ résúmé \x{2639}!";
    >
    > print "$cosa\n";
    > ...
    >


    Here's a follow-on with an observation/question for someone more knowledgeable about Perl unicode)

    I don't know how 'use locale' affects this but I
    only see the OP's expected display of characters
    by using the "\N{U+...}" notation to force character
    semantics:

    #use utf8;
    my $cosa = "Here is my \N{U+263A} résúmé \N{U+03C0}!";

    Output: Here is my ☺ résúmé π!

    --
    Charles DeRykus
     
    C.DeRykus, Oct 24, 2012
    #6
  7. with <> Ben Morrow wrote:

    *SKIP*

    > (In theory you can 'use encoding' to specify a different source
    > character encoding, but in practice that pragma has always been buggy
    > and is better avoided.)


    Stop spreading FUD. They need

    use encoding ENCNAME Filter => 1;

    (what I<ENCNAME> could possibly be?) but

    * "use utf8" is implicitly declared so you no longer have to "use
    utf8" to "${"\x{4eba}"}++".

    what pretty much defies the purpose of C<use encoding;>.

    *SKIP*

    > The lexer converts the "Ã¥" into a 1-character string which eventually
    > gets passed to 'say', which appends a newline (that is, a character
    > with ordinal 0a) and passes it to the STDOUT filehandle for writing.


    That's not a whole story.

    {2754:13} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "а" ; Dump $aa'
    SV = PV(0x927a750) at 0x9295fac
    REFCNT = 1
    FLAGS = (POK,pPOK,UTF8)
    PV = 0x9291a08 "\320\260"\0 [UTF8 "\x{430}"]
    CUR = 2
    LEN = 12
    {2936:14} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "Ã¥" ; Dump $aa'
    SV = PV(0x9af4750) at 0x9b0ffac
    REFCNT = 1
    FLAGS = (POK,pPOK,UTF8)
    PV = 0x9b0ba08 "\303\245"\0 [UTF8 "\x{e5}"]
    CUR = 2
    LEN = 12

    For a first glance, me wondered: what the heck is with yours
    C<use warnings;>. Now I feel much better.

    *CUT*

    --
    Torvalds' goal for Linux is very simple: World Domination
    Stallman's goal for GNU is even simpler: Freedom
     
    Eric Pozharski, Oct 27, 2012
    #7
  8. with <> Ben Morrow wrote:
    > Quoth Eric Pozharski <>:
    >> with <> Ben Morrow wrote:
    >>
    >>> (In theory you can 'use encoding' to specify a different source
    >>> character encoding, but in practice that pragma has always been
    >>> buggy and is better avoided.)

    >>
    >> Stop spreading FUD.

    >
    > That was certainly not my intention. My understanding is that 'use
    > encoding' is liable to cause incorrect behaviour and segfaults; see
    > for instance
    >
    > https://rt.perl.org/rt3/Public/Bug/Display.html?id=31923


    C<use threads;> and C<use encoding 'utf8';>. Unexpected(?) edge case?

    > https://rt.perl.org/rt3/Public/Bug/Display.html?id=36248


    C<use utf8;>, C<use encoding 'utf8';>, and C<use Encode;>. Panic mode?

    > https://rt.perl.org/rt3/Public/Bug/Display.html?id=37526


    Double encoding.

    > http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2009-09/msg00669.html


    Monkey wrench.

    > http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2011-03/msg00255.html


    Works just as expected, see below.

    > which suggests that 'use utf8' is also broken; I didn't know that
    > until just now, and I'm not sure I entirely believe it. If you have
    > newer information than me, I'd be happy to change my opinion.


    Probably that's not safe to state things like this below unprivately,
    but:

    not perl->isa( 'fool-proof' ) or die

    (I'm trying to speak Perl here). IOW, Perl has an entry level. And
    it's quite high. And one of steps to get behind is ability to read. I
    don't mind ability to read code, I mean ability to RTFM. Three former
    examples are clearly (for me) of that type. I have a couple of scripts
    that have C<use encoding 'utf8';> (I<STDIN>, I<STDOUT>, and quote-like
    operators) and C<use open ':locale';> (other filehandles, quite risky,
    but those scripts are not for distribution thus I'm safe here). Those
    scripts were started 4.5 years ago (according to logs, I can't believe
    it was sarge (thus 5.8.8?)). Anyway, 5.10.0, 5.10.1, 5.14.2 -- because
    I've made those right. Because I've read carefully, all the unicode
    documentation that comes with perl (namely perluniitro.pod,
    perlunicode.pod, utf8.pod, encoding.pm, Encdoe.pm (perlunifaq.pod,
    perlunitut, and perluniprops.pod weren't distributed five years ago,
    should read them too)). I've found that I don't need utf8.pm (those
    scripts and modules should be us-ascii anyway).

    I feel utf8-safe because, first of all, I can read. If I can, they can
    too, can't they? Apparently, they don't, maybe because they can't.

    >> They need
    >>
    >> use encoding ENCNAME Filter => 1;

    >
    > That installs a source filter; I'm not sure what the effects of that
    > are, but I wouldn't be surprised if you get the union of any bugs in
    > 'use encoding' and any bugs in 'use utf8'.
    >
    >> (what I<ENCNAME> could possibly be?) but
    >>
    >> * "use utf8" is implicitly declared so you no longer have to
    >> "use utf8" to "${"\x{4eba}"}++".


    BTW, I've checked. There's no C<use utf8>. It's B<require utf8> and no
    import. A whole different story.

    > I don't believe this is safe either. The pad code (which handles 'my'
    > variables) isn't utf8-safe, so you can't create 'my' variables with
    > Unicode names. (The above is a symref to a global; I don't know if the
    > code handling the names of globals is utf8-safe, but even if it is
    > that isn't terribly useful.)


    Let me rephrase one famous proverb:

    If an answer you've got is 'filter', you probably asking wrong
    question.

    *SKIP*
    > In any case, the result is exactly what I said: the string contains
    > one (logical) character. If you apply length() to that string it will
    > return 1. (This character happens to be represented internally as two
    > bytes; that is none of your business.) What do you think I omitted
    > from the story?


    Right. And that's closely related to your last example (the one about
    utf8.pm being unsafe). I've tried to make a point that *characters*
    from different *ranges* happen to be of different length in bytes.

    {9829:45} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "aàа" ; Dump $aa'
    SV = PV(0xa06f750) at 0xa08afac
    REFCNT = 1
    FLAGS = (POK,pPOK,UTF8)
    PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
    CUR = 5
    LEN = 12

    *Characters* of latin1 aren't wide (even if they are characters, they
    are still one byte long)

    {10406:65} [0:0]% perl -Mutf8 -wle 'print "[à]"'
    [à]
    {10415:66} [0:0]% perl -Mutf8 -wle 'print "[а]"'
    Wide character in print at -e line 1.
    [а]

    I must have added those braces, because:

    {10421:67} [0:0]% perl -wle 'print "à"' # no problmes, just a byte
    à
    {10477:68} [0:0]% perl -Mutf8 -wle 'print "à"' # oops

    {10520:69} [0:0]% perl -Mutf8 -wle 'print "à "' # stupid
    à
    {10522:70} [0:0]% perl -Mutf8 -wle 'print "\x{E0}"' # oops

    {10532:71} [0:0]% perl -Mutf8 -wle 'print "\x{E0} "' # stupid
    à
    {10602:79} [0:0]% perl -Mutf8 -wle 'print "\N{U+00E0}"' # oops

    {10608:80} [0:0]% perl -Mutf8 -wle 'print "\N{U+00E0} "' # stupid
    à

    But watch this:

    {10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "à"' # hooray!
    à
    {10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
    �
    {10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hooray!
    à

    Except the middle one (what I should think about), I think encoding.pm
    wins again.

    --
    Torvalds' goal for Linux is very simple: World Domination
    Stallman's goal for GNU is even simpler: Freedom
     
    Eric Pozharski, Oct 28, 2012
    #8
  9. On 2012-10-27 23:37, Ben Morrow <> wrote:
    > Quoth Eric Pozharski <>:
    >> with <> Ben Morrow wrote:
    >>
    >> > (In theory you can 'use encoding' to specify a different source
    >> > character encoding, but in practice that pragma has always been buggy
    >> > and is better avoided.)

    >>
    >> Stop spreading FUD.

    >
    > That was certainly not my intention. My understanding is that 'use
    > encoding' is liable to cause incorrect behaviour and segfaults; see for
    > instance
    >
    > https://rt.perl.org/rt3/Public/Bug/Display.html?id=31923
    > https://rt.perl.org/rt3/Public/Bug/Display.html?id=36248
    > https://rt.perl.org/rt3/Public/Bug/Display.html?id=37526
    > http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2009-09/msg00669.html
    >
    > Incidentally, while looking for those I also found
    >
    > http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2011-03/msg00255.html
    >
    > which suggests that 'use utf8' is also broken; I didn't know that until
    > just now, and I'm not sure I entirely believe it.


    That doesn't look like a bug in "use utf8" to me, but like a bug in the
    code which generates the warnings.

    It doesn't help that Tom just dumped a load of gibberish into his mail
    without specifying which encoding he was using. I had to guess that he
    was using CP1252.

    Anyway, with use utf8, the qw[] section of his program is parsed correcly as

    ("élite", "Ævar", "μῦθος", "mío")

    In the error message each character (even those in the printable ASCII
    range U+0020 ... U+007E) is "helpfully" given in hex which I agree is
    .... suboptimal.


    > If you have newer information than me, I'd be happy to change my opinion.


    Me too, although frankly I see no reason to use encoding even if it
    works. It mixes up encoding of the source code and the I/O, which is not
    a good idea, IMSHO, and my editor handles UTF-8 just fine, so I don't
    see why I should write my perl scripts in a different encoding than
    UTF-8. I/O can be handled explicitely by I/O layers or implicitely by
    "use open".


    >> (what I<ENCNAME> could possibly be?) but
    >>
    >> * "use utf8" is implicitly declared so you no longer have to "use
    >> utf8" to "${"\x{4eba}"}++".

    >
    > I don't believe this is safe either. The pad code (which handles 'my'
    > variables) isn't utf8-safe, so you can't create 'my' variables with
    > Unicode names. (The above is a symref to a global; I don't know if the
    > code handling the names of globals is utf8-safe, but even if it is that
    > isn't terribly useful.)


    I'm puzzled about this part of the documentation, too. Why would anybody
    want to use a variable ${"\x{4eba}"} ? I am guessing that the variable
    is really supposed to be $人, i.e., there is a Han character in the
    source code, not a symref.

    Is this unsafe? I have occasionally used non-ascii characters in
    variable names (mostly Greek characters in physical formulas) together
    with use utf8 since 5.8.x and I never noticed a problem. (The only
    "problem" I noticed is that the euro sign isn't a word character, so you
    can't have a variable $amount_in_€. But then you can't have a variable
    $amount_in_$ either, so I guess this is fair ;-))

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Oct 28, 2012
    #9
  10. On 2012-10-28 11:45, Eric Pozharski <> wrote:
    > with <> Ben Morrow wrote:
    >> In any case, the result is exactly what I said: the string contains
    >> one (logical) character. If you apply length() to that string it will
    >> return 1. (This character happens to be represented internally as two
    >> bytes; that is none of your business.) What do you think I omitted
    >> from the story?

    >
    > Right. And that's closely related to your last example (the one about
    > utf8.pm being unsafe). I've tried to make a point that *characters*
    > from different *ranges* happen to be of different length in bytes.


    Then maybe you shouldn't have chosen two examples which both are same
    length in bytes.

    >
    > {9829:45} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "aàа" ; Dump $aa'
    > SV = PV(0xa06f750) at 0xa08afac
    > REFCNT = 1
    > FLAGS = (POK,pPOK,UTF8)
    > PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
    > CUR = 5
    > LEN = 12
    >
    > *Characters* of latin1 aren't wide (even if they are characters, they
    > are still one byte long)


    In UTF-8, latin-1 characters >= 0x80 are 2 bytes, the same as cyrillic
    characters. Your example shows this: "à" (LATIN SMALL LETTER A WITH
    GRAVE) is "\303\240" and "а" (CYRILLIC SMALL LETTER A) is "\320\260".

    But this isn't what "wide character" in the warning means. In the
    warning, it means a string element with a code > 255. For string
    elements <= 255, perl can assume that they are supposed to be bytes, not
    characters, when you try to write them to a byte stream. It could be
    argued that this assumption is a mistake, but for better or worse we are
    stuck with that decision. But for string elements > 255, that just isn't
    possible. It can't be a byte, it must be a character, and to convert a
    character into bytes, the encoding needs to known.


    > {10406:65} [0:0]% perl -Mutf8 -wle 'print "[à]"'
    > [à]
    > {10415:66} [0:0]% perl -Mutf8 -wle 'print "[а]"'
    > Wide character in print at -e line 1.
    > [а]


    .... as these examples demonstrate.


    > I must have added those braces, because:
    >
    > {10421:67} [0:0]% perl -wle 'print "à"' # no problmes, just a byte
    > à


    Assuming you use a UTF-8 terminal here: No, this isn't one byte. These are
    two bytes, \303\240.

    > {10477:68} [0:0]% perl -Mutf8 -wle 'print "à"' # oops
    >


    Now you have one character (because of -Mutf8, the two bytes \303\240
    are decoded to the character U+00e0), but you are trying to write it to a byte
    stream without specifying the encoding. Perl writes the single byte
    0xE0, which your UTF-8 terminal cannot interpret. (Mine displays a
    question mark in a dark circle)


    > {10520:69} [0:0]% perl -Mutf8 -wle 'print "à "' # stupid
    > à


    Huh? What version of Perl on what platform is this? The string is
    "\x{E0}\x{20}". All elements of the string are <= 255, so the string is
    output as a byte string. This isn't valid UTF-8, and your terminal
    shouldn't be able to interpret it as "à" anymore than it was able to
    interpret "\x{E0}\x{0A}" above.

    [more equivalent examples snipped]

    If your program does character I/O, you *need* to specify the encoding
    of the I/O channels. For one-liners, the -C option is sufficent:

    hrunkner:~/tmp 20:40 :) 195% perl -CS -Mutf8 -wle 'print "à"'
    à

    For scripts you would use binmode or 'use open'.

    (Didn't you praise yourself on your ability to read? This is documented
    and it has been repeated by several people in this newsgroup for years)


    > But watch this:
    >
    > {10613:81} [0:0]% perl -Mencoding=utf8 -wle 'print "à"' # hooray!
    > à
    > {10645:82} [0:0]% perl -Mencoding=utf8 -wle 'print "\x{E0}"' # oops
    > �
    > {10654:83} [0:0]% perl -Mencoding=utf8 -wle 'print "\N{U+00E0}"' # hooray!
    > à
    >
    > Except the middle one (what I should think about), I think encoding.pm
    > wins again.


    Excellent example, it shows exactly one of the pitfalls of using "use
    encoding". One would expect "\x{E0}" to result in a string with a single
    element with code 0xE0. At least you seem to have expected it, and for a
    moment I was confused, too. But 'use encoding' doesn't work that way. It
    was designed to convert string constants from the specified encoding to
    Unicode, so it tries to interpret "\x{E0}" as UTF-8, but of course this
    isn't valid UTF-8. So you get "\x{FFFD}" instead (U+FFFD is the
    REPLACEMENT CHARACTER used to mark invalid characters).

    If you use a correct UTF-8 encoded string, it works as expected (well,
    expected by somebody who's read the documentation and remembers that
    little pitfall):

    hrunkner:~/tmp 20:47 :) 197% perl -Mencoding=utf8 -wle 'print "\303\240"'
    à


    For one-liners like this, using the same encoding for the script and the
    I/O is useful ("-CS -Mutf8" is even shorter than "-Mencoding=utf8", but
    maybe you don't have a UTF-8 capable terminal). However, for real
    programs, I think tying the encoding of the source code to the encoding
    of I/O-streams the script is supposed to handle is foolish. My scripts
    are always encoded in UTF-8, but they frequently have to handle files in
    CP-1252.

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Oct 28, 2012
    #10
  11. On Sun, 28 Oct 2012, Peter J. Holzer wrote:

    > But this isn't what "wide character" in the warning means. In the
    > warning, it means a string element with a code > 255. For string
    > elements <= 255, perl can assume that they are supposed to be bytes, not
    > characters, when you try to write them to a byte stream.


    You have to distinguish what may work sometimes or always, and what is
    part of the interface which *should* work. If it does nor work in the
    latter case, it is an error; if it does not work in the former case you
    have made a bad guess about how it is implemented. So do not rely on your
    guesses but use the documented interface.

    There are two ways to use the interface:

    - You regard all strings, both during the run of the script and on
    input/output, as bytes (=groups of 8 bits) without any meaning as
    characters (=member of an alphabet for writing text). This will work if
    all devices, and the script itself, use the same character code, which
    must not have bytes with value >255. This *can* be a viable option if
    you can either guarantee this restriction, or if your bytes do not
    have a character meaning.

    In this case, strings in the program text with characters that are not
    contained in the common character code are meaningless, and will yield
    errors.

    - You regard the data during the run of the script as sequences of
    characters, and the data on onput and output as sequences of bytes. Then
    you have to convert bytes into textstrings on input and textstrings into
    bytes on output -- in both cases you can specify the conversion once and
    for all for each file. This is the only working way when the restrictions
    of the last item are not fulfilled.

    In this case, strings in the program text may contain any characters
    whether or not they are representable in the codes used in input/output.
    The "use utf8" pragma tells perl to interpret the program text itself as a
    sequence of UTF-8 characters which will make a difference only for literal
    strings in the program.

    A third way does *not* work:

    - You do input and output on strings of bytes and assume that perl will guess
    correctly what characters these byte represent in your opinion.
    Unfortunately that will *often* work (because perl assumes ISO-8859-1 on
    many systems which may be what you are actually using), but it will also
    often break (if you use other codes, or if you mix strings which happen to
    contain only ISO-8859-1 characters with string containing also other
    characters). But if it breaks, it is your fault: it is nowhere guaranteed
    how text strings map to byte strings and vice versa, the sole exception
    being the documented encode and decode functions.

    This is fairly well explained in
    http://search.cpan.org/~dom/perl-5.14.3/pod/perlunitut.pod

    --
    Helmut Richter
     
    Helmut Richter, Oct 28, 2012
    #11
  12. Helmut Richter <> writes:

    [...]

    > - You regard the data during the run of the script as sequences of
    > characters, and the data on onput and output as sequences of bytes. Then
    > you have to convert bytes into textstrings on input and textstrings into
    > bytes on output -- in both cases you can specify the conversion once and
    > for all for each file. This is the only working way when the restrictions
    > of the last item are not fulfilled.


    This is the only 'working way' when the assumption that perl uses a
    'secret mystery encoding' different from any other encoding known to
    man is taken for granted. But this assumption is wrong and the concept
    makes preciously little sense since it requires an additional copy of
    all input data and all output data (possibly, times the number of perl
    processes in a 'long' pipeline since not even perl is supposed to be
    able to talk to perl natively). Considering the way perl is
    implemented, this is a real problem for users of Windows (and Mac OS
    X, AFAIK) because in both cases, perl uses something other than the
    native encoding. That some people would like to inflict the same
    damage onto users of platforms where the problem doesn't exist is
    certainly very laudable but IMNSHO, best ignored.
     
    Rainer Weikusat, Oct 28, 2012
    #12
  13. On 2012-10-28 20:57, Helmut Richter <> wrote:
    > On Sun, 28 Oct 2012, Peter J. Holzer wrote:
    >
    >> But this isn't what "wide character" in the warning means. In the
    >> warning, it means a string element with a code > 255. For string
    >> elements <= 255, perl can assume that they are supposed to be bytes, not
    >> characters, when you try to write them to a byte stream.

    >
    > You have to distinguish what may work sometimes or always, and what is
    > part of the interface which *should* work. If it does nor work in the
    > latter case, it is an error; if it does not work in the former case you
    > have made a bad guess about how it is implemented. So do not rely on your
    > guesses but use the documented interface.


    I was careful to use the term "string element" and avoid the terms
    "byte" and "character" when talking about the things a string is
    composed of.

    Perl has two types of strings: Character strings (often called utf8
    strings in the documentation) and byte strings. Character strings are
    composed of 32-bit entities, each denoting a unicode code point. So
    "\x{1f42a}" is a string with the single character DROMEDARY CAMEL.
    Byte strings are just that: Strings of uninterpreted bytes. Any
    semantics assigned to them is semantics of the program, not of the Perl
    language (this isn't quite correct: character oriented functions like lc
    or character classes in regexps do work on them, but only for ASCII).

    These differences are documented, and I consider them part of the
    interface, although some members of p5p consider the distinction a bug
    and try to remove it.

    However, for the warning "Wide character in print" this is irrelevant.

    Perl doesn't distinguish between character and byte strings when writing
    them to a file handle. For both the strings "\x{E0}" (a byte string) and
    "\N{U+00E0}" (a character string), if you write them to a raw file
    handle, the single byte 0xE0 will be written. Both will be converted to
    two bytes 0xC3 0xA0 if you write them the a file handle with the
    ":encoding(UTF-8)" layer. And so on. But for strings with elements >
    255, it simply isn't possible, to write a single byte with this value to
    a byte stream, because a byte has only 8 bits (on the platforms we care
    about). So Perl prints a warning and encodes the string in UTF-8 (or
    just copies its internal representation, which happens to be the same
    thing). I would argue that perl should die() instead, but this has been
    the observed and documented behaviour since 5.8.0, so I doubt it will
    change.


    [Rest snipped. All true, but IMHO not very relevant to this thread].

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Oct 29, 2012
    #13
  14. On 2012-10-28 21:39, Rainer Weikusat <> wrote:
    > Helmut Richter <> writes:
    >> - You regard the data during the run of the script as sequences of
    >> characters, and the data on onput and output as sequences of bytes. Then
    >> you have to convert bytes into textstrings on input and textstrings into
    >> bytes on output -- in both cases you can specify the conversion once and
    >> for all for each file. This is the only working way when the restrictions
    >> of the last item are not fulfilled.

    >
    > This is the only 'working way' when the assumption that perl uses a
    > 'secret mystery encoding' different from any other encoding known to
    > man is taken for granted.


    The encoding isn't a 'secret mystery'. It is well documented that it
    is Unicode.

    perl -CS -MEncode -E 'say ord(Encode::decode("utf-8", "\xE2\x82\xAC"))'

    is defined to print "8364".

    It is a 'secret mystery' (wink, wink, nudge, nudge) how this is
    represented internally, just like the representation of numbers is a
    'secret mystery'.

    However, for most programs you don't have to know that Perl character
    strings are Unicode strings. It is sufficient to know that Perl has the
    concept of a "character" which is different from the concept of a
    "byte", that a character has certain properties (e.g. it can be a letter
    or an ideograph, it may have an associated uppercase or lowercase
    letter, ...) and to convert a sequence of characters into a sequence of
    bytes you have to encode them. Whether the Euro sign has the numeric
    code 8364 or 4711 is rarely significant.


    > But this assumption is wrong and the concept
    > makes preciously little sense since it requires an additional copy of
    > all input data and all output data


    This is an unsubstantiated claim. It is possible that the current
    implementation of I/O layers does indeed perform an additional copy (I
    haven't checked the code), but this is certainly not required.

    And even if it is true, it is almost certainly lost in the noise as soon
    as your script does something more complex than "cat" with your input -
    almost any string operation in perl performs a copy.

    > (possibly, times the number of perl processes in a 'long' pipeline
    > since not even perl is supposed to be able to talk to perl natively).
    > Considering the way perl is implemented, this is a real problem for
    > users of Windows (and Mac OS X, AFAIK) because in both cases, perl
    > uses something other than the native encoding.


    Why is this a real problem?

    > That some people would like to inflict the same damage onto users of
    > platforms where the problem doesn't exist is certainly very laudable
    > but IMNSHO, best ignored.


    Whatever "the problem" may be. The problem that characters and bytes
    aren't the same and that most programmers prefer to think of text as a
    sequence of characters, not a sequence of bytes exists on every
    platform.

    hp



    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Oct 29, 2012
    #14
  15. On Mon, 29 Oct 2012, Peter J. Holzer wrote:

    > However, for most programs you don't have to know that Perl character
    > strings are Unicode strings.


    Are they? They are strings of characters that are contained in Unicode. They
    are not necessarily internally encoded as Unicode. People run into problems
    when they make assumptions about the way they are implemented. I would have
    worded:

    For all programs you must not pretend to know that Perl character strings
    are Unicode strings.

    It may be true, it may be false -- either way, it is not part of the
    documented interface. Hence, it must not be used even if it be true.

    --
    Helmut Richter
     
    Helmut Richter, Oct 29, 2012
    #15
  16. with <> Peter J. Holzer wrote:
    > On 2012-10-28 11:45, Eric Pozharski <> wrote:
    >> with <> Ben Morrow wrote:


    >>> In any case, the result is exactly what I said: the string contains
    >>> one (logical) character. If you apply length() to that string it
    >>> will return 1. (This character happens to be represented internally
    >>> as two bytes; that is none of your business.) What do you think I
    >>> omitted from the story?

    >> Right. And that's closely related to your last example (the one
    >> about utf8.pm being unsafe). I've tried to make a point that
    >> *characters* from different *ranges* happen to be of different length
    >> in bytes.

    > Then maybe you shouldn't have chosen two examples which both are same
    > length in bytes.


    (Last night I've reread loads of perlunicode and friends, I feel much
    better now) No, they are the same length *if* encoding of stream is set:

    {7453:22} [0:0]% perl -CS -Mutf8 -wle 'print "à"' | xxd
    0000000: c3a0 0a ...
    {7459:23} [0:0]% perl -CS -Mutf8 -wle 'print "а"' | xxd
    0000000: d0b0 0a ...
    {7466:24} [0:0]%

    But latin1 is special (I've reread perlunicode and friends), *if*
    there's no reason (printing isn't reason) to upgrade to utf8 then
    *characters* of latin1 script (and latin1 only) stay *bytes*:

    {7466:24} [0:0]% perl -Mutf8 -wle 'print "à"' | xxd
    0000000: e00a ..
    {7795:25} [0:0]% perl -Mutf8 -wle 'print "а"' | xxd
    Wide character in print at -e line 1.
    0000000: d0b0 0a ...

    But even if encoding of stream isn't set concatenation with non-latin1
    script upgrades latin1 too:

    {7800:26} [0:0]% perl -Mutf8 -wle 'print "[à][а]"' | xxd
    Wide character in print at -e line 1.
    0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].

    Please rewind the thread. That's exactly what happened couple of posts
    ago (specifically: <eli$-neck.ny.us> and
    <>).

    >>
    >> {9829:45} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "aàа" ; Dump $aa'
    >> SV = PV(0xa06f750) at 0xa08afac
    >> REFCNT = 1
    >> FLAGS = (POK,pPOK,UTF8)
    >> PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
    >> CUR = 5
    >> LEN = 12
    >>
    >> *Characters* of latin1 aren't wide (even if they are characters, they
    >> are still one byte long)

    > In UTF-8, latin-1 characters >= 0x80 are 2 bytes, the same as cyrillic
    > characters. Your example shows this: "à" (LATIN SMALL LETTER A WITH
    > GRAVE) is "\303\240" and "а" (CYRILLIC SMALL LETTER A) is "\320\260".


    No. Because it's not UTF-8, it's utf8. As long as utf8 semantics isn't
    set, anything scalar stays plain bytes:

    {2786:10} [0:0]% perl -MDevel::peek -wle 'Dump "à"'
    SV = PV(0x9d0e878) at 0x9d29f28
    REFCNT = 1
    FLAGS = (PADTMP,POK,READONLY,pPOK)
    PV = 0x9d2ddc8 "\303\240"\0
    CUR = 2
    LEN = 12

    However, when utf8 semantics is set, then those codepoints that fit
    latin1 script become special Perl-latin1:

    {5930:11} [0:0]% perl -MDevel::peek -Mutf8 -wle 'Dump "à"'
    SV = PV(0x9b92880) at 0x9badf10
    REFCNT = 1
    FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
    PV = 0x9bb1eb0 "\303\240"\0 [UTF8 "\x{e0}"]
    CUR = 2
    LEN = 12

    Upgrade to UTF-8 encoding or staying with latin1 encoding depends on
    concatation with already upgraded to UTF-8 codepoints and/or encoding of
    output stream.

    *SKIP*
    >> {10477:68} [0:0]% perl -Mutf8 -wle 'print "à"' # oops

    > Now you have one character (because of -Mutf8, the two bytes \303\240
    > are decoded to the character U+00e0), but you are trying to write it
    > to a byte stream without specifying the encoding. Perl writes the
    > single byte 0xE0, which your UTF-8 terminal cannot interpret. (Mine
    > displays a question mark in a dark circle)


    {42:1} [0:0]% perl -Mutf8 -wle 'print "à"'
    à
    {1903:2} [0:0]% perl -Mutf8 -wle 'print "à"'

    {1933:3} [0:0]% perl -Mutf8 -wle 'print "à"' | xxd
    0000000: e00a

    Instead it does. Once. It wasn't typeing, it was search through
    history. Now I'm bothered. Does anyone here know how to list
    extensions enabled in running instance of urxvt?

    *SKIP*
    > For one-liners like this, using the same encoding for the script and
    > the I/O is useful ("-CS -Mutf8" is even shorter than
    > "-Mencoding=utf8", but maybe you don't have a UTF-8 capable terminal).


    {14999:29} [0:0]% perl -mencoding -wle 'print "[à][а]"' | xxd
    0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].
    {15017:30} [0:0]% perl -CS -Mutf8 -wle 'print "[à][а]"' | xxd
    0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].

    Golf?

    > However, for real programs, I think tying the encoding of the source
    > code to the encoding of I/O-streams the script is supposed to handle
    > is foolish. My scripts are always encoded in UTF-8, but they
    > frequently have to handle files in CP-1252.


    Mine are us-ascii, I have open.pm for rest.


    --
    Torvalds' goal for Linux is very simple: World Domination
    Stallman's goal for GNU is even simpler: Freedom
     
    Eric Pozharski, Oct 29, 2012
    #16
  17. Helmut Richter <> writes:
    > On Mon, 29 Oct 2012, Peter J. Holzer wrote:
    >> However, for most programs you don't have to know that Perl character
    >> strings are Unicode strings.


    [...]

    > For all programs you must not pretend to know that Perl character strings
    > are Unicode strings.
    >
    > It may be true, it may be false -- either way, it is not part of the
    > documented interface. Hence, it must not be used even if it be true.


    At best, that's a part of the interface which was meanwhile
    'undocumented' because the implementation choices which were made
    weren't the implementation choices that should have been made,
    according to the opinions of some people who didn't make the
    descision. But indepedently of that, inventing the 'Perl is an
    island!' character encoding - no matter how hypothetical - remains a
    stupid idea. Perl is not an island and it has to interact with code
    written in other programming languages, although maybe not in the
    fantasy universe of people who implement 'wepp fremmwuergs' and
    'ohpscheckt suesstemms' who are generally not troubled by the minor
    consideration of making their stuff do something actually useful in
    the real world. Conseqently, Perl should be compatible with some
    existing convention, ideally, with all existing 'local'
    conventions. If this isn't possible, the next best choice is not 'make
    everyone bleed'.
     
    Rainer Weikusat, Oct 29, 2012
    #17
  18. On 2012-10-29 12:52, Eric Pozharski <> wrote:
    > with <> Peter J. Holzer wrote:
    >> On 2012-10-28 11:45, Eric Pozharski <> wrote:
    >>> with <> Ben Morrow wrote:

    >
    >>>> In any case, the result is exactly what I said: the string contains
    >>>> one (logical) character. If you apply length() to that string it
    >>>> will return 1. (This character happens to be represented internally
    >>>> as two bytes; that is none of your business.) What do you think I
    >>>> omitted from the story?
    >>> Right. And that's closely related to your last example (the one
    >>> about utf8.pm being unsafe). I've tried to make a point that
    >>> *characters* from different *ranges* happen to be of different length
    >>> in bytes.

    >> Then maybe you shouldn't have chosen two examples which both are same
    >> length in bytes.

    >
    > (Last night I've reread loads of perlunicode and friends, I feel much
    > better now) No, they are the same length *if* encoding of stream is set:


    You posted the output of Devel::peek::Dump, so I thought you were
    talking about the *internal* representation.

    How many bytes they occupy in an I/O stream depends on the encoding.

    LATIN SMALL LETTER A WITH GRAVE is one byte in ISO-8859-1, CP850, ...
    LATIN SMALL LETTER A WITH GRAVE is two bytes in UTF-8, UTF-16, ...
    LATIN SMALL LETTER A WITH GRAVE is four bytes in UTF-32, ...

    CYRILLIC SMALL LETTER A is one byte in ISO-8859-5, KOI-8, ...
    CYRILLIC SMALL LETTER A is two bytes in UTF-8, UTF-16, ...
    CYRILLIC SMALL LETTER A is four bytes in UTF-32, ...

    (And of course, both characters cannot be represented at all in some
    encodings: There is no LATIN SMALL LETTER A WITH GRAVE in ISO-8859-5,
    and no CYRILLIC SMALL LETTER A in ISO-8859-1)

    > {7453:22} [0:0]% perl -CS -Mutf8 -wle 'print "à"' | xxd
    > 0000000: c3a0 0a ...
    > {7459:23} [0:0]% perl -CS -Mutf8 -wle 'print "а"' | xxd
    > 0000000: d0b0 0a ...
    > {7466:24} [0:0]%
    >
    > But latin1 is special (I've reread perlunicode and friends), *if*
    > there's no reason (printing isn't reason) to upgrade to utf8 then
    > *characters* of latin1 script (and latin1 only) stay *bytes*:


    I already explained that. When writing to a file handle, perl doesn't
    care whether a string is composed of bytes or characters.

    If the file handle has no :encoding() layer, it will try to write each
    element of the string as a single byte.

    If the file has an :encoding() layer, it will interpret each element of
    the string as a character and convert that to a byte sequence according
    to that encoding.

    So without an encoding layer "\x{E0}" will always be written as the single byte
    0xE0, regardless of whether the string is a byte string or a character
    string. With an ":encoding(UTF-8)" layer it will always be written as
    two bytes 0xC3 0xA0; and with an ":encoding(CP850)" layer, it will
    always be written as a single byte 0x85.

    What it apparently confusing you is what happens if that fails.

    Obviously you can't write a single byte with the value 0x430, you can't
    encode CYRILLIC SMALL LETTER A in ISO-8859-1 and you can't encode LATIN
    SMALL LETTER A WITH GRAVE in ISO-8859-5.

    So what does perl do? It prints a warning to STDERR and writes
    a more or less reasonable approximation to the stream. The details
    depend on the I/O layer:

    If there is no :encoding() layer, the warning is "Wide character in
    print" and the utf-8 representation is sent to the stream. And to
    confuse matters further, this is done for the whole string, not just
    this particular string element:

    % perl -Mutf8 -E 'say "->\x{E0}\x{430}<-"'
    Wide character in say at -e line 1.
    ->àа<-

    (one string: \x{E0} and \x{430} converted to UTF-8)

    % perl -Mutf8 -E 'say "->\x{E0}<-", "->\x{430}<-"'
    Wide character in say at -e line 1.
    ->�<-->а<-

    (two strings: \x{E0} printed as a single byte, \x{430} converted to UTF-8)

    If there is an :encoding() layer, the warning is "\x{....} does not map
    to $charset" and a \x{....} escape sequence is sent to the stream:

    % perl -Mutf8 -E 'binmode STDOUT, ":encoding(iso-8859-5)"; say "->\x{E0}<-"'
    "\x{00e0}" does not map to iso-8859-5 at -e line 1.
    ->\x{00e0}<-

    But these are responses to an *error* condition. You shouldn't try to
    write codepoints > 255 to a byte stream (actually, you shouldn't write
    any characters to a byte stream, a byte stream is for bytes), and you
    shouldn't try to write latin accented characters to a cyrillic stream.
    Or at least you shouldn't be terribly surprised if the result is a
    little confusing - garbage in, garbage out.


    > But even if encoding of stream isn't set concatenation with non-latin1
    > script upgrades latin1 too:


    The term "upgrade" has a rather specific meaning in Perl in context with
    byte and character strings, and I don't think you are talking about
    that.


    > {7800:26} [0:0]% perl -Mutf8 -wle 'print "[à][а]"' | xxd
    > Wide character in print at -e line 1.
    > 0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].


    You have a single string "[à][а]" here. As I wrote above, print treats
    the string as unit and in the absence of an :encoding() layer just dumps
    it in UTF-8 encoding. So, yes, both the "à" and the "а" within this
    single string will be UTF-8-encoded (as will be the square brackets, but
    for them the UTF-8 encoding is the same as for US-ASCII, so you don't
    notice that).

    And I repeat it again: You are doing something which just doesn't make
    sense (writing characters to a byte stream), so don't be surprised if
    the result is a little surprising. Do it right and the result will make
    sense.


    > Please rewind the thread. That's exactly what happened couple of posts
    > ago (specifically: <eli$-neck.ny.us> and
    ><>).


    I've read these postings but I don't know what you are referring to. If
    you are referring to other postings (especially long ones), please cite
    the relevant part.


    >>> {9829:45} [0:0]% perl -Mutf8 -MDevel::peek -wle '$aa = "aàа" ; Dump $aa'
    >>> SV = PV(0xa06f750) at 0xa08afac
    >>> REFCNT = 1
    >>> FLAGS = (POK,pPOK,UTF8)
    >>> PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
    >>> CUR = 5
    >>> LEN = 12
    >>>
    >>> *Characters* of latin1 aren't wide (even if they are characters, they
    >>> are still one byte long)

    >> In UTF-8, latin-1 characters >= 0x80 are 2 bytes, the same as cyrillic
    >> characters. Your example shows this: "à" (LATIN SMALL LETTER A WITH
    >> GRAVE) is "\303\240" and "а" (CYRILLIC SMALL LETTER A) is "\320\260".

    >
    > No. Because it's not UTF-8, it's utf8.


    I presume that by "utf8" you mean a string with the UTF8 bit set
    (testable with the utf8::is_utf8() function). But as I've written
    repeatedly, this is completely irrelevant for I/O. A string will be
    treated completely identical, whether is has this bit set or not. It is
    only the value of the string which is important, not its internal type
    and representation.

    (Also, I find it very confusing that you post the output of
    Devel::peek::Dump, but then apparently don't refer to it but talk about
    something else. Please try to organize your postings in a way that one
    can understand what you are talking about. It is very likely that this
    exercise will also clear up the confusion in your mind)


    > As long as utf8 semantics isn't set, anything scalar stays plain
    > bytes:
    >
    > {2786:10} [0:0]% perl -MDevel::peek -wle 'Dump "à"'
    > SV = PV(0x9d0e878) at 0x9d29f28
    > REFCNT = 1
    > FLAGS = (PADTMP,POK,READONLY,pPOK)
    > PV = 0x9d2ddc8 "\303\240"\0
    > CUR = 2
    > LEN = 12
    >
    > However, when utf8 semantics is set, then those codepoints that fit
    > latin1 script become special Perl-latin1:
    >
    > {5930:11} [0:0]% perl -MDevel::peek -Mutf8 -wle 'Dump "à"'
    > SV = PV(0x9b92880) at 0x9badf10
    > REFCNT = 1
    > FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
    > PV = 0x9bb1eb0 "\303\240"\0 [UTF8 "\x{e0}"]
    > CUR = 2
    > LEN = 12


    Yes. We've been through that. Ben explained it in excruciating detail.
    What don't you understand here?


    >> However, for real programs, I think tying the encoding of the source
    >> code to the encoding of I/O-streams the script is supposed to handle
    >> is foolish. My scripts are always encoded in UTF-8, but they
    >> frequently have to handle files in CP-1252.

    >
    > Mine are us-ascii, I have open.pm for rest.


    US-ASCII is a subset of UTF-8, so your files are UTF-8, too ;-). (Most
    of mine don't contain non-ASCII characters either) What I meant is that
    I don't use any other encoding (like ISO-8859-1 or ISO-8859-15) to
    encode non-ASCII characters, so I don't have any need for "use
    encoding". If your scripts are all in ASCII and you use open.pm for
    "rest", what do you need "use encoding" for? Remember, this subthread
    started when you berated Ben for discouraging the use "use encoding".

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Oct 30, 2012
    #18
  19. On Mon, 29 Oct 2012, Rainer Weikusat wrote:

    > But indepedently of that, inventing the 'Perl is an
    > island!' character encoding - no matter how hypothetical - remains a
    > stupid idea.


    Every program is an "island" within its code. No matter what I use, I do not
    normally know the internals, and if I happen to know them I should not use my
    knowledge because the internals may change at any time.

    Perl is not an island as far as interaction with other programs is
    concerned. It is documented how to read and write byte data, and how to read
    and write character data whose code and encoding is known. If desired, it is
    also not really difficult to write code that tries to guess an unknown code --
    with all the pitfalls such a behaviour entails.

    There is one interface decision perl has made: it does not by default use the
    locale settings to determine the default code and encoding, rather it requires
    that these be specified in the script. Opinions may be divided; I like this
    decision because my experience is that often the locale settings appear to be
    randomly uncorrelated to the codes actually used.

    The implementation decisions that are not part of the interface, in particular
    the internal representation of values of different types including strings,
    concern future developers but not users. If perl decides to store characters
    internally as a 37-bit EBCDIC enhancement, it does not really bother me as
    long as the programm still interacts correctly with the outside world in
    standardised codes.

    --
    Helmut Richter
     
    Helmut Richter, Oct 30, 2012
    #19
  20. with <> Peter J. Holzer wrote:
    > On 2012-10-29 12:52, Eric Pozharski <> wrote:
    >> with <> Peter J. Holzer wrote:


    *SKIP*
    >> Please rewind the thread. That's exactly what happened couple of
    >> posts ago (specifically: <eli$-neck.ny.us> and
    >> <>).

    > I've read these postings but I don't know what you are referring to.
    > If you are referring to other postings (especially long ones), please
    > cite the relevant part.


    [quoting <eli$-neck.ny.us> on]

    $ echo 'a' | perl -Mutf8 -wne 's/a/Ã¥/;print' | od -xc
    0000000 0ae5
    345 \n
    0000002

    [quote off]

    *SKIP*
    >>> In UTF-8, latin-1 characters >= 0x80 are 2 bytes, the same as
    >>> cyrillic characters. Your example shows this: "à" (LATIN SMALL
    >>> LETTER A WITH GRAVE) is "\303\240" and "а" (CYRILLIC SMALL LETTER A)
    >>> is "\320\260".

    >> No. Because it's not UTF-8, it's utf8.

    > I presume that by "utf8" you mean a string with the UTF8 bit set
    > (testable with the utf8::is_utf8() function).


    If "you" above refers to me then you're wrong.

    > But as I've written repeatedly, this is completely irrelevant for I/O.
    > A string will be treated completely identical, whether is has this bit
    > set or not. It is only the value of the string which is important, not
    > its internal type and representation.


    Try to read it again. Slowly.

    > (Also, I find it very confusing that you post the output of
    > Devel::peek::Dump, but then apparently don't refer to it but talk
    > about something else. Please try to organize your postings in a way
    > that one can understand what you are talking about.


    Indeed, only FLAGS and PV are relevant. Sadly that Devel::peek::Dump
    doesn't provide means to filter arbitrary parts of output off (however,
    that's not the purpose of D::p). And I consider editing copypastes a
    bad taste.

    *SKIP*
    > Yes. We've been through that. Ben explained it in excruciating detail.
    > What don't you understand here?


    It's not about understanding. I'm trying to make a point that latin1 is
    special.

    >>> However, for real programs, I think tying the encoding of the source
    >>> code to the encoding of I/O-streams the script is supposed to handle
    >>> is foolish. My scripts are always encoded in UTF-8, but they
    >>> frequently have to handle files in CP-1252.

    >> Mine are us-ascii, I have open.pm for rest.

    > US-ASCII is a subset of UTF-8, so your files are UTF-8, too ;-). (Most
    > of mine don't contain non-ASCII characters either) What I meant is that
    > I don't use any other encoding (like ISO-8859-1 or ISO-8859-15) to
    > encode non-ASCII characters, so I don't have any need for "use
    > encoding". If your scripts are all in ASCII and you use open.pm for
    > "rest", what do you need "use encoding" for?


    Many years ago to get operations to work on characters instead of bytes
    some strings must have been pulled. encoding.pm pulled right strings.
    utf8.pm pulled irrelevant strings. Those days text related operations
    worked for you because they fitted in latin1 script or you didn't hit
    edge cases. However I did (more years ago, in 5.6.0, B<lcfirst()>
    worked *only* on bytes, no matter what).

    Guess what? I've just figured out I don't need either any more:

    {40710:255} [0:0]% xxd foo.koi8-u
    0000000: c6d9 d7c1 0a .....
    {40731:262} [0:0]% perl -wle '
    open $fh, "<:encoding(koi8-u)", "foo.koi8-u";
    read $fh, $fh, -s $fh;
    $fh =~ m{(\w\w)};
    print $1
    '
    Wide character in print at -e line 5.
    Ñ„Ñ‹

    > Remember, this subthread started when you berated Ben for discouraging
    > the use "use encoding".


    It comes clear to me now what made you both (you and Ben) believe in
    bugginess of F<encoding.pm>. I'm fine with that.

    --
    Torvalds' goal for Linux is very simple: World Domination
    Stallman's goal for GNU is even simpler: Freedom
     
    Eric Pozharski, Oct 31, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?VGltOjouLg==?=

    Why, why, why???

    =?Utf-8?B?VGltOjouLg==?=, Jan 27, 2005, in forum: ASP .Net
    Replies:
    6
    Views:
    573
    Juan T. Llibre
    Jan 27, 2005
  2. Mr. SweatyFinger

    why why why why why

    Mr. SweatyFinger, Nov 28, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    912
    Mark Rae
    Dec 21, 2006
  3. Mr. SweatyFinger
    Replies:
    2
    Views:
    2,003
    Smokey Grindel
    Dec 2, 2006
  4. Yuri Shtil

    Wide character in print

    Yuri Shtil, Jul 31, 2003, in forum: Perl Misc
    Replies:
    6
    Views:
    212
    Jürgen Exner
    Aug 5, 2003
  5. Peter J. Holzer
    Replies:
    1
    Views:
    319
    Peter J. Holzer
    Nov 3, 2012
Loading...

Share This Page