utf8

Discussion in 'Perl Misc' started by George Mpouras, May 13, 2013.

  1. Is there any easy way to decice if a string is valid UTF-8 ?
    George Mpouras, May 13, 2013
    #1
    1. Advertising

  2. George Mpouras

    Manfred Lotz Guest

    On Mon, 13 May 2013 14:05:00 +0300
    George Mpouras <> wrote:

    > Is there any easy way to decice if a string is valid UTF-8 ?


    Minimal example:

    #! /usr/bin/perl

    use strict;
    use warnings;

    use utf8;
    use Encode;

    my $string = 'Hä';

    Encode::is_utf8($string) or die "bad string";

    my $bad_string = 0x123456;
    Encode::is_utf8($bad_string) or die "bad string";


    --
    Manfred
    Manfred Lotz, May 13, 2013
    #2
    1. Advertising

  3. Στις 13/5/2013 15:51, ο/η Manfred Lotz έγÏαψε:
    > On Mon, 13 May 2013 14:05:00 +0300
    > George Mpouras <> wrote:
    >
    >> Is there any easy way to decice if a string is valid UTF-8 ?

    >
    > Minimal example:
    >
    > #! /usr/bin/perl
    >
    > use strict;
    > use warnings;
    >
    > use utf8;
    > use Encode;
    >
    > my $string = 'Hä';
    >
    > Encode::is_utf8($string) or die "bad string";
    >
    > my $bad_string = 0x123456;
    > Encode::is_utf8($bad_string) or die "bad string";
    >
    >




    thanks, it is working.
    I have tried the same thing, but my mistake was, I have not used the
    line "use utf8;" !
    George Mpouras, May 13, 2013
    #3
  4. George Mpouras

    Manfred Lotz Guest

    On Mon, 13 May 2013 16:22:36 +0300
    George Mpouras <> wrote:

    > Στις 13/5/2013 15:51, ο/η Manfred Lotz έγÏαψε:
    > > On Mon, 13 May 2013 14:05:00 +0300
    > > George Mpouras <>
    > > wrote:
    > >
    > >> Is there any easy way to decice if a string is valid UTF-8 ?

    > >
    > > Minimal example:
    > >
    > > #! /usr/bin/perl
    > >
    > > use strict;
    > > use warnings;
    > >
    > > use utf8;
    > > use Encode;
    > >
    > > my $string = 'Hä';
    > >
    > > Encode::is_utf8($string) or die "bad string";
    > >
    > > my $bad_string = 0x123456;
    > > Encode::is_utf8($bad_string) or die "bad string";
    > >
    > >

    >
    >
    >
    > thanks, it is working.
    > I have tried the same thing, but my mistake was, I have not used the
    > line "use utf8;" !
    >
    >


    Yes, that is important.


    --
    Manfred
    Manfred Lotz, May 13, 2013
    #4
  5. On 2013-05-13 12:51, Manfred Lotz <> wrote:
    > On Mon, 13 May 2013 14:05:00 +0300
    > George Mpouras <> wrote:
    >> Is there any easy way to decice if a string is valid UTF-8 ?

    >
    > Minimal example:
    >
    > #! /usr/bin/perl
    >
    > use strict;
    > use warnings;
    >
    > use utf8;
    > use Encode;
    >
    > my $string = 'Hä';


    This string is not UTF-8 in any useful sense. It consists of two
    characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
    LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would consist
    of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former string has
    length 2, the latter has length 3.


    > Encode::is_utf8($string) or die "bad string";


    This tests whether the internal representation of the string is
    utf-8-like, which you almost never want to know in a Perl program. It
    also tells you whether the string has character semantics (unless you
    use a rather new version of perl with the unicode_strings feature),
    which is sometimes useful.

    If you want to know whether a string is a correctly encoded UTF-8
    sequence, try to decode it:

    $decoded = eval { decode('UTF-8', $string, FB_CROAK) };

    (decode(..., FB_CROAK) will die if $string is not UTF-8, so you need to
    catch that. All other check parameters are even less convenient).

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, May 14, 2013
    #5
  6. >
    > If you want to know whether a string is a correctly encoded UTF-8
    > sequence, try to decode it:
    >
    > $decoded = eval { decode('UTF-8', $string, FB_CROAK) };
    >
    > (decode(..., FB_CROAK) will die if $string is not UTF-8, so you need to
    > catch that. All other check parameters are even less convenient).
    >


    nice !
    George Mpouras, May 14, 2013
    #6
  7. George Mpouras

    Manfred Lotz Guest

    On Tue, 14 May 2013 01:10:59 +0200
    "Peter J. Holzer" <> wrote:

    > On 2013-05-13 12:51, Manfred Lotz <> wrote:
    > > On Mon, 13 May 2013 14:05:00 +0300
    > > George Mpouras <>
    > > wrote:
    > >> Is there any easy way to decice if a string is valid UTF-8 ?

    > >
    > > Minimal example:
    > >
    > > #! /usr/bin/perl
    > >
    > > use strict;
    > > use warnings;
    > >
    > > use utf8;
    > > use Encode;
    > >
    > > my $string = 'Hä';

    >
    > This string is not UTF-8 in any useful sense. It consists of two
    > characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
    > LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
    > consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
    > string has length 2, the latter has length 3.
    >


    This is only the email. In my test script it is this:

    00000050 20 27 48 c3 a4 27 3b 0a 0a 45 6e 63 6f 64 65 3a
    | 'H..';..Encode:|




    > > Encode::is_utf8($string) or die "bad string";

    >
    > This tests whether the internal representation of the string is
    > utf-8-like, which you almost never want to know in a Perl program. It
    > also tells you whether the string has character semantics (unless you
    > use a rather new version of perl with the unicode_strings feature),
    > which is sometimes useful.
    >
    > If you want to know whether a string is a correctly encoded UTF-8
    > sequence, try to decode it:
    >
    > $decoded = eval { decode('UTF-8', $string, FB_CROAK) };
    >
    > (decode(..., FB_CROAK) will die if $string is not UTF-8, so you need
    > to catch that. All other check parameters are even less convenient).
    >


    Aaah, thanks. Didn't know that.

    #! /usr/bin/perl
    use strict;
    use warnings;

    use utf8;
    use 5.010;

    use Encode qw( decode FB_CROAK );

    my $string = 'Hä'; # = 0x48c3a4


    my $decoded = decode('utf8', $string, FB_CROAK);


    Nevertheless, I'm confused. Above script where 'Hä' is definitely
    0x48c3a4 (verified by hexdump) croaks. Why?

    At any rate I have to read perlunitut, perluniintro etc. to understand
    what's going on.


    --
    Manfred
    Manfred Lotz, May 14, 2013
    #7
  8. George Mpouras

    Manfred Lotz Guest

    On Tue, 14 May 2013 21:27:49 +0100
    Ben Morrow <> wrote:

    >
    > Quoth Manfred Lotz <>:
    > > On Tue, 14 May 2013 01:10:59 +0200
    > > "Peter J. Holzer" <> wrote:
    > > > On 2013-05-13 12:51, Manfred Lotz <> wrote:
    > > > >
    > > > > use utf8;
    > > > > use Encode;
    > > > >
    > > > > my $string = 'Hä';
    > > >
    > > > This string is not UTF-8 in any useful sense. It consists of two
    > > > characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
    > > > LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
    > > > consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
    > > > string has length 2, the latter has length 3.

    > [...]
    > >
    > > use utf8;
    > > use 5.010;
    > >
    > > use Encode qw( decode FB_CROAK );
    > >
    > > my $string = 'Hä'; # = 0x48c3a4
    > >
    > >
    > > my $decoded = decode('utf8', $string, FB_CROAK);
    > >
    > >
    > > Nevertheless, I'm confused. Above script where 'Hä' is definitely
    > > 0x48c3a4 (verified by hexdump) croaks. Why?

    >
    > That is exactly what Peter was trying to explain. Because of the 'use
    > utf8', perl has already decoded the UTF-8 in the source code file into
    > Unicode characters, so $string does *not* contain "\x48\xc3\xa4":


    My mistake was that I believed that perl's internal representation is
    utf8 instead of unicode code point. I thought I had read this in some
    perl man page.


    --
    Manfred
    Manfred Lotz, May 15, 2013
    #8
  9. George Mpouras

    Manfred Lotz Guest

    On Tue, 14 May 2013 21:27:49 +0100
    Ben Morrow <> wrote:

    >
    > Quoth Manfred Lotz <>:
    > > On Tue, 14 May 2013 01:10:59 +0200
    > > "Peter J. Holzer" <> wrote:
    > > > On 2013-05-13 12:51, Manfred Lotz <> wrote:
    > > > >
    > > > > use utf8;
    > > > > use Encode;
    > > > >
    > > > > my $string = 'Hä';
    > > >
    > > > This string is not UTF-8 in any useful sense. It consists of two
    > > > characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
    > > > LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
    > > > consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
    > > > string has length 2, the latter has length 3.

    > [...]
    > >
    > > use utf8;
    > > use 5.010;
    > >
    > > use Encode qw( decode FB_CROAK );
    > >
    > > my $string = 'Hä'; # = 0x48c3a4
    > >
    > >
    > > my $decoded = decode('utf8', $string, FB_CROAK);
    > >
    > >
    > > Nevertheless, I'm confused. Above script where 'Hä' is definitely
    > > 0x48c3a4 (verified by hexdump) croaks. Why?

    >
    > That is exactly what Peter was trying to explain. Because of the 'use
    > utf8', perl has already decoded the UTF-8 in the source code file into
    > Unicode characters, so $string does *not* contain "\x48\xc3\xa4":
    > instead it contains "\x48\xe4". The e4 is because 'ä', as a Unicode
    > character, has ordinal 0x34. This string, which happens to contain
    > only bytes though it could easily not have done, is not valid UTF-8,
    > so decode croaks.
    >


    Ok, I agree that perl decodes 'ä' (which is utf8 x'c3a4' in the file) to
    unicode \x{e4}.

    Nevertheless the ä is a valid utf8 char.

    This means that the test to check for valid utf8 which Peter proposed
    is wrong as it croaks.

    The following snippet:

    #!/usr/bin/perl

    use strict;
    use warnings;

    use utf8;

    use Test::utf8;

    binmode STDOUT, ":utf8";

    my $ae = 'ä';

    show_char($ae);

    sub show_char {
    my $ch = shift;

    print '-' x 80;
    print "\n";
    print "Char: $ch\n";
    is_valid_string($ch); # check the string is valid
    is_sane_utf8($ch); # check not double encoded

    # check the string has certain attributes
    is_flagged_utf8($ch); # has utf8 flag set
    is_within_ascii($ch); # only has ascii chars in it
    is_within_latin_1($ch); # only has latin-1 chars in it

    }

    yields:
    --------------------------------------------------------------------------------
    Char: ä
    ok 1 - valid string test
    ok 2 - sane utf8
    ok 3 - flagged as utf8
    not ok 4 - within ascii
    # Failed test 'within ascii'
    # at ./unicode04.pl line 27.
    # Char 1 not ASCII (it's 228 dec / e4 hex)
    ok 5 - within latin-1
    # Tests were run but no plan was declared and done_testing() was not
    seen.

    which is what I would have assumed.


    --
    Manfred
    Manfred Lotz, May 15, 2013
    #9
  10. Manfred Lotz <> writes:
    > On Tue, 14 May 2013 21:27:49 +0100


    [...]

    > My mistake was that I believed that perl's internal representation is
    > utf8 instead of unicode code point.


    perl's internal representation is utf8 which is supposed to be decoded
    on demand as necessary. That's not an uncommon implementation choice
    for software supposed to interact with 'the real world' (here supposed
    to mean 'everything out there on the internet', have a look at the
    Mozilla Rust FAQ for a cogent and succinct explanation why this makes
    sense) but that's an implementation choice the people who presently
    work on this code strongly disagree with: They would prefer a model
    where, prior to each internal processing step, a pass over the
    complete input data has to be made in order to transform it into "the
    super-secret internal perl encoding" and after any internal processing
    has been completed, a second pass over all of the data has to be made
    in order to decode the 'super secrete internal perl encoding' into
    something which is useful for anyhing except being 'super secret' and
    'internal to Perl'.

    This sort-of makes sense when assuming that perl is an island located
    in strange waters and that it will usually keep mostly to itself
    (figuratively spoken) and it makes absolutely no sense when 'some perl
    code' performs one step of a multi-stage processing pipeline which may
    possibly even include other perl code (since not even 'output of perl'
    is supposed to be suitable to become 'input of perl').
    Rainer Weikusat, May 15, 2013
    #10
  11. George Mpouras

    Manfred Lotz Guest

    On Wed, 15 May 2013 13:27:05 +0100
    Ben Morrow <> wrote:

    >
    > Quoth Manfred Lotz <>:
    > > On Tue, 14 May 2013 21:27:49 +0100
    > > Ben Morrow <> wrote:
    > > >
    > > > That is exactly what Peter was trying to explain. Because of the
    > > > 'use utf8', perl has already decoded the UTF-8 in the source code
    > > > file into Unicode characters, so $string does *not* contain
    > > > "\x48\xc3\xa4": instead it contains "\x48\xe4". The e4 is because
    > > > 'ä', as a Unicode character, has ordinal 0x34. This string, which
    > > > happens to contain only bytes though it could easily not have
    > > > done, is not valid UTF-8, so decode croaks.
    > > >

    > >
    > > Ok, I agree that perl decodes 'ä' (which is utf8 x'c3a4' in the
    > > file) to unicode \x{e4}.
    > >
    > > Nevertheless the ä is a valid utf8 char.

    >
    > No, you're confused about the difference between 'UTF-8' and
    > 'Unicode'.
    >
    > Unicode is a big list of characters, with names and associated
    > semantics (like 'the lowercase of character 'A' is character 'a'').
    > Each of these characters has been given a number; some of these
    > numbers are >255, so it isn't possible to represent a string of
    > Unicode characters directly with a string of bytes, the way you can
    > with ASCII or Latin-1.
    >
    > This is a problem, given that files (on most systems) and TCP
    > connections and so on are defined as strings of bytes, To solve it,
    > various 'Unicode Transformation Formats' have been invented. The one
    > usually used on Unix systems and in Internet protocols is called
    > 'UTF-8'; if you feed a string of Unicode characters into a UTF-8
    > encoder you get a string of bytes out, and if you feed a string of
    > bytes into a UTF-8 decoder you either get a string of Unicode
    > characters or you get an error, if the string of bytes wasn't valid
    > UTF-8.
    >
    > Perl strings are always strings of Unicode characters[0]. If you want
    > to represent a string of bytes in Perl, you do so by using a string of
    > characters all of which happen to have an ordinal value less than 256.
    > Perl does not make any attempt to keep track of whether a given string
    > was supposed to be 'a string of bytes' or not: you have to do this
    > yourself[1].
    >
    > If you read a string from a file (without doing anything special to
    > the filehandle first), you will always get a string of bytes, because
    > the Unix file-reading APIs only support files that consist of strings
    > of bytes. If that string of bytes was supposed to be UTF-8, and you
    > want to manipulate it as a string of Unicode characters, you have to
    > pass it through Encode::decode. Since not all strings of bytes are
    > valid UTF-8 this can function can fail; this is what Peter posted.
    >
    > If you write a string to a file (without...), the characters in the
    > string are written out directly as bytes. If they all have ordinals
    > below 256 this will effectively leave the file encoded in ISO8859-1,
    > since the first 256 Unicode characters have the same numbers as the
    > 256 ISO8859-1 characters. If you try to write a character with
    > ordinal 256 or greater, you will get a warning and stupid behaviour,
    > because there simply isn't any way to write a byte to a file with a
    > value greater than 255[2]. If you want to write UTF-8 to a file, you
    > have to encode your string of characters (which may have ordinals
    > >255) using Encode::encode, which will return a string with all
    > >ordinals <256 which

    > you can write to the file.
    >
    >
    > So "\x48\xc3\xa4" is valid UTF-8. If you decode it into Unicode
    > characters, you get the string "\x48\xe4", which is *not* valid UTF-8.
    >


    I did not decode it.

    > What are you actually trying to do here? That is, why do you think you
    > need to check if a string is valid UTF-8?
    >


    I'm not trying anything. However, the OP asked if there is any easy way
    to decide if a string is valid UTF-8. I answered him pointing to
    Encode ::is_utf8() which as Peter rightly told me is the wrong way.

    Peter said that $decoded = eval { decode('UTF-8', $string, FB_CROAK) };
    is correct which I don't believe.

    Let met repeat from my last example. 'ä' is unicode point 0xe4 and
    utf-8 0xc3a4. In the script file (which itself is an utf8 encoded file)
    ä is 0xc3a4. Why should perl kill this when I have specified 'use
    utf8;'? My only statement is that $ae in the script below is a valid
    utf8 string.


    #!/usr/bin/perl

    use strict;
    use warnings;

    use utf8;

    use Test::utf8;
    use Devel::peek;

    binmode STDOUT, ":utf8";

    my $ae = 'ä';

    show_char($ae);

    sub show_char {
    my $ch = shift;

    print '-' x 80;
    print "\n";
    Dump $ch;
    print "Char: $ch\n";
    is_valid_string($ch); # check the string is valid
    is_sane_utf8($ch); # check not double encoded

    # check the string has certain attributes
    is_flagged_utf8($ch); # has utf8 flag set
    is_within_ascii($ch); # only has ascii chars in it
    is_within_latin_1($ch); # only has latin-1 chars in it

    }


    then I get:


    --------------------------------------------------------------------------------
    SV = PV(0x1b86dd0) at 0x1bd7470
    REFCNT = 1
    FLAGS = (PADMY,POK,pPOK,UTF8)
    PV = 0x1bb0ec0 "\303\244"\0 [UTF8 "\x{e4}"]
    CUR = 2
    LEN = 16
    Char: ä
    ok 1 - valid string test
    ok 2 - sane utf8
    ok 3 - flagged as utf8
    not ok 4 - within ascii
    # Failed test 'within ascii'
    # at ./unicode05.pl line 29.
    # Char 1 not ASCII (it's 228 dec / e4 hex)
    ok 5 - within latin-1
    # Tests were run but no plan was declared and done_testing() was not
    seen.


    This IMHO shows that $ae in above script is a valid utf8 string.
    This is the only thing I state.

    What is your argumentation to say $ae is not utf8? Then you should tell
    me where above script is wrong or telling me how to interpret the
    output of the script in a different way than I did.


    --
    Manfred
    Manfred Lotz, May 15, 2013
    #11
  12. George Mpouras

    Manfred Lotz Guest

    On Wed, 15 May 2013 15:37:14 +0100
    Ben Morrow <> wrote:

    >
    > Quoth Manfred Lotz <>:
    > > On Wed, 15 May 2013 13:27:05 +0100
    > > Ben Morrow <> wrote:
    > >
    > > > So "\x48\xc3\xa4" is valid UTF-8. If you decode it into Unicode
    > > > characters, you get the string "\x48\xe4", which is *not* valid
    > > > UTF-8.

    > >
    > > I did not decode it.

    >
    > Yes you did. You passed Perl a file containing the bytes 0x22 0x48
    > 0xc3 0xa4 0x22 (that is, "Hä", encoded in UTF-8), and you also said
    > 'use utf8;' which asks Perl to decode the rest of the file from
    > UTF-8. Perl did so, and so you ended up with the string "\x48\xe4"
    > which, though it happens to still be a string of bytes, is not valid
    > UTF-8.
    >
    > Until you understand this a bit better you should probably stay away
    > from the 'utf8' pragma. Write your source files in ASCII-only (that
    > is, don't use 8-bit ISO8859-1 characters either), and if you need
    > strings with Unicode in stick to "\x{...}" or "\N{...}".
    >
    > > > What are you actually trying to do here? That is, why do you
    > > > think you need to check if a string is valid UTF-8?

    > >
    > > I'm not trying anything. However, the OP asked if there is any easy
    > > way to decide if a string is valid UTF-8. I answered him pointing to
    > > Encode ::is_utf8() which as Peter rightly told me is the wrong way.

    >
    > I thought you were the OP... oh God, this is a George Mpouras thread.
    > He's in my killfile for a reason...
    >
    > > Peter said that $decoded = eval { decode('UTF-8', $string,
    > > FB_CROAK) }; is correct which I don't believe.
    > >
    > > Let met repeat from my last example. 'ä' is unicode point 0xe4 and
    > > utf-8 0xc3a4. In the script file (which itself is an utf8 encoded
    > > file) ä is 0xc3a4. Why should perl kill this when I have specified
    > > 'use utf8;'? My only statement is that $ae in the script below is a
    > > valid utf8 string.

    >
    > Take out the 'use utf8;' and run the program again. Does that give you
    > the result you expected?
    >


    In my opinion it makes no sense to leave out 'use:utf8;' if I have utf8
    stuff in my script which is outside of ASCII. The only requirement I
    have is that 'ä' won't change whatever perl does with it internally.
    This works fine so I have no complains.



    > Now write the source file out in ISO8859-1 and run it again. Barring
    > bugs in perl, a source file written in ISO8859-1 *without* 'use utf8'
    > and the equivalent source file written in UTF-8 *with* 'use utf8' will
    > have exactly the same effect.
    >
    > (In principle you can rewrite the file in any encoding you like, add
    > an equivalent 'use encoding' directive, and get the same effect. In
    > practice the implementation of 'encoding' is rather buggy, so that
    > doesn't entirely work.)
    >
    > Perl does not remember that the string happened to come from a file
    > which happened to have been in UTF-8. All it knows is that the string
    > has two characters, "\x48\xe4", and that that string is *not* valid
    > UTF-8.
    >
    > > SV = PV(0x1b86dd0) at 0x1bd7470
    > > REFCNT = 1
    > > FLAGS = (PADMY,POK,pPOK,UTF8)
    > > PV = 0x1bb0ec0 "\303\244"\0 [UTF8 "\x{e4}"]
    > > CUR = 2
    > > LEN = 16

    > [...]
    > >
    > > This IMHO shows that $ae in above script is a valid utf8 string.
    > > This is the only thing I state.

    >
    > Which of these questions are you trying to answer?
    >
    > If I write this string to a file, will that file be valid UTF-8?


    This was not asked by the OP. But if I write $ae to stdout using
    binmode STDOUT, ":utf8" then I'm fine.


    > Is the perl-internal SvUTF8 flag set?
    >


    I only tried to answer the question if a string is valid utf8. After the
    discussions we had the new question seems to be if the former is a
    meaningful question at all. Because if the string would contain stuff
    which is invalid utf8 (which can happen when there is some hex
    garbage) then Emacs would have complained latest when I tried to save
    the buffer.



    --
    Manfred
    Manfred Lotz, May 15, 2013
    #12
  13. Ben Morrow <> writes:
    > Quoth Rainer Weikusat <>:
    >> Manfred Lotz <> writes:
    >> > On Tue, 14 May 2013 21:27:49 +0100

    >>
    >> [...]
    >>
    >> > My mistake was that I believed that perl's internal representation is
    >> > utf8 instead of unicode code point.

    >>
    >> perl's internal representation is utf8 which is supposed to be decoded
    >> on demand as necessary. That's not an uncommon implementation choice
    >> for software supposed to interact with 'the real world' (here supposed
    >> to mean 'everything out there on the internet', have a look at the
    >> Mozilla Rust FAQ for a cogent and succinct explanation why this makes
    >> sense) but that's an implementation choice the people who presently
    >> work on this code strongly disagree with: They would prefer a model
    >> where, prior to each internal processing step, a pass over the
    >> complete input data has to be made in order to transform it into "the
    >> super-secret internal perl encoding" and after any internal processing
    >> has been completed, a second pass over all of the data has to be made
    >> in order to decode the 'super secrete internal perl encoding' into
    >> something which is useful for anyhing except being 'super secret' and
    >> 'internal to Perl'.

    >
    > You are confusing semantics with internal representation.


    I'm not 'confusing' anything. I described this (AFAICT) correctly from
    the abstract viewpoint a 'language user' is supposed to assume.

    BTW: This 'stock reply' to any kind justified criticism, attack the person
    who wrote it as 'clueless' by substituting an alternate, more-or-less
    related topic, is really getting long in the tooth.

    > Encode is privy to perl's internal representation; it knows that if
    > you are encoding into (loose) "utf8" and the string is internally
    > represented as SvUTF8 then all it has to do is flip the flag, and
    > similarly that if you are encoding into "ISO8859-1" and the string
    > is not internally SvUTF8 that it doesn't need to do
    > anything. Decoding is not quite so simple, since it isn't safe to
    > assume input which was supposed to be in UTF-8 is actually valid,
    > but decoding a non-SvUTF8 string from "utf8" still doesn't do any
    > actual decoding, it just validates the string and copies it out.


    The idea that the programmer should be forced to do useless stuff but
    that otherwise useless code can be used to detect that the computer
    can skip this useless request doesn't exactly make sense: Despite
    being useless, the useless request code (uselessly) needs to be
    written, debugged and maintained and human time is much more expensive
    than computer time.

    [...]

    >> This sort-of makes sense when assuming that perl is an island located
    >> in strange waters and that it will usually keep mostly to itself
    >> (figuratively spoken) and it makes absolutely no sense when 'some perl
    >> code' performs one step of a multi-stage processing pipeline which may
    >> possibly even include other perl code (since not even 'output of perl'
    >> is supposed to be suitable to become 'input of perl').

    >
    > Unix IPC is defined in terms of bytes. There is no way to represent an
    > arbitrary Unicode character as a sequence of bytes without some sort of
    > encoding step.


    Quoting the document I already mentioned in the original posting:

    Why are strings UTF-8 by default? Why not UCS2 or UCS4?

    The str type is UTF-8 because we observe more text in the wild
    in this encoding -- particularly in network transmissions,
    which are endian-agnostic -- and we think it's best that the
    default treatment of I/O not involve having to recode
    codepoints in each direction.
    https://github.com/mozilla/rust/wik...strings-utf-8-by-default-why-not-ucs2-or-ucs4

    NB: That's the exact argument I made and I guess the correct 'open
    source response' should be that 'the Perl5 tribe' goes on the warpath
    in order to exterminate 'the Mozilla Rust tribe' and thus, rid the
    world of these "fundamentally mistaken" dissenting opinions ...
    Rainer Weikusat, May 15, 2013
    #13
  14. On Wed, 15 May 2013, Rainer Weikusat wrote:

    > The idea that the programmer should be forced to do useless stuff but
    > that otherwise useless code can be used to detect that the computer
    > can skip this useless request doesn't exactly make sense: Despite
    > being useless, the useless request code (uselessly) needs to be
    > written, debugged and maintained and human time is much more expensive
    > than computer time.


    The idea is to separate things that belong to the interface from those
    that do not. The latter things may change at any time or from one
    implementation to another without doing any harm to people who have only
    used the documented interface and not arbitrary implementation decisions
    of one particular implementation. This is a wise way to proceed.

    The internal representation of character strings in perl does *not* belong
    to the interface. If you happen to know how it is done (in particular that
    the same character string may have different representations in the same
    implementation), don't use it because it may change at any time without
    warning. This is so in all programming languages. If you try to exploit
    your knowledge of the bitwise representation of a Fortran real number your
    code may break when you go from one implementation to another.

    By the way, this kind of defined interface made it possible to expand perl
    strings beyond ISO-8859-1 without breaking exising applications.

    --
    Helmut Richter
    Helmut Richter, May 15, 2013
    #14
  15. Helmut Richter <> writes:
    > On Wed, 15 May 2013, Rainer Weikusat wrote:
    >> The idea that the programmer should be forced to do useless stuff but
    >> that otherwise useless code can be used to detect that the computer
    >> can skip this useless request doesn't exactly make sense: Despite
    >> being useless, the useless request code (uselessly) needs to be
    >> written, debugged and maintained and human time is much more expensive
    >> than computer time.

    >
    > The idea is to separate things that belong to the interface from those
    > that do not. The latter things may change at any time or from one
    > implementation to another without doing any harm to people who have only
    > used the documented interface and not arbitrary implementation decisions
    > of one particular implementation. This is a wise way to proceed.


    That's a completely general statement about "good programming
    practices". The sole purpose it is supposed to fulfil here is to
    suggest that an opinion about something which happens to conflict with
    some other opinion would somehow conflict with the mentioned 'good
    programming practice' without detailing how exactly.

    > The internal representation of character strings in perl does *not* belong
    > to the interface.


    The people who are presently concerned with this think that perl
    should have a 'super-secret internal character representation' which
    isn't useful for anything except 'perl-internal processing' (and not
    compatible with anything, including different instances of perl
    itself). As far as I know, the reason why they think this is that
    'implementation convenience' trumps 'real-world usability'. Other
    people working on similar stuff in other programming languages
    (including older versions of Perl) think that the character string
    representation used by $language should be documented and follow a
    'sensibly chosen existing convention' even if this might cause
    'implementation inconveniences'.

    [...]

    > If you try to exploit your knowledge of the bitwise representation
    > of a Fortran real number your code may break when you go from one
    > implementation to another.


    I have no knowledge about 'bitwise representation of
    Fortran-anything' and 'Fortran floating-point data types' and
    'representation of unicode strings' are two very much different things
    (in particular, I doubt that many web pages or other exisiting 'text
    files' contain 'Fortran floating point numbers' represented in
    binary). Apart from that, there are standards for representing
    'floating point values'.
    Rainer Weikusat, May 15, 2013
    #15
  16. On Wed, 15 May 2013, Rainer Weikusat wrote:

    > Helmut Richter <> writes:


    > > The idea is to separate things that belong to the interface from those
    > > that do not. The latter things may change at any time or from one
    > > implementation to another without doing any harm to people who have only
    > > used the documented interface and not arbitrary implementation decisions
    > > of one particular implementation. This is a wise way to proceed.


    > That's a completely general statement about "good programming
    > practices".


    Indeed. And it is meant as such.

    Implementing something in a way that the arbirtrary choice of implementation
    details becomes part of the interface and thus can never again be changed
    would be a major blunder, and I am glad the perl implementers have not done
    so.

    > As far as I know, the reason why they think this is that
    > 'implementation convenience' trumps 'real-world usability'. Other
    > people working on similar stuff in other programming languages
    > (including older versions of Perl) think that the character string
    > representation used by $language should be documented and follow a
    > 'sensibly chosen existing convention' even if this might cause
    > 'implementation inconveniences'.


    You would have found it better programming practices if decades ago perl had
    decided to publish as an interface that iso-8859-1 (the most advanced
    character standard then), one byte per character, is the internal
    representation for all future? Or should they have taken such a decision at
    the time when character code points were restricted to 16 bits? Why shall they
    do it just now?

    It is by no means mandatory to do it the way the perl people did. They could
    have chosen a *more* strict separation between character strings and byte
    strings so that all input/output is to and from byte strings, only byte
    strings can be decoded and only character strings can be encoded. This would
    have disallowed some programming mistakes people are now doing. I, too, have
    doubts that they chose the best solution. But allowing the programmer access
    to the internal representation would have been a major design blunder.

    And what do you positively get from direct acces to the internal
    representation? You talked about efficiency. Is it really a major efficiency
    issue to let perl decide by inspection of one bit whether the internal
    representation of a particular string happens to be already utf-8 so that the
    encoding/decoding is practically a null operation?

    --
    Helmut Richter
    Helmut Richter, May 16, 2013
    #16
  17. George Mpouras

    Dr.Ruud Guest

    On 15/05/2013 17:48, Manfred Lotz wrote:

    > In my opinion it makes no sense to leave out 'use:utf8;' if I have utf8
    > stuff in my script which is outside of ASCII.


    Sure, if your source file is "in 'utf8' format" (and of course a fully
    ASCII file is 'utf8' (and 'UTF-8') as well), then it shouldn't harm.

    But still be aware of the consequences. If you save the file as latin1
    at some point, you break it, exactly because of the "use utf8;".


    I prefer my source files to be ASCII, so I use code like "\x{1234}".


    Now read what the module's documentation states:

    utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source
    code [...]

    The "use utf8" pragma tells the Perl parser to allow UTF-8 in the
    program text in the current lexical scope [...]

    Do not use this pragma for anything else than telling Perl that your
    script is written in UTF-8.

    --
    Ruud
    Dr.Ruud, May 16, 2013
    #17
  18. Helmut Richter <> writes:
    > On Wed, 15 May 2013, Rainer Weikusat wrote:
    >
    >> Helmut Richter <> writes:

    >
    >> > The idea is to separate things that belong to the interface from those
    >> > that do not. The latter things may change at any time or from one
    >> > implementation to another without doing any harm to people who have only
    >> > used the documented interface and not arbitrary implementation decisions
    >> > of one particular implementation. This is a wise way to proceed.

    >
    >> That's a completely general statement about "good programming
    >> practices".

    >
    > Indeed. And it is meant as such.


    Doesn't this 'delete content and reply to a more convenient
    fabrication' trick become boring over time?

    ,----
    | The sole purpose it is supposed to fulfil here is to
    | suggest that an opinion about something which happens to conflict with
    | some other opinion would somehow conflict with the mentioned 'good
    | programming practice' without detailing how exactly.
    `----

    What you should realize here that this is not a dogma, ie, a statement
    detailing an unquestionable truth made by a (by definition) infallible
    entity, but a generalized guideline supposed to be of _demonstable_
    practical usefulness in 'certain situations'. Consequently, quoting it
    as if it was akin to "Thou shalt not bear false witness against thy
    neighbour" is not sufficient as argument in favor of or against
    anything, even more so when actual existance of a 'violation of the
    principle' is just implied but not described. Yet more so, when the
    statement is demonstrably wrong: The Perl programming language is only
    a part of the 'interface' to perl, the other is the extension
    facilitiy which has direct access to everything inside the Perl core,
    including the mechanics of character handling.

    > Implementing something in a way that the arbirtrary choice of implementation
    > details becomes part of the interface and thus can never again be changed
    > would be a major blunder,


    'The interface' itself is nothing but the cumulative effect of a set
    of perfectly arbitrary implementation choices: Every perl operator
    could have been implemented in a different way or not implemented at
    all.

    I'm going to ignore the rest of this text because you aren't telling
    the truth, you know that, I know that, and you know that I know that.
    Rainer Weikusat, May 16, 2013
    #18
  19. Rainer Weikusat <> writes:

    [...]

    > I'm going to ignore the rest of this text because you aren't telling
    > the truth, you know that, I know that, and you know that I know that.


    Addition: A discussion of the relative merits of either approach for
    handling 'extended characters' could be interesting. However, I'm not
    interested in trying to argue for both sides, ie, against my own
    standpoint, and these "the Gods have chosen wisely and now it is for
    the mortals to obey" declarations of faith (or fandom) are pointless.
    Rainer Weikusat, May 16, 2013
    #19
  20. George Mpouras

    Manfred Lotz Guest

    On Thu, 16 May 2013 11:34:15 +0200
    "Dr.Ruud" <> wrote:

    > On 15/05/2013 17:48, Manfred Lotz wrote:
    >
    > > In my opinion it makes no sense to leave out 'use:utf8;' if I have
    > > utf8 stuff in my script which is outside of ASCII.

    >
    > Sure, if your source file is "in 'utf8' format" (and of course a
    > fully ASCII file is 'utf8' (and 'UTF-8') as well), then it shouldn't
    > harm.
    >
    > But still be aware of the consequences. If you save the file as
    > latin1 at some point, you break it, exactly because of the "use
    > utf8;".
    >


    Yep, this is true. However, Emacs wouldn't do this. :)


    >
    > I prefer my source files to be ASCII, so I use code like "\x{1234}".
    >
    >
    > Now read what the module's documentation states:
    >
    > utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source
    > code [...]
    >
    > The "use utf8" pragma tells the Perl parser to allow UTF-8 in the
    > program text in the current lexical scope [...]
    >
    > Do not use this pragma for anything else than telling Perl that your
    > script is written in UTF-8.
    >


    I anyway would use the utf8 pragma only if I really need it.



    --
    Manfred
    Manfred Lotz, May 16, 2013
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. shade
    Replies:
    1
    Views:
    1,647
  2. Thomas =?ISO-8859-15?Q?G=F6tz?=

    LWP::Simple and utf8 problem

    Thomas =?ISO-8859-15?Q?G=F6tz?=, Apr 19, 2004, in forum: Perl
    Replies:
    0
    Views:
    705
    Thomas =?ISO-8859-15?Q?G=F6tz?=
    Apr 19, 2004
  3. Erik Sandblom

    open with encoding(utf8) takes forever

    Erik Sandblom, May 28, 2004, in forum: Perl
    Replies:
    0
    Views:
    532
    Erik Sandblom
    May 28, 2004
  4. Spamtrap

    UTF8 to Unicode conversion

    Spamtrap, Jul 30, 2004, in forum: Perl
    Replies:
    6
    Views:
    9,887
    Joe Smith
    Jul 31, 2004
  5. gry
    Replies:
    2
    Views:
    697
    Alf P. Steinbach
    Mar 13, 2012
Loading...

Share This Page