utf8

George Mpouras · May 13, 2013

Is there any easy way to decice if a string is valid UTF-8 ?

Manfred Lotz · May 13, 2013

Is there any easy way to decice if a string is valid UTF-8 ?

Minimal example:

#! /usr/bin/perl

use strict;
use warnings;

use utf8;
use Encode;

my $string = 'HÃ¤';

Encode::is_utf8($string) or die "bad string";

my $bad_string = 0x123456;
Encode::is_utf8($bad_string) or die "bad string";

George Mpouras · May 13, 2013

Î£Ï„Î¹Ï‚ 13/5/2013 15:51, Î¿/Î· Manfred Lotz ÎÎ³ÏÎ±ÏˆÎµ:

Minimal example:

#! /usr/bin/perl

use strict;
use warnings;

use utf8;
use Encode;

my $string = 'HÃ¤';

Encode::is_utf8($string) or die "bad string";

my $bad_string = 0x123456;
Encode::is_utf8($bad_string) or die "bad string";

thanks, it is working.
I have tried the same thing, but my mistake was, I have not used the
line "use utf8;" !

Manfred Lotz · May 13, 2013

Î£Ï„Î¹Ï‚ 13/5/2013 15:51, Î¿/Î· Manfred Lotz ÎÎ³ÏÎ±ÏˆÎµ:

thanks, it is working.
I have tried the same thing, but my mistake was, I have not used the
line "use utf8;" !

Yes, that is important.

Peter J. Holzer · May 14, 2013

Minimal example:

#! /usr/bin/perl

use strict;
use warnings;

use utf8;
use Encode;

my $string = 'Hä';

This string is not UTF-8 in any useful sense. It consists of two
characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would consist
of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former string has
length 2, the latter has length 3.

Encode::is_utf8($string) or die "bad string";

This tests whether the internal representation of the string is
utf-8-like, which you almost never want to know in a Perl program. It
also tells you whether the string has character semantics (unless you
use a rather new version of perl with the unicode_strings feature),
which is sometimes useful.

If you want to know whether a string is a correctly encoded UTF-8
sequence, try to decode it:

$decoded = eval { decode('UTF-8', $string, FB_CROAK) };

(decode(..., FB_CROAK) will die if $string is not UTF-8, so you need to
catch that. All other check parameters are even less convenient).

hp

George Mpouras · May 14, 2013

If you want to know whether a string is a correctly encoded UTF-8
sequence, try to decode it:

$decoded = eval { decode('UTF-8', $string, FB_CROAK) };

(decode(..., FB_CROAK) will die if $string is not UTF-8, so you need to
catch that. All other check parameters are even less convenient).

nice !

Manfred Lotz · May 14, 2013

This string is not UTF-8 in any useful sense. It consists of two
characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
string has length 2, the latter has length 3.

This is only the email. In my test script it is this:

00000050 20 27 48 c3 a4 27 3b 0a 0a 45 6e 63 6f 64 65 3a
| 'H..';..Encode:|

This tests whether the internal representation of the string is
utf-8-like, which you almost never want to know in a Perl program. It
also tells you whether the string has character semantics (unless you
use a rather new version of perl with the unicode_strings feature),
which is sometimes useful.

If you want to know whether a string is a correctly encoded UTF-8
sequence, try to decode it:

$decoded = eval { decode('UTF-8', $string, FB_CROAK) };

(decode(..., FB_CROAK) will die if $string is not UTF-8, so you need
to catch that. All other check parameters are even less convenient).

Aaah, thanks. Didn't know that.

#! /usr/bin/perl
use strict;
use warnings;

use utf8;
use 5.010;

use Encode qw( decode FB_CROAK );

my $string = 'HÃ¤'; # = 0x48c3a4

my $decoded = decode('utf8', $string, FB_CROAK);

Nevertheless, I'm confused. Above script where 'HÃ¤' is definitely
0x48c3a4 (verified by hexdump) croaks. Why?

At any rate I have to read perlunitut, perluniintro etc. to understand
what's going on.

Manfred Lotz · May 15, 2013

Quoth Manfred Lotz said:
Quoth Manfred Lotz said:

use utf8;
use Encode;

my $string = 'HÃ¤';

This string is not UTF-8 in any useful sense. It consists of two
characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
string has length 2, the latter has length 3.

Click to expand...

[...]

use utf8;
use 5.010;

use Encode qw( decode FB_CROAK );

my $string = 'HÃ¤'; # = 0x48c3a4

my $decoded = decode('utf8', $string, FB_CROAK);

Nevertheless, I'm confused. Above script where 'HÃ¤' is definitely
0x48c3a4 (verified by hexdump) croaks. Why?

Click to expand...

That is exactly what Peter was trying to explain. Because of the 'use
utf8', perl has already decoded the UTF-8 in the source code file into
Unicode characters, so $string does *not* contain "\x48\xc3\xa4":

My mistake was that I believed that perl's internal representation is
utf8 instead of unicode code point. I thought I had read this in some
perl man page.

Manfred Lotz · May 15, 2013

Quoth Manfred Lotz said:
Quoth Manfred Lotz said:

use utf8;
use Encode;

my $string = 'HÃ¤';

This string is not UTF-8 in any useful sense. It consists of two
characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
string has length 2, the latter has length 3.

Click to expand...

[...]

use utf8;
use 5.010;

use Encode qw( decode FB_CROAK );

my $string = 'HÃ¤'; # = 0x48c3a4

my $decoded = decode('utf8', $string, FB_CROAK);

Nevertheless, I'm confused. Above script where 'HÃ¤' is definitely
0x48c3a4 (verified by hexdump) croaks. Why?

Click to expand...

That is exactly what Peter was trying to explain. Because of the 'use
utf8', perl has already decoded the UTF-8 in the source code file into
Unicode characters, so $string does *not* contain "\x48\xc3\xa4":
instead it contains "\x48\xe4". The e4 is because 'Ã¤', as a Unicode
character, has ordinal 0x34. This string, which happens to contain
only bytes though it could easily not have done, is not valid UTF-8,
so decode croaks.

Ok, I agree that perl decodes 'Ã¤' (which is utf8 x'c3a4' in the file) to
unicode \x{e4}.

Nevertheless the Ã¤ is a valid utf8 char.

This means that the test to check for valid utf8 which Peter proposed
is wrong as it croaks.

The following snippet:

#!/usr/bin/perl

use strict;
use warnings;

use utf8;

use Test::utf8;

binmode STDOUT, ":utf8";

my $ae = 'Ã¤';

show_char($ae);

sub show_char {
my $ch = shift;

print '-' x 80;
print "\n";
print "Char: $ch\n";
is_valid_string($ch); # check the string is valid
is_sane_utf8($ch); # check not double encoded

# check the string has certain attributes
is_flagged_utf8($ch); # has utf8 flag set
is_within_ascii($ch); # only has ascii chars in it
is_within_latin_1($ch); # only has latin-1 chars in it

}

yields:
--------------------------------------------------------------------------------
Char: Ã¤
ok 1 - valid string test
ok 2 - sane utf8
ok 3 - flagged as utf8
not ok 4 - within ascii
# Failed test 'within ascii'
# at ./unicode04.pl line 27.
# Char 1 not ASCII (it's 228 dec / e4 hex)
ok 5 - within latin-1
# Tests were run but no plan was declared and done_testing() was not
seen.

which is what I would have assumed.

Rainer Weikusat · May 15, 2013

Manfred Lotz said:
]

My mistake was that I believed that perl's internal representation is
utf8 instead of unicode code point.

perl's internal representation is utf8 which is supposed to be decoded
on demand as necessary. That's not an uncommon implementation choice
for software supposed to interact with 'the real world' (here supposed
to mean 'everything out there on the internet', have a look at the
Mozilla Rust FAQ for a cogent and succinct explanation why this makes
sense) but that's an implementation choice the people who presently
work on this code strongly disagree with: They would prefer a model
where, prior to each internal processing step, a pass over the
complete input data has to be made in order to transform it into "the
super-secret internal perl encoding" and after any internal processing
has been completed, a second pass over all of the data has to be made
in order to decode the 'super secrete internal perl encoding' into
something which is useful for anyhing except being 'super secret' and
'internal to Perl'.

This sort-of makes sense when assuming that perl is an island located
in strange waters and that it will usually keep mostly to itself
(figuratively spoken) and it makes absolutely no sense when 'some perl
code' performs one step of a multi-stage processing pipeline which may
possibly even include other perl code (since not even 'output of perl'
is supposed to be suitable to become 'input of perl').

Manfred Lotz · May 15, 2013

Quoth Manfred Lotz said:
Quoth Manfred Lotz said:

Ok, I agree that perl decodes 'Ã¤' (which is utf8 x'c3a4' in the
file) to unicode \x{e4}.

Nevertheless the Ã¤ is a valid utf8 char.

Click to expand...

No, you're confused about the difference between 'UTF-8' and
'Unicode'.

Unicode is a big list of characters, with names and associated
semantics (like 'the lowercase of character 'A' is character 'a'').
Each of these characters has been given a number; some of these
numbers are >255, so it isn't possible to represent a string of
Unicode characters directly with a string of bytes, the way you can
with ASCII or Latin-1.

This is a problem, given that files (on most systems) and TCP
connections and so on are defined as strings of bytes, To solve it,
various 'Unicode Transformation Formats' have been invented. The one
usually used on Unix systems and in Internet protocols is called
'UTF-8'; if you feed a string of Unicode characters into a UTF-8
encoder you get a string of bytes out, and if you feed a string of
bytes into a UTF-8 decoder you either get a string of Unicode
characters or you get an error, if the string of bytes wasn't valid
UTF-8.

Perl strings are always strings of Unicode characters[0]. If you want
to represent a string of bytes in Perl, you do so by using a string of
characters all of which happen to have an ordinal value less than 256.
Perl does not make any attempt to keep track of whether a given string
was supposed to be 'a string of bytes' or not: you have to do this
yourself[1].

If you read a string from a file (without doing anything special to
the filehandle first), you will always get a string of bytes, because
the Unix file-reading APIs only support files that consist of strings
of bytes. If that string of bytes was supposed to be UTF-8, and you
want to manipulate it as a string of Unicode characters, you have to
pass it through Encode::decode. Since not all strings of bytes are
valid UTF-8 this can function can fail; this is what Peter posted.

If you write a string to a file (without...), the characters in the
string are written out directly as bytes. If they all have ordinals
below 256 this will effectively leave the file encoded in ISO8859-1,
since the first 256 Unicode characters have the same numbers as the
256 ISO8859-1 characters. If you try to write a character with
ordinal 256 or greater, you will get a warning and stupid behaviour,
because there simply isn't any way to write a byte to a file with a
value greater than 255[2]. If you want to write UTF-8 to a file, you
have to encode your string of characters (which may have ordinals

255) using Encode::encode, which will return a string with all
ordinals <256 which

Click to expand...

you can write to the file.

So "\x48\xc3\xa4" is valid UTF-8. If you decode it into Unicode
characters, you get the string "\x48\xe4", which is *not* valid UTF-8.

I did not decode it.

What are you actually trying to do here? That is, why do you think you
need to check if a string is valid UTF-8?

I'm not trying anything. However, the OP asked if there is any easy way
to decide if a string is valid UTF-8. I answered him pointing to
Encode ::is_utf8() which as Peter rightly told me is the wrong way.

Peter said that $decoded = eval { decode('UTF-8', $string, FB_CROAK) };
is correct which I don't believe.

Let met repeat from my last example. 'Ã¤' is unicode point 0xe4 and
utf-8 0xc3a4. In the script file (which itself is an utf8 encoded file)
Ã¤ is 0xc3a4. Why should perl kill this when I have specified 'use
utf8;'? My only statement is that $ae in the script below is a valid
utf8 string.

#!/usr/bin/perl

use strict;
use warnings;

use utf8;

use Test::utf8;
use Devel:

eek;

binmode STDOUT, ":utf8";

my $ae = 'Ã¤';

show_char($ae);

sub show_char {
my $ch = shift;

print '-' x 80;
print "\n";
Dump $ch;
print "Char: $ch\n";
is_valid_string($ch); # check the string is valid
is_sane_utf8($ch); # check not double encoded

# check the string has certain attributes
is_flagged_utf8($ch); # has utf8 flag set
is_within_ascii($ch); # only has ascii chars in it
is_within_latin_1($ch); # only has latin-1 chars in it

}

then I get:

--------------------------------------------------------------------------------
SV = PV(0x1b86dd0) at 0x1bd7470
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1bb0ec0 "\303\244"\0 [UTF8 "\x{e4}"]
CUR = 2
LEN = 16
Char: Ã¤
ok 1 - valid string test
ok 2 - sane utf8
ok 3 - flagged as utf8
not ok 4 - within ascii
# Failed test 'within ascii'
# at ./unicode05.pl line 29.
# Char 1 not ASCII (it's 228 dec / e4 hex)
ok 5 - within latin-1
# Tests were run but no plan was declared and done_testing() was not
seen.

This IMHO shows that $ae in above script is a valid utf8 string.
This is the only thing I state.

What is your argumentation to say $ae is not utf8? Then you should tell
me where above script is wrong or telling me how to interpret the
output of the script in a different way than I did.

Manfred Lotz · May 15, 2013

Yes you did. You passed Perl a file containing the bytes 0x22 0x48
0xc3 0xa4 0x22 (that is, "HÃ¤", encoded in UTF-8), and you also said
'use utf8;' which asks Perl to decode the rest of the file from
UTF-8. Perl did so, and so you ended up with the string "\x48\xe4"
which, though it happens to still be a string of bytes, is not valid
UTF-8.

Until you understand this a bit better you should probably stay away
from the 'utf8' pragma. Write your source files in ASCII-only (that
is, don't use 8-bit ISO8859-1 characters either), and if you need
strings with Unicode in stick to "\x{...}" or "\N{...}".

I thought you were the OP... oh God, this is a George Mpouras thread.
He's in my killfile for a reason...

Take out the 'use utf8;' and run the program again. Does that give you
the result you expected?

In my opinion it makes no sense to leave out 'use:utf8;' if I have utf8
stuff in my script which is outside of ASCII. The only requirement I
have is that 'Ã¤' won't change whatever perl does with it internally.
This works fine so I have no complains.

Now write the source file out in ISO8859-1 and run it again. Barring
bugs in perl, a source file written in ISO8859-1 *without* 'use utf8'
and the equivalent source file written in UTF-8 *with* 'use utf8' will
have exactly the same effect.

(In principle you can rewrite the file in any encoding you like, add
an equivalent 'use encoding' directive, and get the same effect. In
practice the implementation of 'encoding' is rather buggy, so that
doesn't entirely work.)

Perl does not remember that the string happened to come from a file
which happened to have been in UTF-8. All it knows is that the string
has two characters, "\x48\xe4", and that that string is *not* valid
UTF-8.

SV = PV(0x1b86dd0) at 0x1bd7470
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1bb0ec0 "\303\244"\0 [UTF8 "\x{e4}"]
CUR = 2
LEN = 16 [...]

This IMHO shows that $ae in above script is a valid utf8 string.
This is the only thing I state.

Click to expand...

Which of these questions are you trying to answer?

If I write this string to a file, will that file be valid UTF-8?

This was not asked by the OP. But if I write $ae to stdout using
binmode STDOUT, ":utf8" then I'm fine.

Is the perl-internal SvUTF8 flag set?

I only tried to answer the question if a string is valid utf8. After the
discussions we had the new question seems to be if the former is a
meaningful question at all. Because if the string would contain stuff
which is invalid utf8 (which can happen when there is some hex
garbage) then Emacs would have complained latest when I tried to save
the buffer.

Rainer Weikusat · May 15, 2013

Ben Morrow said:
Quoth Rainer Weikusat said:

Manfred Lotz said:

]

My mistake was that I believed that perl's internal representation is
utf8 instead of unicode code point.

Click to expand...

perl's internal representation is utf8 which is supposed to be decoded
on demand as necessary. That's not an uncommon implementation choice
for software supposed to interact with 'the real world' (here supposed
to mean 'everything out there on the internet', have a look at the
Mozilla Rust FAQ for a cogent and succinct explanation why this makes
sense) but that's an implementation choice the people who presently
work on this code strongly disagree with: They would prefer a model
where, prior to each internal processing step, a pass over the
complete input data has to be made in order to transform it into "the
super-secret internal perl encoding" and after any internal processing
has been completed, a second pass over all of the data has to be made
in order to decode the 'super secrete internal perl encoding' into
something which is useful for anyhing except being 'super secret' and
'internal to Perl'.

Click to expand...

You are confusing semantics with internal representation.

I'm not 'confusing' anything. I described this (AFAICT) correctly from
the abstract viewpoint a 'language user' is supposed to assume.

BTW: This 'stock reply' to any kind justified criticism, attack the person
who wrote it as 'clueless' by substituting an alternate, more-or-less
related topic, is really getting long in the tooth.

Encode is privy to perl's internal representation; it knows that if
you are encoding into (loose) "utf8" and the string is internally
represented as SvUTF8 then all it has to do is flip the flag, and
similarly that if you are encoding into "ISO8859-1" and the string
is not internally SvUTF8 that it doesn't need to do
anything. Decoding is not quite so simple, since it isn't safe to
assume input which was supposed to be in UTF-8 is actually valid,
but decoding a non-SvUTF8 string from "utf8" still doesn't do any
actual decoding, it just validates the string and copies it out.

The idea that the programmer should be forced to do useless stuff but
that otherwise useless code can be used to detect that the computer
can skip this useless request doesn't exactly make sense: Despite
being useless, the useless request code (uselessly) needs to be
written, debugged and maintained and human time is much more expensive
than computer time.

[...]

Unix IPC is defined in terms of bytes. There is no way to represent an
arbitrary Unicode character as a sequence of bytes without some sort of
encoding step.

Quoting the document I already mentioned in the original posting:

Why are strings UTF-8 by default? Why not UCS2 or UCS4?

The str type is UTF-8 because we observe more text in the wild
in this encoding -- particularly in network transmissions,
which are endian-agnostic -- and we think it's best that the
default treatment of I/O not involve having to recode
codepoints in each direction.
https://github.com/mozilla/rust/wik...strings-utf-8-by-default-why-not-ucs2-or-ucs4

NB: That's the exact argument I made and I guess the correct 'open
source response' should be that 'the Perl5 tribe' goes on the warpath
in order to exterminate 'the Mozilla Rust tribe' and thus, rid the
world of these "fundamentally mistaken" dissenting opinions ...

Helmut Richter · May 15, 2013

The idea that the programmer should be forced to do useless stuff but
that otherwise useless code can be used to detect that the computer
can skip this useless request doesn't exactly make sense: Despite
being useless, the useless request code (uselessly) needs to be
written, debugged and maintained and human time is much more expensive
than computer time.

The idea is to separate things that belong to the interface from those
that do not. The latter things may change at any time or from one
implementation to another without doing any harm to people who have only
used the documented interface and not arbitrary implementation decisions
of one particular implementation. This is a wise way to proceed.

The internal representation of character strings in perl does *not* belong
to the interface. If you happen to know how it is done (in particular that
the same character string may have different representations in the same
implementation), don't use it because it may change at any time without
warning. This is so in all programming languages. If you try to exploit
your knowledge of the bitwise representation of a Fortran real number your
code may break when you go from one implementation to another.

By the way, this kind of defined interface made it possible to expand perl
strings beyond ISO-8859-1 without breaking exising applications.

Rainer Weikusat · May 15, 2013

Helmut Richter said:
The idea is to separate things that belong to the interface from those
that do not. The latter things may change at any time or from one
implementation to another without doing any harm to people who have only
used the documented interface and not arbitrary implementation decisions
of one particular implementation. This is a wise way to proceed.

That's a completely general statement about "good programming
practices". The sole purpose it is supposed to fulfil here is to
suggest that an opinion about something which happens to conflict with
some other opinion would somehow conflict with the mentioned 'good
programming practice' without detailing how exactly.

The internal representation of character strings in perl does *not* belong
to the interface.

The people who are presently concerned with this think that perl
should have a 'super-secret internal character representation' which
isn't useful for anything except 'perl-internal processing' (and not
compatible with anything, including different instances of perl
itself). As far as I know, the reason why they think this is that
'implementation convenience' trumps 'real-world usability'. Other
people working on similar stuff in other programming languages
(including older versions of Perl) think that the character string
representation used by $language should be documented and follow a
'sensibly chosen existing convention' even if this might cause
'implementation inconveniences'.

[...]

If you try to exploit your knowledge of the bitwise representation
of a Fortran real number your code may break when you go from one
implementation to another.

I have no knowledge about 'bitwise representation of
Fortran-anything' and 'Fortran floating-point data types' and
'representation of unicode strings' are two very much different things
(in particular, I doubt that many web pages or other exisiting 'text
files' contain 'Fortran floating point numbers' represented in
binary). Apart from that, there are standards for representing
'floating point values'.

Helmut Richter · May 16, 2013

That's a completely general statement about "good programming
practices".

Indeed. And it is meant as such.

Implementing something in a way that the arbirtrary choice of implementation
details becomes part of the interface and thus can never again be changed
would be a major blunder, and I am glad the perl implementers have not done
so.

As far as I know, the reason why they think this is that
'implementation convenience' trumps 'real-world usability'. Other
people working on similar stuff in other programming languages
(including older versions of Perl) think that the character string
representation used by $language should be documented and follow a
'sensibly chosen existing convention' even if this might cause
'implementation inconveniences'.

You would have found it better programming practices if decades ago perl had
decided to publish as an interface that iso-8859-1 (the most advanced
character standard then), one byte per character, is the internal
representation for all future? Or should they have taken such a decision at
the time when character code points were restricted to 16 bits? Why shall they
do it just now?

It is by no means mandatory to do it the way the perl people did. They could
have chosen a *more* strict separation between character strings and byte
strings so that all input/output is to and from byte strings, only byte
strings can be decoded and only character strings can be encoded. This would
have disallowed some programming mistakes people are now doing. I, too, have
doubts that they chose the best solution. But allowing the programmer access
to the internal representation would have been a major design blunder.

And what do you positively get from direct acces to the internal
representation? You talked about efficiency. Is it really a major efficiency
issue to let perl decide by inspection of one bit whether the internal
representation of a particular string happens to be already utf-8 so that the
encoding/decoding is practically a null operation?

Dr.Ruud · May 16, 2013

In my opinion it makes no sense to leave out 'use:utf8;' if I have utf8
stuff in my script which is outside of ASCII.

Sure, if your source file is "in 'utf8' format" (and of course a fully
ASCII file is 'utf8' (and 'UTF-8') as well), then it shouldn't harm.

But still be aware of the consequences. If you save the file as latin1
at some point, you break it, exactly because of the "use utf8;".

I prefer my source files to be ASCII, so I use code like "\x{1234}".

Now read what the module's documentation states:

utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source
code [...]

The "use utf8" pragma tells the Perl parser to allow UTF-8 in the
program text in the current lexical scope [...]

Do not use this pragma for anything else than telling Perl that your
script is written in UTF-8.

Rainer Weikusat · May 16, 2013

Helmut Richter said:
Indeed. And it is meant as such.

Doesn't this 'delete content and reply to a more convenient
fabrication' trick become boring over time?

,----
| The sole purpose it is supposed to fulfil here is to
| suggest that an opinion about something which happens to conflict with
| some other opinion would somehow conflict with the mentioned 'good
| programming practice' without detailing how exactly.
`----

What you should realize here that this is not a dogma, ie, a statement
detailing an unquestionable truth made by a (by definition) infallible
entity, but a generalized guideline supposed to be of _demonstable_
practical usefulness in 'certain situations'. Consequently, quoting it
as if it was akin to "Thou shalt not bear false witness against thy
neighbour" is not sufficient as argument in favor of or against
anything, even more so when actual existance of a 'violation of the
principle' is just implied but not described. Yet more so, when the
statement is demonstrably wrong: The Perl programming language is only
a part of the 'interface' to perl, the other is the extension
facilitiy which has direct access to everything inside the Perl core,
including the mechanics of character handling.

Implementing something in a way that the arbirtrary choice of implementation
details becomes part of the interface and thus can never again be changed
would be a major blunder,

'The interface' itself is nothing but the cumulative effect of a set
of perfectly arbitrary implementation choices: Every perl operator
could have been implemented in a different way or not implemented at
all.

I'm going to ignore the rest of this text because you aren't telling
the truth, you know that, I know that, and you know that I know that.

Rainer Weikusat · May 16, 2013

[...]

I'm going to ignore the rest of this text because you aren't telling
the truth, you know that, I know that, and you know that I know that.

Addition: A discussion of the relative merits of either approach for
handling 'extended characters' could be interesting. However, I'm not
interested in trying to argue for both sides, ie, against my own
standpoint, and these "the Gods have chosen wisely and now it is for
the mortals to obey" declarations of faith (or fandom) are pointless.

Manfred Lotz · May 16, 2013

Sure, if your source file is "in 'utf8' format" (and of course a
fully ASCII file is 'utf8' (and 'UTF-8') as well), then it shouldn't
harm.

But still be aware of the consequences. If you save the file as
latin1 at some point, you break it, exactly because of the "use
utf8;".

Yep, this is true. However, Emacs wouldn't do this.

I prefer my source files to be ASCII, so I use code like "\x{1234}".

Now read what the module's documentation states:

utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source
code [...]

The "use utf8" pragma tells the Perl parser to allow UTF-8 in the
program text in the current lexical scope [...]

Do not use this pragma for anything else than telling Perl that your
script is written in UTF-8.

I anyway would use the utf8 pragma only if I really need it.

Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Is the pod of Encode::MIME::Header giving wrong advice?	5	Apr 23, 2014
setting binmode for empty filehandle	3	Apr 8, 2014
DBD::Oracle, Unicode, non-UTF8-non-ASCII strings	0	Jul 23, 2009
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
how is the string encoded	20	Jan 3, 2012
DBD::mysql used to take octets into the utf8 texts but no more inmariadb	3	Mar 13, 2011
utf8 and chomp	13	Feb 22, 2009

utf8

George Mpouras

Manfred Lotz

George Mpouras

Manfred Lotz

Peter J. Holzer

George Mpouras

Manfred Lotz

Manfred Lotz

Manfred Lotz

Rainer Weikusat

Manfred Lotz

Manfred Lotz

Rainer Weikusat

Helmut Richter

Rainer Weikusat

Helmut Richter

Dr.Ruud

Rainer Weikusat

Rainer Weikusat

Manfred Lotz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads