G
George Mpouras
Is there any easy way to decice if a string is valid UTF-8 ?
Is there any easy way to decice if a string is valid UTF-8 ?
Minimal example:
#! /usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
my $string = 'Hä';
Encode::is_utf8($string) or die "bad string";
my $bad_string = 0x123456;
Encode::is_utf8($bad_string) or die "bad string";
Στις 13/5/2013 15:51, ο/η Manfred Lotz ÎγÏαψε:
thanks, it is working.
I have tried the same thing, but my mistake was, I have not used the
line "use utf8;" !
Minimal example:
#! /usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
my $string = 'Hä';
Encode::is_utf8($string) or die "bad string";
If you want to know whether a string is a correctly encoded UTF-8
sequence, try to decode it:
$decoded = eval { decode('UTF-8', $string, FB_CROAK) };
(decode(..., FB_CROAK) will die if $string is not UTF-8, so you need to
catch that. All other check parameters are even less convenient).
This string is not UTF-8 in any useful sense. It consists of two
characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
string has length 2, the latter has length 3.
This tests whether the internal representation of the string is
utf-8-like, which you almost never want to know in a Perl program. It
also tells you whether the string has character semantics (unless you
use a rather new version of perl with the unicode_strings feature),
which is sometimes useful.
If you want to know whether a string is a correctly encoded UTF-8
sequence, try to decode it:
$decoded = eval { decode('UTF-8', $string, FB_CROAK) };
(decode(..., FB_CROAK) will die if $string is not UTF-8, so you need
to catch that. All other check parameters are even less convenient).
Quoth Manfred Lotz said:[...]use utf8;
use Encode;
my $string = 'Hä';
This string is not UTF-8 in any useful sense. It consists of two
characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
string has length 2, the latter has length 3.
use utf8;
use 5.010;
use Encode qw( decode FB_CROAK );
my $string = 'Hä'; # = 0x48c3a4
my $decoded = decode('utf8', $string, FB_CROAK);
Nevertheless, I'm confused. Above script where 'Hä' is definitely
0x48c3a4 (verified by hexdump) croaks. Why?
That is exactly what Peter was trying to explain. Because of the 'use
utf8', perl has already decoded the UTF-8 in the source code file into
Unicode characters, so $string does *not* contain "\x48\xc3\xa4":
Quoth Manfred Lotz said:[...]use utf8;
use Encode;
my $string = 'Hä';
This string is not UTF-8 in any useful sense. It consists of two
characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
string has length 2, the latter has length 3.
use utf8;
use 5.010;
use Encode qw( decode FB_CROAK );
my $string = 'Hä'; # = 0x48c3a4
my $decoded = decode('utf8', $string, FB_CROAK);
Nevertheless, I'm confused. Above script where 'Hä' is definitely
0x48c3a4 (verified by hexdump) croaks. Why?
That is exactly what Peter was trying to explain. Because of the 'use
utf8', perl has already decoded the UTF-8 in the source code file into
Unicode characters, so $string does *not* contain "\x48\xc3\xa4":
instead it contains "\x48\xe4". The e4 is because 'ä', as a Unicode
character, has ordinal 0x34. This string, which happens to contain
only bytes though it could easily not have done, is not valid UTF-8,
so decode croaks.
Manfred Lotz said:]
My mistake was that I believed that perl's internal representation is
utf8 instead of unicode code point.
Quoth Manfred Lotz said:Ok, I agree that perl decodes 'ä' (which is utf8 x'c3a4' in the
file) to unicode \x{e4}.
Nevertheless the ä is a valid utf8 char.
No, you're confused about the difference between 'UTF-8' and
'Unicode'.
Unicode is a big list of characters, with names and associated
semantics (like 'the lowercase of character 'A' is character 'a'').
Each of these characters has been given a number; some of these
numbers are >255, so it isn't possible to represent a string of
Unicode characters directly with a string of bytes, the way you can
with ASCII or Latin-1.
This is a problem, given that files (on most systems) and TCP
connections and so on are defined as strings of bytes, To solve it,
various 'Unicode Transformation Formats' have been invented. The one
usually used on Unix systems and in Internet protocols is called
'UTF-8'; if you feed a string of Unicode characters into a UTF-8
encoder you get a string of bytes out, and if you feed a string of
bytes into a UTF-8 decoder you either get a string of Unicode
characters or you get an error, if the string of bytes wasn't valid
UTF-8.
Perl strings are always strings of Unicode characters[0]. If you want
to represent a string of bytes in Perl, you do so by using a string of
characters all of which happen to have an ordinal value less than 256.
Perl does not make any attempt to keep track of whether a given string
was supposed to be 'a string of bytes' or not: you have to do this
yourself[1].
If you read a string from a file (without doing anything special to
the filehandle first), you will always get a string of bytes, because
the Unix file-reading APIs only support files that consist of strings
of bytes. If that string of bytes was supposed to be UTF-8, and you
want to manipulate it as a string of Unicode characters, you have to
pass it through Encode::decode. Since not all strings of bytes are
valid UTF-8 this can function can fail; this is what Peter posted.
If you write a string to a file (without...), the characters in the
string are written out directly as bytes. If they all have ordinals
below 256 this will effectively leave the file encoded in ISO8859-1,
since the first 256 Unicode characters have the same numbers as the
256 ISO8859-1 characters. If you try to write a character with
ordinal 256 or greater, you will get a warning and stupid behaviour,
because there simply isn't any way to write a byte to a file with a
value greater than 255[2]. If you want to write UTF-8 to a file, you
have to encode your string of characters (which may have ordinalsyou can write to the file.255) using Encode::encode, which will return a string with all
ordinals <256 which
So "\x48\xc3\xa4" is valid UTF-8. If you decode it into Unicode
characters, you get the string "\x48\xe4", which is *not* valid UTF-8.
What are you actually trying to do here? That is, why do you think you
need to check if a string is valid UTF-8?
Yes you did. You passed Perl a file containing the bytes 0x22 0x48
0xc3 0xa4 0x22 (that is, "Hä", encoded in UTF-8), and you also said
'use utf8;' which asks Perl to decode the rest of the file from
UTF-8. Perl did so, and so you ended up with the string "\x48\xe4"
which, though it happens to still be a string of bytes, is not valid
UTF-8.
Until you understand this a bit better you should probably stay away
from the 'utf8' pragma. Write your source files in ASCII-only (that
is, don't use 8-bit ISO8859-1 characters either), and if you need
strings with Unicode in stick to "\x{...}" or "\N{...}".
I thought you were the OP... oh God, this is a George Mpouras thread.
He's in my killfile for a reason...
Take out the 'use utf8;' and run the program again. Does that give you
the result you expected?
Now write the source file out in ISO8859-1 and run it again. Barring
bugs in perl, a source file written in ISO8859-1 *without* 'use utf8'
and the equivalent source file written in UTF-8 *with* 'use utf8' will
have exactly the same effect.
(In principle you can rewrite the file in any encoding you like, add
an equivalent 'use encoding' directive, and get the same effect. In
practice the implementation of 'encoding' is rather buggy, so that
doesn't entirely work.)
Perl does not remember that the string happened to come from a file
which happened to have been in UTF-8. All it knows is that the string
has two characters, "\x48\xe4", and that that string is *not* valid
UTF-8.
SV = PV(0x1b86dd0) at 0x1bd7470
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1bb0ec0 "\303\244"\0 [UTF8 "\x{e4}"]
CUR = 2
LEN = 16 [...]
This IMHO shows that $ae in above script is a valid utf8 string.
This is the only thing I state.
Which of these questions are you trying to answer?
If I write this string to a file, will that file be valid UTF-8?
Is the perl-internal SvUTF8 flag set?
Ben Morrow said:Quoth Rainer Weikusat said:Manfred Lotz said:]
My mistake was that I believed that perl's internal representation is
utf8 instead of unicode code point.
perl's internal representation is utf8 which is supposed to be decoded
on demand as necessary. That's not an uncommon implementation choice
for software supposed to interact with 'the real world' (here supposed
to mean 'everything out there on the internet', have a look at the
Mozilla Rust FAQ for a cogent and succinct explanation why this makes
sense) but that's an implementation choice the people who presently
work on this code strongly disagree with: They would prefer a model
where, prior to each internal processing step, a pass over the
complete input data has to be made in order to transform it into "the
super-secret internal perl encoding" and after any internal processing
has been completed, a second pass over all of the data has to be made
in order to decode the 'super secrete internal perl encoding' into
something which is useful for anyhing except being 'super secret' and
'internal to Perl'.
You are confusing semantics with internal representation.
Encode is privy to perl's internal representation; it knows that if
you are encoding into (loose) "utf8" and the string is internally
represented as SvUTF8 then all it has to do is flip the flag, and
similarly that if you are encoding into "ISO8859-1" and the string
is not internally SvUTF8 that it doesn't need to do
anything. Decoding is not quite so simple, since it isn't safe to
assume input which was supposed to be in UTF-8 is actually valid,
but decoding a non-SvUTF8 string from "utf8" still doesn't do any
actual decoding, it just validates the string and copies it out.
Unix IPC is defined in terms of bytes. There is no way to represent an
arbitrary Unicode character as a sequence of bytes without some sort of
encoding step.
The idea that the programmer should be forced to do useless stuff but
that otherwise useless code can be used to detect that the computer
can skip this useless request doesn't exactly make sense: Despite
being useless, the useless request code (uselessly) needs to be
written, debugged and maintained and human time is much more expensive
than computer time.
Helmut Richter said:The idea is to separate things that belong to the interface from those
that do not. The latter things may change at any time or from one
implementation to another without doing any harm to people who have only
used the documented interface and not arbitrary implementation decisions
of one particular implementation. This is a wise way to proceed.
The internal representation of character strings in perl does *not* belong
to the interface.
If you try to exploit your knowledge of the bitwise representation
of a Fortran real number your code may break when you go from one
implementation to another.
That's a completely general statement about "good programming
practices".
As far as I know, the reason why they think this is that
'implementation convenience' trumps 'real-world usability'. Other
people working on similar stuff in other programming languages
(including older versions of Perl) think that the character string
representation used by $language should be documented and follow a
'sensibly chosen existing convention' even if this might cause
'implementation inconveniences'.
In my opinion it makes no sense to leave out 'use:utf8;' if I have utf8
stuff in my script which is outside of ASCII.
Helmut Richter said:Indeed. And it is meant as such.
Implementing something in a way that the arbirtrary choice of implementation
details becomes part of the interface and thus can never again be changed
would be a major blunder,
I'm going to ignore the rest of this text because you aren't telling
the truth, you know that, I know that, and you know that I know that.
Sure, if your source file is "in 'utf8' format" (and of course a
fully ASCII file is 'utf8' (and 'UTF-8') as well), then it shouldn't
harm.
But still be aware of the consequences. If you save the file as
latin1 at some point, you break it, exactly because of the "use
utf8;".
I prefer my source files to be ASCII, so I use code like "\x{1234}".
Now read what the module's documentation states:
utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source
code [...]
The "use utf8" pragma tells the Perl parser to allow UTF-8 in the
program text in the current lexical scope [...]
Do not use this pragma for anything else than telling Perl that your
script is written in UTF-8.
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.