XML::LibXML UTF-8 toString() -vs- nodeValue()

MaggotChild · Apr 8, 2009

I need to send data across the network and I'm confused by the
UTF-8ness of the values returned by toString() and nodeValue().

I know that toString() will give me what I need -octets regardless of
the underlying encoding- yet I can't understand how the character is
represented by the output of each method.

For example (note that the mangled char is the starting single char
quote) :

use strict;
use warnings;
use XML::LibXML;
use Encode;

$\="\n";

my $parser = XML::LibXML->new;
my $dom = $parser->parse_file(shift);
my $node = ($dom->getElementsByTagName('title'))[0];

print $dom->actualEncoding;
print 'is utf-8: ' . Encode::is_utf8($node->firstChild->nodeValue,1);
print "node value";
print $node->firstChild->nodeValue;
print "to string";
my $txt = $node->firstChild->toString(0,1);
print $txt;
print 'is utf-8: ' . Encode::is_utf8($txt,1);

Outputs:

UTF-8
is utf-8: 1
txt content
Wide character in print at ./utf8-lib-xml.pl line 18.
âERâ
to string
âERâ
is utf-8:

Why is toString no longer UTF-8?

And, since the wide char has been broken down into octets, how does
one know that it's composed of 2 octets when its interpreted on the
receiving end (or even in my terminal)?

On the surface it seems as if I'd be breaking the UTF-8.

Is the toSting() method the preferred way to send the value of a
TextNode across the network?

Ben Bullock · Apr 8, 2009

Wide character in print at ./utf8-lib-xml.pl line 18.

You need to use binmode like this:

binmode STDOUT ":encoding(cp932)";

etc. If you don't tell Perl what you want it to do with the wide
characters, it moans about wide characters to STDERR.

Ã¢ERÃ¢
to string
Ã¢ERÃ¢
is utf-8:

Why is toString no longer UTF-8?

It is UTF-8 but it longer has the flag telling Perl it is UTF-8.

You can get a similar effect using Encode's encode_utf8 and decode_utf8.

And, since the wide char has been broken down into octets, how does
one know that it's composed of 2 octets when its interpreted on the
receiving end (or even in my terminal)?

I'm not sure I understand your question. The length of a UTF-8 character
is unambiguous provided you know that it's UTF-8.

On the surface it seems as if I'd be breaking the UTF-8.

There are two different and yet confusingly similar things here, Perl's
internal representation of Unicode and UTF-8 encoded text which Perl has
not been told is UTF-8.

Is the toSting() method the preferred way to send the value of a
TextNode across the network?

I have no idea.

MaggotChild · Apr 8, 2009

It is UTF-8 but it longer has the flag telling Perl it is UTF-8.

Hi, thanks for your reply.

How can one tell that the string is UTF-8 without checking Perl's utf8
flag?
Or is this not possible?

You can get a similar effect using Encode's encode_utf8 and decode_utf8.

OK, The receiver would have to be UTF-8 aware, unlike my terminal, and
check the encoding info in each byte (or the 1st), sucking up any
additional bytes and building the character accordingly.

I'm not sure I understand your question. The length of a UTF-8 character
is unambiguous provided you know that it's UTF-8.

I think I'm confusing the character representation (composing more
than 1 byte, causing the wide char error) and the toString() (or
binmode) representation which outputs the actual code points.

Ben Bullock · Apr 9, 2009

How can one tell that the string is UTF-8 without checking Perl's utf8
flag?
Or is this not possible?

You can check whether a string is correctly encoded as UTF-8 by using
the "decode_utf8" routine of the Encode module. If it is not UTF-8 you will
get an error message.

If you have a random stream of bytes and you want to check whether they
might be correct UTF-8 or some other encoding, you could try Encode::Guess.
(You can read my review of this module at
"cpanratings.perl.org/dist/Encode". My cpanratings name is "BKB".).

If you want to know whether Perl has marked a string as being UTF-8, in
other words whether Perl thinks that a string is UTF-8 or not, the only
way of doing this is to look at Perl's flag, using utf8::is_utf8 or
the similar routine in Encode.

The important point here is that there are three types of strings:

1) Strings which are not UTF-8, e.g. strings in latin-1 or CP932 etc
2) Strings which are UTF-8 but Perl is not aware of this
3) Strings which are UTF-8 and Perl is aware of this

OK, The receiver would have to be UTF-8 aware, unlike my terminal, and
check the encoding info in each byte (or the 1st), sucking up any
additional bytes and building the character accordingly.

I use Cygwin on a Japanese-language Microsoft Windows and I have to
convert every output to the terminal using

binmode STDOUT, ":encoding(cp932)";

Also I have to convert every file name using something like

open my $input, "<:utf8", encode("cp932", "filename")

I'm not too sure why Perl is not set up to use the Windows "wchar" versions
of file system functions when the strings contain internally-encoded UTF-8,
but unfortunately Perl programmers on Windows are still stuck with using
code pages.

I think I'm confusing the character representation (composing more
than 1 byte, causing the wide char error) and the toString() (or
binmode) representation which outputs the actual code points.

You have a string of type 3) described above, but Perl does not know
what to do with it.

--
perl -e'@a=qw/Harder Better Faster Stronger/;use Time::HiRes"ualarm";@z=(0,2,4,
6);@v=("Work It","Make It","Do It","Makes Us");sub w{("")x$_[0]}$|=$t=432250;@f
=split"/","More Than/Ever/Hour/After/Our/Work Is/Never/Over";@e=((map{join(":",
@f[$_,$_+1])}@z),"");@w=map"$v[$_]:$a[$_]",0..3;@h=(@w,@e);@j=w(5);@t=(@v,@j,@a
,@j);@l=(@t,@f[@z],@j,(map{$f[$_+1]}@z),@j,@t,@w,@j,@e,w(4),(@h)x6,w(9),(@h)x7)
;ualarm$t,$t;$SIG{ALRM}=sub{print p()};a:goto a;sub p{if(($c++)%2){exit if!@l;
$_=shift@l&&if(/(.*)

.*)/){$s=$2;$1}else{"$_\n"}}}elsif($s){" $s\n",$s=""}'

Peter J. Holzer · Apr 9, 2009

You can check whether a string is correctly encoded as UTF-8 by using
the "decode_utf8" routine of the Encode module. If it is not UTF-8 you will
get an error message.

If you have a random stream of bytes and you want to check whether they
might be correct UTF-8 or some other encoding, you could try Encode::Guess.
(You can read my review of this module at
"cpanratings.perl.org/dist/Encode". My cpanratings name is "BKB".).

If you want to know whether Perl has marked a string as being UTF-8, in
other words whether Perl thinks that a string is UTF-8 or not, the only
way of doing this is to look at Perl's flag, using utf8::is_utf8 or
the similar routine in Encode.

The (stupidly named) utf8 flag doesn't indicate whether perl thinks that
a string is UTF-8 or not. It indicates that the string is a character
string, i.e., each element of the string is a character, not a byte.

(The utf8 flag is called the utf8 flag because character strings are
encoded internally as UTF-8. But as a Perl programmer you don't have to
know that, just like you don't have to know how numbers are stored or
how the hash algorithm works. You just have to know that characters are
32-bit values and that ord() of a character returns the unicode code
point.)

hp

Ben Bullock · Apr 10, 2009

The (stupidly named) utf8 flag doesn't indicate whether perl thinks that
a string is UTF-8 or not.

Yes, it does. If I have a string

my $p = "XYZ";

where "XYZ" are three bytes of a single character encoded as UTF-8 in the
text of the program (ie I have asked my text editor to save in the UTF-8
form), then if I have the line

use utf8;

at the top of my program, Perl will set the "utf8 flag" on $p and will
act as if this multibyte UTF-8 character is a single item, for example
length ($p) == 1. If I do not have the use utf8; line in the program,
Perl will not set the utf8 flag on $p and I will get length ($p) == 3
instead of 1. Thus the string may actually be UTF-8 even when Perl's flag
is not set.

Peter J. Holzer · Apr 10, 2009

Yes, it does. If I have a string

my $p = "XYZ";

where "XYZ" are three bytes of a single character encoded as UTF-8 in the
text of the program (ie I have asked my text editor to save in the UTF-8
form), then if I have the line

use utf8;

at the top of my program, Perl will set the "utf8 flag" on $p

Right. If you put "use utf8" at the top of you program you tell the perl
compiler that the *source code* of your program is written in UTF-8.
This affects variable names (you can now use a variable like "$kÃ¤se" or
$ÎºÏŒÏƒÎ¼Îµ or $ã“ã‚“ã«ã¡ã¯) and string constants. The latter are
converted from UTF-8 to Perl's internal character string format. Since
these are now character strings, the (still stupidly named) utf8 flag is
set.

use encoding "X" has a similar effect: You tell the compiler that the
source code is in encoding X, and it will convert any string constant
from encoding X to Perl's internal character string format. Again, the
fact that the string is now a character string (and not a byte string)
will be signified by the utf8 flag, even though the string in the source
file was not UTF-8.

and will act as if this multibyte UTF-8 character is a single item,
for example length ($p) == 1.

And this shows that the string is *not* UTF-8. UTF-8 is by definition a
serialization format for unicode. The Unicode character U+20AC, for
example, is serialized into 3 octets: e2 82 ac. So if the string "â‚¬" was
a UTF-8 string, then

length("â‚¬") == 3,
ord(substr("â‚¬", 0, 1) == 0xE2,
ord(substr("â‚¬", 1, 1) == 0x82,
ord(substr("â‚¬", 2, 1) == 0xAC

would all be true. However, if string was parsed from a source with use
utf8 or use encoding in effect, or read from a file with an encoding
layer, or by some other method which yields Perl character strings, all
of these are false. Instead,

length("â‚¬") == 1,
ord(substr("â‚¬", 0, 1) == 0x20AC,

are true. So the string is a Unicode string, but it is not a UTF-8
string.

If I do not have the use utf8; line in the program, Perl will not set
the utf8 flag on $p and I will get length ($p) == 3 instead of 1.

And then you really have a UTF-8 string.

Thus the string may actually be UTF-8 even when Perl's flag
is not set.

It can *only* be UTF-8 is the flag is *not* set, unless your program is
broken or you are dealing with double-encoded input.

hp

sln · Apr 10, 2009

Right. If you put "use utf8" at the top of you program you tell the perl
compiler that the *source code* of your program is written in UTF-8.
This affects variable names (you can now use a variable like "$käse" or
$????? or $?????) and string constants. The latter are
converted from UTF-8 to Perl's internal character string format. Since
these are now character strings, the (still stupidly named) utf8 flag is
set.

use encoding "X" has a similar effect: You tell the compiler that the
source code is in encoding X,

How do you tell the compiler what coding to use if the encoding '' can't be
decoded? Do you have to encode the encoding. Where does it stop? I mean,
where does it begin?

-sln

sln · Apr 10, 2009

How do you tell the compiler what coding to use if the encoding '' can't be
decoded? Do you have to encode the encoding. Where does it stop? I mean,
where does it begin?

-sln

Utf-16 and utf-32 have merits. Unfortunately, Perl won't do that.
Imagine Perl doing utf-32. Why then you could do Regular Expressions on
a binary stream. Byte is ok, int is slower, but rx on binary has merits.

Oh no, couldn't have that, no no...
UTF-8 it is then, can't have other choices.

-sln

Peter J. Holzer · Apr 10, 2009

How do you tell the compiler what coding to use if the encoding '' can't be
decoded?

If the encoding cannot be decoded, then the compiler will complain and
stop:

encoding: Unknown encoding 'X' at foo2 line 1
BEGIN failed--compilation aborted at foo2 line 1.

Naturally, you can only use encodings which are known to the compiler.
There are quite a lot of them, so I don't think this is a serious
problem.

Or do you mean what happens if the compiler doesn't even get to the "use
encoding 'X'" line because that is encoded? This is only a problem if
you use an encoding which isn't a superset US-ASCII (or EBCDIC on some
platforms). So you can't use UTF-16, because the extra 0x00 octets would confuse
the parser which is expecting US-ASCII, and you can't use EBCDIC on an
US-ASCII platform, but you can use UTF-8, ISO-8859-X, BIG5, euc-jp, as
long as you use only ASCII characters before the use directive (which is
easy since that should be the first line (after the shebang) anyway.

hp

Peter J. Holzer · Apr 10, 2009

Utf-16 and utf-32 have merits. Unfortunately, Perl won't do that.

Actually, for all practical purposes, Perl character strings *are*
UTF-32. Each character is a 32-bit value.

Both UTF-16 and UTF-32 are supported for I/O, of course.

Imagine Perl doing utf-32.

I don't have to imagine that, it does.

Why then you could do Regular Expressions on
a binary stream.

You can't do Regexps on streams, whether binary or not (would be nice if
we could).

You can do Regexps on *strings*, whether they are binary or text.

I don't know what that has to do with UTF-32. Binary strings consist of
octets. Treating them as UTF-32 is almost almost a mistake.

hp

sln · Apr 10, 2009

Actually, for all practical purposes, Perl character strings *are*
UTF-32. Each character is a 32-bit value.

Both UTF-16 and UTF-32 are supported for I/O, of course.

I don't have to imagine that, it does.

You can't do Regexps on streams, whether binary or not (would be nice if
we could).

You can do Regexps on *strings*, whether they are binary or text.

I don't know what that has to do with UTF-32. Binary strings consist of
octets. Treating them as UTF-32 is almost almost a mistake.

hp

If you can't do Reges on streams, then you can't parse XML.
I ah think your missing what Unicode is.
I have already posted sometime back pack/unpack on regex streams.
I can repost the code if you need. Or you can read a few docs on it.
I doubt you'll capitulate no matter what.

perlunicode.html and some others.

-sln

sln · Apr 10, 2009

If the encoding cannot be decoded, then the compiler will complain and
stop:

encoding: Unknown encoding 'X' at foo2 line 1
BEGIN failed--compilation aborted at foo2 line 1.

Naturally, you can only use encodings which are known to the compiler.
There are quite a lot of them, so I don't think this is a serious
problem.

Or do you mean what happens if the compiler doesn't even get to the "use
encoding 'X'" line because that is encoded? This is only a problem if
you use an encoding which isn't a superset US-ASCII (or EBCDIC on some
platforms). So you can't use UTF-16, because the extra 0x00 octets would confuse
the parser which is expecting US-ASCII, and you can't use EBCDIC on an
US-ASCII platform, but you can use UTF-8, ISO-8859-X, BIG5, euc-jp, as
long as you use only ASCII characters before the use directive (which is
easy since that should be the first line (after the shebang) anyway.

hp

So there is a base 'code' line. Isin't that stupid to interpret the rest of
the code in an encoding interpreted with another code? The code is then broken!

-sln

sln · Apr 10, 2009

If you can't do Reges on streams, then you can't parse XML.
I ah think your missing what Unicode is.
I have already posted sometime back pack/unpack on regex streams.
I can repost the code if you need. Or you can read a few docs on it.
I doubt you'll capitulate no matter what.

perlunicode.html and some others.

-sln

Btw, just try to pack or un-pack UTF-16 or UTF-32.
Hey or even UTF-8 that is out of range.
Try to do regex on them next.
I did. I didn't pack/unpack utf16 or utt32.
Let me know if you can do that.

-sln

Peter J. Holzer · Apr 10, 2009

use encoding "X" has a similar effect: You tell the compiler that the
source code is in encoding X,

How do you tell the compiler what coding to use if the encoding '' can't be
decoded?

Click to expand...

[...]
Or do you mean what happens if the compiler doesn't even get to the "use
encoding 'X'" line because that is encoded? This is only a problem if
you use an encoding which isn't a superset US-ASCII (or EBCDIC on some
platforms). So you can't use UTF-16, because the extra 0x00 octets would confuse
the parser which is expecting US-ASCII, and you can't use EBCDIC on an
US-ASCII platform, but you can use UTF-8, ISO-8859-X, BIG5, euc-jp, as
long as you use only ASCII characters before the use directive (which is
easy since that should be the first line (after the shebang) anyway.

Click to expand...

So there is a base 'code' line. Isin't that stupid to interpret the rest of
the code in an encoding interpreted with another code? The code is then broken!

No. Almost all encodings today are supersets of US-ASCII.

Consider these two programs:

#!/usr/bin/perl
use utf8;
use warnings;
use strict;

my $greeting = "ÎšÎ±Î»Î·Î¼ÎÏÎ± ÎºÏŒÏƒÎ¼Îµ";
print "$greeting\n";
__END__

#!/usr/bin/perl
use encoding "iso-8859-7";
use warnings;
use strict;

my $greeting = "ÎšÎ±Î»Î·Î¼ÎÏÎ± ÎºÏŒÏƒÎ¼Îµ";
print "$greeting\n";
__END__

where the first is encoded in UTF-8 and the second is encoded in
ISO-8859-7.

When the compiler starts to parse each program it doesn't know which
encoding is used. But it doesn't have to, because all the octets in the
first two lines are from the common subset of both these encodings: 0x65
is an "e" in both UTF-8 and ISO-8859-7, 0x22 is a double quote in both,
etc. So it can parse those two lines just fine assuming US-ASCII. And
after it has parsed those lines, it knows that the real encoding is not
just US-ASCII, but a specific superset: UTF-8 or ISO-8859-7,
respectively.

But you can't do something like that:

#!/usr/bin/perl
use Greeting "ÎšÎ±Î»Î·Î¼ÎÏÎ± ÎºÏŒÏƒÎ¼Îµ";
use encoding "iso-8859-7";
use warnings;
use strict;

hello();
__END__

because now the use encoding comes too late: The compiler would have to
go back to the start to parse "ÎšÎ±Î»Î·Î¼ÎÏÎ± ÎºÏŒÏƒÎ¼Îµ" correctly.

hp

Peter J. Holzer · Apr 11, 2009

If you can't do Reges on streams, then you can't parse XML.

You don't need regexps at all to parse XML (or any other language).
And you certainly don't need to do them on streams, since you can always
read the next block or line from the stream and append it to your
buffer.

I ah think your missing what Unicode is.

I know quite well what Unicode is - I found characterset issues
fascinating ever since I turned on an Apple ][ in 1984 and it identified
itself as "Apple ÜÄ". I've read Rob Pike's paper in the early 90s and
the full unicode standard (version 2.0) in the late 90s. And I've
discussed character encoding matters (including Unicode) a lot on
various newsgroups and mailinglists over the years and fixed a few
encoding related problems in various pieces of software.

On the other hand, I think you don't know what a stream is:

my ($fh, '<', 'test.xml');

Now $fh refers a stream. Please show me how you can apply a regexp to
this stream. Solutions which don't count:

* reading chunks from the stream into a scalar variable and then
applying the regexp to this variable (because then you apply it to a
string (as I wrote), not a stream.
* writing your own regexp engine (since Perl is a general purpose
programming language, you can of course write that but we were
talking about Perl' builtin regexp).

I have already posted sometime back pack/unpack on regex streams.

pack and unpack are Perl functions. They can only be applied to strings,
not streams. If you don't mean these functions but something else, be
more specific. And I have no idea what a "regex stream" might be. A
stream composed of regexps? A stream with special support for regexps? A
stream split into records with a regexp?

I can repost the code if you need.

Code is always nice because it is unambiguous (unlike the English
language). However, keep in mind that this is a discussion group, not a
code repository. Any code example longer than 50 lines or so is unlikely
to be read.

Or you can read a few docs on it.
perlunicode.html and some others.

I've read that several times (and critisized it here, too).

I doubt you'll capitulate no matter what.

If you think this is a fight where one of us has to win and the other to
capitulate, I'll stop now.

hp

Peter J. Holzer · Apr 11, 2009

Btw, just try to pack or un-pack UTF-16 or UTF-32.

Wrong tool. Use encode/decode for that.

Hey or even UTF-8 that is out of range.

What is "UTF-8 that is out of range"? A UTF-8 sequence which would
be decoded to a Unicode value > 0xFFFF_FFFF? That wasn't well-formed
UTF-8 to begin with since Unicode/ISO-10464 is by definition only 32 bit
(and it is unlikely that there will ever be characters beyond 0x10FFFF
defined since that would break UTF-16).

Try to do regex on them next.

You can do that, but it would be stupid. You decode them first and use
regexps on the result.

I did.

Why am I not surprised?

I didn't pack/unpack utf16 or utt32.
Let me know if you can do that.

I could. But since there's a better way, I wouldn't.

hp

Eric Pozharski · Apr 11, 2009

No. Almost all encodings today are supersets of US-ASCII.

Consider these two programs:

#!/usr/bin/perl
use utf8;
use warnings;
use strict;

my $greeting = "ÎšÎ±Î»Î·Î¼ÎÏÎ± ÎºÏŒÏƒÎ¼Îµ";
print "$greeting\n";
__END__

Show your code, don't master it

$ perl -Mutf8 -wle 'print "Ñ„Ñ‹Ð²Ð°"; print "\x{C0}\x{B0}"'
Wide character in print at -e line 1.
Ñ„Ñ‹Ð²Ð°
ï¿½
$ echo $LC_ALL
en_US.UTF-8

#!/usr/bin/perl
use encoding "iso-8859-7";
use warnings;
use strict;

my $greeting = "ÎšÎ±Î»Î·Î¼ÎÏÎ± ÎºÏŒÏƒÎ¼Îµ";
print "$greeting\n";
__END__

Show your $ENV{LC_ALL}, please

{2775:24} [0:0]$ perl -Mencoding=latin1 -wle 'print "Ñ„Ñ‹Ð²Ð°"; print "\x{C0}\x{B0}"'
Ñ„Ñ‹Ð²Ð°
ï¿½

where the first is encoded in UTF-8 and the second is encoded in
ISO-8859-7.

When the compiler starts to parse each program it doesn't know which
encoding is used. But it doesn't have to, because all the octets in the
first two lines are from the common subset of both these encodings: 0x65
is an "e" in both UTF-8 and ISO-8859-7, 0x22 is a double quote in both,
etc. So it can parse those two lines just fine assuming US-ASCII. And
after it has parsed those lines, it knows that the real encoding is not
just US-ASCII, but a specific superset: UTF-8 or ISO-8859-7,
respectively.

But you can't do something like that:

#!/usr/bin/perl
use Greeting "ÎšÎ±Î»Î·Î¼ÎÏÎ± ÎºÏŒÏƒÎ¼Îµ";
use encoding "iso-8859-7";
use warnings;
use strict;

hello();
__END__

because now the use encoding comes too late: The compiler would have to
go back to the start to parse "ÎšÎ±Î»Î·Î¼ÎÏÎ± ÎºÏŒÏƒÎ¼Îµ" correctly.

You've messed everything up. Since compiler wasn't told about encoding
of C<use Greeting>'s argument, it's treated as latin1, then F<Greeting.pm>
is fed with that *byte* string, and that's F<Greeting.pm> problems what
to do with that stuff.

In case there would be C<use utf8> or C<use encoding 'utf8'>, then the
*utf8* flag would be set, and then that would be F<Greeting.pm> problems
what to do with *character* string.

You missed one important thing -- I dislike this feature, I hate this
already. Hopefully, since c.l.p.m. isn't that public, that dangerous
fact would stay unnoted, see this:

{4579:37} [0:0]$ perl -wle '$Ñ„Ñ‹Ð²Ð°++; print $Ñ„Ñ‹Ð²Ð°'
Unrecognized character \x84 in column 3 at -e line 1.
{4601:39} [0:2]$ perl -Mutf8 -wle '$Ñ„Ñ‹Ð²Ð°++; print $Ñ„Ñ‹Ð²Ð°'
1
{4605:40} [0:0]$ perl -Mencoding=utf8 -wle '$Ñ„Ñ‹Ð²Ð°++; print $Ñ„Ñ‹Ð²Ð°'
Unrecognized character \x84 in column 3 at -e line 1.

That's what C<use utf8> is fscking for.

I should agree, 'UTF-8 flag' is somewhat misleading since it's about
characters but utf8 by itself (I hope).

But,.. here be dragons...

{3335:27} [0:0]$ echo 'Ñ„Ñ‹Ð²Ð°' | xxd
0000000: d184 d18b d0b2 d0b0 0a .........
{3356:28} [0:0]$ echo 'Ñ„Ñ‹Ð²Ð°' | recode utf8..ucs-2-internal |xxd
0000000: 4404 4b04 3204 3004 0a00 D.K.2.0...
{3414:29} [0:1]$ perl -wle 'print "\x{4404}\x{4b04}\x{3204}\x{3004}"'
Wide character in print at -e line 1.
ä„ä¬„ãˆ„ã€„
{3415:30} [0:0]$ perl -Mencoding=ucs2 -wle 'print "\x{4404}\x{4b04}\x{3204}\x{3004}"'
Can't locate object method "cat_decode" via package "Encode::Unicode" at
-e line 1.

Peter J. Holzer · Apr 11, 2009

Show your code, don't master it

$ perl -Mutf8 -wle 'print "Ñ„Ñ‹Ð²Ð°"; print "\x{C0}\x{B0}"'
Wide character in print at -e line 1.
Ñ„Ñ‹Ð²Ð°

Yes, there should be a

binmode STDOUT, ":encoding(whatever)";

before the print. But I was only talking about compile time, not run
time, so this is irrelevant.

In fact, that you *do* get this warning shows my point: $greeting now
contains not a byte string (which can be sent directly to the
byte-oriented world outside) but a character string, which needs to be
encoded first.

ï¿½
$ echo $LC_ALL
en_US.UTF-8

#!/usr/bin/perl
use encoding "iso-8859-7";
use warnings;
use strict;

my $greeting = "ÎšÎ±Î»Î·Î¼ÎÏÎ± ÎºÏŒÏƒÎ¼Îµ";
print "$greeting\n";
__END__

Click to expand...

Show your $ENV{LC_ALL}, please

{2775:24} [0:0]$ perl -Mencoding=latin1 -wle 'print "Ñ„Ñ‹Ð²Ð°"; print "\x{C0}\x{B0}"'
Ñ„Ñ‹Ð²Ð°
ï¿½

use encoding als sets the binmode for STDOUT and STDERR, so you won't
get a warning here. Again, I was talking only about compile time
effects, not run time, so I didn't mention that (you can read the manual
yourself).

You've messed everything up. Since compiler wasn't told about encoding
of C<use Greeting>'s argument, it's treated as latin1,

Wrong: It is treated as an unspecified superset of US-ASCII.

then F<Greeting.pm>
is fed with that *byte* string,
Right,

and that's F<Greeting.pm> problems what
to do with that stuff.

Which is irrelevant for the example. The point is that in this case the
use encoding directive comes too late: at the point the string is
compiled, the compiler still expects some unspecified superset of
US-ASCII and produces byte strings. If you want to tell the compiler
that your source code is in iso-8859-7 (and that is the purpose of the
use encoding directive) then you have to do it *before* the first
element which requires that knowledge. The compiler won't go back and
start over.

In case there would be C<use utf8> or C<use encoding 'utf8'>,

then the compiler would complain about a malformed UTF-8 character if
the source file was actually in ISO-8859-7.

The use encoding or use utf8 *must* match the encoding of the source
file. (And don't think about mixing several encodings in the same file
unless you want to enter your program in an obfu contest).

then the *utf8* flag would be set, and then that would be
F<Greeting.pm> problems what to do with *character* string.

The assumption was of course that Greeting.pm would expect a character
string.

You missed one important thing -- I dislike this feature,

which feature?

I hate this already. Hopefully, since c.l.p.m. isn't that public, that
dangerous fact would stay unnoted, see this:

{4579:37} [0:0]$ perl -wle '$Ñ„Ñ‹Ð²Ð°++; print $Ñ„Ñ‹Ð²Ð°'
Unrecognized character \x84 in column 3 at -e line 1.
{4601:39} [0:2]$ perl -Mutf8 -wle '$Ñ„Ñ‹Ð²Ð°++; print $Ñ„Ñ‹Ð²Ð°'
1
{4605:40} [0:0]$ perl -Mencoding=utf8 -wle '$Ñ„Ñ‹Ð²Ð°++; print $Ñ„Ñ‹Ð²Ð°'
Unrecognized character \x84 in column 3 at -e line 1.

Yes, you can't use "use encoding" for non-ascii variables. "use
encoding" was intended as a cheap way to get pre-5.8 programs with
hard-coded non-ascii strings into the new character string semantic, not
as a general purpose "write your code in any character encoding" tool.

I would *not* advise any one to use "use encoding" in new code, and if
you use it for porting old code, you *must* read the manual. Thoroughly.
Several times. There are dragons here.

That's what C<use utf8> is fscking for.

What is it for?

I should agree, 'UTF-8 flag' is somewhat misleading since it's about
characters but utf8 by itself (I hope).

But,.. here be dragons...

{3335:27} [0:0]$ echo 'Ñ„Ñ‹Ð²Ð°' | xxd
0000000: d184 d18b d0b2 d0b0 0a .........
{3356:28} [0:0]$ echo 'Ñ„Ñ‹Ð²Ð°' | recode utf8..ucs-2-internal |xxd
0000000: 4404 4b04 3204 3004 0a00 D.K.2.0...
{3414:29} [0:1]$ perl -wle 'print "\x{4404}\x{4b04}\x{3204}\x{3004}"'

You've mixed up the endianness. 'Ñ„' is U+0444, not U+4404.

% echo 'Ñ„Ñ‹Ð²Ð°' | iconv -t UTF-16BE | xxd
0000000: 0444 044b 0432 0430 000a .D.K.2.0..
% perl -CO -wle 'print "\x{0444}\x{044b}\x{0432}\x{0430}"'
Ñ„Ñ‹Ð²Ð°

(And another word of warning: -CO only works on the command line in
5.10.0 - in real code always use binmode)

hp

Eric Pozharski · Apr 12, 2009

Before anything else, I beg your and everyone else pardon. For some
weird reason, I'd called "tokens" "literals". Now I feel much better.

No. Almost all encodings today are supersets of US-ASCII.

Consider these two programs:

Click to expand...

*SKIP*
$ perl -Mutf8 -wle 'print "Ñ„Ñ‹Ð²Ð°"; print "\x{C0}\x{B0}"'
Wide character in print at -e line 1.
Ñ„Ñ‹Ð²Ð°
ï¿½ *SKIP*
{2775:24} [0:0]$ perl -Mencoding=latin1 -wle 'print "Ñ„Ñ‹Ð²Ð°"; print "\x{C0}\x{B0}"'
Ñ„Ñ‹Ð²Ð°
ï¿½

Click to expand...

use encoding als sets the binmode for STDOUT and STDERR, so you won't

No, it doesn't (s/STDERR/STDIN/)

{5665:37} [0:0]$ perl -Mencoding=utf8 -wle 'print STDERR "Ñ„Ñ‹Ð²Ð°"'
Wide character in print at -e line 1.
Ñ„Ñ‹Ð²Ð°

get a warning here. Again, I was talking only about compile time
effects, not run time, so I didn't mention that (you can read the manual
yourself).

I fail to see any compile time effects -- either in those two above or
this one below

{2259:8} [0:0]$ perl -Mstrict -wle 'my $x = "Ñ„Ñ‹Ð²Ð°"; $x = "\x{C0}\x{B0}"'
{2264:9} [0:0]$

Wrong: It is treated as an unspecified superset of US-ASCII.

My understanding is based on this -- C<perldoc perlunicode>

"use encoding" needed to upgrade non-Latin-1 byte strings
By default, there is a fundamental asymmetry in Perl's Unicode
model: implicit upgrading from byte strings to Unicode strings
assumes that they were encoded in ISO 8859-1 (Latin-1), but
Unicode strings are downgraded with UTF-8 encoding. This happens
because the first 256 codepoints in Unicode happens to agree
with Latin-1.

If encoding is unknown, it's treated as latin1, even if it's not.

*SKIP*

then the compiler would complain about a malformed UTF-8 character if
the source file was actually in ISO-8859-7.

The use encoding or use utf8 *must* match the encoding of the source
file. (And don't think about mixing several encodings in the same file
unless you want to enter your program in an obfu contest).

But it didn't. You want to say C<"\x{C0}\x{B0}"> is a welformed UTF-8?
In spite of it's not a welformed UTF-8, compiler ignores it. However,
I've made a file with real bytes with high bit set -- it compiles OK.
The warnings are delayed to run-time.

That's not the compiler who complains, that C<use warnings;>

*SKIP*

which feature?

Have you ever seen a program text where tokens are mix of ASCII and
non-ASCII characters? I've seen.

*SKIP*

What is it for?

Quoting C<perldoc utf8>

Do not use this pragma for anything else than telling Perl that your
script is written in UTF-8. The utility functions described below
are directly usable without "use utf8;".

My understanding of "script" is a program text outside of any quotes in
it.

I should agree, 'UTF-8 flag' is somewhat misleading since it's about
characters but utf8 by itself (I hope).

But,.. here be dragons...

{3335:27} [0:0]$ echo 'Ñ„Ñ‹Ð²Ð°' | xxd
0000000: d184 d18b d0b2 d0b0 0a .........
{3356:28} [0:0]$ echo 'Ñ„Ñ‹Ð²Ð°' | recode utf8..ucs-2-internal |xxd
0000000: 4404 4b04 3204 3004 0a00 D.K.2.0...
{3414:29} [0:1]$ perl -wle 'print "\x{4404}\x{4b04}\x{3204}\x{3004}"'

Click to expand...

You've mixed up the endianness. 'Ñ„' is U+0444, not U+4404.

Yes, my fault. And why you skipped the next line? It behaves the same
way with endianess fixed.

*CUT*

LibXML element->toString vs document->toString	5	Jul 12, 2012
Learning XML::LibXML::XPathContext	0	Jul 16, 2012
libxml	3	Jan 31, 2010
how to $doc->createElement with XML::LibXML	2	Feb 22, 2010
UTF-8 read & print?	6	Nov 25, 2012
XML::LibXML: Including xml fragments in a larger document	2	Mar 12, 2010
LibXML UTF8 - Input is not proper UTF-8, indicate encoding !	2	Mar 5, 2005
Unicode (UTF-8) in C	13	Mar 16, 2014

XML::LibXML UTF-8 toString() -vs- nodeValue()

MaggotChild

Ben Bullock

MaggotChild

Ben Bullock

Peter J. Holzer

Ben Bullock

Peter J. Holzer

sln

sln

Peter J. Holzer

Peter J. Holzer

sln

sln

sln

Peter J. Holzer

Peter J. Holzer

Peter J. Holzer

Eric Pozharski

Peter J. Holzer

Eric Pozharski

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads