M
MaggotChild
I need to send data across the network and I'm confused by the
UTF-8ness of the values returned by toString() and nodeValue().
I know that toString() will give me what I need -octets regardless of
the underlying encoding- yet I can't understand how the character is
represented by the output of each method.
For example (note that the mangled char is the starting single char
quote) :
use strict;
use warnings;
use XML::LibXML;
use Encode;
$\="\n";
my $parser = XML::LibXML->new;
my $dom = $parser->parse_file(shift);
my $node = ($dom->getElementsByTagName('title'))[0];
print $dom->actualEncoding;
print 'is utf-8: ' . Encode::is_utf8($node->firstChild->nodeValue,1);
print "node value";
print $node->firstChild->nodeValue;
print "to string";
my $txt = $node->firstChild->toString(0,1);
print $txt;
print 'is utf-8: ' . Encode::is_utf8($txt,1);
Outputs:
UTF-8
is utf-8: 1
txt content
Wide character in print at ./utf8-lib-xml.pl line 18.
âERâ
to string
âERâ
is utf-8:
Why is toString no longer UTF-8?
And, since the wide char has been broken down into octets, how does
one know that it's composed of 2 octets when its interpreted on the
receiving end (or even in my terminal)?
On the surface it seems as if I'd be breaking the UTF-8.
Is the toSting() method the preferred way to send the value of a
TextNode across the network?
UTF-8ness of the values returned by toString() and nodeValue().
I know that toString() will give me what I need -octets regardless of
the underlying encoding- yet I can't understand how the character is
represented by the output of each method.
For example (note that the mangled char is the starting single char
quote) :
use strict;
use warnings;
use XML::LibXML;
use Encode;
$\="\n";
my $parser = XML::LibXML->new;
my $dom = $parser->parse_file(shift);
my $node = ($dom->getElementsByTagName('title'))[0];
print $dom->actualEncoding;
print 'is utf-8: ' . Encode::is_utf8($node->firstChild->nodeValue,1);
print "node value";
print $node->firstChild->nodeValue;
print "to string";
my $txt = $node->firstChild->toString(0,1);
print $txt;
print 'is utf-8: ' . Encode::is_utf8($txt,1);
Outputs:
UTF-8
is utf-8: 1
txt content
Wide character in print at ./utf8-lib-xml.pl line 18.
âERâ
to string
âERâ
is utf-8:
Why is toString no longer UTF-8?
And, since the wide char has been broken down into octets, how does
one know that it's composed of 2 octets when its interpreted on the
receiving end (or even in my terminal)?
On the surface it seems as if I'd be breaking the UTF-8.
Is the toSting() method the preferred way to send the value of a
TextNode across the network?