could XML::Simple handling chinese character?

H

havel.zhang

hi everyone:

I found XML::Simple can not handling chinese character. for example:
part1.xml:
<?xml version="1.0" encoding="utf-8"?>
<config>
<user>ºÍƽ</user>
<passwd>longNails</passwd>
<books>
<book author="Steinbeck" title="Cannery Row"/>
<book author="Faulkner" title="Soldier's Pay"/>
<book author="Steinbeck" title="East of Eden"/>
</books>
</config>

----------------------------------------
my program:

#!/usr/bin/perl -w
use strict;
use XML::Simple;
use Data::Dumper;
print Dumper (XML::Simple->new()->XMLin('part1.xml',ForceArray =>
1,KeepRoot => 1));
----------------------------------------
then the result is:
not well-formed (invalid token) at line 2, column 8, byte 17 at C:/Perl/site/lib/XML/Parser.pm line 187

so it's just because of chinese character.

anyone can help me? thank you:)

havel
 
A

asimsuter

hi everyone:

I found XML::Simple can not handling chinese character. for example:
part1.xml:
<?xml version="1.0" encoding="utf-8"?>
<config>
<user>ºÍƽ</user>
<passwd>longNails</passwd>
<books>
<book author="Steinbeck" title="Cannery Row"/>
<book author="Faulkner" title="Soldier's Pay"/>
<book author="Steinbeck" title="East of Eden"/>
</books>
</config>

----------------------------------------
my program:

#!/usr/bin/perl -w
use strict;
use XML::Simple;
use Data::Dumper;
print Dumper (XML::Simple->new()->XMLin('part1.xml',ForceArray =>
1,KeepRoot => 1));
----------------------------------------
then the result is:


so it's just because of chinese character.

anyone can help me? thank you:)

havel


Try XML::parser

from http://search.cpan.org/~msergeant/XML-Parser/Parser.pm

===========================================================================

XML documents may be encoded in character sets other than Unicode as
long as they may be mapped into the Unicode character set. Expat has
further restrictions on encodings. Read the xmlparse.h header file in
the expat distribution to see details on these restrictions.

Expat has built-in encodings for: UTF-8, ISO-8859-1, UTF-16, and US-
ASCII. Encodings are set either through the XML declaration encoding
attribute or through the ProtocolEncoding option to XML::parser or
XML::parser::Expat.

For encodings other than the built-ins, expat calls the function
load_encoding in the Expat package with the encoding name. This
function looks for a file in the path list
@XML::parser::Expat::Encoding_Path, that matches the lower-cased name
with a '.enc' extension. The first one it finds, it loads.

If you wish to build your own encoding maps, check out the
XML::Encoding module from CPAN.
AUTHORS

===========================================================================

Regards.

Asim Suter
(e-mail address removed)
 
M

mirod

havel.zhang said:
hi everyone:

I found XML::Simple can not handling chinese character. for example:
part1.xml:
<?xml version="1.0" encoding="utf-8"?>
<config>
<user>ºÍƽ</user>
</config>
#!/usr/bin/perl -w
use strict;
use XML::Simple;
use Data::Dumper;
print Dumper (XML::Simple->new()->XMLin('part1.xml',ForceArray =>
1,KeepRoot => 1));

Actually the example works perfectly on my machine. There must be
something either in the format of your file (but I copied it as is, so I
can't see what could cause a problem there) or something in your
environment. What versions of perl, XML:::Simple, but also the parser
(XML::parser in your case, but if you installed XML::LibXML it would be
used instead) are you using?
 
H

havel.zhang

Actually the example works perfectly on my machine. There must be
something either in the format of your file (but I copied it as is, so I
can't see what could cause a problem there) or something in your
environment. What versions of perl, XML:::Simple, but also the parser
(XML::parser in your case, but if you installed XML::LibXML it would be
used instead) are you using?

hi mirod:
when i changed chinese character with english word, it works fine.
my versions of perl is 5.8.8 .

havel
 
M

Mumia W.

hi mirod:
when i changed chinese character with english word, it works fine.
my versions of perl is 5.8.8 .

havel

I also ran your program without problems on Perl 5.8.4 / Linux. You
should enable a utf8 locale on your computer and tell Perl to use that
encoding when reading from the file.

When I tested your program, I first saved part1.xml to a file in utf8
format; then I copied your script to a file in utf8 format. I also added
the "encoding" pragma to tell Perl that the script was written in utf8.
And my locale is currently set to utf8.

So there's no way for Perl to be unprepared to deal with utf8 encoded
data on my system right now, and Chinese characters should be stored in
either utf8 or gb2312 files.

I suspect your problem is encoding confusion. Either you don't have a
suitable locale installed (e.g. utf8), or you stored the file in one
encoding (e.g. gb2312), but you're trying to read it in another encoding
(utf8 ?).
 
A

Adrian Ulrich

You told XML::Something that this XML-File will be utf8 encoded
<?xml version="1.0" encoding="utf-8"?>
<user>和平</user>

...so is '和平' UTF-8 encoded? I'd recommend a real unicode editor like
yudit (http://www.yudit.org) to edit/create utf8 files.
when i changed chinese character with english word, it works fine.

UTF-8 is a superset of ASCII. A normal ASCII string will always be valid UTF-8.

Regards,
Adrian
 
P

Peter J. Holzer

I also ran your program without problems on Perl 5.8.4 / Linux. You
should enable a utf8 locale on your computer and tell Perl to use that
encoding when reading from the file.

No, you should not (well, using a utf8 locale may be a good idea anyway,
but it doesn't have anything to do with his problem). Telling perl to
use a specific encoding when reading XML files is at best ineffectual,
or it may cause problems.

When I tested your program, I first saved part1.xml to a file in utf8
format;

Thus is obviously necessary as the XML file starts with

then I copied your script to a file in utf8 format.

The script doesn't contain any non-ASCII characters so there is no
difference between ASCII format, Latin-1 format, UTF-8 format, etc.

I also added the "encoding" pragma to tell Perl that the script was
written in utf8.

The script is pure ASCII. Of course that means it's UTF-8, too, but it's
also a dozen other charsets which are supersets of ASCII.

And my locale is currently set to utf8.

Irrelevant. XML files contain their own encoding. They *must* *not* be
read differently depending on the locale. If the XML declaration
contains encoding="utf-8", the file must be parsed as UTF-8, regardless
of the charset of the current locale. Since you can't know the encoding
of an XML file before parsing it, it is the responsibility of the XML
parser to determine the encoding.

So there's no way for Perl to be unprepared to deal with utf8 encoded
data on my system right now,

Nothing you described above "prepared your system to deal with utf8
encoded" XML files.
and Chinese characters should be stored in either utf8 or gb2312
files.

Or GB18030 or EUC-CN or whatever contains the necessary characters. It
is only necessary that the XML declaration matches the contents of the
file.
I suspect your problem is encoding confusion. Either you don't have a
suitable locale installed (e.g. utf8),

I don't think you can install perl 5.8.8 without support for UTF-8,
regardless of any system-specific locales.
or you stored the file in one encoding (e.g. gb2312), but you're
trying to read it in another encoding (utf8 ?).

The parser must read it in UTF-8 encoding since that's what the file
says it is. Your suspicion that the file really is in some other
encoding seems likely (especially since Havel posted in gb2312).
It's also possible that the parser used by XML::Simple is broken, but
judging from the error message it is XML::parser which in turn uses
expat, so I think that's unlikely.

hp
 
X

xhoster

havel.zhang said:
hi everyone:

I found XML::Simple can not handling chinese character. for example:
part1.xml:
<?xml version=3D"1.0" encoding=3D"utf-8"?>
<config>
<user>=BA=CD=C6=BD</user>
<passwd>longNails</passwd>
<books>
<book author=3D"Steinbeck" title=3D"Cannery Row"/>
<book author=3D"Faulkner" title=3D"Soldier's Pay"/>
<book author=3D"Steinbeck" title=3D"East of Eden"/>
</books>
</config>

Hi Havel,

I'm not sure that the Chinese characters in your post survived their
trip through usenet, so I can't use the above to serve as a realistic test.
Can you post a bit of Perl code (using chr(), for example) which is coded
in ASCII but would, when run, properly create the characters you are trying
to express?

Xho
 
I

Ian Wilson

xhoster said:
havel zhang wrote:


I'm not sure that the Chinese characters in your post survived their
trip through usenet, so I can't use the above to serve as a realistic test.

The two chinese characters displayed OK in my newsreader.

The OP's posting had this header
Content-Type: text/plain; charset="gb2312"

Could it be that your newsreader doesn't support GB2312 encoding?


As others have said, it seems likely that the OP's XML file is actually
encoded in GB2312, not in UTF8 as specified in it's XML declaration.
 
B

Bart Lateur

Ian said:
As others have said, it seems likely that the OP's XML file is actually
encoded in GB2312, not in UTF8 as specified in it's XML declaration.

Which would be an excellent reason for any XML parser to barf.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top