could XML::Simple handling chinese character?

Discussion in 'Perl Misc' started by havel.zhang, Jun 17, 2007.

  1. havel.zhang

    havel.zhang Guest

    hi everyone:

    I found XML::Simple can not handling chinese character. for example:
    part1.xml:
    <?xml version="1.0" encoding="utf-8"?>
    <config>
    <user>ºÍƽ</user>
    <passwd>longNails</passwd>
    <books>
    <book author="Steinbeck" title="Cannery Row"/>
    <book author="Faulkner" title="Soldier's Pay"/>
    <book author="Steinbeck" title="East of Eden"/>
    </books>
    </config>

    ----------------------------------------
    my program:

    #!/usr/bin/perl -w
    use strict;
    use XML::Simple;
    use Data::Dumper;
    print Dumper (XML::Simple->new()->XMLin('part1.xml',ForceArray =>
    1,KeepRoot => 1));
    ----------------------------------------
    then the result is:
    >not well-formed (invalid token) at line 2, column 8, byte 17 at C:/Perl/site/lib/XML/Parser.pm line 187


    so it's just because of chinese character.

    anyone can help me? thank you:)

    havel
    havel.zhang, Jun 17, 2007
    #1
    1. Advertising

  2. havel.zhang

    Guest

    On Jun 16, 8:55 pm, "havel.zhang" <> wrote:
    > hi everyone:
    >
    > I found XML::Simple can not handling chinese character. for example:
    > part1.xml:
    > <?xml version="1.0" encoding="utf-8"?>
    > <config>
    > <user>ºÍƽ</user>
    > <passwd>longNails</passwd>
    > <books>
    > <book author="Steinbeck" title="Cannery Row"/>
    > <book author="Faulkner" title="Soldier's Pay"/>
    > <book author="Steinbeck" title="East of Eden"/>
    > </books>
    > </config>
    >
    > ----------------------------------------
    > my program:
    >
    > #!/usr/bin/perl -w
    > use strict;
    > use XML::Simple;
    > use Data::Dumper;
    > print Dumper (XML::Simple->new()->XMLin('part1.xml',ForceArray =>
    > 1,KeepRoot => 1));
    > ----------------------------------------
    > then the result is:
    >
    > >not well-formed (invalid token) at line 2, column 8, byte 17 at C:/Perl/site/lib/XML/Parser.pm line 187

    >
    > so it's just because of chinese character.
    >
    > anyone can help me? thank you:)
    >
    > havel



    Try XML::parser

    from http://search.cpan.org/~msergeant/XML-Parser/Parser.pm

    ===========================================================================

    XML documents may be encoded in character sets other than Unicode as
    long as they may be mapped into the Unicode character set. Expat has
    further restrictions on encodings. Read the xmlparse.h header file in
    the expat distribution to see details on these restrictions.

    Expat has built-in encodings for: UTF-8, ISO-8859-1, UTF-16, and US-
    ASCII. Encodings are set either through the XML declaration encoding
    attribute or through the ProtocolEncoding option to XML::parser or
    XML::parser::Expat.

    For encodings other than the built-ins, expat calls the function
    load_encoding in the Expat package with the encoding name. This
    function looks for a file in the path list
    @XML::parser::Expat::Encoding_Path, that matches the lower-cased name
    with a '.enc' extension. The first one it finds, it loads.

    If you wish to build your own encoding maps, check out the
    XML::Encoding module from CPAN.
    AUTHORS

    ===========================================================================

    Regards.

    Asim Suter
    , Jun 17, 2007
    #2
    1. Advertising

  3. havel.zhang

    mirod Guest

    havel.zhang wrote:
    > hi everyone:
    >
    > I found XML::Simple can not handling chinese character. for example:
    > part1.xml:
    > <?xml version="1.0" encoding="utf-8"?>
    > <config>
    > <user>ºÍƽ</user>
    > </config>


    > #!/usr/bin/perl -w
    > use strict;
    > use XML::Simple;
    > use Data::Dumper;
    > print Dumper (XML::Simple->new()->XMLin('part1.xml',ForceArray =>
    > 1,KeepRoot => 1));
    > ----------------------------------------
    > then the result is:
    >> not well-formed (invalid token) at line 2, column 8, byte 17 at C:/Perl/site/lib/XML/Parser.pm line 187

    >
    > so it's just because of chinese character.


    Actually the example works perfectly on my machine. There must be
    something either in the format of your file (but I copied it as is, so I
    can't see what could cause a problem there) or something in your
    environment. What versions of perl, XML:::Simple, but also the parser
    (XML::parser in your case, but if you installed XML::LibXML it would be
    used instead) are you using?

    --
    mirod
    mirod, Jun 17, 2007
    #3
  4. havel.zhang

    havel.zhang Guest

    On 6ÔÂ17ÈÕ, ÏÂÎç1ʱ01·Ö, mirod <> wrote:
    > havel.zhang wrote:
    > > hi everyone:

    >
    > > I found XML::Simple can not handling chinese character. for example:
    > > part1.xml:
    > > <?xml version="1.0" encoding="utf-8"?>
    > > <config>
    > > <user>ºÍƽ</user>
    > > </config>
    > > #!/usr/bin/perl -w
    > > use strict;
    > > use XML::Simple;
    > > use Data::Dumper;
    > > print Dumper (XML::Simple->new()->XMLin('part1.xml',ForceArray =>
    > > 1,KeepRoot => 1));
    > > ----------------------------------------
    > > then the result is:
    > >> not well-formed (invalid token) at line 2, column 8, byte 17 at C:/Perl/site/lib/XML/Parser.pm line 187

    >
    > > so it's just because of chinese character.

    >
    > Actually the example works perfectly on my machine. There must be
    > something either in the format of your file (but I copied it as is, so I
    > can't see what could cause a problem there) or something in your
    > environment. What versions of perl, XML:::Simple, but also the parser
    > (XML::parser in your case, but if you installed XML::LibXML it would be
    > used instead) are you using?
    >
    > --
    > mirod- Òþ²Ø±»ÒýÓÃÎÄ×Ö -
    >
    > - ÏÔʾÒýÓõÄÎÄ×Ö -


    hi mirod:
    when i changed chinese character with english word, it works fine.
    my versions of perl is 5.8.8 .

    havel
    havel.zhang, Jun 17, 2007
    #4
  5. havel.zhang

    Mumia W. Guest

    On 06/17/2007 01:10 AM, havel.zhang wrote:
    >
    > hi mirod:
    > when i changed chinese character with english word, it works fine.
    > my versions of perl is 5.8.8 .
    >
    > havel
    >


    I also ran your program without problems on Perl 5.8.4 / Linux. You
    should enable a utf8 locale on your computer and tell Perl to use that
    encoding when reading from the file.

    When I tested your program, I first saved part1.xml to a file in utf8
    format; then I copied your script to a file in utf8 format. I also added
    the "encoding" pragma to tell Perl that the script was written in utf8.
    And my locale is currently set to utf8.

    So there's no way for Perl to be unprepared to deal with utf8 encoded
    data on my system right now, and Chinese characters should be stored in
    either utf8 or gb2312 files.

    I suspect your problem is encoding confusion. Either you don't have a
    suitable locale installed (e.g. utf8), or you stored the file in one
    encoding (e.g. gb2312), but you're trying to read it in another encoding
    (utf8 ?).
    Mumia W., Jun 17, 2007
    #5
  6. You told XML::Something that this XML-File will be utf8 encoded
    > <?xml version="1.0" encoding="utf-8"?>
    > <user>和平</user>


    ...so is '和平' UTF-8 encoded? I'd recommend a real unicode editor like
    yudit (http://www.yudit.org) to edit/create utf8 files.

    > when i changed chinese character with english word, it works fine.


    UTF-8 is a superset of ASCII. A normal ASCII string will always be valid UTF-8.

    Regards,
    Adrian
    Adrian Ulrich, Jun 17, 2007
    #6
  7. On 2007-06-17 08:09, Mumia W. <> wrote:
    > On 06/17/2007 01:10 AM, havel.zhang wrote:
    >> hi mirod:
    >> when i changed chinese character with english word, it works fine.
    >> my versions of perl is 5.8.8 .

    >
    > I also ran your program without problems on Perl 5.8.4 / Linux. You
    > should enable a utf8 locale on your computer and tell Perl to use that
    > encoding when reading from the file.


    No, you should not (well, using a utf8 locale may be a good idea anyway,
    but it doesn't have anything to do with his problem). Telling perl to
    use a specific encoding when reading XML files is at best ineffectual,
    or it may cause problems.


    > When I tested your program, I first saved part1.xml to a file in utf8
    > format;


    Thus is obviously necessary as the XML file starts with

    <?xml version="1.0" encoding="utf-8"?>


    > then I copied your script to a file in utf8 format.


    The script doesn't contain any non-ASCII characters so there is no
    difference between ASCII format, Latin-1 format, UTF-8 format, etc.


    > I also added the "encoding" pragma to tell Perl that the script was
    > written in utf8.


    The script is pure ASCII. Of course that means it's UTF-8, too, but it's
    also a dozen other charsets which are supersets of ASCII.


    > And my locale is currently set to utf8.


    Irrelevant. XML files contain their own encoding. They *must* *not* be
    read differently depending on the locale. If the XML declaration
    contains encoding="utf-8", the file must be parsed as UTF-8, regardless
    of the charset of the current locale. Since you can't know the encoding
    of an XML file before parsing it, it is the responsibility of the XML
    parser to determine the encoding.


    > So there's no way for Perl to be unprepared to deal with utf8 encoded
    > data on my system right now,


    Nothing you described above "prepared your system to deal with utf8
    encoded" XML files.

    > and Chinese characters should be stored in either utf8 or gb2312
    > files.


    Or GB18030 or EUC-CN or whatever contains the necessary characters. It
    is only necessary that the XML declaration matches the contents of the
    file.

    > I suspect your problem is encoding confusion. Either you don't have a
    > suitable locale installed (e.g. utf8),


    I don't think you can install perl 5.8.8 without support for UTF-8,
    regardless of any system-specific locales.

    > or you stored the file in one encoding (e.g. gb2312), but you're
    > trying to read it in another encoding (utf8 ?).


    The parser must read it in UTF-8 encoding since that's what the file
    says it is. Your suspicion that the file really is in some other
    encoding seems likely (especially since Havel posted in gb2312).
    It's also possible that the parser used by XML::Simple is broken, but
    judging from the error message it is XML::parser which in turn uses
    expat, so I think that's unlikely.

    hp

    --
    _ | Peter J. Holzer | I know I'd be respectful of a pirate
    |_|_) | Sysadmin WSR | with an emu on his shoulder.
    | | | |
    __/ | http://www.hjp.at/ | -- Sam in "Freefall"
    Peter J. Holzer, Jun 17, 2007
    #7
  8. havel.zhang

    Guest

    "havel.zhang" <> wrote:
    > hi everyone:
    >
    > I found XML::Simple can not handling chinese character. for example:
    > part1.xml:
    > <?xml version=3D"1.0" encoding=3D"utf-8"?>
    > <config>
    > <user>=BA=CD=C6=BD</user>
    > <passwd>longNails</passwd>
    > <books>
    > <book author=3D"Steinbeck" title=3D"Cannery Row"/>
    > <book author=3D"Faulkner" title=3D"Soldier's Pay"/>
    > <book author=3D"Steinbeck" title=3D"East of Eden"/>
    > </books>
    > </config>


    Hi Havel,

    I'm not sure that the Chinese characters in your post survived their
    trip through usenet, so I can't use the above to serve as a realistic test.
    Can you post a bit of Perl code (using chr(), for example) which is coded
    in ASCII but would, when run, properly create the characters you are trying
    to express?

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Jun 17, 2007
    #8
  9. havel.zhang

    Ian Wilson Guest

    xhoster wrote:
    > havel zhang wrote:


    >> <?xml version=3D"1.0" encoding=3D"utf-8"?>
    >> <user>=BA=CD=C6=BD</user>

    >
    >
    > I'm not sure that the Chinese characters in your post survived their
    > trip through usenet, so I can't use the above to serve as a realistic test.


    The two chinese characters displayed OK in my newsreader.

    The OP's posting had this header
    Content-Type: text/plain; charset="gb2312"

    Could it be that your newsreader doesn't support GB2312 encoding?


    As others have said, it seems likely that the OP's XML file is actually
    encoded in GB2312, not in UTF8 as specified in it's XML declaration.


    > Can you post a bit of Perl code (using chr(), for example) which is coded
    > in ASCII but would, when run, properly create the characters you are trying
    > to express?
    Ian Wilson, Jun 18, 2007
    #9
  10. havel.zhang

    Bart Lateur Guest

    Ian Wilson wrote:

    >As others have said, it seems likely that the OP's XML file is actually
    >encoded in GB2312, not in UTF8 as specified in it's XML declaration.


    Which would be an excellent reason for any XML parser to barf.

    --
    Bart.
    Bart Lateur, Jun 18, 2007
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?U3BpZGVyX0ppYQ==?=

    how to diaplay chinese character in aspx page

    =?Utf-8?B?U3BpZGVyX0ppYQ==?=, May 27, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    765
    Natty Gur
    May 28, 2004
  2. Jeff
    Replies:
    3
    Views:
    808
    chris
    Jan 16, 2004
  3. Jeff
    Replies:
    3
    Views:
    802
    Jon A. Cruz
    Jan 17, 2004
  4. chad
    Replies:
    3
    Views:
    685
    Alun Harford
    Mar 5, 2004
  5. KFC
    Replies:
    1
    Views:
    640
Loading...

Share This Page