XML::Simple and utf8 woes

Discussion in 'Perl Misc' started by Guest, Mar 18, 2006.

  1. Guest

    Guest Guest

    Dear wizards,

    I use XML::Simple to parse an XML file and
    also to write it out. The problem lies in the
    utf8 character data contained in the XML
    source. While the XMLin() function seems
    to read them properly, the XMLout() function
    tries to replace utf8 material by multibyte
    nonsense.

    Below is my minimal example, run under perl 5.8.5
    on a Fedora C3 box. Just compare the output
    of the script (in w.xml) with its input, in DATA.

    Please advice on how to fix the broken utf8 output.

    Thanks in advance,
    Oliver.

    #!/usr/bin/perl
    use XML::Simple;
    print "Reading data from XML source...\n";
    $data=XMLin(\*DATA,
    ForceArray=>[manju,hauer],
    ContentKey=>'-content',
    KeyAttr=>[name],
    );
    print "Retrieve and display data example:\n";
    $k='0004.1';
    print $k.": ".
    $data->{lemma}->{$k}->{manju}->[0].
    "\n";
    print "Writing data to XML file...\n";
    XMLout($data,
    NumericEscape=>0,
    RootName=>'wuti',
    XMLDecl=>1,
    OutputFile=>'w.xml',
    );
    __DATA__
    <?xml version='1.0' encoding='utf-8' standalone='yes'?>
    <wuti>
    <lemma name="0004.1">
    <hauer>in der Morgendämmerung (H).</hauer>
    <manju>farhûn suwaliyame</manju>
    </lemma>
    <lemma name="0004.2">
    <hauer>Morgendämmerung.</hauer>
    <manju>gersi fersi</manju>
    </lemma>
    </wuti>

    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, Mar 18, 2006
    #1
    1. Advertising

  2. Guest

    ngoc Guest


    > Below is my minimal example, run under perl 5.8.5
    > on a Fedora C3 box. Just compare the output
    > of the script (in w.xml) with its input, in DATA.

    I tried your code in Windows XP. It gives utf-8 output. But if I use
    RootName => 'unicode here', only the output of rootname is changed
    (manual fix will help), other parts are in utf-8. I suggest you

    1. To save your perl program in utf-8 encoding.

    2. This step in theory is not necessary. But maybe it helps

    open my $fh, '>:encoding(UTF-8)', $path or die "open($path): $!";
    XMLout($ref, OutputFile => $fh);

    3. Try in Windows XP or 2000 environment to see it is different
    ngoc, Mar 18, 2006
    #2
    1. Advertising

  3. Guest

    Guest Guest

    ngoc <> wrote:
    : I tried your code in Windows XP. It gives utf-8 output.

    Really? I'll have to try tomorrow, don't have an XP box here right now.

    : RootName => 'unicode here', only the output of rootname is changed
    : (manual fix will help), other parts are in utf-8.

    Sounds interesting, I'll try this one, too.

    : 1. To save your perl program in utf-8 encoding.

    Doesn't make sense, I write everything in utf-8 environment. Did you
    notice the a-umlaut and u-caret in the data?

    : 2. This step in theory is not necessary. But maybe it helps

    : open my $fh, '>:encoding(UTF-8)', $path or die "open($path): $!";
    : XMLout($ref, OutputFile => $fh);

    I had tried this already before posting, but to no avail.

    : 3. Try in Windows XP or 2000 environment to see it is different

    Tomorrow.

    Thanks, Oliver.

    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, Mar 18, 2006
    #3
  4. Guest

    Guest Guest

    -berlin.de wrote:

    : Really? I'll have to try tomorrow, don't have an XP box here right now.

    I still don't have an XP system at hand.

    If you run the code with the -CS flag given to perl, even the innocent
    print statement in the middle of the code will output two characters
    instead of one utf8-encoded character, and this doesn't change the broken
    output of the XMLout() statement.

    This is beyond any expectation created after reading the perlrun manpage.

    However, if XML::Simple is instructed in the XMLout statement to escape
    all non-ASCII characters, then, miraculuously, the correct utf8 replacements
    appear. It really drives me nuts.

    Oliver.
    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, Mar 20, 2006
    #4
  5. Guest

    fhscobey Guest

    Hi,
    You might try Perl 5.8.1 too. 5.8.3 and above have had some UTF-8
    issues crop back up for some reason. Our application deals 100% in
    UTF-8 data, but all source code is ISO-8859-1. We really had some
    issues getting UTF-8 stuff to work (we started back when 5.8.0 came
    out) and found that using 5.8.1, with some well placed ...

    Encode::_utf8_on($content);
    Encode::_utf8_off($content);

    .... seemed to do the trick for us. So you might try to make sure the
    UTF-8 flag is turned on for your XML data, and then try and parse it.
    We are using some older versions of modules, which at the time, were
    just starting to deal with the change in Perl 5.8 to treat content
    internally as UTF-8 ecoded. Note: I believe Perl 5.8.7 has some
    issues with the Encode module specifically with UTF-8, check with
    bugs.perl.org for more information.

    All of this may seem strange, but I can tell you when we wrote our
    application, it worked fine with Perl 5.8.0 and 5.8.1. I've tried
    5.8.3|5|7 and all versions are giving us garbled data out.

    Also, if you are reading your data in from a handle, you absolutely
    have to decalre the handle to be UTF-8 encoded. [i.e. open(FH,
    "<:utf8", "file");].

    Not sure if this helps you at all,
    - Jeff
    fhscobey, Mar 20, 2006
    #5
  6. Guest

    Guest Guest

    fhscobey <> wrote:
    : Hi,
    : You might try Perl 5.8.1 too. 5.8.3 and above have had some UTF-8
    : issues crop back up for some reason. Our application deals 100% in
    : UTF-8 data, but all source code is ISO-8859-1. We really had some
    : issues getting UTF-8 stuff to work (we started back when 5.8.0 came
    : out) and found that using 5.8.1, with some well placed ...

    : Encode::_utf8_on($content);
    : Encode::_utf8_off($content);

    : issues with the Encode module specifically with UTF-8, check with
    : bugs.perl.org for more information.

    Hi Jeff,

    You're really saved my day. So it's _not_ my personal failure to
    understand how utf8 in Perl works, but really a problem, version-
    dependent too. Thank you.

    Anyway, of course, when using file handles, I make sure the line
    discipline is set to :utf8, but it does not always help. See my other
    answer to the Perl and UTF8 posting.

    Best regards,
    Oliver.


    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, Mar 21, 2006
    #6
  7. [Whoops, meant to post, not mail]

    -berlin.de wrote:
    > Dear wizards,
    >
    > I use XML::Simple to parse an XML file and
    > also to write it out. The problem lies in the
    > utf8 character data contained in the XML
    > source. While the XMLin() function seems
    > to read them properly, the XMLout() function
    > tries to replace utf8 material by multibyte
    > nonsense.
    >
    > Below is my minimal example, run under perl 5.8.5
    > on a Fedora C3 box. Just compare the output
    > of the script (in w.xml) with its input, in DATA.
    >
    > Please advice on how to fix the broken utf8 output.
    >
    > Thanks in advance,
    > Oliver.
    >
    > #!/usr/bin/perl
    > use XML::Simple;
    > print "Reading data from XML source...\n";
    > $data=XMLin(\*DATA,
    > ForceArray=>[manju,hauer],
    > ContentKey=>'-content',
    > KeyAttr=>[name],
    > );
    > print "Retrieve and display data example:\n";
    > $k='0004.1';
    > print $k.": ".
    > $data->{lemma}->{$k}->{manju}->[0].
    > "\n";
    > print "Writing data to XML file...\n";
    > XMLout($data,
    > NumericEscape=>0,
    > RootName=>'wuti',
    > XMLDecl=>1,
    > OutputFile=>'w.xml',
    > );
    > __DATA__
    > <?xml version='1.0' encoding='utf-8' standalone='yes'?>
    > <wuti>
    > <lemma name="0004.1">
    > <hauer>in der Morgendämmerung (H).</hauer>
    > <manju>farhûn suwaliyame</manju>
    > </lemma>
    > <lemma name="0004.2">
    > <hauer>Morgendämmerung.</hauer>
    > <manju>gersi fersi</manju>
    > </lemma>
    > </wuti>
    >


    The problem seems to be the absence of a "use utf8;" pragma. Perl is
    assuming that your code (including the __DATA__ section) is in ISO-8859-1.

    [Addendum: FWIW, your newsreader is also making the same assumption.]

    --
    Donald King, a.k.a. Chronos Tachyon
    http://chronos-tachyon.net/
    Chronos Tachyon, Mar 23, 2006
    #7
  8. Guest

    fhscobey Guest

    Donald brings up a good point. If your source is not ISO-8859-1(which
    I believe you mentioned), you have to use the utf8 pragma. But, I also
    believe if you were to try using Perl 5.8.0, you would have to use this
    pragma even if it was only the data your script was dealing with.
    Starting with 5.8.1+, they deprecated the use of this pragma, to only
    be used for telling Perl what encoding your source was in.

    See http://perldoc.perl.org/utf8.html for more information.
    fhscobey, Mar 23, 2006
    #8
  9. Guest

    Guest Guest

    Chronos Tachyon <> wrote:

    : The problem seems to be the absence of a "use utf8;" pragma. Perl is
    : assuming that your code (including the __DATA__ section) is in ISO-8859-1.

    No, I don't think so, as inserting the utf8 pragma doesn't change anything.
    I tried it, and the output is still not in utf8.

    : [Addendum: FWIW, your newsreader is also making the same assumption.]

    That is a different story, on a different machine. My production code
    runs in a true utf8 environment, this one here is only used for communi-
    cations. Thank you for the hint, nonetheless!

    Oliver.

    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, Mar 24, 2006
    #9
  10. Guest

    Guest Guest

    fhscobey <> wrote:
    : Donald brings up a good point. If your source is not ISO-8859-1(which
    : I believe you mentioned), you have to use the utf8 pragma. But, I also
    : believe if you were to try using Perl 5.8.0, you would have to use this
    : pragma even if it was only the data your script was dealing with.
    : Starting with 5.8.1+, they deprecated the use of this pragma, to only
    : be used for telling Perl what encoding your source was in.

    : See http://perldoc.perl.org/utf8.html for more information.

    I read that, and also studied the various options to switch -C (see
    perlrun for that), and I am really confused why the behaviour of my
    system is so out of sync with the descriptions in the documentation.

    Oliver.

    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, Mar 24, 2006
    #10
  11. Guest

    Donald King Guest

    [Whoops, I did the post-vs-mail thing again. Bad coder, no cookie.]

    -berlin.de wrote:
    > Chronos Tachyon <> wrote:
    >
    > : The problem seems to be the absence of a "use utf8;" pragma. Perl is
    > : assuming that your code (including the __DATA__ section) is in ISO-8859-1.
    >
    > No, I don't think so, as inserting the utf8 pragma doesn't change anything.
    > I tried it, and the output is still not in utf8.
    >


    FWIW, I've taken your original test case from the top of the thread and
    fixed it up. It's now properly encoded in UTF-8, it uses both "use
    utf8" and "binmode(STDOUT, ':utf8')" to fix the problem, and I fixed it
    to run under "use strict" and "use warnings" while I was in there. You
    can download it at <http://chronos-tachyon.net/~chronos/corff.pl>.

    BTW, during my testing, I found that, if script.pl has a "#!perl -CS"
    shebang line, "./script.pl" uses the -CS but "perl script.pl" doesn't.
    I guess I was so used to -w and -T being automatically picked up from
    the shebang line, I didn't realize that Perl doesn't interpret *all* the
    flags there. I'd recommend explicit binmode() calls or the "open"
    pragma instead of the -C flag, due to the confusion it can cause.

    --
    Donald King, a.k.a. Chronos Tachyon
    http://chronos-tachyon.net/
    Donald King, Mar 25, 2006
    #11
  12. Guest

    Guest Guest

    Donald King <> wrote:
    : [Whoops, I did the post-vs-mail thing again. Bad coder, no cookie.]

    : FWIW, I've taken your original test case from the top of the thread and
    : fixed it up. It's now properly encoded in UTF-8, it uses both "use
    : utf8" and "binmode(STDOUT, ':utf8')" to fix the problem, and I fixed it
    : to run under "use strict" and "use warnings" while I was in there. You
    : can download it at <http://chronos-tachyon.net/~chronos/corff.pl>.

    Hi Donald,

    Thank you _very_ much for the fixed code. I ran it, and to no avail. The
    problems remain. Can you tell me which environment the code worked for you?

    My environment:

    perl -v states:
    perl v5.8.5 built for i386-linux-thread-multi

    echo $LANG states:
    en_US.UTF-8

    in vim, opening the file in utf8 encoding succeeds (and displays correctly)

    When running the file from the command line
    ../corff.pl

    I get:
    1) broken output of the print statement
    2) over-interpreted representations o utf8 data in the output file w.xml.

    If I disable _both_ the
    # use utf8;
    ....
    # binmode(STDOUT, ":utf8");

    lines,

    the output of the print statement is _correct_ (accented characters
    show properly), whereas the output of the w.xml file is still garbage.

    If, in a fit of desperation, I modify the output of XMLout() with
    NumericEscape=>2, all I get in the output is that, eg. a umlaut of
    Morgend&auml;mmerung (sorry for this encoding-independet symbolic
    notation here!) is represented as ä which happens to be the
    decimal values of the two octets comprising U+00e4, or Latin small a
    with umlaut.

    I've already considered to suffer silently from now onwards and to write
    a small filter that replaces all theses bytes in the final output, but
    then, I think this is deeply unsatisfying.

    Thanks again,
    Oliver.

    PS: A small truth table when using utf8 and the binmode statements:

    use utf8 binmode
    yes yes print fails, XMLout fails
    no yes print fails, XMLout fails
    yes no print succeeds, XMLout fails
    no no print succeeds, XMLout fails

    We see that the utf8 pragma doesn't change anything even though the
    data section of my script is utf8-material whereas binmode (STDOUT,':utf8')
    seems to have the opposite effect of what it claims.

    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, Mar 26, 2006
    #12
  13. -berlin.de wrote:
    > Donald King <> wrote:
    > : <http://chronos-tachyon.net/~chronos/corff.pl>.
    >
    > Thank you _very_ much for the fixed code. I ran it, and to no avail.
    > The problems remain. Can you tell me which environment the code worked
    > for you?
    >
    > My environment:
    >
    > perl -v states:
    > perl v5.8.5 built for i386-linux-thread-multi
    >
    > echo $LANG states:
    > en_US.UTF-8
    >
    > in vim, opening the file in utf8 encoding succeeds (and displays
    > correctly)


    I assume that your terminal is in UTF-8 mode, too, then. You could
    verify that by invoking "cat corff.pl" and checking whether it looks
    correct.


    > When running the file from the command line
    > ./corff.pl
    >
    > I get:
    > 1) broken output of the print statement
    > 2) over-interpreted representations o utf8 data in the output file
    > w.xml.


    It works for me on FC3 (which is what you are using, too, if I remember
    one of your previous posts correctly).

    XML::Simple doesn't parse XML itself. Which XML parser are you using?
    I use XML::LibXML.

    For reference, here is the output of
    rpm -qa | grep perl-XML
    on this machine.

    perl-XML-NamespaceSupport-1.08-6
    perl-XML-LibXML-1.58-1
    perl-XML-Dumper-0.71-2
    perl-XML-SAX-0.12-7
    perl-XML-Encoding-1.01-26
    perl-XML-Grove-0.46alpha-27
    perl-XML-Parser-2.34-5
    perl-XML-Twig-3.13-6
    perl-XML-LibXML-Common-0.13-7


    > If, in a fit of desperation, I modify the output of XMLout() with
    > NumericEscape=>2, all I get in the output is that, eg. a umlaut of
    > Morgend&auml;mmerung (sorry for this encoding-independet symbolic
    > notation here!) is represented as ä


    This is definitely wrong. It should be only one entity (ä in this
    case). So probably your parser parses the file as ISO-8859-1 instead of
    UTF-8 or passes the "raw" strings on instead of converting them into
    perl's internal utf-8 representation.

    > I've already considered to suffer silently from now onwards and to
    > write a small filter that replaces all theses bytes in the final
    > output, but then, I think this is deeply unsatisfying.


    Don't. It looks like the error happens on input, so if you have to
    resort to such crude hacks, replace the bytes just after reading the
    input.

    hp

    --
    _ | Peter J. Holzer | Löschung von at.usenet.schmankerl?
    |_|_) | Sysadmin WSR/LUGA |
    | | | | Diskussion derzeit in at.usenet.gruppen
    __/ | http://www.hjp.at/ |
    Peter J. Holzer, Mar 26, 2006
    #13
  14. Guest

    Donald King Guest

    -berlin.de wrote:
    > Donald King <> wrote:
    > : [Whoops, I did the post-vs-mail thing again. Bad coder, no cookie.]
    >
    > : FWIW, I've taken your original test case from the top of the thread and
    > : fixed it up. It's now properly encoded in UTF-8, it uses both "use
    > : utf8" and "binmode(STDOUT, ':utf8')" to fix the problem, and I fixed it
    > : to run under "use strict" and "use warnings" while I was in there. You
    > : can download it at <http://chronos-tachyon.net/~chronos/corff.pl>.
    >
    > Hi Donald,
    >
    > Thank you _very_ much for the fixed code. I ran it, and to no avail. The
    > problems remain. Can you tell me which environment the code worked for you?
    >
    > My environment:
    >
    > perl -v states:
    > perl v5.8.5 built for i386-linux-thread-multi
    >
    > echo $LANG states:
    > en_US.UTF-8
    >
    > in vim, opening the file in utf8 encoding succeeds (and displays correctly)
    >


    My environment:

    Perl: v5.8.8 built for i486-linux-gnu-thread-multi
    LANG: en_US.UTF-8
    Vim: encoding=utf-8 fileencoding=utf-8 termencoding=utf-8
    Terminal: Gnome-Terminal w/ encoding set to "Current Locale (UTF-8)"

    Typing "cat corff.pl" prints the source code, complete with funny German
    scribbles over the vowels. ;-)

    Oh, and since it may be relevant:
    XML::Simple version 2.14
    XML::parser version 2.34
    XML::SAX version 0.12

    Whoops, I think I just found the problem. When I was checking version
    numbers, I went ahead and checked CPAN for newer versions. After
    installing XML::SAX version 0.13, the code broke. Try downgrading to
    0.12 and see if that fixes things. (You can find a copy at
    <http://search.cpan.org/~msergeant/XML-SAX-0.12/>.)

    --
    Donald King, a.k.a. Chronos Tachyon
    http://chronos-tachyon.net/
    Donald King, Mar 26, 2006
    #14
  15. Guest

    Donald King Guest

    Donald King wrote:
    > -berlin.de wrote:
    >
    >> Donald King <> wrote:
    >> : [Whoops, I did the post-vs-mail thing again. Bad coder, no cookie.]
    >>
    >> : FWIW, I've taken your original test case from the top of the thread and
    >> : fixed it up. It's now properly encoded in UTF-8, it uses both "use
    >> : utf8" and "binmode(STDOUT, ':utf8')" to fix the problem, and I fixed it
    >> : to run under "use strict" and "use warnings" while I was in there. You
    >> : can download it at <http://chronos-tachyon.net/~chronos/corff.pl>.
    >>
    >> Hi Donald,
    >>
    >> Thank you _very_ much for the fixed code. I ran it, and to no avail. The
    >> problems remain. Can you tell me which environment the code worked for
    >> you?
    >>
    >> My environment:
    >>
    >> perl -v states:
    >> perl v5.8.5 built for i386-linux-thread-multi
    >>
    >> echo $LANG states:
    >> en_US.UTF-8
    >>
    >> in vim, opening the file in utf8 encoding succeeds (and displays
    >> correctly)
    >>

    >
    > My environment:
    >
    > Perl: v5.8.8 built for i486-linux-gnu-thread-multi
    > LANG: en_US.UTF-8
    > Vim: encoding=utf-8 fileencoding=utf-8 termencoding=utf-8
    > Terminal: Gnome-Terminal w/ encoding set to "Current Locale (UTF-8)"
    >
    > Typing "cat corff.pl" prints the source code, complete with funny German
    > scribbles over the vowels. ;-)
    >
    > Oh, and since it may be relevant:
    > XML::Simple version 2.14
    > XML::parser version 2.34
    > XML::SAX version 0.12
    >
    > Whoops, I think I just found the problem. When I was checking version
    > numbers, I went ahead and checked CPAN for newer versions. After
    > installing XML::SAX version 0.13, the code broke. Try downgrading to
    > 0.12 and see if that fixes things. (You can find a copy at
    > <http://search.cpan.org/~msergeant/XML-SAX-0.12/>.)
    >


    FWIW, I've been on a goose chase through the guts of XML::SAX::purePerl,
    and it seems both versions are horribly buggy with UTF-8. As a quick
    fix, install either XML::SAX::Expat, XML::SAX::ExpatXS, or
    XML::LibXML::SAX. All 3 seem to work just fine.

    --
    Donald King, a.k.a. Chronos Tachyon
    http://chronos-tachyon.net/
    Donald King, Mar 26, 2006
    #15
  16. Guest

    Guest Guest

    Donald King <> wrote:
    : >
    : > Typing "cat corff.pl" prints the source code, complete with funny German
    : > scribbles over the vowels. ;-)

    Yes, it does so.

    : > Oh, and since it may be relevant:
    : > XML::Simple version 2.14
    : > XML::parser version 2.34
    : > XML::SAX version 0.12
    : >

    I'll look into that later today.

    : FWIW, I've been on a goose chase through the guts of XML::SAX::purePerl,
    : and it seems both versions are horribly buggy with UTF-8. As a quick
    : fix, install either XML::SAX::Expat, XML::SAX::ExpatXS, or
    : XML::LibXML::SAX. All 3 seem to work just fine.

    That sounds apalling. All the more as XML claims to use Unicode/utf8
    as its encoding of choice, but very obviously though, developers of
    the above-mentioned packages have potentially never tested their packages
    with some true utf-8 data (perhaps including umlauts and Chinese).

    Thank you very much for your efforts!

    Oliver.
    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, Mar 26, 2006
    #16
  17. Guest

    Guest Guest

    Dennis Roesler <> wrote:

    : I've been following this thread because I have been struggling with
    : XML::Simple writing/sourcing an XML file in cp932 encoding. The
    : NumericEscape is what resolved the writing and setting the encoding in
    : the xml declaration of the cp932 encoded file to x-sjis-cp932 so
    : XML::Simple would source it properly took me awhile to figure out :-(.

    [ good examples snipped ]

    Hi Dennis and all others who have contributed to this thread,

    Thank you very much for your input.

    I followed the idea of the broken SAX module and decided to make other
    parsers usable by XML::SAX and by simply installing SAX::Expat as well
    as XML::LibXML (my code now uses the latter, automagically) the script
    finally runs flawlessly. What a mess of a difficult delivery it was!

    Thanks again to all,

    Oliver.

    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, Mar 29, 2006
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Thomas =?ISO-8859-15?Q?G=F6tz?=

    LWP::Simple and utf8 problem

    Thomas =?ISO-8859-15?Q?G=F6tz?=, Apr 19, 2004, in forum: Perl
    Replies:
    0
    Views:
    713
    Thomas =?ISO-8859-15?Q?G=F6tz?=
    Apr 19, 2004
  2. Simon Willison
    Replies:
    10
    Views:
    568
    Paul Boddie
    Jul 31, 2008
  3. Ronald Fischer

    MySql+UTF8 woes

    Ronald Fischer, Jul 26, 2007, in forum: Ruby
    Replies:
    0
    Views:
    114
    Ronald Fischer
    Jul 26, 2007
  4. gry
    Replies:
    2
    Views:
    712
    Alf P. Steinbach
    Mar 13, 2012
  5. Guest

    Text::Levenshtein and utf8 woes

    Guest, Mar 26, 2006, in forum: Perl Misc
    Replies:
    3
    Views:
    239
    Guest
    Mar 26, 2006
Loading...

Share This Page