Different results parsing a XML file with XML::Simple (XML::Sax vs. XML::Parser)

Discussion in 'Perl Misc' started by Erik Wasser, Mar 2, 2006.

  1. Erik Wasser

    Erik Wasser Guest

    Hello Usenet.

    I'm subject to some confusion with XML and UTF8. I'm working with
    XML-Simple and I try to decode some XML with with german umlauts
    (ISO-8859-1). The first XML line declared the encoding correct (see code
    below). But I'm getting different results using XML-Simple with the
    default XML parser named XML::Sax and a second parser named XML::parser.
    The following code tries to decode the mini XML file and prints the UTF8
    flags of the resulting strings.

    Can someone run this code on his machine and post the results? Thanks.
    The results on my machine are this:

    ÃÃÃäöüà (0) cmp ÄÖÜäöüß (0) = -1
    ÄÖÜäöüß (1) cmp ÄÖÜäöüß (0) = 0

    The first line was parsed by XML::Sax and the second line was parsed by
    XML::parser. My conclusions:

    1) Line 1 is wrong, line 2 is correct
    2) The output should be line 2 two times.
    3) There is a bug in XML::Sax

    Your opinion?

    The code (written in ISO-8859-1 on disc):

    #!/usr/bin/perl -w

    use strict;
    use warnings;

    use XML::Simple;
    use Encode;

    foreach (1..2)
    {
    my $q1 = XMLin("<?xml version='1.0' encoding='iso-8859-1'?>\n<a>ÄÖÜäöüß</a>");
    my $q2 = "ÄÖÜäöüß";

    printf "%s (%d) cmp %s (%d) = %d\n"
    , $q1, Encode::is_utf8($q1)
    , $q2, Encode::is_utf8($q2)
    , $q1 cmp $q2;
    # and again with the non default parser
    $XML::Simple::pREFERRED_PARSER = 'XML::parser';
    }

    PS: I'm using perl v5.8.7, XML-SAX-0.13, XML-Parser-2.34 and
    expat-1.95.8.

    --
    So long... Fuzz
     
    Erik Wasser, Mar 2, 2006
    #1
    1. Advertising

  2. (Erik Wasser) wrote in
    news::

    > I'm subject to some confusion with XML and UTF8. I'm working with
    > XML-Simple and I try to decode some XML with with german umlauts
    > (ISO-8859-1). The first XML line declared the encoding correct (see
    > code below). But I'm getting different results using XML-Simple with
    > the default XML parser named XML::Sax and a second parser named
    > XML::parser. The following code tries to decode the mini XML file and
    > prints the UTF8 flags of the resulting strings.
    >
    > Can someone run this code on his machine and post the results? Thanks.
    > The results on my machine are this:
    >
    > ÃÃÃäöüà (0) cmp ÄÖÜäöüß (0) = -1
    > ÄÖÜäöüß (1) cmp ÄÖÜäöüß (0) = 0
    >
    > The first line was parsed by XML::Sax and the second line was parsed
    > by XML::parser. My conclusions:
    >
    > 1) Line 1 is wrong, line 2 is correct
    > 2) The output should be line 2 two times.
    > 3) There is a bug in XML::Sax
    >
    > Your opinion?
    >
    > The code (written in ISO-8859-1 on disc):
    >
    > #!/usr/bin/perl -w
    >
    > use strict;
    > use warnings;
    >
    > use XML::Simple;
    > use Encode;
    >
    > foreach (1..2)
    > {
    > my $q1 = XMLin("<?xml version='1.0'
    > encoding='iso-8859-1'?>\n<a>ÄÖÜäöüß</a>"); my $q2 = "ÄÖÜäöüß";
    >
    > printf "%s (%d) cmp %s (%d) = %d\n"
    > , $q1, Encode::is_utf8($q1)
    > , $q2, Encode::is_utf8($q2)
    > , $q1 cmp $q2;
    > # and again with the non default parser
    > $XML::Simple::pREFERRED_PARSER = 'XML::parser';
    > }
    >
    > PS: I'm using perl v5.8.7, XML-SAX-0.13, XML-Parser-2.34 and
    > expat-1.95.8.


    First off, let me say I don't know much about this stuff. I am on the US
    English version of XP. I copied and pasted the code above into Gvim, and
    then ran it. I got:


    D:\Home\asu1\UseNet\clpmisc> r > results.txt

    D:\Home\asu1\UseNet\clpmisc> cat results.txt
    ÄÖÜäöüß (1) cmp ÄÖÜäöüß (0) = 0
    ÄÖÜäöüß (1) cmp ÄÖÜäöüß (0) = 0

    I would be inclined to look at what changed in XML-SAX between versions
    0.12 and 0.13, but then, as I said, I don't know much about encodings
    etc.

    I have XML-SAX-0.12 and XML-Parser-2.34 and

    D:\Home\asu1\UseNet\clpmisc> perl -v

    This is perl, v5.8.7 built for MSWin32-x86-multi-thread
    (with 14 registered patches, see perl -V for more detail)

    Copyright 1987-2005, Larry Wall

    Binary build 815 [211909] provided by ActiveState
    http://www.ActiveState.com
    ActiveState is a division of Sophos.
    Built Nov 2 2005 08:44:52

    Sinan
    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Mar 2, 2006
    #2
    1. Advertising

  3. Erik Wasser

    robic0 Guest

    On Thu, 02 Mar 2006 15:57:53 GMT, "A. Sinan Unur" <> wrote:

    > (Erik Wasser) wrote in
    >news::
    >
    >> I'm subject to some confusion with XML and UTF8. I'm working with
    >> XML-Simple and I try to decode some XML with with german umlauts
    >> (ISO-8859-1). The first XML line declared the encoding correct (see
    >> code below). But I'm getting different results using XML-Simple with
    >> the default XML parser named XML::Sax and a second parser named
    >> XML::parser. The following code tries to decode the mini XML file and
    >> prints the UTF8 flags of the resulting strings.
    >>
    >> Can someone run this code on his machine and post the results? Thanks.
    >> The results on my machine are this:
    >>


    You didn't try to decode in German! You might have changed the "code page"
    to German to get different character sets. It doesn't matter. I'm looking at
    your character in whatever "code page" is on my machine. UTF8 is Unicode.
    Its not discernable unless you have a Unicode "aware" renderer. You can't
    just change the characters on the page via cut & paste and it turns into
    Unicode. If you open or save a Unicode document from a Unicode aware editor
    the represented character will not be noticable as Unicode, so it's not
    something that can be "cut 'n pasted" into a newsgroup, as code to be
    tested! UTF8, even "multi-byte" is transparent to the user and only known
    to the renderer. Data from a file that is read into a parser (or a Perl
    program that is UTF8 aware) that is Unicode is treated as Unicode in its
    variable representation and interaction with other variables. If a regex
    is to be applied to Unicode data from an aware Perl parser, it works
    every time.
     
    robic0, Mar 5, 2006
    #3
  4. Erik Wasser

    robic0 Guest

    On Sat, 04 Mar 2006 17:30:09 -0800, robic0 wrote:

    >On Thu, 02 Mar 2006 15:57:53 GMT, "A. Sinan Unur" <> wrote:
    >
    >> (Erik Wasser) wrote in
    >>news::
    >>
    >>> I'm subject to some confusion with XML and UTF8. I'm working with
    >>> XML-Simple and I try to decode some XML with with german umlauts
    >>> (ISO-8859-1). The first XML line declared the encoding correct (see
    >>> code below). But I'm getting different results using XML-Simple with
    >>> the default XML parser named XML::Sax and a second parser named
    >>> XML::parser. The following code tries to decode the mini XML file and
    >>> prints the UTF8 flags of the resulting strings.
    >>>
    >>> Can someone run this code on his machine and post the results? Thanks.
    >>> The results on my machine are this:
    >>>

    >
    >You didn't try to decode in German! You might have changed the "code page"
    >to German to get different character sets. It doesn't matter. I'm looking at
    >your character in whatever "code page" is on my machine. UTF8 is Unicode.
    >Its not discernable unless you have a Unicode "aware" renderer. You can't
    >just change the characters on the page via cut & paste and it turns into
    >Unicode. If you open or save a Unicode document from a Unicode aware editor
    >the represented character will not be noticable as Unicode, so it's not
    >something that can be "cut 'n pasted" into a newsgroup, as code to be
    >tested! UTF8, even "multi-byte" is transparent to the user and only known
    >to the renderer. Data from a file that is read into a parser (or a Perl
    >program that is UTF8 aware) that is Unicode is treated as Unicode in its
    >variable representation and interaction with other variables. If a regex
    >is to be applied to Unicode data from an aware Perl parser, it works
    >every time.


    Just a followup, I know your question was with xml, but if you wan't to use
    unicode "outside" the 0-128 bracket fro regex you might want to use the
    codes as in this simple example (which just uses various "ranges"):

    @UC_Nstart = (
    "\\x{C0}-\\x{D6}",
    "\\x{D8}-\\x{F6}",
    "\\x{F8}-\\x{2FF}",
    "\\x{370}-\\x{37D}",
    "\\x{37F}-\\x{1FFF}",
    "\\x{200C}-\\x{200D}",
    "\\x{2070}-\\x{218F}",
    "\\x{2C00}-\\x{2FEF}",
    "\\x{3001}-\\x{D7FF}",
    "\\x{F900}-\\x{FDCF}",
    "\\x{FDF0}-\\x{FFFD}",
    "\\x{10000}-\\x{EFFFF}",
    );
     
    robic0, Mar 5, 2006
    #4
  5. Erik Wasser

    Erik Wasser Guest

    robic0 wrote:

    > Just a followup, I know your question was with xml, but if you wan't to use
    > unicode "outside" the 0-128 bracket fro regex you might want to use the
    > codes as in this simple example (which just uses various "ranges"):


    My question was: why two XML parsers are getting different results? The
    different results are confusing me not unicode itself.

    --
    So long... Fuzz
     
    Erik Wasser, Mar 5, 2006
    #5
  6. Erik Wasser wrote:

    [XML::Simple gives correct results with XML::parser, but wrong results
    with XML::SAX]

    > My question was: why two XML parsers are getting different results?
    > The different results are confusing me not unicode itself.


    Looks like a bug in XML::SAX or one of the libraries it uses.
    However, like Sinan, I cannot reproduce it here on a Debian Sarge
    system:

    perl, v5.8.4 built for i386-linux-thread-multi
    XML::Simple version 2.14
    XML::SAX version 0.12
    XML::parser version 2.34
    libexpat1 1.95.8-3

    So it may be caused by something weird in your einvironment.

    hp

    --
    This is not a signature
     
    Peter J. Holzer, Mar 5, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    3
    Views:
    601
  2. Naren
    Replies:
    0
    Views:
    598
    Naren
    May 11, 2004
  3. Sanjeev
    Replies:
    4
    Views:
    1,476
    Stanimir Stamenkov
    May 4, 2008
  4. Naresh Agarwal
    Replies:
    8
    Views:
    730
    Philippe Poulard
    Jul 24, 2008
  5. Sidhartha
    Replies:
    1
    Views:
    537
    Martin Honnen
    Sep 15, 2008
Loading...

Share This Page