LibXML element->toString vs document->toString

Discussion in 'Perl Misc' started by Fergus McMenemie, Jul 12, 2012.

  1. Hi, I have been driven mad by the following, which took ages to track
    down. What is going on? I appears it is invalid to use toString on the
    document object.


    #! /usr/local/bin/perl -w
    use strict;
    use warnings;
    use utf8;
    use Encode;
    use XML::LibXML;
    binmode(STDOUT, ":utf8");

    my $src= join("",<DATA>);
    print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
    my $parser = XML::LibXML->new();
    my $x = $parser->parse_string($src)->documentElement();
    my $str=$x->toString(1);
    print "$str\n";
    print "string 1 is invalid \n" unless ( Encode::is_utf8($str,1) );

    $x = $parser->parse_string($src);
    $str=$x->toString(1);
    print "$str\n";
    print "string 2 is invalid \n" unless ( Encode::is_utf8($str,1) );

    __DATA__
    <?xml version="1.0" encoding="utf-8" standalone="no"?>
    <plugin name="\xc5\x81"></plugin>
     
    Fergus McMenemie, Jul 12, 2012
    #1
    1. Advertising

  2. Ben Morrow <> wrote:

    > Quoth (Fergus McMenemie):
    > > Hi, I have been driven mad by the following, which took ages to track
    > > down. What is going on? I appears it is invalid to use toString on the
    > > document object.
    > >
    > >
    > > #! /usr/local/bin/perl -w
    > > use strict;
    > > use warnings;
    > > use utf8;
    > > use Encode;
    > > use XML::LibXML;
    > > binmode(STDOUT, ":utf8");
    > >
    > > my $src= join("",<DATA>);
    > > print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );

    >
    > Don't do that. Encode::is_utf8 checks the state of the SvUTF8 flag,
    > which is internal to perl and none of your business. (The Encode
    > documentation is not as clear about this as is might be, because it only
    > became clear through experience that this is the only approach which
    > works.)


    Agreed, the warnings are there. However it did appear to make the
    issue clearer. This example is rather goofy and posting it to USEnet
    added a few more wrinkles. My original code and the real program
    contained the actual characters. However my USEnet reader would not
    let me post the real chars. Hence the octets.

    My issue is that document->toString does not appear to work. Please
    ignore the use of us_utf8.

    > What are you actually trying to find out?

    I have to pass references to DOM objects around all over the
    place. I find I am having to make use of either documentElement()
    or ownerDocument() depending on what I am doing. I would like to have
    a consistent "pattern" for doing this. I would like to setting on
    passing the document object around but it is anoying that I cant then
    use toString.
     
    Fergus McMenemie, Jul 13, 2012
    #2
    1. Advertising

  3. Ben Morrow <> wrote:

    > > > What are you actually trying to find out?

    > > I have to pass references to DOM objects around all over the
    > > place. I find I am having to make use of either documentElement()
    > > or ownerDocument() depending on what I am doing. I would like to have
    > > a consistent "pattern" for doing this. I would like to setting on
    > > passing the document object around but it is anoying that I cant then
    > > use toString.

    >
    > I'm afraid I don't understand. When I run the original program I get the
    > results I would have expected: the first prints the XML without the
    > <?xml?>, the second prints it with it. What is going wrong for you?


    Thanks for the tip. My code now reads:-

    use strict;
    use warnings;
    use Encode;
    use XML::LibXML;
    binmode(STDOUT, ":utf8");

    my $src= join("",<DATA>);
    $src =~ s/\\x([0-9a-f][0-9a-f])/chr hex $1/egi;
    $src = Encode::decode "utf8", $src;
    print "LibXML VERSION=$XML::LibXML::VERSION\n";
    print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
    my $parser = XML::LibXML->new();
    my $x = $parser->parse_string($src)->documentElement();
    my $str=$x->toString(1);
    print "$str\n";
    print "string 1 is invalid \n" unless ( Encode::is_utf8($str,1) );

    $x = $parser->parse_string($src);
    $str=$x->toString(1);
    print "$str\n";
    print "string 2 is invalid \n" unless ( Encode::is_utf8($str,1) );

    __DATA__
    <?xml version="1.0" encoding="utf-8" standalone="no"?>
    <plugin
    name="\xef\xbd\xb1\xef\xbd\xb2\xef\xbd\xb3\xef\xbd\xb4\xef\xbd\xb5"></pl
    ugin>


    And fails on my mac running OS X Snow Leopard. But the 'real' version is
    running with perl 5.12 on centos and also fails there. No sure about the
    version of LibXML.

    Does it work for your?
     
    Fergus McMenemie, Jul 14, 2012
    #3
  4. Ben Morrow <> wrote:

    > Quoth (Fergus McMenemie):
    > > Ben Morrow <> wrote:
    > > > Quoth (Fergus McMenemie):

    £@¤
    > > > > Hi, I have been driven mad by the following, which took ages to track
    > > > > down. What is going on? I appears it is invalid to use toString on the
    > > > > document object.
    > > > >
    > > > >
    > > > > #! /usr/local/bin/perl -w
    > > > > use strict;
    > > > > use warnings;
    > > > > use utf8;
    > > > > use Encode;
    > > > > use XML::LibXML;
    > > > > binmode(STDOUT, ":utf8");
    > > > >
    > > > > my $src= join("",<DATA>);
    > > > > print "string \$src is invalid \n" unless ( Encode::is_utf8($src,1) );
    > > >
    > > > Don't do that. Encode::is_utf8 checks the state of the SvUTF8 flag,
    > > > which is internal to perl and none of your business. (The Encode
    > > > documentation is not as clear about this as is might be, because it only
    > > > became clear through experience that this is the only approach which
    > > > works.)

    > >
    > > Agreed, the warnings are there. However it did appear to make the
    > > issue clearer. This example is rather goofy and posting it to USEnet
    > > added a few more wrinkles. My original code and the real program
    > > contained the actual characters. However my USEnet reader would not
    > > let me post the real chars. Hence the octets.

    >
    > It can certainly be difficult, given that Usenet officially doesn't
    > support anything but ASCII. Unofficially, if you can get your newsreader
    > to produce it, articles in UTF-8 with 'Content-type: text/plain;
    > charset=UTF-8' seem to work perfectly well.
    >
    > Another thing you can do is explicitly decode the data in the program
    > you post; possibly something like
    >
    > my $str = <DATA>;
    > $str =~ s/%([0-9a-f][0-9a-f])/chr hex $1/egi;
    > $str = Encode::decode "utf8", $str;
    >
    > This uses URL-encoding rather than backslashes; you can pick whatever is
    > convenient for the data you are trying to post.
    >
    > > My issue is that document->toString does not appear to work. Please
    > > ignore the use of us_utf8.

    >
    > OK.
    >
    > > > What are you actually trying to find out?

    > > I have to pass references to DOM objects around all over the
    > > place. I find I am having to make use of either documentElement()
    > > or ownerDocument() depending on what I am doing. I would like to have
    > > a consistent "pattern" for doing this. I would like to setting on
    > > passing the document object around but it is anoying that I cant then
    > > use toString.

    >
    > I'm afraid I don't understand. When I run the original program I get the
    > results I would have expected: the first prints the XML without the
    > <?xml?>, the second prints it with it. What is going wrong for you?
    >
    > Ben
     
    Fergus McMenemie, Jul 14, 2012
    #4
  5. Ben Morrow <> wrote:

    > > What gives you that idea? RFC 5536 explicitly allows MIME-encoded
    > > data, e.g.,

    >
    > Ooh, they've actually published an update. I didn't know that.


    My newsreader does not properly upport UTF8 I guess lots of others still
    dont either.

    MacSoup - my soups gone off!
     
    Fergus McMenemie, Jul 17, 2012
    #5
  6. Ben Morrow <> wrote:

    > Yes, it works as documented for me. Are you getting confused by the fact
    > that ->toString produces a byte string for whole documents, but a
    > character string for just an element? Read the 'ENCODINGS SUPPORT'
    > section in perldoc XML::LibXML: you don't want a :utf8 layer if you're
    > printing a whole document, because the document isn't necessarily in
    > UTF-8.


    Duh!
    Thanks I dont know how I managed to miss that bit.
     
    Fergus McMenemie, Jul 17, 2012
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ian Gregory
    Replies:
    1
    Views:
    507
  2. Olav
    Replies:
    3
    Views:
    4,251
  3. Ujwal
    Replies:
    0
    Views:
    110
    Ujwal
    Dec 4, 2003
  4. MaggotChild
    Replies:
    36
    Views:
    804
    Eric Pozharski
    May 3, 2009
  5. Peter Makholm
    Replies:
    2
    Views:
    169
    Permostat
    Mar 12, 2010
Loading...

Share This Page