xml entity problem

Discussion in 'XML' started by Jos van Uden, Aug 26, 2004.

  1. Jos van Uden

    Jos van Uden Guest

    Can somebody explain why the following file
    has the wrong output:

    <?xml version="1.0" encoding="iso-8859-1"?>
    <test>
    <elem>‘bla bla bla’</elem>
    </test>

    Expected: ‘bla bla bla’
    output: ?bla bla bla?

    It's not caused by the browser, but
    by the (expat) xml parser.

    Thanks.


    test script:


    <?php

    $file = "test.xml";
    $testdata;
    $tagname;

    function startElement($parser, $name, $attrs) {
    global $tagname;
    $tagname = $name;
    }

    function endElement($parser, $name) {
    }

    function characterData(&$parser, $data) {
    global $testdata, $tagname;
    if(trim($data) != "") {
    switch($tagname) {
    case 'ELEM' :
    $testdata .= $data;
    break;
    }
    }
    }

    $xml_parser = xml_parser_create();
    xml_set_element_handler($xml_parser, "startElement", "endElement");
    xml_set_character_data_handler($xml_parser, 'characterData');
    if (!($fp = fopen($file, "r"))) {
    die("could not open XML input");
    }

    while ($data = fread($fp, 4096)) {
    if (!xml_parse($xml_parser, $data, feof($fp))) {
    die(sprintf("XML error: %s at line %d",
    xml_error_string(xml_get_error_code($xml_parser)),
    xml_get_current_line_number($xml_parser)));
    }
    }
    xml_parser_free($xml_parser);

    print "output : " .$testdata;


    ?>
    Jos van Uden, Aug 26, 2004
    #1
    1. Advertising

  2. Jos van Uden

    Derek Harmon Guest

    "Jos van Uden" <> wrote in message news:cgl6aq$rvt$1.nb.home.nl...
    > Can somebody explain why the following file
    > has the wrong output:
    >
    > <?xml version="1.0" encoding="iso-8859-1"?>
    > <test>
    > <elem>‘bla bla bla’</elem>
    > </test>
    >
    > Expected: ‘bla bla bla’
    > output: ?bla bla bla?


    The output is correct. The file has the wrong encoding.
    ISO-8859-1 is essentially ANSI; it does not have 8,217
    code points for it's characters.

    Try UTF-16 encoding instead (and then, as it sounds like
    you're aware, user agents can introduce '?' as well if they
    are not displayed with an appropriate Unicode code page
    and font.)


    Derek Harmon
    Derek Harmon, Aug 27, 2004
    #2
    1. Advertising

  3. Jos van Uden

    Jos van Uden Guest

    Derek Harmon wrote:
    > "Jos van Uden" <> wrote in message news:cgl6aq$rvt$1.nb.home.nl...
    >
    >>Can somebody explain why the following file
    >>has the wrong output:
    >>
    >><?xml version="1.0" encoding="iso-8859-1"?>
    >><test>
    >> <elem>‘bla bla bla’</elem>
    >></test>
    >>
    >>Expected: ‘bla bla bla’
    >>output: ?bla bla bla?


    > The output is correct. The file has the wrong encoding.
    > ISO-8859-1 is essentially ANSI; it does not have 8,217
    > code points for it's characters.


    > Try UTF-16 encoding instead


    Unfortunately, the xml_parser doesn't support UTF-16.
    The supported encodings are ISO-8859-1, US-ASCII
    and UTF-8, so I can't try this.

    >(and then, as it sounds like
    > you're aware, user agents can introduce '?' as well if they
    > are not displayed with an appropriate Unicode code page
    > and font.)


    I've tested this, of course.

    Thanks for your response.

    Jos
    Jos van Uden, Aug 27, 2004
    #3
  4. Unfortunately, the xml_parser doesn't support UTF-16.
    The supported encodings are ISO-8859-1, US-ASCII
    and UTF-8, so I can't try this.

    Then it's not an XML parser as UTF8 and UTF16 are both required
    encodings in any conformant XML parser.

    8217 is LEFT SINGLE QUOTATION MARK you say you expect this to be output
    as byte octal 221 (dec 145) You haven't said what you were using to
    output your parsed file, so I don't know what output encoding you have
    requested, there is no character with such a byte encoding in ISO-8859-1
    or UTF8. Perhaps you want some Microsoft code page. (The character in
    question in your posting is displayed as \221 in my mail reader which
    defaults to showing octal codes for unknown bytes.

    David
    David Carlisle, Aug 27, 2004
    #4
  5. Jos van Uden

    Jos van Uden Guest

    David Carlisle wrote:
    > Unfortunately, the xml_parser doesn't support UTF-16.
    > The supported encodings are ISO-8859-1, US-ASCII
    > and UTF-8, so I can't try this.
    >
    > Then it's not an XML parser as UTF8 and UTF16 are both required
    > encodings in any conformant XML parser.


    I see. From the Php 4 manual:

    "This PHP extension implements support for James Clark's expat™ in PHP.
    (...) It supports three source character encodings also provided by PHP:
    US-ASCII, ISO-8859-1 and UTF-8. UTF-16 is not supported."

    > 8217 is LEFT SINGLE QUOTATION MARK you say you expect this to be output
    > as byte octal 221 (dec 145) You haven't said what you were using to
    > output your parsed file, so I don't know what output encoding you have
    > requested, there is no character with such a byte encoding in ISO-8859-1
    > or UTF8. Perhaps you want some Microsoft code page. (The character in
    > question in your posting is displayed as \221 in my mail reader which
    > defaults to showing octal codes for unknown bytes.


    Also:

    "(...)There are two types of character encodings, source encoding and
    target encoding. PHP's internal representation of the document is always
    encoded with UTF-8.

    (...) The default source encoding used by PHP is ISO-8859-1.

    (...) When an XML parser is created, the target encoding is set to the
    same as the source encoding, (...)

    If PHP encounters characters in the parsed XML document that can not be
    represented in the chosen target encoding, the problem characters will
    be "demoted". Currently, this means that such characters are replaced by
    a question mark. "

    So it seems it's php 4 that's the limiting factor here. Strange thing
    is: if I replace the encoding with the original characters, it shows
    up fine. (using ISO-8859-1 as charset).

    We're having this problem with a rss feeder called zfeeder. We've
    already contacted the author, but haven't received any response. So
    I thought I'd try and fix it myself. :(

    Thanks for your help.

    Jos
    Jos van Uden, Aug 27, 2004
    #5
  6. Jos van Uden wrote:

    > Can somebody explain why the following file
    > has the wrong output:
    >
    > <?xml version="1.0" encoding="iso-8859-1"?>
    > <test>
    > <elem>‘bla bla bla’</elem>
    > </test>
    >
    > Expected: ‘bla bla bla’
    > output: ?bla bla bla?
    >
    > It's not caused by the browser, but
    > by the (expat) xml parser.
    >
    > Thanks.
    >
    >
    > test script:
    >
    >
    > <?php
    >
    > $file = "test.xml";
    > $testdata;
    > $tagname;
    >
    > function startElement($parser, $name, $attrs) {
    > global $tagname;
    > $tagname = $name;
    > }
    >
    > function endElement($parser, $name) {
    > }
    >
    > function characterData(&$parser, $data) {
    > global $testdata, $tagname;
    > if(trim($data) != "") {
    > switch($tagname) {
    > case 'ELEM' :
    > $testdata .= $data;
    > break;
    > }
    > }
    > }
    >
    > $xml_parser = xml_parser_create();
    > xml_set_element_handler($xml_parser, "startElement", "endElement");
    > xml_set_character_data_handler($xml_parser, 'characterData');
    > if (!($fp = fopen($file, "r"))) {
    > die("could not open XML input");
    > }
    >
    > while ($data = fread($fp, 4096)) {
    > if (!xml_parse($xml_parser, $data, feof($fp))) {
    > die(sprintf("XML error: %s at line %d",
    > xml_error_string(xml_get_error_code($xml_parser)),
    > xml_get_current_line_number($xml_parser)));
    > }
    > }
    > xml_parser_free($xml_parser);
    >
    > print "output : " .$testdata;


    Have you tried here to use
    print "output : " . utf_decode($testdata);
    ?
    Or try
    header('Content-Type: text/plain; charset=UTF-8');
    before you print out the data assembled with the XML parser.



    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
    Martin Honnen, Aug 27, 2004
    #6
  7. So it seems it's php 4 that's the limiting factor here. Strange thing
    is: if I replace the encoding with the original characters, it shows
    up fine. (using ISO-8859-1 as charset).

    ISO-8859-1 doesn't have any such quotes and has nothing at all in that
    position which is why the characters in your posting don't display for
    me
    eg
    > James Clark's expat™ in PHP

    comes out as
    James Clark's expat\231 in PHP

    as my news reader (emacs) is defaulting to latin-1 (iso-8859-1) but your
    posting is in the Microsoft-specific encoding
    charset=windows-1252
    which is properly declared in the headers but that doesn't help me as my
    system apparently doesn't know (or at least can't display in) that encoding.

    So if you ask for iso-8859-1 output then the unicode character for a
    left quote is likely to map to some kind of missing glyph marker as the
    specified encoding doesn't have such a character, on the other hand if
    the system just believes that teh first 127 Unicode slots should map
    straight to latin-1 (which is more or less true) and doesn't raise an
    error on "non-characters" in the Control positions then you may find
    that bytes corresponding to Micorosft encoded quote characters do happen
    to be output, but that is more by luck and lack of error detection than
    anything else.

    Of these encodings you list as supported

    It supports three source character encodings also provided by PHP:
    US-ASCII, ISO-8859-1 and UTF-8. UTF-16 is not supported."

    only UTF-8 has the left and right quotes, so you would have to output as
    utf-8.

    David
    David Carlisle, Aug 27, 2004
    #7
  8. Jos van Uden

    Jos van Uden Guest

    Martin Honnen wrote:

    (...)

    > Have you tried here to use
    > print "output : " . utf_decode($testdata);
    > ?
    > Or try
    > header('Content-Type: text/plain; charset=UTF-8');
    > before you print out the data assembled with the XML parser.


    Ok, it works now.

    1) I removed the iso-8859-1 encoding from the xml file, which
    makes it default to UTF-8
    2) I set the encoding of the parser to UTF-8 explicitly, otherwise
    it will default to iso-8859-1. The target encoding follows the source
    encoding.

    Thanks David, Martin and Derek

    What's with this iso-8859-1? Why are we still using that? I use
    it because it seems to be common practice, and I figure there
    must be a reason for it. But what is that reason? Backward
    compatibility?

    If I start using UTF-8 as charset in my meta tags will there be
    undesirable side-effects? Currently I use iso-8859-1 and simply
    convert special characters to entities. My xhtml 1.0 transitional
    always validates (eventually). So I guess no harm is done.

    I'm not even sure if the meta tag does anything in a valid xhtml
    file.

    Anyway, I guess I'll have some googling ahead of me :)

    Thanks again
    Jos van Uden, Aug 27, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Davison
    Replies:
    1
    Views:
    437
    Roedy Green
    Jun 25, 2004
  2. Samuel van Laere

    Entity Name or Entity Number?

    Samuel van Laere, Feb 24, 2007, in forum: HTML
    Replies:
    4
    Views:
    1,593
    Jukka K. Korpela
    Feb 24, 2007
  3. markla
    Replies:
    1
    Views:
    531
    Steven Cheng
    Oct 6, 2008
  4. Norm
    Replies:
    3
    Views:
    2,676
  5. ThatsIT.net.au

    Entity, problem with entity key

    ThatsIT.net.au, Sep 6, 2009, in forum: ASP .Net
    Replies:
    1
    Views:
    1,178
    ThatsIT.net.au
    Sep 7, 2009
Loading...

Share This Page