xml entity problem

Jos van Uden · Aug 26, 2004

Can somebody explain why the following file
has the wrong output:

<?xml version="1.0" encoding="iso-8859-1"?>
<test>
<elem>‘bla bla bla’</elem>
</test>

Expected: ‘bla bla bla’
output: ?bla bla bla?

It's not caused by the browser, but
by the (expat) xml parser.

Thanks.

test script:

<?php

$file = "test.xml";
$testdata;
$tagname;

function startElement($parser, $name, $attrs) {
global $tagname;
$tagname = $name;
}

function endElement($parser, $name) {
}

function characterData(&$parser, $data) {
global $testdata, $tagname;
if(trim($data) != "") {
switch($tagname) {
case 'ELEM' :
$testdata .= $data;
break;
}
}
}

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, 'characterData');
if (!($fp = fopen($file, "r"))) {
die("could not open XML input");
}

while ($data = fread($fp, 4096)) {
if (!xml_parse($xml_parser, $data, feof($fp))) {
die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
}
}
xml_parser_free($xml_parser);

print "output : " .$testdata;

?>

Derek Harmon · Aug 27, 2004

Jos van Uden said:
Can somebody explain why the following file
has the wrong output:

<?xml version="1.0" encoding="iso-8859-1"?>
<test>
<elem>‘bla bla bla’</elem>
</test>

Expected: ‘bla bla bla’
output: ?bla bla bla?

The output is correct. The file has the wrong encoding.
ISO-8859-1 is essentially ANSI; it does not have 8,217
code points for it's characters.

Try UTF-16 encoding instead (and then, as it sounds like
you're aware, user agents can introduce '?' as well if they
are not displayed with an appropriate Unicode code page
and font.)

Derek Harmon

Jos van Uden · Aug 27, 2004

The output is correct. The file has the wrong encoding.
ISO-8859-1 is essentially ANSI; it does not have 8,217
code points for it's characters.

Try UTF-16 encoding instead

Unfortunately, the xml_parser doesn't support UTF-16.
The supported encodings are ISO-8859-1, US-ASCII
and UTF-8, so I can't try this.

(and then, as it sounds like
you're aware, user agents can introduce '?' as well if they
are not displayed with an appropriate Unicode code page
and font.)

I've tested this, of course.

Thanks for your response.

Jos

David Carlisle · Aug 27, 2004

Unfortunately, the xml_parser doesn't support UTF-16.
The supported encodings are ISO-8859-1, US-ASCII
and UTF-8, so I can't try this.

Then it's not an XML parser as UTF8 and UTF16 are both required
encodings in any conformant XML parser.

8217 is LEFT SINGLE QUOTATION MARK you say you expect this to be output
as byte octal 221 (dec 145) You haven't said what you were using to
output your parsed file, so I don't know what output encoding you have
requested, there is no character with such a byte encoding in ISO-8859-1
or UTF8. Perhaps you want some Microsoft code page. (The character in
question in your posting is displayed as \221 in my mail reader which
defaults to showing octal codes for unknown bytes.

David

Jos van Uden · Aug 27, 2004

David said:
Unfortunately, the xml_parser doesn't support UTF-16.
The supported encodings are ISO-8859-1, US-ASCII
and UTF-8, so I can't try this.

Then it's not an XML parser as UTF8 and UTF16 are both required
encodings in any conformant XML parser.

I see. From the Php 4 manual:

"This PHP extension implements support for James Clark's expat™ in PHP.
(...) It supports three source character encodings also provided by PHP:
US-ASCII, ISO-8859-1 and UTF-8. UTF-16 is not supported."

8217 is LEFT SINGLE QUOTATION MARK you say you expect this to be output
as byte octal 221 (dec 145) You haven't said what you were using to
output your parsed file, so I don't know what output encoding you have
requested, there is no character with such a byte encoding in ISO-8859-1
or UTF8. Perhaps you want some Microsoft code page. (The character in
question in your posting is displayed as \221 in my mail reader which
defaults to showing octal codes for unknown bytes.

Also:

"(...)There are two types of character encodings, source encoding and
target encoding. PHP's internal representation of the document is always
encoded with UTF-8.

(...) The default source encoding used by PHP is ISO-8859-1.

(...) When an XML parser is created, the target encoding is set to the
same as the source encoding, (...)

If PHP encounters characters in the parsed XML document that can not be
represented in the chosen target encoding, the problem characters will
be "demoted". Currently, this means that such characters are replaced by
a question mark. "

So it seems it's php 4 that's the limiting factor here. Strange thing
is: if I replace the encoding with the original characters, it shows
up fine. (using ISO-8859-1 as charset).

We're having this problem with a rss feeder called zfeeder. We've
already contacted the author, but haven't received any response. So
I thought I'd try and fix it myself.

Thanks for your help.

Jos

Martin Honnen · Aug 27, 2004

Jos said:
Can somebody explain why the following file
has the wrong output:

<?xml version="1.0" encoding="iso-8859-1"?>
<test>
<elem>‘bla bla bla’</elem>
</test>

Expected: ‘bla bla bla’
output: ?bla bla bla?

It's not caused by the browser, but
by the (expat) xml parser.

Thanks.

test script:

<?php

$file = "test.xml";
$testdata;
$tagname;

function startElement($parser, $name, $attrs) {
global $tagname;
$tagname = $name;
}

function endElement($parser, $name) {
}

function characterData(&$parser, $data) {
global $testdata, $tagname;
if(trim($data) != "") {
switch($tagname) {
case 'ELEM' :
$testdata .= $data;
break;
}
}
}

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, 'characterData');
if (!($fp = fopen($file, "r"))) {
die("could not open XML input");
}

while ($data = fread($fp, 4096)) {
if (!xml_parse($xml_parser, $data, feof($fp))) {
die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
}
}
xml_parser_free($xml_parser);

print "output : " .$testdata;

Have you tried here to use
print "output : " . utf_decode($testdata);
?
Or try
header('Content-Type: text/plain; charset=UTF-8');
before you print out the data assembled with the XML parser.

David Carlisle · Aug 27, 2004

So it seems it's php 4 that's the limiting factor here. Strange thing
is: if I replace the encoding with the original characters, it shows
up fine. (using ISO-8859-1 as charset).

ISO-8859-1 doesn't have any such quotes and has nothing at all in that
position which is why the characters in your posting don't display for
me
eg

James Clark's expat™ in PHP

comes out as
James Clark's expat\231 in PHP

as my news reader (emacs) is defaulting to latin-1 (iso-8859-1) but your
posting is in the Microsoft-specific encoding
charset=windows-1252
which is properly declared in the headers but that doesn't help me as my
system apparently doesn't know (or at least can't display in) that encoding.

So if you ask for iso-8859-1 output then the unicode character for a
left quote is likely to map to some kind of missing glyph marker as the
specified encoding doesn't have such a character, on the other hand if
the system just believes that teh first 127 Unicode slots should map
straight to latin-1 (which is more or less true) and doesn't raise an
error on "non-characters" in the Control positions then you may find
that bytes corresponding to Micorosft encoded quote characters do happen
to be output, but that is more by luck and lack of error detection than
anything else.

Of these encodings you list as supported

It supports three source character encodings also provided by PHP:
US-ASCII, ISO-8859-1 and UTF-8. UTF-16 is not supported."

only UTF-8 has the left and right quotes, so you would have to output as
utf-8.

David

Jos van Uden · Aug 27, 2004

Martin Honnen wrote:

(...)

Have you tried here to use
print "output : " . utf_decode($testdata);
?
Or try
header('Content-Type: text/plain; charset=UTF-8');
before you print out the data assembled with the XML parser.

Ok, it works now.

1) I removed the iso-8859-1 encoding from the xml file, which
makes it default to UTF-8
2) I set the encoding of the parser to UTF-8 explicitly, otherwise
it will default to iso-8859-1. The target encoding follows the source
encoding.

Thanks David, Martin and Derek

What's with this iso-8859-1? Why are we still using that? I use
it because it seems to be common practice, and I figure there
must be a reason for it. But what is that reason? Backward
compatibility?

If I start using UTF-8 as charset in my meta tags will there be
undesirable side-effects? Currently I use iso-8859-1 and simply
convert special characters to entities. My xhtml 1.0 transitional
always validates (eventually). So I guess no harm is done.

I'm not even sure if the meta tag does anything in a valid xhtml
file.

Anyway, I guess I'll have some googling ahead of me

Thanks again

Using PHP to parse specific XML tag content?	17	Dec 2, 2008
Insert data from XML to MYSQL	0	Feb 17, 2006
search script help	0	Jul 31, 2007
SEARCH SCRIPT HELP!	0	Jul 31, 2007
Parsing cdata using expat in C	0	Mar 27, 2012
Problem with Xerces-C SAX parser ?	0	Jul 25, 2006
Expat problems	4	Nov 20, 2003
loosing data while parsing xml with expat	0	Nov 19, 2003

xml entity problem

Jos van Uden

Derek Harmon

Jos van Uden

David Carlisle

Jos van Uden

Martin Honnen

David Carlisle

Jos van Uden

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads