xml entity problem

J

Jos van Uden

Can somebody explain why the following file
has the wrong output:

<?xml version="1.0" encoding="iso-8859-1"?>
<test>
<elem>‘bla bla bla’</elem>
</test>

Expected: ‘bla bla bla’
output: ?bla bla bla?

It's not caused by the browser, but
by the (expat) xml parser.

Thanks.


test script:


<?php

$file = "test.xml";
$testdata;
$tagname;

function startElement($parser, $name, $attrs) {
global $tagname;
$tagname = $name;
}

function endElement($parser, $name) {
}

function characterData(&$parser, $data) {
global $testdata, $tagname;
if(trim($data) != "") {
switch($tagname) {
case 'ELEM' :
$testdata .= $data;
break;
}
}
}

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, 'characterData');
if (!($fp = fopen($file, "r"))) {
die("could not open XML input");
}

while ($data = fread($fp, 4096)) {
if (!xml_parse($xml_parser, $data, feof($fp))) {
die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
}
}
xml_parser_free($xml_parser);

print "output : " .$testdata;


?>
 
D

Derek Harmon

Jos van Uden said:
Can somebody explain why the following file
has the wrong output:

<?xml version="1.0" encoding="iso-8859-1"?>
<test>
<elem>‘bla bla bla’</elem>
</test>

Expected: ‘bla bla bla’
output: ?bla bla bla?

The output is correct. The file has the wrong encoding.
ISO-8859-1 is essentially ANSI; it does not have 8,217
code points for it's characters.

Try UTF-16 encoding instead (and then, as it sounds like
you're aware, user agents can introduce '?' as well if they
are not displayed with an appropriate Unicode code page
and font.)


Derek Harmon
 
J

Jos van Uden

The output is correct. The file has the wrong encoding.
ISO-8859-1 is essentially ANSI; it does not have 8,217
code points for it's characters.
Try UTF-16 encoding instead

Unfortunately, the xml_parser doesn't support UTF-16.
The supported encodings are ISO-8859-1, US-ASCII
and UTF-8, so I can't try this.
(and then, as it sounds like
you're aware, user agents can introduce '?' as well if they
are not displayed with an appropriate Unicode code page
and font.)

I've tested this, of course.

Thanks for your response.

Jos
 
D

David Carlisle

Unfortunately, the xml_parser doesn't support UTF-16.
The supported encodings are ISO-8859-1, US-ASCII
and UTF-8, so I can't try this.

Then it's not an XML parser as UTF8 and UTF16 are both required
encodings in any conformant XML parser.

8217 is LEFT SINGLE QUOTATION MARK you say you expect this to be output
as byte octal 221 (dec 145) You haven't said what you were using to
output your parsed file, so I don't know what output encoding you have
requested, there is no character with such a byte encoding in ISO-8859-1
or UTF8. Perhaps you want some Microsoft code page. (The character in
question in your posting is displayed as \221 in my mail reader which
defaults to showing octal codes for unknown bytes.

David
 
J

Jos van Uden

David said:
Unfortunately, the xml_parser doesn't support UTF-16.
The supported encodings are ISO-8859-1, US-ASCII
and UTF-8, so I can't try this.

Then it's not an XML parser as UTF8 and UTF16 are both required
encodings in any conformant XML parser.

I see. From the Php 4 manual:

"This PHP extension implements support for James Clark's expat™ in PHP.
(...) It supports three source character encodings also provided by PHP:
US-ASCII, ISO-8859-1 and UTF-8. UTF-16 is not supported."
8217 is LEFT SINGLE QUOTATION MARK you say you expect this to be output
as byte octal 221 (dec 145) You haven't said what you were using to
output your parsed file, so I don't know what output encoding you have
requested, there is no character with such a byte encoding in ISO-8859-1
or UTF8. Perhaps you want some Microsoft code page. (The character in
question in your posting is displayed as \221 in my mail reader which
defaults to showing octal codes for unknown bytes.

Also:

"(...)There are two types of character encodings, source encoding and
target encoding. PHP's internal representation of the document is always
encoded with UTF-8.

(...) The default source encoding used by PHP is ISO-8859-1.

(...) When an XML parser is created, the target encoding is set to the
same as the source encoding, (...)

If PHP encounters characters in the parsed XML document that can not be
represented in the chosen target encoding, the problem characters will
be "demoted". Currently, this means that such characters are replaced by
a question mark. "

So it seems it's php 4 that's the limiting factor here. Strange thing
is: if I replace the encoding with the original characters, it shows
up fine. (using ISO-8859-1 as charset).

We're having this problem with a rss feeder called zfeeder. We've
already contacted the author, but haven't received any response. So
I thought I'd try and fix it myself. :(

Thanks for your help.

Jos
 
M

Martin Honnen

Jos said:
Can somebody explain why the following file
has the wrong output:

<?xml version="1.0" encoding="iso-8859-1"?>
<test>
<elem>‘bla bla bla’</elem>
</test>

Expected: ‘bla bla bla’
output: ?bla bla bla?

It's not caused by the browser, but
by the (expat) xml parser.

Thanks.


test script:


<?php

$file = "test.xml";
$testdata;
$tagname;

function startElement($parser, $name, $attrs) {
global $tagname;
$tagname = $name;
}

function endElement($parser, $name) {
}

function characterData(&$parser, $data) {
global $testdata, $tagname;
if(trim($data) != "") {
switch($tagname) {
case 'ELEM' :
$testdata .= $data;
break;
}
}
}

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, 'characterData');
if (!($fp = fopen($file, "r"))) {
die("could not open XML input");
}

while ($data = fread($fp, 4096)) {
if (!xml_parse($xml_parser, $data, feof($fp))) {
die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
}
}
xml_parser_free($xml_parser);

print "output : " .$testdata;

Have you tried here to use
print "output : " . utf_decode($testdata);
?
Or try
header('Content-Type: text/plain; charset=UTF-8');
before you print out the data assembled with the XML parser.
 
D

David Carlisle

So it seems it's php 4 that's the limiting factor here. Strange thing
is: if I replace the encoding with the original characters, it shows
up fine. (using ISO-8859-1 as charset).

ISO-8859-1 doesn't have any such quotes and has nothing at all in that
position which is why the characters in your posting don't display for
me
eg
James Clark's expat™ in PHP
comes out as
James Clark's expat\231 in PHP

as my news reader (emacs) is defaulting to latin-1 (iso-8859-1) but your
posting is in the Microsoft-specific encoding
charset=windows-1252
which is properly declared in the headers but that doesn't help me as my
system apparently doesn't know (or at least can't display in) that encoding.

So if you ask for iso-8859-1 output then the unicode character for a
left quote is likely to map to some kind of missing glyph marker as the
specified encoding doesn't have such a character, on the other hand if
the system just believes that teh first 127 Unicode slots should map
straight to latin-1 (which is more or less true) and doesn't raise an
error on "non-characters" in the Control positions then you may find
that bytes corresponding to Micorosft encoded quote characters do happen
to be output, but that is more by luck and lack of error detection than
anything else.

Of these encodings you list as supported

It supports three source character encodings also provided by PHP:
US-ASCII, ISO-8859-1 and UTF-8. UTF-16 is not supported."

only UTF-8 has the left and right quotes, so you would have to output as
utf-8.

David
 
J

Jos van Uden

Martin Honnen wrote:

(...)
Have you tried here to use
print "output : " . utf_decode($testdata);
?
Or try
header('Content-Type: text/plain; charset=UTF-8');
before you print out the data assembled with the XML parser.

Ok, it works now.

1) I removed the iso-8859-1 encoding from the xml file, which
makes it default to UTF-8
2) I set the encoding of the parser to UTF-8 explicitly, otherwise
it will default to iso-8859-1. The target encoding follows the source
encoding.

Thanks David, Martin and Derek

What's with this iso-8859-1? Why are we still using that? I use
it because it seems to be common practice, and I figure there
must be a reason for it. But what is that reason? Backward
compatibility?

If I start using UTF-8 as charset in my meta tags will there be
undesirable side-effects? Currently I use iso-8859-1 and simply
convert special characters to entities. My xhtml 1.0 transitional
always validates (eventually). So I guess no harm is done.

I'm not even sure if the meta tag does anything in a valid xhtml
file.

Anyway, I guess I'll have some googling ahead of me :)

Thanks again
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top