How do xml parsers handle encoding?

B

billsahiker

if an xml file specifies an encoding, e.g., utf16, do xml browsers and
xml editors read and verify each character in the file to make sure it
is utf16? and throw an error if it is not, or. do they do an automatic
filtering/converting to utf16, or do they do something else?

Do they default to utf8 if the xml file does not specify an encoding?

Bill
 
M

Martin Honnen

if an xml file specifies an encoding, e.g., utf16, do xml browsers and
xml editors read and verify each character in the file to make sure it
is utf16? and throw an error if it is not, or. do they do an automatic
filtering/converting to utf16, or do they do something else?

Do they default to utf8 if the xml file does not specify an encoding?

An XML parser checks for a BOM (byte order mark) to find out whether it
is UTF-8 or UTF-16 if there is no XML declaration declaring an encoding.

And XML parsers are required to check that documents are properly
encoded. However browser like Firefox or Opera I think might not report
any such violation. For instance I saved an XML document as UTF-8 but
with an XML declaration saying encoding="UTF-16" and then loaded with
Firefox 2.0 and Opera 9 and they both did not report an error, instead
treated the document as UTF-8. IE 6 reported an error.
 
M

Martin Honnen

Martin said:
And XML parsers are required to check that documents are properly
encoded. However browser like Firefox or Opera I think might not report
any such violation. For instance I saved an XML document as UTF-8 but
with an XML declaration saying encoding="UTF-16" and then loaded with
Firefox 2.0 and Opera 9 and they both did not report an error, instead
treated the document as UTF-8. IE 6 reported an error.

For Mozilla, the FAQ
http://developer.mozilla.org/en/doc...rom_the_treatment_of_text.2Fhtml_documents.3F
says:
"Most well-formedness constraints are enforced. (Currently Mozilla
does not catch character encoding errors, because the document is
re-encoded using a lenient encoding converter before the document
reaches the XML parser. This is a bug.)"
 
J

Joseph J. Kesselman

The rules for how they're *supposed* to handle it are spelled out in the
XML Recommendation. Not all parsers are in strict compliance with all
parts of the recommendation, alas. Bug Happens.

If you're asking whether you can get away with cheating: the brief
answer is that it's extremely bad practice to try. If you're asking
whether you can be certain a particular parser will or won't let
something through, you can ask its development/user community... but be
aware that the next release may fix this, and it's a very bad idea to
write code that depends on bugs in specific versions.
 
B

billsahiker

So how do they do that? do they check every character? or do they just
convert? if the encoding attribute is utf8 and the file has a
character not utf8, does the browser error, convert it or what? Like
if a Korean character is in a file that says it is utf8.

Bill
 
R

Richard Tobin

[/QUOTE]
So how do they do that? do they check every character?
Yes.

Like if a Korean character is in a file that says it is utf8.

utf-8 covers all of Unicode, so it includes Korean characters.

A parser has to check two things: that the data is legal for the
encoding (for example, some sequences of bytes are not legal in
UTF-8), and that the character it encodes is allowed in XML.

-- Richard
 
B

billsahiker

utf-8 covers all of Unicode, so it includes Korean characters.

A parser has to check two things: that the data is legal for the
encoding (for example, some sequences of bytes are not legal in
UTF-8), and that the character it encodes is allowed in XML.

-- Richard

OK. I dont know if you are a .net programmer or not(Martin is so maybe
he can respond to this too), but if I use streamreader to read an xml
file with encoding specified as utf8 and I set the
streamreader.encoding property to utf8, will streamreader fire an
exception if a character is not utf8,
or do I have to parse every character and check its value to see if it
is in the utf8 range?

Bill
 
M

Martin Honnen

OK. I dont know if you are a .net programmer or not(Martin is so maybe
he can respond to this too), but if I use streamreader to read an xml
file with encoding specified as utf8 and I set the
streamreader.encoding property to utf8, will streamreader fire an
exception if a character is not utf8,
or do I have to parse every character and check its value to see if it
is in the utf8 range?

As far as I know StreamReader does not throw an exception.
 
J

Joseph J. Kesselman

So how do they do that? do they check every character? or do they just
convert?

Most hand it off to an appropriate encoding-aware stream reader library
and let that code do the work. Why build a wheel when you can buy one?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top