How do xml parsers handle encoding?

billsahiker · Apr 30, 2008

if an xml file specifies an encoding, e.g., utf16, do xml browsers and
xml editors read and verify each character in the file to make sure it
is utf16? and throw an error if it is not, or. do they do an automatic
filtering/converting to utf16, or do they do something else?

Do they default to utf8 if the xml file does not specify an encoding?

Bill

Martin Honnen · Apr 30, 2008

if an xml file specifies an encoding, e.g., utf16, do xml browsers and
xml editors read and verify each character in the file to make sure it
is utf16? and throw an error if it is not, or. do they do an automatic
filtering/converting to utf16, or do they do something else?

Do they default to utf8 if the xml file does not specify an encoding?

An XML parser checks for a BOM (byte order mark) to find out whether it
is UTF-8 or UTF-16 if there is no XML declaration declaring an encoding.

And XML parsers are required to check that documents are properly
encoded. However browser like Firefox or Opera I think might not report
any such violation. For instance I saved an XML document as UTF-8 but
with an XML declaration saying encoding="UTF-16" and then loaded with
Firefox 2.0 and Opera 9 and they both did not report an error, instead
treated the document as UTF-8. IE 6 reported an error.

Martin Honnen · Apr 30, 2008

Martin said:
And XML parsers are required to check that documents are properly
encoded. However browser like Firefox or Opera I think might not report
any such violation. For instance I saved an XML document as UTF-8 but
with an XML declaration saying encoding="UTF-16" and then loaded with
Firefox 2.0 and Opera 9 and they both did not report an error, instead
treated the document as UTF-8. IE 6 reported an error.

For Mozilla, the FAQ
http://developer.mozilla.org/en/doc...rom_the_treatment_of_text.2Fhtml_documents.3F
says:
"Most well-formedness constraints are enforced. (Currently Mozilla
does not catch character encoding errors, because the document is
re-encoded using a lenient encoding converter before the document
reaches the XML parser. This is a bug.)"

Joseph J. Kesselman · Apr 30, 2008

The rules for how they're *supposed* to handle it are spelled out in the
XML Recommendation. Not all parsers are in strict compliance with all
parts of the recommendation, alas. Bug Happens.

If you're asking whether you can get away with cheating: the brief
answer is that it's extremely bad practice to try. If you're asking
whether you can be certain a particular parser will or won't let
something through, you can ask its development/user community... but be
aware that the next release may fix this, and it's a very bad idea to
write code that depends on bugs in specific versions.

billsahiker · Apr 30, 2008

So how do they do that? do they check every character? or do they just
convert? if the encoding attribute is utf8 and the file has a
character not utf8, does the browser error, convert it or what? Like
if a Korean character is in a file that says it is utf8.

Bill

Richard Tobin · Apr 30, 2008

[/QUOTE]

So how do they do that? do they check every character?
Yes.

Like if a Korean character is in a file that says it is utf8.

utf-8 covers all of Unicode, so it includes Korean characters.

A parser has to check two things: that the data is legal for the
encoding (for example, some sequences of bytes are not legal in
UTF-8), and that the character it encodes is allowed in XML.

-- Richard

billsahiker · Apr 30, 2008

utf-8 covers all of Unicode, so it includes Korean characters.

A parser has to check two things: that the data is legal for the
encoding (for example, some sequences of bytes are not legal in
UTF-8), and that the character it encodes is allowed in XML.

-- Richard

OK. I dont know if you are a .net programmer or not(Martin is so maybe
he can respond to this too), but if I use streamreader to read an xml
file with encoding specified as utf8 and I set the
streamreader.encoding property to utf8, will streamreader fire an
exception if a character is not utf8,
or do I have to parse every character and check its value to see if it
is in the utf8 range?

Bill

Martin Honnen · Apr 30, 2008

OK. I dont know if you are a .net programmer or not(Martin is so maybe
he can respond to this too), but if I use streamreader to read an xml
file with encoding specified as utf8 and I set the
streamreader.encoding property to utf8, will streamreader fire an
exception if a character is not utf8,
or do I have to parse every character and check its value to see if it
is in the utf8 range?

As far as I know StreamReader does not throw an exception.

Joseph J. Kesselman · Apr 30, 2008

So how do they do that? do they check every character? or do they just
convert?

Most hand it off to an appropriate encoding-aware stream reader library
and let that code do the work. Why build a wheel when you can buy one?

How do I save information from an GUI into a XML-file?	0	Aug 17, 2022
How to set broadcast receiver attributes programmatically in android studio?	1	Mar 19, 2022
If no encoding what then ?	12	Sep 29, 2007
A proposal to handle file encodings	31	Nov 22, 2012
How to speed up XML reading	11	Sep 11, 2012
How to remove an empty line which is created when i deleted a element from my xml file?	0	Oct 1, 2016
Guessing the encoding from a BOM	7	Jan 16, 2014
How IE's and Firefox's XML DOM parsers deal with whitespace text nodes	1	Nov 2, 2007

How do xml parsers handle encoding?

billsahiker

Martin Honnen

Martin Honnen

Joseph J. Kesselman

billsahiker

Richard Tobin

billsahiker

Martin Honnen

Joseph J. Kesselman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads