Detect XML document encodings with SAX

S

Stanimir Stamenkov

Wed, 21 Nov 2012 15:32:19 +0100, /Sebastian/:
I discovered this post:
http://www.ibm.com/developerworks/library/x-tipsaxxni/

and implemented both approaches (SAX and Xerces XNI).

Unfortunately, for the attached XML file, both methods
output an encoding of UTF-8, while looking at the file
makes it clear that it is not UTF-8 encoded (all characters,
including the umlaut and the Euro-sign, take one byte, and the
declared encoding also is not UTF-8).

Does anyone have an idea why that is so? And how I could
go about making some XML parser determine the correct encoding?

Sorry if this has been answered already elsewhere in the thread.
The XML specification has a guideline for detecting the source encoding:

http://www.w3.org/TR/xml/#sec-guessing

and this is basically what parsers do. One-byte encodings are
basically indistinguishable from each other and they could be only
reliably detected in presence of an explicit encoding
information/declaration.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top