Detect XML document encodings with SAX

Gene Wirchenko · Dec 13, 2012

[snip]

If you don't want input files, then ask for a MSSSCCE and link ^^^^^^^
to the rules for that.

Click to expand...

Please expand your new acronym.

Click to expand...

MarkSpace SSCCE

Thank you.

Sincerely,

Gene Wirchenko

Lew · Dec 13, 2012

Arne said:
????

Steven Simpson solved the problem with the provided information.

And OP acknowledged it.

I stand corrected.

Stanimir Stamenkov · Dec 16, 2012

Wed, 21 Nov 2012 15:32:19 +0100, /Sebastian/:

I discovered this post:
http://www.ibm.com/developerworks/library/x-tipsaxxni/

and implemented both approaches (SAX and Xerces XNI).

Unfortunately, for the attached XML file, both methods
output an encoding of UTF-8, while looking at the file
makes it clear that it is not UTF-8 encoded (all characters,
including the umlaut and the Euro-sign, take one byte, and the
declared encoding also is not UTF-8).

Does anyone have an idea why that is so? And how I could
go about making some XML parser determine the correct encoding?

Sorry if this has been answered already elsewhere in the thread.
The XML specification has a guideline for detecting the source encoding:

http://www.w3.org/TR/xml/#sec-guessing

and this is basically what parsers do. One-byte encodings are
basically indistinguishable from each other and they could be only
reliably detected in presence of an explicit encoding
information/declaration.

A proposal to handle file encodings	31	Nov 22, 2012
The future of the character-encodings library	4	Mar 16, 2011
SAX & UTF-8 problem	5	Jul 8, 2004
Encodings of javascript	2	Oct 17, 2008
Guessing Encodings and the PerlIO layer	2	Jul 27, 2009
read from file with mixed encodings in Python3	2	Nov 7, 2011
XML parsing: SAX/expat & yield	2	Aug 4, 2010
Ruby 1.9.1, HTTP and Encodings	0	Jun 24, 2009

Detect XML document encodings with SAX

Gene Wirchenko

Lew

Stanimir Stamenkov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads