Detect XML document encodings with SAX

Sebastian · Nov 21, 2012

Hello there,

I discovered this post:
http://www.ibm.com/developerworks/library/x-tipsaxxni/

and implemented both approaches (SAX and Xerces XNI).

Unfortunately, for the attached XML file, both methods
output an encoding of UTF-8, while looking at the file
makes it clear that it is not UTF-8 encoded (all characters,
including the umlaut and the Euro-sign, take one byte, and the
declared encoding also is not UTF-8).

Does anyone have an idea why that is so? And how I could
go about making some XML parser determine the correct encoding?

-- Sebastian

Lew · Nov 21, 2012

Sebastian said:
I discovered this post:
http://www.ibm.com/developerworks/library/x-tipsaxxni/

and implemented both approaches (SAX and Xerces XNI).

Unfortunately, for the attached XML file, both methods

Don't do attachments on Usenet.

output an encoding of UTF-8, while looking at the file

as they should. XML should be encoded in UTF-8 nearly always.

But SAX is a parser, so it doesn't output, it inputs. What are you telling us?

makes it clear that it is not UTF-8 encoded (all characters,
including the umlaut and the Euro-sign, take one byte, and the
declared encoding also is not UTF-8).
http://sscce.org/

Does anyone have an idea why that is so? And how I could

You used the default encoding in your Writer.

go about making some XML parser determine the correct encoding?

Your problem is writing the file, no? That has nothing to do with parsing.

If your problem is with reading the file, then the encoding in the XML declaration
should suffice to guide the parser. But then why do you talk about methods that
"output an encoding"?

However, according to
http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding
supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, and EUC-JP,
as you would have learned had you researched your question.

So it looks like you must not accept XML documents with such a non-standard
encoding.

Show us the code, or at least an SSCCE of it.

Sebastian · Nov 21, 2012

Am 21.11.2012 20:31, schrieb Lew:

Sebastian said:
Sebastian said:

I discovered this post:
http://www.ibm.com/developerworks/library/x-tipsaxxni/

and implemented both approaches (SAX and Xerces XNI).

Click to expand...

[snip]

Your problem is writing the file, no? That has nothing to do with parsing.

No, it is with parsing the file. Parsing with the purpose of detecting
the encoding.

If your problem is with reading the file, then the encoding in the XML declaration
should suffice to guide the parser.

My question is exactly why in this case this does not suffice.

But then why do you talk about methods that
"output an encoding"?

I meant the System.out.println() statements in the code.

[snip]

Show us the code, or at least an SSCCE of it.

I was referring to the code in the IBM developerworks article that I
linked to. Perhaps I should simply have copied out that code into my
original post. So here goes:

import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.*;

import java.io.IOException;

public class SAXEncodingDetector extends DefaultHandler {

/**
* print the encodings of all URLs given on the command line.
*/
public static void main(String[] args) throws SAXException,
IOException {
XMLReader parser = XMLReaderFactory.createXMLReader();
SAXEncodingDetector handler = new SAXEncodingDetector();
parser.setContentHandler(handler);
for (int i = 0; i < args.length; i++) {
try {
parser.parse(args);
}
catch (SAXException ex) {
System.out.println(handler.encoding);
}
}
}

private String encoding;
private Locator2 locator;

@Override
public void setDocumentLocator(Locator locator) {
if (locator instanceof Locator2) {
this.locator = (Locator2) locator;
}
else {
this.encoding = "unknown";
}
}

@Override
public void startDocument() throws SAXException {
if (locator != null) {
this.encoding = locator.getEncoding();
}
throw new SAXException("Early termination");
}

}

Lew · Nov 22, 2012

Sebastian said:
schrieb Lew:

Sebastian said:

I discovered this post:
http://www.ibm.com/developerworks/library/x-tipsaxxni/

and implemented both approaches (SAX and Xerces XNI).

Click to expand...

[snip]

Your problem is writing the file, no? That has nothing to do with parsing.

Click to expand...

No, it is with parsing the file. Parsing with the purpose of detecting
the encoding.

Not clear from your phrasing.

My question is exactly why in this case this does not suffice.

Did my answer to that question not suffice?

I notice you didn't address my answer in your response; in fact you snipped it.

Sebastian · Nov 22, 2012

Am 22.11.2012 01:37, schrieb Lew:

Sebastian said:
Sebastian said:

schrieb Lew:

Sebastian wrote:
I discovered this post:
http://www.ibm.com/developerworks/library/x-tipsaxxni/

and implemented both approaches (SAX and Xerces XNI).
[snip]

Your problem is writing the file, no? That has nothing to do with parsing.

Click to expand...

No, it is with parsing the file. Parsing with the purpose of detecting
the encoding.

Click to expand...

Not clear from your phrasing.

My question is exactly why in this case this does not suffice.

Click to expand...

Did my answer to that question not suffice?

I notice you didn't address my answer in your response; in fact you snipped it.

The answer cannot be that windows-1250 is non-standard. In fact, the
declared encoding of the XML file does not seem to matter. The code will
always output "UTF-8".

I am using Java 7 on Windows XP.

-- Sebastian

markspace · Nov 22, 2012

The answer cannot be that windows-1250 is non-standard. In fact, the
declared encoding of the XML file does not seem to matter. The code will
always output "UTF-8".

Maybe this quote from the article will help you out:

"This approach works 90 percent of the time, maybe a little more. But
SAX parsers aren't required to support the Locator interface, much less
Locator2, and a few don't. A second option, if you know you're using
Xerces, is to work with XNI"

Since the output of the program is "unknown", I'd guess that this
particular SAX parser doesn't support Locator2, like it says.

Steven Simpson · Nov 22, 2012

Maybe this quote from the article will help you out:

"This approach works 90 percent of the time, maybe a little more. But
SAX parsers aren't required to support the Locator interface, much
less Locator2, and a few don't. A second option, if you know you're
using Xerces, is to work with XNI"

Since the output of the program is "unknown", I'd guess that this
particular SAX parser doesn't support Locator2, like it says.

Like the OP, I'm getting "UTF-8", and tracing in the code shows that it
is getting a Locator2.

Roedy Green · Nov 22, 2012

Does anyone have an idea why that is so? And how I could
go about making some XML parser determine the correct encoding?

See http://mindprod.com/products2.html#ENCODINGRECOGNISER

This is a manual assist tool to help you guess the encoding.

Encodings are not embedded in any way in files. You just have to know.

ARGHHH!

See http://mindprod.com/jgloss/encoding.html
for how to use native2ascii to interconvert encodings.

The XML world likes UTF-8. Using anything else is just asking for
trouble.

markspace · Nov 22, 2012

Like the OP, I'm getting "UTF-8", and tracing in the code shows that it
is getting a Locator2.

Oh, well mine doesn't. I guess we have two different implementations.
Sorry can't guess what is up with yours.

Peter J. Holzer · Nov 23, 2012

See http://mindprod.com/products2.html#ENCODINGRECOGNISER

This is a manual assist tool to help you guess the encoding.

No need to guess.

Encodings are not embedded in any way in files. You just have to know.

Not true for XML. The file Sebastian posted starts with

<?xml version="1.0" encoding="windows-1250"?>

hp

Arne Vajhøj · Nov 24, 2012

Sebastian said:
I discovered this post:
http://www.ibm.com/developerworks/library/x-tipsaxxni/

and implemented both approaches (SAX and Xerces XNI).

Unfortunately, for the attached XML file, both methods
output an encoding of UTF-8, while looking at the file

I tried.

And I can not get it to work either.

SAX detects UTF-8 no matter what it really is.

StAX seems never to detect and W3C DOM seems to
always detect correct.

I can not offer an explanation. Obviously the parsers
need to internally detect correct. Otherwise they
could not parse correct.

Code below.

Arne

====

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;

import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.ext.Locator2;
import org.xml.sax.helpers.XMLReaderFactory;
import org.xml.sax.helpers.DefaultHandler;

public class XmlEncodingDectect {
private static final String FNM1 = "/work/foobar1.xml";
private static final String FNM2 = "/work/foobar2.xml";
private static final String FNM3 = "/work/foobar3.xml";
private static void gen1() throws IOException {
PrintWriter pw = new PrintWriter(new FileWriter(FNM1));
pw.println("<?xml version='1.0' encoding='UTF-8'?>");
pw.println("<root/>");
pw.close();
}
private static void gen2() throws IOException {
PrintWriter pw = new PrintWriter(new FileWriter(FNM2));
pw.println("<?xml version='1.0' encoding='ISO-8859-1'?>");
pw.println("<root/>");
pw.close();
}
private static void gen3() throws IOException {
PrintWriter pw = new PrintWriter(new FileWriter(FNM3));
pw.println("<?xml version='1.0'?>");
pw.println("<root/>");
pw.close();
}
private static String encoding;
private static String detectSAX(String fnm) throws SAXException,
IOException {
XMLReader parser = XMLReaderFactory.createXMLReader();
parser.setContentHandler(new DefaultHandler() {
private Locator2 locator;
@Override
public void setDocumentLocator(Locator locator) {
if (locator instanceof Locator2) {
this.locator = (Locator2) locator;
} else {
encoding = "Unknown";
}
}
@Override
public void startDocument() throws SAXException {
if (locator != null) {
encoding = locator.getEncoding();
}
}
});
parser.parse(new InputSource(new FileInputStream(fnm)));
return encoding;
}
private static String detectW3CDOM(String fnm) throws
ParserConfigurationException, FileNotFoundException, SAXException,
IOException {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new FileInputStream(fnm)));
String encoding = doc.getXmlEncoding();
return encoding != null ? encoding : "Unknown";
}
private static String detectStAX(String fnm) throws
FileNotFoundException, XMLStreamException {
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(new
FileInputStream(fnm));
String encoding = null;
while(xsr.hasNext()) {
xsr.next();
switch(xsr.getEventType()) {
case XMLStreamReader.START_DOCUMENT:
encoding = xsr.getEncoding();
break;
default:
break;
}
}
return encoding != null ? encoding : "Unknown";
}
public static void main(String[] args) throws IOException,
SAXException, ParserConfigurationException, XMLStreamException {
gen1();
System.out.println(detectSAX(FNM1));
System.out.println(detectW3CDOM(FNM1));
System.out.println(detectStAX(FNM1));
gen2();
System.out.println(detectSAX(FNM2));
System.out.println(detectW3CDOM(FNM2));
System.out.println(detectStAX(FNM2));
gen3();
System.out.println(detectSAX(FNM3));
System.out.println(detectW3CDOM(FNM3));
System.out.println(detectStAX(FNM3));
}
}

Arne Vajhøj · Nov 24, 2012

Don't do attachments on Usenet.

as they should.

No.

If the XML prolog specifies another encoding than UTF-8,
then it should not return UTF-8.

XML should be encoded in UTF-8 nearly always.

XML allows for other encodings.

And Java XML parsers support it.

So it should always work.

But SAX is a parser, so it doesn't output, it inputs. What are you telling us?

Output usually mean System.out.println - that works fine with a parser.

If your problem is with reading the file, then the encoding in the XML declaration
should suffice to guide the parser. But then why do you talk about methods that
"output an encoding"?

Because he wants to know what it is.

However, according to
http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding
supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, and EUC-JP,
as you would have learned had you researched your question.

So it looks like you must not accept XML documents with such a non-standard
encoding.

Those that has researched would know that the XML spec do not
limit the encodings at all. The XML processor must support UTF-8
and UTF-16, but are free to support others.

Arne

Arne

Arne Vajhøj · Nov 24, 2012

Maybe this quote from the article will help you out:

"This approach works 90 percent of the time, maybe a little more. But
SAX parsers aren't required to support the Locator interface, much less
Locator2, and a few don't. A second option, if you know you're using
Xerces, is to work with XNI"

Since the output of the program is "unknown", I'd guess that this
particular SAX parser doesn't support Locator2, like it says.

Except that it does not return Unknown - it returns UTF-8.

Arne

Arne Vajhøj · Nov 24, 2012

No need to guess.

Not true for XML. The file Sebastian posted starts with

<?xml version="1.0" encoding="windows-1250"?>

New around here?

Don't expect Roedy's posts to relate that much to what he is
replying to.

Arne

Lew · Nov 24, 2012

Arne said:
Lew said:

Sebastian wrote: [snip]

output an encoding of UTF-8, while looking at the file

Click to expand...

as they should.

Click to expand...

No.

If the XML prolog specifies another encoding than UTF-8,
then it should not return UTF-8.

True, but I'm saying they should specify UTF-8 in the prolog.

See?

XML allows for other encodings.

So? You should use UTF-8 nearly always, i.e., unless there's a compelling
reason not to.

And Java XML parsers support it.

For those rare times when you deviate from the usual UTF-8.

So it should always work.

Output usually mean System.out.println - that works fine with a parser.

His phrasing wasn't clear to me. That's why I asked for clarification.

I could have guessed, too.

See? You're preaching to the choir.

Because he wants to know what it is.

Those that has researched would know that the XML spec do not
limit the encodings at all. The XML processor must support UTF-8
and UTF-16, but are free to support others.

Perhaps the OP's parser doesn't exercise that freedom, judging by the
symptoms.

'sall I'm sayin'.

Obviously I don't know the answer, but he's asking for suggestions
to investigate, AIUI. He's having encoding problems. His XML is apparently
encoded in Windows-1252, a notoriously funky encoding especially for
the variety of characters with which one might wish to deal. So why not
investigate obtaining material that isn't in such a notoriously funky
encoding, like, oh, say, the old reliable standard UTF-8?

Perhaps that isn't feasible, for reasons as yet unstated, but that's
the nature of brainstorming.

Sebastian · Nov 24, 2012

Sebastian said:
I discovered this post:
http://www.ibm.com/developerworks/library/x-tipsaxxni/

and implemented both approaches (SAX and Xerces XNI).

Unfortunately, for the attached XML file, both methods
output an encoding of UTF-8, while looking at the file

Am 24.11.2012 11:14, schrieb Lew:
[snip]

Obviously I don't know the answer, but he's asking for suggestions
to investigate, AIUI. He's having encoding problems. His XML is apparently
encoded in Windows-1252, a notoriously funky encoding especially for
the variety of characters with which one might wish to deal. So why not
investigate obtaining material that isn't in such a notoriously funky
encoding, like, oh, say, the old reliable standard UTF-8?

Perhaps that isn't feasible, for reasons as yet unstated, but that's
the nature of brainstorming.

Here's the background to my question:
I am dealing with other people's code that processes XML files.
Unfortunately, that code, which I have no control over, seems to use
some home-grown parsing algorithm, which DOES NOT always detect
encodings correctly, but expects to be told them.

The XML files come from several sources in different encodings, and I
cannot dictate anything there either.

So I thought, well, why don't I add a little preprocessor to discover
the encoding to give to that terrible file processor I'm stuck with.
Shouldn't be that hard, because, as Arne said:

Am 24.11.2012 03:11, schrieb Arne Vajhøj:
Obviously the parsers
need to internally detect correct. Otherwise they
could not parse correct.

The only approach that seems to work (at least for Arne), namely
W3C DOM, is out of the question for me, because the files are
potentially huge and I cannot keep a complete document model in memory.
I need something along the lines of SAX. I'll have to look around some more.

-- Sebastian

PS: The author of that article from which I took the code isn't just
anyone. Elliotte Rusty Harold hosts the XML web site
http://www.cafeconleche.org/ and is affiliated with the University of
North Carolina. Perhaps I could try to get in touch with him.

Arne Vajhøj · Nov 24, 2012

Am 24.11.2012 11:14, schrieb Lew:
[snip]

Obviously I don't know the answer, but he's asking for suggestions
to investigate, AIUI. He's having encoding problems. His XML is
apparently
encoded in Windows-1252, a notoriously funky encoding especially for
the variety of characters with which one might wish to deal. So why not
investigate obtaining material that isn't in such a notoriously funky
encoding, like, oh, say, the old reliable standard UTF-8?

Perhaps that isn't feasible, for reasons as yet unstated, but that's
the nature of brainstorming.

Click to expand...

Here's the background to my question:
I am dealing with other people's code that processes XML files.
Unfortunately, that code, which I have no control over, seems to use
some home-grown parsing algorithm, which DOES NOT always detect
encodings correctly, but expects to be told them.

The XML files come from several sources in different encodings, and I
cannot dictate anything there either.

I would consider it tempting to rewrite that app to use a standard
XML parser.

It would solve this problem and possibly also some future problems.

So I thought, well, why don't I add a little preprocessor to discover
the encoding to give to that terrible file processor I'm stuck with.
Shouldn't be that hard, because, as Arne said:

The only approach that seems to work (at least for Arne), namely
W3C DOM, is out of the question for me, because the files are
potentially huge and I cannot keep a complete document model in memory.
I need something along the lines of SAX. I'll have to look around some
more.

What about just reading the first few lines until you have the
XML declaration.

Parsing the encoding out of that should be simple.

private static final Pattern encpat =
Pattern.compile("encoding\\s*=\\s*['\"]([^'\"]+)['\"]");
private static String detectSimple(String fnm) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(fnm));
String firstpart = "";
while(!firstpart.contains(">")) firstpart += br.readLine();
br.close();
Matcher m = encpat.matcher(firstpart);
if(m.find()) {
return m.group(1);
} else {
return "Unknown";
}
}

I do not like the solution, but given the restrictions in the
context, then maybe it is what you need.

PS: The author of that article from which I took the code isn't just
anyone. Elliotte Rusty Harold hosts the XML web site
http://www.cafeconleche.org/ and is affiliated with the University of
North Carolina. Perhaps I could try to get in touch with him.

Teaching at a university is no guarantee of good practical
programming skills.

Arne

Arne Vajhøj · Nov 24, 2012

His phrasing wasn't clear to me. That's why I asked for clarification.

Then maybe we need "How to ask for clarifications the smart way".

Perhaps the OP's parser doesn't exercise that freedom, judging by the
symptoms.

There are nothing in OP's symptoms that indicate lack of support
for encodings.

OP's symptoms is that it parse fine with encoding XYZ but when asked
by caller it claims wrongfully to be using UTF-8.

Arne

Arne Vajhøj · Nov 24, 2012

Obviously I don't know the answer, but he's asking for suggestions
to investigate, AIUI. He's having encoding problems. His XML is apparently
encoded in Windows-1252, a notoriously funky encoding especially for
the variety of characters with which one might wish to deal.

CP-1252 is just another encoding. It is not more or less funky than
any other encoding.

In fact it is identical with ISO-8859-1 for all characters except
128-159, which are control characters/unmapped in ISO-8859-1 but has
various extra values in CP-1252.

So why not
investigate obtaining material that isn't in such a notoriously funky
encoding, like, oh, say, the old reliable standard UTF-8?

If one can chose the data files and the software, then life is easy.

Arne

markspace · Nov 25, 2012

I am dealing with other people's code that processes XML files.
Unfortunately, that code, which I have no control over, seems to use
some home-grown parsing algorithm, which DOES NOT always detect
encodings correctly, but expects to be told them.

That's not a big deal. Several of the Java components work this way.
Open the file with an assumed encoding, and test the encoding. If you
are wrong, throw an exception, which causes the stream to be re-opened
with the correct encoding (now that the correct encoding has been detected).

Be careful you're not subverting an established, working process here.

I personally am still looking for an SSCCE, as your last one didn't
reproduce the error for me.

A proposal to handle file encodings	31	Nov 22, 2012
The future of the character-encodings library	4	Mar 16, 2011
SAX & UTF-8 problem	5	Jul 8, 2004
Encodings of javascript	2	Oct 17, 2008
Guessing Encodings and the PerlIO layer	2	Jul 27, 2009
read from file with mixed encodings in Python3	2	Nov 7, 2011
XML parsing: SAX/expat & yield	2	Aug 4, 2010
Ruby 1.9.1, HTTP and Encodings	0	Jun 24, 2009

Detect XML document encodings with SAX

Sebastian

Lew

Sebastian

Lew

Sebastian

markspace

Steven Simpson

Roedy Green

markspace

Peter J. Holzer

Arne Vajhøj

Arne Vajhøj

Arne Vajhøj

Arne Vajhøj

Lew

Sebastian

Arne Vajhøj

Arne Vajhøj

Arne Vajhøj

markspace

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads