Xml parser and character encoding

G

Ghislain Benrais

Hello,
I am new to java and I run a short program processing xml files.
Everything ran very well until I received xml files with the character
itself instead of its numerical reference (for instance 'é' instead of
'é'). I thought java would handle it but unexpectedly, it handles it
under DOS but doesn't handle it under Linux !
Do you have any explanations ?
Input file :
=======
<?xml version="1.0" encoding="ISO-8859-1" ?>
<values>
<value>détail</value>
<value>détail</value>
</values>
Java program :
==========
package javaapplication2;
import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.util.*;
public class Main extends DefaultHandler {
private String CData;
// Encodage
static String encoding;
private Writer out;
public Main(String[] args) {
super();
encoding = "ISO-8859-15";
try {
XMLReader xr = XMLReaderFactory.createXMLReader();
xr.setContentHandler( this );
out = new OutputStreamWriter(new
FileOutputStream("out.txt"),encoding);
InputSource input = null;
input = new InputSource(new FileReader("file.xml"));
xr.parse(input);
out.close();
}catch ( Exception e ) {
e.printStackTrace();
}
}
public static void main(String[] args) {
// TODO code application logic here
Main main = new Main(args);
}
//--------------------------------------------------------------------------------------
// Méthodes du parser
//--------------------------------------------------------------------------------------
public void startElement( String namespaceURI, String localName, String
qName, Attributes attr ) throws SAXException {
CData = new String("");
}
public void characters(char[] chars, int iStart, int iLen) {
CData = CData + new String(chars, iStart, iLen);
}
public void endElement( String namespaceURI,String localName,String
qName ) throws SAXException {
if (localName.equals( "value" )) {
try{
out.write(CData+"\n");
}catch ( Exception e ) {
e.printStackTrace();
}
return;
}
}
}
Result if run from DOS
================
détail
détail
Result if run from Linux
=================
d?tail
détail


Thanks in advance,
Ghislain
 
O

Oliver Wong

Ghislain Benrais said:
Hello,
I am new to java and I run a short program processing xml files.
Everything ran very well until I received xml files with the character
itself instead of its numerical reference (for instance 'é' instead of
'é'). I thought java would handle it but unexpectedly, it handles it
under DOS but doesn't handle it under Linux !
Do you have any explanations ?
Input file :
=======
<?xml version="1.0" encoding="ISO-8859-1" ?> [most of the code snipped]
input = new InputSource(new FileReader("file.xml"));

From http://java.sun.com/j2se/1.5.0/docs/api/java/io/FileReader.html:

<quote>
The constructors of this class assume that the default character encoding
and the default byte-buffer size are appropriate. To specify these values
yourself, construct an InputStreamReader on a FileInputStream.
</quote>

In other words, you're not specifying the encoding in the reader, and so
it picks some arbitrary one, and that encoding doesn't match the encoding
used in your XML file.

Did you try using the constructor of InputSource which takes a byte
stream instead of a character stream?
http://java.sun.com/j2se/1.5.0/docs...tSource.html#InputSource(java.io.InputStream)

- Oliver
 
G

Ghislain Benrais

I tried :
input = new InputSource(new FileInputStream("file.xml"));
and it works !
Thank you Oliver !
 
C

Chris Uppal

Ghislain said:
I tried :
input = new InputSource(new FileInputStream("file.xml"));
and it works !

But now you are overriding the encoding specified in the input file with the
one used by the FileInputStream -- and that will be whatever your Java system
default is.

As far as I can see, your earlier code would have used the charset specified in
the XML file, and -- as far as I can tell that /ought/ to work correctly. I
have no idea why it doesn't.

-- chris
 
O

Oliver Wong

Chris Uppal said:
But now you are overriding the encoding specified in the input file with
the
one used by the FileInputStream -- and that will be whatever your Java
system
default is.

As far as I can see, your earlier code would have used the charset
specified in
the XML file, and -- as far as I can tell that /ought/ to work correctly.
I
have no idea why it doesn't.

The original code specifies the *OUTPUT* encoding, but not the input
one.

- Oliver
 
O

Oliver Wong

Oliver Wong said:
The original code specifies the *OUTPUT* encoding, but not the input
one.

Oops, sorry, I misread your post, Chris.

Here's what I suspect is happening in the original code: A FileReader is
created with no specified encoding. A FileReader doesn't know anything about
XML, so it's not like the file reader is going to look for an XML
declaration node, and check it's encoding attribute. Instead, the FileReader
just uses the system default encoding and reads a stream of bytes from the
disk, an transforms them into a stream of characters, and passes these
characters to the XMLReader. By the time the XMLReader receives these
characters, they've already been decoded under some specific encoding, so
it's "too late" for the XMLReader to try to use the encoding information
specified in the XML file.

That's why I suggested the OP use the constructor which takes in a
stream of bytes instead. The XMLReader will probably decode the first few
bytes using ASCII or UTF-8, until it finds an encoding specified in the
file, in which case it does whatever magic it needs to do to switch encoding
mid-stream.

And it turns out that's what the OP actually did. FileInputStream
processes files as a stream of bytes, and not as a stream of characters, so
no encoding/decoding is done by FileInputStream.

- Oliver
 
C

Chris Uppal

Oliver said:
The original code specifies the *OUTPUT* encoding, but not the input
one.

Yes, precisely. And if the input encoding is not specified from code, then (as
I understand it) the SAX implementation is /supposed/ to take it from the XML
(where, in the OP's examply it was declared as "IS-8859-1"). Using a
FileInputStream means that the input is decoded by that stream before the XML
parser sees it -- which may not be what is desired. More specifically, the
code I commented on uses the Java system default decoder (whatever that happens
to be) -- which is almost certainly not what is desired.

-- chris
 
C

Chris Uppal

I said:
Yes, precisely. And if the input encoding is not specified from code,
then (as I understand it) the SAX implementation is /supposed/ to take it
from the XML (where, in the OP's examply it was declared as "IS-8859-1").
Using a FileInputStream means that the input is decoded by that stream
before the XML parser sees it -- which may not be what is desired. More
specifically, the code I commented on uses the Java system default
decoder (whatever that happens to be) -- which is almost certainly not
what is desired.

Oops, sorry, I misread your post Oliver.

;-) (But the "sorry" is real)

I misread both your post and the OP, in fact. I was under the impression that
he was originally using an FileInputStream, and you were "correcting" that to a
FileReader. My mistake.

-- chris
 
D

Dale King

Oliver said:
That's why I suggested the OP use the constructor which takes in a
stream of bytes instead. The XMLReader will probably decode the first
few bytes using ASCII or UTF-8, until it finds an encoding specified in
the file, in which case it does whatever magic it needs to do to switch
encoding mid-stream.

It is UTF-8 by the way. XML can get encoding information from:

- an external transport protocol (e.g. HTTP or MIME) which is really the
only reason to use a Reader as input to XMLReader.
- from an encoding declaration as in <?xml encoding='UTF-8'?>
- or from a byte order mark

If none of the above are present it is a fatal error for the XML to be
in anything but UTF-8.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top