Unicode problem with Java Xerces DOM

D

Dale Gerdemann

I'm having trouble with Unicode encoding in DOM. As a simple example,
I read in a UTF-8 encoded xml file such as:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<aText>letter 'a' with umlaut: ä</aText>

And when I serialize it, it comes out encoded as ISO-8895-1. But I
don't think the problem is with serialization. In processing my XML
files, I'm matching bits and pieces of text and attributes with some
Unicode/UTF-8 text read in from another souce. When the strings in my
XML file contain non-ASCII characters, then I have problems.

Hopefully, I've explained the problem enough so that someone can help.
In case it's necessary, I attach at the end, a bit of code for reading
in and serializing a DOM.

Dale Gerdemann
----------------
import org.xml.sax.InputSource;
import java.io.FileInputStream;
import java.io.File;
import java.io.FileWriter;
import org.w3c.dom.Document;
import org.apache.xerces.parsers.DOMParser;
import org.apache.xerces.dom.DOMImplementationImpl;
import org.xml.sax.SAXException;
import org.w3c.dom.DOMException;
import java.io.IOException;
import org.w3c.dom.Element;
import org.apache.xml.serialize.OutputFormat;
import org.apache.xml.serialize.XMLSerializer;
import org.apache.xml.serialize.LineSeparator;


public class AProblem {

public static void main(String[] args)
throws DOMException, IOException, SAXException {

DOMParser parser = new DOMParser();
InputSource is = new InputSource(new FileInputStream(new
File("foo.xml")));
is.setEncoding("UTF-8");
parser.parse(is);
Document doc = parser.getDocument();
Element root = doc.getDocumentElement();
System.out.println(root.getChildNodes().item(0));



OutputFormat format = new OutputFormat(doc);
format.setLineSeparator(LineSeparator.Unix);

format.setIndenting(true);
format.setLineWidth(0);
format.setPreserveSpace(true);
format.setEncoding("UTF-8");
FileWriter fw = new FileWriter("bar.xml");

XMLSerializer serializer = new XMLSerializer(fw, format);
serializer.serialize(doc);


}
}
 
K

Kenneth Stephen

Dale said:
I'm having trouble with Unicode encoding in DOM. As a simple example,
I read in a UTF-8 encoded xml file such as:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<aText>letter 'a' with umlaut: ä</aText>

And when I serialize it, it comes out encoded as ISO-8895-1. But I
don't think the problem is with serialization. In processing my XML
files, I'm matching bits and pieces of text and attributes with some
Unicode/UTF-8 text read in from another souce. When the strings in my
XML file contain non-ASCII characters, then I have problems.
Dale,

What JDK are you using and under which env?

Regards,
Kenneth
 
S

Steve W. Jackson

:I'm having trouble with Unicode encoding in DOM. As a simple example,
:I read in a UTF-8 encoded xml file such as:
:
:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
:
:<aText>letter 'a' with umlaut: ä</aText>
:
:And when I serialize it, it comes out encoded as ISO-8895-1. But I
:don't think the problem is with serialization. In processing my XML
:files, I'm matching bits and pieces of text and attributes with some
:Unicode/UTF-8 text read in from another souce. When the strings in my
:XML file contain non-ASCII characters, then I have problems.
:
:Hopefully, I've explained the problem enough so that someone can help.
:In case it's necessary, I attach at the end, a bit of code for reading
:in and serializing a DOM.
:
:Dale Gerdemann
:----------------
:import org.xml.sax.InputSource;
:import java.io.FileInputStream;
:import java.io.File;
:import java.io.FileWriter;
:import org.w3c.dom.Document;
:import org.apache.xerces.parsers.DOMParser;
:import org.apache.xerces.dom.DOMImplementationImpl;
:import org.xml.sax.SAXException;
:import org.w3c.dom.DOMException;
:import java.io.IOException;
:import org.w3c.dom.Element;
:import org.apache.xml.serialize.OutputFormat;
:import org.apache.xml.serialize.XMLSerializer;
:import org.apache.xml.serialize.LineSeparator;
:
:
:public class AProblem {
:
: public static void main(String[] args)
: throws DOMException, IOException, SAXException {
:
: DOMParser parser = new DOMParser();
: InputSource is = new InputSource(new FileInputStream(new
:File("foo.xml")));
: is.setEncoding("UTF-8");
: parser.parse(is);
: Document doc = parser.getDocument();
: Element root = doc.getDocumentElement();
: System.out.println(root.getChildNodes().item(0));
:
:
:
: OutputFormat format = new OutputFormat(doc);
: format.setLineSeparator(LineSeparator.Unix);
:
: format.setIndenting(true);
: format.setLineWidth(0);
: format.setPreserveSpace(true);
: format.setEncoding("UTF-8");
: FileWriter fw = new FileWriter("bar.xml");
:
: XMLSerializer serializer = new XMLSerializer(fw, format);
: serializer.serialize(doc);
:
:
: }
:}

I've encountered this problem myself. The solution was to use something
besides a FileWriter to output your new XML document, since you need to
encode both the XML data and the data written to an external file.

Your OutputFormat object specifies that the XML gets UTF-8 encoding, but
the FileWriter will use your system's default encoding. What I use now
is an OutputStreamWriter with its constructor taking an OutputStream (I
use a FileOutputStream) and a String naming the encoding. That solved
the problem for me.

I also note that you're specifying UTF-8 on input. While I doubt it
does any harm, it shouldn't be necessary.

Hope this helps.

= Steve =
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top