D
Dale Gerdemann
I'm having trouble with Unicode encoding in DOM. As a simple example,
I read in a UTF-8 encoded xml file such as:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<aText>letter 'a' with umlaut: ä</aText>
And when I serialize it, it comes out encoded as ISO-8895-1. But I
don't think the problem is with serialization. In processing my XML
files, I'm matching bits and pieces of text and attributes with some
Unicode/UTF-8 text read in from another souce. When the strings in my
XML file contain non-ASCII characters, then I have problems.
Hopefully, I've explained the problem enough so that someone can help.
In case it's necessary, I attach at the end, a bit of code for reading
in and serializing a DOM.
Dale Gerdemann
----------------
import org.xml.sax.InputSource;
import java.io.FileInputStream;
import java.io.File;
import java.io.FileWriter;
import org.w3c.dom.Document;
import org.apache.xerces.parsers.DOMParser;
import org.apache.xerces.dom.DOMImplementationImpl;
import org.xml.sax.SAXException;
import org.w3c.dom.DOMException;
import java.io.IOException;
import org.w3c.dom.Element;
import org.apache.xml.serialize.OutputFormat;
import org.apache.xml.serialize.XMLSerializer;
import org.apache.xml.serialize.LineSeparator;
public class AProblem {
public static void main(String[] args)
throws DOMException, IOException, SAXException {
DOMParser parser = new DOMParser();
InputSource is = new InputSource(new FileInputStream(new
File("foo.xml")));
is.setEncoding("UTF-8");
parser.parse(is);
Document doc = parser.getDocument();
Element root = doc.getDocumentElement();
System.out.println(root.getChildNodes().item(0));
OutputFormat format = new OutputFormat(doc);
format.setLineSeparator(LineSeparator.Unix);
format.setIndenting(true);
format.setLineWidth(0);
format.setPreserveSpace(true);
format.setEncoding("UTF-8");
FileWriter fw = new FileWriter("bar.xml");
XMLSerializer serializer = new XMLSerializer(fw, format);
serializer.serialize(doc);
}
}
I read in a UTF-8 encoded xml file such as:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<aText>letter 'a' with umlaut: ä</aText>
And when I serialize it, it comes out encoded as ISO-8895-1. But I
don't think the problem is with serialization. In processing my XML
files, I'm matching bits and pieces of text and attributes with some
Unicode/UTF-8 text read in from another souce. When the strings in my
XML file contain non-ASCII characters, then I have problems.
Hopefully, I've explained the problem enough so that someone can help.
In case it's necessary, I attach at the end, a bit of code for reading
in and serializing a DOM.
Dale Gerdemann
----------------
import org.xml.sax.InputSource;
import java.io.FileInputStream;
import java.io.File;
import java.io.FileWriter;
import org.w3c.dom.Document;
import org.apache.xerces.parsers.DOMParser;
import org.apache.xerces.dom.DOMImplementationImpl;
import org.xml.sax.SAXException;
import org.w3c.dom.DOMException;
import java.io.IOException;
import org.w3c.dom.Element;
import org.apache.xml.serialize.OutputFormat;
import org.apache.xml.serialize.XMLSerializer;
import org.apache.xml.serialize.LineSeparator;
public class AProblem {
public static void main(String[] args)
throws DOMException, IOException, SAXException {
DOMParser parser = new DOMParser();
InputSource is = new InputSource(new FileInputStream(new
File("foo.xml")));
is.setEncoding("UTF-8");
parser.parse(is);
Document doc = parser.getDocument();
Element root = doc.getDocumentElement();
System.out.println(root.getChildNodes().item(0));
OutputFormat format = new OutputFormat(doc);
format.setLineSeparator(LineSeparator.Unix);
format.setIndenting(true);
format.setLineWidth(0);
format.setPreserveSpace(true);
format.setEncoding("UTF-8");
FileWriter fw = new FileWriter("bar.xml");
XMLSerializer serializer = new XMLSerializer(fw, format);
serializer.serialize(doc);
}
}