M
Michael
Hi all,
I'm trying to serialize an xml document with JAXP. The xml may or may not
contain international characters, and so I want any text elements to be
UTF-8 encoded. Consider the following (a brief summary is included below the
code):
---- code begin ----
org.w3c.dom.Document doc =
javax.xml.parsers.DocumentBuilderFactory.newInstance().newDocumentBuilder().
newDocument();
org.w3c.dom.Element el = doc.createElement("element");
el.setAttribute("attr1","attr1value");
el.appendChild(doc.createTextNode("Danish < æøå > characters!"));
doc.appendChild(el);
javax.xml.transform.TransformerFactory transformerFactory =
javax.xml.transform.TransformerFactory.newInstance();
javax.xml.transform.Transformer transformer =
transformerFactory.newTransformer();
transformer.setOutputProperty(javax.xml.transform.OutputKeys.INDENT,"yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount","4
");
java.io.StringWriter xmlout = new java.io.StringWriter();
javax.xml.transform.stream.StreamResult result = new
javax.xml.transform.stream.StreamResult(xmlout);
transformer.transform(new javax.xml.transform.dom.DOMSource(doc),result);
System.out.println(xmlout.getBuffer());
---- code end ----
So, I'm creating a document (DOM), setting an attribute and appending a text
node with international characters (and a couple of brackets just for fun).
Then I create a transformer instance, I ask it to indent the output nicely
and finally to actually serialize my DOM into xml.
When I run this code (in a jsp file on a tomcat 4.1.x server with the latest
xerces2-j version installed) I get this output:
<?xml version="1.0" encoding="UTF-8"?>
<element attr1="attr1value">Danish < æøå > characters!</element>
Okay. So I got the < and > converted as I expected. However, the
international characters do not appear to have been encoded to UTF-8 or
anything else for that matter. In fact, the above isn't even a valid xml
document, and several parsers I tried (including Microsoft XML) rejects it
because of the illegal character data. Clearly there is a mismatch between
the what xml header encoding specifies and what's actually appearing in the
text nodes of the document. It's very curious that JAXP will transform a DOM
into a result that isn't valid.
Interestingly, when I run the same code interactively inside my WebSphere
Studio Application Developer 5 (using what is known as a scrapbook page), I
get this:
<?xml version="1.0" encoding="UTF-8"?>
<element attr1="attr1value">Danish < æøå >
characters!</element>
Well. I'm not sure that #230 is a correct UTF-8 encoding of "æ" (in fact I'm
sure it isn't), but at least the document is now valid and even Microsoft
XML will parse it without complaints.
I am hoping that someone out there can shed some light on this problem and
tell me what I am doing wrong. Exactly how do I instruct JAXP to encode the
text nodes in my DOM so that it doesn't break my XML parser?
Regards,
Michael Berg
www.hyperpal.com
I'm trying to serialize an xml document with JAXP. The xml may or may not
contain international characters, and so I want any text elements to be
UTF-8 encoded. Consider the following (a brief summary is included below the
code):
---- code begin ----
org.w3c.dom.Document doc =
javax.xml.parsers.DocumentBuilderFactory.newInstance().newDocumentBuilder().
newDocument();
org.w3c.dom.Element el = doc.createElement("element");
el.setAttribute("attr1","attr1value");
el.appendChild(doc.createTextNode("Danish < æøå > characters!"));
doc.appendChild(el);
javax.xml.transform.TransformerFactory transformerFactory =
javax.xml.transform.TransformerFactory.newInstance();
javax.xml.transform.Transformer transformer =
transformerFactory.newTransformer();
transformer.setOutputProperty(javax.xml.transform.OutputKeys.INDENT,"yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount","4
");
java.io.StringWriter xmlout = new java.io.StringWriter();
javax.xml.transform.stream.StreamResult result = new
javax.xml.transform.stream.StreamResult(xmlout);
transformer.transform(new javax.xml.transform.dom.DOMSource(doc),result);
System.out.println(xmlout.getBuffer());
---- code end ----
So, I'm creating a document (DOM), setting an attribute and appending a text
node with international characters (and a couple of brackets just for fun).
Then I create a transformer instance, I ask it to indent the output nicely
and finally to actually serialize my DOM into xml.
When I run this code (in a jsp file on a tomcat 4.1.x server with the latest
xerces2-j version installed) I get this output:
<?xml version="1.0" encoding="UTF-8"?>
<element attr1="attr1value">Danish < æøå > characters!</element>
Okay. So I got the < and > converted as I expected. However, the
international characters do not appear to have been encoded to UTF-8 or
anything else for that matter. In fact, the above isn't even a valid xml
document, and several parsers I tried (including Microsoft XML) rejects it
because of the illegal character data. Clearly there is a mismatch between
the what xml header encoding specifies and what's actually appearing in the
text nodes of the document. It's very curious that JAXP will transform a DOM
into a result that isn't valid.
Interestingly, when I run the same code interactively inside my WebSphere
Studio Application Developer 5 (using what is known as a scrapbook page), I
get this:
<?xml version="1.0" encoding="UTF-8"?>
<element attr1="attr1value">Danish < æøå >
characters!</element>
Well. I'm not sure that #230 is a correct UTF-8 encoding of "æ" (in fact I'm
sure it isn't), but at least the document is now valid and even Microsoft
XML will parse it without complaints.
I am hoping that someone out there can shed some light on this problem and
tell me what I am doing wrong. Exactly how do I instruct JAXP to encode the
text nodes in my DOM so that it doesn't break my XML parser?
Regards,
Michael Berg
www.hyperpal.com