Serializing XML with JAXP - help needed

M

Michael

Hi all,

I'm trying to serialize an xml document with JAXP. The xml may or may not
contain international characters, and so I want any text elements to be
UTF-8 encoded. Consider the following (a brief summary is included below the
code):

---- code begin ----

org.w3c.dom.Document doc =
javax.xml.parsers.DocumentBuilderFactory.newInstance().newDocumentBuilder().
newDocument();

org.w3c.dom.Element el = doc.createElement("element");
el.setAttribute("attr1","attr1value");
el.appendChild(doc.createTextNode("Danish < æøå > characters!"));
doc.appendChild(el);

javax.xml.transform.TransformerFactory transformerFactory =
javax.xml.transform.TransformerFactory.newInstance();
javax.xml.transform.Transformer transformer =
transformerFactory.newTransformer();

transformer.setOutputProperty(javax.xml.transform.OutputKeys.INDENT,"yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount","4
");

java.io.StringWriter xmlout = new java.io.StringWriter();
javax.xml.transform.stream.StreamResult result = new
javax.xml.transform.stream.StreamResult(xmlout);
transformer.transform(new javax.xml.transform.dom.DOMSource(doc),result);

System.out.println(xmlout.getBuffer());

---- code end ----

So, I'm creating a document (DOM), setting an attribute and appending a text
node with international characters (and a couple of brackets just for fun).
Then I create a transformer instance, I ask it to indent the output nicely
and finally to actually serialize my DOM into xml.

When I run this code (in a jsp file on a tomcat 4.1.x server with the latest
xerces2-j version installed) I get this output:

<?xml version="1.0" encoding="UTF-8"?>
<element attr1="attr1value">Danish &lt; æøå &gt; characters!</element>

Okay. So I got the < and > converted as I expected. However, the
international characters do not appear to have been encoded to UTF-8 or
anything else for that matter. In fact, the above isn't even a valid xml
document, and several parsers I tried (including Microsoft XML) rejects it
because of the illegal character data. Clearly there is a mismatch between
the what xml header encoding specifies and what's actually appearing in the
text nodes of the document. It's very curious that JAXP will transform a DOM
into a result that isn't valid.

Interestingly, when I run the same code interactively inside my WebSphere
Studio Application Developer 5 (using what is known as a scrapbook page), I
get this:

<?xml version="1.0" encoding="UTF-8"?>
<element attr1="attr1value">Danish &lt; æøå &gt;
characters!</element>

Well. I'm not sure that #230 is a correct UTF-8 encoding of "æ" (in fact I'm
sure it isn't), but at least the document is now valid and even Microsoft
XML will parse it without complaints.

I am hoping that someone out there can shed some light on this problem and
tell me what I am doing wrong. Exactly how do I instruct JAXP to encode the
text nodes in my DOM so that it doesn't break my XML parser? :)

Regards,
Michael Berg
www.hyperpal.com
 
M

Michael Berg

Hi all,

The problem is related to the use of a StringWriter to collect the XML
output. Apparently StringWriters have their own idea about character
encoding, so use an OutputStreamWriter in stead - like this, for example:

java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
javax.xml.transform.stream.StreamResult result = new
javax.xml.transform.stream.StreamResult(
new java.io_OutputStreamWriter(
baos,
"UTF-8"
)
);

/Michael
www.hyperpal.com
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top