UTF-8 incorrect from org.apache.xml.serialize.XMLSerializer

J

Jim Cobban

I must be missing something.

I am using org.apache.xml.serialize.XMLSerializer to save a DOM but I am not
getting non-basic characters converted to UTF-8.

I create Text nodes in the DOM by, for example:

Document doc;
JTextArea textPrompt;
Text newTextNode;
Element descElt;
....
newTextNode = doc.createTextNode(textPrompt.getText());
descElt.appendChild(newTextNode);

The code to serialize the DOM is:

private void saveXml(Document document)
{
// rename the existing layout file
new File(fileName).renameTo(new File(fileName + "~"));
// write the document out
OutputFormat format = new OutputFormat(document);
format.setIndenting(true);
format.setLineWidth(0);
format.setPreserveSpace(true);
try {
XMLSerializer serializer;
serializer = new XMLSerializer (
new FileWriter(fileName),
format);
serializer.asDOMSerializer();
serializer.serialize(document);
}
catch (IOException ioe)
{
....
}
}

If I enter a character such as e' (e with acute accent) into the JTextArea
and I look at the XML file using a non-UTF-8-aware editor I see that the e'
has been inserted as a single byte, not as the 2 character UTF-8 escaped
value. If I subsequently try to read the XML file using XERCES it blows up
because of the invalid escape sequence.

How do I get a valid serialization of this DOM into XML using UTF-8?


--
Jim Cobban (e-mail address removed)
34 Palomino Dr.
Kanata, ON, CANADA
K2M 1M1
+1-613-592-9438
 
M

Martin Honnen

Jim said:
I must be missing something.

I am using org.apache.xml.serialize.XMLSerializer to save a DOM but I am not
getting non-basic characters converted to UTF-8.

I create Text nodes in the DOM by, for example:

Document doc;
JTextArea textPrompt;
Text newTextNode;
Element descElt;
...
newTextNode = doc.createTextNode(textPrompt.getText());
descElt.appendChild(newTextNode);

The code to serialize the DOM is:

private void saveXml(Document document)
{
// rename the existing layout file
new File(fileName).renameTo(new File(fileName + "~"));
// write the document out
OutputFormat format = new OutputFormat(document);

Does it help if you explicitly set
new OutputFormat(document, "UTF-8", true);
??
 
J

Jim Cobban

Martin Honnen said:
Does it help if you explicitly set
new OutputFormat(document, "UTF-8", true);
??

Martin Honnen
http://JavaScript.FAQTs.com/
No. Explicitly setting the format does not change the behavior. The
non-basic character is still inserted into the output as a single character
rather than as a 2 character UTF-8 escape as it should be.
 
J

Jim Cobban

Jim Cobban said:
I must be missing something.

I was misunderstanding something:
XMLSerializer serializer;
serializer = new XMLSerializer (
new FileWriter(fileName),
format);

When I replaced this with:

serializer = new XMLSerializer (
new FileOutputStream(fileName),
format);

it worked correctly. The problem was that by passing in a FileWriter, which
is constructed with the default encoding, there was no opportunity to
specify the UTF-8 encoding. The second format permits the new instance of
XMLSerializer to supply the correct encoding when it constructs the instance
of OutputWriter under the covers.

Basically my mistake was copying sample code from the distribution without
taking the time to understand exactly what it was doing. Once I took that
time I realized that I was using the wrong constructor.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top