UTF-8 incorrect from org.apache.xml.serialize.XMLSerializer

Discussion in 'XML' started by Jim Cobban, Dec 5, 2003.

  1. Jim Cobban

    Jim Cobban Guest

    I must be missing something.

    I am using org.apache.xml.serialize.XMLSerializer to save a DOM but I am not
    getting non-basic characters converted to UTF-8.

    I create Text nodes in the DOM by, for example:

    Document doc;
    JTextArea textPrompt;
    Text newTextNode;
    Element descElt;
    ....
    newTextNode = doc.createTextNode(textPrompt.getText());
    descElt.appendChild(newTextNode);

    The code to serialize the DOM is:

    private void saveXml(Document document)
    {
    // rename the existing layout file
    new File(fileName).renameTo(new File(fileName + "~"));
    // write the document out
    OutputFormat format = new OutputFormat(document);
    format.setIndenting(true);
    format.setLineWidth(0);
    format.setPreserveSpace(true);
    try {
    XMLSerializer serializer;
    serializer = new XMLSerializer (
    new FileWriter(fileName),
    format);
    serializer.asDOMSerializer();
    serializer.serialize(document);
    }
    catch (IOException ioe)
    {
    ....
    }
    }

    If I enter a character such as e' (e with acute accent) into the JTextArea
    and I look at the XML file using a non-UTF-8-aware editor I see that the e'
    has been inserted as a single byte, not as the 2 character UTF-8 escaped
    value. If I subsequently try to read the XML file using XERCES it blows up
    because of the invalid escape sequence.

    How do I get a valid serialization of this DOM into XML using UTF-8?


    --
    Jim Cobban
    34 Palomino Dr.
    Kanata, ON, CANADA
    K2M 1M1
    +1-613-592-9438
    Jim Cobban, Dec 5, 2003
    #1
    1. Advertising

  2. Jim Cobban wrote:

    > I must be missing something.
    >
    > I am using org.apache.xml.serialize.XMLSerializer to save a DOM but I am not
    > getting non-basic characters converted to UTF-8.
    >
    > I create Text nodes in the DOM by, for example:
    >
    > Document doc;
    > JTextArea textPrompt;
    > Text newTextNode;
    > Element descElt;
    > ...
    > newTextNode = doc.createTextNode(textPrompt.getText());
    > descElt.appendChild(newTextNode);
    >
    > The code to serialize the DOM is:
    >
    > private void saveXml(Document document)
    > {
    > // rename the existing layout file
    > new File(fileName).renameTo(new File(fileName + "~"));
    > // write the document out
    > OutputFormat format = new OutputFormat(document);


    Does it help if you explicitly set
    new OutputFormat(document, "UTF-8", true);
    ??
    > format.setIndenting(true);
    > format.setLineWidth(0);
    > format.setPreserveSpace(true);
    > try {
    > XMLSerializer serializer;
    > serializer = new XMLSerializer (
    > new FileWriter(fileName),
    > format);
    > serializer.asDOMSerializer();
    > serializer.serialize(document);
    > }
    > catch (IOException ioe)
    > {
    > ...
    > }
    > }
    >
    > If I enter a character such as e' (e with acute accent) into the JTextArea
    > and I look at the XML file using a non-UTF-8-aware editor I see that the e'
    > has been inserted as a single byte, not as the 2 character UTF-8 escaped
    > value. If I subsequently try to read the XML file using XERCES it blows up
    > because of the invalid escape sequence.
    >
    > How do I get a valid serialization of this DOM into XML using UTF-8?
    >
    >


    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
    Martin Honnen, Dec 5, 2003
    #2
    1. Advertising

  3. Jim Cobban

    Jim Cobban Guest

    "Martin Honnen" <> wrote in message
    news:3fd09538$...
    >
    >
    > Does it help if you explicitly set
    > new OutputFormat(document, "UTF-8", true);
    > ??
    >
    > Martin Honnen
    > http://JavaScript.FAQTs.com/
    >

    No. Explicitly setting the format does not change the behavior. The
    non-basic character is still inserted into the output as a single character
    rather than as a 2 character UTF-8 escape as it should be.
    Jim Cobban, Dec 5, 2003
    #3
  4. Jim Cobban

    Jim Cobban Guest

    "Jim Cobban" <> wrote in message
    news:...
    > I must be missing something.


    I was misunderstanding something:

    > XMLSerializer serializer;
    > serializer = new XMLSerializer (
    > new FileWriter(fileName),
    > format);


    When I replaced this with:

    serializer = new XMLSerializer (
    new FileOutputStream(fileName),
    format);

    it worked correctly. The problem was that by passing in a FileWriter, which
    is constructed with the default encoding, there was no opportunity to
    specify the UTF-8 encoding. The second format permits the new instance of
    XMLSerializer to supply the correct encoding when it constructs the instance
    of OutputWriter under the covers.

    Basically my mistake was copying sample code from the distribution without
    taking the time to understand exactly what it was doing. Once I took that
    time I realized that I was using the wrong constructor.
    Jim Cobban, Dec 6, 2003
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    873
  2. Jim Cobban
    Replies:
    2
    Views:
    728
    Jim Cobban
    Dec 15, 2003
  3. Jim Cobban
    Replies:
    1
    Views:
    370
    Jim Cobban
    Dec 6, 2003
  4. Scott Harper
    Replies:
    0
    Views:
    390
    Scott Harper
    May 25, 2006
  5. Replies:
    4
    Views:
    550
    Joseph Kesselman
    Aug 10, 2006
Loading...

Share This Page