Serializing XML with JAXP - help needed

Discussion in 'Java' started by Michael, Feb 22, 2004.

  1. Michael

    Michael Guest

    Hi all,

    I'm trying to serialize an xml document with JAXP. The xml may or may not
    contain international characters, and so I want any text elements to be
    UTF-8 encoded. Consider the following (a brief summary is included below the
    code):

    ---- code begin ----

    org.w3c.dom.Document doc =
    javax.xml.parsers.DocumentBuilderFactory.newInstance().newDocumentBuilder().
    newDocument();

    org.w3c.dom.Element el = doc.createElement("element");
    el.setAttribute("attr1","attr1value");
    el.appendChild(doc.createTextNode("Danish < æøå > characters!"));
    doc.appendChild(el);

    javax.xml.transform.TransformerFactory transformerFactory =
    javax.xml.transform.TransformerFactory.newInstance();
    javax.xml.transform.Transformer transformer =
    transformerFactory.newTransformer();

    transformer.setOutputProperty(javax.xml.transform.OutputKeys.INDENT,"yes");
    transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount","4
    ");

    java.io.StringWriter xmlout = new java.io.StringWriter();
    javax.xml.transform.stream.StreamResult result = new
    javax.xml.transform.stream.StreamResult(xmlout);
    transformer.transform(new javax.xml.transform.dom.DOMSource(doc),result);

    System.out.println(xmlout.getBuffer());

    ---- code end ----

    So, I'm creating a document (DOM), setting an attribute and appending a text
    node with international characters (and a couple of brackets just for fun).
    Then I create a transformer instance, I ask it to indent the output nicely
    and finally to actually serialize my DOM into xml.

    When I run this code (in a jsp file on a tomcat 4.1.x server with the latest
    xerces2-j version installed) I get this output:

    <?xml version="1.0" encoding="UTF-8"?>
    <element attr1="attr1value">Danish &lt; æøå &gt; characters!</element>

    Okay. So I got the < and > converted as I expected. However, the
    international characters do not appear to have been encoded to UTF-8 or
    anything else for that matter. In fact, the above isn't even a valid xml
    document, and several parsers I tried (including Microsoft XML) rejects it
    because of the illegal character data. Clearly there is a mismatch between
    the what xml header encoding specifies and what's actually appearing in the
    text nodes of the document. It's very curious that JAXP will transform a DOM
    into a result that isn't valid.

    Interestingly, when I run the same code interactively inside my WebSphere
    Studio Application Developer 5 (using what is known as a scrapbook page), I
    get this:

    <?xml version="1.0" encoding="UTF-8"?>
    <element attr1="attr1value">Danish &lt; æøå &gt;
    characters!</element>

    Well. I'm not sure that #230 is a correct UTF-8 encoding of "æ" (in fact I'm
    sure it isn't), but at least the document is now valid and even Microsoft
    XML will parse it without complaints.

    I am hoping that someone out there can shed some light on this problem and
    tell me what I am doing wrong. Exactly how do I instruct JAXP to encode the
    text nodes in my DOM so that it doesn't break my XML parser? :)

    Regards,
    Michael Berg
    www.hyperpal.com
     
    Michael, Feb 22, 2004
    #1
    1. Advertising

  2. Michael

    Michael Berg Guest

    Hi all,

    The problem is related to the use of a StringWriter to collect the XML
    output. Apparently StringWriters have their own idea about character
    encoding, so use an OutputStreamWriter in stead - like this, for example:

    java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
    javax.xml.transform.stream.StreamResult result = new
    javax.xml.transform.stream.StreamResult(
    new java.io_OutputStreamWriter(
    baos,
    "UTF-8"
    )
    );

    /Michael
    www.hyperpal.com

    "Michael" < (figure it out)> wrote in
    message news:40380891$0$95001$...
    > Hi all,
    >
    > I'm trying to serialize an xml document with JAXP. The xml may or may not
    > contain international characters, and so I want any text elements to be
    > UTF-8 encoded. Consider the following (a brief summary is included below

    the
    > code):
    >
    > ---- code begin ----
    >
    > org.w3c.dom.Document doc =
    >

    javax.xml.parsers.DocumentBuilderFactory.newInstance().newDocumentBuilder().
    > newDocument();
    >
    > org.w3c.dom.Element el = doc.createElement("element");
    > el.setAttribute("attr1","attr1value");
    > el.appendChild(doc.createTextNode("Danish < æøå > characters!"));
    > doc.appendChild(el);
    >
    > javax.xml.transform.TransformerFactory transformerFactory =
    > javax.xml.transform.TransformerFactory.newInstance();
    > javax.xml.transform.Transformer transformer =
    > transformerFactory.newTransformer();
    >
    >

    transformer.setOutputProperty(javax.xml.transform.OutputKeys.INDENT,"yes");
    >

    transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount","4
    > ");
    >
    > java.io.StringWriter xmlout = new java.io.StringWriter();
    > javax.xml.transform.stream.StreamResult result = new
    > javax.xml.transform.stream.StreamResult(xmlout);
    > transformer.transform(new javax.xml.transform.dom.DOMSource(doc),result);
    >
    > System.out.println(xmlout.getBuffer());
    >
    > ---- code end ----
    >
    > So, I'm creating a document (DOM), setting an attribute and appending a

    text
    > node with international characters (and a couple of brackets just for

    fun).
    > Then I create a transformer instance, I ask it to indent the output nicely
    > and finally to actually serialize my DOM into xml.
    >
    > When I run this code (in a jsp file on a tomcat 4.1.x server with the

    latest
    > xerces2-j version installed) I get this output:
    >
    > <?xml version="1.0" encoding="UTF-8"?>
    > <element attr1="attr1value">Danish &lt; æøå &gt; characters!</element>
    >
    > Okay. So I got the < and > converted as I expected. However, the
    > international characters do not appear to have been encoded to UTF-8 or
    > anything else for that matter. In fact, the above isn't even a valid xml
    > document, and several parsers I tried (including Microsoft XML) rejects it
    > because of the illegal character data. Clearly there is a mismatch between
    > the what xml header encoding specifies and what's actually appearing in

    the
    > text nodes of the document. It's very curious that JAXP will transform a

    DOM
    > into a result that isn't valid.
    >
    > Interestingly, when I run the same code interactively inside my WebSphere
    > Studio Application Developer 5 (using what is known as a scrapbook page),

    I
    > get this:
    >
    > <?xml version="1.0" encoding="UTF-8"?>
    > <element attr1="attr1value">Danish &lt; æøå &gt;
    > characters!</element>
    >
    > Well. I'm not sure that #230 is a correct UTF-8 encoding of "æ" (in fact

    I'm
    > sure it isn't), but at least the document is now valid and even Microsoft
    > XML will parse it without complaints.
    >
    > I am hoping that someone out there can shed some light on this problem and
    > tell me what I am doing wrong. Exactly how do I instruct JAXP to encode

    the
    > text nodes in my DOM so that it doesn't break my XML parser? :)
    >
    > Regards,
    > Michael Berg
    > www.hyperpal.com
    >
    >
     
    Michael Berg, Feb 22, 2004
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tobi Krausl
    Replies:
    0
    Views:
    642
    Tobi Krausl
    Nov 20, 2003
  2. iksrazal

    JAXP Document to String needed

    iksrazal, May 20, 2004, in forum: Java
    Replies:
    2
    Views:
    2,961
    iksrazal
    May 21, 2004
  3. Blue Gecko
    Replies:
    1
    Views:
    448
    Blue Gecko
    Oct 3, 2005
  4. lard
    Replies:
    3
    Views:
    1,763
    Raymond DeCampo
    Mar 12, 2006
  5. KaR
    Replies:
    1
    Views:
    487
Loading...

Share This Page