returning XML as UTF-8 from a servlet

Discussion in 'Java' started by Andy Fish, Mar 2, 2004.

  1. Andy Fish

    Andy Fish Guest

    Hi,

    I have a servlet (running under tomcat 4.1, java 1.4.2) that sends XML in
    the HTTP body from a servlet. The I want the XML to be encoded in UTF-8.

    when I run Tomcat on windows 2000, the XML appears fine on the client end,
    but running Tomcat on debian woody linux, accented characters don't appear
    correctly. In the XML output stream, each accented character comes out as
    two characters, so obviously the fact that it's supposed to be UTF-8 is
    being lost.

    here's how I'm streaming the XML:

    response.setContentType("text/xml");
    OutputStream os = response.getOutputStream();
    OutputStreamWriter osw = new OutputStreamWriter(os , "UTF-8");
    PrintWriter pw = new PrintWriter(osw);
    pw.print("..all the xml..")

    If, instead of writing to the response object, I write to a
    FileOutputStream, the accented characters appear OK in the file.

    I'm a bit stuck here because when I wrote this code, I read up all about
    character encoding and did what I thought was right, and it all worked on my
    Win2000 test system. I can't figure out what could be going wrong on the
    linux box.

    many thanks for any advice on hints,

    Andy
    Andy Fish, Mar 2, 2004
    #1
    1. Advertising

  2. Andy Fish wrote:
    > Hi,
    >
    > I have a servlet (running under tomcat 4.1, java 1.4.2) that sends XML in
    > the HTTP body from a servlet. The I want the XML to be encoded in UTF-8.
    >
    > when I run Tomcat on windows 2000, the XML appears fine on the client end,
    > but running Tomcat on debian woody linux, accented characters don't appear
    > correctly. In the XML output stream, each accented character comes out as
    > two characters, so obviously the fact that it's supposed to be UTF-8 is
    > being lost.


    How do you check the XML? With a browser?

    >
    > here's how I'm streaming the XML:
    >
    > response.setContentType("text/xml");


    Maybe you can add the encoding to the HTTP header:
    response.setContentType("text/xml;charset=utf-8");

    f'up2 c.t.x
    --
    Johannes Koch
    In te domine speravi; non confundar in aeternum.
    (Te Deum, 4th cent.)
    Johannes Koch, Mar 2, 2004
    #2
    1. Advertising

  3. Andy Fish wrote:

    > Hi,
    >
    > I have a servlet (running under tomcat 4.1, java 1.4.2) that sends XML in
    > the HTTP body from a servlet. The I want the XML to be encoded in UTF-8.
    >
    > when I run Tomcat on windows 2000, the XML appears fine on the client end,
    > but running Tomcat on debian woody linux, accented characters don't appear
    > correctly. In the XML output stream, each accented character comes out as
    > two characters, so obviously the fact that it's supposed to be UTF-8 is
    > being lost.


    No, that's not obvious at all. Not from the information you have given.
    Unicode provides for logical characters to be composed of two or more
    characters; for instance, a lowercase u with an umlaut could be
    represented as the latin lowercase 'u' followed by the umlaut "combining
    character". Many of the more common combinations also have
    single-character representations, including the u-umlaut example, and
    pretty much all the "diacriticalized" characters used in Western
    European languages. The alternative representations are equivalent as
    far as Unicode is concerned, and Unicode processors are permitted to
    freely substitute one for another. They should be displayed or printed
    the same by a conformant processor.

    Moreover, the fact that you are making judgements about the "UTF-8ness"
    of the stream based on the character count leads me to wonder whether
    perhaps you are confusing characters with bytes / octets, or whether you
    misunderstand the nature of character encodings. The character count
    has little to do with whether the characters are encoded in UTF-8;
    rather it has everything to do with which character or characters have
    been encoded. The byte count has more relation to the encoding, but is
    still closely tied to the characters that have been encoded.

    > here's how I'm streaming the XML:
    >
    > response.setContentType("text/xml");


    Better would probably be "text/xml; charset=UTF-8".

    > OutputStream os = response.getOutputStream();
    > OutputStreamWriter osw = new OutputStreamWriter(os , "UTF-8");
    > PrintWriter pw = new PrintWriter(osw);
    > pw.print("..all the xml..")
    >
    > If, instead of writing to the response object, I write to a
    > FileOutputStream, the accented characters appear OK in the file.


    As judged how?

    > I'm a bit stuck here because when I wrote this code, I read up all about
    > character encoding and did what I thought was right, and it all worked on my
    > Win2000 test system. I can't figure out what could be going wrong on the
    > linux box.


    The output part looks okay to me. I suspect you have a different
    problem than you think you have.


    John Bollinger
    John C. Bollinger, Mar 2, 2004
    #3
  4. Andy Fish

    Jon A. Cruz Guest

    Andy Fish wrote:
    > correctly. In the XML output stream, each accented character comes out as
    > two characters, so obviously the fact that it's supposed to be UTF-8 is
    > being lost.


    No. Not "obviously"

    Capture and list the actual *bytes* going across the wire. Inspect them
    and then you can say one way or another.


    >
    > here's how I'm streaming the XML:
    >
    > response.setContentType("text/xml");
    > OutputStream os = response.getOutputStream();


    IIRC, you need to set the encoding before the call to getOutputStream().
    Jon A. Cruz, Mar 2, 2004
    #4
  5. "Jon A. Cruz" <> schrieb im Newsbeitrag
    news:...
    > >
    > > response.setContentType("text/xml");
    > > OutputStream os = response.getOutputStream();

    >
    > IIRC, you need to set the encoding before the call to getOutputStream().


    The encoding needs to be specified on several levels. One is the HTTP
    Response header, one is in the XML header ( <?xml version="1.0"
    encoding="..."?> ), and finally the output sent to the response's
    outputstream need to use the very same encoding as well.

    The background is that outputstream just handles bytes. You must ensure
    these bytes are in the above mentioned encoding. This can be done by using a
    OutputStreamWriter and setting the encoding in the constructor. Now you can
    output characters and OutputStreamWriter will ensure that the outputstream
    gets the correct bytes.

    Hiran
    Hiran Chaudhuri, Mar 4, 2004
    #5
  6. Andy Fish

    Jon A. Cruz Guest

    Hiran Chaudhuri wrote:
    >
    > The background is that outputstream just handles bytes. You must ensure
    > these bytes are in the above mentioned encoding. This can be done by using a
    > OutputStreamWriter and setting the encoding in the constructor. Now you can
    > output characters and OutputStreamWriter will ensure that the outputstream
    > gets the correct bytes.


    My point is that the order of things is very important. In order to get
    the response headers to properly reflect what you're going to send, you
    need to set things *before* getOutputStream() or getWriter().

    That's a point that trips up a lot of people.
    Jon A. Cruz, Mar 4, 2004
    #6
  7. "Jon A. Cruz" <> schrieb im Newsbeitrag
    news:...
    > Hiran Chaudhuri wrote:
    > My point is that the order of things is very important. In order to get
    > the response headers to properly reflect what you're going to send, you
    > need to set things *before* getOutputStream() or getWriter().
    >
    > That's a point that trips up a lot of people.


    That's right.

    It should be easy to handle as I have seen servlet containers complaining
    about attempts to set headers after the response has been committed. This is
    exactly when you first fill the HTTP response body and afterwards care for
    the headers.

    Hiran
    Hiran Chaudhuri, Mar 5, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JJBW
    Replies:
    1
    Views:
    10,045
    Joerg Jooss
    Apr 24, 2004
  2. =?Utf-8?B?QXNoYQ==?=
    Replies:
    3
    Views:
    416
  3. circuit_breaker
    Replies:
    2
    Views:
    1,981
    Jack Jia
    Apr 4, 2004
  4. Andy Fish
    Replies:
    7
    Views:
    10,042
    Hiran Chaudhuri
    Mar 5, 2004
  5. Arifi Koseoglu
    Replies:
    2
    Views:
    950
    Arifi Koseoglu
    Apr 13, 2004
Loading...

Share This Page