Re: xml, windows, utf-8, and httpclient

Discussion in 'Java' started by Chris Uppal, Dec 20, 2005.

  1. Chris Uppal

    Chris Uppal Guest

    wrote:

    > In MS Windows XP Notepad I created an xml file. I pasted a single
    > tibetan character '\u0F40' as the text part of a certain element:
    > <body>?</body>


    I was amazed to find that the Ka letter in that paragraph is rendered correctly
    by my newsreader !


    > I saved this file using notepad's Save as... -> encoding = UTF-8.


    Check whether Notepad has added a Byte Order Mark. It shouldn't (for UTF-8)
    but I seem to remember that it usually does anyway.


    > If I use a hex editor and view the document I can see that the
    > character is stored as I would expect in utf-8 encoding '\u0F40' -> E0
    > BD 80.
    >
    > Next, I use dom4j to read in and parse the file. dom4j should be using
    > the xerces parser. I assume that the parser knows how to read the utf-8
    > file. After all, it prepends the xml file with:
    > <?xml version="1.0" encoding="UTF-8"?>


    It's not clear at this point whether you mean that the file you created has
    such a charset declaration ?


    > Question 1:
    > At this point, is the character stored in memory as '\u0f40'?


    Why don't you try printing out the integer value of the character(s) ? If it
    is 0x0F40 then all is well so far, if not then something has already gone wrong
    (presumably the parser didn't realise that it was parsing UTF-8).


    > Maybe not, because if I print my xml as a string and view it in hex I
    > can see my utf-8 characters in there 'E0 BD 80'.


    The problem with that is that you don't know how the process of printing the
    string is converting characters into binary.


    > Next, I want to post my xml to a webserver using jakarta commons
    > httpclient. I add a header declaring the encoding as utf-8:
    > content-type=text/xml; charset=UTF-8. This action has the same effect
    > as taking my xml string and using the String.getBytes("UTF-8")
    > function. The bytes are pushed through the utf-8 encoding algorithm
    > again and are sent as 'c3 a0 c2 bd e2 82 ac'.
    >
    > Question 2:
    > Is that how it should be done?
    >
    > Question 3:
    > 'c3 a0 c2 bd' translates back to 'E0 BD', but I have no idea where 'e2
    > 82 ac' comes from... Any ideas?


    It sounds as if the Ka character's UTF-8 representation hasn't been de-UTF-8-ed
    as it was read in by the parser, thus resulting in a String containing the
    chars 0x00E0 0x00BD 0x0080. Which has then been encoded as UTF-8 /again/
    resulting in the gibberish you see.

    I don't know much about dom4j (or Xerces, come to that), but it might be
    worth posting the code you use to open the XML file. I suspect it's not
    decoding the UTF-8.

    -- chris
     
    Chris Uppal, Dec 20, 2005
    #1
    1. Advertising

  2. Chris Uppal

    Alex Buell Guest

    On Tue, 20 Dec 2005 12:13:11 -0000 "Chris Uppal"
    <-THIS.org> wibbled:


    > > In MS Windows XP Notepad I created an xml file. I pasted a single
    > > tibetan character '\u0F40' as the text part of a certain element:
    > > <body>?</body>

    >
    > I was amazed to find that the Ka letter in that paragraph is rendered correctly
    > by my newsreader !


    No it isn't. It is shown as a ? in your post, but I can do this: ཀ.
    Perfect.

    --
    http://www.munted.org.uk

    Anyone that thinks an imaginary deity is going to protect them against
    earthquakes and hurricanes needs psychiatric help.
     
    Alex Buell, Dec 20, 2005
    #2
    1. Advertising

  3. Chris Uppal

    Chris Uppal Guest

    Alex Buell wrote:

    > > > In MS Windows XP Notepad I created an xml file. I pasted a single
    > > > tibetan character '\u0F40' as the text part of a certain element:
    > > > <body>?</body>

    > >
    > > I was amazed to find that the Ka letter in that paragraph is rendered
    > > correctly by my newsreader !

    >
    > No it isn't. It is shown as a ? in your post, but I can do this: ?.
    > Perfect.


    Well, it was /rendered/ correctly (even in the reply composition window), it's
    just that it throws the character away before actually sending the post...

    -- chris
     
    Chris Uppal, Dec 20, 2005
    #3
  4. Chris Uppal

    Alex Buell Guest

    On Tue, 20 Dec 2005 13:14:49 -0000 "Chris Uppal"
    <-THIS.org> wibbled:

    > Alex Buell wrote:
    >
    > > > > In MS Windows XP Notepad I created an xml file. I pasted a single
    > > > > tibetan character '\u0F40' as the text part of a certain element:
    > > > > <body>?</body>
    > > >
    > > > I was amazed to find that the Ka letter in that paragraph is rendered
    > > > correctly by my newsreader !

    > >
    > > No it isn't. It is shown as a ? in your post, but I can do this: ?.
    > > Perfect.

    >
    > Well, it was /rendered/ correctly (even in the reply composition window), it's
    > just that it throws the character away before actually sending the post...


    I actually posted it as an UTF-8 enabled message which might be why I
    can do ཀ. I strongly suggest you have a look at Sylpheed, there's a
    version for Windows (http://www.sylpheed.good-day.net). The author is
    Japanese and very much aware of those issues and that's why it's
    excellent.


    --
    http://www.munted.org.uk

    Anyone that thinks an imaginary deity is going to protect them against
    earthquakes and hurricanes needs psychiatric help.
     
    Alex Buell, Dec 20, 2005
    #4
  5. Chris Uppal

    Chris Uppal Guest

    Alex Buell wrote:

    > I actually posted it as an UTF-8 enabled message which might be why I
    > can do ?. I strongly suggest you have a look at Sylpheed, there's a
    > version for Windows (http://www.sylpheed.good-day.net). The author is
    > Japanese and very much aware of those issues and that's why it's
    > excellent.


    The URL seems to be:
    http://www.sylpheed.good-day.je/

    Looks interesting. I'll probably try it out when the Window's version leaves
    beta. (I'd rather not use Outlook Express, but -- for all its many defects --
    I still haven't found anything like an acceptable replacement.)

    -- chris
     
    Chris Uppal, Dec 21, 2005
    #5
  6. Chris Uppal

    Alex Buell Guest

    On Wed, 21 Dec 2005 09:37:04 -0000 "Chris Uppal"
    <-THIS.org> waved a wand and this message
    magically appeared:

    > Alex Buell wrote:
    >
    > > I actually posted it as an UTF-8 enabled message which might be why I
    > > can do ?. I strongly suggest you have a look at Sylpheed, there's a
    > > version for Windows (http://www.sylpheed.good-day.net). The author is
    > > Japanese and very much aware of those issues and that's why it's
    > > excellent.

    >
    > The URL seems to be:
    > http://www.sylpheed.good-day.je/


    Correction: http://sylpheed.good-day.net

    > Looks interesting. I'll probably try it out when the Window's version leaves
    > beta. (I'd rather not use Outlook Express, but -- for all its many defects --
    > I still haven't found anything like an acceptable replacement.)


    Anything but Outlook, please ;o)

    --
    http://www.munted.org.uk

    Anyone that thinks an imaginary deity is going to protect them against
    earthquakes and hurricanes needs psychiatric help.
     
    Alex Buell, Dec 21, 2005
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JJBW
    Replies:
    1
    Views:
    10,341
    Joerg Jooss
    Apr 24, 2004
  2. =?Utf-8?B?QXNoYQ==?=
    Replies:
    3
    Views:
    445
  3. Arifi Koseoglu
    Replies:
    2
    Views:
    1,006
    Arifi Koseoglu
    Apr 13, 2004
  4. jmfauth
    Replies:
    4
    Views:
    332
    jmfauth
    Oct 13, 2010
  5. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    988
    Grzegorz ¦liwiñski
    Jan 19, 2011
Loading...

Share This Page