URI encoding ASCII, LATIN1 or UNICODE?

Discussion in 'Java' started by Fritz Bayer, Apr 8, 2005.

  1. Fritz Bayer

    Fritz Bayer Guest

    Hello,

    I have stumbled across something, which seems to be of ambuiguity.
    Recently I decoded the URI of a servlet request.

    At first I could not get the expected result. The umlauts äüö would
    not show up correctly, which made me wonder why.

    I then tried URLDecoder.decode(uri, "UTF-8"), which also did not work.

    After googling a bit I found out that tomcat 5.0 (which I use) used to
    send the URI's in the encoding of the document transferred but now
    always sends the URI in ISO-8859-1 but that another encoding can be
    specified in the connector by setting the attribute URIEncoding="..".

    So I set it to "utf8" and now I can decode the URI's correctly.
    However, I was wondering how it can be that this does not seem to be
    specified.

    I though that the HTTP 1.1 protocoll encoding is ASCII only. Of course
    the documents transfered can have a different encoding. But the URI
    part belongs to the startline of the message and therefore to the
    protocoll.

    Anyway if somebody wants to elaborate a bit on this uri issue I would
    be interested in having a little conversation about the subject.

    Fritz

    BTW: So it seems that how uri's get treated depend on the
    implementation of each servlet engine?!
     
    Fritz Bayer, Apr 8, 2005
    #1
    1. Advertising

  2. Hi,
    yes, you are right , i was having the same problem 2 weeks back as we
    want to use UTF-8 characters for Japanese etc in the URI.
    RFC for URI initally suggested ASCII only and leaves the support for
    UTF-8 to the implementation details to the servers.

    Looking at the tomcat source for 5.5 , I came to realize Tomcat does
    the %uu escaping first , followed by the decoding for the charset
    (defaults to ISO-xxxx stuff ) but we can specify UTF-8 as u said . Also
    there is another flag 'useBodyEncodingForURI' which it says there for
    compatibility for Tomcat 4.x.

    yes, if we want compatibility across servers we have to stick with only
    ASCII in URI. One of the RFC suggested jokingly we recommend
    supporting UTF-8 for URI but that transition may take 50 years ... (
    that was written in 1999 ) sorry forgot the RFC #'s i looked up

    hope this helps

    -v

    Fritz Bayer wrote:
    > Hello,
    >
    > I have stumbled across something, which seems to be of ambuiguity.
    > Recently I decoded the URI of a servlet request.
    >
    > At first I could not get the expected result. The umlauts äüö

    would
    > not show up correctly, which made me wonder why.
    >
    > I then tried URLDecoder.decode(uri, "UTF-8"), which also did not

    work.
    >
    > After googling a bit I found out that tomcat 5.0 (which I use) used

    to
    > send the URI's in the encoding of the document transferred but now
    > always sends the URI in ISO-8859-1 but that another encoding can be
    > specified in the connector by setting the attribute URIEncoding="..".
    >
    > So I set it to "utf8" and now I can decode the URI's correctly.
    > However, I was wondering how it can be that this does not seem to be
    > specified.
    >
    > I though that the HTTP 1.1 protocoll encoding is ASCII only. Of

    course
    > the documents transfered can have a different encoding. But the URI
    > part belongs to the startline of the message and therefore to the
    > protocoll.
    >
    > Anyway if somebody wants to elaborate a bit on this uri issue I would
    > be interested in having a little conversation about the subject.
    >
    > Fritz
    >
    > BTW: So it seems that how uri's get treated depend on the
    > implementation of each servlet engine?!
     
    Arjunan Venkatesh, Apr 8, 2005
    #2
    1. Advertising

  3. Fritz Bayer

    Fritz Bayer Guest

    "Arjunan Venkatesh" <> wrote in message news:<>...
    > Hi,
    > yes, you are right , i was having the same problem 2 weeks back as we
    > want to use UTF-8 characters for Japanese etc in the URI.
    > RFC for URI initally suggested ASCII only and leaves the support for
    > UTF-8 to the implementation details to the servers.
    >


    So how is a uri encoded, which contains non ASCII characters like äöüß
    and for example greek characters?

    > Looking at the tomcat source for 5.5 , I came to realize Tomcat does
    > the %uu escaping first , followed by the decoding for the charset
    > (defaults to ISO-xxxx stuff ) but we can specify UTF-8 as u said . Also
    > there is another flag 'useBodyEncodingForURI' which it says there for
    > compatibility for Tomcat 4.x.
    >
    > yes, if we want compatibility across servers we have to stick with only
    > ASCII in URI. One of the RFC suggested jokingly we recommend
    > supporting UTF-8 for URI but that transition may take 50 years ... (
    > that was written in 1999 ) sorry forgot the RFC #'s i looked up
    >
    > hope this helps
    >
    > -v
    >
    > Fritz Bayer wrote:
    > > Hello,
    > >
    > > I have stumbled across something, which seems to be of ambuiguity.
    > > Recently I decoded the URI of a servlet request.
    > >
    > > At first I could not get the expected result. The umlauts =E4=FC=F6

    > would
    > > not show up correctly, which made me wonder why.
    > >
    > > I then tried URLDecoder.decode(uri, "UTF-8"), which also did not

    > work.
    > >
    > > After googling a bit I found out that tomcat 5.0 (which I use) used

    > to
    > > send the URI's in the encoding of the document transferred but now
    > > always sends the URI in ISO-8859-1 but that another encoding can be
    > > specified in the connector by setting the attribute URIEncoding=3D"..".
    > >
    > > So I set it to "utf8" and now I can decode the URI's correctly.
    > > However, I was wondering how it can be that this does not seem to be
    > > specified.
    > >
    > > I though that the HTTP 1.1 protocoll encoding is ASCII only. Of

    > course
    > > the documents transfered can have a different encoding. But the URI
    > > part belongs to the startline of the message and therefore to the
    > > protocoll.
    > >
    > > Anyway if somebody wants to elaborate a bit on this uri issue I would
    > > be interested in having a little conversation about the subject.
    > >
    > > Fritz
    > >
    > > BTW: So it seems that how uri's get treated depend on the
    > > implementation of each servlet engine?!
     
    Fritz Bayer, Apr 20, 2005
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Marko Faldix
    Replies:
    8
    Views:
    428
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Dec 15, 2003
  2. Luis P. Mendes

    ascii to latin1

    Luis P. Mendes, May 9, 2006, in forum: Python
    Replies:
    14
    Views:
    735
    Luis P. Mendes
    May 10, 2006
  3. Helmut Jarausch

    restructuredtext latin1 encoding (FAQ?)

    Helmut Jarausch, Jul 3, 2007, in forum: Python
    Replies:
    2
    Views:
    400
    Helmut Jarausch
    Jul 3, 2007
  4. Harshad Modi

    encoding latin1 to utf-8

    Harshad Modi, Sep 10, 2007, in forum: Python
    Replies:
    6
    Views:
    454
    Harshad Modi
    Sep 12, 2007
  5. Mark Toth

    Problem with encoding latin1/UTF8

    Mark Toth, Dec 28, 2007, in forum: Ruby
    Replies:
    1
    Views:
    152
    Chris Gers32
    Jan 7, 2008
Loading...

Share This Page