URI encoding ASCII, LATIN1 or UNICODE?

F

Fritz Bayer

Hello,

I have stumbled across something, which seems to be of ambuiguity.
Recently I decoded the URI of a servlet request.

At first I could not get the expected result. The umlauts äüö would
not show up correctly, which made me wonder why.

I then tried URLDecoder.decode(uri, "UTF-8"), which also did not work.

After googling a bit I found out that tomcat 5.0 (which I use) used to
send the URI's in the encoding of the document transferred but now
always sends the URI in ISO-8859-1 but that another encoding can be
specified in the connector by setting the attribute URIEncoding="..".

So I set it to "utf8" and now I can decode the URI's correctly.
However, I was wondering how it can be that this does not seem to be
specified.

I though that the HTTP 1.1 protocoll encoding is ASCII only. Of course
the documents transfered can have a different encoding. But the URI
part belongs to the startline of the message and therefore to the
protocoll.

Anyway if somebody wants to elaborate a bit on this uri issue I would
be interested in having a little conversation about the subject.

Fritz

BTW: So it seems that how uri's get treated depend on the
implementation of each servlet engine?!
 
A

Arjunan Venkatesh

Hi,
yes, you are right , i was having the same problem 2 weeks back as we
want to use UTF-8 characters for Japanese etc in the URI.
RFC for URI initally suggested ASCII only and leaves the support for
UTF-8 to the implementation details to the servers.

Looking at the tomcat source for 5.5 , I came to realize Tomcat does
the %uu escaping first , followed by the decoding for the charset
(defaults to ISO-xxxx stuff ) but we can specify UTF-8 as u said . Also
there is another flag 'useBodyEncodingForURI' which it says there for
compatibility for Tomcat 4.x.

yes, if we want compatibility across servers we have to stick with only
ASCII in URI. One of the RFC suggested jokingly we recommend
supporting UTF-8 for URI but that transition may take 50 years ... (
that was written in 1999 ) sorry forgot the RFC #'s i looked up

hope this helps

-v
 
F

Fritz Bayer

Arjunan Venkatesh said:
Hi,
yes, you are right , i was having the same problem 2 weeks back as we
want to use UTF-8 characters for Japanese etc in the URI.
RFC for URI initally suggested ASCII only and leaves the support for
UTF-8 to the implementation details to the servers.

So how is a uri encoded, which contains non ASCII characters like äöüß
and for example greek characters?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top