Can't decode UTF-8

G

Gary Thomas

Hi,

This is driving me nuts, could anyone assist? I am passing a 5
character (chinese) unicode string as a GET parameter to a servlet. The
URL encoding looks fine, I've verified the hex below corresponds to the
UTF-8 character values:

id=%E4%BA%B8%E4%BA%B4%E4%BA%B0%E4%BA%A9%E4%BA%A3


However, my servlet does not seem to be able to decode this back into a
proper Java String (i.e. 5 characters). Code snippet:

....
String id = request.getParameter("id");
char[] c = id.toCharArray();
logger.error("BEFORE: " + id + " - # of chars: " + c.length);

id = URLDecoder.decode(id, "UTF-8");
c = id.toCharArray();
logger.error("AFTER: " + id + " - " + c.length);
....


logs then show:

....
BEFORE: 亸亴亰亩亣 - # of chars: 15
AFTER: 亸亴亰亩亣 - # of chars: 15
....

So obviously, it looks like the String was not decoded from UTF-8
properly. However, if I view the logs with an editor that reads UTF-8,
the 15 characters above show as the correct 5 chinese characters, so the
original UTF-8 does not seem to be incorrect.

Am I missing something obvious? This seems so simple, but just can't
get it to work... I'm using JDK 1.4.2

Thanks,

Gary
 
M

Manish Jethani

Gary said:
This is driving me nuts, could anyone assist? I am passing a 5

It has driven me nuts in the past :)
character (chinese) unicode string as a GET parameter to a servlet. The
URL encoding looks fine, I've verified the hex below corresponds to the
UTF-8 character values:

id=%E4%BA%B8%E4%BA%B4%E4%BA%B0%E4%BA%A9%E4%BA%A3


However, my servlet does not seem to be able to decode this back into a
proper Java String (i.e. 5 characters). Code snippet:

...
String id = request.getParameter("id");
char[] c = id.toCharArray();
logger.error("BEFORE: " + id + " - # of chars: " + c.length);

You need to set the character encoding in the request object.

request.setCharacterEncoding("UTF-8");

Either you do this in code (above), or set this in the config
files of your servlet container.

If you set it in the code, then make sure this is done before
calling any getParameter() methods. So it's best set at the
beginning of your doGet() and doPost()
id = URLDecoder.decode(id, "UTF-8");
c = id.toCharArray();
logger.error("AFTER: " + id + " - " + c.length);
...

This is redundant. There's no need to decode()

One more thing: if you're converting from a String object to a
byte[] array, and vice versa, you need to specify the encoding
explicitly in the String constructor and getBytes()

HTH,
Manish
 
G

Gary Thomas

Manish said:
You need to set the character encoding in the request object.

request.setCharacterEncoding("UTF-8");

Either you do this in code (above), or set this in the config
files of your servlet container.

If you set it in the code, then make sure this is done before
calling any getParameter() methods. So it's best set at the
beginning of your doGet() and doPost()

Thanks for the reply, I'm still confused though. I have been calling
request.setCharacterEncoding("UTF-8") in my request processor all along,
and it seems to be setting it correctly, but the parameter is not being
decoded. You can see below that the request encoding is correct:

Code snippet:

....
logger.error("Encoding: " + request.getCharacterEncoding());
String id = request.getParameter("id");
char[] c = id.toCharArray();
logger.error("BEFORE: " + id + " - # of chars: " + c.length);

id = new String(id.getBytes(), "UTF-8");
c = id.toCharArray();
logger.error("AFTER: " + id + " - # of chars: " + c.length);
....

Logs show:
....
Encoding: UTF-8
BEFORE: 亸亴亰亩亣 - # of chars: 15
AFTER: ????? - # of chars: 5
....

As you can see though, 'new String(id.getBytes(), "UTF-8")' works
correctly. I also tried 'request.setCharacterEncoding("UTF-8")' in the
code above, but to no avail.

Is there any problems with using 'new String(id.getBytes(), "UTF-8")' as
a workaround?

I should also mention that I'm using the Struts framework, but this
shouldn't have an effect on the code above, correct?


Many Thanks,

Gary
 
D

Dave Miller

In article <G7HVa.23857$Bp2.380@fed1read07>, (e-mail address removed) says...

Strings are immutable.

try -

String id = URLDecoder.decode(request.getParameter("id", "UTF-8");

DM
 
I

Illya Kysil

Gary Thomas said:
Hi,

This is driving me nuts, could anyone assist? I am passing a 5
character (chinese) unicode string as a GET parameter to a servlet. The
URL encoding looks fine, I've verified the hex below corresponds to the
UTF-8 character values:

id=%E4%BA%B8%E4%BA%B4%E4%BA%B0%E4%BA%A9%E4%BA%A3


However, my servlet does not seem to be able to decode this back into a
proper Java String (i.e. 5 characters). Code snippet:

...
String id = request.getParameter("id");
char[] c = id.toCharArray();
logger.error("BEFORE: " + id + " - # of chars: " + c.length);

id = URLDecoder.decode(id, "UTF-8");
c = id.toCharArray();
logger.error("AFTER: " + id + " - " + c.length);
...


logs then show:

...
BEFORE: 亸亴亰亩亣 - # of chars: 15
AFTER: 亸亴亰亩亣 - # of chars: 15
...

So obviously, it looks like the String was not decoded from UTF-8
properly. However, if I view the logs with an editor that reads UTF-8,
the 15 characters above show as the correct 5 chinese characters, so the
original UTF-8 does not seem to be incorrect.

Am I missing something obvious? This seems so simple, but just can't
get it to work... I'm using JDK 1.4.2
Take a look @ http://www.anassina.com/struts/i18n/i18n.html
i18n with Struts tutorial
 
G

Gary Thomas

Thank you for the link Illya.

- Gary

Illya said:
Gary Thomas said:
Hi,

This is driving me nuts, could anyone assist? I am passing a 5
character (chinese) unicode string as a GET parameter to a servlet. The
URL encoding looks fine, I've verified the hex below corresponds to the
UTF-8 character values:

id=%E4%BA%B8%E4%BA%B4%E4%BA%B0%E4%BA%A9%E4%BA%A3


However, my servlet does not seem to be able to decode this back into a
proper Java String (i.e. 5 characters). Code snippet:

...
String id = request.getParameter("id");
char[] c = id.toCharArray();
logger.error("BEFORE: " + id + " - # of chars: " + c.length);

id = URLDecoder.decode(id, "UTF-8");
c = id.toCharArray();
logger.error("AFTER: " + id + " - " + c.length);
...


logs then show:

...
BEFORE: 亸亴亰亩亣 - # of chars: 15
AFTER: 亸亴亰亩亣 - # of chars: 15
...

So obviously, it looks like the String was not decoded from UTF-8
properly. However, if I view the logs with an editor that reads UTF-8,
the 15 characters above show as the correct 5 chinese characters, so the
original UTF-8 does not seem to be incorrect.

Am I missing something obvious? This seems so simple, but just can't
get it to work... I'm using JDK 1.4.2

Take a look @ http://www.anassina.com/struts/i18n/i18n.html
i18n with Struts tutorial
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top