Internationalization and character encoding

M

Mickey Segal

People with different keyboard encodings are using our Java applet to send
Strings to our server. The text, once accepted, is then viewable by all the
others. In most cases the text looks fine in English, but in some cases
some characters are replaced with question marks, presumably due to
incompatible character encodings. This occurs frequently for text submitted
from Japanese and Turkish people, and occasionally for people using Spanish
and German. I'm having trouble finding information on good strategies to
deal with such diverse inputs. Also, testing is not trivial since people
are using operating systems in other languages.

What are the good approaches for dealing with this issue? Do I force their
strings into a particular character encoding, or is that what is happening
anyway and leading to the problem? If I force a particular encoding is
there a standard encoding that people use under these circumstances? Should
I expect problems to be only from characters that do not appear in English
or will it occur also for English characters encoded in different ways?

It seems like there should be a good tutorial on this because it seems like
it would be a common issue but I'm having trouble finding such a guide.
 
P

Pete Barrett

People with different keyboard encodings are using our Java applet to send
Strings to our server. The text, once accepted, is then viewable by all the
others. In most cases the text looks fine in English, but in some cases
some characters are replaced with question marks, presumably due to
incompatible character encodings. This occurs frequently for text submitted
from Japanese and Turkish people, and occasionally for people using Spanish
and German. I'm having trouble finding information on good strategies to
deal with such diverse inputs. Also, testing is not trivial since people
are using operating systems in other languages.

What are the good approaches for dealing with this issue? Do I force their
strings into a particular character encoding, or is that what is happening
anyway and leading to the problem? If I force a particular encoding is
there a standard encoding that people use under these circumstances? Should
I expect problems to be only from characters that do not appear in English
or will it occur also for English characters encoded in different ways?

It seems like there should be a good tutorial on this because it seems like
it would be a common issue but I'm having trouble finding such a guide.

If it's a Java String class that the applet is sending (through RMI or
something), then it will be encoded in UCS2. (It may be encoded in
UTF-8 for the actual transmission, but when it's picked up as a String
at your end, it will be in UCS2.) If that's the case, then if it's not
being displayed properly, it's probably because you don't have the
appropriate characters installed on your machine, in which case
installing an appropriate code page should do the trick.

If the data is being sent in some other way (HTTP, for instance), then
character encodings become an issue. The normal way is either to
insist on a particular encoding (ISO 8859-1 is probably the most
popular, and includes most of the characters needed for European
languages), or to include an indication of the code page used in the
data being sent, or to send the data in UCS2, UTF-8 encoded, which is
absolutely standard and will accomodate any likely keyboard. Of
course, once the character data is at your end, you still have to


Pete Barrett
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,074
Latest member
StanleyFra

Latest Threads

Top