JDK 1.4 character set change?

D

Dave Bender

Did something change in the default character set in Java 1.4? We have an
application that pulls data from a Windows database that includes characters
in the windows character set. This application works fine in Java 1.2.2,
but when we switch to the 1.4.1 JDK, special characters, such as the emdash,
come out as question marks.

I've scoured the sun web site for information about character sets, but I'm
not finding anything that would explain the change and offer a solution.

Dave
 
J

Jon A. Cruz

Dave said:
Did something change in the default character set in Java 1.4? We have an
application that pulls data from a Windows database that includes characters
in the windows character set. This application works fine in Java 1.2.2,
but when we switch to the 1.4.1 JDK, special characters, such as the emdash,
come out as question marks.

What exactly do you mean by "come out as"?

Often, programs have a few potential points of failure. Misconversion
could happen at one or more.

At what point does the data break? Straight out of the DB? Going into a
widget? When viewed in an AWT component? When viewed in a Swing
component? When viewed from the command-line?...
 
D

Dave Bender

The characters display improperly in the web pages that are created by the
application. The browser is sent a question mark character, so I don't
believe it is a browser display issue. Coming from the database, the
characters still appear to be correct in my debugger (JBuilder 9). Some
other details:

The application runs on Solaris and the machine's default character set is
ASCII.

The application is a Web Application containing servlets and JSPs running in
iPlanet Web Server. It appears that the character set it is using is
ISO-8859-1. That is what the page says when I include
<%=response.getCharacterEncoding()%>

Dave
 
N

Neal Gafter

In Java 1.4, the sequence of bytes on input are by default interpreted according
to your platform's default encoding. If you want to override that behavior, you
should specify an encoding explicitly when you create the input stream. I
suggest you try the encoding "ISO8859-1".

-Neal
 
J

Juha Laiho

Dave Bender said:
The characters display improperly in the web pages that are created by the
application. The browser is sent a question mark character, so I don't
believe it is a browser display issue.

I did see this issue when going from 1.2.2 to 1.3.1 (on Solaris 8, in case
that this is platform-dependent).

So, with 1.2.2 you could print out strings containing non-ASCII characters,
and they came out "just fine" (supposing the editing, compilation and
runtime environment all had same character encoding in use.

Starting at 1.3.1 (or 1.3.0? didn't test), Sun JRE did require proper
character encoding declarations:
- if you compiled code containing strings w/non-ASCII characters,
the characters ended up as '?'-marks in the compiled code
- if you ran code that did include non-ASCII characters in the
binary, the characters did print out as '?'-marks

So, in this sense 1.2.2 was broken, and 1.3.1 and onward work as they
should. However, it can take some time to explain this to some people.
 
M

Mike Schilling

Roedy Green said:
So the thing he did wrong was to embed Chinese characters directly
into Java source strings rather than encoding them as \uxxxx.

Is there some way he can tell Javac what encoding it is using in his
source files to get around that problem?

javac -encoding
 
J

John O'Conner

Roedy said:
It sounds like he is being forced into writing a little preprocessor
than takes the chinese chars and converts them to \uxxxx format prior
to every compilation. Ouch!

I hope that one would look at the "native2ascii" utility provided in the
J2SDK before creating this preprocessor. The native2ascii utility
converts text files that have non-ascii characters into text files that
contain \uXXXX escape sequences in place of each non-ascii character.

Regards,
John O'Conner
 
J

Juha Laiho

(e-mail address removed) said:
So the thing he did wrong was to embed Chinese characters directly
into Java source strings rather than encoding them as \uxxxx.

Actually in the case I saw the non-ASCII characters were from the
ISO-8859-1 character set (so nice single-byte encoding, but outside the
ASCII range), a-umlaut being one such character.

The fix in our case was to correct the environment so that the
character set was correctly declared (set LC_CTYPE to fi_FI.ISO8859-1).
The other possibility would've been to use the -encoding flag for
javac and java.
 
Joined
Feb 26, 2009
Messages
1
Reaction score
0
When I read the file, I point out what character set using. For example:

FileInputStream fstream = new FileInputStream(url.getFile());
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in, Charset.forName("ISO-8859-1")));
br.readLine();
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,157
Latest member
MercedesE4
Top