JDK 1.4 character set change?

Dave Bender · Aug 28, 2003

Did something change in the default character set in Java 1.4? We have an
application that pulls data from a Windows database that includes characters
in the windows character set. This application works fine in Java 1.2.2,
but when we switch to the 1.4.1 JDK, special characters, such as the emdash,
come out as question marks.

I've scoured the sun web site for information about character sets, but I'm
not finding anything that would explain the change and offer a solution.

Dave

Jon A. Cruz · Aug 29, 2003

Dave said:
Did something change in the default character set in Java 1.4? We have an
application that pulls data from a Windows database that includes characters
in the windows character set. This application works fine in Java 1.2.2,
but when we switch to the 1.4.1 JDK, special characters, such as the emdash,
come out as question marks.

What exactly do you mean by "come out as"?

Often, programs have a few potential points of failure. Misconversion
could happen at one or more.

At what point does the data break? Straight out of the DB? Going into a
widget? When viewed in an AWT component? When viewed in a Swing
component? When viewed from the command-line?...

Dave Bender · Aug 29, 2003

The characters display improperly in the web pages that are created by the
application. The browser is sent a question mark character, so I don't
believe it is a browser display issue. Coming from the database, the
characters still appear to be correct in my debugger (JBuilder 9). Some
other details:

The application runs on Solaris and the machine's default character set is
ASCII.

The application is a Web Application containing servlets and JSPs running in
iPlanet Web Server. It appears that the character set it is using is
ISO-8859-1. That is what the page says when I include
<%=response.getCharacterEncoding()%>

Dave

Neal Gafter · Aug 29, 2003

In Java 1.4, the sequence of bytes on input are by default interpreted according
to your platform's default encoding. If you want to override that behavior, you
should specify an encoding explicitly when you create the input stream. I
suggest you try the encoding "ISO8859-1".

-Neal

Juha Laiho · Aug 29, 2003

Dave Bender said:
The characters display improperly in the web pages that are created by the
application. The browser is sent a question mark character, so I don't
believe it is a browser display issue.

I did see this issue when going from 1.2.2 to 1.3.1 (on Solaris 8, in case
that this is platform-dependent).

So, with 1.2.2 you could print out strings containing non-ASCII characters,
and they came out "just fine" (supposing the editing, compilation and
runtime environment all had same character encoding in use.

Starting at 1.3.1 (or 1.3.0? didn't test), Sun JRE did require proper
character encoding declarations:
- if you compiled code containing strings w/non-ASCII characters,
the characters ended up as '?'-marks in the compiled code
- if you ran code that did include non-ASCII characters in the
binary, the characters did print out as '?'-marks

So, in this sense 1.2.2 was broken, and 1.3.1 and onward work as they
should. However, it can take some time to explain this to some people.

Mike Schilling · Aug 29, 2003

Roedy Green said:
So the thing he did wrong was to embed Chinese characters directly
into Java source strings rather than encoding them as \uxxxx.

Is there some way he can tell Javac what encoding it is using in his
source files to get around that problem?

javac -encoding

John O'Conner · Aug 30, 2003

Roedy said:
It sounds like he is being forced into writing a little preprocessor
than takes the chinese chars and converts them to \uxxxx format prior
to every compilation. Ouch!

I hope that one would look at the "native2ascii" utility provided in the
J2SDK before creating this preprocessor. The native2ascii utility
converts text files that have non-ascii characters into text files that
contain \uXXXX escape sequences in place of each non-ascii character.

Regards,
John O'Conner

Juha Laiho · Aug 30, 2003

(e-mail address removed) said:

So the thing he did wrong was to embed Chinese characters directly
into Java source strings rather than encoding them as \uxxxx.

Actually in the case I saw the non-ASCII characters were from the
ISO-8859-1 character set (so nice single-byte encoding, but outside the
ASCII range), a-umlaut being one such character.

The fix in our case was to correct the environment so that the
character set was correctly declared (set LC_CTYPE to fi_FI.ISO8859-1).
The other possibility would've been to use the -encoding flag for
javac and java.

yxnpxl · Feb 27, 2009

When I read the file, I point out what character set using. For example:

FileInputStream fstream = new FileInputStream(url.getFile());
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in, Charset.forName("ISO-8859-1")));
br.readLine();

Problem with JDK 1.4/org.apache.xalan.serialize.SerializerFactory	0	May 25, 2006
JDK 1.4 installation error	1	Jun 20, 2004
jdk 1.4 and java.lang.UnsupportedClassVersionError	1	Mar 5, 2004
JDK 1.4 to 1.5 StackOverFlowError	1	Apr 18, 2005
Character set	18	Jun 22, 2009
CORBA from jdk1.2 or 1.3 or 1.4 ?	1	Apr 4, 2004
Compiler telling me I have 1.4 JDK when i have 1.5 wont compile generics code	6	Mar 10, 2006
Displaying 'umlaut' character	15	Sep 22, 2010

JDK 1.4 character set change?

Dave Bender

Jon A. Cruz

Dave Bender

Neal Gafter

Juha Laiho

Mike Schilling

John O'Conner

Juha Laiho

yxnpxl

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads