javac -encoding problem and/or glaring bug ?

J

java

Hi:

Consider this file, saved to disk as utf-8, no BOM.
---------------------------------------------------
public class x
{
public static void main (String args[])
{
System.out.println("\u0222");
}
}
--------------------------------------------------------

By the way, unicode 0x0222 looks like a funky eight --> Ȣ
You may not see it in this news post because of your
newsreader, doesn't matter.

While compiling I've tried all of:

javac x.java
javac -encoding utf-8 x.java
javac -encoding utf8 x.java
javax -encoding UTF-8 x.java
javac -encoding UTF8 x.java

Using:
JDK 1.5, on both linux and osx (same problem)

If you run this (regardless of how you compile it), you
will see '?' instead of the proper unicode character
(regardless of output device, even if you output to a
unicode capable terminal that can properly render
0x0222, you still see '?'

Am I missing something or is this like the biggest most
retarded bug ever ?

--j
 
R

Real Gagnon

If you run this (regardless of how you compile it), you
will see '?' instead of the proper unicode character
(regardless of output device, even if you output to a
unicode capable terminal that can properly render
0x0222, you still see '?'

Try to run it with :

java -Dfile.encoding=UTF8 x

Bye.
 
J

java

Try to run it with :
java -Dfile.encoding=UTF8 x

Ok, I tried that and that solved the problem.

But why ?

javac -encoding UTF8 x.java --> x.class

Now, shouldn't x.class be entirely self contained ? It's not
java source anymore.

So why do I have to set this property ? Is it because
the PrintWriter (System.out) uses this "file.encoding"
property internally ?

Background:
This becomes tricky when I have differently encoded web pages
(say jsp's) on the server at the same time (all of which print
debugging messages using System.out)

-j
 
T

Thomas Fritsch

java said:
Consider this file, saved to disk as utf-8, no BOM.
---------------------------------------------------
public class x
{
public static void main (String args[])
{
System.out.println("\u0222");
}
} [...]
While compiling I've tried all of:

javac x.java
javac -encoding utf-8 x.java
javac -encoding utf8 x.java
javax -encoding UTF-8 x.java
javac -encoding UTF8 x.java

Using:
JDK 1.5, on both linux and osx (same problem)

If you run this (regardless of how you compile it), you
will see '?' instead of the proper unicode character
(regardless of output device, even if you output to a
unicode capable terminal that can properly render
0x0222, you still see '?'

Am I missing something
Aaahm, yes.
Your *source* contains only harmless ASCII characters.
Remember, \ u 0 2 2 are in range 0x0020...0x007F, where ASCII is
identical to UTF-8. Therefore all your effort to make the compiler
understand UTF-8 is pointless. (sorry)
Your problem is not a compile-problem (javac), but a runtime-problem
(java). Real Gagnon already told how to parametrize java to use UTF-8.
But even that might not solve your problem, if the font used by your
terminal doesn't contain a rendering for the 0x0222 character.

By the way: Even my "Arial Unicode MS" font, which contains all of the
greek, cyrillic, armenian, chinese etc characters, has no renderings in
the range 0x0220..0x024F.
 
J

Juha Laiho

java said:
Ok, I tried that and that solved the problem.

But why ?

javac -encoding UTF8 x.java --> x.class

Now, shouldn't x.class be entirely self contained ? It's not
java source anymore.

So why do I have to set this property ? Is it because
the PrintWriter (System.out) uses this "file.encoding"
property internally ?

That is because the JVM runtime does attempt to find out what
character encoding the environment outside the JVM uses, and
apparently in your environment it gets a native character set
of something else that UTF8.

So, even if you have funky UTF-8 characters in your source,
Java may be able to print them out in environments with some
other native character encoding, if that other encoding
happens to have a code point for the same character glyph.

For example, source code with UTF-8 may contain the byte
sequence [0xc3, 0xa4], signifying lower-case a-diaeresis
character glyph. Now, if that source code is compiled
properly, letting the compiler know that the source is in UTF-8
character set, and subsequently the code is run in an environment
with ISO-8859-1 character set, the program will output just
one byte, 0xE4. Also, if the same code is run in an environment
configured for plain US-ASCII character set, it will output
only a question mark (as US-ASCII character set does not have
a glyph for the a-diaeresis character.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top