JNI / localization / filenames

M

mwaller

I'm having a bit of an issue with filenames on Korean systems.
We use the native file dialogs on Win32 (the customer insisted), and
consequently use JNI to access them. This makes life a little
complicated as we have to dip into JNI to get the filename, use it in
various classes in Java to create temporary files for the customised
file format, then save the actual file in JNI.
Using guidelines from Sheng Liang's 'The Java Native Interface' book
I'm using the String getBytes() and new String(byte[]) to convert to
the default locale, instead of GetNewStringUTF.
I can now save files on Korean W2K, but the resultant file cannot be
opened by double clicking unless it is renamed (in DOS) first. (my
'tester' is on the other side of the world as I haven't got a Korean
Windows to hand, and this is what he reported.) Opening the
application then opening the file works.

Any ideas?

mlw
 
J

Jon A. Cruz

mwaller said:
Using guidelines from Sheng Liang's 'The Java Native Interface' book
I'm using the String getBytes() and new String(byte[]) to convert to
the default locale, instead of GetNewStringUTF.

That just sounds like bad advice.

I'd recommend always staying Unicode as long as possible. Even Microsoft
does this.

I can now save files on Korean W2K, but the resultant file cannot be
opened by double clicking unless it is renamed (in DOS) first. (my
'tester' is on the other side of the world as I haven't got a Korean
Windows to hand, and this is what he reported.) Opening the
application then opening the file works.

Ahhh. MS Windows.

Did you know that Windows has an API call to convert between 16-bit
Unicode and the local code page?

WideCharToMultiByte and MultiByteToWideChar. Just pass it CP_ACP.

So use the UTF-16 versions of JNI string calls, not the UTF-8 versions.



Oh, and don't listen to them. Avoid TCHAR, _T and #define UNICODE.
(Microsoft doesn't use them due to problems)

If you can keep all file creation and writing in JNI code, things might
be simpler. There are some long standing bugs in Sun's VM in regards to
non-ASCII filenames on WindowsNT systems.

Also... be aware that most win32 calls have just macros that resolve foo
to be fooA or fooW (for ANSI or Wide, respectively). Just keep all your
characters in JNI as explicit 16-bit UTF-16 characters, then you can
call the fooW calls directly on Windows NT, Windows 2000, Windows XP. On
Windows 95, Windows 98 and Windows ME you'll have to fall back to
convert wide-to-multibyte, call the fooA versions, then converting the
results back from multibyte to widechar.
 
C

Chris Uppal

Jon said:
So use the UTF-16 versions of JNI string calls, not the UTF-8 versions.

But, whatever you do, don't forget that what Sun calls "UTF8" in JNI is nothing
of the kind.

It's a different encoding (admitedly similar in many ways) and you'll have to
convert it to whatever encoding of Unicode Windows uses.

-- chris
 
J

Jon A. Cruz

Chris said:
Jon A. Cruz wrote:




But, whatever you do, don't forget that what Sun calls "UTF8" in JNI is nothing
of the kind.

It's a different encoding (admitedly similar in many ways) and you'll have to
convert it to whatever encoding of Unicode Windows uses.


But...

That's just another reason to avoid it, as I had just said.


Keep to NewString, GetStringLength, GetStringChars, and use jchar.
 
C

Chris Uppal

Jon said:
Chris said:
Jon A. Cruz wrote:




But, whatever you do, don't forget that what Sun calls "UTF8" in JNI is
nothing of the kind.
[...]

But...

That's just another reason to avoid it, as I had just said.

Sorry, I meant to be read as adding to your point, not denying it.

OTOH. There is no such thing as UTF-16 in JNI. Just the option of using 16-bit
quantities to represent Java 16-bit 'char's directly -- but that's not UTF-16.

(I really wish Sun hadn't started this wretched idea of referring to its
non-standard, short-sighted, crippled, character encodings by the same name as
"proper" industry standards.)

-- chris
 
J

Jon A. Cruz

Chris said:
OTOH. There is no such thing as UTF-16 in JNI. Just the option of using 16-bit
quantities to represent Java 16-bit 'char's directly -- but that's not UTF-16.

Sun experts in the area would disagree with you on that. They definitely
consider it UTF-16 and not UCS-2, and are continually adding more
support for UTF-16 details in implementations, including more support of
surrogate pairs, etc. JSR 204 has more info.
 
C

Chris Uppal

Jon said:
Sun experts in the area would disagree with you on that. They definitely
consider it UTF-16 and not UCS-2, and are continually adding more
support for UTF-16 details in implementations, including more support of
surrogate pairs, etc. JSR 204 has more info.

I don't think so, looking at JSR 204 (thanks for the pointer, I hadn't seen it
before) I get the impression that they are attempting to come to terms with the
fact that Java and Unicode don't match.

--- the intro to JSR 204 ---
Version 3.1 of the Unicode standard is the first one to define characters that
cannot be described by single 16-bit code points and thus the standard breaks a
fundamental assumption of the Java programming language and APIs. This JSR
defines the necessary adjustments to the Java APIs to enable support for such
characters and enables the Java platform to continue to track the Unicode
standard.
----------

Unfortunately the rest of the JSR paper doesn't seem to provide much
information.

However it seems clear that they are following the Unicode standard's very
unfortunate wording and thinking of characters with code points > 2**16 as
somehow "additional", maybe not "real Unicode characters".

Unicode cannot be represented by 16-bit characters.

Sequences of Unicode characters (up to 24-bit) can be represented as sequences
of 16-bit quantities using UTF-16. However neither Java Strings, nor the
arrays of jchar manipulated by JNI are in this encoding. Java/JNI use a direct
encoding of Java's 16-bit characters as (probably) "unsigned short" in JNI.
That isn't UTF-16. Java's encoding is neither upward or downward compatible
with UTF-16 (though there are many sequences of characters that are encoded the
same way in both.)

Granted, a Java String could be used to hold a UTF-16 sequence, (but then so
could a char[], a short[], or a byte[], or -- hell -- even a double[] since
it's only a string of bytes). But the Java "char"s in such a sequence are not
the same as the Unicode "character"s in the same collection of bits.

That's why I say that Java doesn't support Unicode, and that Java Strings, and
char[]s, are *not* UTF-16.

The good people working on JSR204 will have to find a way to work this out.
I'd guess that they'll introduce new APIs for using Strings and char[]s to hold
UTF16-encoded data, and have int-returning methods that (e.g.) know how to do
the decoding to find the (say) 8th Unicode character in a String. The use of
the "char" primitive datatype will start to look very dodgy indeed. With luck
they'll also define a few UnicodeString classes which separate the interface
from the representation, and (internally) encode the Unicode data in
programmer-selectable ways.

However none of that has happened yet. When it does it will introduce another
boat-load of complexity into the Java programmers life, and invalidate
(partially) a load of text handling code that already exists. They are going
to have a very hard time trying to sell this stuff to the community, and their
job won't be made easier by the fact that Sun has traditionally blurred the
differences between the Java APIs and real Unicode -- such as the many APIs
that falsely claim to talk UTF-8.

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
DewittMill
Top