Convert UTF8 to SJIS

rookie · Sep 11, 2005

Hello,

I have been struggling this problem for couple of days and I cannot
really figure out a proper way to handle. I appreciate for your great
help.

Basically, I would like to read the Japanese kanji from the xml file in
my (UF8 format) and save it into Sybase table which is SJIS format. And
I have the following code :

public void setinstructions (java.lang.String newinstructions) {

byte[] utf8_bytes = null;

try {
utf8_bytes = newinstructions.getBytes("UTF-8");
instructions = new String(utf8_bytes, "SJIS");
} catch (UnsupportedEncodingException e) { System.out.println(e); }

}

The above code failed to convert the UTF8 kanji into SJIS. It also
falied to save (insert) into the Sybase table as SJIS (the coding for
the Sybase is iso_1, I talked to DBA for this information).

Can someone share some experience with me ?

Thanks
Rookie

Roedy Green · Sep 11, 2005

my (UF8 format) and save it into Sybase table which is SJIS format. And
I have the following code :

There are two things you could mean by UTF-8 -- something written by
DataOutputStream.writeUTF which are counted strings and something
produced by a text editor, a file one great huge long string without
counts. Let me assume the latter.

You have UTF-8 bytes. You want to convert that to Unicode Strings.
Then you want to convert the Unicode Strings to SJIS bytes.

There is no such thing as UTF-8 chars/Strings. Ditto SJIS.

You can do the conversions as you have with String methods, or using
Readers and Writers. See http://mindprod.com/applets/fileio.html
for the latter technique.

HGA03630 · Sep 11, 2005

Whatever the original encoding is, a Java String is a Java
String. So, your newinstructions is a Java String if it is
properly displayable by System.out et al.

Then,

Charset cs = Charset.forName("Shift_JIS");
ByteBuffer bb = cs.endoce(newinstructions);

Don't create a Java String from this ByteBuffer, because
it would be another Java String, not a Shift_JIS string!!!

HGA03630 · Sep 11, 2005

Oh no.
cs.encode(...);

rookie · Sep 11, 2005

Don't create a Java String from this ByteBuffer, because

it would be another Java String, not a Shift_JIS string!!!

Thank you very much for your help. Since my final output will be a
String (instructions), should I just do :
instructions = bb.toString() ? Or should I decode ByteBuffer to
CharBuffer, then do toString in the CharBuffer to instructions ?

http://javaalmanac.com/egs/java.nio.charset/ConvertChar.html

Thanks again !

Roedy Green · Sep 11, 2005

Charset cs = Charset.forName("Shift_JIS");
ByteBuffer bb = cs.endoce(newinstructions);

endoce -> encode

Any thoughts on when to use String vs Charset methods?

Roedy Green · Sep 11, 2005

Thank you very much for your help. Since my final output will be a
String (instructions), should I just do :
instructions = bb.toString() ? Or should I decode ByteBuffer to
CharBuffer, then do toString in the CharBuffer to instructions ?

try reading http://mindprod.com/jgloss/encoding.html#CONVERTING

see if that clarifies it for you.

Stanimir Stamenkov · Sep 11, 2005

Basically, I would like to read the Japanese kanji from the xml file in

my (UF8 format) and save it into Sybase table which is SJIS format.

I guess you use JDBC to interact with the DB. When you use a
PreparedStatement you set the parameters to insert using:

PreparedStatement statement;
String unicodeText;
...
statement.setString(1, unicodeText);

The JDBC driver then handles whatever conversion/encoding is needed, so
you should just configure the driver (consult with your DB/driver
documentation) .

If you know in advance the DB can't handle Unicode text you may have
choosen to store certain text as raw bytes in order to preserve all the
multilingual characters there may appear in the input string. These
bytes you'll have to handle manually in the Java code, like:

PreparedStatement statement;
String unicodeText;
...
statement.setBytes(1, unicodeText.getBytes("UTF-8"));

On reading from the DB you have to reconstruct the string the same way:

ResultSet rs;
...
String text = new String(rs.getBytes(1), "UTF-8");

Having a binary field for text in the DB however has the drawback of
not being able to use SQL text operations (text search, etc.).

And if you only care to store characters in "Shift_JIS" then you should
go the first way - configure the driver appropriatelly and work with
just Strings. Probably you should put some check if the input string
contains characters which can't be represented in the DB and write a
log or present a message to the user.

HGA03630 · Sep 11, 2005

Don't create a Java String from this ByteBuffer, because

Thank you very much for your help. Since my final output will be a
String (instructions), should I just do :
instructions = bb.toString() ? Or should I decode ByteBuffer to
CharBuffer, then do toString in the CharBuffer to instructions ?

I said don't create a Java String because it can't be a Shift_JIS.
As Stanimir has pointed out, if your JDBC driver does necessary
conversion, that is,
Java String(unicode16) to DB native(Shift_JIS), accoding to your
pre-configuration,
It is best to use the functionality.

rookie · Sep 12, 2005

Thanks a lot Roedy and everyone !

I read the page thoroughly and I have idea what I am doing right now.
Basically, I would like to encode an incoming String (newinstructions)
, which is in UTF-8 (as stated in the XML header : <?xml version="1.0"
encoding="utf-8" ?> ) to become another String (instructions), which is
in SJIS. I have tried the ways it showed in "Converting", but it seemed
it doesn't work. Then I tried Stanimir's : statement.setBytes(1,
unicodeText.getBytes("UTF-8")); I can't make it either. I guess I am
missing something. I write a small function which can show the hex code
for the string, which tells what encoding the string is in.

private String getHex(String str) {
StringBuffer sBuffer = new StringBuffer("");
for (int i = 0; i < str.length(); i++) {
int code = (int) str.charAt(i);
sBuffer.append( Integer.toHexString(code));
}
return sBuffer.toString().toUpperCase() ;
}

my new setinstructiosn :

public void setinstructions (java.lang.String newinstructions) {

// encode String to bytes[]
Charset cs_utf8 = Charset.forName("UTF-8");
ByteBuffer bb_utf8 = cs_utf8.encode(newinstructions);
byte[] b = bb_utf8.array();

// decode byte[] to String
Charset cs_sjis = Charset.forName( "Shift_JIS");
ByteBuffer bb_sjis = ByteBuffer.wrap( b );
CharBuffer cb = cs_sjis.decode( bb_sjis );
instructions = cb.toString();

myLogger.log(getHex(newinstructions));
myLogger.log(getHex(instructions));
myLogger.log("Done instructions conversion!");

}

Please help to point out which place I am wrong.

Thanks
Rookie

Roedy Green · Sep 12, 2005

I read the page thoroughly and I have idea what I am doing right now.
Basically, I would like to encode an incoming String (newinstructions)
, which is in UTF-8 (as stated in the XML header : <?xml version="1.0"
encoding="utf-8" ?> ) to become another String (instructions), which is
in SJIS.

There is no such thing as as STRING in JSIIS or UTF-8. ONLY byte[].
This is your essential problem. I quote from my webpage:
http://mindprod.com/jgloss/encoding.html

The key thing in converting to keep uppermost in your mind is that all
encoded files are conceptually composed of 8-bit byte[], even UTF-16
encoded files. Java internally works with Unicode 16-bit chars. Don't
try to go from String to String or byte[] to byte[]. You are always
encoding String to byte[] or decoding byte[] to String.

Roedy Green · Sep 12, 2005

Any thoughts on when to use String vs Charset methods?

new String likely does a HashCode lookup on the name to get the
canonical name, then does a classForName on that. Quite a song and
dance just to convert a string. Perhaps it is clever caching encoding
classes.

With Charset you are doing that lookup only once, but then you have
all the futzing about with ByteBuffer and CharBuffer. You would have
experiment to see the tradeoffs.

rookie · Sep 12, 2005

Thanks a lot again, Roedy.

Maybe I expressed wrongly in my previously post... My concept for
conversion is first to encode the string in to utf8 bytes, then decode
the sjis byte back to string (You are always
encoding String to byte[] or decoding byte[] to String - quite from
your page.) If this concept is right, I think I may miss something in
the code which I posted today. I am very green in this topic. Can you
point out if I made any mistake made ? I made up this code according to
the 4 example I see in converting section.

// encode String to bytes[]
Charset cs_utf8 = Charset.forName("UTF-8");
ByteBuffer bb_utf8 = cs_utf8.encode(newinstructions);
byte[] b = bb_utf8.array();

// decode byte[] to String
Charset cs_sjis = Charset.forName( "Shift_JIS");
ByteBuffer bb_sjis = ByteBuffer.wrap( b );
CharBuffer cb = cs_sjis.decode( bb_sjis );
instructions = cb.toString();

Thanks
Rookie

Stanimir Stamenkov · Sep 12, 2005

/rookie/:

My concept for
conversion is first to encode the string in to utf8 bytes, then decode
the sjis byte back to string...

You essentially get apples and force them to become steaks. You don't
need to decode/encode anything - just configure your DB and/or JDBC
driver to do the correct conversion.

Thomas Hawtin · Sep 12, 2005

Roedy said:
new String likely does a HashCode lookup on the name to get the
canonical name, then does a classForName on that. Quite a song and
dance just to convert a string. Perhaps it is clever caching encoding
classes.

IIRC, String has a thread-local, soft cache of the last used converter
used for encoding and the last used for decoding (I don't know why it
uses ThreadLocal instead of just adding a package-private field onto
Thread). So if you do lots of conversions of the same type, you wont get
an enormous penalty for doing it the simpler way. Indeed it could be
much faster than a half-baked attempt to use charsets directly.

Tom Hawtin

rookie · Sep 13, 2005

Thanks Stanimir,

I have done some more testing this morning. Right now, I remove all
conversion code and just pass in what I read in the code to
setString(). But the conversion seemed not done properly. I think that
I have configured my JDBC (6.0) driver properly. I pass in the
connection properties as CHARSET=sjis and
DISABLE_UNICHAR_SENDING=false.

Can you let me know from your experience what I may missing ?

http://sybooks.sybase.com/onlineboo...link;pt=2779?target=%N_1072_START_RESTART_N%

rookie

rookie · Sep 13, 2005

I found something from the Variables window in Eclipse and this might
be the reason.. I was trying to see if the connection properties are
alright and I found out that there is a warning message :

Character set conversion is not available between client character set
'sjis' and server character set 'iso_1'.

The error code is "2401" from Sybase
(http://manuals.sybase.com/onlinebooks/group-as/asg1250e/svrtsg/@Generic__BookTextView/28381;pt=14594).
This is not the error which with stop loading, but it will stop the
JDBC conversion happen.

I am wondering if it means that I have to switch my server (Sybase)
charset into sjis before I can successfully stored the kanji
instructions..

rookie

Stanimir Stamenkov · Sep 13, 2005

/rookie/:

I am wondering if it means that I have to switch my server (Sybase)
charset into sjis before I can successfully stored the kanji
instructions..

I have no Sybase experience but yes, the DB should be configured to
handle a specific character set (possibly Unicode), using a specific
encoding. It could be that you could configure different DBs on the
server to use different charsets/encodings, but you should consult with
a Sybase support group.

Here's what I've read from the documentation you've given a link
previously

Property:
CHARSET

Description:
Specifies the character set for strings passed to the database.
If the CHARSET value is null, jConnect uses the default character
set of the server to send string data to the server. If you specify
a CHARSET, the database must be able to handle characters in that
format. If the database cannot do so, a message is generated
indicating that character conversion cannot be properly completed.

When using jConnect 6.0 with unichar enabled, jConnect detects
when a client is trying to send characters to the server that cannot
be represented in the character set that is being used for the
connection. When that occurs, jConnect sends the character data to
the server as unichar data, which allows clients to insert Unicode
data into unichar/univarchar columns and parameters.

Default value:
Null

As additional hint I've read from the second documentation link

Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
split UTF-8 string to multi UTF8-file	2	Jan 26, 2010
Convert .resource file to audio file (From react native compiled apk)	0	Apr 26, 2021
Hello guys ! How do I convert a string from an array into numbers ? Javascript	3	Dec 19, 2022
New to VHDL... Trying to convert a 2-bytes number into an decimal	0	Dec 9, 2022
how do I expand a unicode string to its visual UTF8 representation?	32	Aug 6, 2009
convert Java unicode escape to utf8	12	Jul 6, 2007

Convert UTF8 to SJIS

rookie

Roedy Green

HGA03630

HGA03630

rookie

Roedy Green

Roedy Green

Stanimir Stamenkov

HGA03630

rookie

Roedy Green

Roedy Green

rookie

Stanimir Stamenkov

Thomas Hawtin

rookie

rookie

Stanimir Stamenkov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads