Convert UTF8 to SJIS

R

rookie

Hello,


I have been struggling this problem for couple of days and I cannot
really figure out a proper way to handle. I appreciate for your great
help.


Basically, I would like to read the Japanese kanji from the xml file in
my (UF8 format) and save it into Sybase table which is SJIS format. And
I have the following code :


public void setinstructions (java.lang.String newinstructions) {


byte[] utf8_bytes = null;


try {
utf8_bytes = newinstructions.getBytes("UTF-8");
instructions = new String(utf8_bytes, "SJIS");
} catch (UnsupportedEncodingException e) { System.out.println(e); }



}


The above code failed to convert the UTF8 kanji into SJIS. It also
falied to save (insert) into the Sybase table as SJIS (the coding for
the Sybase is iso_1, I talked to DBA for this information).

Can someone share some experience with me ?


Thanks
Rookie
 
R

Roedy Green

my (UF8 format) and save it into Sybase table which is SJIS format. And
I have the following code :

There are two things you could mean by UTF-8 -- something written by
DataOutputStream.writeUTF which are counted strings and something
produced by a text editor, a file one great huge long string without
counts. Let me assume the latter.

You have UTF-8 bytes. You want to convert that to Unicode Strings.
Then you want to convert the Unicode Strings to SJIS bytes.

There is no such thing as UTF-8 chars/Strings. Ditto SJIS.

You can do the conversions as you have with String methods, or using
Readers and Writers. See http://mindprod.com/applets/fileio.html
for the latter technique.
 
H

HGA03630

Whatever the original encoding is, a Java String is a Java
String. So, your newinstructions is a Java String if it is
properly displayable by System.out et al.

Then,

Charset cs = Charset.forName("Shift_JIS");
ByteBuffer bb = cs.endoce(newinstructions);

Don't create a Java String from this ByteBuffer, because
it would be another Java String, not a Shift_JIS string!!!
 
R

Roedy Green

Charset cs = Charset.forName("Shift_JIS");
ByteBuffer bb = cs.endoce(newinstructions);

endoce -> encode

Any thoughts on when to use String vs Charset methods?
 
S

Stanimir Stamenkov

Basically, I would like to read the Japanese kanji from the xml file in
my (UF8 format) and save it into Sybase table which is SJIS format.

I guess you use JDBC to interact with the DB. When you use a
PreparedStatement you set the parameters to insert using:

PreparedStatement statement;
String unicodeText;
...
statement.setString(1, unicodeText);

The JDBC driver then handles whatever conversion/encoding is needed, so
you should just configure the driver (consult with your DB/driver
documentation) .

If you know in advance the DB can't handle Unicode text you may have
choosen to store certain text as raw bytes in order to preserve all the
multilingual characters there may appear in the input string. These
bytes you'll have to handle manually in the Java code, like:

PreparedStatement statement;
String unicodeText;
...
statement.setBytes(1, unicodeText.getBytes("UTF-8"));

On reading from the DB you have to reconstruct the string the same way:

ResultSet rs;
...
String text = new String(rs.getBytes(1), "UTF-8");

Having a binary field for text in the DB however has the drawback of
not being able to use SQL text operations (text search, etc.).

And if you only care to store characters in "Shift_JIS" then you should
go the first way - configure the driver appropriatelly and work with
just Strings. Probably you should put some check if the input string
contains characters which can't be represented in the DB and write a
log or present a message to the user.
 
H

HGA03630

Don't create a Java String from this ByteBuffer, because
Thank you very much for your help. Since my final output will be a
String (instructions), should I just do :
instructions = bb.toString() ? Or should I decode ByteBuffer to
CharBuffer, then do toString in the CharBuffer to instructions ?
I said don't create a Java String because it can't be a Shift_JIS.
As Stanimir has pointed out, if your JDBC driver does necessary
conversion, that is,
Java String(unicode16) to DB native(Shift_JIS), accoding to your
pre-configuration,
It is best to use the functionality.
 
R

rookie

Thanks a lot Roedy and everyone !

I read the page thoroughly and I have idea what I am doing right now.
Basically, I would like to encode an incoming String (newinstructions)
, which is in UTF-8 (as stated in the XML header : <?xml version="1.0"
encoding="utf-8" ?> ) to become another String (instructions), which is
in SJIS. I have tried the ways it showed in "Converting", but it seemed
it doesn't work. Then I tried Stanimir's : statement.setBytes(1,
unicodeText.getBytes("UTF-8")); I can't make it either. I guess I am
missing something. I write a small function which can show the hex code
for the string, which tells what encoding the string is in.

private String getHex(String str) {
StringBuffer sBuffer = new StringBuffer("");
for (int i = 0; i < str.length(); i++) {
int code = (int) str.charAt(i);
sBuffer.append( Integer.toHexString(code));
}
return sBuffer.toString().toUpperCase() ;
}


my new setinstructiosn :

public void setinstructions (java.lang.String newinstructions) {

// encode String to bytes[]
Charset cs_utf8 = Charset.forName("UTF-8");
ByteBuffer bb_utf8 = cs_utf8.encode(newinstructions);
byte[] b = bb_utf8.array();

// decode byte[] to String
Charset cs_sjis = Charset.forName( "Shift_JIS");
ByteBuffer bb_sjis = ByteBuffer.wrap( b );
CharBuffer cb = cs_sjis.decode( bb_sjis );
instructions = cb.toString();

myLogger.log(getHex(newinstructions));
myLogger.log(getHex(instructions));
myLogger.log("Done instructions conversion!");

}

Please help to point out which place I am wrong.

Thanks
Rookie
 
R

Roedy Green

I read the page thoroughly and I have idea what I am doing right now.
Basically, I would like to encode an incoming String (newinstructions)
, which is in UTF-8 (as stated in the XML header : <?xml version="1.0"
encoding="utf-8" ?> ) to become another String (instructions), which is
in SJIS.

There is no such thing as as STRING in JSIIS or UTF-8. ONLY byte[].
This is your essential problem. I quote from my webpage:
http://mindprod.com/jgloss/encoding.html

The key thing in converting to keep uppermost in your mind is that all
encoded files are conceptually composed of 8-bit byte[], even UTF-16
encoded files. Java internally works with Unicode 16-bit chars. Don't
try to go from String to String or byte[] to byte[]. You are always
encoding String to byte[] or decoding byte[] to String.
 
R

Roedy Green

Any thoughts on when to use String vs Charset methods?

new String likely does a HashCode lookup on the name to get the
canonical name, then does a classForName on that. Quite a song and
dance just to convert a string. Perhaps it is clever caching encoding
classes.

With Charset you are doing that lookup only once, but then you have
all the futzing about with ByteBuffer and CharBuffer. You would have
experiment to see the tradeoffs.
 
R

rookie

Thanks a lot again, Roedy.

Maybe I expressed wrongly in my previously post... My concept for
conversion is first to encode the string in to utf8 bytes, then decode
the sjis byte back to string (You are always
encoding String to byte[] or decoding byte[] to String - quite from
your page.) If this concept is right, I think I may miss something in
the code which I posted today. I am very green in this topic. Can you
point out if I made any mistake made ? I made up this code according to
the 4 example I see in converting section.

// encode String to bytes[]
Charset cs_utf8 = Charset.forName("UTF-8");
ByteBuffer bb_utf8 = cs_utf8.encode(newinstructions);
byte[] b = bb_utf8.array();

// decode byte[] to String
Charset cs_sjis = Charset.forName( "Shift_JIS");
ByteBuffer bb_sjis = ByteBuffer.wrap( b );
CharBuffer cb = cs_sjis.decode( bb_sjis );
instructions = cb.toString();




Thanks
Rookie
 
S

Stanimir Stamenkov

/rookie/:
My concept for
conversion is first to encode the string in to utf8 bytes, then decode
the sjis byte back to string...

You essentially get apples and force them to become steaks. You don't
need to decode/encode anything - just configure your DB and/or JDBC
driver to do the correct conversion.
 
T

Thomas Hawtin

Roedy said:
new String likely does a HashCode lookup on the name to get the
canonical name, then does a classForName on that. Quite a song and
dance just to convert a string. Perhaps it is clever caching encoding
classes.

IIRC, String has a thread-local, soft cache of the last used converter
used for encoding and the last used for decoding (I don't know why it
uses ThreadLocal instead of just adding a package-private field onto
Thread). So if you do lots of conversions of the same type, you wont get
an enormous penalty for doing it the simpler way. Indeed it could be
much faster than a half-baked attempt to use charsets directly.

Tom Hawtin
 
R

rookie

Thanks Stanimir,

I have done some more testing this morning. Right now, I remove all
conversion code and just pass in what I read in the code to
setString(). But the conversion seemed not done properly. I think that
I have configured my JDBC (6.0) driver properly. I pass in the
connection properties as CHARSET=sjis and
DISABLE_UNICHAR_SENDING=false.

Can you let me know from your experience what I may missing ?

http://sybooks.sybase.com/onlineboo...link;pt=2779?target=%N_1072_START_RESTART_N%

rookie
 
R

rookie

I found something from the Variables window in Eclipse and this might
be the reason.. I was trying to see if the connection properties are
alright and I found out that there is a warning message :

Character set conversion is not available between client character set
'sjis' and server character set 'iso_1'.

The error code is "2401" from Sybase
(http://manuals.sybase.com/onlinebooks/group-as/asg1250e/svrtsg/@Generic__BookTextView/28381;pt=14594).
This is not the error which with stop loading, but it will stop the
JDBC conversion happen.

I am wondering if it means that I have to switch my server (Sybase)
charset into sjis before I can successfully stored the kanji
instructions..

rookie
 
S

Stanimir Stamenkov

/rookie/:
I am wondering if it means that I have to switch my server (Sybase)
charset into sjis before I can successfully stored the kanji
instructions..

I have no Sybase experience but yes, the DB should be configured to
handle a specific character set (possibly Unicode), using a specific
encoding. It could be that you could configure different DBs on the
server to use different charsets/encodings, but you should consult with
a Sybase support group.

Here's what I've read from the documentation you've given a link
previously
Property:
CHARSET

Description:
Specifies the character set for strings passed to the database.
If the CHARSET value is null, jConnect uses the default character
set of the server to send string data to the server. If you specify
a CHARSET, the database must be able to handle characters in that
format. If the database cannot do so, a message is generated
indicating that character conversion cannot be properly completed.

When using jConnect 6.0 with unichar enabled, jConnect detects
when a client is trying to send characters to the server that cannot
be represented in the character set that is being used for the
connection. When that occurs, jConnect sends the character data to
the server as unichar data, which allows clients to insert Unicode
data into unichar/univarchar columns and parameters.

Default value:
Null

As additional hint I've read from the second documentation link
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top