Convert UTF8 to SJIS

Discussion in 'Java' started by rookie, Sep 11, 2005.

  1. rookie

    rookie Guest

    Hello,


    I have been struggling this problem for couple of days and I cannot
    really figure out a proper way to handle. I appreciate for your great
    help.


    Basically, I would like to read the Japanese kanji from the xml file in
    my (UF8 format) and save it into Sybase table which is SJIS format. And
    I have the following code :


    public void setinstructions (java.lang.String newinstructions) {


    byte[] utf8_bytes = null;


    try {
    utf8_bytes = newinstructions.getBytes("UTF-8");
    instructions = new String(utf8_bytes, "SJIS");
    } catch (UnsupportedEncodingException e) { System.out.println(e); }



    }


    The above code failed to convert the UTF8 kanji into SJIS. It also
    falied to save (insert) into the Sybase table as SJIS (the coding for
    the Sybase is iso_1, I talked to DBA for this information).

    Can someone share some experience with me ?


    Thanks
    Rookie
     
    rookie, Sep 11, 2005
    #1
    1. Advertising

  2. rookie

    Roedy Green Guest

    On 10 Sep 2005 18:28:07 -0700, "rookie" <> wrote or
    quoted :

    >my (UF8 format) and save it into Sybase table which is SJIS format. And
    >I have the following code :


    There are two things you could mean by UTF-8 -- something written by
    DataOutputStream.writeUTF which are counted strings and something
    produced by a text editor, a file one great huge long string without
    counts. Let me assume the latter.

    You have UTF-8 bytes. You want to convert that to Unicode Strings.
    Then you want to convert the Unicode Strings to SJIS bytes.

    There is no such thing as UTF-8 chars/Strings. Ditto SJIS.

    You can do the conversions as you have with String methods, or using
    Readers and Writers. See http://mindprod.com/applets/fileio.html
    for the latter technique.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Again taking new Java programming contracts.
     
    Roedy Green, Sep 11, 2005
    #2
    1. Advertising

  3. rookie

    Guest

    Whatever the original encoding is, a Java String is a Java
    String. So, your newinstructions is a Java String if it is
    properly displayable by System.out et al.

    Then,

    Charset cs = Charset.forName("Shift_JIS");
    ByteBuffer bb = cs.endoce(newinstructions);

    Don't create a Java String from this ByteBuffer, because
    it would be another Java String, not a Shift_JIS string!!!
     
    , Sep 11, 2005
    #3
  4. rookie

    Guest

    Oh no.
    cs.encode(...);
     
    , Sep 11, 2005
    #4
  5. rookie

    rookie Guest

    >Don't create a Java String from this ByteBuffer, because
    >it would be another Java String, not a Shift_JIS string!!!


    Thank you very much for your help. Since my final output will be a
    String (instructions), should I just do :
    instructions = bb.toString() ? Or should I decode ByteBuffer to
    CharBuffer, then do toString in the CharBuffer to instructions ?

    http://javaalmanac.com/egs/java.nio.charset/ConvertChar.html

    Thanks again !
     
    rookie, Sep 11, 2005
    #5
  6. rookie

    Roedy Green Guest

    On 11 Sep 2005 01:44:33 -0700, wrote or quoted :

    >Charset cs = Charset.forName("Shift_JIS");
    >ByteBuffer bb = cs.endoce(newinstructions);


    endoce -> encode

    Any thoughts on when to use String vs Charset methods?
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Again taking new Java programming contracts.
     
    Roedy Green, Sep 11, 2005
    #6
  7. rookie

    Roedy Green Guest

    On 11 Sep 2005 02:22:26 -0700, "rookie" <> wrote or
    quoted :

    >Thank you very much for your help. Since my final output will be a
    >String (instructions), should I just do :
    >instructions = bb.toString() ? Or should I decode ByteBuffer to
    >CharBuffer, then do toString in the CharBuffer to instructions ?


    try reading http://mindprod.com/jgloss/encoding.html#CONVERTING

    see if that clarifies it for you.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Again taking new Java programming contracts.
     
    Roedy Green, Sep 11, 2005
    #7
  8. > Basically, I would like to read the Japanese kanji from the xml file in
    > my (UF8 format) and save it into Sybase table which is SJIS format.


    I guess you use JDBC to interact with the DB. When you use a
    PreparedStatement you set the parameters to insert using:

    PreparedStatement statement;
    String unicodeText;
    ...
    statement.setString(1, unicodeText);

    The JDBC driver then handles whatever conversion/encoding is needed, so
    you should just configure the driver (consult with your DB/driver
    documentation) .

    If you know in advance the DB can't handle Unicode text you may have
    choosen to store certain text as raw bytes in order to preserve all the
    multilingual characters there may appear in the input string. These
    bytes you'll have to handle manually in the Java code, like:

    PreparedStatement statement;
    String unicodeText;
    ...
    statement.setBytes(1, unicodeText.getBytes("UTF-8"));

    On reading from the DB you have to reconstruct the string the same way:

    ResultSet rs;
    ...
    String text = new String(rs.getBytes(1), "UTF-8");

    Having a binary field for text in the DB however has the drawback of
    not being able to use SQL text operations (text search, etc.).

    And if you only care to store characters in "Shift_JIS" then you should
    go the first way - configure the driver appropriatelly and work with
    just Strings. Probably you should put some check if the input string
    contains characters which can't be represented in the DB and write a
    log or present a message to the user.

    --
    Stanimir
     
    Stanimir Stamenkov, Sep 11, 2005
    #8
  9. rookie

    Guest

    >>Don't create a Java String from this ByteBuffer, because
    >>it would be another Java String, not a Shift_JIS string!!!

    >
    >Thank you very much for your help. Since my final output will be a
    >String (instructions), should I just do :
    >instructions = bb.toString() ? Or should I decode ByteBuffer to
    >CharBuffer, then do toString in the CharBuffer to instructions ?

    I said don't create a Java String because it can't be a Shift_JIS.
    As Stanimir has pointed out, if your JDBC driver does necessary
    conversion, that is,
    Java String(unicode16) to DB native(Shift_JIS), accoding to your
    pre-configuration,
    It is best to use the functionality.
     
    , Sep 11, 2005
    #9
  10. rookie

    rookie Guest

    Thanks a lot Roedy and everyone !

    I read the page thoroughly and I have idea what I am doing right now.
    Basically, I would like to encode an incoming String (newinstructions)
    , which is in UTF-8 (as stated in the XML header : <?xml version="1.0"
    encoding="utf-8" ?> ) to become another String (instructions), which is
    in SJIS. I have tried the ways it showed in "Converting", but it seemed
    it doesn't work. Then I tried Stanimir's : statement.setBytes(1,
    unicodeText.getBytes("UTF-8")); I can't make it either. I guess I am
    missing something. I write a small function which can show the hex code
    for the string, which tells what encoding the string is in.

    private String getHex(String str) {
    StringBuffer sBuffer = new StringBuffer("");
    for (int i = 0; i < str.length(); i++) {
    int code = (int) str.charAt(i);
    sBuffer.append( Integer.toHexString(code));
    }
    return sBuffer.toString().toUpperCase() ;
    }


    my new setinstructiosn :

    public void setinstructions (java.lang.String newinstructions) {

    // encode String to bytes[]
    Charset cs_utf8 = Charset.forName("UTF-8");
    ByteBuffer bb_utf8 = cs_utf8.encode(newinstructions);
    byte[] b = bb_utf8.array();

    // decode byte[] to String
    Charset cs_sjis = Charset.forName( "Shift_JIS");
    ByteBuffer bb_sjis = ByteBuffer.wrap( b );
    CharBuffer cb = cs_sjis.decode( bb_sjis );
    instructions = cb.toString();

    myLogger.log(getHex(newinstructions));
    myLogger.log(getHex(instructions));
    myLogger.log("Done instructions conversion!");

    }

    Please help to point out which place I am wrong.

    Thanks
    Rookie
     
    rookie, Sep 12, 2005
    #10
  11. rookie

    Roedy Green Guest

    On 11 Sep 2005 22:23:29 -0700, "rookie" <> wrote or
    quoted :

    >I read the page thoroughly and I have idea what I am doing right now.
    >Basically, I would like to encode an incoming String (newinstructions)
    >, which is in UTF-8 (as stated in the XML header : <?xml version="1.0"
    >encoding="utf-8" ?> ) to become another String (instructions), which is
    >in SJIS.


    There is no such thing as as STRING in JSIIS or UTF-8. ONLY byte[].
    This is your essential problem. I quote from my webpage:
    http://mindprod.com/jgloss/encoding.html

    The key thing in converting to keep uppermost in your mind is that all
    encoded files are conceptually composed of 8-bit byte[], even UTF-16
    encoded files. Java internally works with Unicode 16-bit chars. Don't
    try to go from String to String or byte[] to byte[]. You are always
    encoding String to byte[] or decoding byte[] to String.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Again taking new Java programming contracts.
     
    Roedy Green, Sep 12, 2005
    #11
  12. rookie

    Roedy Green Guest

    On Sun, 11 Sep 2005 09:46:10 GMT, Roedy Green
    <> wrote or quoted :

    >Any thoughts on when to use String vs Charset methods?


    new String likely does a HashCode lookup on the name to get the
    canonical name, then does a classForName on that. Quite a song and
    dance just to convert a string. Perhaps it is clever caching encoding
    classes.

    With Charset you are doing that lookup only once, but then you have
    all the futzing about with ByteBuffer and CharBuffer. You would have
    experiment to see the tradeoffs.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Again taking new Java programming contracts.
     
    Roedy Green, Sep 12, 2005
    #12
  13. rookie

    rookie Guest

    Thanks a lot again, Roedy.

    Maybe I expressed wrongly in my previously post... My concept for
    conversion is first to encode the string in to utf8 bytes, then decode
    the sjis byte back to string (You are always
    encoding String to byte[] or decoding byte[] to String - quite from
    your page.) If this concept is right, I think I may miss something in
    the code which I posted today. I am very green in this topic. Can you
    point out if I made any mistake made ? I made up this code according to
    the 4 example I see in converting section.

    // encode String to bytes[]
    Charset cs_utf8 = Charset.forName("UTF-8");
    ByteBuffer bb_utf8 = cs_utf8.encode(newinstructions);
    byte[] b = bb_utf8.array();

    // decode byte[] to String
    Charset cs_sjis = Charset.forName( "Shift_JIS");
    ByteBuffer bb_sjis = ByteBuffer.wrap( b );
    CharBuffer cb = cs_sjis.decode( bb_sjis );
    instructions = cb.toString();




    Thanks
    Rookie
     
    rookie, Sep 12, 2005
    #13
  14. /rookie/:
    > My concept for
    > conversion is first to encode the string in to utf8 bytes, then decode
    > the sjis byte back to string...


    You essentially get apples and force them to become steaks. You don't
    need to decode/encode anything - just configure your DB and/or JDBC
    driver to do the correct conversion.

    --
    Stanimir
     
    Stanimir Stamenkov, Sep 12, 2005
    #14
  15. Roedy Green wrote:
    > On Sun, 11 Sep 2005 09:46:10 GMT, Roedy Green
    > <> wrote or quoted :
    >
    >>Any thoughts on when to use String vs Charset methods?

    >
    > new String likely does a HashCode lookup on the name to get the
    > canonical name, then does a classForName on that. Quite a song and
    > dance just to convert a string. Perhaps it is clever caching encoding
    > classes.


    IIRC, String has a thread-local, soft cache of the last used converter
    used for encoding and the last used for decoding (I don't know why it
    uses ThreadLocal instead of just adding a package-private field onto
    Thread). So if you do lots of conversions of the same type, you wont get
    an enormous penalty for doing it the simpler way. Indeed it could be
    much faster than a half-baked attempt to use charsets directly.

    Tom Hawtin
    --
    Unemployed English Java programmer
    http://jroller.com/page/tackline/
     
    Thomas Hawtin, Sep 12, 2005
    #15
  16. rookie

    rookie Guest

    Thanks Stanimir,

    I have done some more testing this morning. Right now, I remove all
    conversion code and just pass in what I read in the code to
    setString(). But the conversion seemed not done properly. I think that
    I have configured my JDBC (6.0) driver properly. I pass in the
    connection properties as CHARSET=sjis and
    DISABLE_UNICHAR_SENDING=false.

    Can you let me know from your experience what I may missing ?

    http://sybooks.sybase.com/onlineboo...link;pt=2779?target=%N_1072_START_RESTART_N%

    rookie
     
    rookie, Sep 13, 2005
    #16
  17. rookie

    rookie Guest

    I found something from the Variables window in Eclipse and this might
    be the reason.. I was trying to see if the connection properties are
    alright and I found out that there is a warning message :

    Character set conversion is not available between client character set
    'sjis' and server character set 'iso_1'.

    The error code is "2401" from Sybase
    (http://manuals.sybase.com/onlineboo...svrtsg/@Generic__BookTextView/28381;pt=14594).
    This is not the error which with stop loading, but it will stop the
    JDBC conversion happen.

    I am wondering if it means that I have to switch my server (Sybase)
    charset into sjis before I can successfully stored the kanji
    instructions..

    rookie
     
    rookie, Sep 13, 2005
    #17
  18. /rookie/:

    > I am wondering if it means that I have to switch my server (Sybase)
    > charset into sjis before I can successfully stored the kanji
    > instructions..


    I have no Sybase experience but yes, the DB should be configured to
    handle a specific character set (possibly Unicode), using a specific
    encoding. It could be that you could configure different DBs on the
    server to use different charsets/encodings, but you should consult with
    a Sybase support group.

    Here's what I've read from the documentation you've given a link
    previously
    <http://sybooks.sybase.com/onlinebooks/group-jc/jcg0600e/prjdbc/@ebt-link;pt=2779?target=%25N%14_1072_START_RESTART_N%25>:

    > Property:
    > CHARSET
    >
    > Description:
    > Specifies the character set for strings passed to the database.
    > If the CHARSET value is null, jConnect uses the default character
    > set of the server to send string data to the server. If you specify
    > a CHARSET, the database must be able to handle characters in that
    > format. If the database cannot do so, a message is generated
    > indicating that character conversion cannot be properly completed.
    >
    > When using jConnect 6.0 with unichar enabled, jConnect detects
    > when a client is trying to send characters to the server that cannot
    > be represented in the character set that is being used for the
    > connection. When that occurs, jConnect sends the character data to
    > the server as unichar data, which allows clients to insert Unicode
    > data into unichar/univarchar columns and parameters.
    >
    > Default value:
    > Null


    As additional hint I've read from the second documentation link
    <http://manuals.sybase.com/onlinebooks/group-as/asg1250e/svrtsg/@Generic__BookTextView/28381;pt=14594>:

    > Action
    >
    > Make sure all necessary character sets are loaded, including the
    > client's character set (as shown in the error message output):


    --
    Stanimir
     
    Stanimir Stamenkov, Sep 13, 2005
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    3
    Views:
    363
    Keith Thompson
    Mar 17, 2006
  2. Replies:
    0
    Views:
    316
  3. Replies:
    8
    Views:
    753
    Eric Sosman
    Mar 17, 2006
  4. Thomas Morgan

    SJIS conversion question

    Thomas Morgan, Feb 3, 2009, in forum: Ruby
    Replies:
    0
    Views:
    100
    Thomas Morgan
    Feb 3, 2009
  5. gry
    Replies:
    2
    Views:
    740
    Alf P. Steinbach
    Mar 13, 2012
Loading...

Share This Page