Converting to a UTF-8 String

R

Red

Hi all,

I have a question regarding conversion to UTF-8 from a string. I use
the getBytes() string method to get the bytes from the original java
string. For some reason, the string get's cut off. I CAN convert this
string using the nativa2ascii java function to the correct output.

Specifically the example is:
// desired output = 2002年第4四半期の出荷面積を発表
// actual output (newString value) = 2002年第4

// here is the code...

String originalString =
"2002年第4å››å?ŠæœŸã?®å‡ºè?·é?¢ç©?ã‚'発表";
byte []b = s.getBytes();

// create UTF-8 encoded String
String newString = new String( b, 0, length, enc );


What do you think? I have tried using getBytes(<different encodings>)
and none of the tries worked. I have a feeling that the string itself
may be erroneous - the question marks in the string are questionable -
but this is returned from the database itself. ANY thoughts/comments
would be greatly appreciated!
 
S

Steve Horsley

Hi all,

I have a question regarding conversion to UTF-8 from a string. I use
the getBytes() string method to get the bytes from the original java
string. For some reason, the string get's cut off. I CAN convert this
string using the nativa2ascii java function to the correct output.

Specifically the example is:
// desired output = 2002年第4四半期の出荷面積を発表
// actual output (newString value) = 2002年第4

// here is the code...

String originalString =
"2002年第4å››å?ŠæœŸã?®å‡ºè?·é?¢ç©?ã‚'発表";
byte []b = s.getBytes();

// create UTF-8 encoded String
String newString = new String( b, 0, length, enc );


What do you think? I have tried using getBytes(<different encodings>)
and none of the tries worked. I have a feeling that the string itself
may be erroneous - the question marks in the string are questionable -
but this is returned from the database itself. ANY thoughts/comments
would be greatly appreciated!

Let me get pedantic - a String is a series of characters, each one being
represented by a Unicode value in the range 0-65535. UTF-8 is a way of
encoding Strings into a sequence of bytes. Each character may occupy
1-3 bytes. String.getBytes("UTF-8") is what you need to do to encode a
String into a byte sequence.

This is not what you seem to want to do. You seem to want to produce
another String, where the Unicode values from the original string are
encoded as "&#ddddd" where ddddd is the decimal representation of the
unicode value.

To do the translation you want, I think you need to work through the
original String character-by-character, casting to an int, then either
passing the original character, or if its value is greater than 255 or
whatever, pass through the encoded version instead. Something like this:

publc static String encode(String s) {
StringBuffer sb = new StringBuffer();
for(int x = 0 ; x < s.length() ; x++) {
char c = s.charAt(x);
if(c > 255) {
sb.append("&#");
sb.append((int) c);
}
else {
sb.append(c);
}
}
return sb.toString();
}

You may be safer using a limit of 127.
You also need a way of escaping the '&' character itself to avoid
ambiguity.

The odd thing is that I would think that all this should not be required.
If the database stores a String, I would expect it to take a String with
the full repertiore of Unicode values, and return same. If the database is
storing a byte[], then String.getBytes("UTF-8") and new String(byte[],
"UTF-8") should do very nicely. I'm no database expert, but I would regard
any other behaviour as broken. Opinions from others???

Although it produces very different looking Strings, java.net.URLEncoder
and URLDecoder might save you some coding time.

Steve
 
J

Jon A. Cruz

Red said:
String originalString =
"2002年第4å››å?ŠæœŸã?®å‡ºè?·é?¢ç©?ã‚'発表";

That's generally not a good thing to do in Java code. Use Unicode
escapes for any non-ASCII characters. Remeber, ASCII only goes from 0 up
through 127. Anything else is not "ASCII"

String originalString = "2002\u5e74\u7b2c4\u56db\u534a\u671f\u306e...";




// create UTF-8 encoded String
String newString = new String( b, 0, length, enc );

No, it doesn't.


As Steve pointed out, bytes are bytes and chars are chars and never the
twain should mix.

Strings are sequences of chars, not arbitrary streams of bytes.



BTW, is that really UTF-8 data you have? Or are you trying to munge
things and shove raw binary into strings? If so, that's almost
guaranteed to break.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,277
Latest member
VytoKetoReview

Latest Threads

Top