Converting to a UTF-8 String

Red · Jul 14, 2003

Hi all,

I have a question regarding conversion to UTF-8 from a string. I use
the getBytes() string method to get the bytes from the original java
string. For some reason, the string get's cut off. I CAN convert this
string using the nativa2ascii java function to the correct output.

Specifically the example is:
// desired output = 2002年第4四半期の出荷面積を発表
// actual output (newString value) = 2002年第4

// here is the code...

String originalString =
"2002å¹´ç¬¬4å››å?ŠæœŸã?®å‡ºè?·é?¢ç©?ã‚'ç™ºè¡¨";
byte []b = s.getBytes();

// create UTF-8 encoded String
String newString = new String( b, 0, length, enc );

What do you think? I have tried using getBytes(<different encodings>)
and none of the tries worked. I have a feeling that the string itself
may be erroneous - the question marks in the string are questionable -
but this is returned from the database itself. ANY thoughts/comments
would be greatly appreciated!

Steve Horsley · Jul 14, 2003

Hi all,

I have a question regarding conversion to UTF-8 from a string. I use
the getBytes() string method to get the bytes from the original java
string. For some reason, the string get's cut off. I CAN convert this
string using the nativa2ascii java function to the correct output.

Specifically the example is:
// desired output = 2002年第4四半期の出荷面積を発表
// actual output (newString value) = 2002年第4

// here is the code...

String originalString =
"2002å¹´ç¬¬4å››å?ŠæœŸã?®å‡ºè?·é?¢ç©?ã‚'ç™ºè¡¨";
byte []b = s.getBytes();

// create UTF-8 encoded String
String newString = new String( b, 0, length, enc );

What do you think? I have tried using getBytes(<different encodings>)
and none of the tries worked. I have a feeling that the string itself
may be erroneous - the question marks in the string are questionable -
but this is returned from the database itself. ANY thoughts/comments
would be greatly appreciated!

Let me get pedantic - a String is a series of characters, each one being
represented by a Unicode value in the range 0-65535. UTF-8 is a way of
encoding Strings into a sequence of bytes. Each character may occupy
1-3 bytes. String.getBytes("UTF-8") is what you need to do to encode a
String into a byte sequence.

This is not what you seem to want to do. You seem to want to produce
another String, where the Unicode values from the original string are
encoded as "&#ddddd" where ddddd is the decimal representation of the
unicode value.

To do the translation you want, I think you need to work through the
original String character-by-character, casting to an int, then either
passing the original character, or if its value is greater than 255 or
whatever, pass through the encoded version instead. Something like this:

publc static String encode(String s) {
StringBuffer sb = new StringBuffer();
for(int x = 0 ; x < s.length() ; x++) {
char c = s.charAt(x);
if(c > 255) {
sb.append("&#");
sb.append((int) c);
}
else {
sb.append(c);
}
}
return sb.toString();
}

You may be safer using a limit of 127.
You also need a way of escaping the '&' character itself to avoid
ambiguity.

The odd thing is that I would think that all this should not be required.
If the database stores a String, I would expect it to take a String with
the full repertiore of Unicode values, and return same. If the database is
storing a byte[], then String.getBytes("UTF-8") and new String(byte[],
"UTF-8") should do very nicely. I'm no database expert, but I would regard
any other behaviour as broken. Opinions from others???

Although it produces very different looking Strings, java.net.URLEncoder
and URLDecoder might save you some coding time.

Steve

Jon A. Cruz · Jul 15, 2003

Red said:
String originalString =
"2002å¹´ç¬¬4å››å?ŠæœŸã?®å‡ºè?·é?¢ç©?ã‚'ç™ºè¡¨";

That's generally not a good thing to do in Java code. Use Unicode
escapes for any non-ASCII characters. Remeber, ASCII only goes from 0 up
through 127. Anything else is not "ASCII"

String originalString = "2002\u5e74\u7b2c4\u56db\u534a\u671f\u306e...";

// create UTF-8 encoded String
String newString = new String( b, 0, length, enc );

No, it doesn't.

As Steve pointed out, bytes are bytes and chars are chars and never the
twain should mix.

Strings are sequences of chars, not arbitrary streams of bytes.

BTW, is that really UTF-8 data you have? Or are you trying to munge
things and shove raw binary into strings? If so, that's almost
guaranteed to break.

Converting from std::wstring to UTF-8 std::string	5	Aug 19, 2011
Forcing a string to valid UTF-8	2	Apr 26, 2010
split UTF-8 string to multi UTF8-file	2	Jan 26, 2010
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
ANSI/UTF-8 File when save string to it	4	Feb 14, 2011
How to convert utf-8 bytes into a java string?	5	Nov 13, 2007
Read utf-8 file return utf-16 coding hex string ?	18	Jan 29, 2010
UTF-8 problems with windows	31	Aug 10, 2009

Converting to a UTF-8 String

Red

Steve Horsley

Jon A. Cruz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads