What is the encoding of this String?

H

howachen

I try to print a string using Eclipse (console set to UTF8)

byte b[] = {-28, -72, -83};
String str = new String(b);
System.out.println(str);

a chinese charater was shown, but the UTF8 value of that character
should not be "{-28, -72, -83}"

so can anyone can tell me what exactly {-28, -72, -83} is?

thanks!
 
R

Robert Klemme

I try to print a string using Eclipse (console set to UTF8)

byte b[] = {-28, -72, -83};
String str = new String(b);
System.out.println(str);

a chinese charater was shown, but the UTF8 value of that character
should not be "{-28, -72, -83}"

so can anyone can tell me what exactly {-28, -72, -83} is?

The default encoding likely converted this byte sequence to something
else. You could print hex values of each char and look them up in
Unicode charsets at http://www.unicode.org/

Kind regards

robert
 
H

howachen

Robert said:
I try to print a string using Eclipse (console set to UTF8)

byte b[] = {-28, -72, -83};
String str = new String(b);
System.out.println(str);

a chinese charater was shown, but the UTF8 value of that character
should not be "{-28, -72, -83}"

so can anyone can tell me what exactly {-28, -72, -83} is?

The default encoding likely converted this byte sequence to something
else. You could print hex values of each char and look them up in
Unicode charsets at http://www.unicode.org/

Kind regards

robert

Hex value = [e, 4, b, 8, a, d] (Using apache commons to convert to the
byte array)

I can't find any related character from unicode.org
I heard that Java Unicode is modified from the standard?

thanks...
 
M

Mike Schilling

Robert said:
I try to print a string using Eclipse (console set to UTF8)

byte b[] = {-28, -72, -83};
String str = new String(b);
System.out.println(str);

a chinese charater was shown, but the UTF8 value of that character
should not be "{-28, -72, -83}"

so can anyone can tell me what exactly {-28, -72, -83} is?

The default encoding likely converted this byte sequence to something
else. You could print hex values of each char and look them up in
Unicode charsets at http://www.unicode.org/

Kind regards

robert

Hex value = [e, 4, b, 8, a, d] (Using apache commons to convert to the
byte array)

Yiu don't want a byte array here, you want the characters in the string.
Try

for (int i = 0; i < str.length(); i++)
System.out.println((int)str.charAt(i));
 
H

howachen

Mike Schilling 寫é“:
Robert said:
(e-mail address removed) wrote:
I try to print a string using Eclipse (console set to UTF8)

byte b[] = {-28, -72, -83};
String str = new String(b);
System.out.println(str);

a chinese charater was shown, but the UTF8 value of that character
should not be "{-28, -72, -83}"

so can anyone can tell me what exactly {-28, -72, -83} is?

The default encoding likely converted this byte sequence to something
else. You could print hex values of each char and look them up in
Unicode charsets at http://www.unicode.org/

Kind regards

robert

Hex value = [e, 4, b, 8, a, d] (Using apache commons to convert to the
byte array)

Yiu don't want a byte array here, you want the characters in the string.
Try

for (int i = 0; i < str.length(); i++)
System.out.println((int)str.charAt(i));

sorry, this make no difference...

the hex values of your scripts are also output : [e, 4, b, 8, a, d]
 
T

Thomas Fritsch

I try to print a string using Eclipse (console set to UTF8)

byte b[] = {-28, -72, -83};
String str = new String(b);
// You should really specify the wanted UTF-8 encoding here, instead
// of assuming that the system's default encoding is UTF-8:
String str = new String(b, "UTF-8");
System.out.println(str);
//dump the hex values:
for (int i = 0; i < str.length(); i++)
System.out.println("["+i+"]=0x"+Integer.toHexString(str.charAt(i)));
a chinese charater was shown, but the UTF8 value of that character
should not be "{-28, -72, -83}"

so can anyone can tell me what exactly {-28, -72, -83} is?
With the code above I found that str is "\u4e2d" which means "middle" in
chinese, according to <http://www.unicode.org/charts/unihan.html>
 
M

margie mago

I try to print a string using Eclipse (console set to UTF8)

byte b[] = {-28, -72, -83};
String str = new String(b);
System.out.println(str);

a chinese charater was shown, but the UTF8 value of that character
should not be "{-28, -72, -83}"

so can anyone can tell me what exactly {-28, -72, -83} is?

thanks!

Try:

String str = new String(b, "UTF-8");
 
C

Chris Uppal

byte b[] = {-28, -72, -83};
String str = new String(b);
System.out.println(str);

a chinese charater was shown, but the UTF8 value of that character
should not be "{-28, -72, -83}"

so can anyone can tell me what exactly {-28, -72, -83} is?

Java's signed bytes are a pain in the arse. /Everyone/ else in the world
thinks of bytes as unsigned (including the Unicode consortium) but Java wants
to be different....

So for most people, exactly the same pattern of bits would be described as:
0xE4 0xB8 0xAD
which is the UTF-8 encoding of a Unicode string consisting of a single
character:
U+4E2D
which is a character in the unified CKJ area.

You can use unsigned values to initalise Java byte arrays:
byte[] b = { (byte)0xE4, (byte)0xB8, (byte)0xAD };
which is more verbose, but (IMO) a Hell of a lot clearer.

Also, when you are printing out byte values, if you want them to look like
unsigned values, you can write (for instance):

for (int i = 0; i <.b.length; i++)
System.out.println( b & 0xFF);

-- chris
 
H

howachen

margie mago 寫é“:
I try to print a string using Eclipse (console set to UTF8)

byte b[] = {-28, -72, -83};
String str = new String(b);
System.out.println(str);

a chinese charater was shown, but the UTF8 value of that character
should not be "{-28, -72, -83}"

so can anyone can tell me what exactly {-28, -72, -83} is?

thanks!

Try:

String str = new String(b, "UTF-8");

you are right, but the problem of my post :

"What is the encoding of this String?" , is it Java Unicode
representation? UTF16, UTF16 LE ?
 
H

howachen

Chris Uppal 寫é“:
byte b[] = {-28, -72, -83};
String str = new String(b);
System.out.println(str);

a chinese charater was shown, but the UTF8 value of that character
should not be "{-28, -72, -83}"

so can anyone can tell me what exactly {-28, -72, -83} is?

Java's signed bytes are a pain in the arse. /Everyone/ else in the world
thinks of bytes as unsigned (including the Unicode consortium) but Java wants
to be different....

So for most people, exactly the same pattern of bits would be described as:
0xE4 0xB8 0xAD
which is the UTF-8 encoding of a Unicode string consisting of a single
character:
U+4E2D
which is a character in the unified CKJ area.

You can use unsigned values to initalise Java byte arrays:
byte[] b = { (byte)0xE4, (byte)0xB8, (byte)0xAD };
which is more verbose, but (IMO) a Hell of a lot clearer.

Also, when you are printing out byte values, if you want them to look like
unsigned values, you can write (for instance):

for (int i = 0; i <.b.length; i++)
System.out.println( b & 0xFF);

-- chris


GREAT!

Thanks!
 
C

Chris Uppal

Mike said:
"Encoding" in Java specifically means "way of representing 16-bit unicode
characters in 8-bit bytes". Characters in a Java string *are* 16-bit
unicode. In that sense, they're not encoded, because they're in their
native form.

I don't really like that way of looking at it -- I think it's misleading.
Here's how I see it:

There are two ways to think of Java Strings.

The first is the way that we are /supposed/ to be able to think about them, and
it is usually the best way. But, unfortunately, it is technically incorrect.
The second is technically correct but is harder to think about and may cause
confusion.

So here's the first way. Strings are collections of characters. Characters
are Unicode characters. And as such Strings and chars are pure Unicode data.
There is no "encoding" involved at all (since encoding is how you translate
pure Unicode data into sequences of bytes -- and Java's Strings are not
sequences of bytes). So you manipulate Strings and chars directly without
worrying about encodings (which are irrelevant). It's only when you want to
convert between Strings and sequences of bytes (e.g. writing to file) that you
have to consider what encoding to use (and you always /do/ have to consider it
since files don't hold Strings, but only sequences of bytes. If you want to
put Strings into a file then you /have/ to choose an encoding -- if you don't
then the system will choose one for you, which isn't often what you want it to
do).

That's the simple version of the story. Now the second version, which is
technically accurate, but much nastier.

Due to an unfortunate set of circumstances Java has hardwired the idea that
there are <= 2**16 Unicode characters. That assumption is incorrect. It is
unfortunate that Unicode didn't go public on that until a few months after Java
became set in stone (although there /must/ have been people working for Sun who
knew all about it long before that). It's even more unfortunate that the size
of a char /was/ set in stone; and very, very, unfortunate that instead of
responding to the problem instantly, the Java designers spent about a decade
apparently hoping that the problem would just go away by itself. It didn't and
instead the situation grew worse and worse...

Anyway, brickbats aside, what has happened is that since the 16-bit limit on a
char cannot be changed, Sun have been forced to redefine what a String /is/.
It is no longer considered to be "pure Unicode data", but is now considered to
be formally a sequence of 16-bit values which /encode/ a Unicode string using
UTF-16. So now, even though Strings are not sequences of bytes, it is now
technically correct to say that Java's Strings are encoded in UTF-16.

Fortunately, for many purposes, we can still use the simpler picture ("Strings
are pure Unicode"), since that works perfectly well provided we are only using
characters in the 16-bit range of Unicode (as the OP's example was). But if we
have to deal with characters outside that range, then we have to use the
second, more complicated, picture to understand what's going on.

-- chris
 
C

Chris Uppal

Mike said:
Ask yourself: how often do I want to do arithmetic on 8-bit quantities
(and thus sign-extend them when converting to 16 or 32 bits) vs. how
often I want to manipulate 8-bit octets which it would be idiotic to
sign-extend?

I have now asked ;-)

My answer (to myself) was that I can't remember /ever/ wanting to work with
values in the range -128..+127. (I did think I had one example -- dealing with
8-bit audio on Windows -- but when I checked it turned out that 8-bit is
handled specially: it uses unsigned for 8-bit, but signed integers for higher
resolution audio).

I'm not saying that no one anywhere has ever needed to do so, but if they have
then I would be interested to know what sort of programming problem pruduced
that requirement.

Values from a (domestic) fridge temperature probe perhaps ?

-- chris
 
M

Mike Schilling

Chris Uppal said:
Anyway, brickbats aside, what has happened is that since the 16-bit limit
on a
char cannot be changed, Sun have been forced to redefine what a String
/is/.
It is no longer considered to be "pure Unicode data", but is now
considered to
be formally a sequence of 16-bit values which /encode/ a Unicode string
using
UTF-16. So now, even though Strings are not sequences of bytes, it is now
technically correct to say that Java's Strings are encoded in UTF-16.

Though, not being byte-oriented, they're none of the usual UTF-16 encodings:
not LE, not BE, and no BOM. Converting a string to a byte array using
UTF-16 is *not* an identity transformation. So even if your last sentence
is technically correct, I think it will cause confusion.

Two notes:

1. AFAICT, the only change that would have to be made to Java to represent
all Unicode characters natively would be to change the size of char to three
bytes. To say it another way, if char hadn't been defined as a 2-byte
integer type in the first place, there would have been no difficulty
accomodating the extended Unicode range.

2. .NET, which came along much later, faithfully copied Java's mistakes in
this area.
 
C

Chris Uppal

Mike said:
Though, not being byte-oriented, they're none of the usual UTF-16
encodings: not LE, not BE, and no BOM. Converting a string to a byte
array using UTF-16 is *not* an identity transformation. So even if your
last sentence is technically correct, I think it will cause confusion.

Well, I /did/ warn that the technically correct picture was likely to be
confusing. There are two notions of encoding being used at the same time :-(

You probably know this, but for the record:

Unicode distinguishes between "encoding forms" and "encoding schemes". The
former are ways of representing Unicode data as sequences of logical integers
in some bounded range. These integers are called "code units". UTF-16 is an
encoding form using 16-bit code units. Encoding /forms/, otoh, are the
physical representation of such encoded integers as sequences of bytes such as
can be written to file. For 8-bit encodings like UTF-8 there is no real need
to distinguish between encoding schemes and encoding forms, but for UTF-16 we
need the specify the byte order before we can translate 16-bit integers into
bytes. Hence there are two concrete encoding schemes, UTF-16BE and UTF-16LE,
with different bytes orders. There's a third encoding scheme in that family,
which is also called "UTF-16" (with no adornment), which is used when either
the byte order is specified by a BOM, or where is it determined unambiguously
from context.

So, Java's strings are Unicode data represented in the encoding /form/ UTF-16
(which may be represented in physical RAM as UTF-16LE or UTF-16BE, as
determined by the machine architecture. But there's no reason to know or care
which, unless you are working with JNI or some-such -- and almost certainly not
even then).

BTW, I don't claim to be able to remember the singularly opaque Unicode
terminology for this stuff -- I had to go look it up....

1. AFAICT, the only change that would have to be made to Java to represent
all Unicode characters natively would be to change the size of char to
three bytes. To say it another way, if char hadn't been defined as a
2-byte integer type in the first place, there would have been no
difficulty accomodating the extended Unicode range.
Agreed.


2. .NET, which came along much later, faithfully copied Java's mistakes in
this area.

Odd that...

;-)

To be fair: the motivation may not have been a lemming-like urge to replicate
Java's little mistakes, but a lemming-like urge to maintain compatibility with
the Win32 APIs which .NET is supposed to make obsolete...

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,682
Members
48,796
Latest member
Greg L.

Latest Threads

Top