Slightly tricky string problem

Dirk Bruere at NeoPax · May 28, 2009

.... which I'm having trouble getting my head around.

I have a String, which is single character eg "a"
I need to convert it to a String which is the decimal representation of
the UTF8 ascii code ie "97"

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

Mike Schilling · May 28, 2009

Dirk said:
... which I'm having trouble getting my head around.

I have a String, which is single character eg "a"
I need to convert it to a String which is the decimal representation
of the UTF8 ascii code ie "97"

If you know it's a single ASCII character,

String s_a = "a";
String s_b = Integer.toString((int)s_a.charAt(0));

This could be generalized to non-ASCII characters or multiple
characters, if I knew what the desired result was.

Andrew Thompson · May 28, 2009

Cool. ..Did you have a question, or were you just
sharing that with us?

Dirk Bruere at NeoPax · May 28, 2009

Andrew said:
Cool. ..Did you have a question, or were you just
sharing that with us?

Er... the Q was "how do I do it (neatly)".
Since it's 04:52 here my brain is running on empty.

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

Dirk Bruere at NeoPax · May 28, 2009

Mike said:
If you know it's a single ASCII character,

String s_a = "a";
String s_b = Integer.toString((int)s_a.charAt(0));

This could be generalized to non-ASCII characters or multiple
characters, if I knew what the desired result was.

Thanks - I'll work with that.

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

Mayeul · May 28, 2009

Mark said:
I don't think this actually gives UTF-8, just Java's internal Unicode,
whatever that happens to be.

As indicated in java.lang.Character javadoc, a char value represents a
Unicode code point in the BMP.

So, for characters in the BMP, Java's Unicode is just plain expected
Unicode.

As for characters outside the BMP, you would need two Java chars to
represent them, in a UTF-16 way.

Conclusion: as long as we're speaking ASCII, the given method works.
Outside ASCII but still in the BMP, the given method will produce the
character's code point.
But one might wonder what "UTF-8 ascii code" is, and what to do with
non-ASCII characters, as they would be represented in more than one byte
in UTF-8.

charlesbos73 · May 28, 2009

... which I'm having trouble getting my head around.

I have a String, which is single character eg "a"
I need to convert it to a String which is the decimal representation of
the UTF8 ascii code ie "97"

What did you do to try to solve your problem?

As Mayeul pointed out, "UTF8 ascii code" [sic] doesn't mean anything.

ASCII is a code defining 128 entities, which are usually represented
each on 8 bits, with the most significant bit set to 0. But in any
case "ASCII the characters" should not be mistaken with "ASCII the
encoding".

Same for Unicode.

Unicode defines much more entities (called codepoints).
The 128 first Unicode entities are the 128 ASCII entities.

UTF-8 is an encoding that has been created so that any byte
with the most significant bit set to 0 is an ASCII entity.

So an UTF-8 encoded file containing only ASCII characters shall
be the same as an ASCII encoded file.

But in your case, if you have a String [sic] you shouldn't
care at all about encoding details: UTF-8 or little faeries
wearing boots drawing you characters using magical powder has
no importance.

Things get quickly messy in Java because when Java was created
Unicode didn't define codepoints outside the BMP. So we end
up with a backward compatible charAt(..) method that is broken
beyond repair because it definitely does NOT give back the
character at 'x' when you have a String that contains characters
outside the BMP.

All hope is not lost that said, for we now have the codePointAt(..)
method which works correctly for codepoints outside the BMP, as
shown in the example below:

@Test public void tests() {
assertEquals( Integer.toString("\u0000".codePointAt(0)),
"0" );
// Java offers no easy way to source code encode, say, U+1040B
(dec 66571)
assertEquals( Integer.toString("\uD801\uDC0B".codePointAt(0)),
"66571" ); // 0x1040B (hex) 66571 (dec)
assertEquals( Integer.toString("a".codePointAt(0)), "97" );
}

If you're curious as to how to do what Integer.toString(..) does
you can look at the source code for the Integer class.

Note that Integer.toString(int) works as expected on
entities outside the BMP:

Integer.toString("\uD801\uDC0B".codePointAt(0))

gives back the expected "66571" string.

By now you can expect the "JLS-nazi bot" (that shall recognize
itself) to nitpick on grammatical mistakes and claim loud
that Java is perfect and that the fact that we have both a
(broken) charAt(..) method and codePointAt(..) is not a
problem at all.

But as usual the "JLS-nazi bot"'s deranged ramblings shall be
sent to /dev/null without any consideration.

charlesbos73 · May 28, 2009

As indicated in java.lang.Character javadoc, a char value represents a
Unicode code point in the BMP.

So, for characters in the BMP, Java's Unicode is just plain expected
Unicode.

As for characters outside the BMP, you would need two Java chars to
represent them, in a UTF-16 way.

Conclusion: as long as we're speaking ASCII, the given method works.
Outside ASCII but still in the BMP, the given method will produce the
character's code point.

I wholeheartly agree with your post.

Minor remark: it only happens to produce the character's codepoint
in the BMP because it's taking the first character of the string.
Had it been charAt(1) or anything else than 0 and even in the BMP
it's not guaranteed to work (because if, say, the first character
of the string is outside the BMP charAt is broken).

But by simply replacing charAt by codePointAt, the method will produce
the character's codepoint even if it's outside the BMP (and even
if we're taking a 'character' that is not the first of the string).

But one might wonder what "UTF-8 ascii code" is, and what to do with
non-ASCII characters, as they would be represented in more than one byte
in UTF-8.

exactly

Mike Schilling · May 28, 2009

Mark said:
I don't think this actually gives UTF-8, just Java's internal
Unicode,
whatever that happens to be.

It's the same for ASCII characters (<=127), which is why I said, in
the part you clipped, that this works only for them.

Mayeul · May 28, 2009

Mark said:
I think the OP wants UTF-8, not the UTF-16 code point. I'm assuming his
request for "ASCII" was a misstatement. Outside of the first 127
characters, charAt(int) won't yield UTF-8.

If I had to guess I'd think the OP is confusing UTF-8 with Unicode, and
describes "the number associated with a character" when saying "UTF-8
ascii code". But that is a guess.

I would rather point out the fact that we don't actually know what the
OP meant.

I also wanted to point out that "whatever Java's internal Unicode is" is
actually plainly expected Unicode in the BMP. Therefore Mike's
suggestion *might* have been correct. Yours too, only the OP could
possibly know.

Mark Space · May 28, 2009

Mayeul said:
I also wanted to point out that "whatever Java's internal Unicode is" is
actually plainly expected Unicode in the BMP. Therefore Mike's

Well, I think Java's internal encoding used to be USC-2 but is now
UTF-16. The two are different, and depending on exactly which JVM you
have, the encoding might be neither I suppose. Just pointing out that
you really can't rely 100% on those internal codes.

This doesn't affect the first 127 code points of course, but I think
charAt(int) is too brittle unless you're certain of the source. Given
the OP mixed the term "UTF-8" in there, I'd rather show him the most
robust method.

Dirk Bruere at NeoPax · May 28, 2009

Mayeul said:
If I had to guess I'd think the OP is confusing UTF-8 with Unicode, and
describes "the number associated with a character" when saying "UTF-8
ascii code". But that is a guess.

I would rather point out the fact that we don't actually know what the
OP meant.

What I meant is the values listed here
http://www.asciitable.com/

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

Dirk Bruere at NeoPax · May 28, 2009

charlesbos73 said:
... which I'm having trouble getting my head around.

I have a String, which is single character eg "a"
I need to convert it to a String which is the decimal representation of
the UTF8 ascii code ie "97"

Click to expand...

What did you do to try to solve your problem?

As Mayeul pointed out, "UTF8 ascii code" [sic] doesn't mean anything.

ASCII is a code defining 128 entities, which are usually represented
each on 8 bits, with the most significant bit set to 0. But in any
case "ASCII the characters" should not be mistaken with "ASCII the
encoding".

Same for Unicode.

Unicode defines much more entities (called codepoints).
The 128 first Unicode entities are the 128 ASCII entities.

UTF-8 is an encoding that has been created so that any byte
with the most significant bit set to 0 is an ASCII entity.

So an UTF-8 encoded file containing only ASCII characters shall
be the same as an ASCII encoded file.

But in your case, if you have a String [sic] you shouldn't
care at all about encoding details: UTF-8 or little faeries
wearing boots drawing you characters using magical powder has
no importance.

It is when I have a protocol that interfaces with a machine that only
accepts ascci encoded strings. So UTF8 would be a good starting point.

Things get quickly messy in Java because when Java was created
Unicode didn't define codepoints outside the BMP. So we end
up with a backward compatible charAt(..) method that is broken
beyond repair because it definitely does NOT give back the
character at 'x' when you have a String that contains characters
outside the BMP.

All hope is not lost that said, for we now have the codePointAt(..)
method which works correctly for codepoints outside the BMP, as
shown in the example below:

@Test public void tests() {
assertEquals( Integer.toString("\u0000".codePointAt(0)),
"0" );
// Java offers no easy way to source code encode, say, U+1040B
(dec 66571)
assertEquals( Integer.toString("\uD801\uDC0B".codePointAt(0)),
"66571" ); // 0x1040B (hex) 66571 (dec)
assertEquals( Integer.toString("a".codePointAt(0)), "97" );
}

If you're curious as to how to do what Integer.toString(..) does
you can look at the source code for the Integer class.

Note that Integer.toString(int) works as expected on
entities outside the BMP:

Integer.toString("\uD801\uDC0B".codePointAt(0))

gives back the expected "66571" string.

By now you can expect the "JLS-nazi bot" (that shall recognize
itself) to nitpick on grammatical mistakes and claim loud
that Java is perfect and that the fact that we have both a
(broken) charAt(..) method and codePointAt(..) is not a
problem at all.

But as usual the "JLS-nazi bot"'s deranged ramblings shall be
sent to /dev/null without any consideration.

Thanks.
Right now my problem is lack of full definition of the protocol, so I'll
have to return to this later.

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

Mike Schilling · May 28, 2009

Mark said:
Well, I think Java's internal encoding used to be USC-2 but is now
UTF-16. The two are different, and depending on exactly which JVM
you
have, the encoding might be neither I suppose. Just pointing out
that
you really can't rely 100% on those internal codes.

Older versions of Java didn't support surrogates; current ones do. (I
don't know where the dividing line is.) If a code point is in the
BMP, its Java "char" value didn't change between the two. If a code
point is outside the BMP, it couldn't be representeed by these older
versions of Java. In neither case did a preexisting value change.

Mark Space · May 28, 2009

Dirk said:
What I meant is the values listed here
http://www.asciitable.com/

What happens if the string contains characters that are outside that range?

String s = "\u0080";
System.out.println((int)s.charAt(0));
System.out.println(Arrays.toString(s.getBytes("UTF-8")));

run:
128
[-62, -128]
BUILD SUCCESSFUL (total time: 0 seconds)

Dirk Bruere at NeoPax · May 28, 2009

Mark said:
Dirk said:

What I meant is the values listed here
http://www.asciitable.com/

Click to expand...

What happens if the string contains characters that are outside that range?

String s = "\u0080";
System.out.println((int)s.charAt(0));
System.out.println(Arrays.toString(s.getBytes("UTF-8")));

run:
128
[-62, -128]
BUILD SUCCESSFUL (total time: 0 seconds)

I don't know - that's tomorrows problem:-(

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

Mark Space · May 28, 2009

Dirk said:
I don't know - that's tomorrows problem:-(

It doesn't have to be a problem. getBytes() works as well as charAt()
for ASCII values, and can return proper utf-8 for other values. And
it's just as easy to implement, imo.

Mark Space · May 28, 2009

Dirk said:
It is when I have a protocol that interfaces with a machine that only
accepts ascci encoded strings. So UTF8 would be a good starting point.

Right now my problem is lack of full definition of the protocol, so I'll
have to return to this later.

And reading this, I think I should point out that there's a lot more
character encodings available to getBytes() besides UTF-8.

getBytes("ASCII");

will I believe reject any characters that are out of range for ASCII. I
don't recall what it does if the character is out of range (throws an
error? Replaces it with a "?" That's what Java docs are for) but that
sounds safer if you really don't want to deal with non-ASCII values.

Dirk Bruere at NeoPax · May 28, 2009

Mark said:
It doesn't have to be a problem. getBytes() works as well as charAt()
for ASCII values, and can return proper utf-8 for other values. And
it's just as easy to implement, imo.

I'll read all the replies again a bit later and get to understand it
properly. Meanwhile, I have to do a bit of tedious but straightforward
coding to "show progress". Another deadline approaching.

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

Change character in string	105	Mar 10, 2009
Official Java Classes	10	Jul 12, 2009
Can an Applet beep?	4	Apr 17, 2009
ListModel name	10	Jul 17, 2009
JMF?	21	Jul 14, 2009
Sorting a JList	4	Jul 9, 2009
Substring	53	May 18, 2009
Send string to IP address	17	Feb 24, 2009

Slightly tricky string problem

Dirk Bruere at NeoPax

Mike Schilling

Andrew Thompson

Dirk Bruere at NeoPax

Dirk Bruere at NeoPax

Mayeul

charlesbos73

charlesbos73

Mike Schilling

Mayeul

Mark Space

Dirk Bruere at NeoPax

Dirk Bruere at NeoPax

Mike Schilling

Mark Space

Dirk Bruere at NeoPax

Mark Space

Mark Space

Dirk Bruere at NeoPax

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads