convert Java unicode escape to utf8

J

Jeff Higgins

Hi,
How can I convert a String containing a
Java Unicode escape sequence to a String
containing the equivalent UTF8 representation?

For instance "\u4f55" -> "e4bd95"

Thanks,
Jeff Higgins
 
B

bugbear

Jeff said:
Hi,
How can I convert a String containing a
Java Unicode escape sequence to a String
containing the equivalent UTF8 representation?

For instance "\u4f55" -> "e4bd95"

You mean a string containing the hex representation
for the UTF-8 bytes encoding of the string?

Or do you mean a byte array containing utf-8 bytes?

In Java, a string contains "characters" which are
UTF-16.

So a string never contains a "unicode escape sequence",
it merely contains a character. It is the compiler
which turns the escape sequence in your source code
into a "true" string.

BugBear
 
B

bugbear

bugbear said:
You mean a string containing the hex representation
for the UTF-8 bytes encoding of the string?

Or do you mean a byte array containing utf-8 bytes?

String str = "\u4f55";
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Charset cs1 = Charset.forName("UTF-8");
OutputStreamWriter osw = new OutputStreamWriter(baos, cs1);
osw.write(str);
byte want[] = baos.toByteArray();

(neither compiled nor tested)

BugBear
 
R

Roedy Green

How can I convert a String containing a
Java Unicode escape sequence to a String
containing the equivalent UTF8 representation?

For instance "\u4f55" -> "e4bd95"

If for some reason you wanted to roll your own utility, the code for
UTF-8 reading and writing its at http://mindprod.com/jgloss/utf.html

The code is primarily to help you understand the format.
 
J

Jeff Higgins

Jeff said:
Hi,
How can I convert a String containing a
Java Unicode escape sequence to a String
containing the equivalent UTF8 representation?

For instance "\u4f55" -> "e4bd95"

Thanks,
Jeff Higgins

Ok,
Thanks everyone for the generous responses.
SadRed for the pointer to the UTF8 definition.
I found it kind of hard to follow at first, but
now that I've found some code to follow along
with, it's making more sense. Bugbear for the
NIO example, as you can see I struggle with basic
IO now I need to understand wrapping and flipping.
And Roedy whose excellent mindprod site has been
a continuing source of enlightenment, Thanks.

Anyway,
for anyone else who read my OP and was
only able to shake their head in amazement at
it's utter incomprehensibility, here is what I
had \really\ hoped to accomplish.

How to encode a Unicode scalar value in UTF8?

public class Encode
{
public static void main(String[] args)
{
int[] intArray = {0x4f55};
byte[] byteArray = encode(intArray);
for(byte b : byteArray)
{
System.out.print(Integer.toString((b & 0xff) + 0x100,
16).substring(1));
}
}
}

prints e4bd95

where encode(int[]) is a method described at:
<http://developers.sun.com/dev/gadc/technicalpublications/articles/utf8.html>
 
H

Hendrik Maryns

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jeff Higgins schreef:
Jeff said:
Hi,
How can I convert a String containing a
Java Unicode escape sequence to a String
containing the equivalent UTF8 representation?

For instance "\u4f55" -> "e4bd95"

Thanks,
Jeff Higgins

Ok,
Thanks everyone for the generous responses.
SadRed for the pointer to the UTF8 definition.
I found it kind of hard to follow at first, but
now that I've found some code to follow along
with, it's making more sense. Bugbear for the
NIO example, as you can see I struggle with basic
IO now I need to understand wrapping and flipping.
And Roedy whose excellent mindprod site has been
a continuing source of enlightenment, Thanks.

Anyway,
for anyone else who read my OP and was
only able to shake their head in amazement at
it's utter incomprehensibility, here is what I
had \really\ hoped to accomplish.

How to encode a Unicode scalar value in UTF8?

public class Encode
{
public static void main(String[] args)
{
int[] intArray = {0x4f55};
byte[] byteArray = encode(intArray);
for(byte b : byteArray)
{
System.out.print(Integer.toString((b & 0xff) + 0x100,
16).substring(1));
}
}
}

prints e4bd95

where encode(int[]) is a method described at:
<http://developers.sun.com/dev/gadc/technicalpublications/articles/utf8.html>

Ok, I found out what the & 0xff is for, but mind explaining me why you
do + 0x100?

H.
- --
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGlLb8e+7xMGD3itQRAuRLAJ4uKGKORPEssjckqmIX62FKq5vMygCdFlVt
VbcSYyfnmH53D+SyIhrB7Ik=
=3b6+
-----END PGP SIGNATURE-----
 
J

Jeff Higgins

Hendrik said:
Jeff Higgins schreef:
Jeff said:
Hi,
How can I convert a String containing a
Java Unicode escape sequence to a String
containing the equivalent UTF8 representation?

For instance "\u4f55" -> "e4bd95"

Thanks,
Jeff Higgins

Ok,
Thanks everyone for the generous responses.
SadRed for the pointer to the UTF8 definition.
I found it kind of hard to follow at first, but
now that I've found some code to follow along
with, it's making more sense. Bugbear for the
NIO example, as you can see I struggle with basic
IO now I need to understand wrapping and flipping.
And Roedy whose excellent mindprod site has been
a continuing source of enlightenment, Thanks.

Anyway,
for anyone else who read my OP and was
only able to shake their head in amazement at
it's utter incomprehensibility, here is what I
had \really\ hoped to accomplish.

How to encode a Unicode scalar value in UTF8?

public class Encode
{
public static void main(String[] args)
{
int[] intArray = {0x4f55};
byte[] byteArray = encode(intArray);
for(byte b : byteArray)
{
System.out.print(Integer.toString((b & 0xff) + 0x100,
16).substring(1));
}
}
}

prints e4bd95

where encode(int[]) is a method described at:
<http://developers.sun.com/dev/gadc/technicalpublications/articles/utf8.html>

Ok, I found out what the & 0xff is for, but mind explaining me why you
do + 0x100?

Well, quite frankly because Roedy Green told me to. Or rather showed
the technique \somewhere\ on his mindprod site. I can't find it now. :(

Boiled down, the code that produced the result follows.
I have no idea how it works, except that it seems to produce the desired
result.
Now you have caused me to have to twiddle bits until I understand.

Thanks,
JH

public class Test
{
public static void main(String[] args)
{
int in = 0x4f55;
byte[] out = new byte[3];
out[0] = (byte)(in >> 12 | 0xE0);
out[1] = (byte)(in >> 6 & 0x3F | 0x80);
out[2] = (byte)(in & 0x3F | 0x80);
for(byte b : out)
{
System.out.print(Integer.toString((b & 0xff + 0x100),
16).substring(1));
}
}
}
 
J

Jeff Higgins

Jeff said:
Hendrik said:
Jeff Higgins schreef:
Jeff Higgins wrote:
How to encode a Unicode scalar value in UTF8?

public class Encode
{
public static void main(String[] args)
{
int[] intArray = {0x4f55};
byte[] byteArray = encode(intArray);
for(byte b : byteArray)
{
System.out.print(Integer.toString((b & 0xff) + 0x100,
16).substring(1));
}
}
}

prints e4bd95

where encode(int[]) is a method described at:
<http://developers.sun.com/dev/gadc/technicalpublications/articles/utf8.html>

Ok, I found out what the & 0xff is for, but mind explaining me why you
do + 0x100?

Well, quite frankly because Roedy Green told me to. Or rather showed
the technique \somewhere\ on his mindprod site. I can't find it now. :(

OK,
Wish I could find it on mindprod site, but can't.
Must have served another purpose.
This works.

System.out.println(Integer.toString((b & 0xff),16));
Boiled down, the code that produced the result follows.
I have no idea how it works, except that it seems to produce the desired
result.
Now you have caused me to have to twiddle bits until I understand.

Thanks,
JH

public class Test
{
public static void main(String[] args)
{
int in = 0x4f55;
byte[] out = new byte[3];
out[0] = (byte)(in >> 12 | 0xE0);
out[1] = (byte)(in >> 6 & 0x3F | 0x80);
out[2] = (byte)(in & 0x3F | 0x80);
for(byte b : out)
{
System.out.print(Integer.toString((b & 0xff + 0x100),
16).substring(1));
}
}
}
 
T

Thomas Fritsch

Hendrik said:
Jeff Higgins schreef: [...]
int[] intArray = {0x4f55};
byte[] byteArray = encode(intArray);
for(byte b : byteArray)
{
System.out.print(Integer.toString((b & 0xff) + 0x100, 16).substring(1));
}
[...]
Ok, I found out what the & 0xff is for, but mind explaining me why you
do + 0x100?
I think it is for inserting the leading "0" for each byte less than
0x10, which would be missing otherwise.

For example: Suppose b = 4
Then
Integer.toString((b & 0xff), 16) gives "4",
which is not what you want. You want "04".
The missing leading "0" is produced by the tricky +0x100 and substring(1)
Integer.toString((b & 0xff) + 0x100, 16) gives "104"
Integer.toString((b & 0xff) + 0x100, 16).substring(1) gives "04"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top