convert Java unicode escape to utf8

Discussion in 'Java' started by Jeff Higgins, Jul 6, 2007.

  1. Jeff Higgins

    Jeff Higgins Guest

    Hi,
    How can I convert a String containing a
    Java Unicode escape sequence to a String
    containing the equivalent UTF8 representation?

    For instance "\u4f55" -> "e4bd95"

    Thanks,
    Jeff Higgins
     
    Jeff Higgins, Jul 6, 2007
    #1
    1. Advertising

  2. Jeff Higgins

    SadRed Guest

    On Jul 6, 1:03 pm, "Jeff Higgins" <> wrote:
    > Hi,
    > How can I convert a String containing a
    > Java Unicode escape sequence to a String
    > containing the equivalent UTF8 representation?
    >
    > For instance "\u4f55" -> "e4bd95"
    >
    > Thanks,
    > Jeff Higgins


    See Unicode standard documentation.
    This might be handy for UTF-8 encoding:
    http://homepage1.nifty.com/algafield/core0.html
     
    SadRed, Jul 6, 2007
    #2
    1. Advertising

  3. Jeff Higgins

    bugbear Guest

    Jeff Higgins wrote:
    > Hi,
    > How can I convert a String containing a
    > Java Unicode escape sequence to a String
    > containing the equivalent UTF8 representation?
    >
    > For instance "\u4f55" -> "e4bd95"


    You mean a string containing the hex representation
    for the UTF-8 bytes encoding of the string?

    Or do you mean a byte array containing utf-8 bytes?

    In Java, a string contains "characters" which are
    UTF-16.

    So a string never contains a "unicode escape sequence",
    it merely contains a character. It is the compiler
    which turns the escape sequence in your source code
    into a "true" string.

    BugBear
     
    bugbear, Jul 6, 2007
    #3
  4. Jeff Higgins

    bugbear Guest

    bugbear wrote:
    > Jeff Higgins wrote:
    >> Hi,
    >> How can I convert a String containing a
    >> Java Unicode escape sequence to a String
    >> containing the equivalent UTF8 representation?
    >>
    >> For instance "\u4f55" -> "e4bd95"

    >
    > You mean a string containing the hex representation
    > for the UTF-8 bytes encoding of the string?
    >
    > Or do you mean a byte array containing utf-8 bytes?


    String str = "\u4f55";
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    Charset cs1 = Charset.forName("UTF-8");
    OutputStreamWriter osw = new OutputStreamWriter(baos, cs1);
    osw.write(str);
    byte want[] = baos.toByteArray();

    (neither compiled nor tested)

    BugBear
     
    bugbear, Jul 6, 2007
    #4
  5. Jeff Higgins

    Roedy Green Guest

    On Fri, 6 Jul 2007 00:03:52 -0400, "Jeff Higgins"
    <> wrote, quoted or indirectly quoted someone who
    said :

    >
    >For instance "\u4f55" -> "e4bd95"


    If by that \u4f55 you mean a single 16-bit char, you just have to
    write to a Writer specifying UTF-8 as your encoding. See
    http://mindprod.com/applets/fileio.html for sample code.

    If by that \u4f55 your mean 6 8-bit ASCII characters, nativetoascii
    will convert it to other encodings. see
    http://mindprod.com/jgloss/native2asciiexe.html and
    http://mindprod.com/jgloss/encoding.html
    for details

    --
    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Jul 6, 2007
    #5
  6. Jeff Higgins

    Roedy Green Guest

    On Fri, 6 Jul 2007 00:03:52 -0400, "Jeff Higgins"
    <> wrote, quoted or indirectly quoted someone who
    said :

    >How can I convert a String containing a
    >Java Unicode escape sequence to a String
    >containing the equivalent UTF8 representation?
    >
    >For instance "\u4f55" -> "e4bd95"


    If for some reason you wanted to roll your own utility, the code for
    UTF-8 reading and writing its at http://mindprod.com/jgloss/utf.html

    The code is primarily to help you understand the format.
    --
    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Jul 6, 2007
    #6
  7. Jeff Higgins

    Jeff Higgins Guest

    Jeff Higgins wrote:
    > Hi,
    > How can I convert a String containing a
    > Java Unicode escape sequence to a String
    > containing the equivalent UTF8 representation?
    >
    > For instance "\u4f55" -> "e4bd95"
    >
    > Thanks,
    > Jeff Higgins
    >


    Ok,
    Thanks everyone for the generous responses.
    SadRed for the pointer to the UTF8 definition.
    I found it kind of hard to follow at first, but
    now that I've found some code to follow along
    with, it's making more sense. Bugbear for the
    NIO example, as you can see I struggle with basic
    IO now I need to understand wrapping and flipping.
    And Roedy whose excellent mindprod site has been
    a continuing source of enlightenment, Thanks.

    Anyway,
    for anyone else who read my OP and was
    only able to shake their head in amazement at
    it's utter incomprehensibility, here is what I
    had \really\ hoped to accomplish.

    How to encode a Unicode scalar value in UTF8?

    public class Encode
    {
    public static void main(String[] args)
    {
    int[] intArray = {0x4f55};
    byte[] byteArray = encode(intArray);
    for(byte b : byteArray)
    {
    System.out.print(Integer.toString((b & 0xff) + 0x100,
    16).substring(1));
    }
    }
    }

    prints e4bd95

    where encode(int[]) is a method described at:
    <http://developers.sun.com/dev/gadc/technicalpublications/articles/utf8.html>
     
    Jeff Higgins, Jul 7, 2007
    #7
  8. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Jeff Higgins schreef:
    > Jeff Higgins wrote:
    >> Hi,
    >> How can I convert a String containing a
    >> Java Unicode escape sequence to a String
    >> containing the equivalent UTF8 representation?
    >>
    >> For instance "\u4f55" -> "e4bd95"
    >>
    >> Thanks,
    >> Jeff Higgins
    >>

    >
    > Ok,
    > Thanks everyone for the generous responses.
    > SadRed for the pointer to the UTF8 definition.
    > I found it kind of hard to follow at first, but
    > now that I've found some code to follow along
    > with, it's making more sense. Bugbear for the
    > NIO example, as you can see I struggle with basic
    > IO now I need to understand wrapping and flipping.
    > And Roedy whose excellent mindprod site has been
    > a continuing source of enlightenment, Thanks.
    >
    > Anyway,
    > for anyone else who read my OP and was
    > only able to shake their head in amazement at
    > it's utter incomprehensibility, here is what I
    > had \really\ hoped to accomplish.
    >
    > How to encode a Unicode scalar value in UTF8?
    >
    > public class Encode
    > {
    > public static void main(String[] args)
    > {
    > int[] intArray = {0x4f55};
    > byte[] byteArray = encode(intArray);
    > for(byte b : byteArray)
    > {
    > System.out.print(Integer.toString((b & 0xff) + 0x100,
    > 16).substring(1));
    > }
    > }
    > }
    >
    > prints e4bd95
    >
    > where encode(int[]) is a method described at:
    > <http://developers.sun.com/dev/gadc/technicalpublications/articles/utf8.html>


    Ok, I found out what the & 0xff is for, but mind explaining me why you
    do + 0x100?

    H.
    - --
    Hendrik Maryns
    http://tcl.sfs.uni-tuebingen.de/~hendrik/
    ==================
    http://aouw.org
    Ask smart questions, get good answers:
    http://www.catb.org/~esr/faqs/smart-questions.html
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.5 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    iD8DBQFGlLb8e+7xMGD3itQRAuRLAJ4uKGKORPEssjckqmIX62FKq5vMygCdFlVt
    VbcSYyfnmH53D+SyIhrB7Ik=
    =3b6+
    -----END PGP SIGNATURE-----
     
    Hendrik Maryns, Jul 11, 2007
    #8
  9. Jeff Higgins

    Jeff Higgins Guest

    Hendrik Maryns wrote:
    > Jeff Higgins schreef:
    >> Jeff Higgins wrote:
    >>> Hi,
    >>> How can I convert a String containing a
    >>> Java Unicode escape sequence to a String
    >>> containing the equivalent UTF8 representation?
    >>>
    >>> For instance "\u4f55" -> "e4bd95"
    >>>
    >>> Thanks,
    >>> Jeff Higgins
    >>>

    >>
    >> Ok,
    >> Thanks everyone for the generous responses.
    >> SadRed for the pointer to the UTF8 definition.
    >> I found it kind of hard to follow at first, but
    >> now that I've found some code to follow along
    >> with, it's making more sense. Bugbear for the
    >> NIO example, as you can see I struggle with basic
    >> IO now I need to understand wrapping and flipping.
    >> And Roedy whose excellent mindprod site has been
    >> a continuing source of enlightenment, Thanks.
    >>
    >> Anyway,
    >> for anyone else who read my OP and was
    >> only able to shake their head in amazement at
    >> it's utter incomprehensibility, here is what I
    >> had \really\ hoped to accomplish.
    >>
    >> How to encode a Unicode scalar value in UTF8?
    >>
    >> public class Encode
    >> {
    >> public static void main(String[] args)
    >> {
    >> int[] intArray = {0x4f55};
    >> byte[] byteArray = encode(intArray);
    >> for(byte b : byteArray)
    >> {
    >> System.out.print(Integer.toString((b & 0xff) + 0x100,
    >> 16).substring(1));
    >> }
    >> }
    >> }
    >>
    >> prints e4bd95
    >>
    >> where encode(int[]) is a method described at:
    >> <http://developers.sun.com/dev/gadc/technicalpublications/articles/utf8.html>

    >
    > Ok, I found out what the & 0xff is for, but mind explaining me why you
    > do + 0x100?
    >


    Well, quite frankly because Roedy Green told me to. Or rather showed
    the technique \somewhere\ on his mindprod site. I can't find it now. :(

    Boiled down, the code that produced the result follows.
    I have no idea how it works, except that it seems to produce the desired
    result.
    Now you have caused me to have to twiddle bits until I understand.

    Thanks,
    JH

    public class Test
    {
    public static void main(String[] args)
    {
    int in = 0x4f55;
    byte[] out = new byte[3];
    out[0] = (byte)(in >> 12 | 0xE0);
    out[1] = (byte)(in >> 6 & 0x3F | 0x80);
    out[2] = (byte)(in & 0x3F | 0x80);
    for(byte b : out)
    {
    System.out.print(Integer.toString((b & 0xff + 0x100),
    16).substring(1));
    }
    }
    }
     
    Jeff Higgins, Jul 11, 2007
    #9
  10. Jeff Higgins

    Jeff Higgins Guest

    Jeff Higgins wrote:
    > Hendrik Maryns wrote:
    >> Jeff Higgins schreef:
    >>> Jeff Higgins wrote:
    >>> How to encode a Unicode scalar value in UTF8?
    >>>
    >>> public class Encode
    >>> {
    >>> public static void main(String[] args)
    >>> {
    >>> int[] intArray = {0x4f55};
    >>> byte[] byteArray = encode(intArray);
    >>> for(byte b : byteArray)
    >>> {
    >>> System.out.print(Integer.toString((b & 0xff) + 0x100,
    >>> 16).substring(1));
    >>> }
    >>> }
    >>> }
    >>>
    >>> prints e4bd95
    >>>
    >>> where encode(int[]) is a method described at:
    >>> <http://developers.sun.com/dev/gadc/technicalpublications/articles/utf8.html>

    >>
    >> Ok, I found out what the & 0xff is for, but mind explaining me why you
    >> do + 0x100?
    >>

    >
    > Well, quite frankly because Roedy Green told me to. Or rather showed
    > the technique \somewhere\ on his mindprod site. I can't find it now. :(
    >


    OK,
    Wish I could find it on mindprod site, but can't.
    Must have served another purpose.
    This works.

    System.out.println(Integer.toString((b & 0xff),16));

    > Boiled down, the code that produced the result follows.
    > I have no idea how it works, except that it seems to produce the desired
    > result.
    > Now you have caused me to have to twiddle bits until I understand.
    >
    > Thanks,
    > JH
    >
    > public class Test
    > {
    > public static void main(String[] args)
    > {
    > int in = 0x4f55;
    > byte[] out = new byte[3];
    > out[0] = (byte)(in >> 12 | 0xE0);
    > out[1] = (byte)(in >> 6 & 0x3F | 0x80);
    > out[2] = (byte)(in & 0x3F | 0x80);
    > for(byte b : out)
    > {
    > System.out.print(Integer.toString((b & 0xff + 0x100),
    > 16).substring(1));
    > }
    > }
    > }
    >
     
    Jeff Higgins, Jul 11, 2007
    #10
  11. Hendrik Maryns schrieb:
    > Jeff Higgins schreef:

    [...]
    >> int[] intArray = {0x4f55};
    >> byte[] byteArray = encode(intArray);
    >> for(byte b : byteArray)
    >> {
    >> System.out.print(Integer.toString((b & 0xff) + 0x100, 16).substring(1));
    >> }

    [...]
    > Ok, I found out what the & 0xff is for, but mind explaining me why you
    > do + 0x100?

    I think it is for inserting the leading "0" for each byte less than
    0x10, which would be missing otherwise.

    For example: Suppose b = 4
    Then
    Integer.toString((b & 0xff), 16) gives "4",
    which is not what you want. You want "04".
    The missing leading "0" is produced by the tricky +0x100 and substring(1)
    Integer.toString((b & 0xff) + 0x100, 16) gives "104"
    Integer.toString((b & 0xff) + 0x100, 16).substring(1) gives "04"

    --
    Thomas
     
    Thomas Fritsch, Jul 11, 2007
    #11
  12. Jeff Higgins

    Roedy Green Guest

    On Wed, 11 Jul 2007 11:03:39 -0400, "Jeff Higgins"
    <> wrote, quoted or indirectly quoted someone who
    said :

    >> Ok, I found out what the & 0xff is for, but mind explaining me why you
    >> do + 0x100?
    >>

    >
    >Well, quite frankly because Roedy Green told me to. Or rather showed
    >the technique \somewhere\ on his mindprod site. I can't find it now. :(


    It is a trick for forcing lead zeroes.
    see http://mindprod.com/jgloss/hex.html
    --
    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Jul 12, 2007
    #12
  13. Jeff Higgins

    Jeff Higgins Guest

    Jeff Higgins, Jul 12, 2007
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fritz Bayer
    Replies:
    5
    Views:
    24,369
    Fritz Bayer
    Oct 25, 2004
  2. Kenneth McDonald
    Replies:
    1
    Views:
    878
    Carl Banks
    Dec 27, 2006
  3. Jeremy
    Replies:
    1
    Views:
    832
    Alex Willmer
    Jan 11, 2011
  4. Jeremy
    Replies:
    0
    Views:
    607
    Jeremy
    Jan 11, 2011
  5. gry
    Replies:
    2
    Views:
    802
    Alf P. Steinbach
    Mar 13, 2012
Loading...

Share This Page