how do I expand a unicode string to its visual UTF8 representation?

Discussion in 'Java' started by Andrew, Aug 6, 2009.

  1. Andrew

    Andrew Guest

    Hello,

    I have an example program below that contains weird Icelandic
    characters, and a copyright symbol, just for good measure. The code
    expresses these as UTF8. They print exactly as you would want/expect
    them to. So far so good. But what I want is to be able to go the other
    way. I want to take a unicode string and recreate the escape sequences
    for the funny international characters.For example, the single
    character E-acute should be expanded to \u00C9 (6 characters). Any
    ideas on how to do this please?

    public class UTF8Test {
    public UTF8Test() {
    }

    public String getString() {
    StringBuilder builder = new StringBuilder();
    builder.append("Copyright \u00A9 2009\n");
    builder.append("Here is the phrase (in Icelandic): I can eat glass
    and it doesn't hurt me\n");
    builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
    mei\u00F0a mig");
    return builder.toString();
    }

    public static void main(String[] args) {
    UTF8Test test = new UTF8Test();
    System.out.println(test.getString());
    }
    }

    FWIW, the reason I want to do this is I need to write strings like
    this to a sybase table where the column is of type varchar. We cannot
    make it univarchar (don't ask). So I need to be able to write unicode
    characters without using unicode chars! I thought by having them in
    this expanded form java can convert them just like the program above
    does.

    Regards,

    Andrew Marlow
     
    Andrew, Aug 6, 2009
    #1
    1. Advertising

  2. Andrew wrote:
    > Hello,
    >
    > I have an example program below that contains weird Icelandic
    > characters, and a copyright symbol, just for good measure. The code
    > expresses these as UTF8. They print exactly as you would want/expect
    > them to. So far so good. But what I want is to be able to go the other
    > way. I want to take a unicode string and recreate the escape sequences
    > for the funny international characters.For example, the single
    > character E-acute should be expanded to \u00C9 (6 characters). Any
    > ideas on how to do this please?
    >
    > public class UTF8Test {
    > public UTF8Test() {
    > }
    >
    > public String getString() {
    > StringBuilder builder = new StringBuilder();
    > builder.append("Copyright \u00A9 2009\n");
    > builder.append("Here is the phrase (in Icelandic): I can eat glass
    > and it doesn't hurt me\n");
    > builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
    > mei\u00F0a mig");
    > return builder.toString();
    > }
    >
    > public static void main(String[] args) {
    > UTF8Test test = new UTF8Test();
    > System.out.println(test.getString());
    > }
    > }
    >
    > FWIW, the reason I want to do this is I need to write strings like
    > this to a sybase table where the column is of type varchar. We cannot
    > make it univarchar (don't ask). So I need to be able to write unicode
    > characters without using unicode chars! I thought by having them in
    > this expanded form java can convert them just like the program above
    > does.
    >
    > Regards,
    >
    > Andrew Marlow


    public class UTF8Test {
    public UTF8Test() {
    }

    public void doit() {
    StringBuilder builder = new StringBuilder();
    builder.append("Copyright \u00A9 2009\n");
    builder.append("Here is the phrase (in Icelandic): I can eat glass and
    it doesn't hurt me\n");
    builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
    mei\u00F0a mig");
    String str = builder.toString();

    System.out.println(str);

    byte[] buf = str.getBytes();
    for (byte b : buf)
    System.out.printf("\\u%04x",b);
    }

    public static void main(String[] args) {
    UTF8Test test = new UTF8Test();
    test.doit();
    }
    }

    C:\Documents and Settings\Knute Johnson>java UTF8Test
    Copyright ⌠2009
    Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt me
    ╔g get eti≡ gler ßn ■ess a≡ mei≡a mig
    \u0043\u006f\u0070\u0079\u0072\u0069\u0067\u0068\u0074\u0020\u00a9\u0020\u0032\u
    0030\u0030\u0039\u000a\u0048\u0065\u0072\u0065\u0020\u0069\u0073\u0020\u0074\u00
    68\u0065\u0020\u0070\u0068\u0072\u0061\u0073\u0065\u0020\u0028\u0069\u006e\u0020
    \u0049\u0063\u0065\u006c\u0061\u006e\u0064\u0069\u0063\u0029\u003a\u0020\u0049\u
    0020\u0063\u0061\u006e\u0020\u0065\u0061\u0074\u0020\u0067\u006c\u0061\u0073\u00
    73\u0020\u0061\u006e\u0064\u0020\u0069\u0074\u0020\u0064\u006f\u0065\u0073\u006e
    \u0027\u0074\u0020\u0068\u0075\u0072\u0074\u0020\u006d\u0065\u000a\u00c9\u0067\u
    0020\u0067\u0065\u0074\u0020\u0065\u0074\u0069\u00f0\u0020\u0067\u006c\u0065\u00
    72\u0020\u00e1\u006e\u0020\u00fe\u0065\u0073\u0073\u0020\u0061\u00f0\u0020\u006d
    \u0065\u0069\u00f0\u0061\u0020\u006d\u0069\u0067

    --

    Knute Johnson
    email s/nospam/knute2009/

    --
    Posted via NewsDemon.com - Premium Uncensored Newsgroup Service
    ------->>>>>>http://www.NewsDemon.com<<<<<<------
    Unlimited Access, Anonymous Accounts, Uncensored Broadband Access
     
    Knute Johnson, Aug 6, 2009
    #2
    1. Advertising

  3. Andrew

    Andrew Guest

    Re: how do I expand a unicode string to its visual UTF8representation?

    On 6 Aug, 17:02, Knute Johnson <> wrote:
    > Andrew wrote:
    > > Hello,

    >
    > > I have an example program below that contains weird Icelandic
    > > characters, and a copyright symbol, just for good measure. The code
    > > expresses these as UTF8. They print exactly as you would want/expect
    > > them to. So far so good. But what I want is to be able to go the other
    > > way. I want to take a unicode string and recreate the escape sequences
    > > for the funny international characters.For example, the single
    > > character E-acute should be expanded to \u00C9 (6 characters). Any
    > > ideas on how to do this please?


    > C:\Documents and Settings\Knute Johnson>java UTF8Test
    > Copyright ⌠2009
    > Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt me
    > ╔g get eti≡ gler ßn ■ess a≡ mei≡a mig
    > \u0043\u006f\u0070\u0079\u0072\u0069\u0067\u0068\u0074\u0020\u00a9\u0020\u0032\u
    > 0030\u0030\u0039\u000a\u0048\u0065\u0072\u0065\u0020\u0069\u0073\u0020\u0074\u00
    > 68\u0065\u0020\u0070\u0068\u0072\u0061\u0073\u0065\u0020\u0028\u0069\u006e\u0020
    > \u0049\u0063\u0065\u006c\u0061\u006e\u0064\u0069\u0063\u0029\u003a\u0020\u0049\u
    > 0020\u0063\u0061\u006e\u0020\u0065\u0061\u0074\u0020\u0067\u006c\u0061\u0073\u00
    > 73\u0020\u0061\u006e\u0064\u0020\u0069\u0074\u0020\u0064\u006f\u0065\u0073\u006e
    > \u0027\u0074\u0020\u0068\u0075\u0072\u0074\u0020\u006d\u0065\u000a\u00c9\u0067\u
    > 0020\u0067\u0065\u0074\u0020\u0065\u0074\u0069\u00f0\u0020\u0067\u006c\u0065\u00
    > 72\u0020\u00e1\u006e\u0020\u00fe\u0065\u0073\u0073\u0020\u0061\u00f0\u0020\u006d
    > \u0065\u0069\u00f0\u0061\u0020\u006d\u0069\u0067


    Well, thanks for the quick reply, but that hasn't quite worked has it?
    All the chars have come out as \uxxxx. I want the ones that are 7 bit
    ASCII to come out as the normal printable char, i.e I want the output
    of doit to be:

    Copyright \u00A9 2009
    Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
    me
    \u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0 mei\u00F0a mig
     
    Andrew, Aug 6, 2009
    #3
  4. Andrew

    Arne Vajhøj Guest

    Andrew wrote:
    > I have an example program below that contains weird Icelandic
    > characters, and a copyright symbol, just for good measure. The code
    > expresses these as UTF8. They print exactly as you would want/expect
    > them to. So far so good. But what I want is to be able to go the other
    > way. I want to take a unicode string and recreate the escape sequences
    > for the funny international characters.For example, the single
    > character E-acute should be expanded to \u00C9 (6 characters). Any
    > ideas on how to do this please?
    >
    > public class UTF8Test {
    > public UTF8Test() {
    > }
    >
    > public String getString() {
    > StringBuilder builder = new StringBuilder();
    > builder.append("Copyright \u00A9 2009\n");
    > builder.append("Here is the phrase (in Icelandic): I can eat glass
    > and it doesn't hurt me\n");
    > builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
    > mei\u00F0a mig");
    > return builder.toString();
    > }
    >
    > public static void main(String[] args) {
    > UTF8Test test = new UTF8Test();
    > System.out.println(test.getString());
    > }
    > }
    >
    > FWIW, the reason I want to do this is I need to write strings like
    > this to a sybase table where the column is of type varchar. We cannot
    > make it univarchar (don't ask). So I need to be able to write unicode
    > characters without using unicode chars! I thought by having them in
    > this expanded form java can convert them just like the program above
    > does.


    The specific question asked can be solved with something like:

    public static String encode(String s) {
    StringBuffer sb = new StringBuffer("");
    for(int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    if((c >= 0) && (c <=127)) {
    sb.append(c);
    } else {
    String hex = Integer.toHexString(c);
    sb.append("\\u" + "0000".substring(hex.length(), 4) + hex);
    }
    }
    return sb.toString();
    }

    But it will actually also require some work to decode it. Because the
    unescape done in your code is done at compile time not runtime.

    And 1 code point -> 6 bytes is not a very efficient encoding.

    Assuming your VARCHAR supports 0-255 then you should be able
    to store you UTF-8 bytes as ISO-8859-1.

    A bit messy but more efficient space wise and less code.

    Alternatively you could look at Quoted Printable but that
    will also have overhead.

    Arne
     
    Arne Vajhøj, Aug 6, 2009
    #4
  5. Andrew

    Mayeul Guest

    Andrew wrote:
    > Hello,
    >
    > I have an example program below that contains weird Icelandic
    > characters, and a copyright symbol, just for good measure. The code
    > expresses these as UTF8. They print exactly as you would want/expect
    > them to. So far so good. But what I want is to be able to go the other
    > way. I want to take a unicode string and recreate the escape sequences
    > for the funny international characters.For example, the single
    > character E-acute should be expanded to \u00C9 (6 characters). Any
    > ideas on how to do this please?


    > public class UTF8Test {
    > public UTF8Test() {
    > }
    >
    > public String getString() {
    > StringBuilder builder = new StringBuilder();
    > builder.append("Copyright \u00A9 2009\n");
    > builder.append("Here is the phrase (in Icelandic): I can eat glass
    > and it doesn't hurt me\n");
    > builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
    > mei\u00F0a mig");
    > return builder.toString();
    > }
    >
    > public static void main(String[] args) {
    > UTF8Test test = new UTF8Test();
    > System.out.println(test.getString());
    > }
    > }


    You might want to read on UTF-8, as something like \u00C9 has absolutely
    nothing to do with UTF-8. It is the Java escape notation which enables
    to represent a character with its Unicode code point as hexadecimal.
    Nothing to do with UTF-8. A lot to do with UTF-16, though.

    As a side note, please be aware that Java Strings are sequences of Java
    char values. Char values are unsigned and 16-bit, which is not enough to
    hold characters with a Unicode code point above U+FFFF. Such characters
    are therefore encoded as a combination of two Java chars, in the same
    way UTF-16 works.
    This won't impact what you're trying to do though, since UTF-16 use
    surrogate characters that are still non-ASCII for characters above
    U+FFFF. Their correct escape sequence is the horrible \uAAAA\uBBBB, the
    escape sequences of the surrogates. Not addressing the issue at all will
    automagically produce the desired results.


    As for how to do encode to or decode from such a format, I don't know of
    any direct way, but Knute and Arne showed it should be rather
    straightforward.

    > FWIW, the reason I want to do this is I need to write strings like
    > this to a sybase table where the column is of type varchar. We cannot
    > make it univarchar (don't ask). So I need to be able to write unicode
    > characters without using unicode chars!


    I recommand you store them encoded in UTF-7 or quoted-printable, then.
    This will be more efficient and more standard than what you're trying to
    do, and libraries will do it for you.

    > I thought by having them in
    > this expanded form java can convert them just like the program above
    > does.


    As far as I know, you were wrong when thinking that.

    --
    Mayeul
     
    Mayeul, Aug 6, 2009
    #5
  6. Re: how do I expand a unicode string to its visual UTF8 representation?

    Andrew wrote:
    > Well, thanks for the quick reply, but that hasn't quite worked has it?
    > All the chars have come out as \uxxxx. I want the ones that are 7 bit
    > ASCII to come out as the normal printable char, i.e I want the
    > output of doit to be:
    >
    > Copyright \u00A9 2009 Here is the phrase (in Icelandic): I can eat
    > glass and it doesn't hurt me \u00C9g get eti\u00F0 gler \u00E1n
    > \u00FEess a\u00F0 mei\u00F0a mig


    Well I figured since you had a fairly sophisticated question and
    appeared to have some knowledge of Java that you could figure out how to
    use the 'if' statement yourself. Oh and just so you don't complain that
    I used lower case hex, I fixed that too.

    C:\Documents and Settings\Knute Johnson>java UTF8Test
    Copyright ⌠2009
    Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt me
    ╔g get eti≡ gler ßn ■ess a≡ mei≡a mig
    Copyright \u00A9 2009
    Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt me
    \u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0 mei\u00F0a mig

    public class UTF8Test {
    public UTF8Test() {
    }

    public void doit() {
    StringBuilder builder = new StringBuilder();
    builder.append("Copyright \u00A9 2009\n");
    builder.append("Here is the phrase (in Icelandic): I can eat glass
    and it doesn't hurt me\n");
    builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
    mei\u00F0a mig");
    String str = builder.toString();

    System.out.println(str);

    byte[] buf = str.getBytes();
    for (byte b : buf) {
    if ((b & 0x80) == 0)
    System.out.print(new String(new byte[] { b }));
    else
    System.out.printf("\\u%04X",b);
    }
    }

    public static void main(String[] args) {
    UTF8Test test = new UTF8Test();
    test.doit();
    }
    }

    --

    Knute Johnson
    email s/nospam/knute2009/

    --
    Posted via NewsDemon.com - Premium Uncensored Newsgroup Service
    ------->>>>>>http://www.NewsDemon.com<<<<<<------
    Unlimited Access, Anonymous Accounts, Uncensored Broadband Access
     
    Knute Johnson, Aug 6, 2009
    #6
  7. Andrew

    Roedy Green Guest

    On Thu, 6 Aug 2009 08:03:59 -0700 (PDT), Andrew
    <> wrote, quoted or indirectly quoted
    someone who said :

    > I want to take a unicode string and recreate the escape sequences
    >for the funny international characters.For example, the single
    >character E-acute should be expanded to \u00C9 (6 characters). Any
    >ideas on how to do this please?


    Another way of formulating your question is how to I take some
    Unicode-16 data in RAM and write it out in 8-bit Icelandic encoding or
    possibly UTF-8 encoding.

    See http://mindprod.com/applet/file.html

    See http://mindprod.com/jgloss/encoding.html
    to find the name of the possible Icelandic encodings.

    See http://mindprod.com/applet/encodingrecogniser.html
    To help you figure out which Icelandic encoding you sample is using.

    P.S. none of these codes is "visual". Turning these codes to glyphs is
    the job of the font. See
    http://mindprod.com/jgloss/font.html
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "Let us pray it is not so, or if it is, that it will not become widely known."
    ~ Wife of the Bishop of Exeter on hearing of Darwin's theory of the common descent of humans and apes.
     
    Roedy Green, Aug 6, 2009
    #7
  8. Andrew

    Andrew Guest

    Re: how do I expand a unicode string to its visual UTF8representation?

    On 6 Aug, 17:24, Mayeul <> wrote:
    > Andrew wrote:
    > > Hello,

    >
    > > I have an example program below that contains weird Icelandic
    > > characters, and a copyright symbol, just for good measure. The code
    > > expresses these as UTF8. They print exactly as you would want/expect
    > > them to. So far so good. But what I want is to be able to go the other
    > > way. I want to take a unicode string and recreate the escape sequences
    > > for the funny international characters.


    >
    > You might want to read on UTF-8, as something like \u00C9 has absolutely
    > nothing to do with UTF-8. It is the Java escape notation which enables
    > to represent a character with its Unicode code point as hexadecimal.
    > Nothing to do with UTF-8. A lot to do with UTF-16, though.


    Yes, ahem, you're right.

    > As for how to do encode to or decode from such a format, I don't know of
    > any direct way, but Knute and Arne showed it should be rather
    > straightforward.


    I am not sure about those solutions. Don't I need to convert the
    internal representation to something specific first, like UTF8? Or is
    there a formal definition of the internal representation whee no
    explicit encoding is given?

    >
    > > FWIW, the reason I want to do this is I need to write strings like
    > > this to a sybase table where the column is of type varchar. We cannot
    > > make it univarchar (don't ask). So I need to be able to write unicode
    > > characters without using unicode chars!

    >
    > I recommand you store them encoded in UTF-7 or quoted-printable, then.
    > This will be more efficient and more standard than what you're trying to
    > do, and libraries will do it for you.


    If I store the data in a varchar as this:

    Copyright \u00A9 2009
    Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
    me
    \u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0 mei\u00F0a mig

    then java will do the working of conversion for me automatically.
    That's why I need to move in the other direction first.

    > > I thought by having them in
    > > this expanded form java can convert them just like the program above
    > > does.

    >
    > As far as I know, you were wrong when thinking that.


    I think I am right. When the \uxxxx strings are in a file and I read
    them in, printing gives the correct result. Therefore reading from a
    varchar should also give the correct result.
     
    Andrew, Aug 6, 2009
    #8
  9. Andrew

    Andrew Guest

    Re: how do I expand a unicode string to its visual UTF8representation?

    On 6 Aug, 18:11, Roedy Green <> wrote:
    > On Thu, 6 Aug 2009 08:03:59 -0700 (PDT), Andrew
    > <> wrote, quoted or indirectly quoted
    > someone who said :
    >
    > > I want to take a unicode string and recreate the escape sequences
    > >for the funny international characters.For example, the single
    > >character E-acute should be expanded to \u00C9 (6 characters). Any
    > >ideas on how to do this please?

    >
    > Another way of formulating your question is how to I take some
    > Unicode-16 data in RAM and write it out in 8-bit Icelandic encoding or
    > possibly UTF-8 encoding.


    No, that is not my question. Icelandic was just an example. The point
    is the data contains international characters. I don't know what
    language the text will be in and I don't care. I just need to be able
    to write it to the database without losing information but I cannot
    make the column univarchar (for reasons I won't go into here).

    > Seehttp://mindprod.com/applet/encodingrecogniser.html
    > To help you figure out which Icelandic encoding you sample is using.


    This is not the problem (but I appreciate the thought though....).

    >
    > P.S. none of these codes is "visual". Turning these codes to glyphs is
    > the job of the font.  Seehttp://mindprod.com/jgloss/font.html


    By visual I meant NOT binary. I.e. I do not want to get to the raw bit
    pattern that represents E-acute, I want the single char that is E-
    acute to be mapped to 6 bytes of the form \uxxxx that is the
    equivalent.

    > --
    > Roedy Green Canadian Mind Productshttp://mindprod.com


    -Andrew M.
     
    Andrew, Aug 6, 2009
    #9
  10. Andrew

    Andrew Guest

    Re: how do I expand a unicode string to its visual UTF8representation?

    On 6 Aug, 17:46, Knute Johnson <> wrote:
    > Andrew wrote:
    >
    >   > Well, thanks for the quick reply, but that hasn't quite worked has it?
    >
    > > All the chars have come out as \uxxxx. I want the ones that are 7 bit
    > >  ASCII to come out as the normal printable char, i.e I want the
    > > output of doit to be:

    >
    > > Copyright \u00A9 2009 Here is the phrase (in Icelandic): I can eat
    > > glass and it doesn't hurt me \u00C9g get eti\u00F0 gler \u00E1n
    > > \u00FEess a\u00F0 mei\u00F0a mig

    >
    > Well I figured since you had a fairly sophisticated question and
    > appeared to have some knowledge of Java that you could figure out how to
    > use the 'if' statement yourself.  Oh and just so you don't complain that
    > I used lower case hex, I fixed that too.


    >      public void doit() {
    >      StringBuilder builder = new StringBuilder();
    >      builder.append("Copyright \u00A9 2009\n");
    >      builder.append("Here is the phrase (in Icelandic): I can eat glass
    > and it doesn't hurt me\n");
    >      builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
    > mei\u00F0a mig");
    >      String str = builder.toString();
    >
    >      System.out.println(str);
    >
    >      byte[] buf = str.getBytes();
    >      for (byte b : buf) {
    >          if ((b & 0x80) == 0)
    >              System.out.print(new String(new byte[] { b }));
    >          else
    >              System.out.printf("\\u%04X",b);
    >      }
    >
    > }


    I do appreciate you trying to help but I'm afraid that code does not
    do the job. When I run it, this is what I get:

    Copyright \u00C2\u00A9 2009
    Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
    me
    \u00C3\u0089g get eti\u00C3\u00B0 gler \u00C3\u00A1n \u00C3\u00BEess a
    \u00C3\u00

    For example, the copyright symbol comes out as 00C2 when I expect
    00A9. The E-acute comes out as 00C3 where I expect 00C9.

    -Andrew Marlow
     
    Andrew, Aug 6, 2009
    #10
  11. Andrew

    Roedy Green Guest

    Re: how do I expand a unicode string to its visual UTF8 representation?

    On Thu, 6 Aug 2009 10:25:17 -0700 (PDT), Andrew
    <> wrote, quoted or indirectly quoted
    someone who said :

    >By visual I meant NOT binary. I.e. I do not want to get to the raw bit
    >pattern that represents E-acute, I want the single char that is E-
    >acute to be mapped to 6 bytes of the form \uxxxx that is the
    >equivalent.


    If you store international characters in a database, you can do any of
    the following

    1. hand the database 16 bit Unicode and leave it up to it to convert
    them to some compact form.

    2. hand the database UTF-8. Tell the database you are giving it UTF-8
    or raw bytes.

    3. hand the database some other national encoding. Tell the database
    you are giving it that encoding or raw bytes.

    The problem is data in files is not self-identifying. HTTP has headers
    to let you know the encoding though.

    The usual way to handle a mixture of languages is to store 16-bit
    Unicode in the database.

    You said a few things that suggest you may have missed some of the
    basics about encodings. See http://mindprod.com/jgloss/encoding.html
    to fill in the holes.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "Let us pray it is not so, or if it is, that it will not become widely known."
    ~ Wife of the Bishop of Exeter on hearing of Darwin's theory of the common descent of humans and apes.
     
    Roedy Green, Aug 6, 2009
    #11
  12. Andrew

    markspace Guest

    Re: how do I expand a unicode string to its visual UTF8 representation?

    Andrew wrote:

    > No, that is not my question. Icelandic was just an example. The point
    > is the data contains international characters. I don't know what
    > language the text will be in and I don't care.



    This is a big problem. If you don't know what the encoding is, you have
    binary, not text. You have to decode the text into Java strings or you
    aren't going to be able to do anything with them, really.

    If you're just storing binary as a string (which you are), consider base
    64 encoding. It's easy to do and will always work. You should be able
    to find some source code to do this, it's not hard to roll your own either.

    If you must write your own \u encoder and decoder, don't forget that you
    should probably encode the range from 0 to 31 as well as the range from
    128 to 255. Plus you'll have to encode the \ char too, or reading
    things back is going to be a pain.


    > I just need to be able
    > to write it to the database without losing information but I cannot
    > make the column univarchar (for reasons I won't go into here).



    I don't know of any built in class that does this. You'll have to roll
    your own, I think.


    > By visual I meant NOT binary. I.e. I do not want to get to the raw bit
    > pattern that represents E-acute, I want the single char that is E-
    > acute to be mapped to 6 bytes of the form \uxxxx that is the
    > equivalent.



    You don't know "equivalent" unless you know what encoding you started
    with, however.

    Once you have the encoding, you can make a Java string, then do

    byte[] binary = string.getBytes( "UTF-8" );

    to encode the string into UTF-8 binary, but then you still have to store
    the binary.



    Just curious: what is driving the need for this "\u + UTF-8" encoding?
    Is some other program reading the strings in this format? Or did you
    just think it was a good idea and decide to encode these strings like
    this on your own?
     
    markspace, Aug 6, 2009
    #12
  13. Re: how do I expand a unicode string to its visual UTF8 representation?

    Andrew wrote:
    > On 6 Aug, 17:46, Knute Johnson <> wrote:
    >> Andrew wrote:
    >>
    >> > Well, thanks for the quick reply, but that hasn't quite worked has it?

    >>
    >>> All the chars have come out as \uxxxx. I want the ones that are 7 bit
    >>> ASCII to come out as the normal printable char, i.e I want the
    >>> output of doit to be:
    >>> Copyright \u00A9 2009 Here is the phrase (in Icelandic): I can eat
    >>> glass and it doesn't hurt me \u00C9g get eti\u00F0 gler \u00E1n
    >>> \u00FEess a\u00F0 mei\u00F0a mig

    >> Well I figured since you had a fairly sophisticated question and
    >> appeared to have some knowledge of Java that you could figure out how to
    >> use the 'if' statement yourself. Oh and just so you don't complain that
    >> I used lower case hex, I fixed that too.

    >
    >> public void doit() {
    >> StringBuilder builder = new StringBuilder();
    >> builder.append("Copyright \u00A9 2009\n");
    >> builder.append("Here is the phrase (in Icelandic): I can eat glass
    >> and it doesn't hurt me\n");
    >> builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
    >> mei\u00F0a mig");
    >> String str = builder.toString();
    >>
    >> System.out.println(str);
    >>
    >> byte[] buf = str.getBytes();
    >> for (byte b : buf) {
    >> if ((b & 0x80) == 0)
    >> System.out.print(new String(new byte[] { b }));
    >> else
    >> System.out.printf("\\u%04X",b);
    >> }
    >>
    >> }

    >
    > I do appreciate you trying to help but I'm afraid that code does not
    > do the job. When I run it, this is what I get:
    >
    > Copyright \u00C2\u00A9 2009
    > Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
    > me
    > \u00C3\u0089g get eti\u00C3\u00B0 gler \u00C3\u00A1n \u00C3\u00BEess a
    > \u00C3\u00
    >
    > For example, the copyright symbol comes out as 00C2 when I expect
    > 00A9. The E-acute comes out as 00C3 where I expect 00C9.
    >
    > -Andrew Marlow
    >


    You saw it worked on my computer. So yours must be using a different
    character set. You will have to adjust for that.

    --

    Knute Johnson
    email s/nospam/knute2009/

    --
    Posted via NewsDemon.com - Premium Uncensored Newsgroup Service
    ------->>>>>>http://www.NewsDemon.com<<<<<<------
    Unlimited Access, Anonymous Accounts, Uncensored Broadband Access
     
    Knute Johnson, Aug 6, 2009
    #13
  14. Andrew

    Tom Anderson Guest

    Re: how do I expand a unicode string to its visual UTF8representation?

    On Thu, 6 Aug 2009, Arne Vajh?j wrote:

    > Andrew wrote:
    >> I have an example program below that contains weird Icelandic
    >> characters, and a copyright symbol, just for good measure.

    >
    > Alternatively you could look at Quoted Printable but that will also have
    > overhead.


    Andrew, you should totally use quoted-printable (extended to 16- rather
    than 8-bit values). Your unicode escape scheme is madness.

    tom

    --
    The sunlights differ, but there is only one darkness. -- Ursula K. LeGuin,
    'The Dispossessed'
     
    Tom Anderson, Aug 7, 2009
    #14
  15. Andrew

    Andrew Guest

    Re: how do I expand a unicode string to its visual UTF8representation?

    On 7 Aug, 00:26, Tom Anderson <> wrote:
    > On Thu, 6 Aug 2009, Arne Vajh?j wrote:
    > > Andrew wrote:
    > >> I have an example program below that contains weird Icelandic
    > >> characters, and a copyright symbol, just for good measure.

    >
    > Andrew, you should totally use quoted-printable (extended to 16- rather
    > than 8-bit values). Your unicode escape scheme is madness.
    >
    > tom


    Er, why? I am only using the same escaping convention that java itself
    uses. My example program shows the correct international text being
    output when the java convention for escaping such characters is
    employed.
     
    Andrew, Aug 7, 2009
    #15
  16. Andrew

    Andrew Guest

    Re: how do I expand a unicode string to its visual UTF8representation?

    On 6 Aug, 23:15, Knute Johnson <> wrote:
    > Andrew wrote:
    > > On 6 Aug, 17:46, Knute Johnson <> wrote:
    > >> Andrew wrote:

    >
    > >>   > Well, thanks for the quick reply, but that hasn't quite worked has it?

    >
    > >>> All the chars have come out as \uxxxx. I want the ones that are 7 bit
    > >>>  ASCII to come out as the normal printable char, i.e I want the
    > >>> output of doit to be:
    > >>> Copyright \u00A9 2009 Here is the phrase (in Icelandic): I can eat
    > >>> glass and it doesn't hurt me \u00C9g get eti\u00F0 gler \u00E1n
    > >>> \u00FEess a\u00F0 mei\u00F0a mig
    > >> Well I figured since you had a fairly sophisticated question and
    > >> appeared to have some knowledge of Java that you could figure out how to
    > >> use the 'if' statement yourself.  Oh and just so you don't complain that
    > >> I used lower case hex, I fixed that too.

    >
    > >>      public void doit() {
    > >>      StringBuilder builder = new StringBuilder();
    > >>      builder.append("Copyright \u00A9 2009\n");
    > >>      builder.append("Here is the phrase (in Icelandic): I can eat glass
    > >> and it doesn't hurt me\n");
    > >>      builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
    > >> mei\u00F0a mig");
    > >>      String str = builder.toString();

    >
    > >>      System.out.println(str);

    >
    > >>      byte[] buf = str.getBytes();
    > >>      for (byte b : buf) {
    > >>          if ((b & 0x80) == 0)
    > >>              System.out.print(new String(new byte[] { b }));
    > >>          else
    > >>              System.out.printf("\\u%04X",b);
    > >>      }

    >
    > >> }

    >
    > > I do appreciate you trying to help but I'm afraid that code does not
    > > do the job. When I run it, this is what I get:

    >
    > > Copyright \u00C2\u00A9 2009
    > > Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
    > > me
    > > \u00C3\u0089g get eti\u00C3\u00B0 gler \u00C3\u00A1n \u00C3\u00BEess a
    > > \u00C3\u00

    >
    > > For example, the copyright symbol comes out as 00C2 when I expect
    > > 00A9. The E-acute comes out as 00C3 where I expect 00C9.

    >
    > > -Andrew Marlow

    >
    > You saw it worked on my computer.  So yours must be using a different
    > character set.  You will have to adjust for that.
    >

    Indeed, this is what I suspected and this is part of my point.
    Whatever solution I wind up with it needs to be platform-independent.
     
    Andrew, Aug 7, 2009
    #16
  17. Andrew

    Guest

    Re: how do I expand a unicode string to its visual UTF8representation?

    What Java (brokenly) uses internally to represent String
    shouldn't concern you.

    Java was conceived with Unicode 3.0 in mind, when there
    were less than 65536 'codepoints'.

    Remember that you're not *ever* forced to use the broken
    'char' primitive, which does *not* represent a character
    anymore since Unicode 3.1 came out.

    Java 1.5's String codePointAt(int it) is the method that
    correctly returns a character, and is commented as doing
    just that in the (correct) Javadoc.

    The (broken being repair) charAt(int i) method is only
    there for backward compatibility and shall continue to
    mislead programmers thinking it does actually return
    a character. The Javadoc clearly states that it returns
    a char.

    I don't care if internally Java uses UCS-2 and broken
    chars to represents Unicode strings or the color of
    moonboots little faeries are wearing.

    What is important is the abstraction the String class
    is offering.

    charAt is there for backward compatibility reason and
    is as much broken as the char primitive (the whole concept
    of primitives being disputable in an OO language anyway
    btw).

    codePointAt is the method to get characters.

    Now, if you want to have an ASCII Java source file containing
    Unicode characters (for String or in comments), you ll have
    to use the creative (but broken) uXXXX escaping but this is
    another Java weirdity that should not pollute
    the DB you re using.

    If you really need to escape your Unicode string in your DB
    then at least don't pollute your DB with Java-specific
    weirdities.

    I 100% agree with Mayeul.

    uXXXX escaping has exactly *nothing* to do with UTF-8.

    A Unicode character is a Unicode character and the broken
    internal representation that Java uses to store Unicode
    strings and the broken Java char primitive (and overall
    broken primitive concept in an OO language) should be
    of no concern to you.

    The only thing that count is the abstraction that the
    String class is offering (dropping the broken methods
    present for backward compatibility), not the internal
    representation that the JVM is using.

    Who s going to query that Sybase DB? Only your Java
    app?

    Non Java-apps are going to use that DB? How are they
    going to deal with the escaping scheme you'll come
    with?

    Reproducing in your DB the uXXX/uYYYY escaping is
    IMHO definitely not the way to go.
     
    , Aug 7, 2009
    #17
  18. Andrew

    Arne Vajhøj Guest

    Re: how do I expand a unicode string to its visual UTF8 representation?

    Andrew wrote:
    > On 7 Aug, 00:26, Tom Anderson <> wrote:
    >> On Thu, 6 Aug 2009, Arne Vajh?j wrote:
    >>> Andrew wrote:
    >>>> I have an example program below that contains weird Icelandic
    >>>> characters, and a copyright symbol, just for good measure.

    >> Andrew, you should totally use quoted-printable (extended to 16- rather
    >> than 8-bit values). Your unicode escape scheme is madness.

    >
    > Er, why? I am only using the same escaping convention that java itself
    > uses.


    Actually you are not.

    You are doing runtime processing using that syntax.

    Java uses that syntax at compile time.

    That is a significant difference.

    Arne
     
    Arne Vajhøj, Aug 7, 2009
    #18
  19. Andrew

    Arne Vajhøj Guest

    Re: how do I expand a unicode string to its visual UTF8 representation?

    wrote:
    > uXXXX escaping has exactly *nothing* to do with UTF-8.


    Correct.

    > A Unicode character is a Unicode character and the broken
    > internal representation that Java uses to store Unicode
    > strings and the broken Java char primitive (and overall
    > broken primitive concept in an OO language) should be
    > of no concern to you.


    Java uses the same concept as other widely used languages.

    > Non Java-apps are going to use that DB? How are they
    > going to deal with the escaping scheme you'll come
    > with?


    The exact same way Java would. Parse it.

    If the other language is of C heritage, then the
    code would almost be the same.

    Arne
     
    Arne Vajhøj, Aug 7, 2009
    #19
  20. Andrew

    Arne Vajhøj Guest

    Tom Anderson wrote:
    > On Thu, 6 Aug 2009, Arne Vajh?j wrote:
    >> Andrew wrote:
    >>> I have an example program below that contains weird Icelandic
    >>> characters, and a copyright symbol, just for good measure.

    >>
    >> Alternatively you could look at Quoted Printable but that will also
    >> have overhead.

    >
    > Andrew, you should totally use quoted-printable (extended to 16- rather
    > than 8-bit values).


    I would suggest standard QP on UTF-8 encoding instead of a custom QP.

    > Your unicode escape scheme is madness.


    At least rather cumbersome.

    Arne
     
    Arne Vajhøj, Aug 7, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hessam
    Replies:
    0
    Views:
    2,194
    Hessam
    Aug 8, 2003
  2. =?Utf-8?B?UmFqZXNoIHNvbmk=?=

    'System.String[]' from its string representation 'String[] Array'

    =?Utf-8?B?UmFqZXNoIHNvbmk=?=, May 4, 2006, in forum: ASP .Net
    Replies:
    0
    Views:
    1,813
    =?Utf-8?B?UmFqZXNoIHNvbmk=?=
    May 4, 2006
  3. Hessam
    Replies:
    1
    Views:
    246
    Teemu Keiski
    Aug 16, 2003
  4. Hessam
    Replies:
    0
    Views:
    282
    Hessam
    Aug 8, 2003
  5. gry
    Replies:
    2
    Views:
    807
    Alf P. Steinbach
    Mar 13, 2012
Loading...

Share This Page