Unicode escapes and String literals?

Discussion in 'Java' started by Knute Johnson, Dec 13, 2012.

  1. I just had a great revelation as I was putting together my SSCCE for the
    question I was going to ask. So it has changed my question. How do I
    do the conversion of unicode escape sequences to a String that are done
    by string literals?

    String s = "\u0066\u0065\u0064";

    becomes "fed" but if you create a String with \u0066\u0065\u0064 in it
    without using the literal it stays \u0066\u0065\u0064. Is there a built
    in mechanism in Java for doing that translation to a String?

    --

    Knute Johnson
    Knute Johnson, Dec 13, 2012
    #1
    1. Advertising

  2. On 13.12.2012 18:31, Knute Johnson wrote:
    > I just had a great revelation as I was putting together my SSCCE for the
    > question I was going to ask. So it has changed my question. How do I do
    > the conversion of unicode escape sequences to a String that are done by
    > string literals?
    >
    > String s = "\u0066\u0065\u0064";
    >
    > becomes "fed" but if you create a String with \u0066\u0065\u0064 in it
    > without using the literal it stays \u0066\u0065\u0064. Is there a built
    > in mechanism in Java for doing that translation to a String?


    Yes. It's called "compiler". The same part of the compiler that
    translates a "\t" in a string literal to the TAB control character also
    replaces the unicode sequences in the string literal to the
    corresponding unicode encoding.

    Greetings,
    Thomas
    Thomas Richter, Dec 13, 2012
    #2
    1. Advertising

  3. On 12/13/2012 9:51 AM, Thomas Richter wrote:
    > On 13.12.2012 18:31, Knute Johnson wrote:
    >> I just had a great revelation as I was putting together my SSCCE for the
    >> question I was going to ask. So it has changed my question. How do I do
    >> the conversion of unicode escape sequences to a String that are done by
    >> string literals?
    >>
    >> String s = "\u0066\u0065\u0064";
    >>
    >> becomes "fed" but if you create a String with \u0066\u0065\u0064 in it
    >> without using the literal it stays \u0066\u0065\u0064. Is there a built
    >> in mechanism in Java for doing that translation to a String?

    >
    > Yes. It's called "compiler". The same part of the compiler that
    > translates a "\t" in a string literal to the TAB control character also
    > replaces the unicode sequences in the string literal to the
    > corresponding unicode encoding.
    >
    > Greetings,
    > Thomas


    I want to be able to do it to a String not to a string literal.

    --

    Knute Johnson
    Knute Johnson, Dec 13, 2012
    #3
  4. Knute Johnson

    Lew Guest

    Knute Johnson wrote:
    > Thomas Richter wrote:
    >> Knute Johnson wrote:
    >>> I just had a great revelation as I was putting together my SSCCE for the
    >>> question I was going to ask. So it has changed my question. How do I do
    >>> the conversion of unicode [sic] escape sequences to a String that are done by
    >>> string literals?


    They aren't done by String literals.

    >>> String s = "\u0066\u0065\u0064";
    >>> becomes "fed" but if you create a String with \u0066\u0065\u0064 in it


    Exactly how?

    >>> without using the literal it stays \u0066\u0065\u0064. Is there a built
    >>> in mechanism in Java for doing that translation to a String?


    No.

    >> Yes. It's called "compiler". The same part of the compiler that


    That's not exactly correct, and it certainly is not the same part that translates '\t'.

    >> translates a "\t" in a string literal to the TAB control character also
    >> replaces the unicode sequences in the string literal to the
    >> corresponding unicode encoding.


    Nope.

    > I want to be able to do it to a String not to a string literal.


    You want to do what, exactly? I'm not clear on what you're trying to accomplish.

    '\u' sequences are pre-compile, not during compile. Their presence is exactly equivalent
    to typing the corresponding Unicode character directly.

    You can embed them in identifiers, directives, anywhere the corresponding character can go.

    Not just literals.

    For that matter, you can use them in numeric literals.

    <sscce>
    package temp;

    /**
    * ShowUnicodeEscapes.
    */
    public class ShowUnicodeEscapes {

    static final \u0069nt COUN\u0054 = \u0030\u003b

    /**
    * main.
    *
    * @param args String array of arguments.
    */
    public static void main(String[] args) {
    System.out.println("COUNT = \u0022+ COUNT);
    }
    }
    </sscce>
    Lew, Dec 13, 2012
    #4
  5. Knute Johnson

    Daniel Pitts Guest

    On 12/13/12 9:31 AM, Knute Johnson wrote:
    > I just had a great revelation as I was putting together my SSCCE for the
    > question I was going to ask. So it has changed my question. How do I
    > do the conversion of unicode escape sequences to a String that are done
    > by string literals?
    >
    > String s = "\u0066\u0065\u0064";
    >
    > becomes "fed" but if you create a String with \u0066\u0065\u0064 in it
    > without using the literal it stays \u0066\u0065\u0064. Is there a built
    > in mechanism in Java for doing that translation to a String?
    >


    Do you mean, you have a String, whose value is "\\u0066\\u0065\\u0064",
    you want to pass that String to a method which will return fed.

    meaning

    String foo = "\\u0066\\u0065\\u0064";

    System.out.println(foo); // prints \u0066\u0065\u0064
    System.out.println(magicFunction(foo)); // prints fed

    There might be such a function in Apache Commons library, but I don't
    think there is one in the standard API. I could be wrong though.
    Daniel Pitts, Dec 13, 2012
    #5
  6. Knute Johnson

    Daniel Pitts Guest

    On 12/13/12 11:46 AM, Daniel Pitts wrote:
    > On 12/13/12 9:31 AM, Knute Johnson wrote:
    >> I just had a great revelation as I was putting together my SSCCE for the
    >> question I was going to ask. So it has changed my question. How do I
    >> do the conversion of unicode escape sequences to a String that are done
    >> by string literals?
    >>
    >> String s = "\u0066\u0065\u0064";
    >>
    >> becomes "fed" but if you create a String with \u0066\u0065\u0064 in it
    >> without using the literal it stays \u0066\u0065\u0064. Is there a built
    >> in mechanism in Java for doing that translation to a String?
    >>

    >
    > Do you mean, you have a String, whose value is "\\u0066\\u0065\\u0064",
    > you want to pass that String to a method which will return fed.
    >
    > meaning
    >
    > String foo = "\\u0066\\u0065\\u0064";
    >
    > System.out.println(foo); // prints \u0066\u0065\u0064
    > System.out.println(magicFunction(foo)); // prints fed
    >
    > There might be such a function in Apache Commons library, but I don't
    > think there is one in the standard API. I could be wrong though.


    Two minutes of googling and reading a stack-overflow post gave me this link:

    <http://commons.apache.org/lang/api/org/apache/commons/lang3/StringEscapeUtils.html#unescapeJava%28java.lang.String%29>
    Daniel Pitts, Dec 13, 2012
    #6
  7. Knute Johnson

    markspace Guest

    On 12/13/2012 10:47 AM, Knute Johnson wrote:
    >
    > I want to be able to do it to a String not to a string literal.
    >


    Daniel showed one way to interpret your request. Here's another. Pay
    special attention to the bits out side the quotes. This program prints
    "fed".


    public class EscapeTest {
    public static void main(String[] args) {
    String \u0066\u0065\u0064 = "\u0066\u0065\u0064";
    System.out.println( fed );
    }
    }
    markspace, Dec 13, 2012
    #7
  8. Knute Johnson

    David Lamb Guest

    On 13/12/2012 3:58 PM, markspace wrote:
    > On 12/13/2012 10:47 AM, Knute Johnson wrote:
    >>
    >> I want to be able to do it to a String not to a string literal.
    >>

    >
    > Daniel showed one way to interpret your request. Here's another. Pay
    > special attention to the bits out side the quotes. This program prints
    > "fed".
    >
    >
    > public class EscapeTest {
    > public static void main(String[] args) {
    > String \u0066\u0065\u0064 = "\u0066\u0065\u0064";
    > System.out.println( fed );
    > }
    > }


    Cute. But presupposing that the OP isn't the idiot some people seem to
    have assumed, I suspect he meant something more like

    String line = someBufferedFile.readline();
    ... change all \u escapes into unicode in "line" ... [1]

    where by "\u escapes" he mean the 6-character substrings one usually
    types in string literals. The OP needs to look into "code points" and
    the corresponding codepoint to Character conversions at
    http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html

    [1] which, for the pedantic, really means "create a new string(buffer)
    from line"
    David Lamb, Dec 13, 2012
    #8
  9. Knute Johnson

    markspace Guest

    On 12/13/2012 1:21 PM, David Lamb wrote:
    >
    > Cute. But presupposing that the OP isn't the idiot some people seem to
    > have assumed, I suspect he meant something more like
    >
    > String line = someBufferedFile.readline();
    > ... change all \u escapes into unicode in "line" ... [1]



    Maybe. But your code above is obvious, imo. Either Knute had a brain
    fart and forgot about \\ to escape a slash, or he ran into some other
    problem.

    My point was that there's a very simple pre-compiler for Java. It
    translates all \u-escapes into characters before the compiler proper
    sees it. There's no difference to the Java compiler between "fed" and
    "\u0066\u0065\u0064". It literally can't tell the difference.

    That's an important distinction.
    markspace, Dec 13, 2012
    #9
  10. Knute Johnson

    David Lamb Guest

    On 13/12/2012 5:00 PM, markspace wrote:
    > My point was that there's a very simple pre-compiler for Java. It
    > translates all \u-escapes into characters before the compiler proper
    > sees it. There's no difference to the Java compiler between "fed" and
    > "\u0066\u0065\u0064". It literally can't tell the difference.


    I should probably have found a different point in the thread to hang my
    comment, since you're perfectly correct.
    David Lamb, Dec 13, 2012
    #10
  11. Knute Johnson

    David Lamb Guest

    On 13/12/2012 5:00 PM, markspace wrote:
    > Either Knute had a brain fart and forgot about \\ to escape a slash, or
    > he ran into some other problem.


    Some other problem. As I said, I suspect he didn't know about the
    codepoint-to-character methods. Let's wait to see if he responds to my
    suggestion. Or for Lew to condemn him for not thinking of the right spot
    to read in the API docs.
    David Lamb, Dec 13, 2012
    #11
  12. On 12/13/2012 11:46 AM, Daniel Pitts wrote:
    > On 12/13/12 9:31 AM, Knute Johnson wrote:
    >> I just had a great revelation as I was putting together my SSCCE for the
    >> question I was going to ask. So it has changed my question. How do I
    >> do the conversion of unicode escape sequences to a String that are done
    >> by string literals?
    >>
    >> String s = "\u0066\u0065\u0064";
    >>
    >> becomes "fed" but if you create a String with \u0066\u0065\u0064 in it
    >> without using the literal it stays \u0066\u0065\u0064. Is there a built
    >> in mechanism in Java for doing that translation to a String?
    >>

    >
    > Do you mean, you have a String, whose value is "\\u0066\\u0065\\u0064",
    > you want to pass that String to a method which will return fed.
    >
    > meaning
    >
    > String foo = "\\u0066\\u0065\\u0064";
    >
    > System.out.println(foo); // prints \u0066\u0065\u0064
    > System.out.println(magicFunction(foo)); // prints fed
    >
    > There might be such a function in Apache Commons library, but I don't
    > think there is one in the standard API. I could be wrong though.


    I obviously didn't explain it well the first time around, so let me try
    again. I understand that the compiler reads unicode escape sequences
    pretty much anywhere and converts them to characters. What I want to be
    able to do is to do that conversion on characters that are in a String.
    So if in my String I had the characters \u0066\u0065\u0064 I would
    like to convert those to a String of "fed".

    I did look at the apache commons link you sent and that would probably
    do it but if the compiler can translate them it must have a method
    already. Maybe it's not public but that is what I was asking.

    So thanks everybody for your answers.

    --

    Knute Johnson
    Knute Johnson, Dec 13, 2012
    #12
  13. Knute Johnson

    Arne Vajhøj Guest

    On 12/13/2012 12:31 PM, Knute Johnson wrote:
    > I just had a great revelation as I was putting together my SSCCE for the
    > question I was going to ask. So it has changed my question. How do I
    > do the conversion of unicode escape sequences to a String that are done
    > by string literals?
    >
    > String s = "\u0066\u0065\u0064";
    >
    > becomes "fed" but if you create a String with \u0066\u0065\u0064 in it
    > without using the literal it stays \u0066\u0065\u0064. Is there a built
    > in mechanism in Java for doing that translation to a String?


    I don't think there is anything built in.

    But it is trivial to code.

    This was posted just a few months back:

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;

    public class Unescape {
    private static final Pattern p = Pattern.compile("\\\\u([0-9A-F]{4})");
    public static String U2U(String s) {
    //String res = s;
    //Matcher m = p.matcher(res);
    //while (m.find()) {
    // res = res.replaceAll("\\" + m.group(0),
    Character.toString((char) Integer.parseInt(m.group(1), 16)));
    //}
    //return res;
    Matcher m = p.matcher(s);
    StringBuffer res = new StringBuffer();
    while (m.find()) {
    m.appendReplacement(res, Character.toString((char)
    Integer.parseInt(m.group(1), 16)));
    }
    m.appendTail(res);
    return res.toString();
    }
    public static void main(String[] args) {

    System.out.println(U2U("\\u0041\\u0042\\u0043\\u000A\\u0031\\u0032\\u0033"));
    }
    }

    Arne
    Arne Vajhøj, Dec 13, 2012
    #13
  14. Knute Johnson

    markspace Guest

    On 12/13/2012 2:55 PM, Knute Johnson wrote:

    > I did look at the apache commons link you sent and that would probably
    > do it but if the compiler can translate them it must have a method
    > already. Maybe it's not public but that is what I was asking.


    The compilers internal methods aren't part of the public API. The
    closest thing I'm aware of is Properties#load(), which does convert \u
    and some other escape sequences in a properties file. However their
    method do do that is private.

    I think if it's in the Apache utils then it's fair to say there's no
    Java API equivalent. Otherwise, why make an Apache utils method?
    markspace, Dec 13, 2012
    #14
  15. Knute Johnson

    Daniel Pitts Guest

    On 12/13/12 3:09 PM, Arne Vajhøj wrote:
    > On 12/13/2012 12:31 PM, Knute Johnson wrote:
    >> I just had a great revelation as I was putting together my SSCCE for the
    >> question I was going to ask. So it has changed my question. How do I
    >> do the conversion of unicode escape sequences to a String that are done
    >> by string literals?
    >>
    >> String s = "\u0066\u0065\u0064";
    >>
    >> becomes "fed" but if you create a String with \u0066\u0065\u0064 in it
    >> without using the literal it stays \u0066\u0065\u0064. Is there a built
    >> in mechanism in Java for doing that translation to a String?

    >
    > I don't think there is anything built in.
    >
    > But it is trivial to code.

    Famous last words. Nothing in Unicode is trivial. It may seem trivial,
    but there are potentially gotchas in the spec.

    I don't know of any off the top of my head, but I wouldn't just assume
    it was trivial unless I knew the spec backward and forward.
    Daniel Pitts, Dec 13, 2012
    #15
  16. On 12/13/2012 3:09 PM, Arne Vajhøj wrote:
    > On 12/13/2012 12:31 PM, Knute Johnson wrote:
    >> I just had a great revelation as I was putting together my SSCCE
    >> for the question I was going to ask. So it has changed my
    >> question. How do I do the conversion of unicode escape sequences
    >> to a String that are done by string literals?
    >>
    >> String s = "\u0066\u0065\u0064";
    >>
    >> becomes "fed" but if you create a String with \u0066\u0065\u0064 in
    >> it without using the literal it stays \u0066\u0065\u0064. Is there
    >> a built in mechanism in Java for doing that translation to a
    >> String?

    >
    > I don't think there is anything built in.
    >
    > But it is trivial to code.
    >
    > This was posted just a few months back:
    >
    > import java.util.regex.Matcher; import java.util.regex.Pattern;
    >
    > public class Unescape { private static final Pattern p =
    > Pattern.compile("\\\\u([0-9A-F]{4})"); public static String
    > U2U(String s) { //String res = s; //Matcher m = p.matcher(res);
    > //while (m.find()) { // res = res.replaceAll("\\" + m.group(0),
    > Character.toString((char) Integer.parseInt(m.group(1), 16))); //}
    > //return res; Matcher m = p.matcher(s); StringBuffer res = new
    > StringBuffer(); while (m.find()) { m.appendReplacement(res,
    > Character.toString((char) Integer.parseInt(m.group(1), 16))); }
    > m.appendTail(res); return res.toString(); } public static void
    > main(String[] args) {
    >
    > System.out.println(U2U("\\u0041\\u0042\\u0043\\u000A\\u0031\\u0032\\u0033"));
    >
    >
    > } }
    >
    > Arne


    Well, brilliant minds think alike. Where were you when I asked the
    first time :). I don't remember a thread on this going by but that's
    getting harder to do all the time. I originally had String.valueOf()
    instead of Character.toString(). I think the latter is better but not
    sure if it makes any difference. Could be a non-trivial Unicode gotcha
    eh Daniel?

    Thanks everybody.


    import java.util.regex.*;

    public class test6 {
    public static void main(String[] args) {
    String clear = "byte me!";
    System.out.println(clear);
    String escpd = unicodeEscape(clear);
    System.out.println(escpd);

    Pattern p = Pattern.compile("\\\\u([0-9a-fA-F]{4})");
    Matcher m = p.matcher(escpd);

    StringBuffer buf = new StringBuffer();
    while (m.find()) {
    String repl =
    Character.toString((char)Integer.parseInt(m.group(1),16));
    m.appendReplacement(buf,repl);
    }
    m.appendTail(buf);

    System.out.println(buf);
    }

    public static String unicodeEscape(char c) {
    return String.format("\\u%04x",(int)c);
    }

    public static String unicodeEscape(Character c) {
    if (c == null)
    return null;

    return unicodeEscape(c.charValue());
    }

    public static String unicodeEscape(String str) {
    StringBuilder buf = new StringBuilder();
    for (int i=0; i<str.length(); i++)
    buf.append(unicodeEscape(str.charAt(i)));

    return buf.toString();
    }
    }

    C:\Documents and Settings\Knute Johnson>java test6
    byte me!
    \u0062\u0079\u0074\u0065\u0020\u006d\u0065\u0021
    byte me!

    --

    Knute Johnson
    Knute Johnson, Dec 14, 2012
    #16
  17. Knute Johnson

    Arne Vajhøj Guest

    On 12/13/2012 4:21 PM, David Lamb wrote:
    > Cute. But presupposing that the OP isn't the idiot some people seem to
    > have assumed, I suspect he meant something more like
    >
    > String line = someBufferedFile.readline();
    > ... change all \u escapes into unicode in "line" ... [1]
    >
    > where by "\u escapes" he mean the 6-character substrings one usually
    > types in string literals. The OP needs to look into "code points" and
    > the corresponding codepoint to Character conversions at
    > http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html


    Why?

    I think he is only asking for conversion between string with
    escape and 16 bit chars.

    The mess with code points and surrogate pairs is no
    different from usual.

    Arne
    Arne Vajhøj, Dec 14, 2012
    #17
  18. Knute Johnson

    Arne Vajhøj Guest

    On 12/13/2012 6:52 PM, Daniel Pitts wrote:
    > On 12/13/12 3:09 PM, Arne Vajhøj wrote:
    >> I don't think there is anything built in.
    >>
    >> But it is trivial to code.

    > Famous last words. Nothing in Unicode is trivial. It may seem trivial,
    > but there are potentially gotchas in the spec.
    >
    > I don't know of any off the top of my head, but I wouldn't just assume
    > it was trivial unless I knew the spec backward and forward.


    Unicode can be tricky.

    But in reality this is not really a unicode problem.

    It is about converting substrings of length 6 to
    16 bit chars.

    Which substantial reduces the complexity.

    Arne
    Arne Vajhøj, Dec 14, 2012
    #18
  19. Knute Johnson

    Arne Vajhøj Guest

    On 12/13/2012 7:11 PM, Knute Johnson wrote:
    > On 12/13/2012 3:09 PM, Arne Vajhøj wrote:
    >> On 12/13/2012 12:31 PM, Knute Johnson wrote:
    >>> I just had a great revelation as I was putting together my SSCCE
    >>> for the question I was going to ask. So it has changed my
    >>> question. How do I do the conversion of unicode escape sequences
    >>> to a String that are done by string literals?
    >>>
    >>> String s = "\u0066\u0065\u0064";
    >>>
    >>> becomes "fed" but if you create a String with \u0066\u0065\u0064 in
    >>> it without using the literal it stays \u0066\u0065\u0064. Is there
    >>> a built in mechanism in Java for doing that translation to a
    >>> String?

    >>
    >> I don't think there is anything built in.
    >>
    >> But it is trivial to code.
    >>
    >> This was posted just a few months back:


    > Well, brilliant minds think alike. Where were you when I asked the
    > first time :). I don't remember a thread on this going by but that's
    > getting harder to do all the time.


    I am pretty sure that it was here that I posted the
    code, but with the out commented implementation and
    that someone (Daniel?) suggested the new implementation
    as an improvement.

    Arne
    Arne Vajhøj, Dec 14, 2012
    #19
  20. On 12/13/2012 4:43 PM, Arne Vajhøj wrote:
    > On 12/13/2012 7:11 PM, Knute Johnson wrote:
    >> On 12/13/2012 3:09 PM, Arne Vajhøj wrote:
    >>> On 12/13/2012 12:31 PM, Knute Johnson wrote:
    >>>> I just had a great revelation as I was putting together my SSCCE
    >>>> for the question I was going to ask. So it has changed my
    >>>> question. How do I do the conversion of unicode escape sequences
    >>>> to a String that are done by string literals?
    >>>>
    >>>> String s = "\u0066\u0065\u0064";
    >>>>
    >>>> becomes "fed" but if you create a String with \u0066\u0065\u0064 in
    >>>> it without using the literal it stays \u0066\u0065\u0064. Is there
    >>>> a built in mechanism in Java for doing that translation to a
    >>>> String?
    >>>
    >>> I don't think there is anything built in.
    >>>
    >>> But it is trivial to code.
    >>>
    >>> This was posted just a few months back:

    >
    >> Well, brilliant minds think alike. Where were you when I asked the
    >> first time :). I don't remember a thread on this going by but that's
    >> getting harder to do all the time.

    >
    > I am pretty sure that it was here that I posted the
    > code, but with the out commented implementation and
    > that someone (Daniel?) suggested the new implementation
    > as an improvement.
    >
    > Arne
    >


    Well I appreciate everybody's help. It was driving me nuts for two days.

    --

    Knute Johnson
    Knute Johnson, Dec 14, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    4
    Views:
    871
    Roedy Green
    Nov 21, 2005
  2. John Goche
    Replies:
    8
    Views:
    16,425
  3. nico
    Replies:
    6
    Views:
    768
  4. Fletcher Johnson
    Replies:
    4
    Views:
    326
    Fletcher Johnson
    Nov 1, 2011
  5. Tyler
    Replies:
    1
    Views:
    924
    Robert Klemme
    Jul 29, 2011
Loading...

Share This Page