Unescaping Unicode code points in a Java string

Discussion in 'Java' started by Greg, Aug 31, 2006.

  1. Greg

    Greg Guest

    My Java program reads in (from an external source) text that contains
    the same sort of unicode character escape sequences as java source
    code. For example, one such string might be:

    "En Espa\u00f1ol"

    Naturally, I would like to convert the five characters subsequence,
    "\u00f1", into the single character codepoint (hex 00F1) that those
    characters actually represent:

    "En Español"

    I've been browsing the J2SE 1.5 docs hoping to find a convenient method
    to perform this kind of conversion, but so far have not found one. Does
    anyone have any suggestions?

    Thanks,

    Greg
     
    Greg, Aug 31, 2006
    #1
    1. Advertising

  2. Greg wrote:
    > My Java program reads in (from an external source) text that contains
    > the same sort of unicode character escape sequences as java source
    > code. For example, one such string might be:
    >
    > "En Espa\u00f1ol"
    >
    > Naturally, I would like to convert the five characters subsequence,
    > "\u00f1", into the single character codepoint (hex 00F1) that those
    > characters actually represent:
    >
    > "En Español"
    >
    > I've been browsing the J2SE 1.5 docs hoping to find a convenient method
    > to perform this kind of conversion, but so far have not found one. Does
    > anyone have any suggestions?


    Long time ago I searched the Java API and sources for a method doing
    that kind of String decoding, but to no avail. The only thing I found
    was method
    private String loadConvert(String)
    in class java.util.Properties. But because it is private, it is not
    reusable outside Properties.

    (You find the source in src.zip of JDK installation directory)

    --
    Thomas
     
    Thomas Fritsch, Aug 31, 2006
    #2
    1. Advertising

  3. Greg

    Oliver Wong Guest

    "Greg" <> wrote in message
    news:...
    > My Java program reads in (from an external source) text that contains
    > the same sort of unicode character escape sequences as java source
    > code. For example, one such string might be:
    >
    > "En Espa\u00f1ol"
    >
    > Naturally, I would like to convert the five characters subsequence,
    > "\u00f1", into the single character codepoint (hex 00F1) that those
    > characters actually represent:
    >
    > "En Español"
    >
    > I've been browsing the J2SE 1.5 docs hoping to find a convenient method
    > to perform this kind of conversion, but so far have not found one. Does
    > anyone have any suggestions?


    Iterate through each character of the String, looking for the sequence
    "\u". If you find it, delete those two chars, and read in the next 4 chars.
    Parse that sequence of 4 characters into a integer assuming hexadecimal
    notation. Take that integer and cast it to a char, and insert the resulting
    char back into the String.

    - Oliver
     
    Oliver Wong, Aug 31, 2006
    #3
  4. Greg wrote:
    > My Java program reads in (from an external source) text that contains
    > the same sort of unicode character escape sequences as java source
    > code. For example, one such string might be:
    >
    > "En Espa\u00f1ol"
    >
    > Naturally, I would like to convert the five characters subsequence,
    > "\u00f1", into the single character codepoint (hex 00F1) that those
    > characters actually represent:
    >
    > "En Español"
    >
    > I've been browsing the J2SE 1.5 docs hoping to find a convenient method
    > to perform this kind of conversion, but so far have not found one. Does
    > anyone have any suggestions?


    One of many possible solutions:

    private static final Pattern p = Pattern.compile("\\\\u([0-9A-F]{4})");
    public static String U2U(String s) {
    String res = s;
    Matcher m = p.matcher(res);
    while(m.find()) {
    res = res.replaceAll("\\" + m.group(0),
    Character.toString((char)Integer.parseInt(m.group(1), 16)));
    }
    return res;
    }

    Arne
     
    =?ISO-8859-1?Q?Arne_Vajh=F8j?=, Sep 1, 2006
    #4
  5. Greg

    Dale King Guest

    Oliver Wong wrote:
    >
    > "Greg" <> wrote in message
    > news:...
    >> My Java program reads in (from an external source) text that contains
    >> the same sort of unicode character escape sequences as java source
    >> code. For example, one such string might be:
    >>
    >> "En Espa\u00f1ol"
    >>
    >> Naturally, I would like to convert the five characters subsequence,
    >> "\u00f1", into the single character codepoint (hex 00F1) that those
    >> characters actually represent:
    >>
    >> "En Español"
    >>
    >> I've been browsing the J2SE 1.5 docs hoping to find a convenient method
    >> to perform this kind of conversion, but so far have not found one. Does
    >> anyone have any suggestions?

    >
    > Iterate through each character of the String, looking for the
    > sequence "\u". If you find it, delete those two chars, and read in the
    > next 4 chars. Parse that sequence of 4 characters into a integer
    > assuming hexadecimal notation. Take that integer and cast it to a char,
    > and insert the resulting char back into the String.


    It's a bit more complicated than that because you will also need to
    support things like \\ to actually insert a backslash and perhaps
    support things like \n.

    --
    Dale King
     
    Dale King, Sep 1, 2006
    #5
  6. On Fri, 01 Sep 2006 01:09:40 -0400, Dale King wrote:

    >> "Greg" <> wrote in message
    >> news:...
    >>> My Java program reads in (from an external source) text that contains
    >>> the same sort of unicode character escape sequences as java source
    >>> code. For example, one such string might be:
    >>>
    >>> "En Espa\u00f1ol"
    >>>
    >>> Naturally, I would like to convert the five characters subsequence,
    >>> "\u00f1", into the single character codepoint (hex 00F1) that those
    >>> characters actually represent:
    >>>
    >>> "En Español"

    >
    > It's a bit more complicated than that because you will also need to
    > support things like \\ to actually insert a backslash and perhaps
    > support things like \n.


    If he is defining a new specification for escaped input, this would be
    nice but not necessary. "\" can be escaped as "\u005C", and a newline
    as "\u000A". In Java source code, "\u005C" results in a malformed string
    literal (which means one needs to use "\n" instead), but that escape
    sequence is permitted in properties files. On the other hand, the Java
    compiler and Properties.load() do not recognize the C escape-sequences
    "\v" and "\a" for VT and BEL.

    I think Arne's response (that used a regular expression) was too
    complicated, and the response to which you are responding was
    poorly-thought-out (because strings are immutable in Java). Here's a
    possible solution:

    String unescape(String s) {
    int i=0,len=s.length(); char c; StringBuffer sb = new StringBuffer(len);
    while (i<len) {
    c = s.charAt(i++);
    if (c=='\\') {
    if (i<len) {
    c = s.charAt(i++);
    if (c=='u') {
    c = (char) Integer.parseInt(s.substring(i,i+4),16);
    i += 4;
    } // add other cases here as desired...
    }} // fall through: \ escapes itself, quotes any character but u
    sb.append(c);
    }
    return sb.toString();
    }

    Unlike Arne's solution, it examines each character in the string only
    once, and it doesn't require the java.util.regex package (which was not
    introduced until Java 1.4). I also think it's more readable, to one who
    is trying to verify that it does exactly what's expected and no more.

    (What would Arne's solution do to "\u005Cu0020\u0020"? Is that the
    correct result?)

    --
    PGP key posted on website ... http://www.lmert.com/people/davidl/
     
    David Lee Lambert, Sep 1, 2006
    #6
  7. Greg

    Dale King Guest

    David Lee Lambert wrote:
    > On Fri, 01 Sep 2006 01:09:40 -0400, Dale King wrote:
    >
    >>> "Greg" <> wrote in message
    >>> news:...
    >>>> My Java program reads in (from an external source) text that contains
    >>>> the same sort of unicode character escape sequences as java source
    >>>> code. For example, one such string might be:
    >>>>
    >>>> "En Espa\u00f1ol"
    >>>>
    >>>> Naturally, I would like to convert the five characters subsequence,
    >>>> "\u00f1", into the single character codepoint (hex 00F1) that those
    >>>> characters actually represent:
    >>>>
    >>>> "En Español"

    >> It's a bit more complicated than that because you will also need to
    >> support things like \\ to actually insert a backslash and perhaps
    >> support things like \n.

    >
    > If he is defining a new specification for escaped input, this would be
    > nice but not necessary. "\" can be escaped as "\u005C", and a newline
    > as "\u000A". In Java source code, "\u005C" results in a malformed string
    > literal (which means one needs to use "\n" instead), but that escape
    > sequence is permitted in properties files.


    It's up to him what he wants to specify, but personally I would prefer
    the \\ and \n.

    > On the other hand, the Java
    > compiler and Properties.load() do not recognize the C escape-sequences
    > "\v" and "\a" for VT and BEL.


    Which is understandable. BEL is specific to consoles and Java has no
    real support for consoles because they are too platform specific and VT
    is rarely used.

    > I think Arne's response (that used a regular expression) was too
    > complicated, and the response to which you are responding was
    > poorly-thought-out (because strings are immutable in Java). Here's a
    > possible solution:
    >
    > String unescape(String s) {


    The proper time to do the conversion is when the text is being read from
    the "external source" using some form of FilterReader subclass. I
    remember now that I wrote one of those once, but after a long search I
    have figured out that I left that code at my previous employer and did
    not keep a copy of it (which is a shame because that was part of
    something that was some really good work).

    --
    Dale King
     
    Dale King, Sep 1, 2006
    #7
  8. Greg

    vektor

    Joined:
    May 17, 2011
    Messages:
    1
    I use the apache converter:
    Code:
    org.apache.commons.lang.StringEscapeUtils
     
    vektor, May 17, 2011
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Vance Kessler

    Re: Unescaping ASP vbscript escaped string

    Vance Kessler, Mar 1, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    2,644
    Vance Kessler
    Mar 1, 2004
  2. Daniel

    unescaping xml escape codes

    Daniel, Aug 10, 2003, in forum: Python
    Replies:
    2
    Views:
    5,240
    Bengt Richter
    Aug 11, 2003
  3. John Nagle

    Unescaping URLs in Python

    John Nagle, Dec 25, 2006, in forum: Python
    Replies:
    3
    Views:
    639
    Jeffrey Froman
    Dec 25, 2006
  4. Jonny
    Replies:
    7
    Views:
    454
    Jonny
    Dec 20, 2005
  5. sprite
    Replies:
    2
    Views:
    315
    sprite
    Sep 2, 2010
Loading...

Share This Page