retriving escape unicode sequences from files ...

Discussion in 'Java' started by qwertmonkey@syberianoutpost.ru, Aug 3, 2012.

  1. Guest

    Why is it that if you save a unicode sequence in a file, say "français"
    ~
    \u0066\u0072\u0061\u006e\u00e7\u0061\u0069\u0073
    ~
    and then retrieve as a String you can't then convert it back to a UTF-8 String
    ~
    As you can test with this piece of code, you can simply declare the String as
    a literal one or give it in the command prompt, but retrieving what seems to be
    the same sequence of characters (as they print to standard out) from a file
    doesn't seem to work
    ~
    import java.io.ByteArrayOutputStream;
    import java.io.PrintStream;
    import java.io.UnsupportedEncodingException;
    import java.io.IOException;

    // __
    public class UniKdEnk00Test{
    private static final String aNWLn = System.getProperty("line.separator");
    // __
    public static void main (String[] aArgs){
    try{
    // __
    if((aArgs == null) || (aArgs.length != 1)){ throw new IOException(aNWLn +
    "// __ usage:" + aNWLn + aNWLn +
    " java UniKdEnk00Test \\u0066\\u0072\\u0061\\u006e\\u00e7\\u0061\\u0069\\u0073"
    + aNWLn); }
    String aUniKdEnk = "\u0066\u0072\u0061\u006e\u00e7\u0061\u0069\u0073";
    byte[] bAr = aUniKdEnk.getBytes("UTF-8");
    ByteArrayOutputStream BOS = new ByteArrayOutputStream();
    BOS.write(bAr, 0, bAr.length);
    String aUTF8L = new String(BOS.toByteArray(), "UTF-8");
    System.out.println(aUTF8L);
    BOS.reset();
    }catch(UnsupportedEncodingException UEncX){ UEncX.printStackTrace(); }
    catch(IOException IOX) { IOX.printStackTrace(); }
    // __
    }
    }
    ~
    lbrtchx
    comp.lang.java.programmer: escape unicode sequences in files ...
    , Aug 3, 2012
    #1
    1. Advertising

  2. markspace Guest

    On 8/2/2012 8:52 PM, wrote:
    > Why is it that if you save a unicode sequence in a file, say "français"
    > ~
    > \u0066\u0072\u0061\u006e\u00e7\u0061\u0069\u0073
    > ~
    > and then retrieve as a String you can't then convert it back to a UTF-8 String



    Because it isn't French, it's just the ASCII characters \, u, 0, 0, 6, 6
    etc. This is a totally different concept from the idea of escape
    sequences that the compiler interprets for you.

    If you want to read French out of a file, put *French* in the file, not
    ASCII. It can't work any other way.

    If you want to interpret ASCII as escape sequences, you'll have to write
    the interpreter. The Java Properties object reads escape sequences, but
    I don't think you can separate just the escape parser out.
    markspace, Aug 3, 2012
    #2
    1. Advertising

  3. Roedy Green Guest

    On Fri, 3 Aug 2012 03:52:12 +0000 (UTC),
    wrote, quoted or indirectly quoted
    someone who said :

    > Why is it that if you save a unicode sequence in a file, say "français"


    This is a bit of a simplification.
    You need to understand encoding, which kicks in when you use a Reader
    or Writer. Otherwise you are dealing with raw bytes and InputStreams
    and OutputStreams.

    Encoding takes your 16-bit internal Unicode chars and converts it back
    and forth to UTF-8 bytes.

    see http://mindprod.com/applet/fileio.html for sample code
    see http://mindprod.com/jgloss/encoding.html for an explanation of
    encoding and the various types of encoding.

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    The greatest shortcoming of the human race is our inability to understand the exponential function.
    ~ Dr. Albert A. Bartlett (born: 1923-03-21 age: 89)
    http://www.youtube.com/watch?v=F-QA2rkpBSY
    Roedy Green, Aug 3, 2012
    #3
  4. wrote:
    > Why is it that if you save a unicode sequence in a file, say "français"
    > ~
    > \u0066\u0072\u0061\u006e\u00e7\u0061\u0069\u0073


    Note the difference between \u0066 and \uu0066.

    Specifically, consider the java program:

    class quote {
    public static void main(String args[]) {
    System.out.println(\u0022hi there\u0021\u0022);
    }
    }

    -- glen
    glen herrmannsfeldt, Aug 4, 2012
    #4
  5. Arne Vajhøj Guest

    On 8/2/2012 11:52 PM, wrote:
    > Why is it that if you save a unicode sequence in a file, say "français"
    > ~
    > \u0066\u0072\u0061\u006e\u00e7\u0061\u0069\u0073
    > ~
    > and then retrieve as a String you can't then convert it back to a UTF-8 String
    > ~


    Some code from my shelf:

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;

    public class Unescape {
    private static final Pattern p = Pattern.compile("\\\\u([0-9A-F]{4})");
    public static String U2U(String s) {
    String res = s;
    Matcher m = p.matcher(res);
    while(m.find()) {
    res = res.replaceAll("\\" + m.group(0),
    Character.toString((char)Integer.parseInt(m.group(1), 16)));
    }
    return res;
    }
    public static void main(String[] args) {

    System.out.println(U2U("\\u0041\\u0042\\u0043\\u000A\\u0031\\u0032\\u0033"));
    }
    }

    Arne
    Arne Vajhøj, Aug 4, 2012
    #5
  6. Daniel Pitts Guest

    On 8/3/12 5:37 PM, Arne Vajhøj wrote:
    > On 8/2/2012 11:52 PM, wrote:
    >> Why is it that if you save a unicode sequence in a file, say "français"
    >> ~
    >> \u0066\u0072\u0061\u006e\u00e7\u0061\u0069\u0073
    >> ~
    >> and then retrieve as a String you can't then convert it back to a
    >> UTF-8 String
    >> ~

    >
    > Some code from my shelf:
    >
    > import java.util.regex.Matcher;
    > import java.util.regex.Pattern;
    >
    > public class Unescape {
    > private static final Pattern p =
    > Pattern.compile("\\\\u([0-9A-F]{4})");
    > public static String U2U(String s) {
    > String res = s;
    > Matcher m = p.matcher(res);
    > while(m.find()) {
    > res = res.replaceAll("\\" + m.group(0),
    > Character.toString((char)Integer.parseInt(m.group(1), 16)));
    > }
    > return res;
    > }
    > public static void main(String[] args) {
    >
    > System.out.println(U2U("\\u0041\\u0042\\u0043\\u000A\\u0031\\u0032\\u0033"));
    >
    > }
    > }

    And if you wanted this to be effecient, you'd use appendReplacement
    instead of res.replaceAll()
    Daniel Pitts, Aug 4, 2012
    #6
  7. markspace Guest

    On 8/3/2012 8:49 PM, Daniel Pitts wrote:

    > And if you wanted this to be effecient, you'd use appendReplacement
    > instead of res.replaceAll()
    >



    Free code is free. Not efficient. ;-)
    markspace, Aug 4, 2012
    #7
  8. Lew Guest

    markspace wrote:
    > Daniel Pitts wrote:
    >> And if you wanted this to be efficient, you'd use appendReplacement
    >> instead of res.replaceAll()

    >
    > Free code is free. Not efficient. ;-)


    Not always. But after some reviewers suggest improvements,
    it converges on it.

    Valuably, the posting to Usenet opens up public review for
    suggestions for improvement like this.

    The pedagogical value of exposing code to tweaks offered by
    commenters is beyond measure.

    --
    Lew
    Lew, Aug 4, 2012
    #8
  9. Arne Vajhøj Guest

    On 8/3/2012 11:49 PM, Daniel Pitts wrote:
    > On 8/3/12 5:37 PM, Arne Vajhøj wrote:
    >> On 8/2/2012 11:52 PM, wrote:
    >>> Why is it that if you save a unicode sequence in a file, say
    >>> "français"
    >>> ~
    >>> \u0066\u0072\u0061\u006e\u00e7\u0061\u0069\u0073
    >>> ~
    >>> and then retrieve as a String you can't then convert it back to a
    >>> UTF-8 String
    >>> ~

    >>
    >> Some code from my shelf:
    >>
    >> import java.util.regex.Matcher;
    >> import java.util.regex.Pattern;
    >>
    >> public class Unescape {
    >> private static final Pattern p =
    >> Pattern.compile("\\\\u([0-9A-F]{4})");
    >> public static String U2U(String s) {
    >> String res = s;
    >> Matcher m = p.matcher(res);
    >> while(m.find()) {
    >> res = res.replaceAll("\\" + m.group(0),
    >> Character.toString((char)Integer.parseInt(m.group(1), 16)));
    >> }
    >> return res;
    >> }
    >> public static void main(String[] args) {
    >>
    >> System.out.println(U2U("\\u0041\\u0042\\u0043\\u000A\\u0031\\u0032\\u0033"));
    >>
    >>
    >> }
    >> }

    > And if you wanted this to be effecient, you'd use appendReplacement
    > instead of res.replaceAll()


    I did not even knew that existed.

    So:

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;

    public class Unescape {
    private static final Pattern p = Pattern.compile("\\\\u([0-9A-F]{4})");
    public static String U2U(String s) {
    Matcher m = p.matcher(s);
    StringBuffer res = new StringBuffer();
    while (m.find()) {
    m.appendReplacement(res, Character.toString((char)
    Integer.parseInt(m.group(1), 16)));
    }
    m.appendTail(res);
    return res.toString();
    }
    public static void main(String[] args) {

    System.out.println(U2U("\\u0041\\u0042\\u0043\\u000A\\u0031\\u0032\\u0033"));
    }
    }

    Arne
    Arne Vajhøj, Aug 7, 2012
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    118
  2. qwertmonkey
    Replies:
    8
    Views:
    264
    Arne Vajhøj
    Aug 8, 2012
  3. qwertmonkey
    Replies:
    0
    Views:
    191
    qwertmonkey
    Aug 3, 2012
  4. Replies:
    0
    Views:
    172
  5. Replies:
    0
    Views:
    165
Loading...

Share This Page