Regex and Unicode

Discussion in 'Java' started by michael.biden@gmail.com, Mar 19, 2007.

  1. Guest

    I have a situation in which I am receiving a String from a non-java
    system. The system that generates the String attempts to encode some
    characters such a slash to unicode. However it encodes characters
    using the percent sign rathern than the backslash.

    Thus the String test-victorf becomes test%u002dvictorf. I'd love to
    be able to simply replace the percent with a backslash, but it seems
    that there is no way to dynamically insert the backslash like a
    literal. For example:
    public static void main (String args[]){
    String user = "test%u002dvictof";
    user = user.replace('%', '\\');
    System.out.println(user);
    }

    Does not work. The output is test\002dvictorf.

    So I tried to use a regular expression with a capturing parantheses:
    public static void main (String args[]){
    String user = "test%u002dvictof";
    user = user.replaceAll("%u([a-f | A-F | 0-9][a-f | A-F | 0-9][a-f |
    A-F | 0-9][a-f | A-F | 0-9])",
    Character.toString((char)Integer.valueOf("$1", 16).intValue()) );
    System.out.println(user);
    }
    Which generates a java.lang.NumberFormatException becuase the compiler
    does not like the $1 at runtime. It seems that the $1 is being
    interpretted literally. The real value of $1 at run time is '002d'

    Any help is appreciated.

    Thanks.
     
    , Mar 19, 2007
    #1
    1. Advertising

  2. Oliver Wong Guest

    <> wrote in message
    news:...
    >I have a situation in which I am receiving a String from a non-java
    > system. The system that generates the String attempts to encode some
    > characters such a slash to unicode. However it encodes characters
    > using the percent sign rathern than the backslash.
    >
    > Thus the String test-victorf becomes test%u002dvictorf. I'd love to
    > be able to simply replace the percent with a backslash, but it seems
    > that there is no way to dynamically insert the backslash like a
    > literal. For example:
    > public static void main (String args[]){
    > String user = "test%u002dvictof";
    > user = user.replace('%', '\\');
    > System.out.println(user);
    > }
    >
    > Does not work. The output is test\002dvictorf.
    >
    > So I tried to use a regular expression with a capturing parantheses:
    > public static void main (String args[]){
    > String user = "test%u002dvictof";
    > user = user.replaceAll("%u([a-f | A-F | 0-9][a-f | A-F | 0-9][a-f |
    > A-F | 0-9][a-f | A-F | 0-9])",
    > Character.toString((char)Integer.valueOf("$1", 16).intValue()) );
    > System.out.println(user);
    > }
    > Which generates a java.lang.NumberFormatException becuase the compiler
    > does not like the $1 at runtime. It seems that the $1 is being
    > interpretted literally. The real value of $1 at run time is '002d'


    "$1" is interpreted literally, because "$1" is a literal. It has the
    same value at runtime as it does a compile time, namely the two-character
    string consisting of the character '$' followed by the character '1'.

    Do the replace in three smaller steps instead of one big step: In the
    first step, extract the "specially-encoded" char, "%u002d", and in the
    second step, convert this 6-character string into a 1-character string
    "-". In the third step, put your 1-character string where it should be in
    the original string you were parsing.

    - Oliver
     
    Oliver Wong, Mar 19, 2007
    #2
    1. Advertising

  3. On 19.03.2007 17:25, wrote:
    > I have a situation in which I am receiving a String from a non-java
    > system. The system that generates the String attempts to encode some
    > characters such a slash to unicode. However it encodes characters
    > using the percent sign rathern than the backslash.
    >
    > Thus the String test-victorf becomes test%u002dvictorf. I'd love to
    > be able to simply replace the percent with a backslash, but it seems
    > that there is no way to dynamically insert the backslash like a
    > literal. For example:
    > public static void main (String args[]){
    > String user = "test%u002dvictof";
    > user = user.replace('%', '\\');
    > System.out.println(user);
    > }
    >
    > Does not work. The output is test\002dvictorf.


    Well, there is no Unicode escape sequence in the string so there is
    actually a "%" in the string which gets replaced. To make the unicode
    replacement work, the string has to read "test\u002dvictof" in the
    *source code* because the compiler will do the replacement.

    > So I tried to use a regular expression with a capturing parantheses:
    > public static void main (String args[]){
    > String user = "test%u002dvictof";
    > user = user.replaceAll("%u([a-f | A-F | 0-9][a-f | A-F | 0-9][a-f |
    > A-F | 0-9][a-f | A-F | 0-9])",
    > Character.toString((char)Integer.valueOf("$1", 16).intValue()) );
    > System.out.println(user);
    > }
    > Which generates a java.lang.NumberFormatException becuase the compiler
    > does not like the $1 at runtime. It seems that the $1 is being
    > interpretted literally. The real value of $1 at run time is '002d'


    You need to set a replacement string for every replacement *while
    replacing* because the calculation of the replacement value has to take
    place for every individual match. See

    http://java.sun.com/j2se/1.4.2/docs...ment(java.lang.StringBuffer, java.lang.String)


    > Any help is appreciated.


    I think a more proper solution would be to create a custom
    InputStreamReader that does the conversion to char when reading binary.
    Maybe even one of the default encodings does this already. IIRC
    java.util.Property.load() does it already when reading from files. But
    this is an ugly hack so I'd rather either look for something or create
    your own solution.

    Kind regards

    robert
     
    Robert Klemme, Mar 19, 2007
    #3
  4. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    schreef:
    > I have a situation in which I am receiving a String from a non-java
    > system. The system that generates the String attempts to encode some
    > characters such a slash to unicode. However it encodes characters
    > using the percent sign rathern than the backslash.
    >
    > Thus the String test-victorf becomes test%u002dvictorf. I'd love to
    > be able to simply replace the percent with a backslash, but it seems
    > that there is no way to dynamically insert the backslash like a
    > literal. For example:
    > public static void main (String args[]){
    > String user = "test%u002dvictof";
    > user = user.replace('%', '\\');
    > System.out.println(user);
    > }
    >
    > Does not work. The output is test\002dvictorf.


    Actually the output is test\u002dvictof, which is what I thought you
    wanted from your description. If you really want to replace the percent
    encoding with the character it represents, read the other replies.

    H.
    - --
    Hendrik Maryns
    http://tcl.sfs.uni-tuebingen.de/~hendrik/
    ==================
    http://aouw.org
    Ask smart questions, get good answers:
    http://www.catb.org/~esr/faqs/smart-questions.html
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.5 (GNU/Linux)

    iD8DBQFF//+Pe+7xMGD3itQRAtOTAJ417tuJ0pSNyqMM270ZVf7Dy3/VXACeM2+V
    QXuLhbwle9rK+od7WEPPF30=
    =m5/A
    -----END PGP SIGNATURE-----
     
    Hendrik Maryns, Mar 20, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    3
    Views:
    832
    Reedick, Andrew
    Jul 1, 2008
  2. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    1,026
    Grzegorz ¦liwiñski
    Jan 19, 2011
  3. Chirag Mistry
    Replies:
    6
    Views:
    191
    Ollivier Robert
    Feb 8, 2008
  4. Replies:
    2
    Views:
    417
  5. Terry Reedy
    Replies:
    0
    Views:
    93
    Terry Reedy
    Jan 7, 2014
Loading...

Share This Page