Regex and Unicode

M

michael.biden

I have a situation in which I am receiving a String from a non-java
system. The system that generates the String attempts to encode some
characters such a slash to unicode. However it encodes characters
using the percent sign rathern than the backslash.

Thus the String test-victorf becomes test%u002dvictorf. I'd love to
be able to simply replace the percent with a backslash, but it seems
that there is no way to dynamically insert the backslash like a
literal. For example:
public static void main (String args[]){
String user = "test%u002dvictof";
user = user.replace('%', '\\');
System.out.println(user);
}

Does not work. The output is test\002dvictorf.

So I tried to use a regular expression with a capturing parantheses:
public static void main (String args[]){
String user = "test%u002dvictof";
user = user.replaceAll("%u([a-f | A-F | 0-9][a-f | A-F | 0-9][a-f |
A-F | 0-9][a-f | A-F | 0-9])",
Character.toString((char)Integer.valueOf("$1", 16).intValue()) );
System.out.println(user);
}
Which generates a java.lang.NumberFormatException becuase the compiler
does not like the $1 at runtime. It seems that the $1 is being
interpretted literally. The real value of $1 at run time is '002d'

Any help is appreciated.

Thanks.
 
O

Oliver Wong

I have a situation in which I am receiving a String from a non-java
system. The system that generates the String attempts to encode some
characters such a slash to unicode. However it encodes characters
using the percent sign rathern than the backslash.

Thus the String test-victorf becomes test%u002dvictorf. I'd love to
be able to simply replace the percent with a backslash, but it seems
that there is no way to dynamically insert the backslash like a
literal. For example:
public static void main (String args[]){
String user = "test%u002dvictof";
user = user.replace('%', '\\');
System.out.println(user);
}

Does not work. The output is test\002dvictorf.

So I tried to use a regular expression with a capturing parantheses:
public static void main (String args[]){
String user = "test%u002dvictof";
user = user.replaceAll("%u([a-f | A-F | 0-9][a-f | A-F | 0-9][a-f |
A-F | 0-9][a-f | A-F | 0-9])",
Character.toString((char)Integer.valueOf("$1", 16).intValue()) );
System.out.println(user);
}
Which generates a java.lang.NumberFormatException becuase the compiler
does not like the $1 at runtime. It seems that the $1 is being
interpretted literally. The real value of $1 at run time is '002d'

"$1" is interpreted literally, because "$1" is a literal. It has the
same value at runtime as it does a compile time, namely the two-character
string consisting of the character '$' followed by the character '1'.

Do the replace in three smaller steps instead of one big step: In the
first step, extract the "specially-encoded" char, "%u002d", and in the
second step, convert this 6-character string into a 1-character string
"-". In the third step, put your 1-character string where it should be in
the original string you were parsing.

- Oliver
 
R

Robert Klemme

I have a situation in which I am receiving a String from a non-java
system. The system that generates the String attempts to encode some
characters such a slash to unicode. However it encodes characters
using the percent sign rathern than the backslash.

Thus the String test-victorf becomes test%u002dvictorf. I'd love to
be able to simply replace the percent with a backslash, but it seems
that there is no way to dynamically insert the backslash like a
literal. For example:
public static void main (String args[]){
String user = "test%u002dvictof";
user = user.replace('%', '\\');
System.out.println(user);
}

Does not work. The output is test\002dvictorf.

Well, there is no Unicode escape sequence in the string so there is
actually a "%" in the string which gets replaced. To make the unicode
replacement work, the string has to read "test\u002dvictof" in the
*source code* because the compiler will do the replacement.
So I tried to use a regular expression with a capturing parantheses:
public static void main (String args[]){
String user = "test%u002dvictof";
user = user.replaceAll("%u([a-f | A-F | 0-9][a-f | A-F | 0-9][a-f |
A-F | 0-9][a-f | A-F | 0-9])",
Character.toString((char)Integer.valueOf("$1", 16).intValue()) );
System.out.println(user);
}
Which generates a java.lang.NumberFormatException becuase the compiler
does not like the $1 at runtime. It seems that the $1 is being
interpretted literally. The real value of $1 at run time is '002d'

You need to set a replacement string for every replacement *while
replacing* because the calculation of the replacement value has to take
place for every individual match. See

http://java.sun.com/j2se/1.4.2/docs...ent(java.lang.StringBuffer, java.lang.String)

Any help is appreciated.

I think a more proper solution would be to create a custom
InputStreamReader that does the conversion to char when reading binary.
Maybe even one of the default encodings does this already. IIRC
java.util.Property.load() does it already when reading from files. But
this is an ugly hack so I'd rather either look for something or create
your own solution.

Kind regards

robert
 
H

Hendrik Maryns

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

(e-mail address removed) schreef:
I have a situation in which I am receiving a String from a non-java
system. The system that generates the String attempts to encode some
characters such a slash to unicode. However it encodes characters
using the percent sign rathern than the backslash.

Thus the String test-victorf becomes test%u002dvictorf. I'd love to
be able to simply replace the percent with a backslash, but it seems
that there is no way to dynamically insert the backslash like a
literal. For example:
public static void main (String args[]){
String user = "test%u002dvictof";
user = user.replace('%', '\\');
System.out.println(user);
}

Does not work. The output is test\002dvictorf.

Actually the output is test\u002dvictof, which is what I thought you
wanted from your description. If you really want to replace the percent
encoding with the character it represents, read the other replies.

H.
- --
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFF//+Pe+7xMGD3itQRAtOTAJ417tuJ0pSNyqMM270ZVf7Dy3/VXACeM2+V
QXuLhbwle9rK+od7WEPPF30=
=m5/A
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top