how do I expand a unicode string to its visual UTF8 representation?

M

Mayeul

Arne said:
Java uses the same concept as other widely used languages.

I think the point of neuneudr is that Java 'char' and
java.lang.Character types are misleading in the same way C 'char' type
is. And that charAt(), length() and other operate-on-single-char-values
methods in the String class are even more misleading.

This is all because at the time the decision was made it fitted
perfectly with Unicode, but doesn't anymore, yet Java needs to maintain
compatibility.
Still, though I don't see that need of ranting and ranting all over it,
it's true.
 
A

Arne Vajhøj

Mayeul said:
I think the point of neuneudr is that Java 'char' and
java.lang.Character types are misleading in the same way C 'char' type
is. And that charAt(), length() and other operate-on-single-char-values
methods in the String class are even more misleading.

This is all because at the time the decision was made it fitted
perfectly with Unicode, but doesn't anymore, yet Java needs to maintain
compatibility.
Still, though I don't see that need of ranting and ranting all over it,
it's true.

Gosling did not have a crystal ball.

Next language will get that part right.

And then something else will be found later for that language.

Arne
 
R

Roedy Green

Gosling did not have a crystal ball.
Also his language Oak, was intended for a Set top TV box. I think it
turned out to be far more extensible than most people would have
created for such a purpose.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Let us pray it is not so, or if it is, that it will not become widely known."
~ Wife of the Bishop of Exeter on hearing of Darwin's theory of the common descent of humans and apes.
 
A

Arne Vajhøj

Andrew said:
If I store the data in a varchar as this:

Copyright \u00A9 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
me
\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0 mei\u00F0a mig

then java will do the working of conversion for me automatically.

No, it will not.

The Java compiler does that for Java source code, but that
is something else.
I think I am right. When the \uxxxx strings are in a file and I read
them in, printing gives the correct result.

It does not.

(not unless it is a properties file)

Arne
 
M

Mike Amling

Andrew said:
Hello,

I have an example program below that contains weird Icelandic
characters, and a copyright symbol, just for good measure. The code
expresses these as UTF8. They print exactly as you would want/expect
them to. So far so good. But what I want is to be able to go the other
way. I want to take a unicode string and recreate the escape sequences
for the funny international characters.For example, the single
character E-acute should be expanded to \u00C9 (6 characters). Any
ideas on how to do this please?

public class UTF8Test {
public UTF8Test() {
}

public String getString() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat glass
and it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
return builder.toString();
}

public static void main(String[] args) {
UTF8Test test = new UTF8Test();
System.out.println(test.getString());
}
}

FWIW, the reason I want to do this is I need to write strings like
this to a sybase table where the column is of type varchar. We cannot
make it univarchar (don't ask). So I need to be able to write unicode
characters without using unicode chars! I thought by having them in
this expanded form java can convert them just like the program above
does.

public class UTF8Test {
public UTF8Test() {
}

public String getString() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat
glass and it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess
a\u00F0 mei\u00F0a mig");
return builder.toString();
}

public static void main(String[] args) {
UTF8Test test = new UTF8Test();
final String what=test.getString();
System.out.println(what);

for (int jj=0; jj<what.length(); ++jj) {
final char which=what.charAt(jj);
if (which=='\n') {
System.out.print("\\n");
} else if (which>=' ' && which<=0x7E) {
System.out.print(which);
} else {
System.out.printf("\\u%04X", (int)which);
}
}
System.out.println();
}
}

I think all the talk of UTF-8 and UTF-16 and encoding and system
properties is off the make. I think this is what you're looking for.

Copyright ? 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt me
?g get eti? gler ?n ?ess a? mei?a mig
Copyright \u00A9 2009\u000AHere is the phrase (in Icelandic): I can eat
glass and it doesn't hurt me\n\u00C9g get eti\u00F0 gler \u00E1n
\u00FEess a\u00F0 mei\u00F0a mig
 
R

Roedy Green

Copyright \u00A9 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
me
\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0 mei\u00F0a mig

The utility NativeToAscii will do that. See
http://mindprod.com/jgloss/encoding.html
for example of use.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"We must be very careful when we give advice to younger people: sometimes
they follow it!"
~ Edsger Wybe Dijkstra, born: 1930-05-11 died: 2002-08-06 at age: 72
 
M

Mike Amling

Mike said:
I think all the talk of UTF-8 and UTF-16 and encoding and system
properties is off the make. I think this is what you're looking for.

I meant to say "IMHO, ... off the MARK." That is, WADR to the other
posters, if I understand the OP, he just wants the non-ASCII (and
non-space whitespace) characters escaped to \u1234 format.
If the OP is going to use a PreparedStatement (yay!) to Insert the
escaped String into the database, then there is no requirement to escape
'\'' or '"'.

Hmm... If "\u006E" is a String literal containing lowercase n, and
"\\u006E" is a String literal containing a backslash, then what trick is
needed to get a 6-character String literal containing a backlash, a
lowercase u and four hex digits?

--Mike Amling
 
A

Arne Vajhøj

Andrew said:
Well I figured since you had a fairly sophisticated question and
appeared to have some knowledge of Java that you could figure out how to
use the 'if' statement yourself. Oh and just so you don't complain that
I used lower case hex, I fixed that too.
public void doit() {
StringBuilder builder = new StringBuilder();
builder.append("Copyright \u00A9 2009\n");
builder.append("Here is the phrase (in Icelandic): I can eat glass
and it doesn't hurt me\n");
builder.append("\u00C9g get eti\u00F0 gler \u00E1n \u00FEess a\u00F0
mei\u00F0a mig");
String str = builder.toString();

System.out.println(str);

byte[] buf = str.getBytes();
for (byte b : buf) {
if ((b & 0x80) == 0)
System.out.print(new String(new byte[] { b }));
else
System.out.printf("\\u%04X",b);
}

}

I do appreciate you trying to help but I'm afraid that code does not
do the job. When I run it, this is what I get:

Copyright \u00C2\u00A9 2009
Here is the phrase (in Icelandic): I can eat glass and it doesn't hurt
me
\u00C3\u0089g get eti\u00C3\u00B0 gler \u00C3\u00A1n \u00C3\u00BEess a
\u00C3\u00

For example, the copyright symbol comes out as 00C2 when I expect
00A9. The E-acute comes out as 00C3 where I expect 00C9.

You need to replace:

byte[] buf = str.getBytes();

with:

byte[] buf = str.getBytes(YOUR_CHARSET);

\u00C2\u00A9 is UTF-8.

\u00A9 is ISO-8859-1 (and possible others)

Arne
 
A

Andrew

According to Andrew  <[email protected]>:
Internally, Java strings are sequences of 16-bit units (exactly the kind
which fits neatly in a Java 'char') which are equal to the Unicode code
points from the first plane. E.g., the Unicode code point for 'A' is 65,
and Java specifies that: "A".charAt(0) == 65.

Code points from the 16 other planes are represented in Java as
"surrogate pairs", i.e. two 16-bit units, the first being in the high
surrogate range (0xD800 to 0xDBFF) and the second being in the low
surrogate range (0xDC00 to 0xDFFF). Surrogate pairs are what is used in
UTF-16 encoding, and Java uses the same system for pretty much the same
reasons.

A useful summing up....
Actually Java can do such conversions, but only in some very specific
situations. I know of only two places where Java converts \u escapes
into the corresponding code points:
-- within the Java compiler, when the \u is encountered in a Java
source code;
-- when decoding java.util.Properties with Properties.load().

Yes. I thought it was more general but I now realise I was wrong.
Both of these have some strong requirements on the input data (the
text must be either a correct Java source code file, or a properly
formed encoded Properties object) so I would not advise using the
\u escapes, unless you are in a situation where you precisely
process Java source code (e.g. your application is a code generator),
or Properties.

Yes, I see that now, thanks to your advice and that of several other
people on this thread.
On a more general basis, Java does not interpret input data. When
you use a FileReader, you get the raw characters, and if you have
a \u escape, then you will read 0x5C (the Unicode code point for a
backslash), then 0x75 (the code point for 'u'), and so on.

Yes. I was mstaken.
        --Thomas Pornin

Actually, the problem has mutated. I now have two operations to
perform on the data. The first is to handle an internationalisation
issue, the second is to handle some special markup. The markup problem
made me realise that the end result I need is a string that can be
stored in a varchar column where the string contains HTML to handle
special formatting. The markup can be dealt with using XSLT. The
internationalisation will need to convert unicode characters that are
outside of the character set supported by 7 bit ASCII to equivalent
HTML using (or something similar).

Regards,

Andrew M.
 
A

Arne Vajhøj

Andrew said:
Actually, the problem has mutated. I now have two operations to
perform on the data. The first is to handle an internationalisation
issue, the second is to handle some special markup. The markup problem
made me realise that the end result I need is a string that can be
stored in a varchar column where the string contains HTML to handle
special formatting. The markup can be dealt with using XSLT. The
internationalisation will need to convert unicode characters that are
outside of the character set supported by 7 bit ASCII to equivalent
HTML using (or something similar).

This is practically the same problem as the unicode
escape with practically the same solution:

public static String encode(String s) {
StringBuffer sb = new StringBuffer("");
for(int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if((c >= 0) && (c <=127)) {
sb.append(c);
} else {
sb.append("&#" + Integer.toHexString(c) + ";");
}
}
return sb.toString();
}

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top