Chinese character woes continued

Jeff · Jan 16, 2004

I'd like to be able to render chinese characters from a UTF-8 database
and I'd like to be able to render them from inside the JSP. When I print
ASCII strings of unicode characters with this page directive:

<%@ page contentType="text/html; charset=UTF-8" pageEncoding="Cp1252" %>

The unicode strings of chinese characters print fine, but the data from
the database does not.

If I switch to this page directive:
<%@ page pageEncoding="UTF-8" %>

And try to convert the unicode strings to UTF-8, I get a series of
question marks ?????

java.util.Properties p = new java.util.Properties();
p.setProperty("userName", "\u660e\u67b6\u5e73\u677f");
byte[] utfBytes = p.getProperty("userName").getBytes("UTF8");
String utfString = new String(utfBytes, "UTF8");

<%=utfString%> <- prints ????

TIA,
Jeff

Jon A. Cruz · Jan 16, 2004

Jeff said:
I'd like to be able to render chinese characters from a UTF-8 database
and I'd like to be able to render them from inside the JSP. When I print
ASCII strings of unicode characters with this page directive:

Cp1252 is *not* "ASCII". "ASCII" is not a synonym for "8-bit text"

Instead "ASCII" is a 7-bit encoding with values ranging from 0 through
127 that's mainly only useful for US English, Latin, Hawaiian, and Swahili.

<%@ page contentType="text/html; charset=UTF-8" pageEncoding="Cp1252" %>

The unicode strings of chinese characters print fine, but the data from
the database does not.

I'm not as up to speed with JSP, but that looks bad. Looks like you're
mixing two different 8-bit encodings.

If I switch to this page directive:
<%@ page pageEncoding="UTF-8" %>

And try to convert the unicode strings to UTF-8, I get a series of
question marks ?????

Ummm.. what do you mean by "get"? Where do you see those? Did you check
the raw bytes at each point to see their hex values?

java.util.Properties p = new java.util.Properties();
p.setProperty("userName", "\u660e\u67b6\u5e73\u677f");

Ok. That seems good so far.

byte[] utfBytes = p.getProperty("userName").getBytes("UTF8");
String utfString = new String(utfBytes, "UTF8");

OK. That is just wasted code. None of that should be done.

Java strings are *always* equivalent of 16-bit UTF-16 as far as a Java
programmer is concerned. Thuse you just did a big NOOP.

Strings are *not* UTF8. They are always 16-bit Unicode. They don't have
different encodings. They are always sequences of 16-bit UTF-16
characters as far as your program is concerned.

<%=utfString%> <- prints ????

What do you mean by "prints ????"

"Appears in MSIE x.x on Windows x.x as '????'" ?

"Bytes 0x3f arrive on the browser's side of the socket when I sniff the
traffic"?

"shows 0x3f when I dump the bytes in Java as they are sent out"?

"shows as 0x3f when I save the page from my browser and look at it with
a hex editor"?

Jeff · Jan 16, 2004

Jon said:
Cp1252 is *not* "ASCII". "ASCII" is not a synonym for "8-bit text"

Instead "ASCII" is a 7-bit encoding with values ranging from 0 through
127 that's mainly only useful for US English, Latin, Hawaiian, and Swahili.

I'm not as up to speed with JSP, but that looks bad. Looks like you're
mixing two different 8-bit encodings.

That's my opinion, thus far.

If I switch to this page directive:
<%@ page pageEncoding="UTF-8" %>

And try to convert the unicode strings to UTF-8, I get a series of
question marks ?????

Click to expand...

Ummm.. what do you mean by "get"? Where do you see those? Did you check
the raw bytes at each point to see their hex values?

java.util.Properties p = new java.util.Properties();
p.setProperty("userName", "\u660e\u67b6\u5e73\u677f");

Click to expand...

Ok. That seems good so far.

byte[] utfBytes = p.getProperty("userName").getBytes("UTF8");
String utfString = new String(utfBytes, "UTF8");

Click to expand...

OK. That is just wasted code. None of that should be done.

Java strings are *always* equivalent of 16-bit UTF-16 as far as a Java
programmer is concerned. Thuse you just did a big NOOP.

Strings are *not* UTF8. They are always 16-bit Unicode. They don't have
different encodings. They are always sequences of 16-bit UTF-16
characters as far as your program is concerned.

<%=utfString%> <- prints ????

Click to expand...

What do you mean by "prints ????"

"Appears in MSIE x.x on Windows x.x as '????'" ?

"Bytes 0x3f arrive on the browser's side of the socket when I sniff the
traffic"?

"shows 0x3f when I dump the bytes in Java as they are sent out"?

"shows as 0x3f when I save the page from my browser and look at it with
a hex editor"?

I mean, it literally prints four ?s

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Language Test</title>
</head>
<body>
<pre>????</pre>
<pre>????</pre>
1 - ??????? - [B@5a23b - 
4 - ?????? - [B@5df367 - 
5 - ?????? - [B@16fd7c - 
6 - ???? - [B@5bdc4b - 
8 - ?????? - [B@5bf9cf - 
9 - ?????? - [B@6dbfb0 - 
11 - ???? - [B@43f787 - 
21 - ??????? - [B@2fe9bf - 
22 - europäischen - [B@1b1e15 - 
50 - Jasp? - [B@541bfd - 
52 - Jaspé - [B@79ae24 - 
</html>

Jon A. Cruz · Jan 17, 2004

Jeff said:
Jon said:

Jeff said:

<%@ page contentType="text/html; charset=UTF-8" pageEncoding="Cp1252" %>

Click to expand...

[SNIP]
I'm not as up to speed with JSP, but that looks bad. Looks like you're
mixing two different 8-bit encodings.

Click to expand...

That's my opinion, thus far.

Yup. I just did a quick scan for that attribute.

So first you use the contentType to say "This here page will be an HTML
type text page with an encoding of UTF-8" and set to add that as an
outgoing literal. Then you turn around and say "use Windows CodePage
1252 encoding for the contents".

I was desperate. Frankly, I expected <%=p.getProperty("userName")%> to
work.

Off hand, I'd guess that if that didn't work, then maybe things were not
placed in that correctly in the first place.

I mean, it literally prints four ?s

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Language Test</title>
</head>
<body>
<pre>????</pre>
<pre>????</pre>
1 - ??????? - [B@5a23b - 
4 - ?????? - [B@5df367 - 
5 - ?????? - [B@16fd7c - 
6 - ???? - [B@5bdc4b - 
8 - ?????? - [B@5bf9cf - 
9 - ?????? - [B@6dbfb0 - 
11 - ???? - [B@43f787 - 
21 - ??????? - [B@2fe9bf - 
22 - europäischen - [B@1b1e15 - 
50 - Jasp? - [B@541bfd - 
52 - Jaspé - [B@79ae24 - 
</html>

You didn't answer my question. You just reiterated your original statement.

What do you mean by "prints" ?

I'm guessing that you are not on a teletype.

So... when you see that content you just pasted, where are you looking?

Is that in your browser?
Is that in your IDE?
Is that in a text editor after you saved from your browser?
Is that console output from System.out.println()?
Is that from a network sniffer grabbing the raw bytes as they arrive at
the client?
Something else?

There can be *many* different causes for "questions marks appearing on
my screen". *How* you're getting that there is the key to tracking down
to what's the cause of the problem.

Chinese - German character woes in JSP	3	Jan 15, 2004
JSP Character Sets	1	Jul 6, 2005
how to display/input/write Chinese Text in java	6	Feb 20, 2008
Chinese characters in IE6 now showing correctly	7	Jan 12, 2008
Chinese Characters in Page	3	May 1, 2007
Chinese Search Parameters/Unicode Support	0	Jun 29, 2007
UTF-8 to Unicode conversion in ajax response	9	May 17, 2011
What the \xc2\xa0 ?!!	1	Sep 7, 2010

Chinese character woes continued

Jeff

Jon A. Cruz

Jeff

Jon A. Cruz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads