Chinese character woes continued

J

Jeff

I'd like to be able to render chinese characters from a UTF-8 database
and I'd like to be able to render them from inside the JSP. When I print
ASCII strings of unicode characters with this page directive:

<%@ page contentType="text/html; charset=UTF-8" pageEncoding="Cp1252" %>

The unicode strings of chinese characters print fine, but the data from
the database does not.

If I switch to this page directive:
<%@ page pageEncoding="UTF-8" %>

And try to convert the unicode strings to UTF-8, I get a series of
question marks ?????

java.util.Properties p = new java.util.Properties();
p.setProperty("userName", "\u660e\u67b6\u5e73\u677f");
byte[] utfBytes = p.getProperty("userName").getBytes("UTF8");
String utfString = new String(utfBytes, "UTF8");

<%=utfString%> <- prints ????

TIA,
Jeff
 
J

Jon A. Cruz

Jeff said:
I'd like to be able to render chinese characters from a UTF-8 database
and I'd like to be able to render them from inside the JSP. When I print
ASCII strings of unicode characters with this page directive:

Cp1252 is *not* "ASCII". "ASCII" is not a synonym for "8-bit text"

Instead "ASCII" is a 7-bit encoding with values ranging from 0 through
127 that's mainly only useful for US English, Latin, Hawaiian, and Swahili.


<%@ page contentType="text/html; charset=UTF-8" pageEncoding="Cp1252" %>

The unicode strings of chinese characters print fine, but the data from
the database does not.

I'm not as up to speed with JSP, but that looks bad. Looks like you're
mixing two different 8-bit encodings.
If I switch to this page directive:
<%@ page pageEncoding="UTF-8" %>

And try to convert the unicode strings to UTF-8, I get a series of
question marks ?????

Ummm.. what do you mean by "get"? Where do you see those? Did you check
the raw bytes at each point to see their hex values?

java.util.Properties p = new java.util.Properties();
p.setProperty("userName", "\u660e\u67b6\u5e73\u677f");

Ok. That seems good so far.


byte[] utfBytes = p.getProperty("userName").getBytes("UTF8");
String utfString = new String(utfBytes, "UTF8");

OK. That is just wasted code. None of that should be done.

Java strings are *always* equivalent of 16-bit UTF-16 as far as a Java
programmer is concerned. Thuse you just did a big NOOP.

Strings are *not* UTF8. They are always 16-bit Unicode. They don't have
different encodings. They are always sequences of 16-bit UTF-16
characters as far as your program is concerned.


<%=utfString%> <- prints ????

What do you mean by "prints ????"

"Appears in MSIE x.x on Windows x.x as '????'" ?

"Bytes 0x3f arrive on the browser's side of the socket when I sniff the
traffic"?

"shows 0x3f when I dump the bytes in Java as they are sent out"?

"shows as 0x3f when I save the page from my browser and look at it with
a hex editor"?
 
J

Jeff

Jon said:
Cp1252 is *not* "ASCII". "ASCII" is not a synonym for "8-bit text"

Instead "ASCII" is a 7-bit encoding with values ranging from 0 through
127 that's mainly only useful for US English, Latin, Hawaiian, and Swahili.





I'm not as up to speed with JSP, but that looks bad. Looks like you're
mixing two different 8-bit encodings.

That's my opinion, thus far.
If I switch to this page directive:
<%@ page pageEncoding="UTF-8" %>

And try to convert the unicode strings to UTF-8, I get a series of
question marks ?????

Ummm.. what do you mean by "get"? Where do you see those? Did you check
the raw bytes at each point to see their hex values?

java.util.Properties p = new java.util.Properties();
p.setProperty("userName", "\u660e\u67b6\u5e73\u677f");


Ok. That seems good so far.


byte[] utfBytes = p.getProperty("userName").getBytes("UTF8");
String utfString = new String(utfBytes, "UTF8");


OK. That is just wasted code. None of that should be done.

Java strings are *always* equivalent of 16-bit UTF-16 as far as a Java
programmer is concerned. Thuse you just did a big NOOP.

Strings are *not* UTF8. They are always 16-bit Unicode. They don't have
different encodings. They are always sequences of 16-bit UTF-16
characters as far as your program is concerned.
<%=utfString%> <- prints ????


What do you mean by "prints ????"

"Appears in MSIE x.x on Windows x.x as '????'" ?

"Bytes 0x3f arrive on the browser's side of the socket when I sniff the
traffic"?

"shows 0x3f when I dump the bytes in Java as they are sent out"?

"shows as 0x3f when I save the page from my browser and look at it with
a hex editor"?

I mean, it literally prints four ?s

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Language Test</title>
</head>
<body>
<pre>????</pre>
<pre>????</pre>
1 - ??????? - [B@5a23b - <br>
4 - ?????? - [B@5df367 - <br>
5 - ?????? - [B@16fd7c - <br>
6 - ???? - [B@5bdc4b - <br>
8 - ?????? - [B@5bf9cf - <br>
9 - ?????? - [B@6dbfb0 - <br>
11 - ???? - [B@43f787 - <br>
21 - ??????? - [B@2fe9bf - <br>
22 - europäischen - [B@1b1e15 - <br>
50 - Jasp? - [B@541bfd - <br>
52 - Jaspé - [B@79ae24 - <br>
</html>
 
J

Jon A. Cruz

Jeff said:
Jon said:
Jeff said:
<%@ page contentType="text/html; charset=UTF-8" pageEncoding="Cp1252" %>
[SNIP]
I'm not as up to speed with JSP, but that looks bad. Looks like you're
mixing two different 8-bit encodings.


That's my opinion, thus far.

Yup. I just did a quick scan for that attribute.

So first you use the contentType to say "This here page will be an HTML
type text page with an encoding of UTF-8" and set to add that as an
outgoing literal. Then you turn around and say "use Windows CodePage
1252 encoding for the contents".

I was desperate. Frankly, I expected <%=p.getProperty("userName")%> to
work.

Off hand, I'd guess that if that didn't work, then maybe things were not
placed in that correctly in the first place.

I mean, it literally prints four ?s

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Language Test</title>
</head>
<body>
<pre>????</pre>
<pre>????</pre>
1 - ??????? - [B@5a23b - <br>
4 - ?????? - [B@5df367 - <br>
5 - ?????? - [B@16fd7c - <br>
6 - ???? - [B@5bdc4b - <br>
8 - ?????? - [B@5bf9cf - <br>
9 - ?????? - [B@6dbfb0 - <br>
11 - ???? - [B@43f787 - <br>
21 - ??????? - [B@2fe9bf - <br>
22 - europäischen - [B@1b1e15 - <br>
50 - Jasp? - [B@541bfd - <br>
52 - Jaspé - [B@79ae24 - <br>
</html>

You didn't answer my question. You just reiterated your original statement.

What do you mean by "prints" ?

I'm guessing that you are not on a teletype.

So... when you see that content you just pasted, where are you looking?

Is that in your browser?
Is that in your IDE?
Is that in a text editor after you saved from your browser?
Is that console output from System.out.println()?
Is that from a network sniffer grabbing the raw bytes as they arrive at
the client?
Something else?

There can be *many* different causes for "questions marks appearing on
my screen". *How* you're getting that there is the key to tracking down
to what's the cause of the problem.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top