Sending a (UTF-8) query to Google search engine

K

Kevin

Hi, All! I am spending days now trying to get a simple program to
work.

I want to query Google with Unicode-included (Chinese,Japanese) queries
with URLConnection.

Try this on your favorite Browser:

http://www.google.com/search?hl=en&lr=&q=新宿+&start=0&sa=N

If you go to your browser (choose View->Encoding) , you will see that
the the browser automatically set it to (UTF-8). If you manually
change it to ISO, and, type in the URL again, then the search returns
wrong results.

It seems that I need to set the HTTP request correctly to UTF-8. How
do I do that?

I am using the following code, and it is NOT working.

**************************************************************
URL urlObject = new URL(url);
HttpURLConnection con = (HttpURLConnection)urlObject.openConnection();
con.setRequestProperty ( "User-Agent","Mozilla/4.71 [en] (WinNT; I)");
con.setRequestProperty("Content-Type", "x-www-form-urlencoding;
charset=UTF8");
con.setRequestProperty("Content-Encoding", "UTF8");
System.out.println(con.getRequestProperty("Content-Type")) ;
BufferedReader webData = new BufferedReader(new
InputStreamReader(con.getInputStream(), "UTF8"));
**************************************************************

Thanks!

Kevin
 
R

Roedy Green

It seems that I need to set the HTTP request correctly to UTF-8. How
do I do that?

Google has a search parm for the encoding apart from the HTTP Header.

Try going to the google website and using their tools to build you
search boxes. They will likely include it.

Also just look at the URL Google constructs when you use one of the
search boxes on their site. e.g.

http://www.google.ca/search?client=...serial+Javax&sourceid=opera&ie=utf-8&oe=utf-8

note the ie and oe parms. I presume one decribes the encoding of the
URL and one describes the desired encoding of the response.

Seems a bit odd to have a parm to control the encoding after the data
is describes though.
 
N

NOBODY

Hi, All! I am spending days now trying to get a simple program to
work.

I want to query Google with Unicode-included (Chinese,Japanese)
queries with URLConnection.

Try this on your favorite Browser:

http://www.google.com/search?hl=en&lr=&q=新宿+&start=0&sa
=N

If you go to your browser (choose View->Encoding) , you will see that
the the browser automatically set it to (UTF-8). If you manually
change it to ISO, and, type in the URL again, then the search returns
wrong results.

It seems that I need to set the HTTP request correctly to UTF-8. How
do I do that?

I am using the following code, and it is NOT working.

**************************************************************
URL urlObject = new URL(url);
HttpURLConnection con = (HttpURLConnection)urlObject.openConnection();
con.setRequestProperty ( "User-Agent","Mozilla/4.71 [en] (WinNT; I)");
con.setRequestProperty("Content-Type", "x-www-form-urlencoding;
charset=UTF8");
con.setRequestProperty("Content-Encoding", "UTF8");
System.out.println(con.getRequestProperty("Content-Type")) ;
BufferedReader webData = new BufferedReader(new
InputStreamReader(con.getInputStream(), "UTF8"));



In the http/url specs, the URI (the part of the url after the server's
host[:port]) cannot be UTF-8 (unless some new IRI spec is considered but
that is not the point here since most servers aren't there yet)
You would have to POST to the server, not GET (change the method on
httpurlconnection).

In Http, the query on a GET is encoded in ISO-8859-1 (standard latin 1).
Maybe some server can understand an utf-8 URI but don't count on that.
Setting the request content type is futile, as there is no content sent
on a GET.

On a POST however, you can set the type to
application/www-form-urlencoded; charset="UTF-8"
and in the body, send the utf8+urlencoded (see java.net.URLEncode)
of your params. Now, you just typed
....charset=UTF8
instead of (try these, I can remember which is good)
....charset=\"UTF8\""
....charset=\"UTF-8\""
And don't fiddle with Content-Encoding. I don't think you should touch
that (read carefully the http rfc for the meaning of each headers).


Finally, the browser view-encoding it totally useless as it is only a
rendering setting (although it may affect the content-type charset when
clicking). It is like forcing the browser to read the bytes in a given
encoding (and letting it fail if any errors).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top