UDF-8 Reading for URL - not working

A

Amith

Hello all,

I have a problem, when i read a webpage contents (with UTF-8
characterset) and try to display it.. it is just considered as unicode
string
please help me

here is the code


import java.net.*;
import java.io.*;

public class URLReader {
public static void main(String[] args) throws Exception {
URL url = new URL("http://www.google.com/transliterate/indic?
tlqt=1&langpair=en|kn&text=namskara%20guru&&tl_app=1");
BufferedReader in = new BufferedReader(
new InputStreamReader(
url.openStream(), "UTF8"));

String inputLine = "";
String fullString = "";


while ((inputLine = in.readLine()) != null)
fullString = fullString + new String(inputLine.getBytes(),"UTF-8");

String string = fullString.substring(fullString.indexOf("[\"") + 2,
fullString.indexOf("\",]"));
System.out.println(string);

in.close();

}
}
 
L

Lothar Kimmeringer

Amith said:
I have a problem, when i read a webpage contents (with UTF-8
characterset) and try to display it..

You left away the interesting part: What is your problem?


Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 
A

Amith

My problem is the UTF-8 string which i read from the URL is considered
as unicode.. i need it as UTF-8

i want it to be printed as "ನಮà³à²¸à³à²•à²°à²—à³à²°à³" and not as "\u0CA8\u0CAE\u0CCD
\u0CB8\u0CCD\u0C95\u0CB0\u0C97\u0CC1\u0CB0\u0CC1"
 
L

Lothar Kimmeringer

Amith said:
My problem is the UTF-8 string which i read from the URL is considered
as unicode.. i need it as UTF-8

i want it to be printed as "ನಮà³à²¸à³à²•à²°à²—à³à²°à³" and not as "\u0CA8\u0CAE\u0CCD
\u0CB8\u0CCD\u0C95\u0CB0\u0C97\u0CC1\u0CB0\u0CC1"

What is this line for:
fullString = fullString + new String(inputLine.getBytes(),"UTF-8")

First of all, use StringBuilder and not String concatenation,
second, why do you create a byte-array from a string, to create
a new one again just to add it to an existing one. Just do
fullString += inputLine
should be enough (and solve your problem by the way). As said
above use a StringBuilder instead as next step.


Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 
A

Amith

even if it is fullString = fullString + inputLine;
it doesnt work, i have tried it, some more useless experiments led me
to the this
fullString = fullString + new String(inputLine.getBytes(),"UTF-8")
 
L

Lew

Amith said:
My problem is the UTF-8 string which i [sic] read from the URL is considered
as unicode.. i [sic] need it as UTF-8

UTF-8 *is* Unicode!
i [sic] want it to be printed as "ನಮà³à²¸à³à²•à²°à²—à³à²°à³" and not as "\u0CA8\u0CAE\u0CCD
\u0CB8\u0CCD\u0C95\u0CB0\u0C97\u0CC1\u0CB0\u0CC1"
public class URLReader {
public static void main(String[] args) throws Exception {
URL url = new URL("http://www.google.com/transliterate/indic?
tlqt=1&langpair=en|kn&text=namskara%20guru&&tl_app=1");
BufferedReader in = new BufferedReader(
new InputStreamReader(
url.openStream(), "UTF8"));

String inputLine = "";

No need to initialize 'inputLine' to a value you are just going to throw away.
String fullString = "";


while ((inputLine = in.readLine()) != null)
fullString = fullString + new String(inputLine.getBytes(),"UTF-8");

This is silly. Just do what Lothar said and add the String to the String.
I'm also pretty sure this isn't correct anyway because the way you defined the
BufferedReader will have already converted the bytes from UTF-8 on the way in
to 'inputLine', so that the 'getBytes()' will create bytes representing UTF-16
encoding. Reconverting those bytes to String using UTF-8 seems like it would
not work. In any event, using straightforward String concatenation, or as
Lothar suggested, StringBuilder concatenation, should keep encoding issues out
of the way.

Strings in Java internally will always be UTF-16.
String string = fullString.substring(fullString.indexOf("[\"") + 2,
fullString.indexOf("\",]"));
System.out.println(string);

This will display the String using the platform's default encoding.
in.close();

This should be in a 'finally' block tightly associated with the input loop.

Do not use TAB characters for indentation of Usenet posts. Use spaces, up to
four per indent level. To get help you might want to keep the code readable.
 
L

Lothar Kimmeringer

Amith said:
even if it is fullString = fullString + inputLine;

Then it's quite likely that the stream you open is not
delivering bytes of UTF-8 encoded data


Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 
M

markspace

Lothar said:
Then it's quite likely that the stream you open is not
delivering bytes of UTF-8 encoded data


or the stream actually contains the string "\u0CA8\u0CAE\u0CCD" etc.
I.e., it's UTF-8 with something else encoded on top of that.

Or the problem is he doesn't have the right glyphs installed on his
system, so he can't see the Arabic characters.

All of which sum up to "it's not in the code you've shown us."
 
R

Roedy Green

Hello all,

I have a problem, when i read a webpage contents (with UTF-8
characterset) and try to display it.. it is just considered as unicode
string
please help me

here is the code


import java.net.*;
import java.io.*;

while ((inputLine = in.readLine()) != null)
fullString = fullString + new String(inputLine.getBytes(),"UTF-8");

The text comes in dribs and drabs. See
http://mindprod.com/products.html#HTML for code to do that properly
that won't go into a tight loop reading empty strings.
 
J

Joshua Cranmer

I used the "User Agent Switcher" extension in Firefox to select a
different User-Agent string (Lynx, for example), and got the Latin1
version there too, so that's what the server is switching on. (That's
an exceedingly daft thing for a server to do, btw, when there's a
perfectly good Accept-Charset header to use instead.)

I thought UA sniffing went out of fashion years ago. Then I discovered
that Google Wave sniffed in a rather limited manner when doing other
work. Now I see that Google sniffs here too. Now I'm never going to
trust Google's actual text results when I see them in the browser window.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top