UDF-8 Reading for URL - not working

Amith · Feb 23, 2010

Hello all,

I have a problem, when i read a webpage contents (with UTF-8
characterset) and try to display it.. it is just considered as unicode
string
please help me

here is the code

import java.net.*;
import java.io.*;

public class URLReader {
public static void main(String[] args) throws Exception {
URL url = new URL("http://www.google.com/transliterate/indic?
tlqt=1&langpair=en|kn&text=namskara%20guru&&tl_app=1");
BufferedReader in = new BufferedReader(
new InputStreamReader(
url.openStream(), "UTF8"));

String inputLine = "";
String fullString = "";

while ((inputLine = in.readLine()) != null)
fullString = fullString + new String(inputLine.getBytes(),"UTF-8");

String string = fullString.substring(fullString.indexOf("[\"") + 2,
fullString.indexOf("\",]"));
System.out.println(string);

in.close();

}
}

Amith · Feb 23, 2010

the URL in the above post would be

"http://www.google.com/transliterate/indic?tlqt=1&langpair=en|
kn&text=namskara%20guru&&tl_app=1"

Amith · Feb 23, 2010

URL used above is

http://www.google.com/transliterate/indic
?tlqt=1&langpair=en|kn&text=namskara%20guru&&tl_app=1

Lothar Kimmeringer · Feb 23, 2010

Amith said:
I have a problem, when i read a webpage contents (with UTF-8
characterset) and try to display it..

You left away the interesting part: What is your problem?

Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!

Amith · Feb 23, 2010

My problem is the UTF-8 string which i read from the URL is considered
as unicode.. i need it as UTF-8

i want it to be printed as "à²¨à²®à³à²¸à³à²•à²°à²—à³à²°à³" and not as "\u0CA8\u0CAE\u0CCD
\u0CB8\u0CCD\u0C95\u0CB0\u0C97\u0CC1\u0CB0\u0CC1"

Lothar Kimmeringer · Feb 23, 2010

Amith said:
My problem is the UTF-8 string which i read from the URL is considered
as unicode.. i need it as UTF-8

i want it to be printed as "à²¨à²®à³à²¸à³à²•à²°à²—à³à²°à³" and not as "\u0CA8\u0CAE\u0CCD
\u0CB8\u0CCD\u0C95\u0CB0\u0C97\u0CC1\u0CB0\u0CC1"

What is this line for:
fullString = fullString + new String(inputLine.getBytes(),"UTF-8")

First of all, use StringBuilder and not String concatenation,
second, why do you create a byte-array from a string, to create
a new one again just to add it to an existing one. Just do
fullString += inputLine
should be enough (and solve your problem by the way). As said
above use a StringBuilder instead as next step.

Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!

Amith · Feb 23, 2010

even if it is fullString = fullString + inputLine;
it doesnt work, i have tried it, some more useless experiments led me
to the this
fullString = fullString + new String(inputLine.getBytes(),"UTF-8")

Lew · Feb 23, 2010

Amith said:
My problem is the UTF-8 string which i [sic] read from the URL is considered
as unicode.. i [sic] need it as UTF-8

UTF-8 *is* Unicode!

i [sic] want it to be printed as "à²¨à²®à³à²¸à³à²•à²°à²—à³à²°à³" and not as "\u0CA8\u0CAE\u0CCD
\u0CB8\u0CCD\u0C95\u0CB0\u0C97\u0CC1\u0CB0\u0CC1"

public class URLReader {
public static void main(String[] args) throws Exception {
URL url = new URL("http://www.google.com/transliterate/indic?
tlqt=1&langpair=en|kn&text=namskara%20guru&&tl_app=1");
BufferedReader in = new BufferedReader(
new InputStreamReader(
url.openStream(), "UTF8"));

String inputLine = "";

No need to initialize 'inputLine' to a value you are just going to throw away.

String fullString = "";

while ((inputLine = in.readLine()) != null)
fullString = fullString + new String(inputLine.getBytes(),"UTF-8");

This is silly. Just do what Lothar said and add the String to the String.
I'm also pretty sure this isn't correct anyway because the way you defined the
BufferedReader will have already converted the bytes from UTF-8 on the way in
to 'inputLine', so that the 'getBytes()' will create bytes representing UTF-16
encoding. Reconverting those bytes to String using UTF-8 seems like it would
not work. In any event, using straightforward String concatenation, or as
Lothar suggested, StringBuilder concatenation, should keep encoding issues out
of the way.

Strings in Java internally will always be UTF-16.

String string = fullString.substring(fullString.indexOf("[\"") + 2,
fullString.indexOf("\",]"));
System.out.println(string);

This will display the String using the platform's default encoding.

in.close();

This should be in a 'finally' block tightly associated with the input loop.

}
}

Do not use TAB characters for indentation of Usenet posts. Use spaces, up to
four per indent level. To get help you might want to keep the code readable.

Lothar Kimmeringer · Feb 23, 2010

Amith said:
even if it is fullString = fullString + inputLine;

Then it's quite likely that the stream you open is not
delivering bytes of UTF-8 encoded data

Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!

markspace · Feb 23, 2010

Lothar said:
Then it's quite likely that the stream you open is not
delivering bytes of UTF-8 encoded data

or the stream actually contains the string "\u0CA8\u0CAE\u0CCD" etc.
I.e., it's UTF-8 with something else encoded on top of that.

Or the problem is he doesn't have the right glyphs installed on his
system, so he can't see the Arabic characters.

All of which sum up to "it's not in the code you've shown us."

Roedy Green · Feb 23, 2010

Hello all,

I have a problem, when i read a webpage contents (with UTF-8
characterset) and try to display it.. it is just considered as unicode
string
please help me

here is the code

import java.net.*;
import java.io.*;

while ((inputLine = in.readLine()) != null)
fullString = fullString + new String(inputLine.getBytes(),"UTF-8");

The text comes in dribs and drabs. See
http://mindprod.com/products.html#HTML for code to do that properly
that won't go into a tight loop reading empty strings.

Joshua Cranmer · Feb 23, 2010

I used the "User Agent Switcher" extension in Firefox to select a
different User-Agent string (Lynx, for example), and got the Latin1
version there too, so that's what the server is switching on. (That's
an exceedingly daft thing for a server to do, btw, when there's a
perfectly good Accept-Charset header to use instead.)

I thought UA sniffing went out of fashion years ago. Then I discovered
that Google Wave sniffed in a rather limited manner when doing other
work. Now I see that Google sniffs here too. Now I'm never going to
trust Google's actual text results when I see them in the browser window.

Read url into a string?	14	Nov 23, 2011
IOException for URL	1	Oct 21, 2007
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
HTTP Error 407 Proxy authentication required	1	Jun 19, 2013
Read utf-8 file return utf-16 coding hex string ?	18	Jan 29, 2010
Read utf-8 char one by one	13	Jan 27, 2010
split UTF-8 string to multi UTF8-file	2	Jan 26, 2010
How to program with proxy	9	Aug 11, 2006

UDF-8 Reading for URL - not working

Amith

Amith

Amith

Lothar Kimmeringer

Amith

Lothar Kimmeringer

Amith

Lew

Lothar Kimmeringer

markspace

Roedy Green

Joshua Cranmer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads