Changing raw text to unicode format using Standard Java APIs

theAndroidGuy · Apr 29, 2009

Hi All,
Is there any specific way/standard APIs for converting any text to
Unicode format. Actually I'm trying to download an html page, for a
given URL, then extract the text[ This html page can be in any
language, specifically I'm working on non-english pages] and then post
that to Apache Solr for indexing. Now I want that whatever the content
may be I'll convert that to unicode and then send it to Solr for
indexing. I'm sure there must be standard way of converting text to
unicode format. Also I'd like to know the basic encoding format for
any webpage, I think most of the times the encoding happens to be
unicode utf-8 for non-english contents as well, but what if this is
not the case then how to convert that to unicode. Any suggestions
would be appreciated.

Thanks.

Karl Uppiano · Apr 30, 2009

theAndroidGuy said:
Hi All,
Is there any specific way/standard APIs for converting any text to
Unicode format. Actually I'm trying to download an html page, for a
given URL, then extract the text[ This html page can be in any
language, specifically I'm working on non-english pages] and then post
that to Apache Solr for indexing. Now I want that whatever the content
may be I'll convert that to unicode and then send it to Solr for
indexing. I'm sure there must be standard way of converting text to
unicode format. Also I'd like to know the basic encoding format for
any webpage, I think most of the times the encoding happens to be
unicode utf-8 for non-english contents as well, but what if this is
not the case then how to convert that to unicode. Any suggestions
would be appreciated.

http://java.sun.com/javase/6/docs/api/java/nio/charset/package-summary.html

RedGrittyBrick · Apr 30, 2009

theAndroidGuy said:
Hi All,
Is there any specific way/standard APIs for converting any text to
Unicode format. Actually I'm trying to download an html page, for a
given URL, then extract the text[ This html page can be in any
language, specifically I'm working on non-english pages] and then post
that to Apache Solr for indexing. Now I want that whatever the content
may be I'll convert that to unicode and then send it to Solr for
indexing. I'm sure there must be standard way of converting text to
unicode format.

Google keywords: recode OR iconv OR icu.

Also I'd like to know the basic encoding format for
any webpage,

The encoding is usually specified in the HTTP headers (and/or the HTML).

Mayeul · Apr 30, 2009

theAndroidGuy said:
Hi All,
Is there any specific way/standard APIs for converting any text to
Unicode format. Actually I'm trying to download an html page, for a
given URL, then extract the text[ This html page can be in any
language, specifically I'm working on non-english pages] and then post
that to Apache Solr for indexing. Now I want that whatever the content
may be I'll convert that to unicode and then send it to Solr for
indexing. I'm sure there must be standard way of converting text to
unicode format. Also I'd like to know the basic encoding format for
any webpage, I think most of the times the encoding happens to be
unicode utf-8 for non-english contents as well, but what if this is
not the case then how to convert that to unicode. Any suggestions
would be appreciated.

There is no such thing as 'raw text'. The closest thing that could be
called 'raw text' would be plain old ASCII, as in, all bytes are 7-bits.
No accents, no fancy punctuations, and of course no script other than
roman. Even this is not 'raw text', it's ASCII.

To change text from one charset to another, you first need to know what
charset you want to convert from and to.
Once you understand this question and answer it, the method to do so is
a simple matter of playing with charset-aware Java classes & methods.

Mark Space · Apr 30, 2009

theAndroidGuy said:
unicode format. Also I'd like to know the basic encoding format for
any webpage, I think most of the times the encoding happens to be

I'd assume that you could use HttpURLConnectin for that, although I
haven't tried it. Note esp. the methods in its parent class.

unicode utf-8 for non-english contents as well, but what if this is
not the case then how to convert that to unicode. Any suggestions
would be appreciated.

You've already been pointed at the Charset class. Note that both
Reader/Writer and Strings have methods for changing charsets around. E.g.

String s = ...
byte[] b = s.getBytes( "UTF-8" );

OutputStream os = ...
OutputStreaWriter osw = new OutputStreamWriter( os, "UTF-8" );
osw.write( s, 0, s.length() );

And similarily for InputStreamWriter. (You'd normally wrap those
InputStreamReader/OutputStreamWriter in a BufferedReader/Writer of some
sort).

Arne Vajhøj · May 1, 2009

theAndroidGuy said:
Is there any specific way/standard APIs for converting any text to
Unicode format. Actually I'm trying to download an html page, for a
given URL, then extract the text[ This html page can be in any
language, specifically I'm working on non-english pages] and then post
that to Apache Solr for indexing. Now I want that whatever the content
may be I'll convert that to unicode and then send it to Solr for
indexing. I'm sure there must be standard way of converting text to
unicode format. Also I'd like to know the basic encoding format for
any webpage, I think most of the times the encoding happens to be
unicode utf-8 for non-english contents as well, but what if this is
not the case then how to convert that to unicode. Any suggestions
would be appreciated.

Getting the correct character set for a web page can be tricky because
it can be specified both in the HTTP header and in a META tag.

See code below for my best attempt.

Arne

======================================================

using System;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;

namespace E
{
public class HttpDownloadCharset
{
private static Regex encpat = new
Regex("charset=([A-Za-z0-9-]+)", RegexOptions.IgnoreCase |
RegexOptions.Compiled);
private static string ParseContentType(string contenttype)
{
Match m = encpat.Match(contenttype);
if(m.Success)
{
return m.Groups[1].Value;
}
else
{
return "ISO-8859-1";
}
}
private static Regex metaencpat = new
Regex("<META\\s+HTTP-EQUIV\\s*=\\s*[\"']Content-Type[\"']\\s+CONTENT\\s*=\\s*[\"']([^\"']*)[\"']>",
RegexOptions.IgnoreCase | RegexOptions.Compiled);
private static string ParseMetaContentType(String html, String
defenc)
{
Match m = metaencpat.Match(html);
if(m.Success)
{
return ParseContentType(m.Groups[1].Value);
} else {
return defenc;
}
}
private const int DEFAULT_BUFSIZ = 1000000;
public static string Download(string urlstr)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlstr);
using(HttpWebResponse resp =
(HttpWebResponse)req.GetResponse())
{
if (resp.StatusCode == HttpStatusCode.OK)
{
string enc = ParseContentType(resp.ContentType);
int bufsiz = (int)resp.ContentLength;
if(bufsiz < 0) {
bufsiz = DEFAULT_BUFSIZ;
}
byte[] buf = new byte[bufsiz];
Stream stm = resp.GetResponseStream();
int ix = 0;
int n;
while((n = stm.Read(buf, ix, buf.Length - ix)) > 0) {
ix += n;
}
stm.Close();
string temp = Encoding.ASCII.GetString(buf);
enc = ParseMetaContentType(temp, enc);
return Encoding.GetEncoding(enc).GetString(buf);
}
else
{
throw new ArgumentException("URL " + urlstr + "
returned " + resp.StatusDescription);
}
}
}
}
public class Program
{
public static void Main(string[] args)
{

Console.WriteLine(HttpDownloadCharset.Download("http://arne:81/~arne/f1.html"));

Console.WriteLine(HttpDownloadCharset.Download("http://arne:81/~arne/f2.html"));

Console.WriteLine(HttpDownloadCharset.Download("http://arne:81/~arne/f3.html"));
}
}
}

Roedy Green · May 1, 2009

Is there any specific way/standard APIs for converting any text to
Unicode format.

It depends on what you mean by "any" text and "Unicode format".

Tools include:

insert and remove &xxx; entities.
http://mindprod.com/jgloss/htmlentities.html

Understanding encodings:
http://mindprod.com/jgloss/encoding.html

convert between two different encodings.
http://mindprod.com/jgloss/encoding.html#NATIVE2ASCII

One tool you might find useful in the Encoding recogniser that till
help you guess the encoding used to write a file. Unfortunately that
information is not in any way embedded in the file or its descriptor.
http://mindprod.com/jgloss/encoding.html#IDENTIFICATION
--
Roedy Green Canadian Mind Products
http://mindprod.com

"We can allow satellites, planets, suns, universe, nay whole systems of universes, to be governed by laws, but the smallest insect, we wish to be created at once by special act."
~ Charles Darwin

How to convert a database data in predefined format and generate output in text format	1	Feb 10, 2022
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022
Data saving in condition of changing reality	0	Apr 29, 2022
unicode to human readable format	7	Dec 22, 2013
Iframe link overlapping text	4	Jan 18, 2021
Logging APIs	3	Jul 7, 2004
Unicode fonts in Java	3	Mar 19, 2007
Python Unicode handling wins again -- mostly	67	Nov 30, 2013

Changing raw text to unicode format using Standard Java APIs

theAndroidGuy

Karl Uppiano

RedGrittyBrick

Mayeul

Mark Space

Arne Vajhøj

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads