Changing raw text to unicode format using Standard Java APIs

T

theAndroidGuy

Hi All,
Is there any specific way/standard APIs for converting any text to
Unicode format. Actually I'm trying to download an html page, for a
given URL, then extract the text[ This html page can be in any
language, specifically I'm working on non-english pages] and then post
that to Apache Solr for indexing. Now I want that whatever the content
may be I'll convert that to unicode and then send it to Solr for
indexing. I'm sure there must be standard way of converting text to
unicode format. Also I'd like to know the basic encoding format for
any webpage, I think most of the times the encoding happens to be
unicode utf-8 for non-english contents as well, but what if this is
not the case then how to convert that to unicode. Any suggestions
would be appreciated.

Thanks.
 
K

Karl Uppiano

theAndroidGuy said:
Hi All,
Is there any specific way/standard APIs for converting any text to
Unicode format. Actually I'm trying to download an html page, for a
given URL, then extract the text[ This html page can be in any
language, specifically I'm working on non-english pages] and then post
that to Apache Solr for indexing. Now I want that whatever the content
may be I'll convert that to unicode and then send it to Solr for
indexing. I'm sure there must be standard way of converting text to
unicode format. Also I'd like to know the basic encoding format for
any webpage, I think most of the times the encoding happens to be
unicode utf-8 for non-english contents as well, but what if this is
not the case then how to convert that to unicode. Any suggestions
would be appreciated.

http://java.sun.com/javase/6/docs/api/java/nio/charset/package-summary.html
 
R

RedGrittyBrick

theAndroidGuy said:
Hi All,
Is there any specific way/standard APIs for converting any text to
Unicode format. Actually I'm trying to download an html page, for a
given URL, then extract the text[ This html page can be in any
language, specifically I'm working on non-english pages] and then post
that to Apache Solr for indexing. Now I want that whatever the content
may be I'll convert that to unicode and then send it to Solr for
indexing. I'm sure there must be standard way of converting text to
unicode format.

Google keywords: recode OR iconv OR icu.
Also I'd like to know the basic encoding format for
any webpage,

The encoding is usually specified in the HTTP headers (and/or the HTML).
 
M

Mayeul

theAndroidGuy said:
Hi All,
Is there any specific way/standard APIs for converting any text to
Unicode format. Actually I'm trying to download an html page, for a
given URL, then extract the text[ This html page can be in any
language, specifically I'm working on non-english pages] and then post
that to Apache Solr for indexing. Now I want that whatever the content
may be I'll convert that to unicode and then send it to Solr for
indexing. I'm sure there must be standard way of converting text to
unicode format. Also I'd like to know the basic encoding format for
any webpage, I think most of the times the encoding happens to be
unicode utf-8 for non-english contents as well, but what if this is
not the case then how to convert that to unicode. Any suggestions
would be appreciated.

There is no such thing as 'raw text'. The closest thing that could be
called 'raw text' would be plain old ASCII, as in, all bytes are 7-bits.
No accents, no fancy punctuations, and of course no script other than
roman. Even this is not 'raw text', it's ASCII.

To change text from one charset to another, you first need to know what
charset you want to convert from and to.
Once you understand this question and answer it, the method to do so is
a simple matter of playing with charset-aware Java classes & methods.
 
M

Mark Space

theAndroidGuy said:
unicode format. Also I'd like to know the basic encoding format for
any webpage, I think most of the times the encoding happens to be

I'd assume that you could use HttpURLConnectin for that, although I
haven't tried it. Note esp. the methods in its parent class.

unicode utf-8 for non-english contents as well, but what if this is
not the case then how to convert that to unicode. Any suggestions
would be appreciated.

You've already been pointed at the Charset class. Note that both
Reader/Writer and Strings have methods for changing charsets around. E.g.

String s = ...
byte[] b = s.getBytes( "UTF-8" );

OutputStream os = ...
OutputStreaWriter osw = new OutputStreamWriter( os, "UTF-8" );
osw.write( s, 0, s.length() );


And similarily for InputStreamWriter. (You'd normally wrap those
InputStreamReader/OutputStreamWriter in a BufferedReader/Writer of some
sort).
 
A

Arne Vajhøj

theAndroidGuy said:
Is there any specific way/standard APIs for converting any text to
Unicode format. Actually I'm trying to download an html page, for a
given URL, then extract the text[ This html page can be in any
language, specifically I'm working on non-english pages] and then post
that to Apache Solr for indexing. Now I want that whatever the content
may be I'll convert that to unicode and then send it to Solr for
indexing. I'm sure there must be standard way of converting text to
unicode format. Also I'd like to know the basic encoding format for
any webpage, I think most of the times the encoding happens to be
unicode utf-8 for non-english contents as well, but what if this is
not the case then how to convert that to unicode. Any suggestions
would be appreciated.

Getting the correct character set for a web page can be tricky because
it can be specified both in the HTTP header and in a META tag.

See code below for my best attempt.

Arne

======================================================

using System;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;

namespace E
{
public class HttpDownloadCharset
{
private static Regex encpat = new
Regex("charset=([A-Za-z0-9-]+)", RegexOptions.IgnoreCase |
RegexOptions.Compiled);
private static string ParseContentType(string contenttype)
{
Match m = encpat.Match(contenttype);
if(m.Success)
{
return m.Groups[1].Value;
}
else
{
return "ISO-8859-1";
}
}
private static Regex metaencpat = new
Regex("<META\\s+HTTP-EQUIV\\s*=\\s*[\"']Content-Type[\"']\\s+CONTENT\\s*=\\s*[\"']([^\"']*)[\"']>",
RegexOptions.IgnoreCase | RegexOptions.Compiled);
private static string ParseMetaContentType(String html, String
defenc)
{
Match m = metaencpat.Match(html);
if(m.Success)
{
return ParseContentType(m.Groups[1].Value);
} else {
return defenc;
}
}
private const int DEFAULT_BUFSIZ = 1000000;
public static string Download(string urlstr)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlstr);
using(HttpWebResponse resp =
(HttpWebResponse)req.GetResponse())
{
if (resp.StatusCode == HttpStatusCode.OK)
{
string enc = ParseContentType(resp.ContentType);
int bufsiz = (int)resp.ContentLength;
if(bufsiz < 0) {
bufsiz = DEFAULT_BUFSIZ;
}
byte[] buf = new byte[bufsiz];
Stream stm = resp.GetResponseStream();
int ix = 0;
int n;
while((n = stm.Read(buf, ix, buf.Length - ix)) > 0) {
ix += n;
}
stm.Close();
string temp = Encoding.ASCII.GetString(buf);
enc = ParseMetaContentType(temp, enc);
return Encoding.GetEncoding(enc).GetString(buf);
}
else
{
throw new ArgumentException("URL " + urlstr + "
returned " + resp.StatusDescription);
}
}
}
}
public class Program
{
public static void Main(string[] args)
{

Console.WriteLine(HttpDownloadCharset.Download("http://arne:81/~arne/f1.html"));

Console.WriteLine(HttpDownloadCharset.Download("http://arne:81/~arne/f2.html"));

Console.WriteLine(HttpDownloadCharset.Download("http://arne:81/~arne/f3.html"));
}
}
}
 
R

Roedy Green

Is there any specific way/standard APIs for converting any text to
Unicode format.

It depends on what you mean by "any" text and "Unicode format".

Tools include:

insert and remove &xxx; entities.
http://mindprod.com/jgloss/htmlentities.html

Understanding encodings:
http://mindprod.com/jgloss/encoding.html

convert between two different encodings.
http://mindprod.com/jgloss/encoding.html#NATIVE2ASCII

One tool you might find useful in the Encoding recogniser that till
help you guess the encoding used to write a file. Unfortunately that
information is not in any way embedded in the file or its descriptor.
http://mindprod.com/jgloss/encoding.html#IDENTIFICATION
--
Roedy Green Canadian Mind Products
http://mindprod.com

"We can allow satellites, planets, suns, universe, nay whole systems of universes, to be governed by laws, but the smallest insect, we wish to be created at once by special act."
~ Charles Darwin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top