Changing raw text to unicode format using Standard Java APIs

Discussion in 'Java' started by theAndroidGuy, Apr 29, 2009.

  1. Hi All,
    Is there any specific way/standard APIs for converting any text to
    Unicode format. Actually I'm trying to download an html page, for a
    given URL, then extract the text[ This html page can be in any
    language, specifically I'm working on non-english pages] and then post
    that to Apache Solr for indexing. Now I want that whatever the content
    may be I'll convert that to unicode and then send it to Solr for
    indexing. I'm sure there must be standard way of converting text to
    unicode format. Also I'd like to know the basic encoding format for
    any webpage, I think most of the times the encoding happens to be
    unicode utf-8 for non-english contents as well, but what if this is
    not the case then how to convert that to unicode. Any suggestions
    would be appreciated.

    Thanks.
     
    theAndroidGuy, Apr 29, 2009
    #1
    1. Advertising

  2. theAndroidGuy

    Karl Uppiano Guest

    "theAndroidGuy" <> wrote in message
    news:...
    > Hi All,
    > Is there any specific way/standard APIs for converting any text to
    > Unicode format. Actually I'm trying to download an html page, for a
    > given URL, then extract the text[ This html page can be in any
    > language, specifically I'm working on non-english pages] and then post
    > that to Apache Solr for indexing. Now I want that whatever the content
    > may be I'll convert that to unicode and then send it to Solr for
    > indexing. I'm sure there must be standard way of converting text to
    > unicode format. Also I'd like to know the basic encoding format for
    > any webpage, I think most of the times the encoding happens to be
    > unicode utf-8 for non-english contents as well, but what if this is
    > not the case then how to convert that to unicode. Any suggestions
    > would be appreciated.


    http://java.sun.com/javase/6/docs/api/java/nio/charset/package-summary.html
     
    Karl Uppiano, Apr 30, 2009
    #2
    1. Advertising

  3. theAndroidGuy wrote:
    > Hi All,
    > Is there any specific way/standard APIs for converting any text to
    > Unicode format. Actually I'm trying to download an html page, for a
    > given URL, then extract the text[ This html page can be in any
    > language, specifically I'm working on non-english pages] and then post
    > that to Apache Solr for indexing. Now I want that whatever the content
    > may be I'll convert that to unicode and then send it to Solr for
    > indexing. I'm sure there must be standard way of converting text to
    > unicode format.


    Google keywords: recode OR iconv OR icu.

    > Also I'd like to know the basic encoding format for
    > any webpage,


    The encoding is usually specified in the HTTP headers (and/or the HTML).

    > I think most of the times the encoding happens to be
    > unicode utf-8 for non-english contents as well, but what if this is
    > not the case then how to convert that to unicode. Any suggestions
    > would be appreciated.
    >



    --
    RGB
     
    RedGrittyBrick, Apr 30, 2009
    #3
  4. theAndroidGuy

    Mayeul Guest

    theAndroidGuy wrote:
    > Hi All,
    > Is there any specific way/standard APIs for converting any text to
    > Unicode format. Actually I'm trying to download an html page, for a
    > given URL, then extract the text[ This html page can be in any
    > language, specifically I'm working on non-english pages] and then post
    > that to Apache Solr for indexing. Now I want that whatever the content
    > may be I'll convert that to unicode and then send it to Solr for
    > indexing. I'm sure there must be standard way of converting text to
    > unicode format. Also I'd like to know the basic encoding format for
    > any webpage, I think most of the times the encoding happens to be
    > unicode utf-8 for non-english contents as well, but what if this is
    > not the case then how to convert that to unicode. Any suggestions
    > would be appreciated.


    There is no such thing as 'raw text'. The closest thing that could be
    called 'raw text' would be plain old ASCII, as in, all bytes are 7-bits.
    No accents, no fancy punctuations, and of course no script other than
    roman. Even this is not 'raw text', it's ASCII.

    To change text from one charset to another, you first need to know what
    charset you want to convert from and to.
    Once you understand this question and answer it, the method to do so is
    a simple matter of playing with charset-aware Java classes & methods.

    --
    Mayeul
     
    Mayeul, Apr 30, 2009
    #4
  5. theAndroidGuy

    Mark Space Guest

    theAndroidGuy wrote:

    > unicode format. Also I'd like to know the basic encoding format for
    > any webpage, I think most of the times the encoding happens to be


    I'd assume that you could use HttpURLConnectin for that, although I
    haven't tried it. Note esp. the methods in its parent class.

    <http://java.sun.com/javase/6/docs/api/java/net/HttpURLConnection.html>

    > unicode utf-8 for non-english contents as well, but what if this is
    > not the case then how to convert that to unicode. Any suggestions
    > would be appreciated.


    You've already been pointed at the Charset class. Note that both
    Reader/Writer and Strings have methods for changing charsets around. E.g.

    String s = ...
    byte[] b = s.getBytes( "UTF-8" );

    OutputStream os = ...
    OutputStreaWriter osw = new OutputStreamWriter( os, "UTF-8" );
    osw.write( s, 0, s.length() );


    And similarily for InputStreamWriter. (You'd normally wrap those
    InputStreamReader/OutputStreamWriter in a BufferedReader/Writer of some
    sort).
     
    Mark Space, Apr 30, 2009
    #5
  6. theAndroidGuy

    Arne Vajhøj Guest

    theAndroidGuy wrote:
    > Is there any specific way/standard APIs for converting any text to
    > Unicode format. Actually I'm trying to download an html page, for a
    > given URL, then extract the text[ This html page can be in any
    > language, specifically I'm working on non-english pages] and then post
    > that to Apache Solr for indexing. Now I want that whatever the content
    > may be I'll convert that to unicode and then send it to Solr for
    > indexing. I'm sure there must be standard way of converting text to
    > unicode format. Also I'd like to know the basic encoding format for
    > any webpage, I think most of the times the encoding happens to be
    > unicode utf-8 for non-english contents as well, but what if this is
    > not the case then how to convert that to unicode. Any suggestions
    > would be appreciated.


    Getting the correct character set for a web page can be tricky because
    it can be specified both in the HTTP header and in a META tag.

    See code below for my best attempt.

    Arne

    ======================================================

    using System;
    using System.IO;
    using System.Net;
    using System.Text;
    using System.Text.RegularExpressions;

    namespace E
    {
    public class HttpDownloadCharset
    {
    private static Regex encpat = new
    Regex("charset=([A-Za-z0-9-]+)", RegexOptions.IgnoreCase |
    RegexOptions.Compiled);
    private static string ParseContentType(string contenttype)
    {
    Match m = encpat.Match(contenttype);
    if(m.Success)
    {
    return m.Groups[1].Value;
    }
    else
    {
    return "ISO-8859-1";
    }
    }
    private static Regex metaencpat = new
    Regex("<META\\s+HTTP-EQUIV\\s*=\\s*[\"']Content-Type[\"']\\s+CONTENT\\s*=\\s*[\"']([^\"']*)[\"']>",
    RegexOptions.IgnoreCase | RegexOptions.Compiled);
    private static string ParseMetaContentType(String html, String
    defenc)
    {
    Match m = metaencpat.Match(html);
    if(m.Success)
    {
    return ParseContentType(m.Groups[1].Value);
    } else {
    return defenc;
    }
    }
    private const int DEFAULT_BUFSIZ = 1000000;
    public static string Download(string urlstr)
    {
    HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlstr);
    using(HttpWebResponse resp =
    (HttpWebResponse)req.GetResponse())
    {
    if (resp.StatusCode == HttpStatusCode.OK)
    {
    string enc = ParseContentType(resp.ContentType);
    int bufsiz = (int)resp.ContentLength;
    if(bufsiz < 0) {
    bufsiz = DEFAULT_BUFSIZ;
    }
    byte[] buf = new byte[bufsiz];
    Stream stm = resp.GetResponseStream();
    int ix = 0;
    int n;
    while((n = stm.Read(buf, ix, buf.Length - ix)) > 0) {
    ix += n;
    }
    stm.Close();
    string temp = Encoding.ASCII.GetString(buf);
    enc = ParseMetaContentType(temp, enc);
    return Encoding.GetEncoding(enc).GetString(buf);
    }
    else
    {
    throw new ArgumentException("URL " + urlstr + "
    returned " + resp.StatusDescription);
    }
    }
    }
    }
    public class Program
    {
    public static void Main(string[] args)
    {

    Console.WriteLine(HttpDownloadCharset.Download("http://arne:81/~arne/f1.html"));

    Console.WriteLine(HttpDownloadCharset.Download("http://arne:81/~arne/f2.html"));

    Console.WriteLine(HttpDownloadCharset.Download("http://arne:81/~arne/f3.html"));
    }
    }
    }
     
    Arne Vajhøj, May 1, 2009
    #6
  7. theAndroidGuy

    Roedy Green Guest

    On Wed, 29 Apr 2009 00:53:59 -0700 (PDT), theAndroidGuy
    <> wrote, quoted or indirectly quoted someone
    who said :

    >Is there any specific way/standard APIs for converting any text to
    >Unicode format.


    It depends on what you mean by "any" text and "Unicode format".

    Tools include:

    insert and remove &xxx; entities.
    http://mindprod.com/jgloss/htmlentities.html

    Understanding encodings:
    http://mindprod.com/jgloss/encoding.html

    convert between two different encodings.
    http://mindprod.com/jgloss/encoding.html#NATIVE2ASCII

    One tool you might find useful in the Encoding recogniser that till
    help you guess the encoding used to write a file. Unfortunately that
    information is not in any way embedded in the file or its descriptor.
    http://mindprod.com/jgloss/encoding.html#IDENTIFICATION
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "We can allow satellites, planets, suns, universe, nay whole systems of universes, to be governed by laws, but the smallest insect, we wish to be created at once by special act."
    ~ Charles Darwin
     
    Roedy Green, May 1, 2009
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Calum MacLean

    UML and APIs using Java interfaces

    Calum MacLean, Jul 2, 2003, in forum: Java
    Replies:
    3
    Views:
    4,909
    Graham Perkins
    Jul 3, 2003
  2. bharath
    Replies:
    4
    Views:
    3,275
    Thomas Fritsch
    Mar 16, 2007
  3. Chris Angelico
    Replies:
    3
    Views:
    160
    Mark Lawrence
    Mar 1, 2013
  4. Peter Otten
    Replies:
    0
    Views:
    138
    Peter Otten
    Feb 28, 2013
  5. Rick Johnson
    Replies:
    0
    Views:
    146
    Rick Johnson
    Feb 28, 2013
Loading...

Share This Page