Extract text from HTML (unicode)

Discussion in 'Java' started by unbending, Jan 29, 2005.

  1. unbending

    unbending Guest

    I'm having trouble using the example method (to extract text from an
    HTML document I found on Sun's site). It works fine for standard
    ANSI-based files, but when I convert them to Unicode or UTF-8, it
    doesn't work right (it includes a bunch of strange characters).

    I think the reason it's not working has to do with the 2-byte vs.
    1-byte encoding, but I have no idea how to fix it. Any ideas?

    Here's my code:
    final StringBuffer buf = new StringBuffer(1000);
    try {
    // Create an HTML document that appends all text to buf
    HTMLDocument doc = new HTMLDocument() {
    public HTMLEditorKit.ParserCallback getReader(int pos) {
    return new HTMLEditorKit.ParserCallback() {
    // This method is called whenever text is encountered
    // in the HTML file
    public void handleText(char[] data, int pos) {
    buf.append(data + "\n");
    }
    };
    }
    };

    // Create a reader on the HTML content
    // URL url = new URI(location).toURL();
    URL url = location.toURL();
    URLConnection conn = url.openConnection();
    Reader rd = new InputStreamReader(conn.getInputStream());

    // Parse the HTML
    HTMLEditorKit kit = new HTMLEditorKit();
    kit.read(rd, doc, 0);
    }
    catch(MalformedURLException mue)
    { System.out.println(mue.getLocalizedMessage()); }
    catch(BadLocationException ble)
    { System.out.println(ble.getLocalizedMessage()); }
    catch(IOException ioe)
    { System.out.println(ioe.getLocalizedMessage()); }
    parsed = buf.toString();
     
    unbending, Jan 29, 2005
    #1
    1. Advertising

  2. unbending

    Chris Smith Guest

    unbending <> wrote:
    > I'm having trouble using the example method (to extract text from an
    > HTML document I found on Sun's site). It works fine for standard
    > ANSI-based files, but when I convert them to Unicode or UTF-8, it
    > doesn't work right (it includes a bunch of strange characters).


    There is no such thing as a "standard ANSI-based file". ANSI
    standardizes (or jointly standardizes) a lot of things, including a good
    number of very different character encodings. If you mean ASCII, then
    say ASCII. If you mean something else, then say what you mean.

    There is also no such character encoding as "Unicode". I'll assume you
    mean one of UCS-2BE, UCS-2LE, UTF-16LE or UTF-16BE. The difference
    between UCS-2 and UTF-16 is probably not critical for you, unless you're
    using characters outside of the Unicode basic plane. The difference
    between big-endian and little-endian is very important, though, and
    you'll need to know which one you are using.

    You then wrote:

    > Reader rd = new InputStreamReader(conn.getInputStream());


    If you're having character encoding problems, this is almost certainly
    the source. The constructor you've used for InputStreamReader uses the
    platform default encoding. Because I don't know what platform you're
    working on, I can't tell you what that is. Apparently, though, it is
    (or is a superset of) the same encoding you used in the first document,
    but is not compatible with UTF-8 or whatever other Unicode encoding you
    tried.

    There is another constructor for InputStreamReader which allows you to
    specify an encoding for the file. You should use that instead.

    --
    www.designacourse.com
    The Easiest Way To Train Anyone... Anywhere.

    Chris Smith - Lead Software Developer/Technical Trainer
    MindIQ Corporation
     
    Chris Smith, Jan 29, 2005
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. text news
    Replies:
    1
    Views:
    652
    Toddy Marx
    Aug 23, 2003
  2. mmk16
    Replies:
    7
    Views:
    222
    Richard Gration
    Jan 22, 2004
  3. AMT2K5
    Replies:
    1
    Views:
    155
    Gunnar Hjalmarsson
    Nov 23, 2005
  4. Mladen
    Replies:
    5
    Views:
    182
    Peter Scott
    Feb 22, 2011
Loading...

Share This Page