Best way to convert html to plain text in java?

Discussion in 'Java' started by google@lrlart.com, Mar 19, 2006.

  1. Guest

    Hello,

    I have a java servlet that processes plain text. I'd like to point to a
    specific url and pull over a webpage, then convert it to plain text for
    further processing.

    I have written some code that simply strips tags from the html, but
    this only does an OK job as it fails on poorly written html and
    javascript (to name a few). Are there any java APIs that would perform
    a better conversion? I've looked into JEditorPane and HTMLEditorKit,
    but haven't had any luck in getting these to perform the conversion.
    Thanks for any help!
     
    , Mar 19, 2006
    #1
    1. Advertising

  2. On Sun, 19 Mar 2006 08:20:01 +0100, <> wrote:

    > Hello,
    >
    > I have a java servlet that processes plain text. I'd like to point to a
    > specific url and pull over a webpage, then convert it to plain text for
    > further processing.
    >
    > I have written some code that simply strips tags from the html, but
    > this only does an OK job as it fails on poorly written html and
    > javascript (to name a few). Are there any java APIs that would perform
    > a better conversion? I've looked into JEditorPane and HTMLEditorKit,
    > but haven't had any luck in getting these to perform the conversion.
    > Thanks for any help!
    >


    its a bad solution but u can always run html2text in child process;)

    --
    SaSol


    --
    Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
     
    Marcin Wielgus, Mar 19, 2006
    #2
    1. Advertising

  3. Can you give some examples of how it fails on poorly written HTML? It
    may not be that hard to bulletproof the tag-stripping code you wrote.
     
    Dave Mandelin, Mar 21, 2006
    #3
  4. Roedy Green Guest

    On 20 Mar 2006 18:32:27 -0800, "Dave Mandelin"
    <> wrote, quoted or indirectly quoted someone
    who said :

    >Can you give some examples of how it fails on poorly written HTML? It
    >may not be that hard to bulletproof the tag-stripping code you wrote.


    I wrote a tag stripper, but it presumes valid HTML. I suppose you
    could on hitting an < in a tag presume the > was missing. and insert
    one just before the first space after the last <

    You could look for standard tags.

    The other common error is as < or > lying around by itself or next to
    =.

    From a practical point of view it might be easiest to run your code
    through a verifier and fix the errors then do your strip. See
    http://mindprod.com/jgloss/htmlvalidator.html

    Anything else is going to lose some data or insert some junk.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Mar 21, 2006
    #4
  5. Guest

    One failure I've run into is with the use of javascript--for example

    <script>

    function CNN_getCookies() {
    var hash = new Array;
    if ( document.cookie ) {
    var cookies = document.cookie.split( '; ' );
    for ( var i = 0; i < cookies.length; i++ ) {

    .......
    Note: Notice the "less than" symbol in the javascript above.

    </script>

    This is some slightly modified source from cnn's site--but the point is
    that a "<tag>" pattern can be distinguished, but it's difficult to
    differentiate this from a greater than or less than in some enclosed
    javascript code.

    But even if I were to write some code that could handle this case
    effectively I'd probably be dealing with loads of other special cases
    within poorly written html source.
     
    , Mar 21, 2006
    #5
  6. Chris Uppal Guest

    wrote:

    > But even if I were to write some code that could handle this case
    > effectively I'd probably be dealing with loads of other special cases
    > within poorly written html source.


    Take it from me: parsing HTML is not trivial. And that's even without
    considering all the invalid HTML out there (I don't mean stuff like incorrectly
    nested structures, but unmatched ""s, tags with no >, etc).

    JTidy appears to do what you are looking for, it might help (I've never tried
    it myself):
    http://jtidy.sourceforge.net/

    -- chris
     
    Chris Uppal, Mar 21, 2006
    #6
  7. Ah, I see. Yeah, that looks pretty rough. JTidy looks like a really
    nice program.
     
    Dave Mandelin, Mar 21, 2006
    #7
  8. kalyan_iitd

    Joined:
    Jul 4, 2006
    Messages:
    1
    Hai Dave, can you prove java code for html to plain text using jtidy. for me, jtidy is working as html validator only.

    some experties provide code for html to text (any java api)

    thanks in advance.
    Kalyan.
     
    kalyan_iitd, Jul 4, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Marcel Kessler

    Convert HTML to plain text

    Marcel Kessler, Nov 13, 2006, in forum: Java
    Replies:
    3
    Views:
    1,712
    Karl Uppiano
    Nov 14, 2006
  2. mahesh
    Replies:
    2
    Views:
    1,223
    Real Gagnon
    Feb 17, 2007
  3. nospam
    Replies:
    11
    Views:
    603
    Thomas Dickey
    May 3, 2007
  4. geoffbache
    Replies:
    8
    Views:
    668
    Stefan Behnel
    Feb 11, 2008
  5. Jake Barnes
    Replies:
    9
    Views:
    825
    dave cutts
    Feb 21, 2006
Loading...

Share This Page