Best way to convert html to plain text in java?

G

google

Hello,

I have a java servlet that processes plain text. I'd like to point to a
specific url and pull over a webpage, then convert it to plain text for
further processing.

I have written some code that simply strips tags from the html, but
this only does an OK job as it fails on poorly written html and
javascript (to name a few). Are there any java APIs that would perform
a better conversion? I've looked into JEditorPane and HTMLEditorKit,
but haven't had any luck in getting these to perform the conversion.
Thanks for any help!
 
M

Marcin Wielgus

Hello,

I have a java servlet that processes plain text. I'd like to point to a
specific url and pull over a webpage, then convert it to plain text for
further processing.

I have written some code that simply strips tags from the html, but
this only does an OK job as it fails on poorly written html and
javascript (to name a few). Are there any java APIs that would perform
a better conversion? I've looked into JEditorPane and HTMLEditorKit,
but haven't had any luck in getting these to perform the conversion.
Thanks for any help!

its a bad solution but u can always run html2text in child process;)
 
D

Dave Mandelin

Can you give some examples of how it fails on poorly written HTML? It
may not be that hard to bulletproof the tag-stripping code you wrote.
 
R

Roedy Green

Can you give some examples of how it fails on poorly written HTML? It
may not be that hard to bulletproof the tag-stripping code you wrote.

I wrote a tag stripper, but it presumes valid HTML. I suppose you
could on hitting an < in a tag presume the > was missing. and insert
one just before the first space after the last <

You could look for standard tags.

The other common error is as < or > lying around by itself or next to
=.

From a practical point of view it might be easiest to run your code
through a verifier and fix the errors then do your strip. See
http://mindprod.com/jgloss/htmlvalidator.html

Anything else is going to lose some data or insert some junk.
 
G

google

One failure I've run into is with the use of javascript--for example

<script>

function CNN_getCookies() {
var hash = new Array;
if ( document.cookie ) {
var cookies = document.cookie.split( '; ' );
for ( var i = 0; i < cookies.length; i++ ) {

.......
Note: Notice the "less than" symbol in the javascript above.

</script>

This is some slightly modified source from cnn's site--but the point is
that a "<tag>" pattern can be distinguished, but it's difficult to
differentiate this from a greater than or less than in some enclosed
javascript code.

But even if I were to write some code that could handle this case
effectively I'd probably be dealing with loads of other special cases
within poorly written html source.
 
C

Chris Uppal

But even if I were to write some code that could handle this case
effectively I'd probably be dealing with loads of other special cases
within poorly written html source.

Take it from me: parsing HTML is not trivial. And that's even without
considering all the invalid HTML out there (I don't mean stuff like incorrectly
nested structures, but unmatched ""s, tags with no >, etc).

JTidy appears to do what you are looking for, it might help (I've never tried
it myself):
http://jtidy.sourceforge.net/

-- chris
 
D

Dave Mandelin

Ah, I see. Yeah, that looks pretty rough. JTidy looks like a really
nice program.
 
Joined
Jul 4, 2006
Messages
1
Reaction score
0
Hai Dave, can you prove java code for html to plain text using jtidy. for me, jtidy is working as html validator only.

some experties provide code for html to text (any java api)

thanks in advance.
Kalyan.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top