Convert HTML to plain text

Marcel Kessler · Nov 13, 2006

Hi there

Does anyone know a good way of converting HTML to plain text, keeping as
much of the formatting as possible?

The HTML will be produced by an editor like FCKEditor, and
transformation should happen in Java.

So far I've found the following options, none of them really convincing:

# Using w3m or lynx to convert html to plain text
(http://www.biglist.com/lists/xsl-list/archives/200406/msg00689.html)
+ neat output
- need to call C from java

# Google gdata routine
(http://www.biglist.com/lists/xsl-list/archives/200406/msg00689.html)
+ java source available
- only basic stripping, no tables etc

# Use xml & xslt
(http://www-128.ibm.com/developerworks/java/library/x-xmlist1/)
+ good result
- complicated approach, cannot use wysiwyg-editor like FCKEditor

# use other tools like docfraq, detagger, notetab etc.
- no better results than with w3m

Thanks and regars
Marcel

Andy Dingley · Nov 13, 2006

Marcel said:
Does anyone know a good way of converting HTML to plain text, keeping as
much of the formatting as possible?

Of course not. "Plain text" doesn't have formatting. If you want to
"keep some formatting", then you first have to know just how much is
preservable. Some people claim "RTF" is "plain text" because it's
editable with a text editor rather than in binary -- how much are you
expecting to preserve?

Converting all HTML block elements to a marker, stripping out
everything except text and markers, normalizing whitespace and markers
and then converting markers to something local is usually a good start.

If you're already in a web context, then a DOM walker that returns the
set of text nodes might be easier.

if the HTML is crap to begin with, pre-process it with Tidy.

Marcel Kessler · Nov 14, 2006

Andy said:
Of course not. "Plain text" doesn't have formatting. If you want to
"keep some formatting", then you first have to know just how much is
preservable. Some people claim "RTF" is "plain text" because it's
editable with a text editor rather than in binary -- how much are you
expecting to preserve?

Thanks, Andy!
Obviously we can't keep e.g. a header in big letters, but one thing we
need for example is if we have a <li> tag, we don't want

* Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque
nec est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
aliquet risus ac velit eleifend scelerisque.

but rather

* Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque
nec
est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
aliquet risus ac velit eleifend scelerisque.

i.e. something that keeps the indention...
If there is some Java library out there that does this kind of thing,
that would be great... the HTML itself should already be quite nice.

Karl Uppiano · Nov 14, 2006

Marcel Kessler said:
Thanks, Andy!
Obviously we can't keep e.g. a header in big letters, but one thing we
need for example is if we have a <li> tag, we don't want

* Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque nec
est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut aliquet
risus ac velit eleifend scelerisque.

but rather

* Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque nec
est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
aliquet risus ac velit eleifend scelerisque.

i.e. something that keeps the indention...
If there is some Java library out there that does this kind of thing, that
would be great... the HTML itself should already be quite nice.

It sounds like you want an HTML parser with pluggable handlers that are
customizable. A SAX parser comes pretty close. If you could first convert
the HTML to well-formed HTML (with matching open and close tags, for
example) you might be able to get a non-validating SAX parser to work. Just
a thought. My guess is that it would take a fair bit of work to implement.

Best way to convert html to plain text in java?	7	Mar 19, 2006
converting html to plain text	18	Apr 16, 2009
How could I convert plain UTF-8 XML to Outlook HTML format ?	1	Oct 14, 2010
How to convert markup text to plain text in python?	8	Feb 1, 2008
Convert text to HTML	1	Aug 20, 2003
when I add HTML to innerHTML, FireFox renders it as HTML, but IE shows it as plain text	9	Feb 6, 2006
HTML Encoded Text not displaying in XSLT	3	Apr 26, 2005
Script to fetch Wikipedia text	4	Oct 11, 2006

Convert HTML to plain text

Marcel Kessler

Andy Dingley

Marcel Kessler

Karl Uppiano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads