Convert HTML to plain text

Discussion in 'Java' started by Marcel Kessler, Nov 13, 2006.

  1. Hi there

    Does anyone know a good way of converting HTML to plain text, keeping as
    much of the formatting as possible?

    The HTML will be produced by an editor like FCKEditor, and
    transformation should happen in Java.

    So far I've found the following options, none of them really convincing:

    # Using w3m or lynx to convert html to plain text
    (http://www.biglist.com/lists/xsl-list/archives/200406/msg00689.html)
    + neat output
    - need to call C from java

    # Google gdata routine
    (http://www.biglist.com/lists/xsl-list/archives/200406/msg00689.html)
    + java source available
    - only basic stripping, no tables etc

    # Use xml & xslt
    (http://www-128.ibm.com/developerworks/java/library/x-xmlist1/)
    + good result
    - complicated approach, cannot use wysiwyg-editor like FCKEditor

    # use other tools like docfraq, detagger, notetab etc.
    - no better results than with w3m

    Thanks and regars
    Marcel
     
    Marcel Kessler, Nov 13, 2006
    #1
    1. Advertising

  2. Marcel Kessler

    Andy Dingley Guest

    Marcel Kessler wrote:

    > Does anyone know a good way of converting HTML to plain text, keeping as
    > much of the formatting as possible?


    Of course not. "Plain text" doesn't have formatting. If you want to
    "keep some formatting", then you first have to know just how much is
    preservable. Some people claim "RTF" is "plain text" because it's
    editable with a text editor rather than in binary -- how much are you
    expecting to preserve?

    Converting all HTML block elements to a marker, stripping out
    everything except text and markers, normalizing whitespace and markers
    and then converting markers to something local is usually a good start.

    If you're already in a web context, then a DOM walker that returns the
    set of text nodes might be easier.

    if the HTML is crap to begin with, pre-process it with Tidy.
     
    Andy Dingley, Nov 13, 2006
    #2
    1. Advertising

  3. Andy Dingley wrote:
    > Marcel Kessler wrote:
    >
    >> Does anyone know a good way of converting HTML to plain text, keeping as
    >> much of the formatting as possible?

    >
    > Of course not. "Plain text" doesn't have formatting. If you want to
    > "keep some formatting", then you first have to know just how much is
    > preservable. Some people claim "RTF" is "plain text" because it's
    > editable with a text editor rather than in binary -- how much are you
    > expecting to preserve?


    Thanks, Andy!
    Obviously we can't keep e.g. a header in big letters, but one thing we
    need for example is if we have a <li> tag, we don't want

    * Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque
    nec est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
    aliquet risus ac velit eleifend scelerisque.

    but rather

    * Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque
    nec
    est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
    aliquet risus ac velit eleifend scelerisque.

    i.e. something that keeps the indention...
    If there is some Java library out there that does this kind of thing,
    that would be great... the HTML itself should already be quite nice.
     
    Marcel Kessler, Nov 14, 2006
    #3
  4. Marcel Kessler

    Karl Uppiano Guest

    "Marcel Kessler" <> wrote in message
    news:...
    > Andy Dingley wrote:
    >> Marcel Kessler wrote:
    >>
    >>> Does anyone know a good way of converting HTML to plain text, keeping as
    >>> much of the formatting as possible?

    >>
    >> Of course not. "Plain text" doesn't have formatting. If you want to
    >> "keep some formatting", then you first have to know just how much is
    >> preservable. Some people claim "RTF" is "plain text" because it's
    >> editable with a text editor rather than in binary -- how much are you
    >> expecting to preserve?

    >
    > Thanks, Andy!
    > Obviously we can't keep e.g. a header in big letters, but one thing we
    > need for example is if we have a <li> tag, we don't want
    >
    > * Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque nec
    > est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut aliquet
    > risus ac velit eleifend scelerisque.
    >
    > but rather
    >
    > * Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque nec
    > est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
    > aliquet risus ac velit eleifend scelerisque.
    >
    > i.e. something that keeps the indention...
    > If there is some Java library out there that does this kind of thing, that
    > would be great... the HTML itself should already be quite nice.


    It sounds like you want an HTML parser with pluggable handlers that are
    customizable. A SAX parser comes pretty close. If you could first convert
    the HTML to well-formed HTML (with matching open and close tags, for
    example) you might be able to get a non-validating SAX parser to work. Just
    a thought. My guess is that it would take a fair bit of work to implement.
     
    Karl Uppiano, Nov 14, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    7
    Views:
    21,633
    kalyan_iitd
    Jul 4, 2006
  2. mahesh
    Replies:
    2
    Views:
    1,261
    Real Gagnon
    Feb 17, 2007
  3. nospam
    Replies:
    11
    Views:
    618
    Thomas Dickey
    May 3, 2007
  4. geoffbache
    Replies:
    8
    Views:
    704
    Stefan Behnel
    Feb 11, 2008
  5. Jake Barnes
    Replies:
    9
    Views:
    849
    dave cutts
    Feb 21, 2006
Loading...

Share This Page