Convert HTML to Text

Discussion in 'HTML' started by cawoodm@gmail.com, Mar 9, 2006.

  1. Guest

    I have written a simple RegEx which strips all tags from an HTML file
    and replaces them with spaces.

    This was fine until I noticed that some tags should not be replaced
    with spaces. For example in the HTML:
    <b>H</b>ello World
    My program will generate "H ello World" effectively breaking a word
    apart.

    Where could I get an "authoritative" list of tags which should result
    in a space and which shouldn't. I presume these are mostly block
    elements like div, br, hr, table etc...
     
    , Mar 9, 2006
    #1
    1. Advertising

  2. Dylan Parry Guest

    Pondering the eternal question of "Hobnobs or Rich Tea?",
    finally proclaimed:

    > Where could I get an "authoritative" list of tags which should result
    > in a space and which shouldn't. I presume these are mostly block
    > elements like div, br, hr, table etc...


    You probably won't find a list that tells you the exact information you
    are after, but the HTML DTDs available from W3C[1] will show you which
    elements are block level and which are inline. From that you could
    assume that the block elements result in a space, and the inline should
    not.

    ____
    [1] http://www.w3.org/TR/html4/sgml/dtd.html
    --
    Dylan Parry
    http://webpageworkshop.co.uk -- FREE Web tutorials and references
     
    Dylan Parry, Mar 9, 2006
    #2
    1. Advertising

  3. mbstevens Guest

    wrote:
    > I have written a simple RegEx which strips all tags from an HTML file
    > and replaces them with spaces.
    >
    > This was fine until I noticed that some tags should not be replaced
    > with spaces. For example in the HTML:
    > <b>H</b>ello World
    > My program will generate "H ello World" effectively breaking a word
    > apart.
    >
    > Where could I get an "authoritative" list of tags which should result
    > in a space and which shouldn't. I presume these are mostly block
    > elements like div, br, hr, table etc...
    >


    I don't have a specific answer to your last paragraph, but:

    Have a look at Perl's HTML::parser and related modules.

    In Python, sgmllib will be useful.

    Using simple regexes to parse HTML
    is liable to more errors than libraries that have been
    exercised by many users. Of course, you might have a good reason
    to re-invent the wheel for another language, but even there having
    a look at the source of these modules might be helpful.
    --
    mbstevens
    http://www.mbstevens.com/
     
    mbstevens, Mar 9, 2006
    #3
  4. Toby Inkster Guest

    Dylan Parry wrote:

    > You probably won't find a list that tells you the exact information you
    > are after, but the HTML DTDs available from W3C[1] will show you which
    > elements are block level and which are inline. From that you could
    > assume that the block elements result in a space, and the inline should
    > not.


    In fact, you could assume that the block elements should begin and end
    with a line break. You could also add a tab between <td> and <th> elements
    in a table, add asterisks for unordered lists, add numbers for ordered
    lists and so on.

    I'll echo Mr Stevens' recommendation to use HTML::parser for parsing
    though -- it will give far better results than a reg exp. For example, a
    reg exp won't tell you to add a line break after the word "bar" here,
    because the closing tag for a paragraph is optional:

    <body>
    <p>Foo bar.
    </body>

    --
    Toby A Inkster BSc (Hons) ARCS
    Contact Me ~ http://tobyinkster.co.uk/contact
     
    Toby Inkster, Mar 9, 2006
    #4
  5. Jim Higson Guest

    wrote:

    > I have written a simple RegEx which strips all tags from an HTML file
    > and replaces them with spaces.
    >
    > This was fine until I noticed that some tags should not be replaced
    > with spaces. For example in the HTML:
    > <b>H</b>ello World
    > My program will generate "H ello World" effectively breaking a word
    > apart.
    >
    > Where could I get an "authoritative" list of tags which should result
    > in a space and which shouldn't. I presume these are mostly block
    > elements like div, br, hr, table etc...


    How about using this?

    http://www.mbayer.de/html2text/

    --
    Jim
     
    Jim Higson, Mar 10, 2006
    #5
  6. Guest

    Thank-you all for the helpful feedback.
    It is true that RegEx is a bit of a dark art but I am writing a Crawler
    in VB Dot Net and not Perl or Python.
    I am not sure if the .NET framework supports HTML parsing in the way I
    want it so I've been applying RegEx.
    Basically I want to strip all tags and then remove excess whitespace so
    that I have "pure" text.
    My current strategy is to replace inline tags with an empty string and
    then replacing all other tags with a space:
    HTML = RegEx.Replace(HTML, "</?(b|i|u|strong|etc)*>", "")
    HTML = RegEx.Replace(HTML, "</?[^>]*>", " ")
    Then I remove excess whitespace:
    HTMLText = RegEx.Replace(HTMLText, "\s+", " ")
    It's the authorative list (b|u|i|strong|...) that I'm looking for so
    I'll take a look at the DTD recommended.
    Cheers
    Jack
     
    , Mar 14, 2006
    #6
  7. Guest

    Aha!
    http://www.htmlhelp.com/reference/html40/inline.html
    -------------------------
    A
    ABBR
    ACRONYM
    B
    BASEFONT
    BDO
    BIG
    BR
    CITE
    CODE
    DFN
    EM
    FONT
    I
    IMG
    INPUT
    KBD
    LABEL
    Q
    S
    SAMP
    SELECT
    SMALL
    SPAN
    STRIKE
    STRONG
    SUB
    SUP
    TEXTAREA
    TT
    U
    VAR
    -------------------------
     
    , Mar 14, 2006
    #7
  8. Jim Higson Guest

    wrote:

    > Thank-you all for the helpful feedback.
    > It is true that RegEx is a bit of a dark art but I am writing a Crawler
    > in VB Dot Net and not Perl or Python.
    > I am not sure if the .NET framework supports HTML parsing in the way I
    > want it so I've been applying RegEx.
    > Basically I want to strip all tags and then remove excess whitespace so
    > that I have "pure" text.
    > My current strategy is to replace inline tags with an empty string and
    > then replacing all other tags with a space:
    > HTML = RegEx.Replace(HTML, "</?(b|i|u|strong|etc)*>", "")
    > HTML = RegEx.Replace(HTML, "</?[^>]*>", " ")
    > Then I remove excess whitespace:
    > HTMLText = RegEx.Replace(HTMLText, "\s+", " ")
    > It's the authorative list (b|u|i|strong|...) that I'm looking for so
    > I'll take a look at the DTD recommended.
    > Cheers
    > Jack


    The program I recomended (http://www.mbayer.de/html2text/) is a simple
    command line app. You should be able to call it from just about any
    language with one line of code. I don't know how you call commands in .NET,
    but it shouldn't be difficult.

    --
    Jim
     
    Jim Higson, Mar 14, 2006
    #8
  9. Neredbojias Guest

    With neither quill nor qualm, quothed:

    > Aha!
    > http://www.htmlhelp.com/reference/html40/inline.html
    > -------------------------
    > A
    > ABBR
    > ACRONYM
    > B
    > BASEFONT
    > BDO
    > BIG
    > BR
    > CITE
    > CODE
    > DFN
    > EM
    > FONT
    > I
    > IMG
    > INPUT
    > KBD
    > LABEL
    > Q
    > S
    > SAMP
    > SELECT
    > SMALL
    > SPAN
    > STRIKE
    > STRONG
    > SUB
    > SUP
    > TEXTAREA
    > TT
    > U
    > VAR
    > -------------------------


    What happened to DIV?

    --
    Neredbojias
    Contrary to popular belief, it is believable.
     
    Neredbojias, Mar 14, 2006
    #9
  10. Steve Pugh Guest

    Neredbojias <> wrote:
    >With neither quill nor qualm, quothed:
    >
    >> Aha!
    >> http://www.htmlhelp.com/reference/html40/inline.html
    >> -------------------------


    [snip list]

    >What happened to DIV?


    Not an inline element is it?

    Steve
    --
    "My theories appal you, my heresies outrage you,
    I never answer letters and you don't like my tie." - The Doctor

    Steve Pugh <> <http://steve.pugh.net/>
     
    Steve Pugh, Mar 14, 2006
    #10
  11. Neredbojias Guest

    With neither quill nor qualm, Steve Pugh quothed:

    > Neredbojias <> wrote:
    > >With neither quill nor qualm, quothed:
    > >
    > >> Aha!
    > >> http://www.htmlhelp.com/reference/html40/inline.html
    > >> -------------------------

    >
    > [snip list]
    >
    > >What happened to DIV?

    >
    > Not an inline element is it?


    Missed the "inline" there.

    --
    Neredbojias
    Contrary to popular belief, it is believable.
     
    Neredbojias, Mar 14, 2006
    #11
  12. Toby Inkster Guest

    Jim Higson wrote:

    > The program I recomended (http://www.mbayer.de/html2text/) is a simple
    > command line app. You should be able to call it from just about any
    > language with one line of code. I don't know how you call commands in .NET,
    > but it shouldn't be difficult.


    In VisualBasic, it's the "Shell" keyword IIRC.

    --
    Toby A Inkster BSc (Hons) ARCS
    Contact Me ~ http://tobyinkster.co.uk/contact
     
    Toby Inkster, Mar 14, 2006
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. zalbermere

    Convert text to HTML

    zalbermere, Aug 20, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    349
    Manohar Kamath [MVP]
    Aug 20, 2003
  2. BADEV

    How do convert text to HTML?

    BADEV, Dec 10, 2003, in forum: ASP .Net
    Replies:
    4
    Views:
    636
    Guest
    Dec 10, 2003
  3. geoffbache
    Replies:
    8
    Views:
    634
    Stefan Behnel
    Feb 11, 2008
  4. keal
    Replies:
    3
    Views:
    168
    Robert Klemme
    Jan 4, 2006
  5. Replies:
    2
    Views:
    180
    John Bokma
    Aug 30, 2013
Loading...

Share This Page