Convert HTML to Text

C

cawoodm

I have written a simple RegEx which strips all tags from an HTML file
and replaces them with spaces.

This was fine until I noticed that some tags should not be replaced
with spaces. For example in the HTML:
<b>H</b>ello World
My program will generate "H ello World" effectively breaking a word
apart.

Where could I get an "authoritative" list of tags which should result
in a space and which shouldn't. I presume these are mostly block
elements like div, br, hr, table etc...
 
D

Dylan Parry

Pondering the eternal question of "Hobnobs or Rich Tea?",
(e-mail address removed) finally proclaimed:
Where could I get an "authoritative" list of tags which should result
in a space and which shouldn't. I presume these are mostly block
elements like div, br, hr, table etc...

You probably won't find a list that tells you the exact information you
are after, but the HTML DTDs available from W3C[1] will show you which
elements are block level and which are inline. From that you could
assume that the block elements result in a space, and the inline should
not.

____
[1] http://www.w3.org/TR/html4/sgml/dtd.html
 
M

mbstevens

I have written a simple RegEx which strips all tags from an HTML file
and replaces them with spaces.

This was fine until I noticed that some tags should not be replaced
with spaces. For example in the HTML:
<b>H</b>ello World
My program will generate "H ello World" effectively breaking a word
apart.

Where could I get an "authoritative" list of tags which should result
in a space and which shouldn't. I presume these are mostly block
elements like div, br, hr, table etc...

I don't have a specific answer to your last paragraph, but:

Have a look at Perl's HTML::parser and related modules.

In Python, sgmllib will be useful.

Using simple regexes to parse HTML
is liable to more errors than libraries that have been
exercised by many users. Of course, you might have a good reason
to re-invent the wheel for another language, but even there having
a look at the source of these modules might be helpful.
 
T

Toby Inkster

Dylan said:
You probably won't find a list that tells you the exact information you
are after, but the HTML DTDs available from W3C[1] will show you which
elements are block level and which are inline. From that you could
assume that the block elements result in a space, and the inline should
not.

In fact, you could assume that the block elements should begin and end
with a line break. You could also add a tab between <td> and <th> elements
in a table, add asterisks for unordered lists, add numbers for ordered
lists and so on.

I'll echo Mr Stevens' recommendation to use HTML::parser for parsing
though -- it will give far better results than a reg exp. For example, a
reg exp won't tell you to add a line break after the word "bar" here,
because the closing tag for a paragraph is optional:

<body>
<p>Foo bar.
</body>
 
J

Jim Higson

I have written a simple RegEx which strips all tags from an HTML file
and replaces them with spaces.

This was fine until I noticed that some tags should not be replaced
with spaces. For example in the HTML:
<b>H</b>ello World
My program will generate "H ello World" effectively breaking a word
apart.

Where could I get an "authoritative" list of tags which should result
in a space and which shouldn't. I presume these are mostly block
elements like div, br, hr, table etc...

How about using this?

http://www.mbayer.de/html2text/
 
C

cawoodm

Thank-you all for the helpful feedback.
It is true that RegEx is a bit of a dark art but I am writing a Crawler
in VB Dot Net and not Perl or Python.
I am not sure if the .NET framework supports HTML parsing in the way I
want it so I've been applying RegEx.
Basically I want to strip all tags and then remove excess whitespace so
that I have "pure" text.
My current strategy is to replace inline tags with an empty string and
then replacing all other tags with a space:
HTML = RegEx.Replace(HTML, "</?(b|i|u|strong|etc)*>", "")
HTML = RegEx.Replace(HTML, "</?[^>]*>", " ")
Then I remove excess whitespace:
HTMLText = RegEx.Replace(HTMLText, "\s+", " ")
It's the authorative list (b|u|i|strong|...) that I'm looking for so
I'll take a look at the DTD recommended.
Cheers
Jack
 
J

Jim Higson

Thank-you all for the helpful feedback.
It is true that RegEx is a bit of a dark art but I am writing a Crawler
in VB Dot Net and not Perl or Python.
I am not sure if the .NET framework supports HTML parsing in the way I
want it so I've been applying RegEx.
Basically I want to strip all tags and then remove excess whitespace so
that I have "pure" text.
My current strategy is to replace inline tags with an empty string and
then replacing all other tags with a space:
HTML = RegEx.Replace(HTML, "</?(b|i|u|strong|etc)*>", "")
HTML = RegEx.Replace(HTML, "</?[^>]*>", " ")
Then I remove excess whitespace:
HTMLText = RegEx.Replace(HTMLText, "\s+", " ")
It's the authorative list (b|u|i|strong|...) that I'm looking for so
I'll take a look at the DTD recommended.
Cheers
Jack

The program I recomended (http://www.mbayer.de/html2text/) is a simple
command line app. You should be able to call it from just about any
language with one line of code. I don't know how you call commands in .NET,
but it shouldn't be difficult.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top