How to strip HTML markup from string?

  • Thread starter Michal A. Valasek
  • Start date
M

Michal A. Valasek

Hello,

I want to transform text with HTML markup to plain text. Is there some
simple way how to do it?

I can surely write my own function, which would simply strip everything with
< and >. But if someonew has already written something similar for .NET, I
would prefer more clever solution, which would try to retain original
layout, at least paragraphs, hyperlinks etc - something like Outlook does
when changing HTML to plain text.
 
M

MS News \(MS ILM\)

Function stripHTML(strHTML)
'Strips the HTML tags from strHTML

Dim objRegExp, strOutput
Set objRegExp = New Regexp

objRegExp.IgnoreCase = True
objRegExp.Global = True
objRegExp.Pattern = "<(.|\n)+?>"

'Replace all HTML tag matches with the empty string
strOutput = objRegExp.Replace(strHTML, "")

'Replace all < and > with &lt; and &gt;
strOutput = Replace(strOutput, "<", "&lt;")
strOutput = Replace(strOutput, ">", "&gt;")

stripHTML = strOutput 'Return the value of strOutput

Set objRegExp = Nothing
End Function
 
Y

Yan-Hong Huang[MSFT]

Hello Michal,

The page in 4guysfromrolla.com (introduced by Ravikanth) and RegEx (introduced by another dev) could work for you.

However, there are some other issues. Even after you entirely strip out all the <htmltags> you may be left with HTML-
encoded strings such as which you will also want to parse. These are easily handled with

System.Web.HTTPUtility.HTMLDecode()

And now, the long explanation of why this won't be good enough. There are still many unresolved issues: (It was posted by
others before)

1) Rendered line feeds versus actual line feeds. In any HTML source the line feeds that are in there are generally NOT the
ones that are rendered. BR, P and others are the elements that determine the position on the rendered page.

2) What you're going to do with any elements outside the BODY tag, and what you are going to do with text that is left over
between elements such as OBJECT or SCRIPT?

3) Complex pages that have multiple DIV, LAYER or SPAN elements - some of which are only displayed conditionally
based on things such as browser version or client-side events.

4) TABLEs. Even though the HTML source for a table is entered in a left-to-right fashion, rows and columns can be spanned
so you may not find two words which are rendered together on the page to be next to each other in the source code.

Basically, you need to decide, in advance, what you are looking for and what your end result is going to be. If you're just
trying to parse a simple HTML page with a reasonably predictable format then a simple regex will do the trick. If you are
looking for specific elements with some important text then a regex and running a for...next loop through the matches would
be in order.



Best regards,
Yanhong Huang
Microsoft Online Partner Support

Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.

--------------------
!From: "Michal A. Valasek" <[email protected]>
!Subject: How to strip HTML markup from string?
!Date: Sat, 9 Aug 2003 04:48:20 +0200
!Lines: 18
!X-Priority: 3
!X-MSMail-Priority: Normal
!X-Newsreader: Microsoft Outlook Express 6.00.2800.1158
!X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165
!Message-ID: <u5P#[email protected]>
!Newsgroups: microsoft.public.dotnet.framework.aspnet
!NNTP-Posting-Host: gateway.haje.altaircom.net 62.24.73.162
!Path: cpmsftngxa06.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTNGP10.phx.gbl
!Xref: cpmsftngxa06.phx.gbl microsoft.public.dotnet.framework.aspnet:166353
!X-Tomcat-NG: microsoft.public.dotnet.framework.aspnet
!
!Hello,
!
!I want to transform text with HTML markup to plain text. Is there some
!simple way how to do it?
!
!I can surely write my own function, which would simply strip everything with
!< and >. But if someonew has already written something similar for .NET, I
!would prefer more clever solution, which would try to retain original
!layout, at least paragraphs, hyperlinks etc - something like Outlook does
!when changing HTML to plain text.
!
!
!--
!Michal A. Valasek, Altair Communications, http://www.altaircom.net
!Please do not reply to this e-mail, for contact see http://www.rider.cz
!Keeping Freedom safe from Democracy
!
!
!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top