Stripping html tags from text

Discussion in 'ASP .Net' started by Spondishy, Mar 6, 2006.

  1. Spondishy

    Spondishy Guest

    Hi,

    I'm looking for help with a regular expression and c#.

    I want to remove all tags from a piece of html except the following.

    <a>
    <b>
    <h1>
    <h2>
    <h3>

    Also, <a> could be <a href="aa">aaa</a> etc.

    Help would be appreciated, along with an explanation of the reg
    expression created.

    Thanks.
     
    Spondishy, Mar 6, 2006
    #1
    1. Advertising

  2. HTML is complex. It would be better instead to say that you want to
    *retrieve* *only* all of the following tags. That way, they are the only
    tags the Regular Expression will have to look for.

    The following will do this:

    (?i)<\s*(a|br|h1|h2|h3)[^>]*>(?:([^<\r\n]+)(?=(?:<\/\1)|(?:\r?\n)))?

    Note: Grouping is used in this Regular Expression. It groups the tag names
    into Group 1, and the InnerText into Group 2, in case you need either of
    these.

    --
    HTH,

    Kevin Spencer
    Microsoft MVP
    ..Net Developer

    Presuming that God is "only an idea" -
    Ideas exist.
    Therefore, God exists.

    "Spondishy" <> wrote in message
    news:...
    > Hi,
    >
    > I'm looking for help with a regular expression and c#.
    >
    > I want to remove all tags from a piece of html except the following.
    >
    > <a>
    > <b>
    > <h1>
    > <h2>
    > <h3>
    >
    > Also, <a> could be <a href="aa">aaa</a> etc.
    >
    > Help would be appreciated, along with an explanation of the reg
    > expression created.
    >
    > Thanks.
    >
     
    Kevin Spencer, Mar 6, 2006
    #2
    1. Advertising

  3. Spondishy

    m.posseth Guest

    i use this in VB

    Private Function stripHTML(ByVal strHTML) As String

    Dim objRegExp As New System.Text.RegularExpressions.Regex("<(.|\n)+?>")

    Return objRegExp.Replace(strHTML, "")

    End Function

    so the regex System.Text.RegularExpressions.Regex("<(.|\n)+?>")

    does the trick

    so in C# it would be ( i am a VB coder so don`t shoot me )

    private string stripHTML(object strHTML)

    {

    System.Text.RegularExpressions.Regex objRegExp = new
    System.Text.RegularExpressions.Regex("<(.|\n)+?>");

    return objRegExp.Replace(strHTML, "");

    }

    regards

    Michel Posseth [MCP]





    "Spondishy" <> wrote in message
    news:...
    > Hi,
    >
    > I'm looking for help with a regular expression and c#.
    >
    > I want to remove all tags from a piece of html except the following.
    >
    > <a>
    > <b>
    > <h1>
    > <h2>
    > <h3>
    >
    > Also, <a> could be <a href="aa">aaa</a> etc.
    >
    > Help would be appreciated, along with an explanation of the reg
    > expression created.
    >
    > Thanks.
    >
     
    m.posseth, Mar 6, 2006
    #3
  4. The problem with that Regular Expression (in this case) is that it simply
    matches all tags in the page. It doesn't match InnerText, as he requested,
    and it matches end tags as separate matches. It is excellent for, for
    example, stripping HTML tags from a page, but not for his requirements.

    --
    HTH,

    Kevin Spencer
    Microsoft MVP
    ..Net Developer

    Presuming that God is "only an idea" -
    Ideas exist.
    Therefore, God exists.

    "m.posseth" <> wrote in message
    news:%...
    >
    >
    > i use this in VB
    >
    > Private Function stripHTML(ByVal strHTML) As String
    >
    > Dim objRegExp As New System.Text.RegularExpressions.Regex("<(.|\n)+?>")
    >
    > Return objRegExp.Replace(strHTML, "")
    >
    > End Function
    >
    > so the regex System.Text.RegularExpressions.Regex("<(.|\n)+?>")
    >
    > does the trick
    >
    > so in C# it would be ( i am a VB coder so don`t shoot me )
    >
    > private string stripHTML(object strHTML)
    >
    > {
    >
    > System.Text.RegularExpressions.Regex objRegExp = new
    > System.Text.RegularExpressions.Regex("<(.|\n)+?>");
    >
    > return objRegExp.Replace(strHTML, "");
    >
    > }
    >
    > regards
    >
    > Michel Posseth [MCP]
    >
    >
    >
    >
    >
    > "Spondishy" <> wrote in message
    > news:...
    >> Hi,
    >>
    >> I'm looking for help with a regular expression and c#.
    >>
    >> I want to remove all tags from a piece of html except the following.
    >>
    >> <a>
    >> <b>
    >> <h1>
    >> <h2>
    >> <h3>
    >>
    >> Also, <a> could be <a href="aa">aaa</a> etc.
    >>
    >> Help would be appreciated, along with an explanation of the reg
    >> expression created.
    >>
    >> Thanks.
    >>

    >
    >
     
    Kevin Spencer, Mar 6, 2006
    #4
  5. Spondishy

    m.posseth Guest

    Oops :)

    i just read "Stripping html tags from text" and missed the exclusion part

    >>>except the following.
    >>>
    >>> <a>
    >>> <b>
    >>> <h1>
    >>> <h2>
    >>> <h3>
    >>>
    >>> Also, <a> could be <a href="aa">aaa</a> etc.


    my code will convert
    <html>
    <head>
    <body>
    <table>
    <tr><td>bla bla </td></tr>
    </table>
    </body>
    </head>
    </html>

    into

    bla bla


    regards

    Michel




    "Kevin Spencer" <> wrote in message
    news:...
    > The problem with that Regular Expression (in this case) is that it simply
    > matches all tags in the page. It doesn't match InnerText, as he requested,
    > and it matches end tags as separate matches. It is excellent for, for
    > example, stripping HTML tags from a page, but not for his requirements.
    >
    > --
    > HTH,
    >
    > Kevin Spencer
    > Microsoft MVP
    > .Net Developer
    >
    > Presuming that God is "only an idea" -
    > Ideas exist.
    > Therefore, God exists.
    >
    > "m.posseth" <> wrote in message
    > news:%...
    >>
    >>
    >> i use this in VB
    >>
    >> Private Function stripHTML(ByVal strHTML) As String
    >>
    >> Dim objRegExp As New System.Text.RegularExpressions.Regex("<(.|\n)+?>")
    >>
    >> Return objRegExp.Replace(strHTML, "")
    >>
    >> End Function
    >>
    >> so the regex System.Text.RegularExpressions.Regex("<(.|\n)+?>")
    >>
    >> does the trick
    >>
    >> so in C# it would be ( i am a VB coder so don`t shoot me )
    >>
    >> private string stripHTML(object strHTML)
    >>
    >> {
    >>
    >> System.Text.RegularExpressions.Regex objRegExp = new
    >> System.Text.RegularExpressions.Regex("<(.|\n)+?>");
    >>
    >> return objRegExp.Replace(strHTML, "");
    >>
    >> }
    >>
    >> regards
    >>
    >> Michel Posseth [MCP]
    >>
    >>
    >>
    >>
    >>
    >> "Spondishy" <> wrote in message
    >> news:...
    >>> Hi,
    >>>
    >>> I'm looking for help with a regular expression and c#.
    >>>
    >>> I want to remove all tags from a piece of html except the following.
    >>>
    >>> <a>
    >>> <b>
    >>> <h1>
    >>> <h2>
    >>> <h3>
    >>>
    >>> Also, <a> could be <a href="aa">aaa</a> etc.
    >>>
    >>> Help would be appreciated, along with an explanation of the reg
    >>> expression created.
    >>>
    >>> Thanks.
    >>>

    >>
    >>

    >
    >
     
    m.posseth, Mar 7, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Lance
    Replies:
    4
    Views:
    437
    Lance
    Jun 17, 2005
  2. JJ Harrison

    Stripping HTML attributes and tags

    JJ Harrison, Nov 27, 2005, in forum: HTML
    Replies:
    5
    Views:
    1,338
    Toby Inkster
    Nov 28, 2005
  3. Ken Fine

    Stripping content delimited by two tags

    Ken Fine, Feb 5, 2004, in forum: ASP General
    Replies:
    5
    Views:
    131
    Ray at
    Feb 5, 2004
  4. shank

    stripping HTML tags

    shank, Jul 10, 2004, in forum: ASP General
    Replies:
    3
    Views:
    115
    Alex Kail
    Jul 14, 2004
  5. Jeff North

    Stripping HTML tags from a TEXTAREA field

    Jeff North, Jan 19, 2004, in forum: Javascript
    Replies:
    15
    Views:
    182
Loading...

Share This Page