Replacing html tags

Discussion in 'ASP .Net' started by jumblesale, Oct 4, 2006.

  1. jumblesale

    jumblesale Guest

    Hello all,
    I'm not all that bad at Regex, but i'm stumped on how to approach my
    problem.

    I need to parse a string and remove all html tags except hyperlinks.

    I can remove all the html tags using: Regex.Replace(inputText,
    @"<(/?[^\>]+)>", "");
    But this also removes any hyperlinks, which i need to keep.

    I've also written a regex for finding hyperlinks:
    <a[\s]href=["'][^"]+[.\s]*["'][^<]+[.\s]*</a>
    but my problem is trying to put all this together.

    I've thought of using Regex.Matches and checking each instance but
    can't get that to work.

    Any ideas and/ or code would be great - i'm used to C# but VB's cool as
    well.

    Cheers in advance,
    max
    jumblesale, Oct 4, 2006
    #1
    1. Advertising

  2. You could do this with the HTML Agility Pack:
    http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack

    I think it comes with an example that strips HTML tags, which you could
    probably adapt quite quickly to keep <a> tags.

    jumblesale wrote:
    > Hello all,
    > I'm not all that bad at Regex, but i'm stumped on how to approach my
    > problem.
    >
    > I need to parse a string and remove all html tags except hyperlinks.
    >
    > I can remove all the html tags using: Regex.Replace(inputText,
    > @"<(/?[^\>]+)>", "");
    > But this also removes any hyperlinks, which i need to keep.
    >
    > I've also written a regex for finding hyperlinks:
    > <a[\s]href=["'][^"]+[.\s]*["'][^<]+[.\s]*</a>
    > but my problem is trying to put all this together.
    >
    > I've thought of using Regex.Matches and checking each instance but
    > can't get that to work.
    >
    > Any ideas and/ or code would be great - i'm used to C# but VB's cool as
    > well.
    >
    > Cheers in advance,
    > max
    Chris Fulstow, Oct 4, 2006
    #2
    1. Advertising

  3. jumblesale

    jumblesale Guest

    wow, that's a great pack but surely there's a simpler way of doing it
    with regex? seems like a huge amount of files to import just to check a
    string

    Cheers for your quick response,
    max

    Chris Fulstow wrote:

    > You could do this with the HTML Agility Pack:
    > http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack
    >
    > I think it comes with an example that strips HTML tags, which you could
    > probably adapt quite quickly to keep <a> tags.
    >
    > jumblesale wrote:
    > > Hello all,
    > > I'm not all that bad at Regex, but i'm stumped on how to approach my
    > > problem.
    > >
    > > I need to parse a string and remove all html tags except hyperlinks.
    > >
    > > I can remove all the html tags using: Regex.Replace(inputText,
    > > @"<(/?[^\>]+)>", "");
    > > But this also removes any hyperlinks, which i need to keep.
    > >
    > > I've also written a regex for finding hyperlinks:
    > > <a[\s]href=["'][^"]+[.\s]*["'][^<]+[.\s]*</a>
    > > but my problem is trying to put all this together.
    > >
    > > I've thought of using Regex.Matches and checking each instance but
    > > can't get that to work.
    > >
    > > Any ideas and/ or code would be great - i'm used to C# but VB's cool as
    > > well.
    > >
    > > Cheers in advance,
    > > max
    jumblesale, Oct 4, 2006
    #3
  4. Woohoo! This is a great control library. Glad you posted it here as it saved
    me from writing a lot of code using the WebBrowser control to do some
    similar HTML manipulation.


    --
    Thanks again,
    Mark Fitzpatrick
    Former Microsoft FrontPage MVP 199?-2006


    "Chris Fulstow" <> wrote in message
    news:...
    > You could do this with the HTML Agility Pack:
    > http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack
    >
    > I think it comes with an example that strips HTML tags, which you could
    > probably adapt quite quickly to keep <a> tags.
    >
    > jumblesale wrote:
    >> Hello all,
    >> I'm not all that bad at Regex, but i'm stumped on how to approach my
    >> problem.
    >>
    >> I need to parse a string and remove all html tags except hyperlinks.
    >>
    >> I can remove all the html tags using: Regex.Replace(inputText,
    >> @"<(/?[^\>]+)>", "");
    >> But this also removes any hyperlinks, which i need to keep.
    >>
    >> I've also written a regex for finding hyperlinks:
    >> <a[\s]href=["'][^"]+[.\s]*["'][^<]+[.\s]*</a>
    >> but my problem is trying to put all this together.
    >>
    >> I've thought of using Regex.Matches and checking each instance but
    >> can't get that to work.
    >>
    >> Any ideas and/ or code would be great - i'm used to C# but VB's cool as
    >> well.
    >>
    >> Cheers in advance,
    >> max

    >
    Mark Fitzpatrick, Oct 4, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dean H. Saxe
    Replies:
    0
    Views:
    1,012
    Dean H. Saxe
    Jan 3, 2004
  2. Rob Nicholson
    Replies:
    3
    Views:
    668
    Rob Nicholson
    May 28, 2005
  3. Donald Firesmith

    html tags within meta tags allowed?

    Donald Firesmith, Jan 5, 2005, in forum: XML
    Replies:
    5
    Views:
    871
    Andy Dingley
    Jan 8, 2005
  4. Rob Meade

    Replacing - and not Replacing...

    Rob Meade, Apr 5, 2005, in forum: ASP General
    Replies:
    5
    Views:
    256
    Chris Hohmann
    Apr 11, 2005
  5. replacing tags between tags

    , Sep 18, 2005, in forum: Perl Misc
    Replies:
    9
    Views:
    117
    J├╝rgen Exner
    Sep 19, 2005
Loading...

Share This Page