Remove HTML tags (except anchor tag) from a string using regularexpressions

Discussion in 'Python' started by Nico Grubert, Feb 1, 2005.

  1. Nico Grubert

    Nico Grubert Guest

    Hello,

    I want to remove all html tags from a string "content" except <a
    ....>xxx</a>.

    My script reads like this:

    ###
    import re
    content = re.sub('<([^!>]([^>]|\n)*)>', '', content)
    ###

    It works fine. It removes all html tags from "content".
    Unfortunately, this also removes <a ...>xxx</a> occurancies.
    Any idea, how to modify this to remove all html tags except <a ...>xxx</a>?

    Thanks in advance,
    Nico
     
    Nico Grubert, Feb 1, 2005
    #1
    1. Advertisements

  2. Nico Grubert

    Anand Guest

    How about...

    import re
    content = re.sub('<([^!(a>)]([^(/a>)]|\n)*)>', '', content)
    Seems to work for me.

    HTH

    -Anand
     
    Anand, Feb 1, 2005
    #2
    1. Advertisements

  3. Nico Grubert

    Anand Guest

    I meant
    content = re.sub ('<[^!(a>)]([^>]|\n)*[^!(/a)]>', '', content)

    Sorry for the mistake.
    However this seems to also print tags like <b>, <p> etc
    also.

    -Anand
     
    Anand, Feb 1, 2005
    #3
  4. Nico Grubert

    Max M Guest

    Max M, Feb 1, 2005
    #4
  5. 'first first '


    keeping in mind that bare ">" and "<" are invalid HTML (should be &gt;
    and &lt;), why'd it leave the greater than and why are there two "first"'s ?
     
    Gabriel Cooper, Feb 2, 2005
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.