Remove HTML tags (except anchor tag) from a string using regularexpressions

Discussion in 'Python' started by Nico Grubert, Feb 1, 2005.

  1. Nico Grubert

    Nico Grubert Guest


    I want to remove all html tags from a string "content" except <a

    My script reads like this:

    import re
    content = re.sub('<([^!>]([^>]|\n)*)>', '', content)

    It works fine. It removes all html tags from "content".
    Unfortunately, this also removes <a ...>xxx</a> occurancies.
    Any idea, how to modify this to remove all html tags except <a ...>xxx</a>?

    Thanks in advance,
    Nico Grubert, Feb 1, 2005
    1. Advertisements

  2. Nico Grubert

    Anand Guest

    How about...

    import re
    content = re.sub('<([^!(a>)]([^(/a>)]|\n)*)>', '', content)
    Seems to work for me.


    Anand, Feb 1, 2005
    1. Advertisements

  3. Nico Grubert

    Anand Guest

    I meant
    content = re.sub ('<[^!(a>)]([^>]|\n)*[^!(/a)]>', '', content)

    Sorry for the mistake.
    However this seems to also print tags like <b>, <p> etc

    Anand, Feb 1, 2005
  4. Nico Grubert

    Max M Guest

    Max M, Feb 1, 2005
  5. 'first first '

    keeping in mind that bare ">" and "<" are invalid HTML (should be &gt;
    and &lt;), why'd it leave the greater than and why are there two "first"'s ?
    Gabriel Cooper, Feb 2, 2005
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.