Stripping HTML with RE

Discussion in 'Python' started by Steveo, Nov 9, 2004.

  1. Steveo

    Steveo Guest

    I am currently stripping HTML from a string with the following code.
    (I know it's not the best way to strip HTML but bear with me)


    I wanted to allow all H1 and H2 tags so i changed it to:


    This seemed to work but it also allowed the HTML tag(basically anythin
    with an H or a 1 or a 2) How can I get this to strip all tags except
    H1 and H2. Any Help you could give would be great.

    Steveo, Nov 9, 2004
    1. Advertisements

  2. You probably want a lookahead assertion. From the docs at

    Matches if ... doesn't match next. This is a negative lookahead assertion.
    For example, Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by

    So I would write your example something like:

    (I was too lazy to compile the re, but of course that's what you'd normally want
    to do.)

    Steven Bethard, Nov 9, 2004
    1. Advertisements

  3. Steveo

    Miles Fender Guest

    Instead of using REs, you might consider the StrippingParser
    from the Python Cookbook:

    It allows you to specify explicitly which tags you want to leave
    intact, so you'll be able to change your mind later without futzing
    about with a complex RE...

    Miles Fender, Nov 9, 2004
  4. Maybe slightly better:

    I've just grouped things a bit differently so that I only have to write H1 and
    H2 once.

    Steven Bethard, Nov 9, 2004
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.