Stripping HTML with RE

Discussion in 'Python' started by Steveo, Nov 9, 2004.

  1. Steveo

    Steveo Guest

    I am currently stripping HTML from a string with the following code.
    (I know it's not the best way to strip HTML but bear with me)

    re.compile("<.*?>")

    I wanted to allow all H1 and H2 tags so i changed it to:

    re.compile("<[^H1|^H2]*?>")

    This seemed to work but it also allowed the HTML tag(basically anythin
    with an H or a 1 or a 2) How can I get this to strip all tags except
    H1 and H2. Any Help you could give would be great.

    Steve
     
    Steveo, Nov 9, 2004
    #1
    1. Advertisements

  2. You probably want a lookahead assertion. From the docs at
    http://docs.python.org/lib/re-syntax.html:

    (?!...)
    Matches if ... doesn't match next. This is a negative lookahead assertion.
    For example, Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by
    'Asimov'.

    So I would write your example something like:
    '<H1>sdfsa</H2>'

    (I was too lazy to compile the re, but of course that's what you'd normally want
    to do.)

    Steve
     
    Steven Bethard, Nov 9, 2004
    #2
    1. Advertisements

  3. Steveo

    Miles Fender Guest

    Instead of using REs, you might consider the StrippingParser
    from the Python Cookbook:

    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52281

    It allows you to specify explicitly which tags you want to leave
    intact, so you'll be able to change your mind later without futzing
    about with a complex RE...


    Miles
     
    Miles Fender, Nov 9, 2004
    #3
  4. Maybe slightly better:
    '<H2>sdfsa</H2>'

    I've just grouped things a bit differently so that I only have to write H1 and
    H2 once.

    Steve
     
    Steven Bethard, Nov 9, 2004
    #4
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.