Stripping HTML with RE

Discussion in 'Python' started by Steveo, Nov 9, 2004.

  1. Steveo

    Steveo Guest

    I am currently stripping HTML from a string with the following code.
    (I know it's not the best way to strip HTML but bear with me)

    re.compile("<.*?>")

    I wanted to allow all H1 and H2 tags so i changed it to:

    re.compile("<[^H1|^H2]*?>")

    This seemed to work but it also allowed the HTML tag(basically anythin
    with an H or a 1 or a 2) How can I get this to strip all tags except
    H1 and H2. Any Help you could give would be great.

    Steve
    Steveo, Nov 9, 2004
    #1
    1. Advertising

  2. Steveo <stephen_p_barrett <at> hotmail.com> writes:
    >
    > I wanted to allow all H1 and H2 tags so i changed it to:
    >
    > re.compile("<[^H1|^H2]*?>")
    >
    > This seemed to work but it also allowed the HTML tag(basically anythin
    > with an H or a 1 or a 2) How can I get this to strip all tags except
    > H1 and H2. Any Help you could give would be great.


    You probably want a lookahead assertion. From the docs at
    http://docs.python.org/lib/re-syntax.html:

    (?!...)
    Matches if ... doesn't match next. This is a negative lookahead assertion.
    For example, Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by
    'Asimov'.

    So I would write your example something like:

    >>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<a>sdfsa</a>')

    'sdfsa'
    >>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<H1>sdfsa</a>')

    '<H1>sdfsa'
    >>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<H1>sdfsa</H2>')

    '<H1>sdfsa</H2>'

    (I was too lazy to compile the re, but of course that's what you'd normally want
    to do.)

    Steve
    Steven Bethard, Nov 9, 2004
    #2
    1. Advertising

  3. Steveo

    Miles Fender Guest

    Steveo wrote:
    > I am currently stripping HTML from a string with the following code.
    > (I know it's not the best way to strip HTML but bear with me)
    > [...]


    Instead of using REs, you might consider the StrippingParser
    from the Python Cookbook:

    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52281

    It allows you to specify explicitly which tags you want to leave
    intact, so you'll be able to change your mind later without futzing
    about with a complex RE...


    Miles
    Miles Fender, Nov 9, 2004
    #3
  4. I wrote:
    > >>> re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<a>sdfsa</a>')

    > 'sdfsa'


    Maybe slightly better:

    >>> re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<a>sdfsa</a>')

    'sdfsa'
    >>> re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H1>sdfsa</a>')

    '<H1>sdfsa'
    >>> re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H1>sdfsa</H2>')

    '<H1>sdfsa</H2>'
    >>> re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H2>sdfsa</H2>')

    '<H2>sdfsa</H2>'

    I've just grouped things a bit differently so that I only have to write H1 and
    H2 once.

    Steve
    Steven Bethard, Nov 9, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Michael Vilain

    regex for stripping HTML

    Michael Vilain, Oct 28, 2003, in forum: Perl
    Replies:
    4
    Views:
    662
    Anno Siegel
    Oct 30, 2003
  2. Spondishy

    Stripping html tags from text

    Spondishy, Mar 6, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    4,154
    m.posseth
    Mar 7, 2006
  3. JJ Harrison

    Stripping HTML attributes and tags

    JJ Harrison, Nov 27, 2005, in forum: HTML
    Replies:
    5
    Views:
    1,314
    Toby Inkster
    Nov 28, 2005
  4. Medros

    Stripping html

    Medros, Jun 12, 2006, in forum: C Programming
    Replies:
    6
    Views:
    290
    =?iso-8859-1?q?Asbj=F8rn_S=E6b=F8?=
    Jun 12, 2006
  5. Carlo Razzeto

    HTML stripping?

    Carlo Razzeto, Jul 10, 2007, in forum: ASP .Net
    Replies:
    1
    Views:
    314
    Alexey Smirnov
    Jul 10, 2007
Loading...

Share This Page