Stripping HTML with RE

Steveo · Nov 9, 2004

I am currently stripping HTML from a string with the following code.
(I know it's not the best way to strip HTML but bear with me)

re.compile("<.*?>")

I wanted to allow all H1 and H2 tags so i changed it to:

re.compile("<[^H1|^H2]*?>")

This seemed to work but it also allowed the HTML tag(basically anythin
with an H or a 1 or a 2) How can I get this to strip all tags except
H1 and H2. Any Help you could give would be great.

Steve

Steven Bethard · Nov 9, 2004

Steveo said:
I wanted to allow all H1 and H2 tags so i changed it to:

re.compile("<[^H1|^H2]*?>")

This seemed to work but it also allowed the HTML tag(basically anythin
with an H or a 1 or a 2) How can I get this to strip all tags except
H1 and H2. Any Help you could give would be great.

You probably want a lookahead assertion. From the docs at
http://docs.python.org/lib/re-syntax.html:

(?!...)
Matches if ... doesn't match next. This is a negative lookahead assertion.
For example, Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by
'Asimov'.

So I would write your example something like:

' said:
re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<a>sdfsa</a>') 'sdfsa'
re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<H1>sdfsa</a>')

Click to expand...

' said:

re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<H1>sdfsa</H2>')

Click to expand...

Click to expand...

'<H1>sdfsa</H2>'

(I was too lazy to compile the re, but of course that's what you'd normally want
to do.)

Steve

Miles Fender · Nov 9, 2004

Steveo said:
I am currently stripping HTML from a string with the following code.
(I know it's not the best way to strip HTML but bear with me)
> [...]

Instead of using REs, you might consider the StrippingParser
from the Python Cookbook:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52281

It allows you to specify explicitly which tags you want to leave
intact, so you'll be able to change your mind later without futzing
about with a complex RE...

Miles

Steven Bethard · Nov 9, 2004

I said:
re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<a>sdfsa</a>')

Click to expand...

Click to expand...

'sdfsa'

Maybe slightly better:

' said:
re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<a>sdfsa</a>') 'sdfsa'
re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H1>sdfsa</a>')

Click to expand...

' said:

re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H1>sdfsa</H2>')

Click to expand...

' said:

re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H2>sdfsa</H2>')

Click to expand...

Click to expand...

'<H2>sdfsa</H2>'

I've just grouped things a bit differently so that I only have to write H1 and
H2 once.

Steve

I need help making an html website	2	Aug 1, 2023
Only one table shows up with the information	2	Mar 29, 2023
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Mini Web Server in C++ (Part One)	4	Oct 2, 2025
Issue with textbox script?	0	Sep 4, 2022
Stripping html tags from text	4	Mar 6, 2006
Stripping unwanted html	5	Oct 5, 2006
Help with my responsive home page	2	Dec 14, 2022

Stripping HTML with RE

Steveo

Steven Bethard

Miles Fender

Steven Bethard

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads