Stripping HTML with RE

S

Steveo

I am currently stripping HTML from a string with the following code.
(I know it's not the best way to strip HTML but bear with me)

re.compile("<.*?>")

I wanted to allow all H1 and H2 tags so i changed it to:

re.compile("<[^H1|^H2]*?>")

This seemed to work but it also allowed the HTML tag(basically anythin
with an H or a 1 or a 2) How can I get this to strip all tags except
H1 and H2. Any Help you could give would be great.

Steve
 
S

Steven Bethard

Steveo said:
I wanted to allow all H1 and H2 tags so i changed it to:

re.compile("<[^H1|^H2]*?>")

This seemed to work but it also allowed the HTML tag(basically anythin
with an H or a 1 or a 2) How can I get this to strip all tags except
H1 and H2. Any Help you could give would be great.

You probably want a lookahead assertion. From the docs at
http://docs.python.org/lib/re-syntax.html:

(?!...)
Matches if ... doesn't match next. This is a negative lookahead assertion.
For example, Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by
'Asimov'.

So I would write your example something like:
re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<a>sdfsa</a>') 'sdfsa'
re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<H1>sdfsa</a>')
' said:
re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<H1>sdfsa</H2>')
'<H1>sdfsa</H2>'

(I was too lazy to compile the re, but of course that's what you'd normally want
to do.)

Steve
 
M

Miles Fender

Steveo said:
I am currently stripping HTML from a string with the following code.
(I know it's not the best way to strip HTML but bear with me)
> [...]

Instead of using REs, you might consider the StrippingParser
from the Python Cookbook:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52281

It allows you to specify explicitly which tags you want to leave
intact, so you'll be able to change your mind later without futzing
about with a complex RE...


Miles
 
S

Steven Bethard

I said:
re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<a>sdfsa</a>')
'sdfsa'

Maybe slightly better:
re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<a>sdfsa</a>') 'sdfsa'
re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H1>sdfsa</a>')
' said:
re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H1>sdfsa</H2>')
' said:
re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H2>sdfsa</H2>')
'<H2>sdfsa</H2>'

I've just grouped things a bit differently so that I only have to write H1 and
H2 once.

Steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top