re pattern for matching JS/CSS

I

i80and

I'm working on a program to remove tags from a HTML document, leaving
just the content, but I want to do it simply. I've finished a system
to remove simple tags, but I want all CSS and JS to be removed. What
re pattern could I use to do that?

I've tried
'<script[\S\s]*/script>'
but that didn't work properly. I'm fairly basic in my knowledge of
Python, so I'm still trying to learn re.
What pattern would work?
 
I

ina

i80and said:
I'm working on a program to remove tags from a HTML document, leaving
just the content, but I want to do it simply. I've finished a system
to remove simple tags, but I want all CSS and JS to be removed. What
re pattern could I use to do that?

I've tried
'<script[\S\s]*/script>'
but that didn't work properly. I'm fairly basic in my knowledge of
Python, so I'm still trying to learn re.
What pattern would work?

I use re.compile("<script.*?</script>",re.DOTALL)
for scripts. I strip this out first since my tag stripping re will
strip out script tags as well hope this was of help.
 
T

Tim Chase

I've tried
'<script[\S\s]*/script>'
but that didn't work properly. I'm fairly basic in my knowledge of
Python, so I'm still trying to learn re.
What pattern would work?

I use re.compile("<script.*?</script>",re.DOTALL)
for scripts. I strip this out first since my tag stripping re will
strip out script tags as well hope this was of help.

This won't catch various alterations of

<
script
>
doEvil()
<
/
script
>

which is valid html/xhtml.

For less valid html, but still attemptable, one might find
something like

<scrip<script>hah</script>t>doEvil()</script>

which, if you nuke your pattern, leaves the valid but unwanted

<script>doEvil()</script>

I'd propose that it's better to use something such as
BeautifulSoup that actually parses the HTML, and then skim
through it whitelisting the tags you plan to allow, and skipping
the emission of any tags that don't make the whitelist.

-tkc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top