Stripping unwanted html

Wild Al · Oct 5, 2006

Hi everyone:

I'm trying to strip html with the exception of a few html tags.

I have found the following code:

def strip_tags(html)
if html.index("<")
text = ""
tokenizer = HTML::Tokenizer.new(html)

while token = tokenizer.next
node = HTML::Node.parse(nil, 0, 0, token, false)
# result is only the content of any Text nodes
text << node.to_s if node.class == HTML::Text
end
# strip any comments, and if they have a newline at the end (ie.
line with
# only a comment) strip that too
text.gsub(/[\n]?/m, "")
else
html # already plain text
end
end

I'm trying to understand what is going on in this code but cannot find
documenation for HTML::Tokenizer or HTML::Node.parse. Does anyone know
the use of the parameters in the parse method?

In the while loop, how do you access the html tag. If I could access
the html tags, I could then decide if I wanted to keep the tag or not.

Thanks for reading,
Wild Al

Michael Moen · Oct 5, 2006

Wild said:
Hi everyone:

I'm trying to strip html with the exception of a few html tags.

I have found the following code:

def strip_tags(html)
if html.index("<")
text = ""
tokenizer = HTML::Tokenizer.new(html)

while token = tokenizer.next
node = HTML::Node.parse(nil, 0, 0, token, false)
# result is only the content of any Text nodes
text << node.to_s if node.class == HTML::Text
end
# strip any comments, and if they have a newline at the end (ie.
line with
# only a comment) strip that too
text.gsub(/[\n]?/m, "")
else
html # already plain text
end
end

I'm trying to understand what is going on in this code but cannot find
documenation for HTML::Tokenizer or HTML::Node.parse. Does anyone know
the use of the parameters in the parse method?

In the while loop, how do you access the html tag. If I could access
the html tags, I could then decide if I wanted to keep the tag or not.

Thanks for reading,
Wild Al

Al- I recently needed a parser similar to Perl's HTML::Scrubber. <a
href="http://www.underpantsgnome.com/2006/09/09/using-hpricot-to-scrub-html/">This</a>
is what I came up with , you may find it useful.

Michael

eden · Oct 6, 2006

Hey, that's RoR's strip_tags method

I ran into the same issue so went down your path and instead of
fretting about no docs, just hacked my way through using IRB.

Here's a modified version of what I came up with, maybe you'll find it
useful?

def strip_tags_except(html, exceptions = [])
if html.index("<")
text = ""
tokenizer = HTML::Tokenizer.new(html)
while token = tokenizer.next
case node = HTML::Node.parse(nil, 0, 0, token, false)
when HTML::Tag
text << node.to_s if exceptions.include?(node.name)
when HTML::Text
text << node.to_s
end
end
text
else
html
end
end

The one I had also stripped attributes and closed up dangling tags if
it found any. Have a look at RoR's strip_links for more examples of
HTML::Node/HTML::Tokenizer usage.

eden · Oct 6, 2006

eden said:
Have a look at RoR's strip_links for more examples of
HTML::Node/HTML::Tokenizer usage.

Sorry, I meant sanitize:
http://api.rubyonrails.com/classes/ActionView/Helpers/TextHelper.html#M000516

Wild Al · Oct 8, 2006

eden said:
Here's a modified version of what I came up with, maybe you'll find it
useful?

def strip_tags_except(html, exceptions = [])
if html.index("<")
text = ""
tokenizer = HTML::Tokenizer.new(html)
while token = tokenizer.next
case node = HTML::Node.parse(nil, 0, 0, token, false)
when HTML::Tag
text << node.to_s if exceptions.include?(node.name)
when HTML::Text
text << node.to_s
end
end
text
else
html
end
end

I found this method very useful; it is exactly what I needed. Thanks.
To all others: your suggestions helped too, especially in understanding
ruby. Thanks again...

eden li · Oct 9, 2006

Glad I could help. One security-related caveat.

The method I posted doesn't strip attributes, so it may be possible for
someone to "hack" your site by putting javascript onto one of the
allowed tags.

You can fix that by changing the last line of the first branch of the
if statement from "text" to "sanitize(text)", eg:

if html.index("<")
...
sanitize(text)
else
...

Python client/server that reads HTML body from server	1	Apr 11, 2023
Final chapter of "Learn PHP, MySQL and JavaScript"	3	Jun 4, 2024
[Security] [ANN] Loofah has an HTML injection / XSS vulnerability,please upgrade to 0.4.6	0	Feb 2, 2010
I need help with a Gemini prompt	1	May 14, 2025
Deleted unwanted characters from a string	4	Jun 17, 2009
JavaFX tags not wrapping around	0	Sep 25, 2024
Stripping HTML with RE	3	Nov 9, 2004
Image upload not working in browser	4	Sep 8, 2022

Stripping unwanted html

Wild Al

Michael Moen

eden

eden

Wild Al

eden li

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads