Remove HTML from String?

jotto · Jan 8, 2006

I can't find a method to remove HTML from a string in the core API. PHP
has something called strip_tags. Does Ruby have anything like this?
http://us3.php.net/manual/en/function.strip-tags.php

Horacio Sanson · Jan 9, 2006

A regular expression can strip the HTML tags from any string...

I use this=20

# Get the html data in a string by any method
html_string =3D get_html_method

# strip all the html tags from the html data
html_string.gsub!(/(<[^>]*>)|\n|\t/s) {" "}

this may not be the best way, (robust or fast) but is enough for my needs.

Horacio

Monday 09 January 2006 17:38=E3=80=81jotto =E3=81=95=E3=82=93=E3=81=AF=E6=
=9B=B8=E3=81=8D=E3=81=BE=E3=81=97=E3=81=9F:

Austin Ziegler · Jan 9, 2006

I can't find a method to remove HTML from a string in the core API. PHP
has something called strip_tags. Does Ruby have anything like this?
http://us3.php.net/manual/en/function.strip-tags.php

Not built in. It's not really appropriate for the core language.
That's one of the things that makes PHP easy to use for people who are
trying to do simple things, but makes it hard when you get into
engineering and maintaining real programs. As was suggested by the
other respondent, it's relatively easy to remove:

a.gsub(%r{</?[^>]+?>}, '')

-austin

Gavin Kistner · Jan 9, 2006

I can't find a method to remove HTML from a string in the core
API. PHP
has something called strip_tags. Does Ruby have anything like this?
http://us3.php.net/manual/en/function.strip-tags.php

Click to expand...

Not built in. It's not really appropriate for the core language.
That's one of the things that makes PHP easy to use for people who are
trying to do simple things, but makes it hard when you get into
engineering and maintaining real programs. As was suggested by the
other respondent, it's relatively easy to remove:

a.gsub(%r{</?[^>]+?>}, '')

...just pray that the HTML you are modifying is valid, and not some
garbage file that web browsers happen to treat as intended. For
example, watch the above regexp go to town on some invalid HTML:

class String
def strip_tags
self.gsub( %r{</?[^>]+?>}, '' )
end
end

source = <<ENDHTML
<html><body>
I'm pretending to know how to code. I <3 HTML, it's teh best!!!!
<script>
for ( i=0; i<10; i++ ){ document.write(i+' ') }
</script>
BLASTOFFS!!!!
</body>
ENDHTML

puts source.strip_tags
#=> I'm pretending to know how to code. I
#=>
#=> for ( i=0; i') }
#=>
#=> BLASTOFFS!!!!

Eric Schwartz · Jan 9, 2006

Gavin Kistner said:
I can't find a method to remove HTML from a string in the core
API. PHP
has something called strip_tags. Does Ruby have anything like this?
http://us3.php.net/manual/en/function.strip-tags.php

Click to expand...

Not built in. It's not really appropriate for the core language.
That's one of the things that makes PHP easy to use for people who are
trying to do simple things, but makes it hard when you get into
engineering and maintaining real programs. As was suggested by the
other respondent, it's relatively easy to remove:

a.gsub(%r{</?[^>]+?>}, '')

Click to expand...

..just pray that the HTML you are modifying is valid, and not some
garbage file that web browsers happen to treat as intended.

More like, "Just pray the HTML you are modifying doesn't happen to be
completely valid, but not formed in exactly the way you are
expecting." For instance, the following HTML snippet is completely
valid, but screws up the regex:

a <img src="greaterthan.gif" alt=">" /> b

irb(main):010:0> a='a <img src="greaterthan.gif" alt=">" /> b'
=> "a <img src=\"greaterthan.gif\" alt=\">\" /> b"
irb(main):011:0> a.gsub(%r{</?[^>]+?>}, '')
=> "a \" /> b"

Finding other such examples is an exercise for the reader. This sort
of thing is why, as a rule, I avoid parsing HTML with regexes.

-=Eric

J. Ryan Sobol · Jan 10, 2006

If you're concerned about prevent browsers from rendering the HTML in
your string, replacing < and > with < and > symbols is more
affective than trying to remove the tags.

~ ryan ~

Austin Ziegler · Jan 10, 2006

More like, "Just pray the HTML you are modifying doesn't happen to be
completely valid, but not formed in exactly the way you are
expecting." For instance, the following HTML snippet is completely
valid, but screws up the regex:

a <img src=3D"greaterthan.gif" alt=3D">" /> b

Actually, that is *not* completely valid, at least not valid XHTML
(which is what I use these days). You have to do that as:

a <img src=3D"greaterthan.gif" alt=3D">" /> b

But my regexp wasn't intended to be complete; there are full libraries
out there for that.

-austin

Eric Schwartz · Jan 10, 2006

Austin Ziegler said:
Actually, that is *not* completely valid, at least not valid XHTML
(which is what I use these days).

When wrapped with the appropriate tags, it validated HTML 4.01, which
is what I recommend most people generate these days (because of some,
but not all, of the reasons elucidated at
http://codinginparadise.org/weblog/2005/08/xhtml-considered-harmful.html).
So yes, it is valid HTML, which is all I claimed it to be.

I specifically didn't mention XHTML, since the bits of the thread I
saw referenced HTML, and they're enough different I figured XHTML
would have been mentioned if that's what was wanted. Of course with
XHTML, you have CDATA sections, which can contain all sorts of
nastiness that can trip you up just as badly.

You have to do that as:
a <img src="greaterthan.gif" alt=">" /> b

But my regexp wasn't intended to be complete; there are full libraries
out there for that.

Right; my point was that in my experience, regexes seem to work just
fine, until suddenly they don't, and then you have to spend silly
amounts of time compensating for them-- or you could just use a proper
library in the first place, and not have to worry about it.

-=Eric

Donkey Agony · Jan 11, 2006

Austin said:
Actually, that is *not* completely valid, at least not valid XHTML
(which is what I use these days). You have to do that as:
a <img src="greaterthan.gif" alt=">" /> b

Not true. This ...

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Some Title</title>
</head>
<body>
a <img src="greaterthan.gif" alt=">" /> b
</body>
</html>

validates just fine. Don't trust me, verify it!

Bottom line:

You *must* escape `<` and `&` in markup.

You do *not* have to escape `>` (even in attribute values), nor do you
have to escape `'` or `"`. Those are common, but mistaken assumptions.
E.g. with numerous "HTML editors" (MS Frontpage, anyone?) when you type

And Susy said, "Hello world."

The oh-so-kind editor will generously change that in the markup to

And Susy said, "Hello world."

which, while not actually incorrect, is certainly quite over-the-top
when a plain

And Susy said, "Hello world."

would have done just fine, thank you very much. ...

ruby talk · Jan 11, 2006

------=_Part_110378_8267316.1136952806778
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

i like this code i found, i did not make but found. and i wish i could give
credit to who created it but i lost the website

require 'cgi'

def html2text html
text =3D html.
gsub(/( |\n|\s)+/im, ' ').squeeze(' ').strip.
gsub(/<([^\s]+)[^>]*(src|href)=3D\s*(.?)([^>\s]*)\3[^>]*>\4<\/\1>/i, '\=
4')

links =3D []
linkregex =3D /<[^>]*(src|href)=3D\s*(.?)([^>\s]*)\2[^>]*>\s*/i
while linkregex.match(text)
links << $~[3]
text.sub!(linkregex, "[#{links.size}]")
end

text =3D CGI.unescapeHTML(
text.
gsub(/<(script|style)[^>]*>.*<\/\1>/im, '').
gsub(//m, '').
gsub(/<hr(| [^>]*)>/i, "___\n").
gsub(/<li(| [^>]*)>/i, "\n* ").
gsub(/<blockquote(| [^>]*)>/i, '> ').
gsub(/<(br)(| [^>]*)>/i, "\n").
gsub(/<(\/h[\d]+|p)(| [^>]*)>/i, "\n\n").
gsub(/<[^>]*>/, '')
).lstrip.gsub(/\n[ ]+/, "\n") + "\n"

for i in (0...links.size).to_a
text =3D text + "\n [#{i+1}] <#{CGI.unescapeHTML(links)}>" unless
links.nil?
end
links =3D nil
text
end

input =3D" <h1>Title</h1> This is the body. Testing <a href=3D'
http://www.google.com/'>link to Google</a>. Testing image <img
src=3D'/noimage.png'>. The End."

print html2text(input)

Valery Visnakov · Aug 4, 2009

jotto said:
I can't find a method to remove HTML from a string in the core API. PHP
has something called strip_tags. Does Ruby have anything like this?
http://us3.php.net/manual/en/function.strip-tags.php

Here is a gem for sanitizing strings http://wonko.com/post/sanitize

CORS/Express: Getting data from server from domain html	2	Sep 3, 2022
Generate one HTML from API based on the object key language and their value	2	Aug 19, 2022
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023
Anyone familiar with WP Bakery and/or Visual Composer?	4	Jan 27, 2023
Digital Signature field form in PDF generated document from HTML	5	Nov 16, 2022
Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
Why is this WordPress comments form not submitting?	1	Jan 12, 2020
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022

Remove HTML from String?

jotto

Horacio Sanson

Austin Ziegler

Gavin Kistner

Eric Schwartz

J. Ryan Sobol

Austin Ziegler

Eric Schwartz

Donkey Agony

ruby talk

Valery Visnakov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads