Stripping HTML tags from a TEXTAREA field

Jeff North · Jan 19, 2004

Hi,
I'm using a control called HTMLArea which allows a person to enter
text and converts the format instructions to html tags. Most of my
users know nothing about html so this is perfect for my use.
http://www.interactivetools.com/products/htmlarea/
This only works with IE5.5+.

What I need to do is to take this html formatted text and only display
part of the text on a web page (much like a news article which shows
only part of the story line).

I need to be able to remove all of the html tags to correctly display
the data.

Is there a regex/replace instruction(s) that I can use to do this?

Many thanks

Michael Winter · Jan 20, 2004

Is there a regex/replace instruction(s) that I can use to do this?

This will simply delete anything that resembles a tag in a string. This
means that the user cannot include anything within angle brackets, even if
the text does not form HTML.

string.replace( /<\S+>/g, '' );

For example,

var testString = '<textarea>this should still be here</textarea>';
testString.replace( /<\S+>/g, '' );

will give testString the new value of "this should still be here".

Mike

Evertjan. · Jan 20, 2004

Jeff North wrote on 20 jan 2004 in comp.lang.javascript:

I'm using a control called HTMLArea which allows a person to enter
text and converts the format instructions to html tags. Most of my
users know nothing about html so this is perfect for my use.
http://www.interactivetools.com/products/htmlarea/
This only works with IE5.5+.

What I need to do is to take this html formatted text and only display
part of the text on a web page (much like a news article which shows
only part of the story line).

I need to be able to remove all of the html tags to correctly display
the data.

Is there a regex/replace instruction(s) that I can use to do this?

Only for IE:

<div id=temp></div>
<SCRIPT>
t="<span>example <b>of</b> html text</span>"
temp.innerHTML=t
t=temp.innerText
temp.innerHTML=""
alert(t)
</SCRIPT>

Michael Winter · Jan 20, 2004

string.replace( /<\S+>/g, '' );

Oops. That should be something more like:

string.replace( /<.+>/g, '' );

Sorry,
Mike

Lasse Reichstein Nielsen · Jan 20, 2004

Michael Winter said:
Oops. That should be something more like:

string.replace( /<.+>/g, '' );

Sorry,

I thought it was deliberate.
The first would correctly clean up "I am <b>so very tired</b>".
The second would leave it as "I am ".

/L

Jeff North · Jan 20, 2004

| On Tue, 20 Jan 2004 00:12:49 GMT, Michael Winter
|
| > string.replace( /<\S+>/g, '' );
|
| Oops. That should be something more like:
|
| string.replace( /<.+>/g, '' );

Thanks Mike for your help. It's most appreciated.

A couple of small problems

The first example didn't remove all of the tags. It mainly left the
font opening tag but successfully removed the closing tag.

The second example wiped the entire text.

So this is what I came up with
-----------------------
var txt2= new String();
var tmp = new String();
while( !rs.EOF )
{
tmp = rs.Fields.Item("Contents").Value;
tmp = tmp.replace( /<\S+>/gi, ' ' );
tmp = tmp.replace( /<.+>/gi, ' ' );
txt2 += tmp;
rs.moveNext();
}

Jeff North · Jan 20, 2004

| Jeff North wrote on 20 jan 2004 in comp.lang.javascript:
|
| > I'm using a control called HTMLArea which allows a person to enter
| > text and converts the format instructions to html tags. Most of my
| > users know nothing about html so this is perfect for my use.
| > http://www.interactivetools.com/products/htmlarea/
| > This only works with IE5.5+.
| >
| > What I need to do is to take this html formatted text and only display
| > part of the text on a web page (much like a news article which shows
| > only part of the story line).
| >
| > I need to be able to remove all of the html tags to correctly display
| > the data.
| >
| > Is there a regex/replace instruction(s) that I can use to do this?
|
| Only for IE:
|
| <div id=temp></div>
| <SCRIPT>
| t="<span>example <b>of</b> html text</span>"
| temp.innerHTML=t
| t=temp.innerText
| temp.innerHTML=""
| alert(t)
| </SCRIPT>

An interesting technique. Unfortunately I need it to be non-browser
specific.

Evertjan. · Jan 20, 2004

Jeff North wrote on 20 jan 2004 in comp.lang.javascript:

So this is what I came up with
-----------------------
var txt2= new String();
var tmp = new String();
while( !rs.EOF )
{
tmp = rs.Fields.Item("Contents").Value;
tmp = tmp.replace( /<\S+>/gi, ' ' );
tmp = tmp.replace( /<.+>/gi, ' ' );

the /i case insensitive is superfluous

txt2 += tmp;
rs.moveNext();
}

Next to my IEonly posting, which gives IMHO the best results and could be
used in a browser testing code, you could try a nongreedy regex:

tmp = tmp.replace( /<[^>]+>/g, ' ' );

Or more modern with the '?' nongreedy operator:

tmp = tmp.replace( /<.+?>/gi, ' ' );

Both will fail in this string:

<img src='x.gif' alt='not visible > hi there < not visible'>

Jeff North · Jan 20, 2004

| Jeff North wrote on 20 jan 2004 in comp.lang.javascript:
|
| > So this is what I came up with
| > -----------------------
| > var txt2= new String();
| > var tmp = new String();
| > while( !rs.EOF )
| > {
| > tmp = rs.Fields.Item("Contents").Value;
| > tmp = tmp.replace( /<\S+>/gi, ' ' );
| > tmp = tmp.replace( /<.+>/gi, ' ' );
|
| the /i case insensitive is superfluous

I thought that too but added it as a precaution

Would this add any significant processing time? The strings I'm using
can get pretty long.

| > txt2 += tmp;
| > rs.moveNext();
| >}
|
| Next to my IEonly posting, which gives IMHO the best results and could be
| used in a browser testing code, you could try a nongreedy regex:
|
| tmp = tmp.replace( /<[^>]+>/g, ' ' );
|
| Or more modern with the '?' nongreedy operator:
|
| tmp = tmp.replace( /<.+?>/gi, ' ' );
|
| Both will fail in this string:
|
| <img src='x.gif' alt='not visible > hi there < not visible'>

No wonder I could never understand regex

Is there any good tutorials available for regex (plus lots of examples
to use)?

Michael Winter · Jan 20, 2004

I thought it was deliberate.
The first would correctly clean up "I am <b>so very tired</b>".
The second would leave it as "I am ".

It would. However, it would not remove any tags that contain spaces. It
might not be an issue, but the second version doesn't (seem to) do any
harm.

Mike

Michael Winter · Jan 20, 2004

The first example didn't remove all of the tags. It mainly left the
font opening tag but successfully removed the closing tag.

The first wouldn't remove tags that contained any whitespace, so tags with

for example) would remain. said:
The second example wiped the entire text.

I tested it with strings that I thought would cause unwanted results, but
they came out fine. I was surprised (with a little more thought after
posting it) that the entire text wasn't wiped. I just found out why[1].

The best safe result I can get is:

.replace( /<[^<>]+>/g, '' )

The only problem is that if angle brackets appear inside tags, the tag
won't be removed properly. Such an occurance isn't really likely to occur,
unless someone wants to explicitly exploit this hole.

tmp = tmp.replace( /<\S+>/gi, ' ' );
tmp = tmp.replace( /<.+>/gi, ' ' );

I think I can explain why this works in your tests. The expression /<.+>/
matches "<anything>", where "anything" is literally that: letters,
numbers, punctuation, symbols, etc. If a tag is paired, like this:

<em id="example">This is emphasised</em>

the "em id=....</em" matches the '.' token in the regular expression. The
earlier expression, /<\S+>/ would remove the closing tag, leaving:

<em id="example">This is emphasised

which is then correctly handled by the greedy second expression. However,
if you try this:

The word, <em>this</em> is emphasised

you'll only get:

The word, is emphasised

back. That is why you should try the third suggestion, /<[^<>]+>/g,
despite it's flaw.

What a mess this is becoming.

Mike

[1] The reason is inconsequential, but it made the testing unfair.

Evertjan. · Jan 20, 2004

Jeff North wrote on 20 jan 2004 in comp.lang.javascript:

| the /i case insensitive is superfluous [..]
|
| tmp = tmp.replace( /<[^>]+>/g, ' ' );
|
| Or more modern with the '?' nongreedy operator:
|
| tmp = tmp.replace( /<.+?>/gi, ' ' );
|
| Both will fail in this string:
|
| <img src='x.gif' alt='not visible > hi there < not visible'>

Click to expand...

No wonder I could never understand regex

Yes, those i's in /gi have a tendency to reappear by themselves ;-)

/<[^>]+>/g

start with a <
accept all next chars except > ([^>]) with a minimunm of 1 (+)
and a > at the end
/g do this at nauseam

/<.+?>/g

start with a <
accept all next chars (.) with a minimunm of 1 (+) till(?) the first and
including > at the end
/g do this at nauseam

Is there any good tutorials available for regex (plus lots of examples
to use)?

<http://www.google.com/search?q=regex.tutorial> 819 hits
<http://www.google.com/search?q=regex.examples> 491 hits

Jeff North · Jan 21, 2004

| On Tue, 20 Jan 2004 08:54:17 GMT, Jeff North
|
| > On Tue, 20 Jan 2004 00:31:58 GMT, in comp.lang.javascript Michael
| >
| >> | On Tue, 20 Jan 2004 00:12:49 GMT, Michael Winter
| >> |
| >> | > string.replace( /<\S+>/g, '' );
| >> |
| >> | Oops. That should be something more like:
| >> |
| >> | string.replace( /<.+>/g, '' );
| >
| > The first example didn't remove all of the tags. It mainly left the
| > font opening tag but successfully removed the closing tag.
|
| The first wouldn't remove tags that contained any whitespace, so tags with
| attributes, or XHTML-style empty tags (<br />, for example) would remain.
| That's what prompted the second suggestion.
|
| > The second example wiped the entire text.
|
| I tested it with strings that I thought would cause unwanted results, but
| they came out fine. I was surprised (with a little more thought after
| posting it) that the entire text wasn't wiped. I just found out why[1].
|
| The best safe result I can get is:
|
| .replace( /<[^<>]+>/g, '' )
|
| The only problem is that if angle brackets appear inside tags, the tag
| won't be removed properly. Such an occurance isn't really likely to occur,
| unless someone wants to explicitly exploit this hole.
|
| > tmp = tmp.replace( /<\S+>/gi, ' ' );
| > tmp = tmp.replace( /<.+>/gi, ' ' );
|
| I think I can explain why this works in your tests. The expression /<.+>/
| matches "<anything>", where "anything" is literally that: letters,
| numbers, punctuation, symbols, etc. If a tag is paired, like this:
|
| <em id="example">This is emphasised</em>
|
| the "em id=....</em" matches the '.' token in the regular expression. The
| earlier expression, /<\S+>/ would remove the closing tag, leaving:
|
| <em id="example">This is emphasised
|
| which is then correctly handled by the greedy second expression. However,
| if you try this:
|
| The word, <em>this</em> is emphasised
|
| you'll only get:
|
| The word, is emphasised
|
| back. That is why you should try the third suggestion, /<[^<>]+>/g,
| despite it's flaw.
|
| What a mess this is becoming.
|
| Mike
|
|
| [1] The reason is inconsequential, but it made the testing unfair.

Mike and Evertjan, thanks for all your time and effort it is greatly
appreciated.

Mike, I tried your 3rd suggestion and it appears to work (so I won't
annoy you anymore LOL).

Here is what I've ended up with and some sample text. I know that
there is probably a more elegant way of doing this but I think that
this is almost self-documenting and easily modifiable:
----------------------------------
//--- read data from database
//--- strip out html tags and convert symbols to characters.
//--- var msg is called in client-side script.
var msg = new String( rsDir.Fields.Item("contents").Value );
msg = msg.replace(/\n/g,"");
msg = msg.replace(/\r/g,"");

//--- any double quote -> single quote
msg = msg.replace(/"/gi,"\'");
msg = msg.replace(/–/g,"-");

//--- any left/right quotes to a single quote
msg = msg.replace(/’/g,"\'");
msg = msg.replace(/“/g,"\'");
msg = msg.replace(/”/g,"\'");

//--- remove non-breaking spaces
msg = msg.replace(/ /gi," ");

//-- strip html tags from text (courtesy of Michael Winter at
comp.lang.javascript newsgroup)
msg = msg.replace( /<[^<>]+>/g, '' );
..
..
..
..
<script>
function ShowMsg()
{
//--- display a message. Do not break/split a word.
var ct = 200; //--- max. characters
var msg = new String();
msg = "<%=msg%>";
//--- move back to first space character.
while( ct > 0 && msg.charAt(ct) != " ") ct--;

document.write( msg.substr(0,ct) + "..." );
}
</script>

------------ sample text ------------
<P><FONT face="arial, helvetica, sans-serif">Dear
All,</FONT></P>\r\n<P><FONT face="arial, helvetica,
sans-serif">2003 will soon be nothing more than a memory. But to
my mind, this last year will continue to live on as an "annus
mirabilis" -  year of wonders. </FONT></P>\r\n<P><FONT
face="arial, helvetica, sans-serif">And it has been
wonderful - our staff and students really covered themselves in glory
during 2003, with awards and accolades coming from virtually every
quarter. But we all know that awards only tell part of the story. What
made this last year “truly wonderful” was the fact that
the Institute achieved so much, in spite of a host of challenges and
uncertainties. We were able to succeed because of one simple fact
– our fantastic staff. All staff regularly did more with less
and continued to provide the very best in vocational education and
training. Thank you for all your hard work.</FONT></P>\r\n<P><FONT
face="arial, helvetica, sans-serif">In many ways, the coming
year will mark the beginning of profound changes to the way in which
Sydney Institute operates. Staff numbers will increase. Reporting
lines and responsibilities will change. Our business and work culture
will have to adapt to new circumstances, personalities and
opportunities. It will be a challenge. However, I am confident we will
meet these challenges in the same way TAFE has coped with change for
over 110 years – with professionalism and dedication. Those
qualities made 2003 a year to remember and I know that 2004
won’t be any different.</FONT></P>\r\n<P><FONT face="arial,
helvetica, sans-serif">Thank you again for all your efforts
during this last year. I look forward to 2004 with anticipation. I
hope you have a safe and happy holiday
season.</FONT></P>\r\n<P><BR><FONT face="arial, helvetica,
sans-serif">

John · Feb 13, 2004

An interesting technique. Unfortunately I need it to be non-browser
specific.

How about:
<SCRIPT>

function strip(el)
{
var retVal="";
for(var z=0; z < el.childNodes.length; z++)
if(el.childNodes[z].nodeName=="#text")
retVal+=el.childNodes[z].nodeValue;
else
retVal+=strip(el.childNodes[z]);
return retVal;
}

e=document.createElement("div");
e.innerHTML="<span>example <b>of</b> html text</span>";
t=strip(e);
alert(t)
</SCRIPT>

Evertjan. · Feb 13, 2004

John wrote on 13 feb 2004 in comp.lang.javascript:

An interesting technique. Unfortunately I need it to be non-browser
specific.

Click to expand...

How about:
<SCRIPT>

function strip(el)
{
var retVal="";
for(var z=0; z < el.childNodes.length; z++)
if(el.childNodes[z].nodeName=="#text")
retVal+=el.childNodes[z].nodeValue;
else
retVal+=strip(el.childNodes[z]);
return retVal;
}

e=document.createElement("div");
e.innerHTML="<span>example <b>of</b> html text</span>";
t=strip(e);
alert(t)
</SCRIPT>

I thought the purpose was to eliminate innerHTNL ?

John · Feb 14, 2004

John wrote on 13 feb 2004 in comp.lang.javascript:

An interesting technique. Unfortunately I need it to be non-browser
specific.

Click to expand...

How about:
<SCRIPT>

function strip(el)
{
var retVal="";
for(var z=0; z < el.childNodes.length; z++)
if(el.childNodes[z].nodeName=="#text")
retVal+=el.childNodes[z].nodeValue;
else
retVal+=strip(el.childNodes[z]);
return retVal;
}

e=document.createElement("div");
e.innerHTML="<span>example <b>of</b> html text</span>";
t=strip(e);
alert(t)
</SCRIPT>

Click to expand...

I thought the purpose was to eliminate innerHTNL ?

Doesn't work very well anyway!!!
I was up late and drunk ;-)

HTML tags in textarea	4	May 26, 2010
Problems with using event handlers for button and textarea input	1	Nov 29, 2021
Problem with android and scrolling with <input textarea	5	May 18, 2022
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023
How to create a JSON array with values from DOM(HTML TABLE) when I click a button using JQuery/Javascript?	0	May 1, 2023
Generate one HTML from API based on the object key language and their value	2	Aug 19, 2022
Stuck with html and css	25	Dec 14, 2022
HTML Aligning social media icons	2	Dec 6, 2020

Stripping HTML tags from a TEXTAREA field

Jeff North

Michael Winter

Evertjan.

Michael Winter

Lasse Reichstein Nielsen

Jeff North

Jeff North

Evertjan.

Jeff North

Michael Winter

Michael Winter

Evertjan.

Jeff North

John

Evertjan.

John

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads