Stripping HTML tags from a TEXTAREA field

J

Jeff North

Hi,
I'm using a control called HTMLArea which allows a person to enter
text and converts the format instructions to html tags. Most of my
users know nothing about html so this is perfect for my use.
http://www.interactivetools.com/products/htmlarea/
This only works with IE5.5+.

What I need to do is to take this html formatted text and only display
part of the text on a web page (much like a news article which shows
only part of the story line).

I need to be able to remove all of the html tags to correctly display
the data.

Is there a regex/replace instruction(s) that I can use to do this?

Many thanks
 
M

Michael Winter

Is there a regex/replace instruction(s) that I can use to do this?

This will simply delete anything that resembles a tag in a string. This
means that the user cannot include anything within angle brackets, even if
the text does not form HTML.

string.replace( /<\S+>/g, '' );

For example,

var testString = '<textarea>this should still be here</textarea>';
testString.replace( /<\S+>/g, '' );

will give testString the new value of "this should still be here".

Mike
 
E

Evertjan.

Jeff North wrote on 20 jan 2004 in comp.lang.javascript:
I'm using a control called HTMLArea which allows a person to enter
text and converts the format instructions to html tags. Most of my
users know nothing about html so this is perfect for my use.
http://www.interactivetools.com/products/htmlarea/
This only works with IE5.5+.

What I need to do is to take this html formatted text and only display
part of the text on a web page (much like a news article which shows
only part of the story line).

I need to be able to remove all of the html tags to correctly display
the data.

Is there a regex/replace instruction(s) that I can use to do this?

Only for IE:

<div id=temp></div>
<SCRIPT>
t="<span>example <b>of</b> html text</span>"
temp.innerHTML=t
t=temp.innerText
temp.innerHTML=""
alert(t)
</SCRIPT>
 
L

Lasse Reichstein Nielsen

Michael Winter said:
Oops. That should be something more like:

string.replace( /<.+>/g, '' );

Sorry,

I thought it was deliberate.
The first would correctly clean up "I am <b>so very tired</b>".
The second would leave it as "I am ".

/L
 
J

Jeff North

| On Tue, 20 Jan 2004 00:12:49 GMT, Michael Winter
|
| > string.replace( /<\S+>/g, '' );
|
| Oops. That should be something more like:
|
| string.replace( /<.+>/g, '' );

Thanks Mike for your help. It's most appreciated.

A couple of small problems :)
The first example didn't remove all of the tags. It mainly left the
font opening tag but successfully removed the closing tag.

The second example wiped the entire text.

So this is what I came up with
-----------------------
var txt2= new String();
var tmp = new String();
while( !rs.EOF )
{
tmp = rs.Fields.Item("Contents").Value;
tmp = tmp.replace( /<\S+>/gi, ' ' );
tmp = tmp.replace( /<.+>/gi, ' ' );
txt2 += tmp;
rs.moveNext();
}
 
J

Jeff North

| Jeff North wrote on 20 jan 2004 in comp.lang.javascript:
|
| > I'm using a control called HTMLArea which allows a person to enter
| > text and converts the format instructions to html tags. Most of my
| > users know nothing about html so this is perfect for my use.
| > http://www.interactivetools.com/products/htmlarea/
| > This only works with IE5.5+.
| >
| > What I need to do is to take this html formatted text and only display
| > part of the text on a web page (much like a news article which shows
| > only part of the story line).
| >
| > I need to be able to remove all of the html tags to correctly display
| > the data.
| >
| > Is there a regex/replace instruction(s) that I can use to do this?
|
| Only for IE:
|
| <div id=temp></div>
| <SCRIPT>
| t="<span>example <b>of</b> html text</span>"
| temp.innerHTML=t
| t=temp.innerText
| temp.innerHTML=""
| alert(t)
| </SCRIPT>

An interesting technique. Unfortunately I need it to be non-browser
specific.
 
E

Evertjan.

Jeff North wrote on 20 jan 2004 in comp.lang.javascript:
So this is what I came up with
-----------------------
var txt2= new String();
var tmp = new String();
while( !rs.EOF )
{
tmp = rs.Fields.Item("Contents").Value;
tmp = tmp.replace( /<\S+>/gi, ' ' );
tmp = tmp.replace( /<.+>/gi, ' ' );

the /i case insensitive is superfluous
txt2 += tmp;
rs.moveNext();
}


Next to my IEonly posting, which gives IMHO the best results and could be
used in a browser testing code, you could try a nongreedy regex:


tmp = tmp.replace( /<[^>]+>/g, ' ' );

Or more modern with the '?' nongreedy operator:

tmp = tmp.replace( /<.+?>/gi, ' ' );

Both will fail in this string:

<img src='x.gif' alt='not visible > hi there < not visible'>
 
J

Jeff North

| Jeff North wrote on 20 jan 2004 in comp.lang.javascript:
|
| > So this is what I came up with
| > -----------------------
| > var txt2= new String();
| > var tmp = new String();
| > while( !rs.EOF )
| > {
| > tmp = rs.Fields.Item("Contents").Value;
| > tmp = tmp.replace( /<\S+>/gi, ' ' );
| > tmp = tmp.replace( /<.+>/gi, ' ' );
|
| the /i case insensitive is superfluous

I thought that too but added it as a precaution :)
Would this add any significant processing time? The strings I'm using
can get pretty long.
| > txt2 += tmp;
| > rs.moveNext();
| >}
|
| Next to my IEonly posting, which gives IMHO the best results and could be
| used in a browser testing code, you could try a nongreedy regex:
|
| tmp = tmp.replace( /<[^>]+>/g, ' ' );
|
| Or more modern with the '?' nongreedy operator:
|
| tmp = tmp.replace( /<.+?>/gi, ' ' );
|
| Both will fail in this string:
|
| <img src='x.gif' alt='not visible > hi there < not visible'>

No wonder I could never understand regex :)
Is there any good tutorials available for regex (plus lots of examples
to use)?
 
M

Michael Winter

I thought it was deliberate.
The first would correctly clean up "I am <b>so very tired</b>".
The second would leave it as "I am ".

It would. However, it would not remove any tags that contain spaces. It
might not be an issue, but the second version doesn't (seem to) do any
harm.

Mike
 
M

Michael Winter

The first example didn't remove all of the tags. It mainly left the
font opening tag but successfully removed the closing tag.

The first wouldn't remove tags that contained any whitespace, so tags with
for example) would remain. said:
The second example wiped the entire text.

I tested it with strings that I thought would cause unwanted results, but
they came out fine. I was surprised (with a little more thought after
posting it) that the entire text wasn't wiped. I just found out why[1].

The best safe result I can get is:

.replace( /<[^<>]+>/g, '' )

The only problem is that if angle brackets appear inside tags, the tag
won't be removed properly. Such an occurance isn't really likely to occur,
unless someone wants to explicitly exploit this hole.
tmp = tmp.replace( /<\S+>/gi, ' ' );
tmp = tmp.replace( /<.+>/gi, ' ' );

I think I can explain why this works in your tests. The expression /<.+>/
matches "<anything>", where "anything" is literally that: letters,
numbers, punctuation, symbols, etc. If a tag is paired, like this:

<em id="example">This is emphasised</em>

the "em id=....</em" matches the '.' token in the regular expression. The
earlier expression, /<\S+>/ would remove the closing tag, leaving:

<em id="example">This is emphasised

which is then correctly handled by the greedy second expression. However,
if you try this:

The word, <em>this</em> is emphasised

you'll only get:

The word, is emphasised

back. That is why you should try the third suggestion, /<[^<>]+>/g,
despite it's flaw.

What a mess this is becoming. :)

Mike


[1] The reason is inconsequential, but it made the testing unfair.
 
E

Evertjan.

Jeff North wrote on 20 jan 2004 in comp.lang.javascript:
| the /i case insensitive is superfluous [..]
|
| tmp = tmp.replace( /<[^>]+>/g, ' ' );
|
| Or more modern with the '?' nongreedy operator:
|
| tmp = tmp.replace( /<.+?>/gi, ' ' );
|
| Both will fail in this string:
|
| <img src='x.gif' alt='not visible > hi there < not visible'>

No wonder I could never understand regex :)

Yes, those i's in /gi have a tendency to reappear by themselves ;-)

/<[^>]+>/g

start with a <
accept all next chars except > ([^>]) with a minimunm of 1 (+)
and a > at the end
/g do this at nauseam

/<.+?>/g

start with a <
accept all next chars (.) with a minimunm of 1 (+) till(?) the first and
including > at the end
/g do this at nauseam
Is there any good tutorials available for regex (plus lots of examples
to use)?

<http://www.google.com/search?q=regex.tutorial> 819 hits
<http://www.google.com/search?q=regex.examples> 491 hits
 
J

Jeff North

| On Tue, 20 Jan 2004 08:54:17 GMT, Jeff North
|
| > On Tue, 20 Jan 2004 00:31:58 GMT, in comp.lang.javascript Michael
| >
| >> | On Tue, 20 Jan 2004 00:12:49 GMT, Michael Winter
| >> |
| >> | > string.replace( /<\S+>/g, '' );
| >> |
| >> | Oops. That should be something more like:
| >> |
| >> | string.replace( /<.+>/g, '' );
| >
| > The first example didn't remove all of the tags. It mainly left the
| > font opening tag but successfully removed the closing tag.
|
| The first wouldn't remove tags that contained any whitespace, so tags with
| attributes, or XHTML-style empty tags (<br />, for example) would remain.
| That's what prompted the second suggestion.
|
| > The second example wiped the entire text.
|
| I tested it with strings that I thought would cause unwanted results, but
| they came out fine. I was surprised (with a little more thought after
| posting it) that the entire text wasn't wiped. I just found out why[1].
|
| The best safe result I can get is:
|
| .replace( /<[^<>]+>/g, '' )
|
| The only problem is that if angle brackets appear inside tags, the tag
| won't be removed properly. Such an occurance isn't really likely to occur,
| unless someone wants to explicitly exploit this hole.
|
| > tmp = tmp.replace( /<\S+>/gi, ' ' );
| > tmp = tmp.replace( /<.+>/gi, ' ' );
|
| I think I can explain why this works in your tests. The expression /<.+>/
| matches "<anything>", where "anything" is literally that: letters,
| numbers, punctuation, symbols, etc. If a tag is paired, like this:
|
| <em id="example">This is emphasised</em>
|
| the "em id=....</em" matches the '.' token in the regular expression. The
| earlier expression, /<\S+>/ would remove the closing tag, leaving:
|
| <em id="example">This is emphasised
|
| which is then correctly handled by the greedy second expression. However,
| if you try this:
|
| The word, <em>this</em> is emphasised
|
| you'll only get:
|
| The word, is emphasised
|
| back. That is why you should try the third suggestion, /<[^<>]+>/g,
| despite it's flaw.
|
| What a mess this is becoming. :)
|
| Mike
|
|
| [1] The reason is inconsequential, but it made the testing unfair.

Mike and Evertjan, thanks for all your time and effort it is greatly
appreciated.

Mike, I tried your 3rd suggestion and it appears to work (so I won't
annoy you anymore LOL).

Here is what I've ended up with and some sample text. I know that
there is probably a more elegant way of doing this but I think that
this is almost self-documenting and easily modifiable:
----------------------------------
//--- read data from database
//--- strip out html tags and convert symbols to characters.
//--- var msg is called in client-side script.
var msg = new String( rsDir.Fields.Item("contents").Value );
msg = msg.replace(/\n/g,"");
msg = msg.replace(/\r/g,"");

//--- any double quote -> single quote
msg = msg.replace(/&quot;/gi,"\'");
msg = msg.replace(/–/g,"-");

//--- any left/right quotes to a single quote
msg = msg.replace(/’/g,"\'");
msg = msg.replace(/“/g,"\'");
msg = msg.replace(/”/g,"\'");

//--- remove non-breaking spaces
msg = msg.replace(/&nbsp;/gi," ");

//-- strip html tags from text (courtesy of Michael Winter at
comp.lang.javascript newsgroup)
msg = msg.replace( /<[^<>]+>/g, '' );
..
..
..
..
<script>
function ShowMsg()
{
//--- display a message. Do not break/split a word.
var ct = 200; //--- max. characters
var msg = new String();
msg = "<%=msg%>";
//--- move back to first space character.
while( ct > 0 && msg.charAt(ct) != " ") ct--;

document.write( msg.substr(0,ct) + "..." );
}
</script>

------------ sample text ------------
<P><FONT face=&quot;arial, helvetica, sans-serif&quot;>Dear
All,</FONT></P>\r\n<P><FONT face=&quot;arial, helvetica,
sans-serif&quot;>2003 will soon be nothing more than a memory. But to
my mind, this last year will continue to live on as an &quot;annus
mirabilis&quot; -&nbsp; year of wonders. </FONT></P>\r\n<P><FONT
face=&quot;arial, helvetica, sans-serif&quot;>And it has been
wonderful - our staff and students really covered themselves in glory
during 2003, with awards and accolades coming from virtually every
quarter. But we all know that awards only tell part of the story. What
made this last year “truly wonderful” was the fact that
the Institute achieved so much, in spite of a host of challenges and
uncertainties. We were able to succeed because of one simple fact
– our fantastic staff. All staff regularly did more with less
and continued to provide the very best in vocational education and
training. Thank you for all your hard work.</FONT></P>\r\n<P><FONT
face=&quot;arial, helvetica, sans-serif&quot;>In many ways, the coming
year will mark the beginning of profound changes to the way in which
Sydney Institute operates. Staff numbers will increase. Reporting
lines and responsibilities will change. Our business and work culture
will have to adapt to new circumstances, personalities and
opportunities. It will be a challenge. However, I am confident we will
meet these challenges in the same way TAFE has coped with change for
over 110 years – with professionalism and dedication. Those
qualities made 2003 a year to remember and I know that 2004
won’t be any different.</FONT></P>\r\n<P><FONT face=&quot;arial,
helvetica, sans-serif&quot;>Thank you again for all your efforts
during this last year. I look forward to 2004 with anticipation. I
hope you have a safe and happy holiday
season.</FONT></P>\r\n<P><BR><FONT face=&quot;arial, helvetica,
sans-serif&quot;>
 
J

John

An interesting technique. Unfortunately I need it to be non-browser
specific.

How about:
<SCRIPT>

function strip(el)
{
var retVal="";
for(var z=0; z < el.childNodes.length; z++)
if(el.childNodes[z].nodeName=="#text")
retVal+=el.childNodes[z].nodeValue;
else
retVal+=strip(el.childNodes[z]);
return retVal;
}

e=document.createElement("div");
e.innerHTML="<span>example <b>of</b> html text</span>";
t=strip(e);
alert(t)
</SCRIPT>
 
E

Evertjan.

John wrote on 13 feb 2004 in comp.lang.javascript:
An interesting technique. Unfortunately I need it to be non-browser
specific.

How about:
<SCRIPT>

function strip(el)
{
var retVal="";
for(var z=0; z < el.childNodes.length; z++)
if(el.childNodes[z].nodeName=="#text")
retVal+=el.childNodes[z].nodeValue;
else
retVal+=strip(el.childNodes[z]);
return retVal;
}

e=document.createElement("div");
e.innerHTML="<span>example <b>of</b> html text</span>";
t=strip(e);
alert(t)
</SCRIPT>

I thought the purpose was to eliminate innerHTNL ?
 
J

John

John wrote on 13 feb 2004 in comp.lang.javascript:
An interesting technique. Unfortunately I need it to be non-browser
specific.

How about:
<SCRIPT>

function strip(el)
{
var retVal="";
for(var z=0; z < el.childNodes.length; z++)
if(el.childNodes[z].nodeName=="#text")
retVal+=el.childNodes[z].nodeValue;
else
retVal+=strip(el.childNodes[z]);
return retVal;
}

e=document.createElement("div");
e.innerHTML="<span>example <b>of</b> html text</span>";
t=strip(e);
alert(t)
</SCRIPT>

I thought the purpose was to eliminate innerHTNL ?

Doesn't work very well anyway!!!
I was up late and drunk ;-)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top