How to replace a word with HTML?

G

gregpinero

Hi guys,

What I'm trying to do is find all instances of an acronymn such as IBM
on a webpage and replace it with <acronym title="International Business
Machines">IBM</acronym>. However in my code below it replace the <,
and > with &lt; and &gt;.

Thus it replaces IBM with:
&lt;acronym title="International Business
Machines"&gt;IBM&lt;/acronym&gt;

at the HTML level.

Any help would be greatly appriciated.

-Greg

Here's the code I'm currently using:


(function() {
var replacements, regex, key, textnodes, node, s;
replacements = {
'IBM':'<acronym title="International Business
Machines">IBM</acronym>',
};
regex = {};

for (key in replacements) {
regex[key] = new RegExp(key, 'gi');
}

textnodes = document.evaluate( "//body//text()", document, null,
XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);

for (var i = 0; i < textnodes.snapshotLength; i++) {
node = textnodes.snapshotItem(i);
s = node.data;
for (key in replacements) {
s = s.replace(regex[key], replacements[key]);
}
node.data = s;
}
})();
 
T

Thomas 'PointedEars' Lahn

What I'm trying to do is find all instances of an acronymn such as IBM
on a webpage and replace it with <acronym title="International Business
Machines">IBM</acronym>. However in my code below it replace the <,
and > with &lt; and &gt;.

Thus it replaces IBM with:
&lt;acronym title="International Business
Machines"&gt;IBM&lt;/acronym&gt;

at the HTML level.

The reason is that you are accessing text nodes. `acronym', however, is
an HTML _element_, therefore it requires an _element_ node. See below.
[...]
Here's the code I'm currently using:

(function() {
var replacements, regex, key, textnodes, node, s;
replacements = {
'IBM':'<acronym title="International Business
Machines">IBM</acronym>',
};
regex = {};

for (key in replacements) {
regex[key] = new RegExp(key, 'gi');

(Do not use the Tab character for code indentation, use spaces.)

You do not want to use the `i' flag here, else you replace also "iBm".
Furthermore, you want to use at least

regex[key] = new RegExp('\\b' + key + '\\b', 'g');

else you would replace also in "aibmo" (apparently a Swedish word).

Last, but not least, as an example, you would replace in "IBM/SAM
Convention", where this "IBM" (International Brotherhood of Magicians,
www.magician.org) does not have anything to do with the company you
actually mean. And there are other meanings as well, for example
}

textnodes = document.evaluate( "//body//text()", document, null,
XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);

for (var i = 0; i < textnodes.snapshotLength; i++) {
node = textnodes.snapshotItem(i);
s = node.data;
for (key in replacements) {
s = s.replace(regex[key], replacements[key]);
}
node.data = s;
}
})();

If you think about it, the issues mentioned above aside, this cannot work:
You obtain a list of text nodes and manipulate each text node. However,
using a string in a text node that resembles an HTML element does not
create that element in the document tree, as in

ATITLE CDATA "International Business Machines"
<acronym>
TEXT
CDATA "IBM"
</acronym>

Instead it creates

TEXT
CDATA '<acronym title="International Business Machines">IBM</acronym>'

which is of course rendered as you described it.

You will have to parse the HTML code or traverse the document tree instead
and split each text node that applies to one of the keys you define, into a
text node that does not contain the key string in its value, followed by an
`acronym' (or `abbr'; there is a difference!) element node that has the key
as its child text node, followed by a text node that does not contain the
key in its value. Therefore, you have to create the infix element node.
E.g. you have to split

TEXT
"The IBM corporate home page"

into

TEXT
"The "

ATITLE CDATA "International Business Machines"
<acronym>
TEXT
CDATA "IBM"
</acronym>

TEXT
CDATA " corporate home page"

However, this splitting is not trivial to do. For example, consider the
markup "<i>I</i>B<b>M</b>", which has the following representation in the
document tree:

<i>
TEXT
CDATA "I"
</i>

TEXT
CDATA "B"

<b>
TEXT
CDATA "M"
</b>


HTH

PointedEars
 
G

gregpinero

Thanks for that answer. That's a shame what I want to do is so
complicated. This is my first week on Javascript so I'm not sure if
I'm up for what you're suggesting.

So there's no way my original method would work, perhaps with special
escape characters for the <'s? (I know that wouldn't be politcally
correct, but I just want to get something working and fix it up later).

I'm also not too worried if I miss "<i>I</i>B<b>M</b>", I just want to
catch most of them.

-Greg
 
T

Thomas 'PointedEars' Lahn

So there's no way my original method would work, perhaps with special
escape characters for the <'s?

Yes, no. CDATA is CDATA is CDATA, not PCDATA (_Parsed_ Character DATA).
(I know that wouldn't be politcally correct, but I just want to get
something working and fix it up later).

This is beyond the issue of pc-ness. It would simply not matter.
I'm also not too worried if I miss "<i>I</i>B<b>M</b>", I just want to
catch most of them.

The following proprietary approach of parsing the HTML code, based on
<should work in many HTML UAs
now:

if (typeof document.body.innerHTML != "undefined")
{
var a = [];
for (var i in replacements)
{
a.push(i);
}

document.body.innerHTML = document.body.innerHTML.replace(
new RegExp("(<(\\w+)[^>]*>[^<]*)\\b(" + a.join("|")
+ ")\\b([^<]*<\\/\\w+[^>]*>)", "g"),
function(match, p1, p2, p3, p4)
{
switch (p2.toLowerCase())
{
case "acronym":
case "area":
case "script":
return [p1, p3, p4].join("");

default:
return [p1, '<acronym title="', replacements[p3], '">',
p3, '<\/acronym>', p4].join("");
}
});
}

Test input with which the above works (the key _words_ in the input
are replaced with the corresponding `acronym' elements except in
cases where this is not desired: within tags, and within `acronym',
`area' and `script' elements) in Firefox 1.5.0.1/Linux (replace
`document.body.innerHTML' with `htmlSource'):

var replacements = {
AOL: 'America Online, Inc.',
IBM: 'International Business Machines Corp.'
};

var htmlSource = [
' <h1>Visiting the <acronym="America Online"',
'>AOL</acronym>-Arena</h1>',
' <div id="Replace" class="IBM">Recently I went to a football',
' match in the aol^W AOL-Arena:',
' </div><img src="aol-arena.png" alt="AOL Arena"><div',
' class="AOL">They used ibM^W IBM software:</div><br><img',
' src="ibm-screenshot.png" alt="IBM software">',
' <script type="text/javascript">var IBM, AOL =',
' "foobar";</script>'
].join('\n');

// [see above]

// window.alert(htmlSource);

Note that non-ASCII word characters (such as German umlauts) are considered
word delimiters for \b. See [de] <for a more detailed explanation, and a viable workaround.

There is still the ambiguity of abbreviations/acronyms, though.


PointedEars
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top