Remove Empty Tags on page

D

David

Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <span> tags are not empty, as they contain
<em> tags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

David
 
D

David

David said:
Hi All,

I am working on a script that is theoreticaly simple but I can not get it
to work completely. I am dealing with a page spit out by .NET that leaves
empty tags in the markup. I need a javascript solution to go behind and do
a clean up after the page loads.
David


For any that look at the page you will see the script is only looping
through a certain set of tags...

var tagArray = ["em", "span", "p", "a", "li", "ul"];

Using all tags.. el=document.getElementsByTagName("*") would be the
preferred method but I found myself needing several loops and several
node.parentNode.removeChild()'s and it still didn't work correctly.

David
 
D

Doug Miller

Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <span> tags are not empty, as they contain
<em> tags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

Any ideas of the best way to approach this?

Any reason you can't just use the search-and-replace function in your favorite
text editor? If you have shell access to a Unix machine, this is pretty
trivial.
 
D

David

Doug Miller said:
Any reason you can't just use the search-and-replace function in your
favorite
text editor? If you have shell access to a Unix machine, this is pretty
trivial.

Yes, the reason is because the .NET is rendering this HTML live. This has to
be done to the actual rendered page on the fly, after it has been loaded.

David
 
B

Bjoern Hoehrmann

* David wrote in comp.lang.javascript:
The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <span> tags are not empty, as they contain
<em> tags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...
http://mysite.verizon.net/res8xvny/removeTags.html

You are checking whether innerHTML equals the empty string. Obviously in
the example above, the span's innerHTML does not equal the empty string,
until after you have removed the em child. The p element might never be
removed because it contains line breaks and spaces, which may or may not
match your definition of empty.
Any ideas of the best way to approach this?

You could traverse the whole document, processing children before their
parent, and check for each whether, say, node.firstChild is null and it
is one of the elements you want to remove, and if so, remove the node.
For example:

var removables = {'span': 1, 'em': 1};
function f(node) {
while (node != null) {
var next = node.nextSibling;

f(node.firstChild);

if (node.nodeType == 1 && node.firstChild == null &&
node.nodeName.toLowerCase() in removables)
node.parentNode.removeChild(node);

node = next;
}
}

An alternative is processing the nodes in reverse document order, but
usually that would involve NodeLists like when using getElementsByTag-
Name and it's a bad idea to modify the document while iterating over
a NodeList. Using a TreeWalker you could do

var removables = {'span': 1, 'em': 1};

var w = document.createTreeWalker(
document.documentElement, 1,
function(n) { return n.nodeName.toLowerCase()
in removables ? 1 : 3 }, true);

while (w.lastChild());

for (var node = w.currentNode; node;) {
var prev = w.previousNode();
if (node.firstChild == null)
node.parentNode.removeChild(node);
node = prev;
}

Or with a NodeIterator the similar

var removables = {'span': 1, 'em': 1};

var w = document.createNodeIterator(
document.documentElement, 1,
function(n) { return n.nodeName.toLowerCase()
in removables ? 1 : 3 }, true);

while (w.nextNode());

for (node = w.previousNode(); node; node = w.previousNode())
if (node.firstChild == null)
node.parentNode.removeChild(node);

Note however that neither is universally supported by current browsers.
These two solutions work because in reverse document order, a child
always precedes its parent, so the children are pruned when checking if
the parent should be pruned aswell.
 
L

Lasse Reichstein Nielsen

David said:
I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.
....
Any ideas of the best way to approach this?

You need to ensure that you remove children before you test whether
parents are empty. That means that you need to control the order in which
you check nodes.

I'd do a recursive traversal from the root and remove those elements
that appear empty.

/**
* Call with node and function that identifies removable tags.
* Removes all empty sub-elements that are identified as removable,
* as well as all all-whitespace text nodes, and returns true if
* the argument node should be removed.
*/
function removeEmptyNodes(node, removableTag) {
if (node.nodeType == 3) { // text node
if (/^\s*$/.exec(node.nodeValue)) {
return true;
}
} else if (node.nodeType == 1) { // element node, check children first
var nextChild = null, child = node.firstChild;
while(child) {
nextChild = child.nextSibling;
if (removeEmptyNodes(child, removableTag)) {
node.removeChild(child);
}
child = nextChild;
}
if (!node.firstChild && removableTag(node.tagName)) {
return true;
}
}
return false;
}

function cleanDocument() {
var tagsRe = /^(span|p|em|a|li|ul)$/i;
removeEmptyNodes(document.body,
function(tagName) { return !!tagsRe.exec(tagName); });
}


Good luck
/L
 
D

David

You could traverse the whole document, processing children before their
parent, and check for each whether, say, node.firstChild is null and it
is one of the elements you want to remove, and if so, remove the node.
For example:

var removables = {'span': 1, 'em': 1};
function f(node) {
while (node != null) {
var next = node.nextSibling;

f(node.firstChild);

if (node.nodeType == 1 && node.firstChild == null &&
node.nodeName.toLowerCase() in removables)
node.parentNode.removeChild(node);

node = next;
}
}

An alternative is processing the nodes in reverse document order, but
usually that would involve NodeLists like when using getElementsByTag-
Name and it's a bad idea to modify the document while iterating over
a NodeList. Using a TreeWalker you could do

var removables = {'span': 1, 'em': 1};

var w = document.createTreeWalker(
document.documentElement, 1,
function(n) { return n.nodeName.toLowerCase()
in removables ? 1 : 3 }, true);

while (w.lastChild());

for (var node = w.currentNode; node;) {
var prev = w.previousNode();
if (node.firstChild == null)
node.parentNode.removeChild(node);
node = prev;
}

Or with a NodeIterator the similar

var removables = {'span': 1, 'em': 1};

var w = document.createNodeIterator(
document.documentElement, 1,
function(n) { return n.nodeName.toLowerCase()
in removables ? 1 : 3 }, true);

while (w.nextNode());

for (node = w.previousNode(); node; node = w.previousNode())
if (node.firstChild == null)
node.parentNode.removeChild(node);

Note however that neither is universally supported by current browsers.
These two solutions work because in reverse document order, a child
always precedes its parent, so the children are pruned when checking if
the parent should be pruned aswell.


Thank you for the code and explanation. This does help...

David
 
D

David

Lasse Reichstein Nielsen said:
You need to ensure that you remove children before you test whether
parents are empty. That means that you need to control the order in which
you check nodes.

I'd do a recursive traversal from the root and remove those elements
that appear empty.

/**
* Call with node and function that identifies removable tags.
* Removes all empty sub-elements that are identified as removable,
* as well as all all-whitespace text nodes, and returns true if
* the argument node should be removed.
*/
function removeEmptyNodes(node, removableTag) {
if (node.nodeType == 3) { // text node
if (/^\s*$/.exec(node.nodeValue)) {
return true;
}
} else if (node.nodeType == 1) { // element node, check children first
var nextChild = null, child = node.firstChild;
while(child) {
nextChild = child.nextSibling;
if (removeEmptyNodes(child, removableTag)) {
node.removeChild(child);
}
child = nextChild;
}
if (!node.firstChild && removableTag(node.tagName)) {
return true;
}
}
return false;
}

function cleanDocument() {
var tagsRe = /^(span|p|em|a|li|ul)$/i;
removeEmptyNodes(document.body,
function(tagName) { return !!tagsRe.exec(tagName); });
}


Good luck
/L
--
Lasse Reichstein Nielsen - (e-mail address removed)
DHTML Death Colors:
<URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'


Likewise, this works very well. Thank you for taking the time to help out
with this.

David
 
R

RobG

Hi All,

I am working on a script that is theoreticaly simple but I can not get it to
work completely. I am dealing with a page spit out by .NET that leaves empty
tags in the markup. I need a javascript solution to go behind and do a clean
up after the page loads.

The .NET will leave behind any combination of nested tags. Here is an
example below. Even though the <span> tags are not empty, as they contain
<em> tags they also need to be removed.

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

Here is a simple test page of what I have done so far. It does remove some
of the tags but always leaves behind some empty tags...http://mysite.verizon.net/res8xvny/removeTags.html

Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

function getText(el)
{
if (typeof el == 'string') el = document.getElementById(el);

// Try DOM 3 textContent property first
if (typeof el.textContent == 'string') {return el.textContent;}

// Try MS innerText property
if (typeof el.innerText == 'string') {return el.innerText;}
return rec(el);

// Recurse over child nodes
function rec(el) {
var n, x = el.childNodes;
var txt = [];
for (var i=0, len=x.length; i<len; ++i){
n = x;

// Use TEXT_NODE and ELEMENT_NODE as apparently IE 8 will
// "not support enumeration of nodeType constant values"
// G. Talbert clj
if (n.TEXT_NODE == n.nodeType) {
txt.push(n.data);
} else if (n.ELEMENT_NODE == n.nodeType) {
txt.push(rec(n));
}
}
return txt.join('').replace(/\s+/g,' ');
}
}


function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');

// These nodes are allowed to be empty
var allowedEmpty = 'base basefont body br col hr html image '
+ 'input isindex link meta param title';
var re;

// Collection is live, so as remove nodes, length gets shorter
for (var i=0; i<nodes.length; i++) {
node = nodes;
re = new RegExp('\\b'+node.tagName+'\\b','i');

// Only removes nodes where textContent is '', but could extend
// to remove any node where textContent is matches \s*
if (!re.test(allowedEmpty) && getText(node) == '') {
node.parentNode.removeChild(node);

// i node removed, so backup
--i;
}
}
}
 
B

Bjoern Hoehrmann

* RobG wrote in comp.lang.javascript:
Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

But then you'll remove e.g. <span><img/></span>.
 
D

David

Have you considered going down the DOM and remove any in-line element
whose textContent or innerText is empty? That way you don't have to
go down nested empty nodes, they will be removed as soon as you reach
the highest ancestor.

function getText(el)
{
if (typeof el == 'string') el = document.getElementById(el);

// Try DOM 3 textContent property first
if (typeof el.textContent == 'string') {return el.textContent;}

// Try MS innerText property
if (typeof el.innerText == 'string') {return el.innerText;}
return rec(el);

// Recurse over child nodes
function rec(el) {
var n, x = el.childNodes;
var txt = [];
for (var i=0, len=x.length; i<len; ++i){
n = x;

// Use TEXT_NODE and ELEMENT_NODE as apparently IE 8 will
// "not support enumeration of nodeType constant values"
// G. Talbert clj
if (n.TEXT_NODE == n.nodeType) {
txt.push(n.data);
} else if (n.ELEMENT_NODE == n.nodeType) {
txt.push(rec(n));
}
}
return txt.join('').replace(/\s+/g,' ');
}
}


function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');

// These nodes are allowed to be empty
var allowedEmpty = 'base basefont body br col hr html image '
+ 'input isindex link meta param title';
var re;

// Collection is live, so as remove nodes, length gets shorter
for (var i=0; i<nodes.length; i++) {
node = nodes;
re = new RegExp('\\b'+node.tagName+'\\b','i');

// Only removes nodes where textContent is '', but could extend
// to remove any node where textContent is matches \s*
if (!re.test(allowedEmpty) && getText(node) == '') {
node.parentNode.removeChild(node);

// i node removed, so backup
--i;
}
}
}



I tried it and it does work, but it leaves in the <p></p> in the page in
this scenario...

<p>
<span><em></em></span>
<span><em></em></span>
<span><em></em></span>
</p>

David
 
R

RobG

* RobG wrote in comp.lang.javascript:


But then you'll remove e.g. <span><img/></span>.

Ooops. That can be fixed by going over the child nodes to see if any
contain "allowed to be empty" nodes - and hence losing its appeal. I
think Lasse's recursive DOM walk is best, as it also allows empty
#text nodes to be removed along the way.

FWIW, here's the fixed function (also removes nodes where the content
is only whitespace):

function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');
var kids, skip = false;
var allowedEmpty = 'base basefont body br col hr html img '
+ 'input isindex link meta param title';
var re0 = /^\s*$/;
var re1, re2;

for (var i=0; i<nodes.length; i++) {
node = nodes;
re1 = new RegExp('\\b'+node.tagName+'\\b','i');

if (!re1.test(allowedEmpty) && re0.test(getText(node))) {
kids = node.getElementsByTagName('*');

for (var j=0, jlen=kids.length; j<jlen; j++) {
re2 = new RegExp('\\b'+kids[j].tagName+'\\b','i');

if (re2.test(allowedEmpty)) {
skip = true;
break;
}
}

if (!skip) {
node.parentNode.removeChild(node);
--i;
}
skip = false;
}
}
}
 
D

David

RobG said:
* RobG wrote in comp.lang.javascript:


But then you'll remove e.g. <span><img/></span>.

Ooops. That can be fixed by going over the child nodes to see if any
contain "allowed to be empty" nodes - and hence losing its appeal. I
think Lasse's recursive DOM walk is best, as it also allows empty
#text nodes to be removed along the way.

FWIW, here's the fixed function (also removes nodes where the content
is only whitespace):

function removeEmptyNodes() {
var node, nodes = document.getElementsByTagName('*');
var kids, skip = false;
var allowedEmpty = 'base basefont body br col hr html img '
+ 'input isindex link meta param title';
var re0 = /^\s*$/;
var re1, re2;

for (var i=0; i<nodes.length; i++) {
node = nodes;
re1 = new RegExp('\\b'+node.tagName+'\\b','i');

if (!re1.test(allowedEmpty) && re0.test(getText(node))) {
kids = node.getElementsByTagName('*');

for (var j=0, jlen=kids.length; j<jlen; j++) {
re2 = new RegExp('\\b'+kids[j].tagName+'\\b','i');

if (re2.test(allowedEmpty)) {
skip = true;
break;
}
}

if (!skip) {
node.parentNode.removeChild(node);
--i;
}
skip = false;
}
}
}



Yep, that works as well. I really appreciate your help on this.

David
 
H

Henry

I am working on a script that is theoreticaly simple but I
can not get it to work completely. I am dealing with a page
spit out by .NET that leaves empty tags in the markup.

No matter how bad .NET may be it is not so bad that it would be
randomly inserting mark-up into its output. If there are empty
elements in the mark-up then it is almost certain that they are there
because .NET had been instructed to put them there. So the obvious
solution is fix the server side code so that it does not output
anything but what you want it to output (i.e. take control of what you
are doing).
I need a javascript solution to go behind and do a clean
up after the page loads.

That would be the worst possible approach to the problem.
 
D

David

Henry said:
No matter how bad .NET may be it is not so bad that it would be
randomly inserting mark-up into its output. If there are empty
elements in the mark-up then it is almost certain that they are there
because .NET had been instructed to put them there. So the obvious
solution is fix the server side code so that it does not output
anything but what you want it to output (i.e. take control of what you
are doing).


That would be the worst possible approach to the problem.

Henry,

Completely agree with you, absolutely, and I told our developers and powers
to be just this but I do not make the decisions and have to deal with them.

David
 
M

Michael Wojcik

Lasse said:
You need to ensure that you remove children before you test whether
parents are empty.

Yes, though that doesn't necessarily mean a single depth-first
traversal. An alternative is to iterate over the tree in arbitrary
order removing empty elements, repeating that process until no changes
are made.

That's less efficient than a single depth-first traversal (asymptotic
complexity of O((M+1)*(N-M/2)), where N is the number of nodes and M
the number of empty elements), but the performance difference is
likely to be negligible in typical cases.

My personal preference would probably be a recursive depth-first
traversal like the one you propose, or perhaps an iterative one if
performance was a concern. Some people might find the nested loop more
to their liking, though.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,052
Latest member
LucyCarper

Latest Threads

Top