Determining document structure

J

Jeremy

Does anyone have a clever algorithm for generating an outline of the
current document from (client-side) javascript using DOM methods?

For example, let's say I predictably have a document structured
hierarchically with <h1>...<h6> tags. I want to generate an outline of
the document wherein I have nested lists of the contents of the headers.
Take for example the following snippet of a fictional legal document:

------------
<h1>Main Title</h1>
<h2>Section One</h2>
<h3>Paragraph A</h3>
<p>Congress shall make no law regarding the production of baggy clown
pants.</p>
<h3>Paragraph B</h3>
<p>Congress shall make now law restricting the use of said clown pants,
for any purpose otherwise legal.</p>

<h2>Section Two</h2>
<h3>Paragraph A</h3>
<p>Etc, etc.</p>
----------

From this I would want to generate
------
<ul>
<li>Main Title
<ul>
<li>Section One
<ul>
<li>Paragraph A</li>
<li>Paragraph B</li>
</ul>
</li>
<li>Section Two
<ul>
<li>Paragraph A</li>
</ul>
</li>
</ul>
------

using DOM methods. There are two ways I can think of to do this, being:
1) Scan through a flat list of all content nodes in succession
[document.getElementsByTagName("*")] and keep a structure of any <hX>
tags that are encountered by attaching any <hN> tag to the most recent
<hN-1> tag.

or

2) Get a list of each header level [document.getElementsByTagName("h1"),
document.getElementsByTagName("h2"), etc...] and somehow merge them and
sort them by order of appearance in the document. Then use that list to
generate the structure.

Method (2) seems more efficient but also more complicated. Before I
start on this, I wanted to see if anyone has done this before or has a
clever algorithm that I haven't thought of.

Thanks,
Jeremy
 
P

pcx99

<script language="JavaScript" type="text/javascript">
<!--

var _xmlStr = '';

function crawlXML(doc) { // Crawls an XML
document
if(doc.hasChildNodes()) { // If present
element has children
_xmlStr+='<ul><li>'+doc.tagName+'> '; // Display current
tag name
for(var i=0; i<doc.childNodes.length; i++) { // for each child
node on current level
crawlXML(doc.childNodes); // Call this
function recursively
} // end for loop
_xmlStr+='<\/li><\/ul>'; // Close the
list item.
} else { // current element
has no children
_xmlStr+=doc.nodeValue; // So display the
value of the data
} // End childNode check
} // End crawlXML

function startup() {
crawlXML(document);
document.getElementById('outputdiv').innerHTML=_xmlStr;
}

//-->
</script>
<body onload="startup()">
<div id='outputdiv'></div>
</body>
 
J

Jeff North

| Does anyone have a clever algorithm for generating an outline of the
| current document from (client-side) javascript using DOM methods?
|
| For example, let's say I predictably have a document structured
| hierarchically with <h1>...<h6> tags. I want to generate an outline of
| the document wherein I have nested lists of the contents of the headers.
| Take for example the following snippet of a fictional legal document:
|
| ------------
| <h1>Main Title</h1>
| <h2>Section One</h2>
| <h3>Paragraph A</h3>
| <p>Congress shall make no law regarding the production of baggy clown
| pants.</p>
| <h3>Paragraph B</h3>
| <p>Congress shall make now law restricting the use of said clown pants,
| for any purpose otherwise legal.</p>
|
| <h2>Section Two</h2>
| <h3>Paragraph A</h3>
| <p>Etc, etc.</p>
| ----------
|
| From this I would want to generate
| ------
| <ul>
| <li>Main Title
| <ul>
| <li>Section One
| <ul>
| <li>Paragraph A</li>
| <li>Paragraph B</li>
| </ul>
| </li>
| <li>Section Two
| <ul>
| <li>Paragraph A</li>
| </ul>
| </li>
| </ul>
| ------
|
| using DOM methods. There are two ways I can think of to do this, being:
| 1) Scan through a flat list of all content nodes in succession
| [document.getElementsByTagName("*")] and keep a structure of any <hX>
| tags that are encountered by attaching any <hN> tag to the most recent
| <hN-1> tag.
|
| or
|
| 2) Get a list of each header level [document.getElementsByTagName("h1"),
| document.getElementsByTagName("h2"), etc...] and somehow merge them and
| sort them by order of appearance in the document. Then use that list to
| generate the structure.
|
| Method (2) seems more efficient but also more complicated. Before I
| start on this, I wanted to see if anyone has done this before or has a
| clever algorithm that I haven't thought of.

I found these couple of entries, using google

http://www.phpied.com/suddenly-structured-articles/
http://www.bazon.net/mishoo/toc.epl
 
P

p.lepin

Does anyone have a clever algorithm for generating an
outline of the current document from (client-side)
javascript using DOM methods?

Note that there's a DSL for that type of processing. The
following is probably of mostly theoretical interest at
the moment, since JavaScript XSLT API support across
browsers seems to be patchy AFAICT, and we are probably
not going to see any kind of XSLT2 support for a few years,
but still, the fact that you can use XSLT to juggle your
nodes instead of crawling through them using DOM API is
something to keep in mind.
For example, let's say I predictably have a document
structured hierarchically with <h1>...<h6> tags. I want
to generate an outline of the document wherein I have
nested lists of the contents of the headers. Take for
example the following snippet of a fictional legal
document:

<h1>Main Title</h1>
<h2>Section One</h2>
<h3>Paragraph A</h3>
<p>Congress shall make no law regarding the production of
baggy clown pants.</p>
<h3>Paragraph B</h3>
<p>Congress shall make now law restricting the use of
said clown pants, for any purpose otherwise legal.</p>
<h2>Section Two</h2>
<h3>Paragraph A</h3>
<p>Etc, etc.</p>

The following code might seem inefficient for a task this
simple, but if you have to do something more complicated
with your document, using the built-in XSLT processor
becomes much more appealing:

<!DOCTYPE HTML PUBLIC
"-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title></title>
<script type="text/javascript">
function genToc ( )
{
var xformSrc =
' <xsl:stylesheet version="1.0" ' +
' xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> ' +
' <xsl:eek:utput method="html"/> ' +
' <xsl:template match="@*|node()"> ' +
' <xsl:apply-templates ' +
' select="@*|node()"/> ' +
' </' + 'xsl:template> ' +
' <xsl:template match="body"> ' +
' <ul> ' +
' <xsl:apply-templates ' +
' select=".//h1"/> ' +
' </' + 'ul> ' +
' </' + 'xsl:template> ' +
' <xsl:template match="h1|h2|h3|h4|h5|h6"> ' +
' <xsl:variable name="curName" ' +
' select="local-name()"/> ' +
' <xsl:variable name="subName" ' +
' select= ' +
' "concat(\'H\',' +
' 1+number(substring-after(' +
' local-name(),\'H\')))"/> ' +
' <li> ' +
' <xsl:apply-templates select="text()" ' +
' mode="yank-text"/> ' +
' <ul> ' +
' <xsl:apply-templates ' +
' select= ' +
' "following::*[(preceding::*' +
' [local-name()=$curName][1])' +
' [generate-id(.)=generate-id(' +
' current())]][local-name()=' +
' $subName]"/> ' +
' </' + 'ul> ' +
' </' + 'li> ' +
' </' + 'xsl:template> ' +
' <xsl:template match="text()" ' +
' mode="yank-text"> ' +
' <xsl:value-of select="."/> ' +
' </' + 'xsl:template> ' +
' </' + 'xsl:stylesheet> ' ;
var parser = new DOMParser ( ) ;
var xform =
parser . parseFromString
(
xformSrc , 'text/xml'
) ;
var proc = new XSLTProcessor ( ) ;
proc . importStylesheet ( xform ) ;
var toc =
proc . transformToFragment
(
document , document
) ;
var tocDiv = document . getElementById ( 'toc' ) ;
tocDiv . appendChild ( toc ) ;
}
</script>
</head>
<body onload=" genToc ( ) ; ">
<div id="toc"></div>
<h1>Main Title</h1>
<h2>Section One</h2>
<h3>Paragraph A</h3>
<p>Congress shall make no law regarding the production of
baggy clown pants.</p>
<h3>Paragraph B</h3>
<p>Congress shall make now law restricting the use of said
clown pants, for any purpose otherwise legal.</p>
<h2>Section Two</h2>
<h3>Paragraph A</h3>
<p>Etc, etc.</p>
</body>
</html>

Only works in Gecko-based UA's (tested in Firefox 1.5.0.7).
 
R

RobG

Jeremy said:
Does anyone have a clever algorithm for generating an outline of the
current document from (client-side) javascript using DOM methods?

For example, let's say I predictably have a document structured
hierarchically with <h1>...<h6> tags. I want to generate an outline of
the document wherein I have nested lists of the contents of the headers.
Take for example the following snippet of a fictional legal document:

------------
<h1>Main Title</h1>
<h2>Section One</h2>
<h3>Paragraph A</h3>
<p>Congress shall make no law regarding the production of baggy clown
pants.</p>
<h3>Paragraph B</h3>
<p>Congress shall make now law restricting the use of said clown pants,
for any purpose otherwise legal.</p>

<h2>Section Two</h2>
<h3>Paragraph A</h3>
<p>Etc, etc.</p>
----------

From this I would want to generate
------
<ul>
<li>Main Title
<ul>
<li>Section One
<ul>
<li>Paragraph A</li>
<li>Paragraph B</li>
</ul>
</li>
<li>Section Two
<ul>
<li>Paragraph A</li>
</ul>
</li>
</ul>

Just wander down the DOM and create a series of nested ULs with the
heading text in LIs. It's a tad easier using a bit of innerHTML, but not
much. Here's a pure DOM method:

<title>Outline</title>
<script type="text/javascript">

var genTOC = (function () {

var tocHTML = '';
var level = 1;
var tagRE = /^h\d+/;
var toc = genNode('ul');
var currentEl = toc;

function getText (el) {
if (el.textContent) {return el.textContent;}
if (el.innerText) {return el.innerText;}
if (typeof el.innerHTML == 'string') {
return el.innerHTML.replace(/<[^<>]+>/g,'');
}
}
function genNode(t) { return document.createElement(t);}
function genText(s) { return document.createTextNode(s);}
function previous(el, t){
el = el.parentNode;
while(el.tagName.toLowerCase() != t) {
el = el.parentNode;
}
return el;
}

return {
start: function (tocEl, startEl) {
if (!document.getElementById)
return;
if (typeof tocEl == 'string')
tocEl = document.getElementById(tocEl);
if (typeof startEl == 'string')
startEl = document.getElementById(startEl);

startEl = startEl || document.body;
this.run(startEl);
tocEl.appendChild(toc);
},

run: function(el){
var kid, kids = el.childNodes;
var t, thisLevel;

for (var i=0, len=kids.length; i<len; i++) {
kid = kids;

if (kid.tagName && tagRE.test(kid.tagName.toLowerCase())) {
thisLevel = kid.tagName.substring(1);
if (thisLevel > level) {
currentEl.appendChild(genNode('ul'));
currentEl = currentEl.lastChild;
level++;
} else {
while (thisLevel < level) {
currentEl = previous(currentEl, 'ul');
level--;
}
}
t = genNode('li');
t.appendChild(genText(getText(kid)))
currentEl.appendChild(t);
}
if (kid.childNodes) {this.run(kid);}
}
}
}
})();

window.onload = function(){genTOC.start('tocDiv');}
</script>

<body>
<div id="tocDiv"></div>
<div>
<h1>Heading 1</h1>
<p>Lorem Ipsum</p>
<h2>Heading 1.1</h2>
<p>Lorem Ipsum</p>
<h2>Heading 1.2</h2>
<p>Lorem Ipsum</p>
<h3>Heading 1.2.1</h3>
<p>Lorem Ipsum</p>
<h3>Heading 1.2.2</h3>
<p>Lorem Ipsum</p>
<h3>Heading 1.2.3</h3>
<p>Lorem Ipsum</p>
<h2>Heading 1.3</h2>
<div>
<p>Lorem Ipsum</p>
<h3>Heading 1.3.1</h3>
<p>Lorem Ipsum</p>
<h3>Heading 1.3.2</h3>
<p>Lorem Ipsum</p>
</div>
<div>
<h1>Heading 2</h1>
<p>Lorem Ipsum</p>
<h2>Heading 2.1</h2>
<p>Lorem Ipsum</p>
<h3>Heading 2.1.1</h3>
<p>Lorem Ipsum</p>
<h3>Heading 2.1.2</h3>
</div>
</div>
</body>


Lightly tested.
 
J

Jeremy

RobG said:
Jeremy wrote:

Just wander down the DOM and create a series of nested ULs with the
heading text in LIs. It's a tad easier using a bit of innerHTML, but not
much. Here's a pure DOM method:

<code snipped>

Thanks to all that replied to this question. I've looked at all the
links and ideas you came up with and wrote with an implementation I'm
pretty happy with.

Jeremy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top