Flat HTML headers to nested XML sections

C

CrazyAtlantaGuy

I am working on creating an XSLT that transforms Html into an XML
format that can be imported into Framemaker. The challenge, it turns
out, is correctly transforming the flat html header tags (<H1>, <H2>,
etc)
into nested sections inside the xml. I have made significant
progress, but have run into a roadblock.

Here is an example of my input HTML:

<html><body>
<p>abc abc</p>
<h1 class='header'>A</h1>
<p>A abc abc</p>
<h2 class='header'>B</h2>
<p>B abc abc</p>
<h3 class='header'>C</h3>
<p>D abc abc</p> <!-- this is missing in the output -->
<h1 class='header'>E</h1>
<p>E abc abc</p>
</body></html>

Here is an example of the output, you'll notice that the <H2>D</h2>
is missing.

<?xml version="1.0" encoding="UTF-8"?>
<article>
<title/>
<para>abc abc</para>
<section depth="1" id="A">
<title>A</title>
<para>A abc abc</para>
<section depth="2" id="B">
<title>B</title>
<para>B abc abc</para>
<section depth="3" id="C">
<title>C</title>
<para>C abc abc</para>
</section>
</section>
</section>
<section depth="1" id="E">
<title>E</title>
<para>E abc abc</para>
</section>

The problem is that my code is currently applying templates to all
nodes following a header who's nearest preceding header is that same
header. For this reason when content follows a header which isn't
it's header (like an <h2> following an <h3>) it doesn't get shown.
What I don't understand is how to fix it. Any help would much
appreciated. I'm not really an xsl guru, so I'm doing the best I can
to get through this.

Here is the relevant code from my xsl:

<xsl:template match="body">
<article>
<title>
<xsl:value-of select="$docTitle" />
</title>

<xsl:for-each select='child::*[not(preceding-
sibling::*[@class="header"])][not(@class="header")]'>
<xsl:apply-templates select="."/>
</xsl:for-each>

<xsl:variable name='depth'
select='substring(name(child::*[@class="header"][1]),2)'/>
<xsl:for-each select='child::*[@class="header"]
[substring(name(),
2)&lt;=$depth]'>
<xsl:apply-templates select="."/>
</xsl:for-each>

</article>
</xsl:template>

<xsl:template match="h1 | h2 | h3 | h4 | h5">
<xsl:call-template name="header">
<xsl:with-param name="depth" select="substring(name(),2)"/>
</xsl:call-template>
</xsl:template>

<xsl:template name="header">
<xsl:param name="depth"/>
<section>
<xsl:attribute name="depth">
<xsl:value-of select="$depth"/>
</xsl:attribute>

<xsl:attribute name="id">
<xsl:value-of select="translate(.,' ','')" />
</xsl:attribute>
<title><xsl:value-of select="."/></title>

<xsl:variable name='thisHeader' select='generate-id(.)'/>
<xsl:for-each select='following-sibling::*[$thisHeader=generate-
id(preceding-sibling::*[@class="header"][last()])]
[not(@class="header") or (@class="header" and substring(name(),2)>=
$depth)]'>
<xsl:apply-templates select="."/>
</xsl:for-each>

</section>

</xsl:template>
 
P

Peter Flynn

CrazyAtlantaGuy said:
I am working on creating an XSLT that transforms Html into an XML
format that can be imported into Framemaker. The challenge, it turns
out, is correctly transforming the flat html header tags (<H1>, <H2>,
etc) into nested sections inside the xml.

This is called encapsulation, and there's a much neater way than writing
XSLT to try and reach-forward-down-the-tree-up-to-but-not-including the
next H1/H2/H3/etc.

1. Run Tidy to make the HTML into well-formed XHTML (tidy -nc -asxml)

2. Write a short script to turn the XHTML back into valid SGML
(remove NETs, namespaces)

3. Apply a DocType Declaration for the ISO 15445 HTML DTD, which
includes a DIV1/DIV2 containment structure, in "preparation" mode
(declare % Preparation as INCLUDE in the internal subset and use
pre-html as the declared root element type)

4. Run osgmlnorm to normalize the document: this adds the missing
markup, switches single quotes to double where possible, etc

<!doctype pre-html
public "ISO/IEC 15445:2000//DTD HyperText Markup Language//EN" [
<!entity % Preparation "include" >
]>
<PRE-HTML>
<HEAD>
<META CONTENT="HTML Tidy for Linux/x86 (vers 1 September 2005), see
www.w3.org" NAME="GENERATOR">
<TITLE></TITLE>
</HEAD>
<BODY>
<P>abc abc</P>
<H1 CLASS="header">A</H1>
<DIV1>
<P>A abc abc</P>
<H2 CLASS="header">B</H2>
<DIV2>
<P>B abc abc</P>
<H3 CLASS="header">C</H3>
<DIV3>
<P>Cabc abc</P>
</DIV3>
</DIV2>
<H2 CLASS="header">D</H2>
<DIV2>
<P>D abc abc</P>
</DIV2>
</DIV1>
<H1 CLASS="header">E</H1>
<DIV1>
<P>E abc abc</P>
</DIV1>
</BODY>
</PRE-HTML>

You can easily mess with the Preparation structure in the DTD if you
don't like the way they did it (I don't).

///Peter
 
J

Joe Kesselman

You could try adapting something from the XSLT FAQ. Likely candidates
would be
http://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e424
or
http://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e1051

Some of the other examples on that page may also be adaptable to this
question.

(It's always worth checking Dave's page; he has done an excellent job of
collecting useful answers from XSL-List, which is unofficial but has
been in existence since before XSL was a Recommendation and has had
participation by a lot of XSL's architects and implementers. I still try
to keep half an eye on that list, though I must admit I don't watch it
as closely as I should.)
 
C

CrazyAtlantaGuy

You could try adapting something from the XSLT FAQ. Likely candidates
would behttp://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e424
orhttp://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e1051

Some of the other examples on that page may also be adaptable to this
question.

(It's always worth checking Dave's page; he has done an excellent job of
collecting useful answers from XSL-List, which is unofficial but has
been in existence since before XSL was a Recommendation and has had
participation by a lot of XSL's architects and implementers. I still try
to keep half an eye on that list, though I must admit I don't watch it
as closely as I should.)

Thanks for the help!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top