Flat HTML headers to nested XML sections

Discussion in 'XML' started by CrazyAtlantaGuy, May 16, 2007.

  1. I am working on creating an XSLT that transforms Html into an XML
    format that can be imported into Framemaker. The challenge, it turns
    out, is correctly transforming the flat html header tags (<H1>, <H2>,
    etc)
    into nested sections inside the xml. I have made significant
    progress, but have run into a roadblock.

    Here is an example of my input HTML:

    <html><body>
    <p>abc abc</p>
    <h1 class='header'>A</h1>
    <p>A abc abc</p>
    <h2 class='header'>B</h2>
    <p>B abc abc</p>
    <h3 class='header'>C</h3>
    <p>D abc abc</p> <!-- this is missing in the output -->
    <h1 class='header'>E</h1>
    <p>E abc abc</p>
    </body></html>

    Here is an example of the output, you'll notice that the <H2>D</h2>
    is missing.

    <?xml version="1.0" encoding="UTF-8"?>
    <article>
    <title/>
    <para>abc abc</para>
    <section depth="1" id="A">
    <title>A</title>
    <para>A abc abc</para>
    <section depth="2" id="B">
    <title>B</title>
    <para>B abc abc</para>
    <section depth="3" id="C">
    <title>C</title>
    <para>C abc abc</para>
    </section>
    </section>
    </section>
    <section depth="1" id="E">
    <title>E</title>
    <para>E abc abc</para>
    </section>

    The problem is that my code is currently applying templates to all
    nodes following a header who's nearest preceding header is that same
    header. For this reason when content follows a header which isn't
    it's header (like an <h2> following an <h3>) it doesn't get shown.
    What I don't understand is how to fix it. Any help would much
    appreciated. I'm not really an xsl guru, so I'm doing the best I can
    to get through this.

    Here is the relevant code from my xsl:

    <xsl:template match="body">
    <article>
    <title>
    <xsl:value-of select="$docTitle" />
    </title>

    <xsl:for-each select='child::*[not(preceding-
    sibling::*[@class="header"])][not(@class="header")]'>
    <xsl:apply-templates select="."/>
    </xsl:for-each>

    <xsl:variable name='depth'
    select='substring(name(child::*[@class="header"][1]),2)'/>
    <xsl:for-each select='child::*[@class="header"]
    [substring(name(),
    2)&lt;=$depth]'>
    <xsl:apply-templates select="."/>
    </xsl:for-each>

    </article>
    </xsl:template>

    <xsl:template match="h1 | h2 | h3 | h4 | h5">
    <xsl:call-template name="header">
    <xsl:with-param name="depth" select="substring(name(),2)"/>
    </xsl:call-template>
    </xsl:template>

    <xsl:template name="header">
    <xsl:param name="depth"/>
    <section>
    <xsl:attribute name="depth">
    <xsl:value-of select="$depth"/>
    </xsl:attribute>

    <xsl:attribute name="id">
    <xsl:value-of select="translate(.,' ','')" />
    </xsl:attribute>
    <title><xsl:value-of select="."/></title>

    <xsl:variable name='thisHeader' select='generate-id(.)'/>
    <xsl:for-each select='following-sibling::*[$thisHeader=generate-
    id(preceding-sibling::*[@class="header"][last()])]
    [not(@class="header") or (@class="header" and substring(name(),2)>=
    $depth)]'>
    <xsl:apply-templates select="."/>
    </xsl:for-each>

    </section>

    </xsl:template>
     
    CrazyAtlantaGuy, May 16, 2007
    #1
    1. Advertisements

  2. CrazyAtlantaGuy

    Peter Flynn Guest

    This is called encapsulation, and there's a much neater way than writing
    XSLT to try and reach-forward-down-the-tree-up-to-but-not-including the
    next H1/H2/H3/etc.

    1. Run Tidy to make the HTML into well-formed XHTML (tidy -nc -asxml)

    2. Write a short script to turn the XHTML back into valid SGML
    (remove NETs, namespaces)

    3. Apply a DocType Declaration for the ISO 15445 HTML DTD, which
    includes a DIV1/DIV2 containment structure, in "preparation" mode
    (declare % Preparation as INCLUDE in the internal subset and use
    pre-html as the declared root element type)

    4. Run osgmlnorm to normalize the document: this adds the missing
    markup, switches single quotes to double where possible, etc

    <!doctype pre-html
    public "ISO/IEC 15445:2000//DTD HyperText Markup Language//EN" [
    <!entity % Preparation "include" >
    ]>
    <PRE-HTML>
    <HEAD>
    <META CONTENT="HTML Tidy for Linux/x86 (vers 1 September 2005), see
    www.w3.org" NAME="GENERATOR">
    <TITLE></TITLE>
    </HEAD>
    <BODY>
    <P>abc abc</P>
    <H1 CLASS="header">A</H1>
    <DIV1>
    <P>A abc abc</P>
    <H2 CLASS="header">B</H2>
    <DIV2>
    <P>B abc abc</P>
    <H3 CLASS="header">C</H3>
    <DIV3>
    <P>Cabc abc</P>
    </DIV3>
    </DIV2>
    <H2 CLASS="header">D</H2>
    <DIV2>
    <P>D abc abc</P>
    </DIV2>
    </DIV1>
    <H1 CLASS="header">E</H1>
    <DIV1>
    <P>E abc abc</P>
    </DIV1>
    </BODY>
    </PRE-HTML>

    You can easily mess with the Preparation structure in the DTD if you
    don't like the way they did it (I don't).

    ///Peter
     
    Peter Flynn, May 16, 2007
    #2
    1. Advertisements

  3. You could try adapting something from the XSLT FAQ. Likely candidates
    would be
    http://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e424
    or
    http://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e1051

    Some of the other examples on that page may also be adaptable to this
    question.

    (It's always worth checking Dave's page; he has done an excellent job of
    collecting useful answers from XSL-List, which is unofficial but has
    been in existence since before XSL was a Recommendation and has had
    participation by a lot of XSL's architects and implementers. I still try
    to keep half an eye on that list, though I must admit I don't watch it
    as closely as I should.)
     
    Joe Kesselman, May 17, 2007
    #3
  4. Thanks for the help!
     
    CrazyAtlantaGuy, May 22, 2007
    #4
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.