Flat HTML headers to nested XML sections

Discussion in 'XML' started by CrazyAtlantaGuy, May 16, 2007.

  1. I am working on creating an XSLT that transforms Html into an XML
    format that can be imported into Framemaker. The challenge, it turns
    out, is correctly transforming the flat html header tags (<H1>, <H2>,
    etc)
    into nested sections inside the xml. I have made significant
    progress, but have run into a roadblock.

    Here is an example of my input HTML:

    <html><body>
    <p>abc abc</p>
    <h1 class='header'>A</h1>
    <p>A abc abc</p>
    <h2 class='header'>B</h2>
    <p>B abc abc</p>
    <h3 class='header'>C</h3>
    <p>Cabc abc</p>
    <h2 class='header'>D</h2> <!-- this is missing in the output --
    >

    <p>D abc abc</p> <!-- this is missing in the output -->
    <h1 class='header'>E</h1>
    <p>E abc abc</p>
    </body></html>

    Here is an example of the output, you'll notice that the <H2>D</h2>
    is missing.

    <?xml version="1.0" encoding="UTF-8"?>
    <article>
    <title/>
    <para>abc abc</para>
    <section depth="1" id="A">
    <title>A</title>
    <para>A abc abc</para>
    <section depth="2" id="B">
    <title>B</title>
    <para>B abc abc</para>
    <section depth="3" id="C">
    <title>C</title>
    <para>C abc abc</para>
    </section>
    </section>
    </section>
    <section depth="1" id="E">
    <title>E</title>
    <para>E abc abc</para>
    </section>

    The problem is that my code is currently applying templates to all
    nodes following a header who's nearest preceding header is that same
    header. For this reason when content follows a header which isn't
    it's header (like an <h2> following an <h3>) it doesn't get shown.
    What I don't understand is how to fix it. Any help would much
    appreciated. I'm not really an xsl guru, so I'm doing the best I can
    to get through this.

    Here is the relevant code from my xsl:

    <xsl:template match="body">
    <article>
    <title>
    <xsl:value-of select="$docTitle" />
    </title>

    <xsl:for-each select='child::*[not(preceding-
    sibling::*[@class="header"])][not(@class="header")]'>
    <xsl:apply-templates select="."/>
    </xsl:for-each>

    <xsl:variable name='depth'
    select='substring(name(child::*[@class="header"][1]),2)'/>
    <xsl:for-each select='child::*[@class="header"]
    [substring(name(),
    2)&lt;=$depth]'>
    <xsl:apply-templates select="."/>
    </xsl:for-each>

    </article>
    </xsl:template>

    <xsl:template match="h1 | h2 | h3 | h4 | h5">
    <xsl:call-template name="header">
    <xsl:with-param name="depth" select="substring(name(),2)"/>
    </xsl:call-template>
    </xsl:template>

    <xsl:template name="header">
    <xsl:param name="depth"/>
    <section>
    <xsl:attribute name="depth">
    <xsl:value-of select="$depth"/>
    </xsl:attribute>

    <xsl:attribute name="id">
    <xsl:value-of select="translate(.,' ','')" />
    </xsl:attribute>
    <title><xsl:value-of select="."/></title>

    <xsl:variable name='thisHeader' select='generate-id(.)'/>
    <xsl:for-each select='following-sibling::*[$thisHeader=generate-
    id(preceding-sibling::*[@class="header"][last()])]
    [not(@class="header") or (@class="header" and substring(name(),2)>=
    $depth)]'>
    <xsl:apply-templates select="."/>
    </xsl:for-each>

    </section>

    </xsl:template>
     
    CrazyAtlantaGuy, May 16, 2007
    #1
    1. Advertising

  2. CrazyAtlantaGuy

    Peter Flynn Guest

    CrazyAtlantaGuy wrote:
    > I am working on creating an XSLT that transforms Html into an XML
    > format that can be imported into Framemaker. The challenge, it turns
    > out, is correctly transforming the flat html header tags (<H1>, <H2>,
    > etc) into nested sections inside the xml.


    This is called encapsulation, and there's a much neater way than writing
    XSLT to try and reach-forward-down-the-tree-up-to-but-not-including the
    next H1/H2/H3/etc.

    1. Run Tidy to make the HTML into well-formed XHTML (tidy -nc -asxml)

    2. Write a short script to turn the XHTML back into valid SGML
    (remove NETs, namespaces)

    3. Apply a DocType Declaration for the ISO 15445 HTML DTD, which
    includes a DIV1/DIV2 containment structure, in "preparation" mode
    (declare % Preparation as INCLUDE in the internal subset and use
    pre-html as the declared root element type)

    4. Run osgmlnorm to normalize the document: this adds the missing
    markup, switches single quotes to double where possible, etc

    <!doctype pre-html
    public "ISO/IEC 15445:2000//DTD HyperText Markup Language//EN" [
    <!entity % Preparation "include" >
    ]>
    <PRE-HTML>
    <HEAD>
    <META CONTENT="HTML Tidy for Linux/x86 (vers 1 September 2005), see
    www.w3.org" NAME="GENERATOR">
    <TITLE></TITLE>
    </HEAD>
    <BODY>
    <P>abc abc</P>
    <H1 CLASS="header">A</H1>
    <DIV1>
    <P>A abc abc</P>
    <H2 CLASS="header">B</H2>
    <DIV2>
    <P>B abc abc</P>
    <H3 CLASS="header">C</H3>
    <DIV3>
    <P>Cabc abc</P>
    </DIV3>
    </DIV2>
    <H2 CLASS="header">D</H2>
    <DIV2>
    <P>D abc abc</P>
    </DIV2>
    </DIV1>
    <H1 CLASS="header">E</H1>
    <DIV1>
    <P>E abc abc</P>
    </DIV1>
    </BODY>
    </PRE-HTML>

    You can easily mess with the Preparation structure in the DTD if you
    don't like the way they did it (I don't).

    ///Peter
     
    Peter Flynn, May 16, 2007
    #2
    1. Advertising

  3. You could try adapting something from the XSLT FAQ. Likely candidates
    would be
    http://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e424
    or
    http://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e1051

    Some of the other examples on that page may also be adaptable to this
    question.

    (It's always worth checking Dave's page; he has done an excellent job of
    collecting useful answers from XSL-List, which is unofficial but has
    been in existence since before XSL was a Recommendation and has had
    participation by a lot of XSL's architects and implementers. I still try
    to keep half an eye on that list, though I must admit I don't watch it
    as closely as I should.)

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
     
    Joe Kesselman, May 17, 2007
    #3
  4. On May 17, 12:37 am, Joe Kesselman <> wrote:
    > You could try adapting something from the XSLT FAQ. Likely candidates
    > would behttp://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e424
    > orhttp://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e1051
    >
    > Some of the other examples on that page may also be adaptable to this
    > question.
    >
    > (It's always worth checking Dave's page; he has done an excellent job of
    > collecting useful answers from XSL-List, which is unofficial but has
    > been in existence since before XSL was a Recommendation and has had
    > participation by a lot of XSL's architects and implementers. I still try
    > to keep half an eye on that list, though I must admit I don't watch it
    > as closely as I should.)
    >
    > --
    > () ASCII Ribbon Campaign | Joe Kesselman
    > /\ Stamp out HTML e-mail! | System architexture and kinetic poetry


    Thanks for the help!
     
    CrazyAtlantaGuy, May 22, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Podi
    Replies:
    4
    Views:
    352
  2. Raman
    Replies:
    6
    Views:
    4,763
    santosh
    Aug 3, 2007
  3. kj
    Replies:
    53
    Views:
    2,595
    alex23
    Nov 10, 2010
  4. Terry Reedy

    Why 'Flat is better than nested'

    Terry Reedy, Jul 31, 2012, in forum: Python
    Replies:
    0
    Views:
    161
    Terry Reedy
    Jul 31, 2012
  5. Ian Kelly

    Re: Why 'Flat is better than nested'

    Ian Kelly, Jul 31, 2012, in forum: Python
    Replies:
    0
    Views:
    148
    Ian Kelly
    Jul 31, 2012
Loading...

Share This Page