HTML Parsing Question

Discussion in 'XML' started by Stefan Kleineikenscheidt, Dec 31, 2006.

  1. Hi all,

    i'm trying to convert an HTML page to a hierachical structure, but I am
    stuck. Consider a page like that:

    <h1>First Heading1</h1>
    <p>some text</p>
    <p>more text</p>

    <h2>First Heading2</h2>
    <p>more text</p>

    <h2>Second Heading2</h2>
    ...
    <h1>Second Heading1</h1>
    ...
    <h2>Third Heading2</h2>
    ...


    Now I would like to convert this into a hierarchical structure like
    this (think of Docbook):

    <article>
    |
    + <sect1>
    | |
    | + <sect2>
    | + <sect2>
    |
    + <sect1>
    |
    + <sect2>

    This is my 'h1' template, where i try to process all elements between
    two 'h1' elements:

    <xsl:template match="//h:h1">
    <section>
    <title><xsl:value-of select="text()" /></title>
    <xsl:variable name="nexth1" select="position(parent::*/*[(name()
    = 'h1')])" />
    <xsl:apply-templates select="following-sibling::*[position()
    &lt;= $nexth1]" />
    </section>
    </xsl:template>

    $nexth1 should be the position of the next 'h1' element. However,
    position() does not take any arguments, and i don't have a clue how to
    get the position. (I need to change the context node, but i don't know
    how...)

    Can you give me any directions on this?

    Thanks in advance,
    -Stefan
     
    Stefan Kleineikenscheidt, Dec 31, 2006
    #1
    1. Advertising

  2. Stefan Kleineikenscheidt

    Peter Flynn Guest

    Stefan Kleineikenscheidt wrote:
    > Hi all,
    >
    > i'm trying to convert an HTML page to a hierachical structure, but I am
    > stuck. Consider a page like that:
    >
    > <h1>First Heading1</h1>
    > <p>some text</p>
    > <p>more text</p>
    >
    > <h2>First Heading2</h2>
    > <p>more text</p>
    >
    > <h2>Second Heading2</h2>
    > ...
    > <h1>Second Heading1</h1>
    > ...
    > <h2>Third Heading2</h2>
    > ...


    First of all you would need to make it well-formed XHTML (use W3C Tidy
    for that). This ensures that any subsequent XSLT process won't gag.

    > This is my 'h1' template, where i try to process all elements between
    > two 'h1' elements:
    >
    > <xsl:template match="//h:h1">
    > <section>
    > <title><xsl:value-of select="text()" /></title>
    > <xsl:variable name="nexth1" select="position(parent::*/*[(name()
    > = 'h1')])" />
    > <xsl:apply-templates select="following-sibling::*[position()
    > &lt;= $nexth1]" />
    > </section>
    > </xsl:template>


    <?xml version="1.0" encoding="iso-8859-1"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">

    <xsl:eek:utput method="xml" indent="yes"/>

    <xsl:template match="h1|h2|h3|h4">
    <xsl:variable name="id" select="generate-id(.)"/>
    <xsl:variable name="level">
    <xsl:value-of select="number(translate(name(),'h',''))"/>
    </xsl:variable>
    <xsl:variable name="gi" select="name()"/>
    <xsl:element name="{concat('sect',$level)}">
    <xsl:attribute name="id" select="$id"/>
    <title>
    <xsl:apply-templates/>
    </title>
    <xsl:apply-templates select="following-sibling::*
    [generate-id(preceding-sibling::*[name()=$gi][1])=$id]
    [not(substring(name(),1,1)='h' and name()!='hr' and
    number(translate(substring(name(),1,1),'h',''))&lt;$level)]

    [not(number(translate(name(preceding-sibling::*[substring(name(),1,1)='h'
    and name()!='hr'][1]),'h',''))&lt;$level)]"/>
    </xsl:element>
    </xsl:template>

    <xsl:template match="p">
    <para>
    <xsl:apply-templates/>
    </para>
    </xsl:template>

    </xsl:stylesheet>

    This needs some more work: it's not subsetting out the higher-level H*
    element types, but I've run out of time here.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
     
    Peter Flynn, Jan 2, 2007
    #2
    1. Advertising

  3. Peter Flynn schrieb:
    > First of all you would need to make it well-formed XHTML

    [...]
    > <?xml version="1.0" encoding="iso-8859-1"?>
    > <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    > version="1.0">
    >
    > <xsl:eek:utput method="xml" indent="yes"/>
    >
    > <xsl:template match="h1|h2|h3|h4">


    If the source is "well-formed XHTML" you will have to deal with
    namespaces as the OP already did.
    --
    Johannes Koch
    In te domine speravi; non confundar in aeternum.
    (Te Deum, 4th cent.)
     
    Johannes Koch, Jan 2, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GIMME
    Replies:
    2
    Views:
    877
    GIMME
    Feb 11, 2004
  2. Naren
    Replies:
    0
    Views:
    585
    Naren
    May 11, 2004
  3. Replies:
    7
    Views:
    1,385
  4. Ninja Li

    Parsing HTML with HTML::TableExtract

    Ninja Li, Nov 27, 2009, in forum: Perl Misc
    Replies:
    2
    Views:
    228
    Martien Verbruggen
    Nov 28, 2009
  5. Ninja Li

    Parsing HTML with HTML::Tree

    Ninja Li, Mar 1, 2010, in forum: Perl Misc
    Replies:
    1
    Views:
    150
    Ninja Li
    Mar 1, 2010
Loading...

Share This Page