HTML Parsing Question

  • Thread starter Stefan Kleineikenscheidt
  • Start date
S

Stefan Kleineikenscheidt

Hi all,

i'm trying to convert an HTML page to a hierachical structure, but I am
stuck. Consider a page like that:

<h1>First Heading1</h1>
<p>some text</p>
<p>more text</p>

<h2>First Heading2</h2>
<p>more text</p>

<h2>Second Heading2</h2>
...
<h1>Second Heading1</h1>
...
<h2>Third Heading2</h2>
...


Now I would like to convert this into a hierarchical structure like
this (think of Docbook):

<article>
|
+ <sect1>
| |
| + <sect2>
| + <sect2>
|
+ <sect1>
|
+ <sect2>

This is my 'h1' template, where i try to process all elements between
two 'h1' elements:

<xsl:template match="//h:h1">
<section>
<title><xsl:value-of select="text()" /></title>
<xsl:variable name="nexth1" select="position(parent::*/*[(name()
= 'h1')])" />
<xsl:apply-templates select="following-sibling::*[position()
&lt;= $nexth1]" />
</section>
</xsl:template>

$nexth1 should be the position of the next 'h1' element. However,
position() does not take any arguments, and i don't have a clue how to
get the position. (I need to change the context node, but i don't know
how...)

Can you give me any directions on this?

Thanks in advance,
-Stefan
 
P

Peter Flynn

Stefan said:
Hi all,

i'm trying to convert an HTML page to a hierachical structure, but I am
stuck. Consider a page like that:

<h1>First Heading1</h1>
<p>some text</p>
<p>more text</p>

<h2>First Heading2</h2>
<p>more text</p>

<h2>Second Heading2</h2>
...
<h1>Second Heading1</h1>
...
<h2>Third Heading2</h2>
...

First of all you would need to make it well-formed XHTML (use W3C Tidy
for that). This ensures that any subsequent XSLT process won't gag.
This is my 'h1' template, where i try to process all elements between
two 'h1' elements:

<xsl:template match="//h:h1">
<section>
<title><xsl:value-of select="text()" /></title>
<xsl:variable name="nexth1" select="position(parent::*/*[(name()
= 'h1')])" />
<xsl:apply-templates select="following-sibling::*[position()
&lt;= $nexth1]" />
</section>
</xsl:template>

<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:eek:utput method="xml" indent="yes"/>

<xsl:template match="h1|h2|h3|h4">
<xsl:variable name="id" select="generate-id(.)"/>
<xsl:variable name="level">
<xsl:value-of select="number(translate(name(),'h',''))"/>
</xsl:variable>
<xsl:variable name="gi" select="name()"/>
<xsl:element name="{concat('sect',$level)}">
<xsl:attribute name="id" select="$id"/>
<title>
<xsl:apply-templates/>
</title>
<xsl:apply-templates select="following-sibling::*
[generate-id(preceding-sibling::*[name()=$gi][1])=$id]
[not(substring(name(),1,1)='h' and name()!='hr' and
number(translate(substring(name(),1,1),'h',''))&lt;$level)]

[not(number(translate(name(preceding-sibling::*[substring(name(),1,1)='h'
and name()!='hr'][1]),'h',''))&lt;$level)]"/>
</xsl:element>
</xsl:template>

<xsl:template match="p">
<para>
<xsl:apply-templates/>
</para>
</xsl:template>

</xsl:stylesheet>

This needs some more work: it's not subsetting out the higher-level H*
element types, but I've run out of time here.

///Peter
 
J

Johannes Koch

Peter said:
First of all you would need to make it well-formed XHTML [...]
<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:eek:utput method="xml" indent="yes"/>

<xsl:template match="h1|h2|h3|h4">

If the source is "well-formed XHTML" you will have to deal with
namespaces as the OP already did.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,142
Latest member
arinsharma
Top