Help to extract data from a web page

S

smiledragon

Hi, I am newbie to XSLT, can you help me to write a XSLT to extract
article data from below web page? Thanks a lot

HTML page

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
<title>Untitled Document</title>
</head>
<body>
<p>... Page Header ...</p>
<p class=PageTitle>Page Title</p>
<p class=ArticleTitle>Article Title</p>
<table border="0" cellspacing="0" cellpadding="5">
<tr>
<td>Article Date </td>
<td>25/8/2007</td>
</tr>
<tr>
<td colspan="2"><p>Hey, I want to extract Page Title, Article
Title, Article Date and Article Content, Request By.</p>
<p>Please help me to write XSLT code to extract article data?
<br/>
<br/>
Thanks.</p></td>
</tr>
<tr align="right">
<td colspan="2">Author David </td>
</tr>
</table>
<p>... Page Footer ...</p>
</body>
</html>




XML Result Page

<?xml version="1.0" encoding="UTF-8"?>
<HTMLPage>
<PageTitle>Page Title</PageTitle>
<ArticleTitle>Article Title</ArticleTitle>
<ArticleDate>25/8/2007</ArticleDate>
<ArticleBody>
<p>Hey, I want to extract Page Title, Article Title, Article Date
<br/>
Thanks.</p>
</ArticleBody>
<Author>David</Author>
</HTMLPage>
 
J

Joe Kesselman

(Despite its name, microsoft.public.xsl doesn't let me post to it, so
you're only going to get an answer in comp.text.xml.)

XSLT is set up to process XML, not HTML. Your HTML document will not go
through an XML parser. So the firs thing you'll need to do is put it
through an HTML-to-XHTML conversion layer, such as the W3C's "tidy"
tool. (Alternatively you could feed the output of an HTML-to-XML parser,
such as NekoHTML, into an XSLT processor... but that will require a bit
more programming to hook those tools to each other.)

After doing that... what do you mean by "extract article data"? You're
writing a program, so you need to be explicit about what it's supposed
to do. Page title and article title are easy; look for <p> elements with
the appropriate class attribute, using XPaths with predicates.

Article date is more of a pain since you need to search for the <td>
with the appropriate text value, then retrieve its following sibling's
value... unless you can count on the fact that it will always be in the
first <tr>, in which case you search for the second td of that tr.

Content -- Can you count on that being the second tr? If so, just
copying the contents of that seems to meet your need.

Author -- Again assuming that it's reliably going to be the third tr,
this is more of a pain because you're going to have to do string
manipulation to extract the author's name.


Having broken it down to this point, you really ought to be able to
complete the task yourself by consulting a good intro-to-XSLT tutorial.
Try it, and if you run into trouble come back with specific questions.
 
M

Martin Honnen

Joe said:
XSLT is set up to process XML, not HTML. Your HTML document will not go
through an XML parser. So the firs thing you'll need to do is put it
through an HTML-to-XHTML conversion layer, such as the W3C's "tidy"
tool. (Alternatively you could feed the output of an HTML-to-XML parser,
such as NekoHTML, into an XSLT processor... but that will require a bit
more programming to hook those tools to each other.)

If you don't want to program to hook those tools together then you can
use TSaxon <http://ccil.org/~cowan/XML/tagsoup/tsaxon/>, it then allows
you to use Saxon 6.5.5 to apply XSLT 1.0 transformations with both XML
and HTML input documents.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top