XHTML to XML conversion

Discussion in 'XML' started by hawat.thufir@gmail.com, Aug 15, 2005.

  1. Guest

    I'm trying do some "screen scraping", and am using
    <http://www.oreilly.com/catalog/xmlhks/> for inspiration.

    First I'd like to convert XHTML to XML, or extract XML from XHTML, I'm
    not sure how to phrase that.

    "Use Cocoon to Create a Well-Formed View of a Web Page, Then Scrape It
    for Data"
    <http://hacks.oreilly.com/pub/h/2125>

    Is what I'd like to do down the line, but for now I'm working on
    something simpler.


    First,

    "Convert an HTML Document to XHTML with HTML Tidy"
    <http://hacks.oreilly.com/pub/h/2054>

    Instead of Tidy, I went with TagSoup
    <http://mercury.ccil.org/~cowan/XML/tagsoup/>.


    Then I'd like go from XHTML to XML in order to:

    "Generate an XSLT Identity Stylesheet with Relaxer"
    <http://hacks.oreilly.com/pub/h/2069>

    How do I get the XML from the XHTML, please?

    here's what I have:[thufir@arrakis tagSoup]$
    [thufir@arrakis tagSoup]$ date
    Sun Aug 14 23:34:13 IST 2005
    [thufir@arrakis tagSoup]$ pwd
    /home/thufir/Desktop/tagSoup
    [thufir@arrakis tagSoup]$ ll
    total 60
    -rw-rw-r-- 1 thufir thufir 7662 Aug 13 22:08 google.html
    -rw-rw-r-- 1 thufir thufir 42207 Aug 14 23:32 tagsoup.jar
    [thufir@arrakis tagSoup]$ java -jar tagsoup.jar --files google.html
    src: google.html dst: google.xhtml
    [thufir@arrakis tagSoup]$ ll
    total 76
    -rw-rw-r-- 1 thufir thufir 7662 Aug 13 22:08 google.html
    -rw-rw-r-- 1 thufir thufir 10568 Aug 14 23:34 google.xhtml
    -rw-rw-r-- 1 thufir thufir 42207 Aug 14 23:32 tagsoup.jar
    [thufir@arrakis tagSoup]$ cat google.xhtml -n
    1 <?xml version="1.0" standalone="yes"?>
    2
    3 <html version="-//W3C//DTD HTML 4.01 Transitional//EN"
    xmlns="http://www.w3.org/1999/xhtml"><head><title>Google
    Directory</title><style>&lt;!--
    4 body,td,a,p,.h{font-family: arial,sans-serif;}
    ..h{color:#008000}
    ..q{text-decoration:none; color:#0000cc;}
    5 //--&gt;</style><script>
    6 &lt;!--
    7 function sf(){document.f.q.focus();}
    8 // --&gt;
    9 </script></head><body bgcolor="#ffffff" text="#000000"
    link="#3300cc" vlink="#660066" alink="#ff0000" onload="sf();">
    10 <center>
    11 <table cellpadding="0" cellspacing="0" border="0"><tr><td
    align="right" colspan="1" rowspan="1" valign="bottom"><img
    src="http://www.google.com/images/hp0.gif" width="158" height="78"
    alt="Google Directory"></img></td><td colspan="1" rowspan="1"
    valign="bottom"><img src="http://www.google.com/images/hp1.gif"
    width="50" height="78" alt=""></img></td><td colspan="1" rowspan="1"
    valign="bottom"><img src="http://www.google.com/images/hp2.gif"
    width="68" height="78" alt=""></img></td></tr><tr><td align="right"
    colspan="1" rowspan="1" valign="top" class="h"><b>Directory</b></td><td
    colspan="1" rowspan="1" valign="top"><img
    src="http://www.google.com/images/hp3.gif" width="50" height="32"
    alt=""></img></td><td colspan="1" rowspan="1" valign="top"
    class="h"></td></tr></table><br clear="none"></br><table border="0"
    cellspacing="0" cellpadding="0"><tr><td colspan="1" rowspan="1"
    width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
    rowspan="1" id="0" bgcolor="#efefef" width="95"><a shape="rect"
    class="q" id="0a" href="http://www.google.com/webhp?hl=en"><font
    size="-1">Web</font></a></td><td colspan="1" rowspan="1"
    width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
    rowspan="1" id="1" bgcolor="#efefef" width="95"><a shape="rect"
    class="q" id="1a" href="http://www.google.com/imghp?hl=en"><font
    size="-1">Images</font></a></td><td colspan="1" rowspan="1"
    width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
    rowspan="1" id="2" bgcolor="#efefef" width="95"><a shape="rect"
    class="q" id="2a" href="http://www.google.com/grphp?hl=en"><font
    size="-1">Groups</font></a></td><td colspan="1" rowspan="1"
    width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
    rowspan="1" id="3" bgcolor="#008000" width="95"><font color="#ffffff"
    size="-1"><b>Directory</b></font></td><td colspan="1" rowspan="1"
    width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
    rowspan="1" id="4" bgcolor="#efefef" width="95"><a shape="rect"
    class="q" id="4a" href="http://www.google.com/nwshp?hl=en"><font
    size="-1">News</font></a></td><td colspan="1" rowspan="1"
    width="15"> </td><td colspan="1" rowspan="1"
    width="15"> </td></tr><tr><td colspan="12" rowspan="1"
    bgcolor="#008000"><img width="1" height="1"
    alt=""></img></td></tr></table><br clear="none"></br><form
    enctype="application/x-www-form-urlencoded" method="get"
    action="http://www.google.com/search" name="f"><table cellpadding="0"
    cellspacing="0"><tr align="middle" valign="center"><td colspan="1"
    rowspan="1" width="150"> </td><td colspan="1" rowspan="1"><input
    maxlength="256" type="text" name="q" size="40"
    value=""></input><script>document.f.q.focus();</script><input
    type="submit" name="btnG" value="Google Search"></input><input
    type="hidden" name="hl" value="en"></input><input type="hidden"
    name="cat" value="gwd/Top"></input></td><td align="left" colspan="1"
    rowspan="1" width="150"><font size="-2"> • <a
    shape="rect" href="http://www.google.com/dirhelp.html">Directory
    Help</a></font></td></tr></table></form><p><font color="#008000"><b>The
    web organized by topic into categories.</b></font></p><p></p><table
    align="center" width="1%" border="0" cellspacing="7"
    cellpadding="0"><tr><td colspan="4" rowspan="1" bgcolor="#008000"><img
    width="1" height="1" alt=""></img></td></tr><tr><td colspan="1"
    rowspan="1"> </td><td colspan="1" nowrap="nowrap" rowspan="1">
    12 <b><a shape="rect" href="/Top/Arts/">Arts</a></b><br
    clear="none"></br>
    13 <font size="-1"><a shape="rect"
    href="/Top/Arts/Movies/">Movies</a>, <a shape="rect"
    href="/Top/Arts/Music/">Music</a>, <a shape="rect"
    href="/Top/Arts/Television/">Television</a>, ...</font><p>
    14 <b><a shape="rect" href="/Top/Business/">Business</a></b><br
    clear="none"></br>
    15 <font size="-1"><a shape="rect"
    href="/Top/Business/Major_Companies/">Companies</a>, <a shape="rect"
    href="/Top/Business/Financial_Services/">Finance</a>, <a shape="rect"
    href="/Top/Business/Employment/">Jobs</a>, ...</font></p><p>
    16 <b><a shape="rect" href="/Top/Computers/">Computers</a></b><br
    clear="none"></br>
    17 <font size="-1"><a shape="rect"
    href="/Top/Computers/Internet/">Internet</a>, <a shape="rect"
    href="/Top/Computers/Hardware/">Hardware</a>, <a shape="rect"
    href="/Top/Computers/Software/">Software</a>, ...</font></p><p>
    18 <b><a shape="rect" href="/Top/Games/">Games</a></b><br
    clear="none"></br>
    19 <font size="-1"><a shape="rect"
    href="/Top/Games/Board_Games/">Board</a>, <a shape="rect"
    href="/Top/Games/Roleplaying/">Roleplaying</a>, <a shape="rect"
    href="/Top/Games/Video_Games/">Video</a>, ...</font></p><p>
    20 <b><a shape="rect" href="/Top/Health/">Health</a></b><br
    clear="none"></br>
    21 <font size="-1"><a shape="rect"
    href="/Top/Health/Alternative/">Alternative</a>, <a shape="rect"
    href="/Top/Health/Fitness/">Fitness</a>, <a shape="rect"
    href="/Top/Health/Medicine/">Medicine</a>, ...</font></p><p>
    22 </p></td><td colspan="1" nowrap="nowrap" rowspan="1">
    23 <b><a shape="rect" href="/Top/Home/">Home</a></b><br
    clear="none"></br>
    24 <font size="-1"><a shape="rect"
    href="/Top/Home/Consumer_Information/">Consumers</a>, <a shape="rect"
    href="/Top/Home/Homeowners/">Homeowners</a>, <a shape="rect"
    href="/Top/Home/Family/">Family</a>, ...</font><p>
    25 <b><a shape="rect" href="/Top/Kids_and_Teens/">Kids and
    Teens</a></b><br clear="none"></br>
    26 <font size="-1"><a shape="rect"
    href="/Top/Kids_and_Teens/Computers/">Computers</a>, <a shape="rect"
    href="/Top/Kids_and_Teens/Entertainment/">Entertainment</a>, <a
    shape="rect" href="/Top/Kids_and_Teens/School_Time/">School</a>,
    ....</font></p><p>
    27 <b><a shape="rect" href="/Top/News/">News</a></b><br
    clear="none"></br>
    28 <font size="-1"><a shape="rect"
    href="/Top/News/Media/">Media</a>, <a shape="rect"
    href="/Top/News/Newspapers/">Newspapers</a>, <a shape="rect"
    href="/Top/News/Current_Events/">Current Events</a>, ...</font></p><p>
    29 <b><a shape="rect"
    href="/Top/Recreation/">Recreation</a></b><br
    clear="none"></br> 30 <font size="-1"><a shape="rect"
    href="/Top/Recreation/Food/">Food</a>, <a shape="rect"
    href="/Top/Recreation/Outdoors/">Outdoors</a>, <a shape="rect"
    href="/Top/Recreation/Travel/">Travel</a>, ...</font></p><p>
    31 <b><a shape="rect" href="/Top/Reference/">Reference</a></b><br
    clear="none"></br>
    32 <font size="-1"><a shape="rect"
    href="/Top/Reference/Education/">Education</a>, <a shape="rect"
    href="/Top/Reference/Libraries/">Libraries</a>, <a shape="rect"
    href="/Top/Reference/Maps/">Maps</a>, ...</font></p><p>
    33 </p></td><td colspan="1" nowrap="nowrap" rowspan="1">
    34 <b><a shape="rect" href="/Top/Regional/">Regional</a></b><br
    clear="none"></br>
    35 <font size="-1"><a shape="rect"
    href="/Top/Regional/Asia/">Asia</a>, <a shape="rect"
    href="/Top/Regional/Europe/">Europe</a>, <a shape="rect"
    href="/Top/Regional/North_America/">North America</a>, ...</font><p>
    36 <b><a shape="rect" href="/Top/Science/">Science</a></b><br
    clear="none"></br>
    37 <font size="-1"><a shape="rect"
    href="/Top/Science/Biology/">Biology</a>, <a shape="rect"
    href="/Top/Science/Social_Sciences/Psychology/">Psychology</a>, <a
    shape="rect" href="/Top/Science/Physics/">Physics</a>,
    ....</font></p><p>
    38 <b><a shape="rect" href="/Top/Shopping/">Shopping</a></b><br
    clear="none"></br>
    39 <font size="-1"><a shape="rect"
    href="/Top/Shopping/Vehicles/Autos/">Autos</a>, <a shape="rect"
    href="/Top/Shopping/Clothing/">Clothing</a>, <a shape="rect"
    href="/Top/Shopping/Gifts/">Gifts</a>, ...</font></p><p>
    40 <b><a shape="rect" href="/Top/Society/">Society</a></b><br
    clear="none"></br>
    41 <font size="-1"><a shape="rect"
    href="/Top/Society/Issues/">Issues</a>, <a shape="rect"
    href="/Top/Society/People/">People</a>, <a shape="rect"
    href="/Top/Society/Religion_and_Spirituality/">Religion</a>,
    ....</font></p><p>
    42 <b><a shape="rect" href="/Top/Sports/">Sports</a></b><br
    clear="none"></br>
    43 <font size="-1"><a shape="rect"
    href="/Top/Sports/Basketball/">Basketball</a>, <a shape="rect"
    href="/Top/Sports/Football/">Football</a>, <a shape="rect"
    href="/Top/Sports/Soccer/">Soccer</a>, ...</font></p><p>
    44 </p></td></tr><tr><td colspan="1" rowspan="1"> </td><td
    colspan="3" rowspan="1"><b><a shape="rect"
    href="/Top/World/">World</a></b><br clear="none"></br>
    45 <font size="-1"><a shape="rect"
    href="/Top/World/Deutsch/">Deutsch</a>, <a shape="rect"
    href="/Top/World/Espa%C3%B1ol/">Espa�ol</a>, <a shape="rect"
    href="/Top/World/Fran%C3%A7ais/">Fran�ais</a>, <a shape="rect"
    href="/Top/World/Italiano/">Italiano</a>, <a shape="rect"
    href="/Top/World/Japanese/">Japanese</a>, <a shape="rect"
    href="/Top/World/Korean/">Korean</a>, <a shape="rect"
    href="/Top/World/Nederlands/">Nederlands</a>, <a shape="rect"
    href="/Top/World/Polska/">Polska</a>, <a shape="rect"
    href="/Top/World/Svenska/">Svenska</a>, ...</font><p>
    46 </p></td></tr><tr><td colspan="1" rowspan="1"> </td><td
    colspan="1" nowrap="nowrap" rowspan="1"><font
    size="-1"> </font></td></tr><tr><td colspan="4" rowspan="1"
    bgcolor="#008000"><img width="1" height="1"
    alt=""></img></td></tr></table><br clear="none"></br><font size="-1"><a
    shape="rect"
    href="http://www.google.com/ads/">Advertise with Us</a> - <a
    shape="rect"
    href="http://www.google.com/about.html">Jobs, Press, Cool Stuff...</a></font><p><font
    face="arial,sans-serif" size="-1"> ©2004 Google</font></p><br
    clear="none"></br><table align="center" border="0" bgcolor="#336600"
    cellpadding="3" cellspacing="0"><tr><td colspan="1" rowspan="1"> <table
    width="100%" cellpadding="2" cellspacing="0" border="0"><tr
    align="center"><td colspan="1" rowspan="1"><font face="sans-serif,
    Arial, Helvetica" size="2" color="#ffffff">Help build the largest
    human-edited directory on the web.</font></td></tr><tr align="center"
    bgcolor="#cccccc"><td colspan="1" rowspan="1"><font face="sans-serif,
    Arial, Helvetica" size="2">
    47 <a shape="rect" href="http://dmoz.org/add.html">
    48 Submit a Site</a> - <a shape="rect"
    href="http://dmoz.org/about.html"><b>Open Directory Project</b></a> -
    49 <a shape="rect" href="http://dmoz.org/cgi-bin/apply.cgi">Become
    an Editor</a> </font>
    50 </td></tr></table>
    51 </td></tr></table>
    52 </center></body></html>
    53
    [thufir@arrakis tagSoup]$ date
    Sun Aug 14 23:34:57 IST 2005
    [thufir@arrakis tagSoup]$


    Thanks,

    Thufir
    , Aug 15, 2005
    #1
    1. Advertising

  2. Guest

    wrote:

    > I'm trying do some "screen scraping",


    As a general rule, this sucks. It's a vile process and very brittle
    (they change their site without telling you, your code dies). It's
    impossible to say how hard or easy it is - it's massively dependent on
    the target page you're scraping. Even within one site, one page may be
    easy and another a nightmare.

    It's also increasingly unneccessary and even illegal to do it. Chances
    are that if they _want_ you to have the content there will be an RSS
    version of it, and if they don't then they'll get pissy with suits over
    it.

    So these days you can quite probably go the easy route, and if you
    can't then there's problems ahead anyway.

    > First I'd like to convert XHTML to XML,


    If your input is XHTML, then life is a lot easier than if it's HTML.
    XHTML _is_ XML, which means that it should be amenable to processing
    with XML tools - these are generally much easier to work with than HTML
    parsing tools.

    OTOH, XHTML is rare on the web. It's still rare to see it, Appendix C
    means that it has to be served up as HTML rather than XML (and may no
    longer work correctly as XML). Additionally much of it is still just
    broken, as the web always has been. Be wary of any page with externally
    served ads on it!


    You will probably get your project developed most effectively by first
    hacking around with a few well-behaved RSS or Atom feeds (BBC and
    Google are good sources). Learn to work through half the problem before
    you have to dive into the nasty half of straining through random tag
    soup.

    Your previously described architecture looked like an awful lot of
    layers - I've never needed to use that many stages of processing.
    , Aug 15, 2005
    #2
    1. Advertising

  3. wrote:
    > wrote:
    >
    > It's also increasingly unneccessary and even illegal to do it.


    Why should it be illegal to save a (public) html-file and modify it? You can
    save it with the "save as" function of your browser as well!

    If you save it for your own use I do not think it is illegal.


    > Chances are that if they _want_ you to have the content there will be an

    RSS
    > version of it, and if they don't then they'll get pissy with suits
    > over it.


    They might get pissy, but tell me: why do they publish that information on
    the web?!

    regards

    Andreas
    Andreas Baier, Aug 15, 2005
    #3
  4. Guest

    Andreas Baier wrote:
    ....
    > > Chances are that if they _want_ you to have the content there will be an

    > RSS
    > > version of it, and if they don't then they'll get pissy with suits
    > > over it.

    >
    > They might get pissy, but tell me: why do they publish that information on
    > the web?!

    ....

    Ok, let's take this to alt.ethics.web ;)

    If there's a beef, it should really be with o'reilly for publishing the
    hack <http://hacks.oreilly.com/pub/h/2125>. Of course, they're
    probably protected by the "free speech" rights part of the
    constitution, but IANAL. (Bill of rights? which part?)

    Anyhow, it's for personal use. I'm not republishing the data, which'd
    be slimy. I don't know that it's illegal, but that'd be slimy.
    Whether it's illegal or slimy, I'm sure there's a book on it. I'm sure
    there are books on spam, for example.



    -Thufir
    , Aug 15, 2005
    #4
  5. Guest

    wrote:
    ....
    > If your input is XHTML, then life is a lot easier than if it's HTML.
    > XHTML _is_ XML, which means that it should be amenable to processing
    > with XML tools - these are generally much easier to work with than HTML
    > parsing tools.

    ....

    TagSoup <http://mercury.ccil.org/~cowan/XML/tagsoup/> nicely creates
    the XHTML file for this trial run. I'm more trying to understand than
    do anything practical at this stage.

    If XHTML is XML, can the hack
    "Generate an XSLT Identity Stylesheet with Relaxer"
    <http://hacks.oreilly.com/pub/h/2069>

    be run on the XHTML file to create the XSLT Identity Stylesheet? If
    not, is there some other "hack" to do something like that, but with
    XHTML?

    Once you have XHTML you have XML because, as you said, XHTML is XML.
    However, there's all that extra stuff in there which makes it XHTML.
    An XSL Stylesheet can turn the XHTML file into plain XML?

    At the moment I just want to get some sort of XSLT Stylesheet to work
    with as a baseline to try to understand this. Can Relaxer create an
    Identity Stylesheet for an XHTML file? Once I have an Identity
    Stylesheet, that'd be something to work with.

    I'm also working on this from the direction of Cocoon as per "Use
    Cocoon to Create a Well-Formed View of a Web Page, Then Scrape It for
    Data" <http://hacks.oreilly.com/pub/h/2125>. Right now I'm just trying
    to figure out how to convert XHTML to XML. I know that an XSLT can be
    used, but can an Identity Stylesheet for an XHTML file be generated?


    Thanks,

    Thufir
    , Aug 15, 2005
    #5
  6. Soren Kuula Guest

    wrote:
    > If XHTML is XML, can the hack
    > "Generate an XSLT Identity Stylesheet with Relaxer"
    > <http://hacks.oreilly.com/pub/h/2069>
    > be run on the XHTML file to create the XSLT Identity Stylesheet? If
    > not, is there some other "hack" to do something like that, but with
    > XHTML?


    An identity stylesheet won't do anything for you. When run on an XSL
    processor, it just takes an XML document and spits out the same document.

    > Once you have XHTML you have XML because, as you said, XHTML is XML.
    > However, there's all that extra stuff in there which makes it XHTML.
    > An XSL Stylesheet can turn the XHTML file into plain XML?


    Really, XHTML is plain XML. It's one XML-based language. Others are RSS,
    XSL, XML Schema, eclipse .project files, ant build scripts, and
    thousands of others.

    I guess you mean, can you use an XSL stylesheet to extract the data in
    which you are interested, from the XHTML? Yes, you can, but as someone
    pointed out, it will be pretty brittle -- someone introduces or removes
    a <span> around "your" data, and you will probably have to edit your
    transform stylesheet. You will get tired.

    BTW, tools to exits to generate extractors, that can select data on web
    pages and then be set loose to suck data out of a large set of
    almost-identical, or identically-generated pages. But these are expensive.

    Soren
    Soren Kuula, Aug 15, 2005
    #6
  7. Guest

    Soren Kuula wrote:
    ....
    > An identity stylesheet won't do anything for you. When run on an XSL
    > processor, it just takes an XML document and spits out the same document.


    That's ok, it'd be a starting point.

    So, Relaxer should be able to generate an identity stylesheet?
    Then I could modify the stylesheet to actually extract the data?


    ....
    > Really, XHTML is plain XML. It's one XML-based language. Others are RSS,
    > XSL, XML Schema, eclipse .project files, ant build scripts, and
    > thousands of others.
    >
    > I guess you mean, can you use an XSL stylesheet to extract the data in
    > which you are interested, from the XHTML? Yes, you can, but as someone
    > pointed out, it will be pretty brittle -- someone introduces or removes
    > a <span> around "your" data, and you will probably have to edit your
    > transform stylesheet. You will get tired.

    ....

    This is just for a one off, to see how it works. I recognize the
    brittleness of it conceptually, although I don't exactly know what a
    span is, besides being a linear algebra term.

    If it breaks, that's ok.


    -Thufir
    , Aug 16, 2005
    #7
  8. Andy Dingley Guest

    On 15 Aug 2005 15:36:44 -0700, ""
    <> wrote:

    >TagSoup <http://mercury.ccil.org/~cowan/XML/tagsoup/> nicely creates
    >the XHTML file for this trial run. I'm more trying to understand than
    >do anything practical at this stage.


    You have roughly three problems to solve.

    - Turning HTML tag soup into a sensible document

    - Turning "information" into "data"

    - Turning your minimal raw data into something useful.


    TagSoup appears to solve the first one for you - a series of SAX events
    may be enough to work from, without even needing to save it as a
    "document".

    The second is the hard one, and the one that's most dependent on the
    target site. A well-coded semantically-detailed site is easy,
    pixel-based "visual design" can be almost impossible. You need to
    identify relationships in the page such that "the row below the row
    containg the string "Weather" will have the expected temperature in the
    third column" - then you implement something (perhaps in complicated
    XPath and simple XSLT) that can implement this rule and extract the
    useful datum.

    Processing the raw data out into a useful output is a perfect XSLT task.
    This is relatively easy.



    >If XHTML is XML, can the hack
    >"Generate an XSLT Identity Stylesheet with Relaxer"
    ><http://hacks.oreilly.com/pub/h/2069>


    I have no idea what that is - it just looks like a title to me.

    I can't even think what an "identity stylesheet" would be either - at
    least not in any useful context.


    >An XSL Stylesheet can turn the XHTML file into plain XML?


    There's no such thing as "plain XML". All XML documents have a schema -
    although there's an abstract concept of "plain XML", you can't have any
    real concrete document without some level of schema. You might not write
    a formal schema, you might not even think through exactly what's in it,
    but as soon as you start giving your elements names, then you've started
    defining some degree of schema. Now if you have to have a schema, and if
    it has to represent web content, then you might as well be using XHTML
    for it !

    The intermediate format (between steps 2 & 3) has less reason to be
    XHTML. It's more likely to be application specific and _just_ holding
    the core data. The schema here could be custom-rolled, or it might be
    some sort of pre-existing weather schema, eBXML catalogue information,
    or even good old Dublin Core.


    As a development route, I'd suggest trying to turn some feed (Amazon top
    10 books ? BBC top stories ?) into RSS 1.0 or Atom, then turning that
    into a usable page. This should be an easy enough example to work on -
    then you can try a more awkward site. I think that hand-coding a simple
    example will give you better experience than diving in with Cocoon and
    having random magic happen in front of you that you don't really
    understand what it's doing (and Cocoon isn't my obvious thought for a
    first tool to use).


    --
    Cats have nine lives, which is why they rarely post to Usenet.
    Andy Dingley, Aug 16, 2005
    #8
  9. Guest

    Andy Dingley wrote:
    ....
    > I have no idea what that is - it just looks like a title to me.


    Heh, I was *hoping* someone who owned the book would respond ;)

    > I can't even think what an "identity stylesheet" would be either - at
    > least not in any useful context.


    I don't think it's useful in and of itself. In the example from that
    book, IIRC they take an xml document, time.xml I believe, create an
    XSLT, run the two through ?xerces? and the result is, essentially,
    time.xml; not it's not useful.

    The point is to automatically create a stylesheet which does nothing,
    then edit the stylesheet, versus creating the stylesheet from scratch.
    It's in the same section as using GUI XSL editors.

    >
    > >An XSL Stylesheet can turn the XHTML file into plain XML?

    >
    > There's no such thing as "plain XML". All XML documents have a schema -
    > although there's an abstract concept of "plain XML", you can't have any
    > real concrete document without some level of schema. You might not write
    > a formal schema, you might not even think through exactly what's in it,
    > but as soon as you start giving your elements names, then you've started
    > defining some degree of schema. Now if you have to have a schema, and if
    > it has to represent web content, then you might as well be using XHTML
    > for it !


    Ah, thank you. I learned something, the "schema" "thing" makes more
    sense now. I don't quite get where XML leaves off and XHTML starts,
    although I do know a bit. XML doesn't have reserved words(?) while
    XHTML does, like <p> for paragraph. XML being more general. I should
    read a bit more about XHTML.

    ....
    > As a development route, I'd suggest trying to turn some feed (Amazon top
    > 10 books ? BBC top stories ?) into RSS 1.0 or Atom, then turning that
    > into a usable page. This should be an easy enough example to work on -
    > then you can try a more awkward site. I think that hand-coding a simple
    > example will give you better experience than diving in with Cocoon and
    > having random magic happen in front of you that you don't really
    > understand what it's doing (and Cocoon isn't my obvious thought for a
    > first tool to use).

    ....

    Ok, that sounds good. I think I bit off a bit more than I can chew at
    the moment. I'll work on from a feed then.


    Thanks,

    Thufir
    , Aug 16, 2005
    #9
  10. Guest

    wrote:
    > Heh, I was *hoping* someone who owned the book would respond ;)


    I haven't bought an O'Reilly in years.


    > > I can't even think what an "identity stylesheet" would be either - at
    > > least not in any useful context.


    Ah - I think I see what this "identity" stylesheet is about.

    An identity transfrom turns "A" into "A". There's an obvious way to
    write one in XSLT that uses wildcards to copy everything, as the
    identity transform. However (given a schema or even an example of
    input) it would be possible to generate a "longhand" identity
    stylesheet that did each element explicitly. This could them be
    modified to process each element differently, as you required it.

    However this is just a time-saving measure for writing it, not some
    fundamental technique. You can code your own pretty easily.

    I've seen any number of "stylesheet generator" tools over the years,
    from Schematron onwards. Supposedly you can transform anything to
    anything, with auto-generated XSLT, based on input schemas and clever
    code. However this whole area is a technique that has singularly
    _failed_ to deliver useful products (unusual for XML tools). I'm
    enormously skeptical about them.
    , Aug 16, 2005
    #10
  11. Guest

    wrote:
    ....
    > An identity transfrom turns "A" into "A". There's an obvious way to
    > write one in XSLT that uses wildcards to copy everything, as the
    > identity transform. However (given a schema or even an example of
    > input) it would be possible to generate a "longhand" identity
    > stylesheet that did each element explicitly. This could them be
    > modified to process each element differently, as you required it.
    >
    > However this is just a time-saving measure for writing it, not some
    > fundamental technique. You can code your own pretty easily.

    ....


    Take matrix A. Then there's the identity matrix I.

    AI=?=IA

    I forget. heh.


    -Thufir
    , Aug 16, 2005
    #11
  12. edgar arizmendi, Aug 31, 2005
    #12
  13. Guest

    , Sep 5, 2005
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. kevin bailey
    Replies:
    0
    Views:
    382
    kevin bailey
    Aug 7, 2003
  2. xhtml champs
    Replies:
    0
    Views:
    527
    xhtml champs
    Aug 1, 2011
  3. xhtml champs
    Replies:
    0
    Views:
    1,036
    xhtml champs
    Aug 2, 2011
  4. Thomas Strömberg
    Replies:
    1
    Views:
    109
    Carlos
    May 31, 2004
  5. Sean Russell
    Replies:
    0
    Views:
    110
    Sean Russell
    Jun 1, 2004
Loading...

Share This Page