parse URL (href) from xhtml, xhtml -> text, for data

Discussion in 'XML' started by hawat.thufir@gmail.com, Feb 7, 2006.

  1. Guest

    Given an xhtml file, how can I "export" the data to plain-text? That is,
    I want:

    google www.google.com


    Whereas, if I copy and paste what the browser shows, I lose the URL and
    end up with:

    google


    The idea is that I want to import the data to MySQL using the mysqlimport
    command, but mysqlimport requires plain-text. The xhtml file in question:

    [thufir@localhost Desktop]$ cat raw.xhtml -n
    1 <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC
    "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    2 <html xmlns="http://www.w3.org/1999/xhtml"><head><meta
    http-equiv="content-type" content="text/html; charset=utf-8" /><title
    /><meta name="generator" content="StarOffice/OpenOffice.org XSLT
    (http://xml.openoffice.org/sx2ml)" /><meta name="created"
    content="2006-02-07T15:19:17" /><meta name="changed"
    content="2006-02-07T15:36:55" /><base href="." /><style type="text/css">
    3 @page { }
    4 table { border-collapse:collapse; border-spacing:0;
    empty-cells:show }
    5 td, th { vertical-align:top; }
    6 h1, h2, h3, h4, h5, h6 { clear:both }
    7 ol, ul { padding:0; }
    8 * { margin:0; }
    9 *.ta1 { }
    10 *.ce1 { font-family:Courier; color:#000000;
    font-size:10pt; font-style:normal; text-shadow:none; font-weight:normal; }
    11 *.ce2 { font-family:Courier; color:#000000; }
    12 *.Default { font-family:'Bitstream Vera Sans'; }
    13 *.Heading { font-family:'Bitstream Vera Sans';
    text-align:center ! important; font-size:16pt; font-style:italic;
    font-weight:bold; }
    14 *.Heading1 { font-family:'Bitstream Vera Sans';
    text-align:center ! important; font-size:16pt; font-style:italic;
    font-weight:bold; }
    15 *.Result { font-family:'Bitstream Vera Sans';
    font-style:italic; font-weight:bold; text-decoration:underline; }
    16 *.Result2 { font-family:'Bitstream Vera Sans';
    font-style:italic; font-weight:bold; text-decoration:underline; }
    17 *.co1 { width:0.8925in; }
    18 *.ro1 { height:0.1756in; }
    19 *.ro2 { height:0.1681in; }
    20 </style></head><body dir="ltr"><table border="0"
    cellspacing="0" cellpadding="0" class="ta1"><colgroup><col width="99"
    /></colgroup><tr class="ro1"><td style="text-align:left;width:0.8925in; "
    class="ce1"><p><a href="http://www.google.com/">google
    </a>  </p></td></tr><tr class="ro2"><td
    style="text-align:left;width:0.8925in; " class="ce2" /></tr><tr
    class="ro2"><td style="text-align:left;width:0.8925in; " class="ce2"
    /></tr></table></body></html>[thufir@localhost Desktop]$ date
    Tue Feb 7 15:52:34 EST 2006
    [thufir@localhost Desktop]$




    thanks,


    Thufir
    , Feb 7, 2006
    #1
    1. Advertising

  2. First, you need to define what portions of the document are "data". It
    sounds like what you want is just the links; is that correct?

    If so, you need to search for <a> elements that have an href attribute,
    pull out their content (which may be arbitrarily complex markup, please
    remember -- rich text, images, etc. -- you need to define how much of
    that you want to return and how you want it presented!), pull the value
    of the href attribute, and report that pair of values.

    Assuming that description of your problem is correct, you can do this by
    writing a program that uses an XML parser and the SAX or DOM APIs, or
    you can write an XSLT stylesheet such as the following. (WARNING: UNTESTED.)

    <xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:eek:utput method="text" version="1.0" encoding="UTF-8" />

    <xsl:template match="/">
    <xsl:apply-templates select="//a[@href]"/>
    </xsl:template>

    <xsl:template match="a[@href]">
    <xsl:value-of select="."/>
    <xsl:text> </xsl:text>
    <xsl:value-of select="@href"/>
    <xsl:text>
    </xsl:text>
    </xsl:template>

    </xsl:stylesheet>
    Joe Kesselman, Feb 8, 2006
    #2
    1. Advertising

  3. Guest

    Joe Kesselman wrote:
    > First, you need to define what portions of the document are "data". It
    > sounds like what you want is just the links; is that correct?


    I like your approach: asking what is "data" in this case. Yes, I'm
    after "just" the links for the example given. However, if I could get
    the pair of values, "google" and "http://www.google.com" that'd be
    stage two. For now, yes, I'd be happy with just the links.

    > If so, you need to search for <a> elements that have an href attribute,
    > pull out their content (which may be arbitrarily complex markup, please
    > remember -- rich text, images, etc. -- you need to define how much of
    > that you want to return and how you want it presented!), pull the value
    > of the href attribute, and report that pair of values.


    Not sure I follow you there, I'm not after the actual google page,
    simply the URL.

    > Assuming that description of your problem is correct, you can do this by
    > writing a program that uses an XML parser and the SAX or DOM APIs, or
    > you can write an XSLT stylesheet such as the following. (WARNING: UNTESTED.)

    ...

    Right, thanks for writing a transform, I get the gist. That's actually
    a big deal, I've read a tad about XSLT but it seemed arcane until just
    now.

    How do I get plain-text from the result, though? The result will be
    XML, but not XHTML, which is a step in the right direction. PHP or
    similar would be required to parse the XML resultant to get a
    plain-text file with the link?


    -Thufir
    , Feb 8, 2006
    #3
  4. wrote:


    > Right, thanks for writing a transform, I get the gist. That's actually
    > a big deal, I've read a tad about XSLT but it seemed arcane until just
    > now.
    >
    > How do I get plain-text from the result, though?


    Joe's XSLT stylesheet has
    <xsl:eek:utput method="text" version="1.0" encoding="UTF-8" />
    so the output method is text and not XML.
    XSLT can produce XML or HTML or text depending on the output method.


    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
    Martin Honnen, Feb 8, 2006
    #4
  5. > How do I get plain-text from the result, though?

    Note that the <xsl:eek:utput> statement in my example says to produce text
    output. That says the output should be a free-form text stream rather
    than XML. (Exactly how that differs from XML or HTML output modes is
    described in the XSLT spec, if you want the details.)

    I've said the text should be encoded as UTF-8; if you want the output in
    a different encoding, that too can be specified via xsl:eek:utput. (Not all
    processors support all encodings, admittedly.)


    As I said, this is just one possible approach. You could hand-code a
    solution almost as trivially, but I think I'm going to leave that as a
    homework assignment for now. <smile/>
    Joe Kesselman, Feb 8, 2006
    #5
  6. Guest

    Joe Kesselman wrote:
    > > How do I get plain-text from the result, though?

    >
    > Note that the <xsl:eek:utput> statement in my example says to produce text
    > output.


    Pardon, I didn't notice that until you mentioned it--thanks!

    ...
    > I've said the text should be encoded as UTF-8; if you want the output in
    > a different encoding, that too can be specified via xsl:eek:utput. (Not all
    > processors support all encodings, admittedly.)
    >
    >
    > As I said, this is just one possible approach. You could hand-code a
    > solution almost as trivially, but I think I'm going to leave that as a
    > homework assignment for now. <smile/>


    Doh!


    Thanks for the help :)
    I have to install a JRE and Saxon (?), then I'll give it a go.


    -Thufir
    , Feb 8, 2006
    #6
  7. Guest

    Joe Kesselman wrote:
    > > How do I get plain-text from the result, though?

    >
    > Note that the <xsl:eek:utput> statement in my example says to produce text
    > output.


    Pardon, I didn't notice that until you mentioned it--thanks!

    ...
    > I've said the text should be encoded as UTF-8; if you want the output in
    > a different encoding, that too can be specified via xsl:eek:utput. (Not all
    > processors support all encodings, admittedly.)
    >
    >
    > As I said, this is just one possible approach. You could hand-code a
    > solution almost as trivially, but I think I'm going to leave that as a
    > homework assignment for now. <smile/>


    Doh!


    Thanks for the help :)
    I have to install a JRE and Saxon (?), then I'll give it a go.


    -Thufir
    , Feb 8, 2006
    #7
  8. Guest

    Joe Kesselman wrote:
    > > How do I get plain-text from the result, though?

    >
    > Note that the <xsl:eek:utput> statement in my example says to produce text
    > output.


    Pardon, I didn't notice that until you mentioned it--thanks!

    ...
    > I've said the text should be encoded as UTF-8; if you want the output in
    > a different encoding, that too can be specified via xsl:eek:utput. (Not all
    > processors support all encodings, admittedly.)
    >
    >
    > As I said, this is just one possible approach. You could hand-code a
    > solution almost as trivially, but I think I'm going to leave that as a
    > homework assignment for now. <smile/>


    Doh!


    Thanks for the help :)
    I have to install a JRE and Saxon (?), then I'll give it a go.


    -Thufir
    , Feb 8, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. CRON
    Replies:
    24
    Views:
    200,417
    Adrienne Boswell
    Jun 20, 2006
  2. Soren Vejrum
    Replies:
    4
    Views:
    514
    Lasse Reichstein Nielsen
    Jul 5, 2003
  3. saiho.yuen
    Replies:
    3
    Views:
    402
    kaeli
    Sep 14, 2004
  4. Replies:
    2
    Views:
    440
  5. David
    Replies:
    1
    Views:
    299
    David
    Dec 6, 2006
Loading...

Share This Page