hpricot and xpath doesn't work like they should ?!?

Discussion in 'Ruby' started by Phlip, Jul 29, 2007.

  1. Phlip

    Phlip Guest

    anansi wrote:

    > Phlip wrote:
    >> BTW scraping TV guide listings is ... kind'a tacky. Aren't the actual
    >> data feeds available somewhere?

    > thanks for your hint with the id-tags but what you mean with this here?
    > rss-feeds ? I'm not aware of any of them ..


    That's what I mean - neither am I aware of any. But the TV guide services
    get their data from somewhere, and (under the wild assumption that TV
    programmers want you to find their shows and watch them) these feeds should
    not be proprietary.

    But note that electronic TV guides predate RSS...

    --
    Phlip
    http://www.oreilly.com/catalog/9780596510657/
    ^ assert_xpath
    http://tinyurl.com/23tlu5 <-- assert_raise_message
     
    Phlip, Jul 29, 2007
    #1
    1. Advertising

  2. Phlip

    anansi Guest

    hi,
    I wanted to write me a little console tv-guide with ruby and hpricot. I
    installed the firefox xpath checker plugin and went to
    http://www.klack.de/TvEvening1.php3?HPTFRAME=/TvAtEvening.php3 . Then
    I checked the xpath of these senders fields like ZDF and got:

    /html/body/table/tbody/tr[2]/td[2]/table/tbody/tr/td/center/form/table/tbody/tr/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr[3]/th[1]

    so I tried to parse the website for this and output the hits but I don't
    get any output. Here's the code:

    #!/usr/bin/env ruby

    $Verbose = true

    require 'hpricot'
    require 'net/http'

    url =
    URI.parse('http://www.klack.de/TvEvening1.php3?HPTFRAME=%2FTvAtEvening.php3')
    req = Net::HTTP::Get.new(url.path)
    res = Net::HTTP.start(url.host, url.port) {|http|
    http.request(req)
    }

    tv = Hpricot(res.body)
    tv.search("/html/body/table/tbody/tr[2]/td[2]/table/tbody/tr/td/center/form/table/tbody/tr/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr[3]/th[1]").each
    { |a| puts a}

    #eof


    Am I using hpricot in the wrong way? I thought it could handle xpaths?


    --
    greets

    one must still have chaos in oneself to be able to
    give birth to a dancing star
     
    anansi, Jul 29, 2007
    #2
    1. Advertising

  3. Phlip

    Phlip Guest

    anansi wrote:

    > Am I using hpricot in the wrong way? I thought it could handle xpaths?


    Briefly, I suspect Hpricot uses an XPath subset invented on the fly to
    permit querying into the HTML node space.

    (This isn't a bad thing; the alternative, REXML::XPath, cannot handle some
    well-formed XHTML [according to Tidy], and certainly can't handle
    traditional HTML.

    (BTW: When I tried to install Hpricot 6 (ruby) on Kubuntu, the require
    'hpricot' refused to find it. This might indicate a broken .so file, so I
    switched to Windows.)

    The best way to use XPath is to locate tags by unique id=''. (The page you
    used abuses the IDs, as CLASSes, so it's ill-formed. But that's not your
    problem here.)

    Don't use long XPath chains (even if an XPath visualizer provides them),
    because these locate things by incidental features that could change when
    you hit the page again. Table elements could come and go on the fly.

    When I installed that XPath Checker (thanks for pointing it out!) and hit
    that page, your XPath selects ZDF, so this implicates Hpricot.

    Let's find a workaround. If I want to hit, say, "Hotel Zack und Cody", I use
    Firebug's Inspect Element context menu feature, and see that blurb has a <td
    title="19:45 Hotel Zack und Cody">. So if I XPath for things like that, we
    get:

    //td[ @title ]

    That sweeps for every td with a title attribute. (The View XPath feature
    should have an option to find minimal and unique paths based on attributes,
    not long obsessive paths based on indices.)

    And that works in Hpricot, too, to select every cell with a title. Further
    poking and parsing should get you the raw TV listings.

    tv.search("//td[ @title ]").each{ |a| p a}

    BTW scraping TV guide listings is ... kind'a tacky. Aren't the actual data
    feeds available somewhere?

    --
    Phlip
    http://www.oreilly.com/catalog/9780596510657/
    "Test Driven Ajax (on Rails)"
    assert_xpath, assert_javascript, & assert_ajax
     
    Phlip, Jul 29, 2007
    #3
  4. Phlip

    anansi Guest

    Phlip wrote:
    > BTW scraping TV guide listings is ... kind'a tacky. Aren't the actual
    > data feeds available somewhere?

    thanks for your hint with the id-tags but what you mean with this here?
    rss-feeds ? I'm not aware of any of them ..



    --
    greets

    one must still have chaos in oneself to be able to
    give birth to a dancing star
     
    anansi, Jul 29, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ken
    Replies:
    35
    Views:
    2,256
  2. Kenneth McDonald
    Replies:
    6
    Views:
    1,770
    Mark Thomas
    Dec 30, 2008
  3. HH
    Replies:
    2
    Views:
    118
  4. Li Chen

    Hpricot and xpath

    Li Chen, Aug 12, 2008, in forum: Ruby
    Replies:
    7
    Views:
    147
    Phlip
    Aug 13, 2008
  5. botp
    Replies:
    6
    Views:
    216
    Joel VanderWerf
    Oct 5, 2010
Loading...

Share This Page