Hpricot syntax different from Xpath ?

Discussion in 'Ruby' started by Celine, Dec 18, 2007.

  1. Celine

    Celine Guest

    Hi all

    I'm trying to parse a page with Hpricot in order to retrieve a value.

    I use Xpather (a firefox extension) in order to get the path of this
    value. But when I use this path with Hpricot, it doesn't work. I have
    to change it so that it works.

    Here's my path, given by Xpather :

    /html/body/div[1]/div[2]/div[1]/div[2]/div[1]/div[2]/div[1]/div/div[1]/
    div[1]/table/tbody/tr[1]


    And here's what I have to write in order to make it understand by
    Hpricot :

    /html/body/div/div/div/div/div/div/div/div/div/div/table/tr


    Could you explain me why I have to write that ?

    Thanks in advance
    Celine, Dec 18, 2007
    #1
    1. Advertising

  2. Celine

    Chris Shea Guest

    On Dec 18, 2007 3:04 PM, Celine <> wrote:
    > Hi all
    >
    > I'm trying to parse a page with Hpricot in order to retrieve a value.
    >
    > I use Xpather (a firefox extension) in order to get the path of this
    > value. But when I use this path with Hpricot, it doesn't work. I have
    > to change it so that it works.
    >
    > Here's my path, given by Xpather :
    >
    > /html/body/div[1]/div[2]/div[1]/div[2]/div[1]/div[2]/div[1]/div/div[1]/
    > div[1]/table/tbody/tr[1]
    >
    >
    > And here's what I have to write in order to make it understand by
    > Hpricot :
    >
    > /html/body/div/div/div/div/div/div/div/div/div/div/table/tr
    >
    >
    > Could you explain me why I have to write that ?


    Well, it depends. It'd be helpful to see the page you're working with.
    You might want to try asking the Hpricot mailing list as well (To
    join: Send a message to Cc:
    ).

    Chris
    Chris Shea, Dec 18, 2007
    #2
    1. Advertising

  3. On Dec 18, 2007 11:04 PM, Celine <> wrote:
    > Hi all
    >
    > I'm trying to parse a page with Hpricot in order to retrieve a value.
    >
    > I use Xpather (a firefox extension) in order to get the path of this
    > value. But when I use this path with Hpricot, it doesn't work. I have
    > to change it so that it works.
    >
    > Here's my path, given by Xpather :
    >
    > /html/body/div[1]/div[2]/div[1]/div[2]/div[1]/div[2]/div[1]/div/div[1]/
    > div[1]/table/tbody/tr[1]
    >


    HPricot doesn't include the whole XPath syntax. You can write a little
    function which translates XPath expressions with brackets to HPricot
    expressions. I wrote a function for that, but my SVN is right now down
    and I can't get it. Drop an answer if you still need it
    Firefox includes some missing HTML tags. I ran in it, when I had to
    write a little script. <tbody> is added in <table>, probably some more
    things, but I didn't find them. You can see the difference when you
    download the page with open-uri. Not many pages add <tbody>.
    Thomas Wieczorek, Dec 18, 2007
    #3
  4. Celine

    Celine Guest

    On Dec 18, 11:21 pm, Chris Shea <-rack.org> wrote:

    >
    > Well, it depends. It'd be helpful to see the page you're working with.
    > You might want to try asking the Hpricot mailing list as well (To
    > join: Send a message to Cc:
    > ).
    >
    > Chris


    Hi Chris, thanks for your answer
    Here is the page I'm working with : http://finance.yahoo.com
    I want to retrieve value of Nasdaq (up left of the page).
    I sent the same message on Hpricot ML this afternoon, actually no
    answer.
    Celine, Dec 18, 2007
    #4
  5. Celine

    Celine Guest

    On Dec 18, 11:25 pm, Thomas Wieczorek <>
    wrote:
    > On Dec 18, 2007 11:04 PM, Celine <> wrote:
    >
    > > Hi all

    >
    > > I'm trying to parse a page with Hpricot in order to retrieve a value.

    >
    > > I use Xpather (a firefox extension) in order to get the path of this
    > > value. But when I use this path with Hpricot, it doesn't work. I have
    > > to change it so that it works.

    >
    > > Here's my path, given by Xpather :

    >
    > > /html/body/div[1]/div[2]/div[1]/div[2]/div[1]/div[2]/div[1]/div/div[1]/
    > > div[1]/table/tbody/tr[1]

    >
    > HPricot doesn't include the whole XPath syntax. You can write a little
    > function which translates XPath expressions with brackets to HPricot
    > expressions. I wrote a function for that, but my SVN is right now down
    > and I can't get it. Drop an answer if you still need it
    > Firefox includes some missing HTML tags. I ran in it, when I had to
    > write a little script. <tbody> is added in <table>, probably some more
    > things, but I didn't find them. You can see the difference when you
    > download the page with open-uri. Not many pages add <tbody>.


    Hi Thomas

    I'm very interested in your function. Do you know where I can find
    differences between XPath syntax and Hpricot syntax ? What reference
    did you use to write your function ?

    Celine
    Celine, Dec 18, 2007
    #5
  6. Celine

    Chris Shea Guest

    On Dec 18, 2007 3:34 PM, Celine <> wrote:
    > On Dec 18, 11:21 pm, Chris Shea <-rack.org> wrote:
    >
    > >
    > > Well, it depends. It'd be helpful to see the page you're working with.
    > > You might want to try asking the Hpricot mailing list as well (To
    > > join: Send a message to Cc:
    > > ).
    > >
    > > Chris

    >
    > Hi Chris, thanks for your answer
    > Here is the page I'm working with : http://finance.yahoo.com
    > I want to retrieve value of Nasdaq (up left of the page).


    I see. It's pretty easy to get using element attributes, which
    resilient to page changes. If Yahoo decides to add a div in the
    hierarchy, or add a new exchange, this shouldn't suddenly fail:

    # assuming doc is the Hpricot object for finance.yahoo.com
    doc.at('tr[@title="Nasdaq"]/td[2]')

    Looking at the page source, the span element that contains the value
    actually has an id (yfs_l10_^ixic), but it doesn't look stable, does
    it?

    > I sent the same message on Hpricot ML this afternoon, actually no
    > answer.


    I'm on the Hpricot ML and never saw it. Maybe a hiccup somewhere?

    HTH,
    Chris
    Chris Shea, Dec 18, 2007
    #6
  7. Celine

    Chris Shea Guest

    > > On Dec 18, 11:21 pm, Chris Shea <-rack.org> wrote:
    > Looking at the page source, the span element that contains the value
    > actually has an id (yfs_l10_^ixic), but it doesn't look stable, does
    > it?


    I take that back. That id is almost definitely stable.

    doc.at('span[@id="yfs_l10_^ixic"]')

    Chris
    Chris Shea, Dec 18, 2007
    #7
  8. On Dec 18, 2007 11:39 PM, Celine <> wrote:
    > On Dec 18, 11:25 pm, Thomas Wieczorek <>
    > wrote:
    >


    >
    > I'm very interested in your function.
    >


    I'll post it as soon as the SVN server is up again.

    > Do you know where I can find
    > differences between XPath syntax and Hpricot syntax ? What reference
    > did you use to write your function ?
    >


    I used http://code.whytheluckystiff.net/hpricot/wiki/SupportedXpathExpressions
    and related pages to get started with it. I found the table/tbody
    thing because I didn't get any further with it and thought, that I did
    something wrong until I downloaded the page without Firefox using
    open-uri.
    Thomas Wieczorek, Dec 18, 2007
    #8
  9. Celine

    Vitor Peres Guest

    [Note: parts of this message were removed to make it a legal post.]

    On Dec 18, 2007 8:34 PM, Celine <> wrote:

    > Hi Chris, thanks for your answer
    > Here is the page I'm working with : http://finance.yahoo.com
    > I want to retrieve value of Nasdaq (up left of the page).
    > I sent the same message on Hpricot ML this afternoon, actually no
    > answer.
    >
    >

    Hi, Celine.

    I know it's not nearly as fun as screen-scraping, but you can get the value
    for Nasdaq (and many other quotes) on Yahoo! Finance by querying the right
    URL for the CSV. The current value can be obtained by fetching:

    http://download.finance.yahoo.com/d/quotes.csv?s=[name]&f=sl1d1t1c1ohgv&e=.csv

    You just have replace [name] with %5EIXIC for Nasdaq. Historical data is
    available (closings only) at:

    http://ichart.finance.yahoo.com/table.csv?&s=[<http://ichart.finance.yahoo.com/table.csv?&s=%5Bquote>name]&a=[start
    month]&b=[start_day]&c=[start
    _year]&d=[end_month]&e=[end_day]&f=[end_year]&g=d&ignore=.csv

    Just replace [name] with the index or stock you wish to query and each
    bracketed date info with integers.

    I've replied to a topic before that involved Yahoo! Finance, but it was
    specifically about searching for a symbol. Since it's not your case, here's
    hoping that directly fetching it will suffice.


    --
    Vitor Peres (dodecaphonic)
    ------------------------------------
    http://twitter.com/dodecaphonic
    Vitor Peres, Dec 19, 2007
    #9
  10. Celine

    Celine Guest

    On 18 déc, 23:54, Chris Shea <-rack.org> wrote:
    > > > On Dec 18, 11:21 pm, Chris Shea <-rack.org> wrote:

    > > Looking at the page source, the span element that contains the value
    > > actually has an id (yfs_l10_^ixic), but it doesn't look stable, does
    > > it?

    >
    > I take that back. That id is almost definitely stable.
    >
    > doc.at('span[@id="yfs_l10_^ixic"]')
    >
    > Chris


    Yes, thanks, it runs.
    There's something I can't understand : in the Xpath expression I
    posted later, when a node has several child DIVs, I access them with
    an index (div[2]...), but in Hpricot syntax, DIVs aren't accessed
    using an index. So, what trick Hpricot uses to locate "the good" div ?
    Celine, Dec 19, 2007
    #10
  11. Celine

    Celine Guest

    On 19 déc, 10:53, Vitor Peres <> wrote:
    >
    > I know it's not nearly as fun as screen-scraping, but you can get the value
    > for Nasdaq (and many other quotes) on Yahoo! Finance by querying the right
    > URL for the CSV. The current value can be obtained by fetching:
    >
    > http://download.finance.yahoo.com/d/quotes.csv?s=[name]&f=sl1d1t1c1ohgv&e=.csv
    >
    > You just have replace [name] with %5EIXIC for Nasdaq. Historical data is
    > available (closings only) at:
    >
    > http://ichart.finance.yahoo.com/table.csv?&s=[<http://ichart.finance.yahoo.com/table.csv?&s=%5Bquote>name]&a=[start
    > month]&b=[start_day]&c=[start
    > _year]&d=[end_month]&e=[end_day]&f=[end_year]&g=d&ignore=.csv
    >
    > Just replace [name] with the index or stock you wish to query and each
    > bracketed date info with integers.
    >
    > I've replied to a topic before that involved Yahoo! Finance, but it was
    > specifically about searching for a symbol. Since it's not your case, here's
    > hoping that directly fetching it will suffice.
    >
    > --
    > Vitor Peres (dodecaphonic)
    > ------------------------------------http://twitter.com/dodecaphonic


    Hi Victor, thank you very much :)
    But, as you said, it isn't very funny, no ? ;)
    (but I didn't know that trick, thanks)
    Celine, Dec 19, 2007
    #11
  12. Celine

    Chris Shea Guest

    On Dec 19, 2007 2:45 PM, Celine <> wrote:
    > Yes, thanks, it runs.
    > There's something I can't understand : in the Xpath expression I
    > posted later, when a node has several child DIVs, I access them with
    > an index (div[2]...), but in Hpricot syntax, DIVs aren't accessed
    > using an index. So, what trick Hpricot uses to locate "the good" div ?


    I'm not sure I understand. Hpricot certainly can access elements that way:

    doc = Hpricot('<body><div>one</div><div>two</div></body>')

    doc.at('body/div[1]').inner_text # => "one"
    doc.at('body/div[2]').inner_text # => "two"
    doc.at('body/div:eq(0)').inner_text # => "one"
    doc.at('body/div:eq(1)').inner_text # => "two"

    Chris
    Chris Shea, Dec 19, 2007
    #12
  13. Celine

    Celine Guest

    On 19 déc, 23:30, Chris Shea <-rack.org> wrote:
    >
    > I'm not sure I understand. Hpricot certainly can access elements that way:
    >
    > doc = Hpricot('<body><div>one</div><div>two</div></body>')
    >
    > doc.at('body/div[1]').inner_text # => "one"
    > doc.at('body/div[2]').inner_text # => "two"
    > doc.at('body/div:eq(0)').inner_text # => "one"
    > doc.at('body/div:eq(1)').inner_text # => "two"
    >
    > Chris


    Look :

    doc = Hpricot(open("http://finance.yahoo.com"))

    (Xpath syntax with DIVs indexed, given by XPather)

    doc.at('html/body/div[1]/div[2]/div[1]/div[2]/div[1]/div[2]/div[1]/div/
    div[1]/div[1]/table/tr[3]/td[2]/span').inner_text
    => NoMethodError: undefined method `inner_text' for nil:NilClass


    (without indices for DIVs)

    doc.at('html/body/div/div/div/div/div/div/div/div/div/div/table/tr[3]/
    td[2]/span').inner_text
    => "2,601.01"

    So, why ?

    Celine
    Celine, Dec 19, 2007
    #13
  14. Celine

    Chris Shea Guest

    On Dec 19, 2007 4:10 PM, Celine <> wrote:
    > Look :
    >
    > doc = Hpricot(open("http://finance.yahoo.com"))
    >
    > (Xpath syntax with DIVs indexed, given by XPather)
    >
    > doc.at('html/body/div[1]/div[2]/div[1]/div[2]/div[1]/div[2]/div[1]/div/
    > div[1]/div[1]/table/tr[3]/td[2]/span').inner_text
    > => NoMethodError: undefined method `inner_text' for nil:NilClass
    >
    >
    > (without indices for DIVs)
    >
    > doc.at('html/body/div/div/div/div/div/div/div/div/div/div/table/tr[3]/
    > td[2]/span').inner_text
    > => "2,601.01"
    >
    > So, why ?


    At some point the path you're using fails. That's why. You could check
    node by node, going one level lower each time to see where you start
    getting nil from your search. And then you could see what you need to
    do to fix the path. That's what I just did:

    Now you look:

    XPATH = 'html/body/div[1]/div[2]/div[2]/div[2]/div[1]/div[2]/div[1]/div[1]/div[1]/div[1]/table/tr[3]/td[2]/span'
    doc = Hpricot(open('http://finance.yahoo.com/'))
    doc.at(XPATH).inner_text # => "2,601.01"

    Tools like Xpather and Firebug can give you paths, but they're not
    going to work all the time. But, as I said before, there's a span with
    an id attribute that lets you pluck the data without worrying about a
    full path, so this is sort of moot.

    HTH,
    Chris
    Chris Shea, Dec 19, 2007
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kenneth McDonald
    Replies:
    6
    Views:
    1,758
    Mark Thomas
    Dec 30, 2008
  2. Phlip
    Replies:
    3
    Views:
    230
    anansi
    Jul 29, 2007
  3. Li Chen

    Hpricot and xpath

    Li Chen, Aug 12, 2008, in forum: Ruby
    Replies:
    7
    Views:
    138
    Phlip
    Aug 13, 2008
  4. Christiaan Venter
    Replies:
    1
    Views:
    134
    7stud --
    May 22, 2009
  5. No Uu
    Replies:
    1
    Views:
    99
    Rob Biedenharn
    May 25, 2009
Loading...

Share This Page