Ruby(and programming) beginners question regarding 'NoMethodError'while using Hpricot

S

Sandeep Guria

Hi!
I am trying to build a web scraper which fetches Fundamental data for
listed companies from finance websites.
let me show an example.


"<tbody>
<tr><td>PE ratio</td><td class="numericalColumn">
16.83</td><td>14/02/11</td></tr>

<tr><td>EPS (Rs)</td><td class="numericalColumn">
10.59</td><td>Mar, 10</td></tr>
<tr><td>Sales (Rs crore)</td><td class="numericalColumn">
13,963.81</td><td>Dec, 10</td></tr>
<tr><td>Face Value (Rs)</td><td
class="numericalColumn">10</td><td>&nbsp;</td></tr>
<tr><td>Net profit margin (%)</td><td class="numericalColumn">
17.72</td><td>Mar, 10</td></tr>

<tr><td>Last dividend (%)</td><td
class="numericalColumn">30</td><td>18/01/11</td></tr>
<tr><td>Return on average equity</td><td
class="numericalColumn">13.69</td><td>Mar, 10</td></tr>
</tbody>
"
I want to the data '16.83' from the above html , so what I do is
I parse the HTML file and save it into doc.
I search doc for inner text 'PE ratio'
And then I chose the next element using next_sibling.
But I am getting an error
'C:\Users\Administrator\Documents>ruby scraper.rb scraper.rb:9:in
`<main>': undefined method `next_sibling' for #<Hpricot::Elements[{elem
<td> "PE ratio" </td>}]> (NoMethodError)'

I'll be grateful for any suggestions .
Sorry about the formatting of the HTML Text!

Attachments:
http://www.ruby-forum.com/attachment/5911/scraper.rb
 
E

Estanislau Trepat

[Note: parts of this message were removed to make it a legal post.]

Hi Sandeep.

The #search method returns an Hpricot::Elements object, which is somewaht
similar to an array. You should call #next_sibling on any of the elements
inside that collection, which, in fact, are Hpricot::Elem objects. For
instance:

# perform search
elements = doc.search('td[text()="PE ratio"]')
=> #<Hpricot::Elements[{elem <td> "PE ratio" </td>}]>

# get the targeted cell
cell = elements*.first.*next_sibling
=> {elem <td class="numericalColumn"> " 16.83" </td>}

# printout raw value
puts cell.to_plain_text
16.83
=> nil

Regards.

--
Estanislau Trepat


2011/2/15 Sandeep Guria said:
Hi!
I am trying to build a web scraper which fetches Fundamental data for
listed companies from finance websites.
let me show an example.


"<tbody>
<tr><td>PE ratio</td><td class="numericalColumn">
16.83</td><td>14/02/11</td></tr>

<tr><td>EPS (Rs)</td><td class="numericalColumn">
10.59</td><td>Mar, 10</td></tr>
<tr><td>Sales (Rs crore)</td><td class="numericalColumn">
13,963.81</td><td>Dec, 10</td></tr>
<tr><td>Face Value (Rs)</td><td
class="numericalColumn">10</td><td> </td></tr>
<tr><td>Net profit margin (%)</td><td class="numericalColumn">
17.72</td><td>Mar, 10</td></tr>

<tr><td>Last dividend (%)</td><td
class="numericalColumn">30</td><td>18/01/11</td></tr>
<tr><td>Return on average equity</td><td
class="numericalColumn">13.69</td><td>Mar, 10</td></tr>
</tbody>
"
I want to the data '16.83' from the above html , so what I do is
I parse the HTML file and save it into doc.
I search doc for inner text 'PE ratio'
And then I chose the next element using next_sibling.
But I am getting an error
'C:\Users\Administrator\Documents>ruby scraper.rb scraper.rb:9:in
`<main>': undefined method `next_sibling' for #<Hpricot::Elements[{elem
<td> "PE ratio" </td>}]> (NoMethodError)'

I'll be grateful for any suggestions .
Sorry about the formatting of the HTML Text!

Attachments:
http://www.ruby-forum.com/attachment/5911/scraper.rb
 
S

Sandeep Guria

Thank You! very much Estanislau

If I am not bothering you too much why wasn't it('next_sibling') working
on my code??and
what are those '*' for in here
cell = elements*.first.*next_sibling

They were giving an error 'syntax error, unexpected '.''.
I removed them and now it's working fine.

One more thing I need to ask , if I could use this thread!
I have this web page 'http://money.rediff.com/companies/all/1-200'
at the bottom there is a link('next') to the next page of the list.
Now this link is a java script .
What I want to do is after finishing scraping this page I want to go to
the next through the 'Next' link. Is there any way to do it???

Note:- A cruder method will be to go to every page o the list by their
web page and scraping from that page (total number of pages will be 17).

Any suggestions are welcome!
Thank you!
Sandeep Guria
 
E

Estanislau Trepat

[Note: parts of this message were removed to make it a legal post.]

Hi Sandeep.

The #next_sibling method was not working because you were using it on the
whole elements array (in fact, an Hpricot::Elems object) and not on each of
the elements inside. That's because we had to use elements.first to get the
first node which met our search criteria and then call #next_sibling on that
node. The #next_sibling method is only defined on each of those nodes not on
the array itself.

I apologize for the * characters, I think I was trying to put that part in
bold and got bad formatting out of my email client.

For the problem you expose, maybe you could try using Watir<http://watir.com/>.
It drives a real web browser, and can thus handle Javascript links.

If you allow me a suggestion: Taking a look at the page you're trying to
scrape and the structure of the query parameters, I'd suggest to extract the
total number of results from the bottom part which reads "Showing 1 - 200 of
3529". If you extract that last number (the total number of results) then
you could point your scraping script to:
http://money.rediff.com/companies/all/1-3529 without needing to follow
javascript links.

Hope it helps.

Regards.
 
S

Sandeep Guria

Hi!
Thanks! for the link Estanislau.
it certainly did my work lot easier.

I uploaded my 'almost' final program.What it does is it searches for
some data for each company on BSE and writes it down on an excel sheet .

First I collected all the links and saved it in an array 'x'
Then i collect the data that i need and save it to an spreadsheet I
defined earlier in the program.
Lastly I write the spreadsheet to an excel File.
I can control how many companies I want by changing the number of
iterations(In this case 8).
This program is running fine if the number of iteration is less than 6
otherwise, I get a error
'links.rb:34:in `block in <main>': undefined method `next_sibling' for
nil:NilCla
ss (NoMethodError)
from links.rb:28:in `times'
from links.rb:28:in `<main>''

I'm puzzled(like always!).
All suggestions are welcome!
Thank you
Sandeep Guria

Attachments:
http://www.ruby-forum.com/attachment/5928/links.rb
 
S

Sandeep Guria

Hi!
I tried to find the class of the object on which I am using the method
next_sibling using the code below is returning (by iterating it for 25
times )
sheet1[num,1]=doc.search('td[text()="PE ratio"]').first
puts num # num is |num|
puts sheet1[num,1].class
It turns out it gives 'nil class' for 13th, 15th 16th and 24th
iteration.
So 'next_method' gives a no method error.

Please help me with this problem

Attachments:
http://www.ruby-forum.com/attachment/5964/scraper.rb
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top