Using Xpath to parse a Yahoo Finance page

Jason Hsu · Dec 2, 2012

I'm trying to extract the data on "total assets" from Yahoo Finance using Python 2.7 and lxml.

Here is a special test script I set up to work on this issue:

import urllib
import lxml
import lxml.html

url_local1 = "http://www.smartmoney.com/quote/FAS...=YB&isFinprint=1&framework.view=smi_emptyView"
result1 = urllib.urlopen(url_local1)
element_html1 = result1.read()
doc1 = lxml.html.document_fromstring (element_html1)
list_row1 = doc1.xpath(u'.//th[div[text()="Total Assets"]]/following-sibling::td/text()')
print list_row1

url_local2 = "http://finance.yahoo.com/q/bs?s=FAST"
result2 = urllib.urlopen(url_local2)
element_html2 = result2.read()
doc2 = lxml.html.document_fromstring (element_html2)
list_row2 = doc2.xpath(u'.//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()')
print list_row2

I'm able to get the row of data on total assets from the Smartmoney page, but I get just an empty list when I try to parse the Yahoo Finance page.

MRAB · Dec 2, 2012

I'm trying to extract the data on "total assets" from Yahoo Finance using Python 2.7 and lxml.

Here is a special test script I set up to work on this issue:

import urllib
import lxml
import lxml.html

url_local1 = "http://www.smartmoney.com/quote/FAS...=YB&isFinprint=1&framework.view=smi_emptyView"
result1 = urllib.urlopen(url_local1)
element_html1 = result1.read()
doc1 = lxml.html.document_fromstring (element_html1)
list_row1 = doc1.xpath(u'.//th[div[text()="Total Assets"]]/following-sibling::td/text()')
print list_row1

url_local2 = "http://finance.yahoo.com/q/bs?s=FAST"
result2 = urllib.urlopen(url_local2)
element_html2 = result2.read()
doc2 = lxml.html.document_fromstring (element_html2)
list_row2 = doc2.xpath(u'.//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()')
print list_row2

I'm able to get the row of data on total assets from the Smartmoney page, but I get just an empty list when I try to parse the Yahoo Finance page.

The problem is that you're asking it to look for an exact match.

If you look at the HTML itself, you'll see that there's whitespace
around the "Total Assets" part.

This should work:

list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total
Assets")]]/following-sibling::td/strong/text()')

(Although I tested it in Python 3.2.)

Jason Hsu · Dec 2, 2012

list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total

Assets")]]/following-sibling::td/strong/text()')

Thanks, MRAB. Your suggestion works!

Jason Hsu · Dec 2, 2012

list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total

Assets")]]/following-sibling::td/strong/text()')

Thanks, MRAB. Your suggestion works!

Stefan Behnel · Dec 3, 2012

MRAB, 03.12.2012 03:25:
The last three lines are unnecessarily complicated code. Just use

doc = lxml.html.parse(url_local1)

list_row1 = doc1.xpath(u'.//th[div[text()="Total
Assets"]]/following-sibling::td/text()')
print list_row1

url_local2 = "http://finance.yahoo.com/q/bs?s=FAST"
result2 = urllib.urlopen(url_local2)
element_html2 = result2.read()
doc2 = lxml.html.document_fromstring (element_html2)
list_row2 = doc2.xpath(u'.//td[strong[text()="Total
Assets"]]/following-sibling::td/strong/text()')
print list_row2

I'm able to get the row of data on total assets from the Smartmoney page,
but I get just an empty list when I try to parse the Yahoo Finance page.

Click to expand...

The problem is that you're asking it to look for an exact match.

If you look at the HTML itself, you'll see that there's whitespace
around the "Total Assets" part.

This should work:

list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total
Assets")]]/following-sibling::td/strong/text()')

Something like "contains(text(),"Total Assets")" is better expressed as
"contains(.,"Total Assets")" because it considers the complete text content
instead of just one text node.

Stefan

Using Xpath to parse a Yahoo Finance page

Jason Hsu

MRAB

Jason Hsu

Jason Hsu

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads