Using Xpath to parse a Yahoo Finance page

J

Jason Hsu

I'm trying to extract the data on "total assets" from Yahoo Finance using Python 2.7 and lxml.

Here is a special test script I set up to work on this issue:

import urllib
import lxml
import lxml.html

url_local1 = "http://www.smartmoney.com/quote/FAS...=YB&isFinprint=1&framework.view=smi_emptyView"
result1 = urllib.urlopen(url_local1)
element_html1 = result1.read()
doc1 = lxml.html.document_fromstring (element_html1)
list_row1 = doc1.xpath(u'.//th[div[text()="Total Assets"]]/following-sibling::td/text()')
print list_row1

url_local2 = "http://finance.yahoo.com/q/bs?s=FAST"
result2 = urllib.urlopen(url_local2)
element_html2 = result2.read()
doc2 = lxml.html.document_fromstring (element_html2)
list_row2 = doc2.xpath(u'.//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()')
print list_row2

I'm able to get the row of data on total assets from the Smartmoney page, but I get just an empty list when I try to parse the Yahoo Finance page.
 
M

MRAB

I'm trying to extract the data on "total assets" from Yahoo Finance using Python 2.7 and lxml.

Here is a special test script I set up to work on this issue:

import urllib
import lxml
import lxml.html

url_local1 = "http://www.smartmoney.com/quote/FAS...=YB&isFinprint=1&framework.view=smi_emptyView"
result1 = urllib.urlopen(url_local1)
element_html1 = result1.read()
doc1 = lxml.html.document_fromstring (element_html1)
list_row1 = doc1.xpath(u'.//th[div[text()="Total Assets"]]/following-sibling::td/text()')
print list_row1

url_local2 = "http://finance.yahoo.com/q/bs?s=FAST"
result2 = urllib.urlopen(url_local2)
element_html2 = result2.read()
doc2 = lxml.html.document_fromstring (element_html2)
list_row2 = doc2.xpath(u'.//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()')
print list_row2

I'm able to get the row of data on total assets from the Smartmoney page, but I get just an empty list when I try to parse the Yahoo Finance page.
The problem is that you're asking it to look for an exact match.

If you look at the HTML itself, you'll see that there's whitespace
around the "Total Assets" part.

This should work:

list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total
Assets")]]/following-sibling::td/strong/text()')

(Although I tested it in Python 3.2.)
 
S

Stefan Behnel

MRAB, 03.12.2012 03:25:
The last three lines are unnecessarily complicated code. Just use

doc = lxml.html.parse(url_local1)

list_row1 = doc1.xpath(u'.//th[div[text()="Total
Assets"]]/following-sibling::td/text()')
print list_row1

url_local2 = "http://finance.yahoo.com/q/bs?s=FAST"
result2 = urllib.urlopen(url_local2)
element_html2 = result2.read()
doc2 = lxml.html.document_fromstring (element_html2)
list_row2 = doc2.xpath(u'.//td[strong[text()="Total
Assets"]]/following-sibling::td/strong/text()')
print list_row2

I'm able to get the row of data on total assets from the Smartmoney page,
but I get just an empty list when I try to parse the Yahoo Finance page.
The problem is that you're asking it to look for an exact match.

If you look at the HTML itself, you'll see that there's whitespace
around the "Total Assets" part.

This should work:

list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total
Assets")]]/following-sibling::td/strong/text()')

Something like "contains(text(),"Total Assets")" is better expressed as
"contains(.,"Total Assets")" because it considers the complete text content
instead of just one text node.

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

XSL/XPath Problems 0

Members online

Forum statistics

Threads
474,045
Messages
2,570,389
Members
47,052
Latest member
ketan

Latest Threads

Top