How use XML parsing tools on this one specific URL?

S

seberino

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

Chris
 
S

skip

Chris> http://moneycentral.msn.com/companyreport?Symbol=BBBY

Chris> I can't validate it and xml.minidom.dom.parseString won't work on
Chris> it.

Chris> If this was just some teenager's web site I'd move on. Is there
Chris> any hope avoiding regular expression hacks to extract the data
Chris> from this page?

Tidy it perhaps or use BeautifulSoup? ElementTree can use tidy if it's
available.

Skip
 
J

Jorge Godoy

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

Yes... And Microsoft is responsible for a lot of the ill-formed pages on the
web be it on their website or made by their applications.
http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

It all depends on what data you want. Probably a non-validating parser would
be able to extract some things. Another option is pass the page through some
validator that can fix the page, like tidy...
 
N

Nikita the Spider

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

Valid XHTML is scarcer than hen's teeth. Luckily, someone else has
already written the ugly regex parsing hacks for you. Try Connelly
Barnes' HTMLData:
http://oregonstate.edu/~barnesc/htmldata/

Or BeautifulSoup as others have suggested.
 
P

Paul Boddie

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

Yes, thank you Microsoft!
I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

The standards adherence from Microsoft services is clearly at "teenage
level", but here's a recipe:

import libxml2dom
import urllib
f = urllib.urlopen("http://moneycentral.msn.com/companyreport?
Symbol=BBBY")
d = libxml2dom.parse(f, html=1)
f.close()

You now have a document which contains a DOM providing libxml2's
interpretation of the HTML. Sadly, PyXML's HtmlLib doesn't seem to
work with the given document. Other tools may give acceptable results,
however.

Paul
 
P

Paul McGuire

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

Chris

How about a pyparsing hack instead? With English-readable expression
names and a few comments, I think this is fairly easy to follow. Also
note the sample statement at then end showing how to use the results
names to access the individual data fields (much easier than indexing
into a 20-element list!).

(You should also verify you are not running afoul of any terms of
service related to the content of this page.)

-- Paul

=======================
from pyparsing import *
import urllib

# define matching elements
integer = Word(nums).setParseAction(lambda t:int(t[0]))
real = Combine(Word(nums) + Word(".",nums)).setParseAction(lambda
t:float(t[0]))
pct = real + Suppress("%")
date = Combine(Word(nums) + '/' + Word(nums))
tdStart,tdEnd = map(Suppress,makeHTMLTags("td"))
dollarUnits = oneOf("Mil Bil")

# stats are one of two patterns - single value or double value stat,
wrapped in HTML <td> tags
# also, attach parse action to make sure each matches only once
def statPattern(name,label,statExpr=real):
if (isinstance(statExpr,And)):
statExpr.exprs[0] = statExpr.exprs[0].setResultsName(name)
else:
statExpr = statExpr.setResultsName(name)
expr = tdStart + Suppress(label) + tdEnd + tdStart + statExpr +
tdEnd
return expr.setParseAction(OnlyOnce(lambda t:None))

def bistatPattern(name,label,statExpr1=real,statExpr2=real):
expr = (tdStart + Suppress(label) + tdEnd +
tdStart + statExpr1 + tdEnd +
tdStart + statExpr2 + tdEnd).setResultsName(name)
return expr.setParseAction(OnlyOnce(lambda t:None))

stats = [
statPattern("last","Last Price"),
statPattern("hi","52 Week High"),
statPattern("lo","52 Week Low"),
statPattern("vol","Volume", real + Suppress(dollarUnits)),
statPattern("aveDailyVol_13wk","Average Daily Volume (13wk)", real
+ Suppress(dollarUnits)),
statPattern("movingAve_50day","50 Day Moving Average"),
statPattern("movingAve_200day","200 Day Moving Average"),
statPattern("volatility","Volatility (beta)"),
bistatPattern("relStrength_last3","Last 3 Months", pct, integer),
bistatPattern("relStrength_last6","Last 6 Months", pct, integer),
bistatPattern("relStrength_last12","Last 12 Months", pct,
integer),
bistatPattern("sales","Sales", real+Suppress(dollarUnits), pct),
bistatPattern("income","Income", real+Suppress(dollarUnits), pct),
bistatPattern("divRate","Dividend Rate", real, pct | "NA"),
bistatPattern("divYield","Dividend Yield", pct, pct),
statPattern("curQtrEPSest","Qtr("+date+") EPS Estimate"),
statPattern("curFyEPSest","FY("+date+") EPS Estimate"),
statPattern("curPE","Current P/E"),
statPattern("fwdEPSest","FY("+date+") EPS Estimate"),
statPattern("fwdPE","Forward P/E"),
]

# create overall search pattern - things move faster if we verify that
we are positioned
# at a <td> tag before going through the MatchFirst group
statSearchPattern = FollowedBy(tdStart) + MatchFirst(stats)

# SETUP IS DONE - now get the HTML source
# read in web page
pg = urllib.urlopen("http://moneycentral.msn.com/companyreport?
Symbol=BBBY")
stockHTML = pg.read()
pg.close()

# extract and merge statistics
ticker =
sum( statSearchPattern.searchString(stockHTML),ParseResults([]) )

# print them out
print ticker.dump()
print ticker.last, ticker.hi,ticker.lo,ticker.vol,ticker.volatility

-----------------------
prints:
[39.549999999999997, 43.32, 30.920000000000002, 2.3599999999999999,
2.7400000000000002, 40.920000000000002, 37.659999999999997,
0.72999999999999998, 1.5, 55, 15.5, 69, 9.8000000000000007, 62,
6.2999999999999998, 19.399999999999999, 586.29999999999995,
27.199999999999999, 0.0, 'NA', 0.0, 0.0, 0.78000000000000003,
2.1499999999999999, 19.399999999999999, 2.3900000000000001,
18.399999999999999]
- aveDailyVol_13wk: 2.74
- curFyEPSest: 2.15
- curPE: 19.4
- curQtrEPSest: 0.78
- divRate: [0.0, 'NA']
- divYield: [0.0, 0.0]
- fwdEPSest: 2.39
- fwdPE: 18.4
- hi: 43.32
- income: [586.29999999999995, 27.199999999999999]
- last: 39.55
- lo: 30.92
- movingAve_200day: 37.66
- movingAve_50day: 40.92
- relStrength_last12: [9.8000000000000007, 62]
- relStrength_last3: [1.5, 55]
- relStrength_last6: [15.5, 69]
- sales: [6.2999999999999998, 19.399999999999999]
- vol: 2.36
- volatility: 0.73
39.55 43.32 30.92 2.36 0.73
 
P

Paul McGuire

P.S. Please send me 1% of all the money you make from your automated-
stock speculation program. On the other hand, if you lose money with
your program, don't bother sending me a bill.

-- Paul
 
F

Fredrik Lundh

Chris> http://moneycentral.msn.com/companyreport?Symbol=BBBY

Chris> I can't validate it and xml.minidom.dom.parseString won't work on
Chris> it.

Chris> If this was just some teenager's web site I'd move on. Is there
Chris> any hope avoiding regular expression hacks to extract the data
Chris> from this page?

Tidy it perhaps or use BeautifulSoup? ElementTree can use tidy if it's
available.

ElementTree can also use BeautifulSoup:

http://effbot.org/zone/element-soup.htm

as noted on that page, tidy is a bit too picky for this kind of use; it's better suited
for "normalizing" HTML that you're producing yourself than for parsing arbitrary
HTML.

</F>
 
P

Paul Boddie

I can't validate it and xml.minidom.dom.parseString won't work on it.
[...]

Valid XHTML is scarcer than hen's teeth.

It probably doesn't need to be valid: being well-formed would be
sufficient for the operation of an XML parser, and for many
applications it'd be sufficient to consider the content as vanilla XML
without the XHTML overtones.

Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top