How use XML parsing tools on this one specific URL?

seberino · Mar 4, 2007

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

Chris

skip · Mar 4, 2007

Chris> http://moneycentral.msn.com/companyreport?Symbol=BBBY

Chris> I can't validate it and xml.minidom.dom.parseString won't work on
Chris> it.

Chris> If this was just some teenager's web site I'd move on. Is there
Chris> any hope avoiding regular expression hacks to extract the data
Chris> from this page?

Tidy it perhaps or use BeautifulSoup? ElementTree can use tidy if it's
available.

Skip

Jorge Godoy · Mar 4, 2007

[email protected] said:
I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

Yes... And Microsoft is responsible for a lot of the ill-formed pages on the
web be it on their website or made by their applications.

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

It all depends on what data you want. Probably a non-validating parser would
be able to extract some things. Another option is pass the page through some
validator that can fix the page, like tidy...

Nikita the Spider · Mar 4, 2007

"[email protected] said:
I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

Valid XHTML is scarcer than hen's teeth. Luckily, someone else has
already written the ugly regex parsing hacks for you. Try Connelly
Barnes' HTMLData:
http://oregonstate.edu/~barnesc/htmldata/

Or BeautifulSoup as others have suggested.

Paul Boddie · Mar 4, 2007

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

Yes, thank you Microsoft!

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

The standards adherence from Microsoft services is clearly at "teenage
level", but here's a recipe:

import libxml2dom
import urllib
f = urllib.urlopen("http://moneycentral.msn.com/companyreport?
Symbol=BBBY")
d = libxml2dom.parse(f, html=1)
f.close()

You now have a document which contains a DOM providing libxml2's
interpretation of the HTML. Sadly, PyXML's HtmlLib doesn't seem to
work with the given document. Other tools may give acceptable results,
however.

Paul

Paul McGuire · Mar 5, 2007

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

If this was just some teenager's web site I'd move on. Is there any
hope avoiding regular expression hacks to extract the data from this
page?

Chris

How about a pyparsing hack instead? With English-readable expression
names and a few comments, I think this is fairly easy to follow. Also
note the sample statement at then end showing how to use the results
names to access the individual data fields (much easier than indexing
into a 20-element list!).

(You should also verify you are not running afoul of any terms of
service related to the content of this page.)

-- Paul

=======================
from pyparsing import *
import urllib

# define matching elements
integer = Word(nums).setParseAction(lambda t:int(t[0]))
real = Combine(Word(nums) + Word(".",nums)).setParseAction(lambda
t:float(t[0]))
pct = real + Suppress("%")
date = Combine(Word(nums) + '/' + Word(nums))
tdStart,tdEnd = map(Suppress,makeHTMLTags("td"))
dollarUnits = oneOf("Mil Bil")

# stats are one of two patterns - single value or double value stat,
wrapped in HTML <td> tags
# also, attach parse action to make sure each matches only once
def statPattern(name,label,statExpr=real):
if (isinstance(statExpr,And)):
statExpr.exprs[0] = statExpr.exprs[0].setResultsName(name)
else:
statExpr = statExpr.setResultsName(name)
expr = tdStart + Suppress(label) + tdEnd + tdStart + statExpr +
tdEnd
return expr.setParseAction(OnlyOnce(lambda t:None))

def bistatPattern(name,label,statExpr1=real,statExpr2=real):
expr = (tdStart + Suppress(label) + tdEnd +
tdStart + statExpr1 + tdEnd +
tdStart + statExpr2 + tdEnd).setResultsName(name)
return expr.setParseAction(OnlyOnce(lambda t:None))

stats = [
statPattern("last","Last Price"),
statPattern("hi","52 Week High"),
statPattern("lo","52 Week Low"),
statPattern("vol","Volume", real + Suppress(dollarUnits)),
statPattern("aveDailyVol_13wk","Average Daily Volume (13wk)", real
+ Suppress(dollarUnits)),
statPattern("movingAve_50day","50 Day Moving Average"),
statPattern("movingAve_200day","200 Day Moving Average"),
statPattern("volatility","Volatility (beta)"),
bistatPattern("relStrength_last3","Last 3 Months", pct, integer),
bistatPattern("relStrength_last6","Last 6 Months", pct, integer),
bistatPattern("relStrength_last12","Last 12 Months", pct,
integer),
bistatPattern("sales","Sales", real+Suppress(dollarUnits), pct),
bistatPattern("income","Income", real+Suppress(dollarUnits), pct),
bistatPattern("divRate","Dividend Rate", real, pct | "NA"),
bistatPattern("divYield","Dividend Yield", pct, pct),
statPattern("curQtrEPSest","Qtr("+date+") EPS Estimate"),
statPattern("curFyEPSest","FY("+date+") EPS Estimate"),
statPattern("curPE","Current P/E"),
statPattern("fwdEPSest","FY("+date+") EPS Estimate"),
statPattern("fwdPE","Forward P/E"),
]

# create overall search pattern - things move faster if we verify that
we are positioned
# at a <td> tag before going through the MatchFirst group
statSearchPattern = FollowedBy(tdStart) + MatchFirst(stats)

# SETUP IS DONE - now get the HTML source
# read in web page
pg = urllib.urlopen("http://moneycentral.msn.com/companyreport?
Symbol=BBBY")
stockHTML = pg.read()
pg.close()

# extract and merge statistics
ticker =
sum( statSearchPattern.searchString(stockHTML),ParseResults([]) )

# print them out
print ticker.dump()
print ticker.last, ticker.hi,ticker.lo,ticker.vol,ticker.volatility

-----------------------
prints:
[39.549999999999997, 43.32, 30.920000000000002, 2.3599999999999999,
2.7400000000000002, 40.920000000000002, 37.659999999999997,
0.72999999999999998, 1.5, 55, 15.5, 69, 9.8000000000000007, 62,
6.2999999999999998, 19.399999999999999, 586.29999999999995,
27.199999999999999, 0.0, 'NA', 0.0, 0.0, 0.78000000000000003,
2.1499999999999999, 19.399999999999999, 2.3900000000000001,
18.399999999999999]
- aveDailyVol_13wk: 2.74
- curFyEPSest: 2.15
- curPE: 19.4
- curQtrEPSest: 0.78
- divRate: [0.0, 'NA']
- divYield: [0.0, 0.0]
- fwdEPSest: 2.39
- fwdPE: 18.4
- hi: 43.32
- income: [586.29999999999995, 27.199999999999999]
- last: 39.55
- lo: 30.92
- movingAve_200day: 37.66
- movingAve_50day: 40.92
- relStrength_last12: [9.8000000000000007, 62]
- relStrength_last3: [1.5, 55]
- relStrength_last6: [15.5, 69]
- sales: [6.2999999999999998, 19.399999999999999]
- vol: 2.36
- volatility: 0.73
39.55 43.32 30.92 2.36 0.73

Paul McGuire · Mar 5, 2007

P.S. Please send me 1% of all the money you make from your automated-
stock speculation program. On the other hand, if you lose money with
your program, don't bother sending me a bill.

-- Paul

Fredrik Lundh · Mar 5, 2007

Chris> http://moneycentral.msn.com/companyreport?Symbol=BBBY

Chris> I can't validate it and xml.minidom.dom.parseString won't work on
Chris> it.

Chris> If this was just some teenager's web site I'd move on. Is there
Chris> any hope avoiding regular expression hacks to extract the data
Chris> from this page?

Tidy it perhaps or use BeautifulSoup? ElementTree can use tidy if it's
available.

ElementTree can also use BeautifulSoup:

http://effbot.org/zone/element-soup.htm

as noted on that page, tidy is a bit too picky for this kind of use; it's better suited
for "normalizing" HTML that you're producing yourself than for parsing arbitrary
HTML.

</F>

Paul Boddie · Mar 5, 2007

I can't validate it and xml.minidom.dom.parseString won't work on it.

Click to expand...

[...]

Valid XHTML is scarcer than hen's teeth.

It probably doesn't need to be valid: being well-formed would be
sufficient for the operation of an XML parser, and for many
applications it'd be sufficient to consider the content as vanilla XML
without the XHTML overtones.

Paul

Stefan Behnel · Mar 5, 2007

I understand that the web is full of ill-formed XHTML web pages but
this is Microsoft:

http://moneycentral.msn.com/companyreport?Symbol=BBBY

I can't validate it and xml.minidom.dom.parseString won't work on it.

Interestingly, no-one mentioned lxml so far:

http://codespeak.net/lxml
http://codespeak.net/lxml/dev/parsing.html#parsers

Parse it as HTML and then use anything from XPath to XSLT to treat it.

Have fun,
Stefan

XML parsing ExpatError with xml.dom.minidom at line 1, column 0	2	Feb 13, 2014
Call for Papers: Conference on Domain-Specific Languages (DSL 2011)	0	Mar 26, 2011
I've read up on XML, but how to really use it??	10	May 30, 2006
Wing IDE 4.1.3 released	0	Jan 12, 2012
Wing IDE 4.1.2 released	0	Dec 12, 2011
Ann: Liquid Technologies Releases New Freeware XML Studio AndSilverlight Code Generator	0	Jan 6, 2009
Recommended decent XML editor? I kind of desperately need one.	7	Aug 15, 2004
Site Template - Any Internet Explorer XML Parser errors?	28	Apr 20, 2007

How use XML parsing tools on this one specific URL?

seberino

skip

Jorge Godoy

Nikita the Spider

Paul Boddie

Paul McGuire

Paul McGuire

Fredrik Lundh

Paul Boddie

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads