Processing XML that's embedded in HTML

Mike Driscoll · Jan 22, 2008

Hi,

I need to parse a fairly complex HTML page that has XML embedded in
it. I've done parsing before with the xml.dom.minidom module on just
plain XML, but I cannot get it to work with this HTML page.

The XML looks like this:

<Row status="o">

<Relationship>Owner</Relationship>

<Priority>1</Priority>

<StartDate>07/16/2007</StartDate>

<StopsExist>No</StopsExist>

<Name>Doe, John</Name>

<Address>1905 S 3rd Ave , Hicksville IA 99999</Address>

</Row>

<Row status="o">

<Relationship>Owner</Relationship>

<Priority>2</Priority>

<StartDate>07/16/2007</StartDate>

<StopsExist>No</StopsExist>

<Name>Doe, Jane</Name>

<Address>1905 S 3rd Ave , Hicksville IA 99999</Address>

</Row>

It appears to be enclosed with <XML
id="grdRegistrationInquiryCustomers"><BoundData>

The rest of the document is html, javascript div tags, etc. I need the
information only from the row where the Relationship tag = Owner and
the Priority tag = 1. The rest I can ignore. When I tried parsing it
with minidom, I get an ExpatError: mismatched tag: line 1, column 357
so I think the HTML is probably malformed.

I looked at BeautifulSoup, but it seems to separate its HTML
processing from its XML processing. Can someone give me some pointers?

I am currently using Python 2.5 on Windows XP. I will be using
Internet Explorer 6 since the document will not display correctly in
Firefox.

Thank you very much!

Mike

Paul Boddie · Jan 22, 2008

I need to parse a fairly complex HTML page that has XML embedded in
it. I've done parsing before with the xml.dom.minidom module on just
plain XML, but I cannot get it to work with this HTML page.

It's HTML day on comp.lang.python today! ;-)

The XML looks like this:

<Row status="o">

<Relationship>Owner</Relationship>

<Priority>1</Priority>

<StartDate>07/16/2007</StartDate>

<StopsExist>No</StopsExist>

<Name>Doe, John</Name>

<Address>1905 S 3rd Ave , Hicksville IA 99999</Address>

</Row>

<Row status="o">

<Relationship>Owner</Relationship>

<Priority>2</Priority>

<StartDate>07/16/2007</StartDate>

<StopsExist>No</StopsExist>

<Name>Doe, Jane</Name>

<Address>1905 S 3rd Ave , Hicksville IA 99999</Address>

</Row>

It appears to be enclosed with <XML
id="grdRegistrationInquiryCustomers"><BoundData>

You could probably find the Row elements with the following XPath
expression:

//XML/BoundData/Row

More specific would be this:

//XML[@id="grdRegistrationInquiryCustomers"]/BoundData/Row

See below for the relevance of this. You could also try using
getElementById on the document, specifying the id attribute's value
given above, then descending to find the Row elements.

The rest of the document is html, javascript div tags, etc. I need the
information only from the row where the Relationship tag = Owner and
the Priority tag = 1. The rest I can ignore. When I tried parsing it
with minidom, I get an ExpatError: mismatched tag: line 1, column 357
so I think the HTML is probably malformed.

Or that it isn't well-formed XML, at least.

I looked at BeautifulSoup, but it seems to separate its HTML
processing from its XML processing. Can someone give me some pointers?

With libxml2dom [1] I'd do something like this:

import libxml2dom
d = libxml2dom.parse(filename, html=1)
# or: d = parseURI(uri, html=1)
rows = d.xpath("//XML/BoundData/Row")
# or: rows = d.xpath("//XML[@id="grdRegistrationInquiryCustomers"]/
BoundData/Row")

Even though the document is interpreted as HTML, you should get a DOM
containing the elements as libxml2 interprets them.

I am currently using Python 2.5 on Windows XP. I will be using
Internet Explorer 6 since the document will not display correctly in
Firefox.

That shouldn't be much of a surprise, it must be said: it isn't XHTML,
where you might be able to extend the document via XML, so the whole
document has to be "proper" HTML.

Paul

[1] http://www.python.org/pypi/libxml2dom

Mike Driscoll · Jan 22, 2008

Or that it isn't well-formed XML, at least.

I probably should have posted that I got the error on the first line
of the file, which is why I think it's the HTML. But I wouldn't be
surprised if it was the XML that's behaving badly.

I looked at BeautifulSoup, but it seems to separate its HTML
processing from its XML processing. Can someone give me some pointers?

Click to expand...

With libxml2dom [1] I'd do something like this:

import libxml2dom
d = libxml2dom.parse(filename, html=1)
# or: d = parseURI(uri, html=1)
rows = d.xpath("//XML/BoundData/Row")
# or: rows = d.xpath("//XML[@id="grdRegistrationInquiryCustomers"]/
BoundData/Row")

Even though the document is interpreted as HTML, you should get a DOM
containing the elements as libxml2 interprets them.

I am currently using Python 2.5 on Windows XP. I will be using
Internet Explorer 6 since the document will not display correctly in
Firefox.

Click to expand...

That shouldn't be much of a surprise, it must be said: it isn't XHTML,
where you might be able to extend the document via XML, so the whole
document has to be "proper" HTML.

Paul

[1]http://www.python.org/pypi/libxml2dom

I must have tried this module quite a while ago since I already have
it installed. I see you're the author of the module, so you can
probably tell me what's what. When I do the above, I get an empty list
either way. See my code below:

import libxml2dom
d = libxml2dom.parse(filename, html=1)
rows = d.xpath('//XML[@id="grdRegistrationInquiryCustomers"]/BoundData/
Row')
# rows = d.xpath("//XML/BoundData/Row")
print rows

I'm not sure what is wrong here...but I got lxml to create a tree from
by doing the following:

<code>
from lxml import etree
from StringIO import StringIO

parser = etree.HTMLParser()
tree = etree.parse(filename, parser)
xml_string = etree.tostring(tree)
context = etree.iterparse(StringIO(xml_string))
</code>

However, when I iterate over the contents of "context", I can't figure
out how to nab the row's contents:

for action, elem in context:
if action == 'end' and elem.tag == 'relationship':
# do something...but what!?
# this if statement probably isn't even right

Thanks for the quick response, though! Any other ideas?

Mike

John Machin · Jan 22, 2008

I'm not sure what is wrong here...but I got lxml to create a tree from
by doing the following:

<code>
from lxml import etree
from StringIO import StringIO

parser = etree.HTMLParser()
tree = etree.parse(filename, parser)
xml_string = etree.tostring(tree)
context = etree.iterparse(StringIO(xml_string))
</code>

However, when I iterate over the contents of "context", I can't figure
out how to nab the row's contents:

for action, elem in context:
if action == 'end' and elem.tag == 'relationship':
# do something...but what!?
# this if statement probably isn't even right

lxml allegedly supports the ElementTree interface so I would expect
elem.text to refer to the contents. Sure enough:
http://codespeak.net/lxml/tutorial.html#elements-contain-text

Why do you want/need to use the iterparse technique on the 2nd pass
instead of creating another tree and then using getiterator?

Paul Boddie · Jan 22, 2008

[1]http://www.python.org/pypi/libxml2dom

Click to expand...

I must have tried this module quite a while ago since I already have
it installed. I see you're the author of the module, so you can
probably tell me what's what. When I do the above, I get an empty list
either way. See my code below:

import libxml2dom
d = libxml2dom.parse(filename, html=1)
rows = d.xpath('//XML[@id="grdRegistrationInquiryCustomers"]/BoundData/
Row')
# rows = d.xpath("//XML/BoundData/Row")
print rows

It may be namespace-related, although parsing as HTML shouldn't impose
namespaces on the document, unlike parsing XHTML, say. One thing you
can try is to start with a simpler query and to expand it. Start with
the expression "//XML" and add things to make the results more
specific. Generally, namespaces can make XPath queries awkward because
you have to qualify the element names and define the namespaces for
each of the prefixes used.

Let me know how you get on!

Paul

Paul McGuire · Jan 22, 2008

Hi,

I need to parse a fairly complex HTML page that has XML embedded in
it. I've done parsing before with the xml.dom.minidom module on just
plain XML, but I cannot get it to work with this HTML page.

The XML looks like this:

...

Once again (this IS HTML Day!), instead of parsing the HTML, pyparsing
can help lift the interesting bits and leave the rest alone. Try this
program out:

from pyparsing import
makeXMLTags,Word,nums,Combine,oneOf,SkipTo,withAttribute

htmlWithEmbeddedXml = """
<HTML>
<Body>
<p>
<b>Hey! this is really bold!</b>

<Row status="o">
<Relationship>Owner</Relationship>
<Priority>1</Priority>
<StartDate>07/16/2007</StartDate>
<StopsExist>No</StopsExist>
<Name>Doe, John</Name>
<Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
</Row>

<Row status="o">
<Relationship>Owner</Relationship>
<Priority>2</Priority>
<StartDate>07/16/2007</StartDate>
<StopsExist>No</StopsExist>
<Name>Doe, Jane</Name>
<Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
</Row>

<table>
<tr><Td>this is in a table, woo-hoo!</td>
more HTML
blah blah blah...
"""

# define pyparsing expressions for XML tags
rowStart,rowEnd = makeXMLTags("Row")
relationshipStart,relationshipEnd = makeXMLTags("Relationship")
priorityStart,priorityEnd = makeXMLTags("Priority")
startDateStart,startDateEnd = makeXMLTags("StartDate")
stopsExistStart,stopsExistEnd = makeXMLTags("StopsExist")
nameStart,nameEnd = makeXMLTags("Name")
addressStart,addressEnd = makeXMLTags("Address")

# define some useful expressions for data of specific types
integer = Word(nums)
date = Combine(Word(nums,exact=2)+"/"+
Word(nums,exact=2)+"/"+Word(nums,exact=4))
yesOrNo = oneOf("Yes No")

# conversion parse actions
integer.setParseAction(lambda t: int(t[0]))
yesOrNo.setParseAction(lambda t: t[0]=='Yes')
# could also define a conversion for date if you really wanted to

# define format of a <Row>, plus assign results names for each data
field
rowRec = rowStart + \
relationshipStart + SkipTo(relationshipEnd)("relationship") +
relationshipEnd + \
priorityStart + integer("priority") + priorityEnd + \
startDateStart + date("startdate") + startDateEnd + \
stopsExistStart + yesOrNo("stopsexist") + stopsExistEnd + \
nameStart + SkipTo(nameEnd)("name") + nameEnd + \
addressStart + SkipTo(addressEnd)("address") + addressEnd + \
rowEnd

# set filtering parse action
rowRec.setParseAction(withAttribute(relationship="Owner",priority=1))

# find all matching rows, matching grammar and filtering parse action
rows = rowRec.searchString(htmlWithEmbeddedXml)

# print the results (uncomment r.dump() statement to see full
# result for each row)
for r in rows:
# print r.dump()
print r.relationship
print r.priority
print r.startdate
print r.stopsexist
print r.name
print r.address

This prints:
Owner
1
07/16/2007
False
Doe, John
1905 S 3rd Ave , Hicksville IA 99999

In addition to parsing this data, some conversions were done at parse
time, too - "1" was converted to the value 1, and "No" was converted
to False. These were done by the conversion parse actions. The
filtering just for Row's containing Relationship="Owner" and
Priority=1 was done in a more global parse action, called
withAttribute. If you comment this line out, you will see that both
rows get retrieved.

-- Paul
(Find out more about pyparsing at http://pyparsing.wikispaces.com.)

Stefan Behnel · Jan 23, 2008

Hi,

Mike said:
I got lxml to create a tree by doing the following:

from lxml import etree
from StringIO import StringIO

parser = etree.HTMLParser()
tree = etree.parse(filename, parser)
xml_string = etree.tostring(tree)
context = etree.iterparse(StringIO(xml_string))

No idea why you need the two steps here. lxml 2.0 supports parsing HTML in
iterparse() directly when you pass the boolean "html" keyword.

However, when I iterate over the contents of "context", I can't figure
out how to nab the row's contents:

for action, elem in context:
if action == 'end' and elem.tag == 'relationship':
# do something...but what!?
# this if statement probably isn't even right

I would really encourage you to use the normal parser here instead of iterparse().

from lxml import etree
parser = etree.HTMLParser()

# parse the HTML/XML melange
tree = etree.parse(filename, parser)

# if you want, you can construct a pure XML document
row_root = etree.Element("newroot")
for row in tree.iterfind("//Row"):
row_root.append(row)

In your specific case, I'd encourage using lxml.objectify:

http://codespeak.net/lxml/dev/objectify.html

It will allow you to do this (untested):

from lxml import etree, objectify
parser = etree.HTMLParser()
lookup = objectify.ObjectifyElementClassLookup()
parser.setElementClassLookup(lookup)

tree = etree.parse(filename, parser)

for row in tree.iterfind("//Row"):
print row.relationship, row.StartDate, row.Priority * 2.7

Stefan

Mike Driscoll · Jan 23, 2008

John and Stefan,

Hi,

No idea why you need the two steps here. lxml 2.0 supports parsing HTML in
iterparse() directly when you pass the boolean "html" keyword.

I don't know why I have 2 steps either, now that I look at it.
However, I don't do enough XML parsing to get real familiar with the
ins and outs of Python parsing either, so it's mainly just my
inexperience. And I also got lost in the lxml tutorials...

I would really encourage you to use the normal parser here instead of iterparse().

from lxml import etree
parser = etree.HTMLParser()

# parse the HTML/XML melange
tree = etree.parse(filename, parser)

# if you want, you can construct a pure XML document
row_root = etree.Element("newroot")
for row in tree.iterfind("//Row"):
row_root.append(row)

In your specific case, I'd encourage using lxml.objectify:

http://codespeak.net/lxml/dev/objectify.html

It will allow you to do this (untested):

from lxml import etree, objectify
parser = etree.HTMLParser()
lookup = objectify.ObjectifyElementClassLookup()
parser.setElementClassLookup(lookup)

tree = etree.parse(filename, parser)

for row in tree.iterfind("//Row"):
print row.relationship, row.StartDate, row.Priority * 2.7

Stefan

I'll give your ideas a go and also see if what the others posted will
be cleaner or faster.

Thank you all.

Mike

Mike Driscoll · Jan 23, 2008

Stefan,

I would really encourage you to use the normal parser here instead of iterparse().

from lxml import etree
parser = etree.HTMLParser()

# parse the HTML/XML melange
tree = etree.parse(filename, parser)

# if you want, you can construct a pure XML document
row_root = etree.Element("newroot")
for row in tree.iterfind("//Row"):
row_root.append(row)

In your specific case, I'd encourage using lxml.objectify:

http://codespeak.net/lxml/dev/objectify.html

It will allow you to do this (untested):

from lxml import etree, objectify
parser = etree.HTMLParser()
lookup = objectify.ObjectifyElementClassLookup()
parser.setElementClassLookup(lookup)

tree = etree.parse(filename, parser)

for row in tree.iterfind("//Row"):
print row.relationship, row.StartDate, row.Priority * 2.7

Stefan

Both the normal parser example and the objectify example you gave me
give a traceback as follows:

Traceback (most recent call last):
File "\\clippy\xml_parser2.py", line 70, in -toplevel-
for row in tree.iterfind("//Row"):
AttributeError: 'etree._ElementTree' object has no attribute
'iterfind'

Is there some kind of newer version of lxml?

Mike

Mike Driscoll · Jan 23, 2008

...

Once again (this IS HTML Day!), instead of parsing the HTML, pyparsing
can help lift the interesting bits and leave the rest alone. Try this
program out:

Happy post-HTML Day to you!

from pyparsing import
makeXMLTags,Word,nums,Combine,oneOf,SkipTo,withAttribute

htmlWithEmbeddedXml = """
<HTML>
<Body>
<p>
<b>Hey! this is really bold!</b>

<Row status="o">
<Relationship>Owner</Relationship>
<Priority>1</Priority>
<StartDate>07/16/2007</StartDate>
<StopsExist>No</StopsExist>
<Name>Doe, John</Name>
<Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
</Row>

<Row status="o">
<Relationship>Owner</Relationship>
<Priority>2</Priority>
<StartDate>07/16/2007</StartDate>
<StopsExist>No</StopsExist>
<Name>Doe, Jane</Name>
<Address>1905 S 3rd Ave , Hicksville IA 99999</Address>
</Row>

<table>
<tr><Td>this is in a table, woo-hoo!</td>
more HTML
blah blah blah...
"""

# define pyparsing expressions for XML tags
rowStart,rowEnd = makeXMLTags("Row")
relationshipStart,relationshipEnd = makeXMLTags("Relationship")
priorityStart,priorityEnd = makeXMLTags("Priority")
startDateStart,startDateEnd = makeXMLTags("StartDate")
stopsExistStart,stopsExistEnd = makeXMLTags("StopsExist")
nameStart,nameEnd = makeXMLTags("Name")
addressStart,addressEnd = makeXMLTags("Address")

# define some useful expressions for data of specific types
integer = Word(nums)
date = Combine(Word(nums,exact=2)+"/"+
Word(nums,exact=2)+"/"+Word(nums,exact=4))
yesOrNo = oneOf("Yes No")

# conversion parse actions
integer.setParseAction(lambda t: int(t[0]))
yesOrNo.setParseAction(lambda t: t[0]=='Yes')
# could also define a conversion for date if you really wanted to

# define format of a <Row>, plus assign results names for each data
field
rowRec = rowStart + \
relationshipStart + SkipTo(relationshipEnd)("relationship") +
relationshipEnd + \
priorityStart + integer("priority") + priorityEnd + \
startDateStart + date("startdate") + startDateEnd + \
stopsExistStart + yesOrNo("stopsexist") + stopsExistEnd + \
nameStart + SkipTo(nameEnd)("name") + nameEnd + \
addressStart + SkipTo(addressEnd)("address") + addressEnd + \
rowEnd

# set filtering parse action
rowRec.setParseAction(withAttribute(relationship="Owner",priority=1))

# find all matching rows, matching grammar and filtering parse action
rows = rowRec.searchString(htmlWithEmbeddedXml)

# print the results (uncomment r.dump() statement to see full
# result for each row)
for r in rows:
# print r.dump()
print r.relationship
print r.priority
print r.startdate
print r.stopsexist
print r.name
print r.address

This prints:
Owner
1
07/16/2007
False
Doe, John
1905 S 3rd Ave , Hicksville IA 99999

In addition to parsing this data, some conversions were done at parse
time, too - "1" was converted to the value 1, and "No" was converted
to False. These were done by the conversion parse actions. The
filtering just for Row's containing Relationship="Owner" and
Priority=1 was done in a more global parse action, called
withAttribute. If you comment this line out, you will see that both
rows get retrieved.

-- Paul
(Find out more about pyparsing athttp://pyparsing.wikispaces.com.)

I've heard of this module, but never used it. Your code runs almost
out of the box on my file and returns the correct result. That's
pretty cool!

It looks like the wiki you linked to has quite a few pieces of example
code. I'll have to look this over. While I like lxml's very Object
Oriented way of doing things, I tend to get overwhelmed by their
tutorials for some reason. One more example of all those college OOP
classes being a waste of money...

Thank you for the help.

Mike

Stefan Behnel · Jan 23, 2008

Mike said:
Both the normal parser example and the objectify example you gave me
give a traceback as follows:

Traceback (most recent call last):
File "\\clippy\xml_parser2.py", line 70, in -toplevel-
for row in tree.iterfind("//Row"):
AttributeError: 'etree._ElementTree' object has no attribute
'iterfind'

Is there some kind of newer version of lxml?

Yep, lxml 2.0. It's currently in beta, but that doesn't say much.

http://codespeak.net/lxml/dev/

Stefan

XML/XHTML/HTML differences, bugs... and howto	0	Jan 23, 2013
Extracting xml from html	13	Sep 17, 2007
tidy to convert google scholar page in xml	1	Oct 8, 2012
html in xml	2	Sep 15, 2009
XSLT: processing embedded (X)HTML	2	Sep 14, 2005
emacs lisp as text processing language...	1	Oct 29, 2007
XML in XMPP	8	Jul 6, 2012
xml element tree to html problem	4	Apr 4, 2006

Processing XML that's embedded in HTML

Mike Driscoll

Paul Boddie

Mike Driscoll

John Machin

Paul Boddie

Paul McGuire

Stefan Behnel

Mike Driscoll

Mike Driscoll

Mike Driscoll

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads