Extracting xml from html

kyosohma · Sep 17, 2007

Hi,

I am attempting to extract some XML from an HTML document that I get
returned from a form based web page. For some reason, I cannot figure
out how to do this. I thought I could use the minidom module to do it,
but all I get is a screwy traceback:

Traceback (most recent call last):
File "\\mcisnt1\repl$\Scripts\PythonPackages\Development\clippy
\xml_parser.py", line 69, in ?
inst = ApptParser(url)
File "\\mcisnt1\repl$\Scripts\PythonPackages\Development\clippy
\xml_parser.py", line 19, in __init__
xml = self.getXml(url)
File "\\mcisnt1\repl$\Scripts\PythonPackages\Development\clippy
\xml_parser.py", line 30, in getXml
doc = xml.dom.minidom.parse(f)
File "C:\Python24\lib\xml\dom\minidom.py", line 1915, in parse
return expatbuilder.parse(file)
File "C:\Python24\lib\xml\dom\expatbuilder.py", line 928, in parse
result = builder.parseFile(file)
File "C:\Python24\lib\xml\dom\expatbuilder.py", line 207, in
parseFile
parser.Parse(buffer, 0)
ExpatError: mismatched tag: line 1, column 357

Here's a sample of the html:

<html>
<body>
lots of screwy text including divs and spans
<Row status="o">
<RecordNum>1126264</RecordNum>
<Make>Mitsubishi</Make>
<Model>Mirage DE</Model>
</Row>
</body>
</html>

What's the best way to get at the XML? Do I need to somehow parse it
using the HTMLParser and then parse that with minidom or what?

Thanks a lot!

Mike

Paul Boddie · Sep 17, 2007

What's the best way to get at the XML? Do I need to somehow parse it
using the HTMLParser and then parse that with minidom or what?

Probably easiest is to use an XML processing toolkit or library which
supports HTML parsing. Since the libxml2 library (written in C) makes
a fairly good job of HTML parsing, I would suggest either libxml2dom
(for a DOM-like API) or lxml (for an ElementTree-like API) as suitable
Python wrappers of libxml2. Of course, HTMLParser or SGMLParser should
work, but the programming style is a bit more convoluted unless you're
used to XML processing using a SAX-like API.

Paul

P.S. I'm biased towards libxml2dom, being the developer, but I use it
routinely and it generally does the job for me.

kyosohma · Sep 17, 2007

Probably easiest is to use an XML processing toolkit or library which
supports HTML parsing. Since the libxml2 library (written in C) makes
a fairly good job of HTML parsing, I would suggest either libxml2dom
(for a DOM-like API) or lxml (for an ElementTree-like API) as suitable
Python wrappers of libxml2. Of course, HTMLParser or SGMLParser should
work, but the programming style is a bit more convoluted unless you're
used to XML processing using a SAX-like API.

Paul

P.S. I'm biased towards libxml2dom, being the developer, but I use it
routinely and it generally does the job for me.

I have lxml installed and I appear to also have libxml2dom installed.
I know lxml has decent docs, but I don't see much for yours. Is this
the only place to go: http://www.boddie.org.uk/python/libxml2dom.html
?

Mike

Gabriel Genellina · Sep 17, 2007

En Mon said:
I am attempting to extract some XML from an HTML document that I get
returned from a form based web page. For some reason, I cannot figure
out how to do this. I thought I could use the minidom module to do it,
but all I get is a screwy traceback:

Traceback (most recent call last):
File "C:\Python24\lib\xml\dom\expatbuilder.py", line 207, in
parseFile
parser.Parse(buffer, 0)
ExpatError: mismatched tag: line 1, column 357

So your HTML is not a well formed XML document, as many html pages, and
you can't use an XML parser. (even a valid HTML document may not be valid
XML). Let's try with some mismatched tags:

py> text = '''<html>
.... <body>
.... <p>lots of <div>screwy text including divs and <span>spans</p>
.... <Row status="o">
.... <RecordNum>1126264</RecordNum>
.... <Make>Mitsubishi</Make>
.... <Model>Mirage DE</Model>
.... </Row>
.... </body>
.... </html>'''
py>
py> import xml.dom.minidom
py> doc = xml.dom.minidom.parseString(text)
Traceback (most recent call last):
....
xml.parsers.expat.ExpatError: mismatched tag: line 3, column 60

You will need a more robust parser, like BeautifulSoup
<http://www.crummy.com/software/BeautifulSoup/>

py> from BeautifulSoup import BeautifulSoup
py> soup = BeautifulSoup(text)
py> for row in soup.findAll("row"):
.... print row.recordnum, row.make.contents, row.model.string
....
<recordnum>1126264</recordnum> [u'Mitsubishi'] Mirage DE

Depending on your document, you may prefer to extract the XML blocks using
BeautifulSoup, and then parse each one using BeautifulStoneSoup (the XML
parser) or xml.etree.ElementTree

Stefan Behnel · Sep 18, 2007

I am attempting to extract some XML from an HTML document that I get
returned from a form based web page. For some reason, I cannot figure
out how to do this.
Here's a sample of the html:

<html>
<body>
lots of screwy text including divs and spans
<Row status="o">
<RecordNum>1126264</RecordNum>
<Make>Mitsubishi</Make>
<Model>Mirage DE</Model>
</Row>
</body>
</html>

What's the best way to get at the XML? Do I need to somehow parse it
using the HTMLParser and then parse that with minidom or what?

lxml makes this pretty easy:

This is actually a tree that can be treated as XML, e.g. with XPath, XSLT,
tree iteration, ... You will also get plain XML when you serialise it to XML:

Note that this doesn't add any namespaces, so you will not magically get valid
XHTML or something. You could rewrite the tags by hand, though.

Stefan

Paul Boddie · Sep 18, 2007

I have lxml installed and I appear to also have libxml2dom installed.
I know lxml has decent docs, but I don't see much for yours. Is this
the only place to go:http://www.boddie.org.uk/python/libxml2dom.html
?

Unfortunately yes, with regard to online documentation, although the
distribution contains API documentation, and the package has
docstrings for most of the public classes, functions and methods. And
the API is a lot like the PyXML and minidom APIs, too.

Paul

kyosohma · Sep 18, 2007

So your HTML is not a well formed XML document, as many html pages, and
you can't use an XML parser. (even a valid HTML document may not be valid
XML). Let's try with some mismatched tags:

Depending on your document, you may prefer to extract the XML blocks using
BeautifulSoup, and then parse each one using BeautifulStoneSoup (the XML
parser) or xml.etree.ElementTree

Thanks for the reply. I already knew about BeautifulSoup but I was
hoping to avoid installing *yet another module* on my PC. I got it to
work with lxml, but it's not very pretty. See my reply to Stefan.

Mike

kyosohma · Sep 18, 2007

lxml makes this pretty easy:

This is actually a tree that can be treated as XML, e.g. with XPath, XSLT,
tree iteration, ... You will also get plain XML when you serialise it to XML:

Note that this doesn't add any namespaces, so you will not magically get valid
XHTML or something. You could rewrite the tags by hand, though.

Stefan

I got it to work with lxml. See below:

def Parser(filename):
parser = etree.HTMLParser()
tree = etree.parse(r'path/to/nextpage.htm', parser)
xml_string = etree.tostring(tree)
events = ("recordnum", "primaryowner", "customeraddress")
context = etree.iterparse(StringIO(xml_string), tag='')
for action, elem in context:
tag = elem.tag
if tag == 'primaryowner':
owner = elem.text
elif tag == 'customeraddress':
address = elem.text
else:
pass

print 'Primary Owner: %s' % owner
print 'Address: %s' % address

Does this make sense? It works pretty well, but I don't really
understand everything that I'm doing.

Mike

George Sakkis · Sep 18, 2007

Thanks for the reply. I already knew about BeautifulSoup but I was
hoping to avoid installing *yet another module* on my PC.

That's a poor excuse for a self-contained module in a single file.
"Installing" it can be as simple as having it in the same directory of
your module that imports it. Given that you can do in 2 lines what
took you around 15 with lxml, I wouldn't think it twice.

George

Stefan Behnel · Sep 19, 2007

George said:
Given that you can do in 2 lines what
took you around 15 with lxml, I wouldn't think it twice.

Don't judge a tool by beginner's code.

Stefan

Laurent Pointal · Sep 19, 2007

(e-mail address removed) a écrit :

I got it to work with lxml. See below:

def Parser(filename):
parser = etree.HTMLParser()
tree = etree.parse(r'path/to/nextpage.htm', parser)
xml_string = etree.tostring(tree)
events = ("recordnum", "primaryowner", "customeraddress")
context = etree.iterparse(StringIO(xml_string), tag='')
for action, elem in context:
tag = elem.tag
if tag == 'primaryowner':
owner = elem.text
elif tag == 'customeraddress':
address = elem.text
else:
pass

print 'Primary Owner: %s' % owner
print 'Address: %s' % address

Does this make sense? It works pretty well, but I don't really
understand everything that I'm doing.

Mike

Q? Once you get your document into an XML tree in memory, while do you
go to event-based handling to extract your data ?

Try to directly manipulate the tree.

parser = etree.HTMLParser()
tree = etree.parse(r'path/to/nextpage.htm', parser)
myrows = tree.findall(".//Row")

# Then work with the sub-elements.
for r in myrows :
rnumelem = r.find("RecordNum")
makeeleme = r.find("Make")
modelelem = r.find("Model")

& co.

Stefan Behnel · Sep 19, 2007

Does this make sense? It works pretty well, but I don't really
understand everything that I'm doing.

def Parser(filename):

It's uncommon to give a function a capitalised name, unless it's a factory
function (which this isn't).

parser = etree.HTMLParser()
tree = etree.parse(r'path/to/nextpage.htm', parser)
xml_string = etree.tostring(tree)

What you do here is parse the HTML page and serialise it back into an XML
string. No need to do that - once it's a tree, you can work with it. lxml is a
highly integrated set of tools, no matter if you use it for XML or HTML.

events = ("recordnum", "primaryowner", "customeraddress")

You're not using this anywhere below, so I assume this is left-over code.

context = etree.iterparse(StringIO(xml_string), tag='')
for action, elem in context:
tag = elem.tag
if tag == 'primaryowner':
owner = elem.text
elif tag == 'customeraddress':
address = elem.text
else:
pass

print 'Primary Owner: %s' % owner
print 'Address: %s' % address

Admittedly, iterparse() doesn't currently support HTML (although this might
become possible in lxml 2.0).

You could do this more easily in a couple of ways. One is to use XPath:

print [el.text for el in tree.xpath("//primaryowner|//customeraddress")]

Note that this works directly on the tree that you retrieved right in the
third line of your code.

Another (and likely simpler) solution is to first find the "Row" element and
then start from that:

row = tree.find("//Row")
print row.findtext("primaryowner")
print row.findtext("customeraddress")

See the lxml tutorial on this, as well as the documentation on XPath support
and tree iteration:

http://codespeak.net/lxml/xpathxslt.html#xpath
http://codespeak.net/lxml/api.html#iteration

Hope this helps,
Stefan

kyosohma · Sep 19, 2007

It's uncommon to give a function a capitalised name, unless it's a factory
function (which this isn't).

Yeah. I was going to use a class (and I still might), so that's how it
got capitalized.

You're not using this anywhere below, so I assume this is left-over code.

I realized I didn't need that line soon after I posted. Sorry about
that!

You could do this more easily in a couple of ways. One is to use XPath:

print [el.text for el in tree.xpath("//primaryowner|//customeraddress")]

This works quite well. Wish I'd thought of it.

Note that this works directly on the tree that you retrieved right in the
third line of your code.

Another (and likely simpler) solution is to first find the "Row" element and
then start from that:

row = tree.find("//Row")
print row.findtext("primaryowner")
print row.findtext("customeraddress")

I tried this your way and Laurent's way and both give me this error:

AttributeError: 'NoneType' object has no attribute 'findtext'

See the lxml tutorial on this, as well as the documentation on XPath support
and tree iteration:

http://codespeak.net/lxml/xpathxslt.html#xpathhttp://codespeak.net/lxml/api.html#iteration

Hope this helps,
Stefan

I'm not sure what George's deal is. I'm not a beginner with Python,
just with lxml. I don't have all the hundreds of modules of Python
memorized and I have yet to meet any one who does. Even if I had used
Beautiful Soup, my code would probably still suck and I was told
explicitly by my boss to avoid adding new dependencies to my programs
whenever possible.

Thanks for the help. I'll add the list comprehension to my code.

Mike

Stefan Behnel · Sep 19, 2007

I tried this your way and Laurent's way and both give me this error:

AttributeError: 'NoneType' object has no attribute 'findtext'

Well, error handling is up to you. If find() doesn't find what you are looking
for, it will return None. Note that tag names are case sensitive - or maybe
there are namespaces involved, cannot tell from the example you posted.

Stefan

problem parsing utf-8 encoded xml - minidom	2	Jul 4, 2008
XML parsing ExpatError with xml.dom.minidom at line 1, column 0	2	Feb 13, 2014
BZip2 decompression and parsing XML	1	Jun 6, 2008
problem with google api / xml	3	May 31, 2006
SOAP failure	0	Dec 6, 2004
Problem with processing XML	8	Jan 22, 2008
Processing XML that's embedded in HTML	10	Jan 22, 2008
XML DOM	4	Nov 7, 2003

Extracting xml from html

kyosohma

Paul Boddie

kyosohma

Gabriel Genellina

Stefan Behnel

Paul Boddie

kyosohma

kyosohma

George Sakkis

Stefan Behnel

Laurent Pointal

Stefan Behnel

kyosohma

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads