Parsing HTML/XML documents

pabloski · Apr 26, 2007

I need to parse real world HTML/XML documents and I found two nice python
solution: BeautifulSoup and Tidy.

However I found pyXPCOM that is a wrapper for Gecko. So I was thinking
Gecko surely handles bad html in a more consistent and error-proof way
than BS and Tidy.

I'm interested in using Mozilla DOM from inside a Python script, however
I'm a bit confused about how can I use pyXPCOM to accomplish this job.

Any suggestions?

Stefan Behnel · Apr 26, 2007

I need to parse real world HTML/XML documents and I found two nice python
solution: BeautifulSoup and Tidy.

There's also lxml, in case you want a real XML tool.
http://codespeak.net/lxml/
http://codespeak.net/lxml/dev/parsing.html#parsers

However I found pyXPCOM that is a wrapper for Gecko. So I was thinking
Gecko surely handles bad html in a more consistent and error-proof way
than BS and Tidy.

I'm interested in using Mozilla DOM from inside a Python script, however
I'm a bit confused about how can I use pyXPCOM to accomplish this job.

I've never used it, but I doubt Gecko would yield substantially better results
than any of the three above. You're dealing with broken data here, so it just
depends on your input which one of them wins.

Stefan

Max M · Apr 26, 2007

Stefan Behnel skrev:

There's also lxml, in case you want a real XML tool.
http://codespeak.net/lxml/
http://codespeak.net/lxml/dev/parsing.html#parsers

I have used both BeautiullSoup and lxml. They are both good tools.

lxml is blindingly fast compared to BeautifulSoup though.

A simple tool for importing contact information from 6000 xml files of
23 MBytes into Zope runs in about 30 seconds. No optimisations at all.
Just inefficient xpath expressions.

That is pretty good in my book.

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science

XML parsing ExpatError with xml.dom.minidom at line 1, column 0	2	Feb 13, 2014
Crawling	1	Mar 10, 2021
parsing nested unbounded XML fields with ElementTree	6	Nov 25, 2013
xml parsing as YML parser does does	0	Sep 24, 2013
XML/XHTML/HTML differences, bugs... and howto	0	Jan 23, 2013
Parsing HTML?	8	Apr 3, 2008
Web Page Parsing/Downloading	1	Nov 22, 2013
Parsing XML RSS feed byte stream for <item> tag	2	Feb 7, 2013

Parsing HTML/XML documents

pabloski

Stefan Behnel

Max M

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads