Parsing broken HTML via Mozilla

=?ISO-8859-15?Q?Walter_D=F6rwald?= · Aug 9, 2004

Hello all!

I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are
e.g. nested comments, bare "&" in URLs and "<" in text (e.g.
"if foo < bar") etc.

All of these pages can be displayed properly in a browser
so why not reuse the parser in e.g. Mozilla? Is there any
way to get proper XML out of Mozilla? Calling mozilla on the
command line would be OK, but it would be better if I could
use Mozilla like a SAX parser. Is there any project that
provides this functionality?

Bye,
Walter Dörwald

John J. Lee · Aug 10, 2004

Walter Dörwald said:
way to get proper XML out of Mozilla? Calling mozilla on the
command line would be OK, but it would be better if I could
use Mozilla like a SAX parser. Is there any project that
provides this functionality?

[...]

PyXPCOM. Good luck compiling it.

John

Tom B. · Aug 10, 2004

Walter Dörwald said:
Hello all!

I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are
e.g. nested comments, bare "&" in URLs and "<" in text (e.g.
"if foo < bar") etc.

All of these pages can be displayed properly in a browser
so why not reuse the parser in e.g. Mozilla? Is there any
way to get proper XML out of Mozilla? Calling mozilla on the
command line would be OK, but it would be better if I could
use Mozilla like a SAX parser. Is there any project that
provides this functionality?

Bye,
Walter Dörwald

Maybe you should preprocess your files with something like,

http://www.zope.org/Members/chrisw/StripOGram
which can help you get rid of the stuff you dont want

Tom

G. S. Hayes · Aug 10, 2004

Walter Do:rwald said:
Hello all!

Hi!

I'm trying to parse broken HTML with several Python tools.

Unfortunately none of them work 100% reliable.

What have you tried?

I've been using Tidy with pretty good results; there's a Python
wrapper called utidylib available at http://utidylib.berlios.de

Make sure to use the "force output" option and it'll do a reasonable
job of parsing fairly broken HTML and outputting either as plain HTML,
XHTML, or several other formats (with lots of tweaky knobs available
to tune the output if you want to).

Paul Wright · Aug 10, 2004

Walter said:
I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are e.g.
nested comments, bare "&" in URLs and "<" in text (e.g. "if foo <
bar") etc.

Not a Mozilla solution, but I hear good things about
http://www.crummy.com/software/BeautifulSoup/

=?ISO-8859-1?Q?Walter_D=F6rwald?= · Aug 11, 2004

Paul said:
Not a Mozilla solution, but I hear good things about
http://www.crummy.com/software/BeautifulSoup/

I already tried that, but it completely ignores encoding issues
and it passes broken entity references (e.g. bare & in URLs) along
literally. Furthermore its support for DTD aware HTML parsing
is not complete (e.g. <link> is not handled as an empty tag).

Bye,
Walter Dörwald

=?ISO-8859-1?Q?Walter_D=F6rwald?= · Aug 11, 2004

Paul said:
Not a Mozilla solution, but I hear good things about
http://www.crummy.com/software/BeautifulSoup/

I already tried that, but it completely ignores encoding issues
and it passes broken entity references (e.g. bare & in URLs) along
literally. Furthermore its support for DTD aware HTML parsing
is not complete (e.g. <link> is not handled as an empty tag).

Bye,
Walter Dörwald

html parsing	0	Dec 2, 2006
Parsing HTML - modify URLs	0	Jul 7, 2004
HTML parser to DOM via SAX?	0	Mar 7, 2005
[SUMMARY] Parsing JSON (#155)	12	Feb 7, 2008
HTML parsing using Java and Xerces	1	Mar 19, 2007
htmldata 1.0.4 - Manipulate HTML documents via data structure	0	Dec 11, 2004
ANN: pyparsing 1.4.8 released	0	Oct 7, 2007
Parsing xhtml with libxml	1	Dec 16, 2005

Parsing broken HTML via Mozilla

=?ISO-8859-15?Q?Walter_D=F6rwald?=

John J. Lee

Tom B.

G. S. Hayes

Paul Wright

=?ISO-8859-1?Q?Walter_D=F6rwald?=

=?ISO-8859-1?Q?Walter_D=F6rwald?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads