Parsing broken HTML via Mozilla

  • Thread starter =?ISO-8859-15?Q?Walter_D=F6rwald?=
  • Start date
?

=?ISO-8859-15?Q?Walter_D=F6rwald?=

Hello all!

I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are
e.g. nested comments, bare "&" in URLs and "<" in text (e.g.
"if foo < bar") etc.

All of these pages can be displayed properly in a browser
so why not reuse the parser in e.g. Mozilla? Is there any
way to get proper XML out of Mozilla? Calling mozilla on the
command line would be OK, but it would be better if I could
use Mozilla like a SAX parser. Is there any project that
provides this functionality?

Bye,
Walter Dörwald
 
J

John J. Lee

Walter Dörwald said:
way to get proper XML out of Mozilla? Calling mozilla on the
command line would be OK, but it would be better if I could
use Mozilla like a SAX parser. Is there any project that
provides this functionality?
[...]

PyXPCOM. Good luck compiling it.


John
 
T

Tom B.

Walter Dörwald said:
Hello all!

I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are
e.g. nested comments, bare "&" in URLs and "<" in text (e.g.
"if foo < bar") etc.

All of these pages can be displayed properly in a browser
so why not reuse the parser in e.g. Mozilla? Is there any
way to get proper XML out of Mozilla? Calling mozilla on the
command line would be OK, but it would be better if I could
use Mozilla like a SAX parser. Is there any project that
provides this functionality?

Bye,
Walter Dörwald

Maybe you should preprocess your files with something like,
http://www.zope.org/Members/chrisw/StripOGram
which can help you get rid of the stuff you dont want

Tom
 
G

G. S. Hayes

Walter Do:rwald said:
Hello all!


Hi!


I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable.



What have you tried?



I've been using Tidy with pretty good results; there's a Python
wrapper called utidylib available at http://utidylib.berlios.de



Make sure to use the "force output" option and it'll do a reasonable
job of parsing fairly broken HTML and outputting either as plain HTML,
XHTML, or several other formats (with lots of tweaky knobs available
to tune the output if you want to).
 
?

=?ISO-8859-1?Q?Walter_D=F6rwald?=

Paul said:
Not a Mozilla solution, but I hear good things about
http://www.crummy.com/software/BeautifulSoup/

I already tried that, but it completely ignores encoding issues
and it passes broken entity references (e.g. bare & in URLs) along
literally. Furthermore its support for DTD aware HTML parsing
is not complete (e.g. <link> is not handled as an empty tag).

Bye,
Walter Dörwald
 
?

=?ISO-8859-1?Q?Walter_D=F6rwald?=

Paul said:
Not a Mozilla solution, but I hear good things about
http://www.crummy.com/software/BeautifulSoup/

I already tried that, but it completely ignores encoding issues
and it passes broken entity references (e.g. bare & in URLs) along
literally. Furthermore its support for DTD aware HTML parsing
is not complete (e.g. <link> is not handled as an empty tag).

Bye,
Walter Dörwald
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top