Code that ought to run fast, but can't due to Python limitations.

John Nagle · Jul 7, 2009

Steven said:
Yes, I'm aware of that, but that's not what John's code is doing -- he's
doing a series of if expr ... elif expr tests. I don't think a case
statement can do much to optimize that.

(I didn't write that code; it's from "http://code.google.com/p/html5lib/",
which is a general purpose HTML 5 parser written in Python. It's compatible
with ElementTree and/or BeautifulSoup. I currently use a modified
BeautifulSoup for parsing real-world HTML in a small-scale crawler, and
I'm looking at this as an HTML 5 compatible replacement.)

John Nagle

John Nagle · Jul 7, 2009

Steven said:
John Nagle is an old hand at Python. He's perfectly aware of this, and
I'm sure he's not trying to program C in Python.

I'm not entirely sure *what* he is doing, and hopefully he'll speak up
and say, but whatever the problem is it's not going to be as simple as
that.

I didn't write this code; I'm just using it. As I said in the
original posting, it's from "http://code.google.com/p/html5lib".
It's from an effort to write a clean HTML 5 parser in Python for
general-purpose use. HTML 5 parsing is well-defined for the awful
cases that make older browsers incompatible, but quite complicated.
The Python implementation here is intended partly as a reference
implementation, so browser writers have something to compare with.

I have a small web crawler robust enough to parse
real-world HTML, which can be appallingly bad. I currently use
an extra-robust version of BeautifulSoup, and even that sometimes
blows up. So I'm very interested in a new Python parser which supposedly
handles bad HTML in the same way browsers do. But if it's slower
than BeautifulSoup, there's a problem.

John Nagle

Stefan Behnel · Jul 7, 2009

John said:
I have a small web crawler robust enough to parse
real-world HTML, which can be appallingly bad. I currently use
an extra-robust version of BeautifulSoup, and even that sometimes
blows up. So I'm very interested in a new Python parser which supposedly
handles bad HTML in the same way browsers do. But if it's slower
than BeautifulSoup, there's a problem.

Well, if performance matters in any way, you can always use lxml's
blazingly fast parser first, possibly trying a couple of different
configurations, and only if all fail, fall back to running html5lib over
the same input. That should give you a tremendous speed-up over your
current code in most cases, while keeping things robust in the hard cases.

Note the numbers that Ian Bicking has for HTML parser performance:

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

You should be able to run lxml's parser ten times in different
configurations (e.g. different charset overrides) before it even reaches
the time that BeautifulSoup would need to parse a document once. Given that
undeclared character set detection is something where BS is a lot better
than lxml, you can also mix the best of both worlds and use BS's character
set detection to configure lxml's parser if you notice that the first
parsing attempts fail.

And yes, html5lib performs pretty badly in comparison (or did, at the
time). But the numbers seem to indicate that if you can drop the ratio of
documents that require a run of html5lib below 30% and use lxml's parser
for the rest, you will still be faster than with BeautifulSoup alone.

Stefan

John Nagle · Jul 7, 2009

Stefan said:
Well, if performance matters in any way, you can always use lxml's
blazingly fast parser first, possibly trying a couple of different
configurations, and only if all fail, fall back to running html5lib over
the same input.

Detecting "fail" is difficult. A common problem is badly terminated
comments which eat most of the document if you follow the spec. The
document seems to parse correctly, but most of it is missing. The
HTML 5 spec actually covers things like

<!This is a bogus SGML directive>

and treats it as a bogus comment. (That's because HTML 5 doesn't
include general SGML; the only directive recognized is DOCTYPE.
Anything else after "<!" is treated as a token-level error.)

So using an agreed-upon parsing method, in the form of html5lib,
is desirable, in that it should mimic browser behavior.

John Nagle

How to create a windows forms application that can dynamically run code entered into it?	0	Aug 10, 2022
Windows automatic rebooting due to faulty code	10	May 23, 2014
Windows rebooting due to faulty code	1	May 23, 2014
When deployed to Heroku, python setup.py egg info did not run successfully.	1	Jul 4, 2022
Can't understand code	2	Jun 5, 2022
Did you know that there is a match-case function in python?	4	Dec 17, 2023
I have to finish this code for my assignment but I cant figure out how to solve it	1	Jun 27, 2023
Can't run any script without it failing due to calling tkinter for no reason	3	Oct 15, 2012

Code that ought to run fast, but can't due to Python limitations.

John Nagle

John Nagle

Stefan Behnel

John Nagle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads