Code that ought to run fast, but can't due to Python limitations.

J

John Nagle

Steven said:
Yes, I'm aware of that, but that's not what John's code is doing -- he's
doing a series of if expr ... elif expr tests. I don't think a case
statement can do much to optimize that.

(I didn't write that code; it's from "http://code.google.com/p/html5lib/",
which is a general purpose HTML 5 parser written in Python. It's compatible
with ElementTree and/or BeautifulSoup. I currently use a modified
BeautifulSoup for parsing real-world HTML in a small-scale crawler, and
I'm looking at this as an HTML 5 compatible replacement.)

John Nagle
 
J

John Nagle

Steven said:
John Nagle is an old hand at Python. He's perfectly aware of this, and
I'm sure he's not trying to program C in Python.

I'm not entirely sure *what* he is doing, and hopefully he'll speak up
and say, but whatever the problem is it's not going to be as simple as
that.

I didn't write this code; I'm just using it. As I said in the
original posting, it's from "http://code.google.com/p/html5lib".
It's from an effort to write a clean HTML 5 parser in Python for
general-purpose use. HTML 5 parsing is well-defined for the awful
cases that make older browsers incompatible, but quite complicated.
The Python implementation here is intended partly as a reference
implementation, so browser writers have something to compare with.

I have a small web crawler robust enough to parse
real-world HTML, which can be appallingly bad. I currently use
an extra-robust version of BeautifulSoup, and even that sometimes
blows up. So I'm very interested in a new Python parser which supposedly
handles bad HTML in the same way browsers do. But if it's slower
than BeautifulSoup, there's a problem.

John Nagle
 
S

Stefan Behnel

John said:
I have a small web crawler robust enough to parse
real-world HTML, which can be appallingly bad. I currently use
an extra-robust version of BeautifulSoup, and even that sometimes
blows up. So I'm very interested in a new Python parser which supposedly
handles bad HTML in the same way browsers do. But if it's slower
than BeautifulSoup, there's a problem.

Well, if performance matters in any way, you can always use lxml's
blazingly fast parser first, possibly trying a couple of different
configurations, and only if all fail, fall back to running html5lib over
the same input. That should give you a tremendous speed-up over your
current code in most cases, while keeping things robust in the hard cases.

Note the numbers that Ian Bicking has for HTML parser performance:

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

You should be able to run lxml's parser ten times in different
configurations (e.g. different charset overrides) before it even reaches
the time that BeautifulSoup would need to parse a document once. Given that
undeclared character set detection is something where BS is a lot better
than lxml, you can also mix the best of both worlds and use BS's character
set detection to configure lxml's parser if you notice that the first
parsing attempts fail.

And yes, html5lib performs pretty badly in comparison (or did, at the
time). But the numbers seem to indicate that if you can drop the ratio of
documents that require a run of html5lib below 30% and use lxml's parser
for the rest, you will still be faster than with BeautifulSoup alone.

Stefan
 
J

John Nagle

Stefan said:
Well, if performance matters in any way, you can always use lxml's
blazingly fast parser first, possibly trying a couple of different
configurations, and only if all fail, fall back to running html5lib over
the same input.

Detecting "fail" is difficult. A common problem is badly terminated
comments which eat most of the document if you follow the spec. The
document seems to parse correctly, but most of it is missing. The
HTML 5 spec actually covers things like

<!This is a bogus SGML directive>

and treats it as a bogus comment. (That's because HTML 5 doesn't
include general SGML; the only directive recognized is DOCTYPE.
Anything else after "<!" is treated as a token-level error.)

So using an agreed-upon parsing method, in the form of html5lib,
is desirable, in that it should mimic browser behavior.

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
DewittMill
Top