BeautifulSoup vs. real-world HTML comments

J

John Nagle

The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here's yet another example, this one from
"http://www.webdirectory.com". The page starts like this:


<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>


Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.


John Nagle
 
C

Carl Banks

The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here's yet another example, this one from
"http://www.webdirectory.com". The page starts like this:

<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

<HTML><HEAD>
<TITLE>Environment Web Directory</TITLE>

Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle them
without problems.

BeautifulSoup can't parse this page usefully at all.
It treats the entire page as a text chunk. It's actually
HTMLParser that parses comments, so this is really an HTMLParser
level problem.

Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.


Carl Banks
 
R

Robert Kern

Carl said:
Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.

Well, BeautifulSoup is just such a dedicated library. However, it defers its
handling of comments to HTMLParser. That's the problem.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
I

irstas

Carl said:
Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.

That's a good suggestion. In fact it looks like there's a Python API
for tidy:
http://utidylib.berlios.de/
Tried it, seems to get rid of <! comments > just fine.
 
S

Steve Holden

Carl said:
Google for a program called "tidy". Install it, and run it as a
filter on any HTML you download. "tidy" has invested in it quite a
bit of work understanding common bad HTML and how browsers deal with
it. It would be pointless to duplicate that work in the Python
standard library; let HTMLParser be small and tight, and outsource the
handling of floozy input to a dedicated program.
eGenix have produced the mxTidy library that handily incorporates these
features in a way that makes them easy for Python programmers to use.

regards
Steve
 
C

Carl Banks

Well, BeautifulSoup is just such a dedicated library.

No, not really.
However, it defers its
handling of comments to HTMLParser. That's the problem.

Well, it's up to the writers of Beautiful Soup to decide how much bad
HTML they want to accept. ISTM they're happy to live with the
limitations of HTMLParser, meaning that they do not consider Beautiful
Soup to be a library dedicated to reading every piece of bad HTML out
there.

(Though it's not like I read their mailing list. Maybe they aren't
happy with HTMLParser.)


Carl Banks
 
P

Paul Boddie

John said:
The syntax that browsers understand as HTML comments is much less
restrictive than what BeautifulSoup understands. I keep running into
sites with formally incorrect HTML comments which are parsed happily
by browsers. Here's yet another example, this one from
"http://www.webdirectory.com". The page starts like this:


<!Hello there! Welcome to The Environment Directory!>
<!Not too much exciting HTML code here but it does the job! >
<!See ya, - JD >

Anything based on libxml2 and its HTML parser will handle such broken
HTML just fine, even if they just ignore such erroneous attempts at
comments, discarding them as the plain nonsense they clearly are.
Certainly, libxml2dom seems to deal with the page:

import libxml2dom
d = libxml2dom.parseURI("http://www.webdirectory.com", html=1,
htmlencoding="iso-8859-1")

I guess lxml and the original libxml2 bindings work at least as well.
Note that some browsers won't be as happy if you give them such
content as XHTML.

Paul
 
R

Robert Kern

Carl said:
No, not really.

Yes, it is. Whether it succeeds in all particulars is besides the point. The
only mission of BeautifulSoup is to handle bad HTML. That tidy doesn't
successfully handle some other subset of bad HTML doesn't mean it's not a
dedicated program for handling bad HTML.
Well, it's up to the writers of Beautiful Soup to decide how much bad
HTML they want to accept. ISTM they're happy to live with the
limitations of HTMLParser, meaning that they do not consider Beautiful
Soup to be a library dedicated to reading every piece of bad HTML out
there.

Sorry, let me be clearer: The problem is that they haven't overridden the
handling of comments of SGMLParser (not HTMLParser, sorry) like it has many
other parts of SGMLParser. Yes, any fix should go into BeautifulSoup and not
SGMLParser.

All it takes is someone to code up their desired behavior for these perverse
comments and submit it to Leonard Richardson.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
C

Carl Banks

Yes, it is. Whether it succeeds in all particulars is besides the point. The
only mission of BeautifulSoup is to handle bad HTML.

I think the authors of BeautifulSoup have the right to decide what
their own mission is.


Carl Banks
 
R

Robert Kern

Carl said:
I think the authors of BeautifulSoup have the right to decide what
their own mission is.

Yes, and he's stated it pretty clearly:

"""You didn't write that awful page. You're just trying to get some data out of
it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser."""

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
J

John Nagle

Robert said:
Yes, and he's stated it pretty clearly:

"""You didn't write that awful page. You're just trying to get some data out of
it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser."""

That's a good summary of the issue. It's a real problem, because
BeautifulSoup's default behavior in the presence of a bad comment is to
silently suck up all remaining text, ignoring HTML markup.

The problem actually is in BeautifulSoup, in parse_declaration:

def parse_declaration(self, i):
"""Treat a bogus SGML declaration as raw data. Treat a CDATA
declaration as a CData object."""
j = None
if self.rawdata[i:i+9] == '<![CDATA[':
k = self.rawdata.find(']]>', i)
if k == -1:
k = len(self.rawdata)
data = self.rawdata[i+9:k]
j = k+3
self._toStringSubclass(data, CData)
else:
try:
j = SGMLParser.parse_declaration(self, i)
except SGMLParseError:
toHandle = self.rawdata[i:]
self.handle_data(toHandle)
j = i + len(toHandle)
return j

Note what happens when a bad declaration is found. SGMLParser.parse_declaration
raises SGMLParseError, and the exception handler just sucks up the rest of the
input (note that "rawdata[i:]"), treats it as unparsed data, and advances
the position to the end of input.

That's too brutal. One bad declaration and the whole parse is messed up.
Something needs to be done at the BeautifulSoup level
to get the parser back on track. Maybe suck up input until the next ">",
treat that as data, then continue parsing from that point. That will do
the right thing most of the time, although bad declarations containing
a ">" will still be misparsed.

How about this patch?

except SGMLParseError: # bad decl, must recover
k = self.rawdata.find('>', i) # find next ">"
if k == -1 : # if no find
k = len(self.rawdata) # use entire string
toHandle = self.rawdata[i:k] # take up to ">" as data
self.handle_data(toHandle) # treat as data
j = i + len(toHandle) # pick up parsing after ">"

This is untested, but this or something close to it should make
BeautifulSoup much more robust.

It might make sense to catch some SGMLParseError at some other
places, too, advance past the next ">", and restart parsing.

John Nagle
 
J

John Nagle

John said:
Note what happens when a bad declaration is found.
SGMLParser.parse_declaration
raises SGMLParseError, and the exception handler just sucks up the rest
of the
input (note that "rawdata[i:]"), treats it as unparsed data, and advances
the position to the end of input.

That's too brutal. One bad declaration and the whole parse is messed up.
Something needs to be done at the BeautifulSoup level
to get the parser back on track. Maybe suck up input until the next ">",
treat that as data, then continue parsing from that point. That will do
the right thing most of the time, although bad declarations containing
a ">" will still be misparsed.

How about this patch?

except SGMLParseError: # bad decl, must recover
k = self.rawdata.find('>', i) # find next ">"
if k == -1 : # if no find
k = len(self.rawdata) # use entire string
toHandle = self.rawdata[i:k] # take up to ">" as data
self.handle_data(toHandle) # treat as data
j = i + len(toHandle) # pick up parsing after ">"
I've been testing this, and it's improved parsing considerably. Now,
common lines like

<!This is an invalid comment>

don't stop parsing.

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top