BeautifulSoup vs. Microsoft

J

John Nagle

Here's a construct with which BeautifulSoup has problems. It's
from "http://support.microsoft.com/contactussupport/?ws=support".

This is the original:


<a href="http://www.microsoft.com/usability/enroll.mspx"
id="L_75998"
title="<!--http://www.microsoft.com/usability/information.mspx->"
onclick="return MS_HandleClick(this,'C_32179', true);">
Help us improve our products
</a>


And this is what comes back after parsing with BeautifulSoup
and using "prettify":


<a href="http://www.microsoft.com/usability/enroll.mspx"
id="L_75998"
title="&lt;!--http://www.microsoft.com/usability/information.mspx-&gt;">
<br clear="all" style="line-height: 1px; overflow: hidden" />
<table id="msviFooter" width="100%" cellpadding="0"
cellspacing="0">
<tr valign="bottom">

<td id="msviFooter2"
style="filter:progid:DXImageTransform.Microsoft.Gradient(startColorStr='#FFFFFF',
endColorStr='#3F8CDA', gradientType='1')">
<div id="msviLocalFooter">
<nobr>
</nobr>
</div>
</td>
</tr>
</table>
</a>

All that other stuff is in the neighborhood, but not in that <a> tag.

Strictly speaking, it's Microsoft's fault.

title="<!--http://www.microsoft.com/usability/information.mspx->"

is supposed to be an HTML comment. But it's improperly terminated.
It should end with "-->". So all that following stuff is from what
follows the next "-->" which terminates a comment.

It's so Microsoft.

Unfortunately, even Firefox accepts bad comments like that.

Anyway, a BeautifulSoup question. "findall(text=True)" collects comments,
processing instructions, etc. as well as real text. What's the right way
to collect ordinary text only?

John Nagle
 
D

Duncan Booth

John Nagle said:
Strictly speaking, it's Microsoft's fault.

title="<!--http://www.microsoft.com/usability/information.mspx->"

is supposed to be an HTML comment. But it's improperly terminated.
It should end with "-->". So all that following stuff is from what
follows the next "-->" which terminates a comment.

It is an attribute value, and unescaped angle brackets are valid in
attributes. It looks to me like a bug in BeautifulSoup.
 
J

Justin Ezequiel

It is an attribute value, and unescaped angle brackets are valid in
attributes. It looks to me like a bug in BeautifulSoup.

FWIW, see http://tinyurl.com/yjtzjz

new fan of BeautifulSoup here as it helped me parse "BAD" XML
(although my client would disagree with that description)
 
D

Duncan Booth

Justin Ezequiel said:
FWIW, see http://tinyurl.com/yjtzjz

new fan of BeautifulSoup here as it helped me parse "BAD" XML
(although my client would disagree with that description)
I'm right behind BeautifulSoup's ability to parse bad HTML, but I still
think it should give priority to being able to parse valid HTML withough
messing it up.
 
P

Paul McGuire

Here's a construct with which BeautifulSoup has problems. It's
from "http://support.microsoft.com/contactussupport/?ws=support".

This is the original:

<a href="http://www.microsoft.com/usability/enroll.mspx"
id="L_75998"
title="<!--http://www.microsoft.com/usability/information.mspx->"
onclick="return MS_HandleClick(this,'C_32179', true);">
Help us improve our products
</a>
Strictly speaking, it's Microsoft's fault.

title="<!--http://www.microsoft.com/usability/information.mspx->"

is supposed to be an HTML comment. But it's improperly terminated.
It should end with "-->". So all that following stuff is from what
follows the next "-->" which terminates a comment.

No, that comment is inside a quoted string, so it should be ok.

If you are just trying to extract <a href=...> tags, this pyparsing
scraper gets them, including this problematic one:


import urllib
from pyparsing import makeHTMLTags

pg = urllib.urlopen("http://support.microsoft.com/contactussupport/?
ws=support")
htmlSrc = pg.read()
pg.close()

# only take first tag returned from makeHTMLTags, not interested in
# closing </a> tags
anchorTag = makeHTMLTags("A")[0]

for a in anchorTag.searchString(htmlSrc):
if "title" in a:
print "Title:", a.title
print "HREF:", a.href
# or use this statement to dump the complete tag contents
# print a.dump()
print

Prints:
Title: <!--http://www.microsoft.com/usability/information.mspx->
HREF: http://www.microsoft.com/usability/enroll.mspx

Title: Print this page
HREF: /gp/noscript/

Title: Print this page
HREF: /gp/noscript/

Title: E-mail this page
HREF: mailto:?subject=Help%20and%20Support&amp;body=http%3a%2f
%2fsupport.microsoft.com%2fdefault.aspx%2fcontactussupport%2f%3fws
%3dsupport

Title: E-mail this page
HREF: mailto:?subject=Help%20and%20Support&amp;body=http%3a%2f
%2fsupport.microsoft.com%2fdefault.aspx%2fcontactussupport%2f%3fws
%3dsupport

Title: Microsoft Worldwide
HREF: /common/international.aspx?rdPath=0

Title: Microsoft Worldwide
HREF: /common/international.aspx?rdPath=0

Title: Save to My Support Favorites
HREF: /gp/noscript/

Title: Save to My Support Favorites
HREF: /gp/noscript/

Title: Go to My Support Favorites
HREF: /gp/noscript/

Title: Go to My Support Favorites
HREF: /gp/noscript/

Title: Send Feedback
HREF: /gp/noscript/

Title: Send Feedback
HREF: /gp/noscript/

-- Paul
 
J

John Nagle

Duncan said:
It is an attribute value, and unescaped angle brackets are valid in
attributes. It looks to me like a bug in BeautifulSoup.

I think you're right. The HTML 4 spec,

http://www.w3.org/TR/html4/intro/sgmltut.html

says "Note that comments are markup". So recognizing comment syntax
inside an attribute is, in fact, an error in BeautifulSoup.

The source HTML on the Microsoft page is thus syntactically correct,
although meaningless. That's the only place on that page with a
comment-type form in an attribute.

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top