BeautifulSoup vs. Microsoft

John Nagle · Mar 29, 2007

Here's a construct with which BeautifulSoup has problems. It's
from "http://support.microsoft.com/contactussupport/?ws=support".

This is the original:

<a href="http://www.microsoft.com/usability/enroll.mspx"
id="L_75998"
title="<!--http://www.microsoft.com/usability/information.mspx->"
onclick="return MS_HandleClick(this,'C_32179', true);">
Help us improve our products
</a>

And this is what comes back after parsing with BeautifulSoup
and using "prettify":

<a href="http://www.microsoft.com/usability/enroll.mspx"
id="L_75998"
title="<!--http://www.microsoft.com/usability/information.mspx->">
<br clear="all" style="line-height: 1px; overflow: hidden" />
<table id="msviFooter" width="100%" cellpadding="0"
cellspacing="0">
<tr valign="bottom">

<td id="msviFooter2"
style="filter

rogid

XImageTransform.Microsoft.Gradient(startColorStr='#FFFFFF',
endColorStr='#3F8CDA', gradientType='1')">
<div id="msviLocalFooter">
<nobr>
</nobr>
</div>
</td>
</tr>
</table>
</a>

All that other stuff is in the neighborhood, but not in that <a> tag.

Strictly speaking, it's Microsoft's fault.

title="". So all that following stuff is from what
follows the next "-->" which terminates a comment.

It's so Microsoft.

Unfortunately, even Firefox accepts bad comments like that.

Anyway, a BeautifulSoup question. "findall(text=True)" collects comments,
processing instructions, etc. as well as real text. What's the right way
to collect ordinary text only?

John Nagle

Duncan Booth · Mar 29, 2007

John Nagle said:
Strictly speaking, it's Microsoft's fault.

title="". So all that following stuff is from what
follows the next "-->" which terminates a comment.

It is an attribute value, and unescaped angle brackets are valid in
attributes. It looks to me like a bug in BeautifulSoup.

Justin Ezequiel · Mar 29, 2007

It is an attribute value, and unescaped angle brackets are valid in
attributes. It looks to me like a bug in BeautifulSoup.

FWIW, see http://tinyurl.com/yjtzjz

new fan of BeautifulSoup here as it helped me parse "BAD" XML
(although my client would disagree with that description)

Justin Ezequiel · Mar 29, 2007

FWIW, seehttp://tinyurl.com/yjtzjz

hmm. not quite right.

http://tinyurl.com/ynv4ct

or

http://www.crummy.com/software/BeautifulSoup/documentation.html#Customizing the Parser

Duncan Booth · Mar 29, 2007

Justin Ezequiel said:
FWIW, see http://tinyurl.com/yjtzjz

new fan of BeautifulSoup here as it helped me parse "BAD" XML
(although my client would disagree with that description)

I'm right behind BeautifulSoup's ability to parse bad HTML, but I still
think it should give priority to being able to parse valid HTML withough
messing it up.

Paul McGuire · Mar 29, 2007

Here's a construct with which BeautifulSoup has problems. It's
from "http://support.microsoft.com/contactussupport/?ws=support".

This is the original:

<a href="http://www.microsoft.com/usability/enroll.mspx"
id="L_75998"
title="<!--http://www.microsoft.com/usability/information.mspx->"
onclick="return MS_HandleClick(this,'C_32179', true);">
Help us improve our products
</a>

Strictly speaking, it's Microsoft's fault.

title="". So all that following stuff is from what
follows the next "-->" which terminates a comment.

No, that comment is inside a quoted string, so it should be ok.

If you are just trying to extract <a href=...> tags, this pyparsing
scraper gets them, including this problematic one:

import urllib
from pyparsing import makeHTMLTags

pg = urllib.urlopen("http://support.microsoft.com/contactussupport/?
ws=support")
htmlSrc = pg.read()
pg.close()

# only take first tag returned from makeHTMLTags, not interested in
# closing </a> tags
anchorTag = makeHTMLTags("A")[0]

for a in anchorTag.searchString(htmlSrc):
if "title" in a:
print "Title:", a.title
print "HREF:", a.href
# or use this statement to dump the complete tag contents
# print a.dump()
print

Prints:
Title: <!--http://www.microsoft.com/usability/information.mspx->
HREF: http://www.microsoft.com/usability/enroll.mspx

Title: Print this page
HREF: /gp/noscript/

Title: Print this page
HREF: /gp/noscript/

Title: E-mail this page
HREF: mailto:?subject=Help%20and%20Support&body=http%3a%2f
%2fsupport.microsoft.com%2fdefault.aspx%2fcontactussupport%2f%3fws
%3dsupport

Title: E-mail this page
HREF: mailto:?subject=Help%20and%20Support&body=http%3a%2f
%2fsupport.microsoft.com%2fdefault.aspx%2fcontactussupport%2f%3fws
%3dsupport

Title: Microsoft Worldwide
HREF: /common/international.aspx?rdPath=0

Title: Microsoft Worldwide
HREF: /common/international.aspx?rdPath=0

Title: Save to My Support Favorites
HREF: /gp/noscript/

Title: Save to My Support Favorites
HREF: /gp/noscript/

Title: Go to My Support Favorites
HREF: /gp/noscript/

Title: Go to My Support Favorites
HREF: /gp/noscript/

Title: Send Feedback
HREF: /gp/noscript/

Title: Send Feedback
HREF: /gp/noscript/

-- Paul

John Nagle · Mar 29, 2007

Duncan said:
It is an attribute value, and unescaped angle brackets are valid in
attributes. It looks to me like a bug in BeautifulSoup.

I think you're right. The HTML 4 spec,

http://www.w3.org/TR/html4/intro/sgmltut.html

says "Note that comments are markup". So recognizing comment syntax
inside an attribute is, in fact, an error in BeautifulSoup.

The source HTML on the Microsoft page is thus syntactically correct,
although meaningless. That's the only place on that page with a
comment-type form in an attribute.

John Nagle

Javascript DOM	1	Mar 29, 2023
Extracting text using Beautifulsoup	0	Oct 25, 2009
Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Only one table shows up with the information	2	Mar 29, 2023
Parsing html with Beautifulsoup	0	Dec 10, 2009
Sort by number of characters	1	Nov 2, 2023
Uncaught ReferenceError: item is not defined at HTMLButtonElement.onclick in the: <button onclick="item.inserir()">Inserir dados</button>	1	Apr 22, 2023
Image shifts to the right when export the page to pdf	4	May 5, 2023

BeautifulSoup vs. Microsoft

John Nagle

Duncan Booth

Justin Ezequiel

Justin Ezequiel

Duncan Booth

Paul McGuire

John Nagle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads