J
John Nagle
Here's a construct with which BeautifulSoup has problems. It's
from "http://support.microsoft.com/contactussupport/?ws=support".
This is the original:
<a href="http://www.microsoft.com/usability/enroll.mspx"
id="L_75998"
title="<!--http://www.microsoft.com/usability/information.mspx->"
onclick="return MS_HandleClick(this,'C_32179', true);">
Help us improve our products
</a>
And this is what comes back after parsing with BeautifulSoup
and using "prettify":
<a href="http://www.microsoft.com/usability/enroll.mspx"
id="L_75998"
title="<!--http://www.microsoft.com/usability/information.mspx->">
<br clear="all" style="line-height: 1px; overflow: hidden" />
<table id="msviFooter" width="100%" cellpadding="0"
cellspacing="0">
<tr valign="bottom">
<td id="msviFooter2"
style="filterrogidXImageTransform.Microsoft.Gradient(startColorStr='#FFFFFF',
endColorStr='#3F8CDA', gradientType='1')">
<div id="msviLocalFooter">
<nobr>
</nobr>
</div>
</td>
</tr>
</table>
</a>
All that other stuff is in the neighborhood, but not in that <a> tag.
Strictly speaking, it's Microsoft's fault.
title="<!--http://www.microsoft.com/usability/information.mspx->"
is supposed to be an HTML comment. But it's improperly terminated.
It should end with "-->". So all that following stuff is from what
follows the next "-->" which terminates a comment.
It's so Microsoft.
Unfortunately, even Firefox accepts bad comments like that.
Anyway, a BeautifulSoup question. "findall(text=True)" collects comments,
processing instructions, etc. as well as real text. What's the right way
to collect ordinary text only?
John Nagle
from "http://support.microsoft.com/contactussupport/?ws=support".
This is the original:
<a href="http://www.microsoft.com/usability/enroll.mspx"
id="L_75998"
title="<!--http://www.microsoft.com/usability/information.mspx->"
onclick="return MS_HandleClick(this,'C_32179', true);">
Help us improve our products
</a>
And this is what comes back after parsing with BeautifulSoup
and using "prettify":
<a href="http://www.microsoft.com/usability/enroll.mspx"
id="L_75998"
title="<!--http://www.microsoft.com/usability/information.mspx->">
<br clear="all" style="line-height: 1px; overflow: hidden" />
<table id="msviFooter" width="100%" cellpadding="0"
cellspacing="0">
<tr valign="bottom">
<td id="msviFooter2"
style="filterrogidXImageTransform.Microsoft.Gradient(startColorStr='#FFFFFF',
endColorStr='#3F8CDA', gradientType='1')">
<div id="msviLocalFooter">
<nobr>
</nobr>
</div>
</td>
</tr>
</table>
</a>
All that other stuff is in the neighborhood, but not in that <a> tag.
Strictly speaking, it's Microsoft's fault.
title="<!--http://www.microsoft.com/usability/information.mspx->"
is supposed to be an HTML comment. But it's improperly terminated.
It should end with "-->". So all that following stuff is from what
follows the next "-->" which terminates a comment.
It's so Microsoft.
Unfortunately, even Firefox accepts bad comments like that.
Anyway, a BeautifulSoup question. "findall(text=True)" collects comments,
processing instructions, etc. as well as real text. What's the right way
to collect ordinary text only?
John Nagle