Is there a module/function to remove all the HTML entities from an HTML
document (e.g. -  , &, &apos, etc.)?
htmllib has this capability, but if you're not doing any other HTML
parsing, a regex, coupled with htmllib's helper module, htmlentitydefs,
does nicely:
import re
import htmlentitydefs
def convertentity(m):
if m.group(1)=='#':
try:
return chr(int(m.group(2)))
except ValueError:
return '&#%s;' % m.group(2)
try:
return htmlentitydefs.entitydefs[m.group(2)]
except KeyError:
return '&%s;' % m.group(2)
def converthtml(s):
return re.sub(r'&(#?)(.+?);',convert,s)
converthtml('Some <html> string.') # --> 'Some <html> string.'
Unknown or invalid entities are left in &xxx; format, while also leaving
Unicode entities in format. If you want a Unicode string to be
returned (and Unicode entities interpreted), replace 'chr' with 'unichr',
and 'htmlentitydefs.entitydefs[m.group(2)]' with
'unichr(htmlentitydefs.name2codepoint[m.group(2)])'.
Hope this helps.