Easy way to remove HTML entities from an HTML document?

Robert Oschler · Jul 25, 2004

Is there a module/function to remove all the HTML entities from an HTML
document (e.g. - &nbsp, &amp, &apos, etc.)?

If not I'll just write one myself but I figured I'd save myself some time.

Thanks,

Christopher T King · Jul 25, 2004

Is there a module/function to remove all the HTML entities from an HTML
document (e.g. - &nbsp, &amp, &apos, etc.)?

htmllib has this capability, but if you're not doing any other HTML
parsing, a regex, coupled with htmllib's helper module, htmlentitydefs,
does nicely:

import re
import htmlentitydefs

def convertentity(m):
if m.group(1)=='#':
try:
return chr(int(m.group(2)))
except ValueError:
return '&#%s;' % m.group(2)
try:
return htmlentitydefs.entitydefs[m.group(2)]
except KeyError:
return '&%s;' % m.group(2)

def converthtml(s):
return re.sub(r'&(#?)(.+?);',convert,s)

converthtml('Some <html> string.') # --> 'Some <html> string.'

Unknown or invalid entities are left in &xxx; format, while also leaving
Unicode entities in format. If you want a Unicode string to be
returned (and Unicode entities interpreted), replace 'chr' with 'unichr',
and 'htmlentitydefs.entitydefs[m.group(2)]' with
'unichr(htmlentitydefs.name2codepoint[m.group(2)])'.

Hope this helps.

Michael Scarlett · Jul 26, 2004

Robert Oschler said:
Is there a module/function to remove all the HTML entities from an HTML
document (e.g. - &nbsp, &amp, &apos, etc.)?

If not I'll just write one myself but I figured I'd save myself some time.

Thanks,

check out mark pilgrims site: http://diveintopython.org/html_processing/index.html

Robert Oschler · Jul 26, 2004

Christopher T King said:
htmllib has this capability, but if you're not doing any other HTML
parsing, a regex, coupled with htmllib's helper module, htmlentitydefs,
does nicely:

import re
import htmlentitydefs

def convertentity(m):
if m.group(1)=='#':
try:
return chr(int(m.group(2)))
except ValueError:
return '&#%s;' % m.group(2)
try:
return htmlentitydefs.entitydefs[m.group(2)]
except KeyError:
return '&%s;' % m.group(2)

def converthtml(s):
return re.sub(r'&(#?)(.+?);',convert,s)

converthtml('Some <html> string.') # --> 'Some <html> string.'

Unknown or invalid entities are left in &xxx; format, while also leaving
Unicode entities in format. If you want a Unicode string to be
returned (and Unicode entities interpreted), replace 'chr' with 'unichr',
and 'htmlentitydefs.entitydefs[m.group(2)]' with
'unichr(htmlentitydefs.name2codepoint[m.group(2)])'.

Hope this helps.

Chris,

I believe the line that reads:

def converthtml(s):
return re.sub(r'&(#?)(.+?);',convert,s)

Should read:

def converthtml(s):
return re.sub(r'&(#?)(.+?);',convertentity,s)

Once I made that change it worked like a charm. I'm showing the correction
for future Usenet searchers.

So you can pass a function to re.sub() as the replacement patttern? Very
cool, I didn't know that. I think you could spend a year just learning
regular expressions and still miss something.

Thanks,
Robert.

Christopher T King · Jul 27, 2004

I believe the line that reads:

def converthtml(s):
return re.sub(r'&(#?)(.+?);',convert,s)

Should read:

def converthtml(s):
return re.sub(r'&(#?)(.+?);',convertentity,s)

Oops, you're right, mea culpa

So you can pass a function to re.sub() as the replacement patttern? Very
cool, I didn't know that. I think you could spend a year just learning
regular expressions and still miss something.

That feature is only mentioned briefly in the online docs, and not at all
in sre.sub's docstring. Surprising, since it's indeed a very useful
feature.

Robert Oschler · Jul 27, 2004

Christopher T King said:
That feature is only mentioned briefly in the online docs, and not at all
in sre.sub's docstring. Surprising, since it's indeed a very useful
feature.

Chris,

Speaking of learning cool things by osmosis, do you know of a well commented
source of Python code, perhaps an Open Source project, that I could study to
learn more interesting techniques like the regexp tip you shared? I find
that studying other people's code is the best way to avoid getting in a
programming rut.

Thanks.

Christopher T King · Jul 29, 2004

Speaking of learning cool things by osmosis, do you know of a well commented
source of Python code, perhaps an Open Source project, that I could study to
learn more interesting techniques like the regexp tip you shared? I find
that studying other people's code is the best way to avoid getting in a
programming rut.

I seem to recall reading about that re.sub trick in something linked from
Pythonware's Daily Python URL (http://www.pythonware.com/daily/). There
are often links there to interesting and useful code snippets from
ActiveState's Python Cookbook and other sources; I'd say start there if
you want to find neat tricks you can do with Python.

I'm not sure of any particularly "well commented" Python projects though
(I've never really looked into that), but you'll probably find some
interesting small projects in the Vaults of Parnassus
(http://www.vex.net/parnassus/).

Robert Oschler · Jul 30, 2004

Christopher T King said:
I seem to recall reading about that re.sub trick in something linked from
Pythonware's Daily Python URL (http://www.pythonware.com/daily/). There
are often links there to interesting and useful code snippets from
ActiveState's Python Cookbook and other sources; I'd say start there if
you want to find neat tricks you can do with Python.

I'm not sure of any particularly "well commented" Python projects though
(I've never really looked into that), but you'll probably find some
interesting small projects in the Vaults of Parnassus
(http://www.vex.net/parnassus/).

Thanks Chris and thanks for all your other help.

With your Python skill you should work for Google. Too bad you don't, you'd
be a wealthy man soon (Google IPO). Wish I did.

Christopher T King · Jul 31, 2004

With your Python skill you should work for Google. Too bad you don't,
you'd be a wealthy man soon (Google IPO). Wish I did.

Thanks for the compliment.

To work at Google is my dream job, and I'm
sure that of many others on this list, too (makes me wonder if any Google
employees read this list...).

javax.xml.transform.Transformer and HTML entities	4	Oct 11, 2011
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
What's the best way to parse this HTML tag?	3	Mar 11, 2012
Peasy: an easy but powerful parser	0	Aug 26, 2013
.NET-ey way to convert XML-encoded/escaped entities into normal characters/HTML?	2	Jun 20, 2007
Ruby, Unicode, and HTML Entities Problem	4	Sep 26, 2010
Getting an html file from a online html document and converting itback to code.	7	Jan 31, 2013
IDE seems to mangle html 4.0 entities	1	Nov 12, 2009

Easy way to remove HTML entities from an HTML document?

Robert Oschler

Christopher T King

Michael Scarlett

Robert Oschler

Christopher T King

Robert Oschler

Christopher T King

Robert Oschler

Christopher T King

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads