unescaping xml escape codes



I'm working with strings that contain xml escape codes, such as '0'
and need a way in python to unescape these back to their ascii
representation, such as '&' but can't seem to find a python method for
this. I tried xml.sax.saxutils.unescape(s), but while it works with
'&', it doesn't work with '0' and other numeric codes. Any
suggestions on how to decode the numeric xml escape codes such as this?

Bengt Richter

I'm working with strings that contain xml escape codes, such as '0'
and need a way in python to unescape these back to their ascii
representation, such as '&' but can't seem to find a python method for
this. I tried xml.sax.saxutils.unescape(s), but while it works with
'&', it doesn't work with '0' and other numeric codes. Any
suggestions on how to decode the numeric xml escape codes such as this?
Maybe just a regex sub function would do it for you? Do you just need the decimal
forms like above or also the hex? If your coded entities are to ÿ or
&x00; to &xff; this might work. Other entities are converted to '?'.

If you want to do this properly, I think you have to parse the html a little and see
what the encoding is, and convert to unicode, and then do the conversions.

Very little tested!!
====< cvthtmlent.py >======================================
import re
rxo =re.compile(r'\&\#(x?[0-9a-fA-F]+);')
def ent2chr(m):
code = m.group(1)
if code.isdigit(): code = int(code)
else: code = int(code[1:], 16)
if code<256: return chr(code)
else: return '?' #XXX unichr(code).encode('utf-16le') ??

def cvthtmlent(s): return rxo.sub(ent2chr, s)

if __name__ == '__main__':
import sys; args = sys.argv[1:]
if args:
arg = args.pop(0)
if arg == '-test':
print cvthtmlent(
'blah [0] blah [ö] blah [123] &#x3c9')
if arg == '-': fi = sys.stdin
else: fi = file(arg)
for line in fi:
If you run this in idle, you can see the umlaut, but not the omega, which becomes a '?'

Martin can tell you the real scoop ;-)
>>> from cvthtmlent import cvthtmlent as cvt
>>> print cvt('blah [0] blah [ö] blah [123] ω')
blah [0] blah [ö] blah [123] ?

Bengt Richter

Bengt Richter

On 11 Aug 2003 00:09:42 GMT, (e-mail address removed) (Bengt Richter) wrote:
Maybe just a regex sub function would do it for you? Do you just need the decimal
forms like above or also the hex? If your coded entities are to ÿ or
&x00; to &xff; this might work. Other entities are converted to '?'.
That should be and ÿ respectively. I did implement hex entites after all.
Botched reediting this commentary however ;-P

Bengt Richter

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Latest member

Latest Threads
