decode Numeric Character References to unicode

William Heymann · Feb 18, 2008

How do I decode a string back to useful unicode that has xml numeric character
references in it?

Things like 占

Duncan Booth · Feb 18, 2008

William Heymann said:
How do I decode a string back to useful unicode that has xml numeric
character references in it?

Things like 占

Try something like this:

import re
from htmlentitydefs import name2codepoint

name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

def decodeEntities(s, encoding='utf-8'):
def unescape(match):
code = match.group(1)
if code:
return unichr(int(code, 10))
else:
code = match.group(2)
if code:
return unichr(int(code, 16))
else:
code = match.group(3)
if code in name2codepoint:
return unichr(name2codepoint

Code:

)
        return match.group(0)

    return EntityPattern.sub(unescape, s.decode(encoding))

Obviously if you really do only want numeric references you can take out 
the lines using name2codepoint and simplify the regex.

7stud · Feb 18, 2008

How do I decode a string back to useful unicode that has xml numeric character
references in it?

Things like 占

BeautifulSoup can handle two of the three formats for html entities.
For instance, an 'o' with umlaut can be represented in three different
ways:

&_ouml_;
ö
ö

BeautifulSoup can convert the first two formats to unicode:

from BeautifulSoup import BeautifulStoneSoup as BSS

my_string = '占'
soup = BSS(my_string, convertEntities=BSS.XML_ENTITIES)
print soup.contents[0].encode('utf-8')
print soup.contents[0]

--output:---
<some asian looking character>

Traceback (most recent call last):
File "test1.py", line 6, in ?
print soup.contents[0]
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5360' in
position 0: ordinal not in range(128)

The error message shows you the unicode string that BeautifulSoup
produced: u'\u5360'

If that won't work for you, it's not hard to write you own conversion
function to handle all three formats:

1) Create a regex that will match any of the formats
2) Convert the first format using htmlentitydefs.name2codepoint
3) Convert the second format using unichar()
4) Convert the third format using int('0'+ match, 16) and then
unichar()

7stud · Feb 18, 2008

BeautifulSoup can handle two of the three formats for html entities.
For instance, an 'o' with umlaut can be represented in three different
ways:

&_ouml_;
ö
ö

lol. It's hard to even make posts about this stuff because html
entities get converted by the forum software. Here are the three
different formats for an 'o with umlaut' with some underscores added
to keep the forum software from rendering the characters:

&_ouml_;
&_#246_;
&_#xf6_;

Duncan Booth · Feb 18, 2008

7stud said:
lol. It's hard to even make posts about this stuff because html
entities get converted by the forum software. Here are the three
different formats for an 'o with umlaut' with some underscores added
to keep the forum software from rendering the characters:

&_ouml_;
&_#246_;
&_#xf6_;

FWIW, your original post was fine, it was just the quoted text in your
followup that was wrong.

I guess that is yet another reason to use a real newsreader or the mailing
list rather than Google Groups.

Ben Finney · Feb 18, 2008

7stud said:
For instance, an 'o' with umlaut can be represented in three
different ways:

'&' followed by 'ouml;'
'&' followed by '#246;'
'&' followed by '#xf6;'

The fourth way, of course, is to simply have 'Ã¶' appear directly as a
character in the document, and set the correct character encoding.
(Hint: UTF-8 is an excellent choice for "the correct character
encoding", if you get to choose.)

Javascript programming in TheThingsNetwork	1	May 12, 2022
Unicode conversion problem (codec can't decode)	2	Apr 4, 2008
Decoding no of ways and printing each decode message	2	Jun 1, 2021
Is this right way to convert data attributes values to number in javascipt? Need to get valid numeric value or 0	2	May 30, 2023
Unicode questions	17	Oct 19, 2010
How do I decode unicode characters in the subject usingemail.message_from_string()?	18	Feb 25, 2009
Numeric coercions	6	Jul 7, 2013
string to unicode	0	Aug 15, 2011

decode Numeric Character References to unicode

William Heymann

Duncan Booth

7stud

7stud

Duncan Booth

Ben Finney

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads