decode Numeric Character References to unicode

W

William Heymann

How do I decode a string back to useful unicode that has xml numeric character
references in it?

Things like 占
 
D

Duncan Booth

William Heymann said:
How do I decode a string back to useful unicode that has xml numeric
character references in it?

Things like 占
Try something like this:

import re
from htmlentitydefs import name2codepoint

name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

def decodeEntities(s, encoding='utf-8'):
def unescape(match):
code = match.group(1)
if code:
return unichr(int(code, 10))
else:
code = match.group(2)
if code:
return unichr(int(code, 16))
else:
code = match.group(3)
if code in name2codepoint:
return unichr(name2codepoint
Code:
)
        return match.group(0)

    return EntityPattern.sub(unescape, s.decode(encoding))

Obviously if you really do only want numeric references you can take out 
the lines using name2codepoint and simplify the regex.
 
7

7stud

How do I decode a string back to useful unicode that has xml numeric character
references in it?

Things like 占

BeautifulSoup can handle two of the three formats for html entities.
For instance, an 'o' with umlaut can be represented in three different
ways:

&_ouml_;
ö
ö

BeautifulSoup can convert the first two formats to unicode:

from BeautifulSoup import BeautifulStoneSoup as BSS

my_string = '占'
soup = BSS(my_string, convertEntities=BSS.XML_ENTITIES)
print soup.contents[0].encode('utf-8')
print soup.contents[0]

--output:---
<some asian looking character>

Traceback (most recent call last):
File "test1.py", line 6, in ?
print soup.contents[0]
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5360' in
position 0: ordinal not in range(128)

The error message shows you the unicode string that BeautifulSoup
produced: u'\u5360'

If that won't work for you, it's not hard to write you own conversion
function to handle all three formats:

1) Create a regex that will match any of the formats
2) Convert the first format using htmlentitydefs.name2codepoint
3) Convert the second format using unichar()
4) Convert the third format using int('0'+ match, 16) and then
unichar()
 
7

7stud

BeautifulSoup can handle two of the three formats for html entities.
For instance, an 'o' with umlaut can be represented in three different
ways:

&_ouml_;
ö
ö

lol. It's hard to even make posts about this stuff because html
entities get converted by the forum software. Here are the three
different formats for an 'o with umlaut' with some underscores added
to keep the forum software from rendering the characters:

&_ouml_;
&_#246_;
&_#xf6_;
 
D

Duncan Booth

7stud said:
lol. It's hard to even make posts about this stuff because html
entities get converted by the forum software. Here are the three
different formats for an 'o with umlaut' with some underscores added
to keep the forum software from rendering the characters:

&_ouml_;
&_#246_;
&_#xf6_;

FWIW, your original post was fine, it was just the quoted text in your
followup that was wrong.

I guess that is yet another reason to use a real newsreader or the mailing
list rather than Google Groups.
 
B

Ben Finney

7stud said:
For instance, an 'o' with umlaut can be represented in three
different ways:

'&' followed by 'ouml;'
'&' followed by '#246;'
'&' followed by '#xf6;'

The fourth way, of course, is to simply have 'ö' appear directly as a
character in the document, and set the correct character encoding.
(Hint: UTF-8 is an excellent choice for "the correct character
encoding", if you get to choose.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top