decode Numeric Character References to unicode

Discussion in 'Python' started by William Heymann, Feb 18, 2008.

  1. How do I decode a string back to useful unicode that has xml numeric character
    references in it?

    Things like 占
    William Heymann, Feb 18, 2008
    1. Advertisements

  2. William Heymann

    Duncan Booth Guest

    Try something like this:

    import re
    from htmlentitydefs import name2codepoint

    name2codepoint = name2codepoint.copy()

    EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

    def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
    code =
    if code:
    return unichr(int(code, 10))
    code =
    if code:
    return unichr(int(code, 16))
    code =
    if code in name2codepoint:
    return unichr(name2codepoint
    Code (Text):

        return EntityPattern.sub(unescape, s.decode(encoding))

    Obviously if you really do only want numeric references you can take out
    the lines using name2codepoint and simplify the regex.
    Duncan Booth, Feb 18, 2008
    1. Advertisements

  3. William Heymann

    7stud Guest

    BeautifulSoup can handle two of the three formats for html entities.
    For instance, an 'o' with umlaut can be represented in three different


    BeautifulSoup can convert the first two formats to unicode:

    from BeautifulSoup import BeautifulStoneSoup as BSS

    my_string = '占'
    soup = BSS(my_string, convertEntities=BSS.XML_ENTITIES)
    print soup.contents[0].encode('utf-8')
    print soup.contents[0]

    <some asian looking character>

    Traceback (most recent call last):
    File "", line 6, in ?
    print soup.contents[0]
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u5360' in
    position 0: ordinal not in range(128)

    The error message shows you the unicode string that BeautifulSoup
    produced: u'\u5360'

    If that won't work for you, it's not hard to write you own conversion
    function to handle all three formats:

    1) Create a regex that will match any of the formats
    2) Convert the first format using htmlentitydefs.name2codepoint
    3) Convert the second format using unichar()
    4) Convert the third format using int('0'+ match, 16) and then
    7stud, Feb 18, 2008
  4. William Heymann

    7stud Guest

    lol. It's hard to even make posts about this stuff because html
    entities get converted by the forum software. Here are the three
    different formats for an 'o with umlaut' with some underscores added
    to keep the forum software from rendering the characters:

    7stud, Feb 18, 2008
  5. William Heymann

    Duncan Booth Guest

    FWIW, your original post was fine, it was just the quoted text in your
    followup that was wrong.

    I guess that is yet another reason to use a real newsreader or the mailing
    list rather than Google Groups.
    Duncan Booth, Feb 18, 2008
  6. William Heymann

    Ben Finney Guest

    The fourth way, of course, is to simply have 'ö' appear directly as a
    character in the document, and set the correct character encoding.
    (Hint: UTF-8 is an excellent choice for "the correct character
    encoding", if you get to choose.)
    Ben Finney, Feb 18, 2008
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.