converting html escape sequences to unicode characters

Discussion in 'Python' started by harrelson, Dec 10, 2004.

  1. harrelson

    harrelson Guest

    I have a list of about 2500 html escape sequences (decimal) that I need
    to convert to utf-8. Stuff like:



















    Anyone know what the decimal is representing? It doesn't seem to
    equate to a unicode codepoint...

    culley
     
    harrelson, Dec 10, 2004
    #1
    1. Advertising

  2. harrelson

    Kent Johnson Guest

    harrelson wrote:
    > I have a list of about 2500 html escape sequences (decimal) that I need
    > to convert to utf-8. Stuff like:
    >
    > 비
    > 행
    > 기
    > 로
    > 보
    > 낼
    > 거
    > 에
    > 요
    > 내
    > 면
    > 금
    > 이
    > 얼
    > 마
    > 지
    > 잠
    >
    > Anyone know what the decimal is representing? It doesn't seem to
    > equate to a unicode codepoint...


    In well-formed HTML (!) these should be the decimal values of Unicode characters. See
    http://www.w3.org/TR/html4/charset.html#h-5.3.1

    These characters appear to be Hangul Syllables:
    http://www.unicode.org/charts/PDF/UAC00.pdf

    import unicodedata

    nums = [
    48708,
    54665,
    44592,
    47196,
    48372,
    45244,
    44144,
    50640,
    50836,
    45236,
    47732,
    44552,
    51060,
    50620,
    47560,
    51648,
    51104,
    ]

    for num in nums:
    print num, unicodedata.name(unichr(num), 'Unknown')

    =>
    48708 HANGUL SYLLABLE BI
    54665 HANGUL SYLLABLE HAENG
    44592 HANGUL SYLLABLE GI
    47196 HANGUL SYLLABLE RO
    48372 HANGUL SYLLABLE BO
    45244 HANGUL SYLLABLE NAEL
    44144 HANGUL SYLLABLE GEO
    50640 HANGUL SYLLABLE E
    50836 HANGUL SYLLABLE YO
    45236 HANGUL SYLLABLE NAE
    47732 HANGUL SYLLABLE MYEON
    44552 HANGUL SYLLABLE GEUM
    51060 HANGUL SYLLABLE I
    50620 HANGUL SYLLABLE EOL
    47560 HANGUL SYLLABLE MA
    51648 HANGUL SYLLABLE JI
    51104 HANGUL SYLLABLE JAM

    Kent
     
    Kent Johnson, Dec 10, 2004
    #2
    1. Advertising

  3. harrelson

    Craig Ringer Guest

    On Fri, 2004-12-10 at 08:36, harrelson wrote:
    > I have a list of about 2500 html escape sequences (decimal) that I need
    > to convert to utf-8. Stuff like:


    I'm pretty sure this somewhat horrifying code does it, but is probably
    an example of what not to do:

    >>> escapeseq = '비'
    >>> uescape = ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
    >>> uescape

    u'\ube44'
    >>> print uescape

    비
    (I don't seem to have the font for it, but I think that's right - my
    terminal font seems to show it correctly).

    I just get the decimal value of the escape, format it as a Python
    unicode hex escape sequence, and tell Python to interpret it as an
    escaped unicode string.

    >>> entities = ['비', '행', '기', '로',

    '보', '낼', '거', '에', '요', '내',
    '면', '금', '이', '얼', '마', '지',
    '잠']
    >>> def unescape(escapeseq):

    .... return ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
    ....
    >>> print ' '.join([ unescape(x) for x in entities ])

    비 í–‰ 기 ë¡œ ë³´ 낼 ê±° ì— ìš” ë‚´ ë©´ 금 ì´ ì–¼ 마 지 ìž 

    --
    Craig Ringer
     
    Craig Ringer, Dec 10, 2004
    #3
  4. harrelson

    Craig Ringer Guest

    On Fri, 2004-12-10 at 16:09, Craig Ringer wrote:
    > On Fri, 2004-12-10 at 08:36, harrelson wrote:
    > > I have a list of about 2500 html escape sequences (decimal) that I need
    > > to convert to utf-8. Stuff like:

    >
    > I'm pretty sure this somewhat horrifying code does it, but is probably
    > an example of what not to do:


    It is. Sorry. I initially misread Kent Johnson's post. He just used
    'unichr()'. Colour me an idiot. If you ever need to know the hard way to
    build a unicode character...

    --
    Craig Ringer
     
    Craig Ringer, Dec 10, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. slomo
    Replies:
    5
    Views:
    1,546
    Duncan Booth
    Dec 2, 2007
  2. Guest
    Replies:
    2
    Views:
    565
    Tim Roberts
    Dec 15, 2007
  3. Guest
    Replies:
    4
    Views:
    715
    Martin v. Löwis
    Dec 19, 2007
  4. Jeremy
    Replies:
    1
    Views:
    810
    Alex Willmer
    Jan 11, 2011
  5. Jeremy
    Replies:
    0
    Views:
    579
    Jeremy
    Jan 11, 2011
Loading...

Share This Page