converting html escape sequences to unicode characters

H

harrelson

I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:



















Anyone know what the decimal is representing? It doesn't seem to
equate to a unicode codepoint...

culley
 
K

Kent Johnson

harrelson said:
I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:



















Anyone know what the decimal is representing? It doesn't seem to
equate to a unicode codepoint...

In well-formed HTML (!) these should be the decimal values of Unicode characters. See
http://www.w3.org/TR/html4/charset.html#h-5.3.1

These characters appear to be Hangul Syllables:
http://www.unicode.org/charts/PDF/UAC00.pdf

import unicodedata

nums = [
48708,
54665,
44592,
47196,
48372,
45244,
44144,
50640,
50836,
45236,
47732,
44552,
51060,
50620,
47560,
51648,
51104,
]

for num in nums:
print num, unicodedata.name(unichr(num), 'Unknown')

=>
48708 HANGUL SYLLABLE BI
54665 HANGUL SYLLABLE HAENG
44592 HANGUL SYLLABLE GI
47196 HANGUL SYLLABLE RO
48372 HANGUL SYLLABLE BO
45244 HANGUL SYLLABLE NAEL
44144 HANGUL SYLLABLE GEO
50640 HANGUL SYLLABLE E
50836 HANGUL SYLLABLE YO
45236 HANGUL SYLLABLE NAE
47732 HANGUL SYLLABLE MYEON
44552 HANGUL SYLLABLE GEUM
51060 HANGUL SYLLABLE I
50620 HANGUL SYLLABLE EOL
47560 HANGUL SYLLABLE MA
51648 HANGUL SYLLABLE JI
51104 HANGUL SYLLABLE JAM

Kent
 
C

Craig Ringer

I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:

I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:
escapeseq = '비'
uescape = ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
uescape u'\ube44'
print uescape
비
(I don't seem to have the font for it, but I think that's right - my
terminal font seems to show it correctly).

I just get the decimal value of the escape, format it as a Python
unicode hex escape sequence, and tell Python to interpret it as an
escaped unicode string.
entities = ['비', '행', '기', '로',
'보', '낼', '거', '에', '요', '내',
'면', '금', '이', '얼', '마', '지',
'잠'].... return ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
....
print ' '.join([ unescape(x) for x in entities ])
비 í–‰ 기 ë¡œ ë³´ 낼 ê±° ì— ìš” ë‚´ ë©´ 금 ì´ ì–¼ 마 지 ìž 
 
C

Craig Ringer

I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:

It is. Sorry. I initially misread Kent Johnson's post. He just used
'unichr()'. Colour me an idiot. If you ever need to know the hard way to
build a unicode character...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top