converting html escape sequences to unicode characters

harrelson · Dec 9, 2004

I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:

비
행
기
로
보
낼
거
에
요
내
면
금
이
얼
마
지
잠

Anyone know what the decimal is representing? It doesn't seem to
equate to a unicode codepoint...

culley

Kent Johnson · Dec 9, 2004

harrelson said:
I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:

비
행
기
로
보
낼
거
에
요
내
면
금
이
얼
마
지
잠

Anyone know what the decimal is representing? It doesn't seem to
equate to a unicode codepoint...

In well-formed HTML (!) these should be the decimal values of Unicode characters. See
http://www.w3.org/TR/html4/charset.html#h-5.3.1

These characters appear to be Hangul Syllables:
http://www.unicode.org/charts/PDF/UAC00.pdf

import unicodedata

nums = [
48708,
54665,
44592,
47196,
48372,
45244,
44144,
50640,
50836,
45236,
47732,
44552,
51060,
50620,
47560,
51648,
51104,
]

for num in nums:
print num, unicodedata.name(unichr(num), 'Unknown')

=>
48708 HANGUL SYLLABLE BI
54665 HANGUL SYLLABLE HAENG
44592 HANGUL SYLLABLE GI
47196 HANGUL SYLLABLE RO
48372 HANGUL SYLLABLE BO
45244 HANGUL SYLLABLE NAEL
44144 HANGUL SYLLABLE GEO
50640 HANGUL SYLLABLE E
50836 HANGUL SYLLABLE YO
45236 HANGUL SYLLABLE NAE
47732 HANGUL SYLLABLE MYEON
44552 HANGUL SYLLABLE GEUM
51060 HANGUL SYLLABLE I
50620 HANGUL SYLLABLE EOL
47560 HANGUL SYLLABLE MA
51648 HANGUL SYLLABLE JI
51104 HANGUL SYLLABLE JAM

Kent

Craig Ringer · Dec 10, 2004

I have a list of about 2500 html escape sequences (decimal) that I need
to convert to utf-8. Stuff like:

I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:

escapeseq = '비'
uescape = ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
uescape u'\ube44'
print uescape

Click to expand...

Click to expand...

ë¹„
(I don't seem to have the font for it, but I think that's right - my
terminal font seems to show it correctly).

I just get the decimal value of the escape, format it as a Python
unicode hex escape sequence, and tell Python to interpret it as an
escaped unicode string.

entities = ['비', '행', '기', '로',

Click to expand...

Click to expand...

'보', '낼', '거', '에', '요', '내',
'면', '금', '이', '얼', '마', '지',
'잠'].... return ("\\u%x" % int(escapeseq[2:-1])).decode("unicode_escape")
....

print ' '.join([ unescape(x) for x in entities ])

Click to expand...

Click to expand...

Craig Ringer · Dec 10, 2004

I'm pretty sure this somewhat horrifying code does it, but is probably
an example of what not to do:

It is. Sorry. I initially misread Kent Johnson's post. He just used
'unichr()'. Colour me an idiot. If you ever need to know the hard way to
build a unicode character...

Convert unicode escape sequences to unicode in a file	1	Jan 11, 2011
retriving escape unicode sequences from files ...	8	Aug 3, 2012
retriving escape unicode sequences from files ...	8	Aug 2, 2012
Windows XP unicode and escape sequences	2	Dec 12, 2007
Pythonic way to count sequences	7	Apr 25, 2013
How to convert MBOX files to HTML?	4	Dec 25, 2024
Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
Thinking Unicode	0	Aug 8, 2013

converting html escape sequences to unicode characters

harrelson

Kent Johnson

Craig Ringer

Craig Ringer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads