Unicode codepoints

Saul Spatz · Jun 22, 2011

Hi,

I'm just starting to learn a bit about Unicode. I want to be able to read autf-8 encoded file, and print out the codepoints it encodes. After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic. Have you a better way?

def codePoints(s):
''' return a list of the Unicode codepoints in the string s '''
answer = []
skip = False
for k, c in enumerate(s):
if skip:
skip = False
answer.append(ord(s[k-1:k+1]))
continue
if not 0xd800 <= ord(c) <= 0xdfff:
answer.append(ord(c))
else:
skip = True
return answer

if __name__ == '__main__':
s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
code = codePoints(s)
for c in code:
print('U+'+hex(c)[2:])

Thanks for any help you can give me.

Saul

Chris Angelico · Jun 22, 2011

Hi,

I'm just starting to learn a bit about Unicode. I want to be able to reada utf-8 encoded file, and print out the codepoints it encodes. After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic. Have you a better way?

Once you have your data as a Unicode string (and you seem to be using
Python 3, so 's' will be a Unicode string), wouldn't a list of its
codepoints simply be this?

for c in s:
print('U+'+hex(ord(c))[2:])

But if you do need the codePoints() function, I'd do it as a generator.

def codePoints(s):
''' return a list of the Unicode codepoints in the string s '''
skip = False
for k, c in enumerate(s):
if skip:
skip = False
yield ord(s[k-1:k+1])
continue
if not 0xd800 <= ord(c) <= 0xdfff:
yield ord(c)
else:
skip = True

Your main function doesn't even have to change - it's iterating over
the list, so it may as well iterate over the generator instead.

But I don't really understand what codePoints() does. Is it expecting
the parameter to be a string of bytes or of Unicode characters?

Chris Angelico

Vlastimil Brom · Jun 22, 2011

2011/6/22 Saul Spatz said:
Hi,

I'm just starting to learn a bit about Unicode. I want to be able to reada utf-8 encoded file, and print out the codepoints it encodes. After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic. Have you a better way?

def codePoints(s):
''' return a list of the Unicode codepoints in the string s '''
answer = []
skip = False
for k, c in enumerate(s):
if skip:
skip = False
answer.append(ord(s[k-1:k+1]))
continue
if not 0xd800 <= ord(c) <= 0xdfff:
answer.append(ord(c))
else:
skip = True
return answer

if __name__ == '__main__':
s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
code = codePoints(s)
for c in code:
print('U+'+hex(c)[2:])

Thanks for any help you can give me.

Saul

Hi,
what functionality should codePoints(...) add over just iterating
through the characters in the unicode string directly (besides
filtering out the surrogates)?

It seems, that you can just use

s = open(r'C:\install\filter-utf-8.txt', encoding = 'utf8', errors
= 'replace').read()
for c in s:
print('U+'+hex(ord(c))[2:])

or eventually add the condition before the print:
if not 0xd800 <= ord(c) <= 0xdfff:

you can also use string formatting to do the hex conversion and a more
usual zero padding; the print(...) calls would be:

"older style formatting"
print("U+%04x"%(ord(c),))

or the newer, potentially more powerful way using format(...)
print("U+{:04x}".format(ord(c)))

hth,
vbr

Peter Otten · Jun 22, 2011

Saul said:
Hi,

I'm just starting to learn a bit about Unicode. I want to be able to read
a utf-8 encoded file, and print out the codepoints it encodes. After many
false starts, here's a script that seems to work, but it strikes me as
awfully awkward and unpythonic. Have you a better way?

def codePoints(s):
''' return a list of the Unicode codepoints in the string s '''
answer = []
skip = False
for k, c in enumerate(s):
if skip:
skip = False
answer.append(ord(s[k-1:k+1]))
continue
if not 0xd800 <= ord(c) <= 0xdfff:
answer.append(ord(c))
else:
skip = True
return answer

if __name__ == '__main__':
s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
code = codePoints(s)
for c in code:
print('U+'+hex(c)[2:])

Thanks for any help you can give me.

Saul

Here's an alternative implementation that follows Chris' suggestion to use a
generator:

def codepoints(s):
s = iter(s)
for c in s:
if 0xd800 <= ord(c) <= 0xdfff:
c += next(s, "")
yield ord(c)

jmfauth · Jun 22, 2011

That seems to me correct.

because

File "<eta last command>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes
in position 0-5: end of string in escape sequence
from this:

u'éléphant\N{EURO SIGN}' éléphant€
u = u'éléphant\N{EURO SIGN}'
''.join(['\\u{:04x}'.format(ord(c)) for c in u]) \u00e9\u006c\u00e9\u0070\u0068\u0061\u006e\u0074\u20ac

Click to expand...

Click to expand...

Skipping surrogate pairs is a little bit a non sense,
because the purpose is to display code points!

Vlastimil Brom · Jun 22, 2011

2011/6/22 Saul Spatz said:
Thanks. Â I agree with you about the generator. Â Using your first suggestion, code points above U+FFFF get separated into two "surrogate pair" characters fron UTF-16. Â So instead of U=10FFFF I get U+DBFF andU+DFFF.

Hi,
If you realy need the wide unicode functionality on a narrow unicode
python build and only need to get the string index of characters
including surrogate pairs counting as one item, you can build a list
of single characters or surrogate pairs, e.g.:
[u'a', u'\ud800', u'\udf30', u' ', u'\ud800', u'\udf31', u' ',
u'\ud800', u'\udf32', u' ', u'\ud800', u'\udf33']

import re
re.findall(ur"(?s)(?:[\ud800-\udbff][\udc00-\udfff])|.", surrog_txt)

Click to expand...

Click to expand...

[u'a', u'\U00010330', u' ', u'\U00010331', u' ', u'\U00010332', u' ',
u'\U00010333']
this way, the indices, slices and len() would work on the
supplementary list as expected for a normal string; however it
probably won't be very efficient for longer texts.
Note that surrogates are not the only asymmetry between code points,
characters (and glyphs - to take the visual representation of those
into account) - there are combining diacritical marks, in various
combinations with precomposed diacritical characters, multiple
normalisation modes etc.

regards,
vbr

python3 Unicode is slow	1	Oct 25, 2009
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
can't get utf8 / unicode strings from embedded python	19	Aug 23, 2013
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Unicode characters	2	Sep 4, 2006
Unicode statistics (uses Data::Alias)	0	Jun 7, 2006
ChatBot	4	Jan 19, 2021
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013

Unicode codepoints

Saul Spatz

Chris Angelico

Vlastimil Brom

Peter Otten

jmfauth

Vlastimil Brom

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads