Unicode codepoints

S

Saul Spatz

Hi,

I'm just starting to learn a bit about Unicode. I want to be able to read autf-8 encoded file, and print out the codepoints it encodes. After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic. Have you a better way?

def codePoints(s):
''' return a list of the Unicode codepoints in the string s '''
answer = []
skip = False
for k, c in enumerate(s):
if skip:
skip = False
answer.append(ord(s[k-1:k+1]))
continue
if not 0xd800 <= ord(c) <= 0xdfff:
answer.append(ord(c))
else:
skip = True
return answer

if __name__ == '__main__':
s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
code = codePoints(s)
for c in code:
print('U+'+hex(c)[2:])

Thanks for any help you can give me.

Saul
 
C

Chris Angelico

Hi,

I'm just starting to learn a bit about Unicode. I want to be able to reada utf-8 encoded file, and print out the codepoints it encodes.  After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic.  Have you a better way?

Once you have your data as a Unicode string (and you seem to be using
Python 3, so 's' will be a Unicode string), wouldn't a list of its
codepoints simply be this?

for c in s:
print('U+'+hex(ord(c))[2:])

But if you do need the codePoints() function, I'd do it as a generator.
def codePoints(s):
   ''' return a list of the Unicode codepoints in the string s '''
   skip = False
   for k, c in enumerate(s):
       if skip:
           skip = False
           yield ord(s[k-1:k+1])
           continue
       if not 0xd800 <= ord(c) <= 0xdfff:
           yield ord(c)
       else:
           skip = True

Your main function doesn't even have to change - it's iterating over
the list, so it may as well iterate over the generator instead.

But I don't really understand what codePoints() does. Is it expecting
the parameter to be a string of bytes or of Unicode characters?

Chris Angelico
 
V

Vlastimil Brom

2011/6/22 Saul Spatz said:
Hi,

I'm just starting to learn a bit about Unicode. I want to be able to reada utf-8 encoded file, and print out the codepoints it encodes.  After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic.  Have you a better way?

def codePoints(s):
   ''' return a list of the Unicode codepoints in the string s '''
   answer = []
   skip = False
   for k, c in enumerate(s):
       if skip:
           skip = False
           answer.append(ord(s[k-1:k+1]))
           continue
       if not 0xd800 <= ord(c) <= 0xdfff:
           answer.append(ord(c))
       else:
           skip = True
   return answer

if __name__ == '__main__':
   s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
   code = codePoints(s)
   for c in code:
       print('U+'+hex(c)[2:])

Thanks for any help you can give me.

Saul

Hi,
what functionality should codePoints(...) add over just iterating
through the characters in the unicode string directly (besides
filtering out the surrogates)?

It seems, that you can just use

s = open(r'C:\install\filter-utf-8.txt', encoding = 'utf8', errors
= 'replace').read()
for c in s:
print('U+'+hex(ord(c))[2:])

or eventually add the condition before the print:
if not 0xd800 <= ord(c) <= 0xdfff:

you can also use string formatting to do the hex conversion and a more
usual zero padding; the print(...) calls would be:

"older style formatting"
print("U+%04x"%(ord(c),))

or the newer, potentially more powerful way using format(...)
print("U+{:04x}".format(ord(c)))

hth,
vbr
 
P

Peter Otten

Saul said:
Hi,

I'm just starting to learn a bit about Unicode. I want to be able to read
a utf-8 encoded file, and print out the codepoints it encodes. After many
false starts, here's a script that seems to work, but it strikes me as
awfully awkward and unpythonic. Have you a better way?

def codePoints(s):
''' return a list of the Unicode codepoints in the string s '''
answer = []
skip = False
for k, c in enumerate(s):
if skip:
skip = False
answer.append(ord(s[k-1:k+1]))
continue
if not 0xd800 <= ord(c) <= 0xdfff:
answer.append(ord(c))
else:
skip = True
return answer

if __name__ == '__main__':
s = open('test.txt', encoding = 'utf8', errors = 'replace').read()
code = codePoints(s)
for c in code:
print('U+'+hex(c)[2:])

Thanks for any help you can give me.

Saul

Here's an alternative implementation that follows Chris' suggestion to use a
generator:

def codepoints(s):
s = iter(s)
for c in s:
if 0xd800 <= ord(c) <= 0xdfff:
c += next(s, "")
yield ord(c)
 
J

jmfauth

That seems to me correct.
File "<eta last command>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes
in position 0-5: end of string in escape sequence
from this:
u'éléphant\N{EURO SIGN}' éléphant€
u = u'éléphant\N{EURO SIGN}'
''.join(['\\u{:04x}'.format(ord(c)) for c in u]) \u00e9\u006c\u00e9\u0070\u0068\u0061\u006e\u0074\u20ac

Skipping surrogate pairs is a little bit a non sense,
because the purpose is to display code points!
 
V

Vlastimil Brom

2011/6/22 Saul Spatz said:
Thanks.  I agree with you about the generator.  Using your first suggestion, code points above U+FFFF get separated into two "surrogate pair" characters fron UTF-16.  So instead of U=10FFFF I get U+DBFF andU+DFFF.
Hi,
If you realy need the wide unicode functionality on a narrow unicode
python build and only need to get the string index of characters
including surrogate pairs counting as one item, you can build a list
of single characters or surrogate pairs, e.g.:
[u'a', u'\ud800', u'\udf30', u' ', u'\ud800', u'\udf31', u' ',
u'\ud800', u'\udf32', u' ', u'\ud800', u'\udf33']
import re
re.findall(ur"(?s)(?:[\ud800-\udbff][\udc00-\udfff])|.", surrog_txt)
[u'a', u'\U00010330', u' ', u'\U00010331', u' ', u'\U00010332', u' ',
u'\U00010333']
this way, the indices, slices and len() would work on the
supplementary list as expected for a normal string; however it
probably won't be very efficient for longer texts.
Note that surrogates are not the only asymmetry between code points,
characters (and glyphs - to take the visual representation of those
into account) - there are combining diacritical marks, in various
combinations with precomposed diacritical characters, multiple
normalisation modes etc.

regards,
vbr
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,062
Latest member
OrderKetozenseACV

Latest Threads

Top