Unicode characters

Paul Johnston · Sep 4, 2006

Hi
I have a string which I convert into a list then read through it
printing its glyph and numeric representation

#-*- coding: utf-8 -*-

thestring = "abcd"
thelist = list(thestring)

for c in thelist:
print c,
print ord(c)

Works fine for latin characters but when I put in a unicode character
a two byte character gives me two characters. For example an arabic
alef returns

* 216
* 167

( the first asterix is the empty set symbol the second a double "s")

Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
sequential listings i.e.
216 167
216 168
216 169
So it is reading the correct details.

Is there anyway to get the c in the for loop to recognise it is
reading a multiple byte character.
I have followed the info in PEP 0263 and am using Python 2.4.3 Build
12 on a Windows box within Eclipse 3.2.0 and Python plugins 1.2.2

Cheers Paul

limodou · Sep 4, 2006

Hi
I have a string which I convert into a list then read through it
printing its glyph and numeric representation

#-*- coding: utf-8 -*-

thestring = "abcd"
thelist = list(thestring)

for c in thelist:
print c,
print ord(c)

Works fine for latin characters but when I put in a unicode character
a two byte character gives me two characters. For example an arabic
alef returns

* 216
* 167

( the first asterix is the empty set symbol the second a double "s")

Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
sequential listings i.e.
216 167
216 168
216 169
So it is reading the correct details.

Is there anyway to get the c in the for loop to recognise it is
reading a multiple byte character.
I have followed the info in PEP 0263 and am using Python 2.4.3 Build
12 on a Windows box within Eclipse 3.2.0 and Python plugins 1.2.2

If the string is not a unicode, it's be encoded in byte, so you can
only get the every character encoding of the string. You can conver it
to unicode, and if the character value less than 127, it should be an
ascii, otherwise maybe a multibytes character. for example:

a = 'string'
b = unicode(a, encoding_according_your_situation)
for i in b:
if ord(i) < 127:
print ord(i), 'ascii'
else:
print ord(i), 'multibytes'

Diez B. Roggisch · Sep 4, 2006

Paul said:
Hi
I have a string which I convert into a list then read through it
printing its glyph and numeric representation

#-*- coding: utf-8 -*-

thestring = "abcd"
thelist = list(thestring)

for c in thelist:
print c,
print ord(c)

Works fine for latin characters but when I put in a unicode character
a two byte character gives me two characters. For example an arabic
alef returns

* 216
* 167

( the first asterix is the empty set symbol the second a double "s")

Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
sequential listings i.e.
216 167
216 168
216 169
So it is reading the correct details.

Is there anyway to get the c in the for loop to recognise it is
reading a multiple byte character.
I have followed the info in PEP 0263 and am using Python 2.4.3 Build
12 on a Windows box within Eclipse 3.2.0 and Python plugins 1.2.2

Use unicode objects instead of byte strings. The above string literal is
_not_ affected by the coding:-header whatsoever.

That applies only to

u"some text"

literals, and makes them a unicode object.

The normal string literals are just bytes - because of your encoding being
properly set in the editor, an entered multibyte-character is stored as
such.

In a nutshell: try the above using u"abcd".
Diez

unicode	7	Jul 1, 2007
attempting to print unicode characters.	23	Aug 29, 2010
decoding keyboard input when using curses	6	May 30, 2009
OpenSP API, Unicode character byte offsets	0	Aug 20, 2003
anybody help me	1	Feb 10, 2006
problem with array sorting - urgent responce needed, due tommrow	5	Mar 4, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
[ANN] JRuby 1.4.0 Released	2	Nov 2, 2009

Unicode characters

Paul Johnston

limodou

Diez B. Roggisch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads