Unicode characters

Discussion in 'Python' started by Paul Johnston, Sep 4, 2006.

  1. Hi
    I have a string which I convert into a list then read through it
    printing its glyph and numeric representation

    #-*- coding: utf-8 -*-

    thestring = "abcd"
    thelist = list(thestring)

    for c in thelist:
    print c,
    print ord(c)

    Works fine for latin characters but when I put in a unicode character
    a two byte character gives me two characters. For example an arabic
    alef returns

    * 216
    * 167

    ( the first asterix is the empty set symbol the second a double "s")

    Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
    sequential listings i.e.
    216 167
    216 168
    216 169
    So it is reading the correct details.


    Is there anyway to get the c in the for loop to recognise it is
    reading a multiple byte character.
    I have followed the info in PEP 0263 and am using Python 2.4.3 Build
    12 on a Windows box within Eclipse 3.2.0 and Python plugins 1.2.2

    Cheers Paul
     
    Paul Johnston, Sep 4, 2006
    #1
    1. Advertising

  2. Paul Johnston

    limodou Guest

    On 9/4/06, Paul Johnston <> wrote:
    > Hi
    > I have a string which I convert into a list then read through it
    > printing its glyph and numeric representation
    >
    > #-*- coding: utf-8 -*-
    >
    > thestring = "abcd"
    > thelist = list(thestring)
    >
    > for c in thelist:
    > print c,
    > print ord(c)
    >
    > Works fine for latin characters but when I put in a unicode character
    > a two byte character gives me two characters. For example an arabic
    > alef returns
    >
    > * 216
    > * 167
    >
    > ( the first asterix is the empty set symbol the second a double "s")
    >
    > Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
    > sequential listings i.e.
    > 216 167
    > 216 168
    > 216 169
    > So it is reading the correct details.
    >
    >
    > Is there anyway to get the c in the for loop to recognise it is
    > reading a multiple byte character.
    > I have followed the info in PEP 0263 and am using Python 2.4.3 Build
    > 12 on a Windows box within Eclipse 3.2.0 and Python plugins 1.2.2
    >

    If the string is not a unicode, it's be encoded in byte, so you can
    only get the every character encoding of the string. You can conver it
    to unicode, and if the character value less than 127, it should be an
    ascii, otherwise maybe a multibytes character. for example:

    a = 'string'
    b = unicode(a, encoding_according_your_situation)
    for i in b:
    if ord(i) < 127:
    print ord(i), 'ascii'
    else:
    print ord(i), 'multibytes'

    --
    I like python!
    My Blog: http://www.donews.net/limodou
    UliPad Site: http://wiki.woodpecker.org.cn/moin/UliPad
    UliPad Maillist: http://groups.google.com/group/ulipad
     
    limodou, Sep 4, 2006
    #2
    1. Advertising

  3. Paul Johnston wrote:

    > Hi
    > I have a string which I convert into a list then read through it
    > printing its glyph and numeric representation
    >
    > #-*- coding: utf-8 -*-
    >
    > thestring = "abcd"
    > thelist = list(thestring)
    >
    > for c in thelist:
    > print c,
    > print ord(c)
    >
    > Works fine for latin characters but when I put in a unicode character
    > a two byte character gives me two characters. For example an arabic
    > alef returns
    >
    > * 216
    > * 167
    >
    > ( the first asterix is the empty set symbol the second a double "s")
    >
    > Putting in sequential characters i.e. alef, beh, teh mabuta, gives me
    > sequential listings i.e.
    > 216 167
    > 216 168
    > 216 169
    > So it is reading the correct details.
    >
    >
    > Is there anyway to get the c in the for loop to recognise it is
    > reading a multiple byte character.
    > I have followed the info in PEP 0263 and am using Python 2.4.3 Build
    > 12 on a Windows box within Eclipse 3.2.0 and Python plugins 1.2.2


    Use unicode objects instead of byte strings. The above string literal is
    _not_ affected by the coding:-header whatsoever.

    That applies only to

    u"some text"

    literals, and makes them a unicode object.

    The normal string literals are just bytes - because of your encoding being
    properly set in the editor, an entered multibyte-character is stored as
    such.

    In a nutshell: try the above using u"abcd".
    Diez
     
    Diez B. Roggisch, Sep 4, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    771
  2. Laszlo Nagy
    Replies:
    6
    Views:
    626
  3. Terry Reedy
    Replies:
    0
    Views:
    515
    Terry Reedy
    Jul 1, 2008
  4. M.-A. Lemburg
    Replies:
    0
    Views:
    897
    M.-A. Lemburg
    Jul 2, 2008
  5. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    969
    Grzegorz ¦liwiñski
    Jan 19, 2011
Loading...

Share This Page