How to find number of characters in a unicode string?

Discussion in 'Python' started by Preben Randhol, Sep 18, 2006.

  1. Hi

    If I use len() on a string containing unicode letters I get the number
    of bytes the string uses. This means that len() can report size 6 when
    the unicode string only contains 3 characters (that one would write by
    hand or see on the screen). Is there a way to calculate in characters
    and not in bytes to represent the characters.

    The reason for asking is that PyGTK needs number of characters to set
    the width of Entry widgets to a certain length, and it expects viewable
    characters and not number of bytes to represent them.


    Thanks in advance


    Preben
     
    Preben Randhol, Sep 18, 2006
    #1
    1. Advertisements

  2. Yes and you already seem to know the answer: Decode the byte string and
    use `len()` on the unicode string.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Sep 18, 2006
    #2
    1. Advertisements

  3. Preben Randhol

    faulkner Guest

    are you sure you're using unicode objects?
    len(u'\uffff') == 1
    the encodings module should help you turn '\xff\xff' into u'\uffff'.
     
    faulkner, Sep 18, 2006
    #3
  4. ..decode("utf-8") did the trick. Thanks!

    Preben
     
    Preben Randhol, Sep 19, 2006
    #4
  5. Hmmm, for some reason

    len(u"C\u0327")

    returns 2.
     
    Lawrence D'Oliveiro, Sep 29, 2006
    #5
  6. Okay, decode and normalize and then use `len()` on the unicode string.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Sep 29, 2006
    #6
  7. That's correct, these are two unicode characters,
    C and combining-cedilla; display as Ç. From
    <http://en.wikipedia.org/wiki/Unicode>:

    "Unicode takes the role of providing a unique
    code point — a number, not a glyph — for each
    character. In other words, Unicode represents a
    character in an abstract way, and leaves the
    visual rendering (size, shape, font or style) to
    other software [...] This simple aim becomes
    complicated, however, by concessions made by
    Unicode's designers, in the hope of encouraging a
    more rapid adoption of Unicode. [...] A lot of
    essentially identical characters were encoded
    multiple times at different code points to
    preserve distinctions used by legacy encodings
    and therefore allow conversion from those
    encodings to Unicode (and back) without losing
    any information. [...] Also, while Unicode allows
    for combining characters, it also contains
    precomposed versions of most letter/diacritic
    combinations in normal use. These make conversion
    to and from legacy encodings simpler and allow
    applications to use Unicode as an internal text
    format without having to implement combining
    characters. For example é can be represented in
    Unicode as U+0065 (Latin small letter e) followed
    by U+0301 (combining acute) but it can also be
    represented as the precomposed character U+00E9
    (Latin small letter e with acute)."

    Gabriel Genellina
    Softlab SRL





    __________________________________________________
    Preguntá. Respondé. Descubrí.
    Todo lo que querías saber, y lo que ni imaginabas,
    está en Yahoo! Respuestas (Beta).
    ¡Probalo ya!
    http://www.yahoo.com.ar/respuestas
     
    Gabriel Genellina, Sep 29, 2006
    #7
  8. Is len(unicodedata.normalize('NFC', u"C\u0327")) what you want?
     
    Leif K-Brooks, Sep 29, 2006
    #8
  9. Preben Randhol

    Leo Kislov Guest

    If python ever provide this functionality it would be I guess
    u"C\u0327".width() == 1. But it's not clear when unicode.org will
    provide recommended fixed font character width information for *all*
    characters. I recently stumbled upon Tamil language, where for example
    u'\u0b95\u0bcd', u'\u0b95\u0bbe', u'\u0b95\u0bca', u'\u0b95\u0bcc'
    looks like they have width 1,2,3 and 4 columns. To add insult to injury
    these 4 symbols are all considered *single* letter symbols :) If your
    email reader is able to show them, here they are in all their glory:
    கà¯, கா, கொ, கௌ.
     
    Leo Kislov, Oct 11, 2006
    #9
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.