How to find number of characters in a unicode string?

Discussion in 'Python' started by Preben Randhol, Sep 18, 2006.

  1. Hi

    If I use len() on a string containing unicode letters I get the number
    of bytes the string uses. This means that len() can report size 6 when
    the unicode string only contains 3 characters (that one would write by
    hand or see on the screen). Is there a way to calculate in characters
    and not in bytes to represent the characters.

    The reason for asking is that PyGTK needs number of characters to set
    the width of Entry widgets to a certain length, and it expects viewable
    characters and not number of bytes to represent them.


    Thanks in advance


    Preben
     
    Preben Randhol, Sep 18, 2006
    #1
    1. Advertising

  2. In <>,
    Preben Randhol wrote:

    > If I use len() on a string containing unicode letters I get the number
    > of bytes the string uses. This means that len() can report size 6 when
    > the unicode string only contains 3 characters (that one would write by
    > hand or see on the screen). Is there a way to calculate in characters
    > and not in bytes to represent the characters.


    Yes and you already seem to know the answer: Decode the byte string and
    use `len()` on the unicode string.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Sep 18, 2006
    #2
    1. Advertising

  3. Preben Randhol

    faulkner Guest

    are you sure you're using unicode objects?
    len(u'\uffff') == 1
    the encodings module should help you turn '\xff\xff' into u'\uffff'.

    Preben Randhol wrote:
    > Hi
    >
    > If I use len() on a string containing unicode letters I get the number
    > of bytes the string uses. This means that len() can report size 6 when
    > the unicode string only contains 3 characters (that one would write by
    > hand or see on the screen). Is there a way to calculate in characters
    > and not in bytes to represent the characters.
    >
    > The reason for asking is that PyGTK needs number of characters to set
    > the width of Entry widgets to a certain length, and it expects viewable
    > characters and not number of bytes to represent them.
    >
    >
    > Thanks in advance
    >
    >
    > Preben
     
    faulkner, Sep 18, 2006
    #3
  4. On Mon, 18 Sep 2006 22:29:20 +0200
    Marc 'BlackJack' Rintsch <> wrote:

    > Yes and you already seem to know the answer: Decode the byte string
    > and use `len()` on the unicode string.


    ..decode("utf-8") did the trick. Thanks!

    Preben
     
    Preben Randhol, Sep 19, 2006
    #4
  5. In message <>, Marc 'BlackJack'
    Rintsch wrote:

    > In <>,
    > Preben Randhol wrote:
    >
    >> Is there a way to calculate in characters
    >> and not in bytes to represent the characters.

    >
    > Decode the byte string and use `len()` on the unicode string.


    Hmmm, for some reason

    len(u"C\u0327")

    returns 2.
     
    Lawrence D'Oliveiro, Sep 29, 2006
    #5
  6. In <efija1$357$>, Lawrence D'Oliveiro wrote:

    > In message <>, Marc 'BlackJack'
    > Rintsch wrote:
    >
    >> In <>,
    >> Preben Randhol wrote:
    >>
    >>> Is there a way to calculate in characters
    >>> and not in bytes to represent the characters.

    >>
    >> Decode the byte string and use `len()` on the unicode string.

    >
    > Hmmm, for some reason
    >
    > len(u"C\u0327")
    >
    > returns 2.


    Okay, decode and normalize and then use `len()` on the unicode string.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Sep 29, 2006
    #6
  7. At Friday 29/9/2006 04:52, Lawrence D'Oliveiro wrote:

    > >> Is there a way to calculate in characters
    > >> and not in bytes to represent the characters.

    > >
    > > Decode the byte string and use `len()` on the unicode string.

    >
    >Hmmm, for some reason
    >
    > len(u"C\u0327")
    >
    >returns 2.


    That's correct, these are two unicode characters,
    C and combining-cedilla; display as Ç. From
    <http://en.wikipedia.org/wiki/Unicode>:

    "Unicode takes the role of providing a unique
    code point — a number, not a glyph — for each
    character. In other words, Unicode represents a
    character in an abstract way, and leaves the
    visual rendering (size, shape, font or style) to
    other software [...] This simple aim becomes
    complicated, however, by concessions made by
    Unicode's designers, in the hope of encouraging a
    more rapid adoption of Unicode. [...] A lot of
    essentially identical characters were encoded
    multiple times at different code points to
    preserve distinctions used by legacy encodings
    and therefore allow conversion from those
    encodings to Unicode (and back) without losing
    any information. [...] Also, while Unicode allows
    for combining characters, it also contains
    precomposed versions of most letter/diacritic
    combinations in normal use. These make conversion
    to and from legacy encodings simpler and allow
    applications to use Unicode as an internal text
    format without having to implement combining
    characters. For example é can be represented in
    Unicode as U+0065 (Latin small letter e) followed
    by U+0301 (combining acute) but it can also be
    represented as the precomposed character U+00E9
    (Latin small letter e with acute)."

    Gabriel Genellina
    Softlab SRL





    __________________________________________________
    Preguntá. Respondé. Descubrí.
    Todo lo que querías saber, y lo que ni imaginabas,
    está en Yahoo! Respuestas (Beta).
    ¡Probalo ya!
    http://www.yahoo.com.ar/respuestas
     
    Gabriel Genellina, Sep 29, 2006
    #7
  8. Lawrence D'Oliveiro wrote:
    > Hmmm, for some reason
    >
    > len(u"C\u0327")
    >
    > returns 2.


    Is len(unicodedata.normalize('NFC', u"C\u0327")) what you want?
     
    Leif K-Brooks, Sep 29, 2006
    #8
  9. Preben Randhol

    Leo Kislov Guest

    Lawrence D'Oliveiro wrote:
    > In message <>, Marc 'BlackJack'
    > Rintsch wrote:
    >
    > > In <>,
    > > Preben Randhol wrote:
    > >
    > >> Is there a way to calculate in characters
    > >> and not in bytes to represent the characters.

    > >
    > > Decode the byte string and use `len()` on the unicode string.

    >
    > Hmmm, for some reason
    >
    > len(u"C\u0327")
    >
    > returns 2.


    If python ever provide this functionality it would be I guess
    u"C\u0327".width() == 1. But it's not clear when unicode.org will
    provide recommended fixed font character width information for *all*
    characters. I recently stumbled upon Tamil language, where for example
    u'\u0b95\u0bcd', u'\u0b95\u0bbe', u'\u0b95\u0bca', u'\u0b95\u0bcc'
    looks like they have width 1,2,3 and 4 columns. To add insult to injury
    these 4 symbols are all considered *single* letter symbols :) If your
    email reader is able to show them, here they are in all their glory:
    கà¯, கா, கொ, கௌ.
     
    Leo Kislov, Oct 11, 2006
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Laszlo Nagy
    Replies:
    6
    Views:
    628
  2. Terry Reedy
    Replies:
    0
    Views:
    516
    Terry Reedy
    Jul 1, 2008
  3. M.-A. Lemburg
    Replies:
    0
    Views:
    899
    M.-A. Lemburg
    Jul 2, 2008
  4. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    970
    Grzegorz ¦liwiñski
    Jan 19, 2011
  5. Ken Fine
    Replies:
    2
    Views:
    200
    Ken Fine
    Feb 5, 2004
Loading...

Share This Page