How to find number of characters in a unicode string?

P

Preben Randhol

Hi

If I use len() on a string containing unicode letters I get the number
of bytes the string uses. This means that len() can report size 6 when
the unicode string only contains 3 characters (that one would write by
hand or see on the screen). Is there a way to calculate in characters
and not in bytes to represent the characters.

The reason for asking is that PyGTK needs number of characters to set
the width of Entry widgets to a certain length, and it expects viewable
characters and not number of bytes to represent them.


Thanks in advance


Preben
 
M

Marc 'BlackJack' Rintsch

If I use len() on a string containing unicode letters I get the number
of bytes the string uses. This means that len() can report size 6 when
the unicode string only contains 3 characters (that one would write by
hand or see on the screen). Is there a way to calculate in characters
and not in bytes to represent the characters.

Yes and you already seem to know the answer: Decode the byte string and
use `len()` on the unicode string.

Ciao,
Marc 'BlackJack' Rintsch
 
F

faulkner

are you sure you're using unicode objects?
len(u'\uffff') == 1
the encodings module should help you turn '\xff\xff' into u'\uffff'.
 
P

Preben Randhol

Yes and you already seem to know the answer: Decode the byte string
and use `len()` on the unicode string.

..decode("utf-8") did the trick. Thanks!

Preben
 
M

Marc 'BlackJack' Rintsch

Hmmm, for some reason

len(u"C\u0327")

returns 2.

Okay, decode and normalize and then use `len()` on the unicode string.

Ciao,
Marc 'BlackJack' Rintsch
 
G

Gabriel Genellina

At said:
Hmmm, for some reason

len(u"C\u0327")

returns 2.

That's correct, these are two unicode characters,
C and combining-cedilla; display as Ç. From
<http://en.wikipedia.org/wiki/Unicode>:

"Unicode takes the role of providing a unique
code point — a number, not a glyph — for each
character. In other words, Unicode represents a
character in an abstract way, and leaves the
visual rendering (size, shape, font or style) to
other software [...] This simple aim becomes
complicated, however, by concessions made by
Unicode's designers, in the hope of encouraging a
more rapid adoption of Unicode. [...] A lot of
essentially identical characters were encoded
multiple times at different code points to
preserve distinctions used by legacy encodings
and therefore allow conversion from those
encodings to Unicode (and back) without losing
any information. [...] Also, while Unicode allows
for combining characters, it also contains
precomposed versions of most letter/diacritic
combinations in normal use. These make conversion
to and from legacy encodings simpler and allow
applications to use Unicode as an internal text
format without having to implement combining
characters. For example é can be represented in
Unicode as U+0065 (Latin small letter e) followed
by U+0301 (combining acute) but it can also be
represented as the precomposed character U+00E9
(Latin small letter e with acute)."

Gabriel Genellina
Softlab SRL





__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
 
L

Leo Kislov

Lawrence said:
Hmmm, for some reason

len(u"C\u0327")

returns 2.

If python ever provide this functionality it would be I guess
u"C\u0327".width() == 1. But it's not clear when unicode.org will
provide recommended fixed font character width information for *all*
characters. I recently stumbled upon Tamil language, where for example
u'\u0b95\u0bcd', u'\u0b95\u0bbe', u'\u0b95\u0bca', u'\u0b95\u0bcc'
looks like they have width 1,2,3 and 4 columns. To add insult to injury
these 4 symbols are all considered *single* letter symbols :) If your
email reader is able to show them, here they are in all their glory:
கà¯, கா, கொ, கௌ.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top