How to find number of characters in a unicode string?

Preben Randhol · Sep 18, 2006

Hi

If I use len() on a string containing unicode letters I get the number
of bytes the string uses. This means that len() can report size 6 when
the unicode string only contains 3 characters (that one would write by
hand or see on the screen). Is there a way to calculate in characters
and not in bytes to represent the characters.

The reason for asking is that PyGTK needs number of characters to set
the width of Entry widgets to a certain length, and it expects viewable
characters and not number of bytes to represent them.

Thanks in advance

Preben

Marc 'BlackJack' Rintsch · Sep 18, 2006

If I use len() on a string containing unicode letters I get the number
of bytes the string uses. This means that len() can report size 6 when
the unicode string only contains 3 characters (that one would write by
hand or see on the screen). Is there a way to calculate in characters
and not in bytes to represent the characters.

Yes and you already seem to know the answer: Decode the byte string and
use `len()` on the unicode string.

Ciao,
Marc 'BlackJack' Rintsch

faulkner · Sep 18, 2006

are you sure you're using unicode objects?
len(u'\uffff') == 1
the encodings module should help you turn '\xff\xff' into u'\uffff'.

Preben Randhol · Sep 19, 2006

Yes and you already seem to know the answer: Decode the byte string
and use `len()` on the unicode string.

..decode("utf-8") did the trick. Thanks!

Preben

Lawrence D'Oliveiro · Sep 29, 2006

Marc 'BlackJack' said:
Decode the byte string and use `len()` on the unicode string.

Hmmm, for some reason

len(u"C\u0327")

returns 2.

Marc 'BlackJack' Rintsch · Sep 29, 2006

Hmmm, for some reason

len(u"C\u0327")

returns 2.

Okay, decode and normalize and then use `len()` on the unicode string.

Ciao,
Marc 'BlackJack' Rintsch

Gabriel Genellina · Sep 29, 2006

At said:
Hmmm, for some reason

len(u"C\u0327")

returns 2.

That's correct, these are two unicode characters,
C and combining-cedilla; display as Ç. From
<http://en.wikipedia.org/wiki/Unicode>:

"Unicode takes the role of providing a unique
code point — a number, not a glyph — for each
character. In other words, Unicode represents a
character in an abstract way, and leaves the
visual rendering (size, shape, font or style) to
other software [...] This simple aim becomes
complicated, however, by concessions made by
Unicode's designers, in the hope of encouraging a
more rapid adoption of Unicode. [...] A lot of
essentially identical characters were encoded
multiple times at different code points to
preserve distinctions used by legacy encodings
and therefore allow conversion from those
encodings to Unicode (and back) without losing
any information. [...] Also, while Unicode allows
for combining characters, it also contains
precomposed versions of most letter/diacritic
combinations in normal use. These make conversion
to and from legacy encodings simpler and allow
applications to use Unicode as an internal text
format without having to implement combining
characters. For example é can be represented in
Unicode as U+0065 (Latin small letter e) followed
by U+0301 (combining acute) but it can also be
represented as the precomposed character U+00E9
(Latin small letter e with acute)."

Gabriel Genellina
Softlab SRL

__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Leif K-Brooks · Sep 29, 2006

Lawrence said:
Hmmm, for some reason

len(u"C\u0327")

returns 2.

Is len(unicodedata.normalize('NFC', u"C\u0327")) what you want?

Leo Kislov · Oct 11, 2006

Lawrence said:
Hmmm, for some reason

len(u"C\u0327")

returns 2.

If python ever provide this functionality it would be I guess
u"C\u0327".width() == 1. But it's not clear when unicode.org will
provide recommended fixed font character width information for *all*
characters. I recently stumbled upon Tamil language, where for example
u'\u0b95\u0bcd', u'\u0b95\u0bbe', u'\u0b95\u0bca', u'\u0b95\u0bcc'
looks like they have width 1,2,3 and 4 columns. To add insult to injury
these 4 symbols are all considered *single* letter symbols

If your
email reader is able to show them, here they are in all their glory:
à®•à¯, à®•à®¾, à®•à¯Š, à®•à¯Œ.

Converting an Array to a String in JavaScript	7	Sep 22, 2023
Measuring a string of text	1	Sep 15, 2022
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
how to find difference in number of characters	12	Oct 9, 2010
Flexible string representation, unicode, typography, ...	94	Aug 23, 2012
Function noseen in C++ , how to find solutions?	0	Oct 4, 2023
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
Unicode characters in btye-strings	5	Mar 12, 2010

How to find number of characters in a unicode string?

Preben Randhol

Marc 'BlackJack' Rintsch

faulkner

Preben Randhol

Lawrence D'Oliveiro

Marc 'BlackJack' Rintsch

Gabriel Genellina

Leif K-Brooks

Leo Kislov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads