Counting unicode graphemes in python

Srinath Avadhanula · Oct 24, 2003

Hello,

I am wondering if there is a way of counting graphemes (or
glyphs) in python. For example, in the following string:

u'\u0915\u093e\u0915'
(
or equivalently,
u"\N{DEVANAGARI LETTER KA}\N{DEVANAGARI VOWEL SIGN AA}\N{DEVANAGARI LETTER KA}"
)

the first two "code points" represent a single character on the screen.
In my application, the GUI seems to handle that part (i.e combining
characters). However, I need to handle cursor movement myself. The GUI
can only be told to move forward by a specified number of bytes.
Therefore, to make cursor keys move over graphemes or glyps rather than
code-points, I need to figure out a way to calculate grapheme boundaries
in python. I searched the web for a long long time and came up with a
few results, the most relevant of which seems to be:

http://www.unicode.org/reports/tr29/tr29-2.html

This page contains rules for calculating grapheme boundaries for Hangul
characters or something of that sort. However, I did not find any
information about more general algorithms.

I also took a look at the unicodedata module in python and that seems to
have a function called unicodedata.category. This function seems to
returns strings 'Mn' for u'\u093f' and 'Lo' for u'\u093e'. However, I
have been unable to find a reference for what these strings signify.
Where should I look for them? (I am hoping for something more specific
than "Look at www.unicode.org") Is this information relevant at all for
counting graphemes?

Thanks,
Srinath

vincent wehren · Oct 24, 2003

| Hello,
|
| I am wondering if there is a way of counting graphemes (or
| glyphs) in python. For example, in the following string:
|
| u'\u0915\u093e\u0915'
| (
| or equivalently,
| u"\N{DEVANAGARI LETTER KA}\N{DEVANAGARI VOWEL SIGN AA}\N{DEVANAGARI LETTER
KA}"
| )
|
| the first two "code points" represent a single character on the screen.

My GUESS is that you can do that unless you *know* exactly which codepoints
form ligatures. In DEVANAGARI this are e.g. the so-called dependent vowels
in range 093e - 094c, wherin 093f stands "left of the consonant" when
rendered. (My knowledge of Indic languages is limited, at best, so there may
be mor to it..)

| In my application, the GUI seems to handle that part (i.e combining
| characters). However, I need to handle cursor movement myself. The GUI
| can only be told to move forward by a specified number of bytes.

What GUI are you working with?

| Therefore, to make cursor keys move over graphemes or glyps rather than
| code-points, I need to figure out a way to calculate grapheme boundaries
| in python. I searched the web for a long long time and came up with a
| few results, the most relevant of which seems to be:
|
| http://www.unicode.org/reports/tr29/tr29-2.html
|
| This page contains rules for calculating grapheme boundaries for Hangul
| characters or something of that sort. However, I did not find any
| information about more general algorithms.

Some systems such as the X Server on IndiX seem to dig into the GPOS and
GSUB tables in the OpenType font. See:

http://rohini.ncst.ernet.in/indix/doc/HOWTO/Devanagari-HOWTO-5.html

|
| I also took a look at the unicodedata module in python and that seems to
| have a function called unicodedata.category. This function seems to
| returns strings 'Mn' for u'\u093f' and 'Lo' for u'\u093e'. However, I
| have been unable to find a reference for what these strings signify.
| Where should I look for them? (I am hoping for something more specific
| than "Look at www.unicode.org")

Would "Look at
http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values " do?

HTH,
Vincent Wehren

|
| Thanks,
| Srinath
|

Srinath Avadhanula · Oct 24, 2003

|
| the first two "code points" represent a single character on the screen.

My GUESS is that you can do that unless you *know* exactly which codepoints
form ligatures. In DEVANAGARI this are e.g. the so-called dependent vowels
in range 093e - 094c, wherin 093f stands "left of the consonant" when
rendered. (My knowledge of Indic languages is limited, at best, so there may
be mor to it..)

After a sleepless night, I finally found out that calculating grapheme
boundaries for devanagari is not so hard after all. It seems to work
reasonably well if I use just three simple rules:

To detect whether in the code point sequence 'ab', the junction between
'a' and 'b' is a glyph boundary.

1. If 'b' is some kind of a mark (i.e unicodedata.category(b) starts
with 'M'), then the 'ab' junction is not a glyph boundary.

2. If 'b' is not a Mark, but is a devanagari letter (i.e category 'Lo')
AND 'a' is a VIRAMA character i.e, 'VIRAMA' in unicodedata.name(a),
then the 'ab' junction is not a glyph boundary.

3. In every other situation, the 'ab' junction is a glyph boundary.

Dont really know if this is completely correct, but it performs pretty
well on quite a big sanskrit text I have... Handles things like

NA + HALANT + DHA + HALANT + YA + AA

and reports it (correctly) as a single glyph.

| In my application, the GUI seems to handle that part (i.e combining
| characters). However, I need to handle cursor movement myself. The GUI
| can only be told to move forward by a specified number of bytes.

What GUI are you working with?

I am using wxPython on windows XP. There are two text display widgets,
wxTextCtrl and wxStyledTextCtrl. The former is pretty basic but the
caret positioning is pretty robust. The latter is very fancy, hanles
syntax highlighting etc, but has some serious problems with combining
characters.

Some systems such as the X Server on IndiX seem to dig into the GPOS and
GSUB tables in the OpenType font. See:

http://rohini.ncst.ernet.in/indix/doc/HOWTO/Devanagari-HOWTO-5.html

Thanks for the link!

Would "Look at
http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values " do?

It does indeed. Notice my new-found fluency with unicodedata.category?

Thanks,
Srinath

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Processing in Python help	0	Aug 31, 2022
Unicode Chars in Windows Path	12	Apr 3, 2014
Python battle game help	2	Feb 23, 2023
API for custom Unicode error handlers	5	Oct 4, 2013
Preserving unicode filename encoding	1	Oct 20, 2012
Python code problem	2	Apr 23, 2023
Ascii to Unicode.	4	Jul 28, 2010

Counting unicode graphemes in python

Srinath Avadhanula

vincent wehren

Srinath Avadhanula

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads