Counting unicode graphemes in python

S

Srinath Avadhanula

Hello,

I am wondering if there is a way of counting graphemes (or
glyphs) in python. For example, in the following string:

u'\u0915\u093e\u0915'
(
or equivalently,
u"\N{DEVANAGARI LETTER KA}\N{DEVANAGARI VOWEL SIGN AA}\N{DEVANAGARI LETTER KA}"
)

the first two "code points" represent a single character on the screen.
In my application, the GUI seems to handle that part (i.e combining
characters). However, I need to handle cursor movement myself. The GUI
can only be told to move forward by a specified number of bytes.
Therefore, to make cursor keys move over graphemes or glyps rather than
code-points, I need to figure out a way to calculate grapheme boundaries
in python. I searched the web for a long long time and came up with a
few results, the most relevant of which seems to be:

http://www.unicode.org/reports/tr29/tr29-2.html

This page contains rules for calculating grapheme boundaries for Hangul
characters or something of that sort. However, I did not find any
information about more general algorithms.

I also took a look at the unicodedata module in python and that seems to
have a function called unicodedata.category. This function seems to
returns strings 'Mn' for u'\u093f' and 'Lo' for u'\u093e'. However, I
have been unable to find a reference for what these strings signify.
Where should I look for them? (I am hoping for something more specific
than "Look at www.unicode.org") Is this information relevant at all for
counting graphemes?

Thanks,
Srinath
 
V

vincent wehren

| Hello,
|
| I am wondering if there is a way of counting graphemes (or
| glyphs) in python. For example, in the following string:
|
| u'\u0915\u093e\u0915'
| (
| or equivalently,
| u"\N{DEVANAGARI LETTER KA}\N{DEVANAGARI VOWEL SIGN AA}\N{DEVANAGARI LETTER
KA}"
| )
|
| the first two "code points" represent a single character on the screen.

My GUESS is that you can do that unless you *know* exactly which codepoints
form ligatures. In DEVANAGARI this are e.g. the so-called dependent vowels
in range 093e - 094c, wherin 093f stands "left of the consonant" when
rendered. (My knowledge of Indic languages is limited, at best, so there may
be mor to it..)



| In my application, the GUI seems to handle that part (i.e combining
| characters). However, I need to handle cursor movement myself. The GUI
| can only be told to move forward by a specified number of bytes.

What GUI are you working with?

| Therefore, to make cursor keys move over graphemes or glyps rather than
| code-points, I need to figure out a way to calculate grapheme boundaries
| in python. I searched the web for a long long time and came up with a
| few results, the most relevant of which seems to be:
|
| http://www.unicode.org/reports/tr29/tr29-2.html
|
| This page contains rules for calculating grapheme boundaries for Hangul
| characters or something of that sort. However, I did not find any
| information about more general algorithms.


Some systems such as the X Server on IndiX seem to dig into the GPOS and
GSUB tables in the OpenType font. See:

http://rohini.ncst.ernet.in/indix/doc/HOWTO/Devanagari-HOWTO-5.html



|
| I also took a look at the unicodedata module in python and that seems to
| have a function called unicodedata.category. This function seems to
| returns strings 'Mn' for u'\u093f' and 'Lo' for u'\u093e'. However, I
| have been unable to find a reference for what these strings signify.
| Where should I look for them? (I am hoping for something more specific
| than "Look at www.unicode.org")

Would "Look at
http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values " do?

HTH,
Vincent Wehren

|
| Thanks,
| Srinath
|
 
S

Srinath Avadhanula

|
| the first two "code points" represent a single character on the screen.

My GUESS is that you can do that unless you *know* exactly which codepoints
form ligatures. In DEVANAGARI this are e.g. the so-called dependent vowels
in range 093e - 094c, wherin 093f stands "left of the consonant" when
rendered. (My knowledge of Indic languages is limited, at best, so there may
be mor to it..)
After a sleepless night, I finally found out that calculating grapheme
boundaries for devanagari is not so hard after all. It seems to work
reasonably well if I use just three simple rules:

To detect whether in the code point sequence 'ab', the junction between
'a' and 'b' is a glyph boundary.

1. If 'b' is some kind of a mark (i.e unicodedata.category(b) starts
with 'M'), then the 'ab' junction is not a glyph boundary.

2. If 'b' is not a Mark, but is a devanagari letter (i.e category 'Lo')
AND 'a' is a VIRAMA character i.e, 'VIRAMA' in unicodedata.name(a),
then the 'ab' junction is not a glyph boundary.

3. In every other situation, the 'ab' junction is a glyph boundary.

Dont really know if this is completely correct, but it performs pretty
well on quite a big sanskrit text I have... Handles things like

NA + HALANT + DHA + HALANT + YA + AA

and reports it (correctly) as a single glyph.
| In my application, the GUI seems to handle that part (i.e combining
| characters). However, I need to handle cursor movement myself. The GUI
| can only be told to move forward by a specified number of bytes.

What GUI are you working with?
I am using wxPython on windows XP. There are two text display widgets,
wxTextCtrl and wxStyledTextCtrl. The former is pretty basic but the
caret positioning is pretty robust. The latter is very fancy, hanles
syntax highlighting etc, but has some serious problems with combining
characters.
Some systems such as the X Server on IndiX seem to dig into the GPOS and
GSUB tables in the OpenType font. See:

http://rohini.ncst.ernet.in/indix/doc/HOWTO/Devanagari-HOWTO-5.html
Thanks for the link!

It does indeed. Notice my new-found fluency with unicodedata.category?
:)

Thanks,
Srinath
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top