Counting unicode graphemes in python

Discussion in 'Python' started by Srinath Avadhanula, Oct 24, 2003.

  1. Hello,

    I am wondering if there is a way of counting graphemes (or
    glyphs) in python. For example, in the following string:

    u'\u0915\u093e\u0915'
    (
    or equivalently,
    u"\N{DEVANAGARI LETTER KA}\N{DEVANAGARI VOWEL SIGN AA}\N{DEVANAGARI LETTER KA}"
    )

    the first two "code points" represent a single character on the screen.
    In my application, the GUI seems to handle that part (i.e combining
    characters). However, I need to handle cursor movement myself. The GUI
    can only be told to move forward by a specified number of bytes.
    Therefore, to make cursor keys move over graphemes or glyps rather than
    code-points, I need to figure out a way to calculate grapheme boundaries
    in python. I searched the web for a long long time and came up with a
    few results, the most relevant of which seems to be:

    http://www.unicode.org/reports/tr29/tr29-2.html

    This page contains rules for calculating grapheme boundaries for Hangul
    characters or something of that sort. However, I did not find any
    information about more general algorithms.

    I also took a look at the unicodedata module in python and that seems to
    have a function called unicodedata.category. This function seems to
    returns strings 'Mn' for u'\u093f' and 'Lo' for u'\u093e'. However, I
    have been unable to find a reference for what these strings signify.
    Where should I look for them? (I am hoping for something more specific
    than "Look at www.unicode.org") Is this information relevant at all for
    counting graphemes?

    Thanks,
    Srinath
    Srinath Avadhanula, Oct 24, 2003
    #1
    1. Advertising

  2. "Srinath Avadhanula" <> schrieb im Newsbeitrag
    news:p...
    | Hello,
    |
    | I am wondering if there is a way of counting graphemes (or
    | glyphs) in python. For example, in the following string:
    |
    | u'\u0915\u093e\u0915'
    | (
    | or equivalently,
    | u"\N{DEVANAGARI LETTER KA}\N{DEVANAGARI VOWEL SIGN AA}\N{DEVANAGARI LETTER
    KA}"
    | )
    |
    | the first two "code points" represent a single character on the screen.

    My GUESS is that you can do that unless you *know* exactly which codepoints
    form ligatures. In DEVANAGARI this are e.g. the so-called dependent vowels
    in range 093e - 094c, wherin 093f stands "left of the consonant" when
    rendered. (My knowledge of Indic languages is limited, at best, so there may
    be mor to it..)



    | In my application, the GUI seems to handle that part (i.e combining
    | characters). However, I need to handle cursor movement myself. The GUI
    | can only be told to move forward by a specified number of bytes.

    What GUI are you working with?

    | Therefore, to make cursor keys move over graphemes or glyps rather than
    | code-points, I need to figure out a way to calculate grapheme boundaries
    | in python. I searched the web for a long long time and came up with a
    | few results, the most relevant of which seems to be:
    |
    | http://www.unicode.org/reports/tr29/tr29-2.html
    |
    | This page contains rules for calculating grapheme boundaries for Hangul
    | characters or something of that sort. However, I did not find any
    | information about more general algorithms.


    Some systems such as the X Server on IndiX seem to dig into the GPOS and
    GSUB tables in the OpenType font. See:

    http://rohini.ncst.ernet.in/indix/doc/HOWTO/Devanagari-HOWTO-5.html



    |
    | I also took a look at the unicodedata module in python and that seems to
    | have a function called unicodedata.category. This function seems to
    | returns strings 'Mn' for u'\u093f' and 'Lo' for u'\u093e'. However, I
    | have been unable to find a reference for what these strings signify.
    | Where should I look for them? (I am hoping for something more specific
    | than "Look at www.unicode.org")

    Would "Look at
    http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values " do?

    HTH,
    Vincent Wehren

    |
    | Thanks,
    | Srinath
    |
    vincent wehren, Oct 24, 2003
    #2
    1. Advertising

  3. On Fri, 24 Oct 2003, vincent wehren wrote:
    > |
    > | the first two "code points" represent a single character on the screen.
    >
    > My GUESS is that you can do that unless you *know* exactly which codepoints
    > form ligatures. In DEVANAGARI this are e.g. the so-called dependent vowels
    > in range 093e - 094c, wherin 093f stands "left of the consonant" when
    > rendered. (My knowledge of Indic languages is limited, at best, so there may
    > be mor to it..)
    >

    After a sleepless night, I finally found out that calculating grapheme
    boundaries for devanagari is not so hard after all. It seems to work
    reasonably well if I use just three simple rules:

    To detect whether in the code point sequence 'ab', the junction between
    'a' and 'b' is a glyph boundary.

    1. If 'b' is some kind of a mark (i.e unicodedata.category(b) starts
    with 'M'), then the 'ab' junction is not a glyph boundary.

    2. If 'b' is not a Mark, but is a devanagari letter (i.e category 'Lo')
    AND 'a' is a VIRAMA character i.e, 'VIRAMA' in unicodedata.name(a),
    then the 'ab' junction is not a glyph boundary.

    3. In every other situation, the 'ab' junction is a glyph boundary.

    Dont really know if this is completely correct, but it performs pretty
    well on quite a big sanskrit text I have... Handles things like

    NA + HALANT + DHA + HALANT + YA + AA

    and reports it (correctly) as a single glyph.

    > | In my application, the GUI seems to handle that part (i.e combining
    > | characters). However, I need to handle cursor movement myself. The GUI
    > | can only be told to move forward by a specified number of bytes.
    >
    > What GUI are you working with?
    >

    I am using wxPython on windows XP. There are two text display widgets,
    wxTextCtrl and wxStyledTextCtrl. The former is pretty basic but the
    caret positioning is pretty robust. The latter is very fancy, hanles
    syntax highlighting etc, but has some serious problems with combining
    characters.

    > Some systems such as the X Server on IndiX seem to dig into the GPOS and
    > GSUB tables in the OpenType font. See:
    >
    > http://rohini.ncst.ernet.in/indix/doc/HOWTO/Devanagari-HOWTO-5.html
    >

    Thanks for the link!

    > Would "Look at
    > http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values " do?


    It does indeed. Notice my new-found fluency with unicodedata.category?
    :)

    Thanks,
    Srinath
    Srinath Avadhanula, Oct 24, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,921
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    548
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    959
    Grzegorz ¦liwiñski
    Jan 19, 2011
  4. edwardfredriks

    counting up instead of counting down

    edwardfredriks, Sep 6, 2005, in forum: Javascript
    Replies:
    6
    Views:
    199
    Dr John Stockton
    Sep 7, 2005
  5. Terry Reedy
    Replies:
    0
    Views:
    74
    Terry Reedy
    Jan 7, 2014
Loading...

Share This Page