unicodedata . normalize (NFD - NFC) inconsistency

Discussion in 'Python' started by Christos TZOTZIOY Georgiou, Nov 8, 2004.

  1. I found at least one case where decombining and recombining a unicode
    character does not result in the same character (see at end).

    I have no extensive knowledge about Unicode, yet I believe that this
    must be a problem of the Unicode 3.2 specification and not Python's.
    However, I haven't found out how the decomp_data (in unicodedata_db.h)
    is built, and neither did I find much more info about the specifics of
    Unicode 3.2. I thought about posting here; anyone more knowing could
    give it a look.

    If we find out that it's a problem with Python, I'll open a bug report
    (and volunteer work).

    *** Example ***

    >>> import unicodedata as ud
    >>> def report(utext):

    for uchar in utext:
    print ord(uchar), ud.name(uchar)


    >>> u1=u'\N{greek small letter alpha with oxia}'
    >>> report(u1)

    8049 GREEK SMALL LETTER ALPHA WITH OXIA
    >>> u2=ud.normalize('NFD', u1)
    >>> report(u2)

    945 GREEK SMALL LETTER ALPHA
    769 COMBINING ACUTE ACCENT
    >>> u3=ud.normalize('NFC', u2)
    >>> report(u3)

    940 GREEK SMALL LETTER ALPHA WITH TONOS
    >>>


    *** End of Example ***

    I can understand this confusion; if, as I have found, there is no
    COMBINING GREEK TONOS or COMBINING TONOS ACCENT in the Unicode table,
    decombining, one has to use the 'oxeia' (acute) accent...
    --
    TZOTZIOY, I speak England very best,
    "Tssss!" --Brad Pitt as Achilles in unprecedented Ancient Greek
     
    Christos TZOTZIOY Georgiou, Nov 8, 2004
    #1
    1. Advertising

  2. Christos TZOTZIOY Georgiou wrote:
    > I have no extensive knowledge about Unicode, yet I believe that this
    > must be a problem of the Unicode 3.2 specification and not Python's.


    Without checking the details: very well possible. Could this be
    an instance of python.org/sf/1054943 ?

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Nov 8, 2004
    #2
    1. Advertising

  3. Christos TZOTZIOY Georgiou

    Brion Vibber Guest

    Christos TZOTZIOY Georgiou wrote:
    > I found at least one case where decombining and recombining a unicode
    > character does not result in the same character (see at end).
    >
    > I have no extensive knowledge about Unicode, yet I believe that this
    > must be a problem of the Unicode 3.2 specification and not Python's.


    I've been spending some time lately writing a normalizer (in PHP of all
    things -- yeesh!), and yes Unicode is a scary world. :) Although it may
    seem counterintuitive, it is in fact perfectly legitimate for a
    character not to be its own canonical composition.

    >>>>u1=u'\N{greek small letter alpha with oxia}'
    >>>>report(u1)

    >
    > 8049 GREEK SMALL LETTER ALPHA WITH OXIA


    This character is a "singleton decomposition". It decomposes into GREEK
    SMALL LETTER ALPHA WITH TONOS, which further decomposes into GREEK SMALL
    LETTER ALPHA and a COMBINING ACUTE ACCENT.

    It is by definition not normalized, so when you normalize it to form C
    it will turn into GREEK SMALL LETTER ALPHA WITH TONOS; there is no way
    to get "back" to the original character in a normalized string. For some
    more info see:
    http://www.unicode.org/unicode/reports/tr15/#Primary_Exclusion_List_Table

    >>>>u2=ud.normalize('NFD', u1)
    >>>>report(u2)

    >
    > 945 GREEK SMALL LETTER ALPHA
    > 769 COMBINING ACUTE ACCENT
    >
    >>>>u3=ud.normalize('NFC', u2)
    >>>>report(u3)

    >
    > 940 GREEK SMALL LETTER ALPHA WITH TONOS


    You should get this same result directly for ud.normalize('NFC', u1).
    Converting directly to NFC should always give the same result as
    converting to NFD and then NFC. Either will give you back the string you
    started with if and only if it's already normalized to form C.

    -- brion vibber (brion @ pobox.com)
     
    Brion Vibber, Nov 9, 2004
    #3
  4. On Mon, 08 Nov 2004 17:40:47 -0800, rumours say that Brion Vibber
    <> might have written:

    >I've been spending some time lately writing a normalizer (in PHP of all
    >things -- yeesh!), and yes Unicode is a scary world. :)


    ....

    >http://www.unicode.org/unicode/reports/tr15/#Primary_Exclusion_List_Table


    Thanks for the pointer, very informative, explaining why the observed
    behaviour is well inside the definition of Unicode. Thanks go to Martin
    also for taking a look at this.
    --
    TZOTZIOY, I speak England very best,
    "Tssss!" --Brad Pitt as Achilles in unprecedented Ancient Greek
     
    Christos TZOTZIOY Georgiou, Nov 10, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. arnold
    Replies:
    1
    Views:
    606
    arnold
    Mar 5, 2006
  2. AndyL
    Replies:
    6
    Views:
    436
    John Machin
    May 25, 2006
  3. Max
    Replies:
    2
    Views:
    1,053
  4. Aaron Patterson

    [ANN] nfc 2.0.0 Released

    Aaron Patterson, Aug 8, 2009, in forum: Ruby
    Replies:
    1
    Views:
    131
    Aaron Patterson
    Aug 9, 2009
  5. Max

    Javascript NFC Normalization

    Max, Oct 1, 2007, in forum: Javascript
    Replies:
    0
    Views:
    97
Loading...

Share This Page