unicodedata . normalize (NFD - NFC) inconsistency

Christos TZOTZIOY Georgiou · Nov 8, 2004

I found at least one case where decombining and recombining a unicode
character does not result in the same character (see at end).

I have no extensive knowledge about Unicode, yet I believe that this
must be a problem of the Unicode 3.2 specification and not Python's.
However, I haven't found out how the decomp_data (in unicodedata_db.h)
is built, and neither did I find much more info about the specifics of
Unicode 3.2. I thought about posting here; anyone more knowing could
give it a look.

If we find out that it's a problem with Python, I'll open a bug report
(and volunteer work).

*** Example ***
for uchar in utext:
print ord(uchar), ud.name(uchar)

945 GREEK SMALL LETTER ALPHA
769 COMBINING ACUTE ACCENT
*** End of Example ***

I can understand this confusion; if, as I have found, there is no
COMBINING GREEK TONOS or COMBINING TONOS ACCENT in the Unicode table,
decombining, one has to use the 'oxeia' (acute) accent...

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Nov 8, 2004

Christos said:
I have no extensive knowledge about Unicode, yet I believe that this
must be a problem of the Unicode 3.2 specification and not Python's.

Without checking the details: very well possible. Could this be
an instance of python.org/sf/1054943 ?

Regards,
Martin

Brion Vibber · Nov 9, 2004

Christos said:
> I found at least one case where decombining and recombining a unicode
> character does not result in the same character (see at end).
>
> I have no extensive knowledge about Unicode, yet I believe that this
> must be a problem of the Unicode 3.2 specification and not Python's.

I've been spending some time lately writing a normalizer (in PHP of all
things -- yeesh!), and yes Unicode is a scary world.

Although it may
seem counterintuitive, it is in fact perfectly legitimate for a
character not to be its own canonical composition.

8049 GREEK SMALL LETTER ALPHA WITH OXIA

This character is a "singleton decomposition". It decomposes into GREEK
SMALL LETTER ALPHA WITH TONOS, which further decomposes into GREEK SMALL
LETTER ALPHA and a COMBINING ACUTE ACCENT.

It is by definition not normalized, so when you normalize it to form C
it will turn into GREEK SMALL LETTER ALPHA WITH TONOS; there is no way
to get "back" to the original character in a normalized string. For some
more info see:
http://www.unicode.org/unicode/reports/tr15/#Primary_Exclusion_List_Table

945 GREEK SMALL LETTER ALPHA
769 COMBINING ACUTE ACCENT

940 GREEK SMALL LETTER ALPHA WITH TONOS

You should get this same result directly for ud.normalize('NFC', u1).
Converting directly to NFC should always give the same result as
converting to NFD and then NFC. Either will give you back the string you
started with if and only if it's already normalized to form C.

-- brion vibber (brion @ pobox.com)

Christos TZOTZIOY Georgiou · Nov 10, 2004

I've been spending some time lately writing a normalizer (in PHP of all
things -- yeesh!), and yes Unicode is a scary world.
....

http://www.unicode.org/unicode/reports/tr15/#Primary_Exclusion_List_Table

Thanks for the pointer, very informative, explaining why the observed
behaviour is well inside the definition of Unicode. Thanks go to Martin
also for taking a look at this.

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
[perl-python] unicode study with unicodedata module	5	Mar 15, 2005
PEP 3131: Supporting Non-ASCII Identifiers	399	May 13, 2007
Clean out accents in French names	8	May 17, 2005
newbie: approach for some odd tasks	0	Feb 14, 2005
How to decode JavaScript's encodeURIComponent in Perl.	4	Jan 23, 2007
word_set = set() def should_preceed_with_an(phrase): first_word =	1	Jan 26, 2013
compiling perl 5.8.7 on Solaris 8	3	Nov 17, 2005

unicodedata . normalize (NFD - NFC) inconsistency

Christos TZOTZIOY Georgiou

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Brion Vibber

Christos TZOTZIOY Georgiou

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads