C
Christopher Subich
I don't think the python regular expression module correctly handles
combining marks; it gives inconsistent results between equivalent forms
of some regular expressions:
In the above example, u'\xf1' is n-with-tilde (ñ). NFC happens to be a
no-op, and NFD decomposes it into u'n\u0303', which splits out the tilde
as a combining mark.
Is this a limitation-by-design, or a bug? If the latter, is it already
known/to-be-fixed?
combining marks; it gives inconsistent results between equivalent forms
of some regular expressions:
u'\xf1'>>> sys.version '2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]'
>>>re.match('\w',unicodedata.normalize('NFD',u'\xf1'),re.UNICODE).group(0) u'n'
>>>re.match('\w',unicodedata.normalize('NFC',u'\xf1'),re.UNICODE).group(0)
In the above example, u'\xf1' is n-with-tilde (ñ). NFC happens to be a
no-op, and NFD decomposes it into u'n\u0303', which splits out the tilde
as a combining mark.
Is this a limitation-by-design, or a bug? If the latter, is it already
known/to-be-fixed?