Unicode regular expressions -- buggy?

C

Christopher Subich

I don't think the python regular expression module correctly handles
combining marks; it gives inconsistent results between equivalent forms
of some regular expressions:
>>> sys.version '2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]'
>>>re.match('\w',unicodedata.normalize('NFD',u'\xf1'),re.UNICODE).group(0) u'n'
>>>re.match('\w',unicodedata.normalize('NFC',u'\xf1'),re.UNICODE).group(0)
u'\xf1'

In the above example, u'\xf1' is n-with-tilde (ñ). NFC happens to be a
no-op, and NFD decomposes it into u'n\u0303', which splits out the tilde
as a combining mark.

Is this a limitation-by-design, or a bug? If the latter, is it already
known/to-be-fixed?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top