Correct handling of case in unicode and regexps

Devin Jeanpierre · Feb 23, 2013

Hi folks,

I'm pretty unsure of myself when it comes to unicode. As I understand
it, you're generally supposed to compare things in a case insensitive
manner by case folding, right? So instead of a.lower() == b.lower()
(the ASCII way), you do a.casefold() == b.casefold()

However, I'm struggling to figure out how regular expressions should
treat case. Python's re module doesn't "work properly" to my
understanding, because:

In addition, it seems improbable that this ever _could_ work. Because
if it did work like that, then what would the value be of
re.match('s', 'ÃŸ', re.UNICODE | re.IGNORECASE).end() ? 0.5?

I'd really like to hear the thoughts of people more experienced with
unicode. What is the ideal correct behavior here? Or do I
misunderstand things?

-- Devin

jmfauth · Feb 24, 2013

Hi folks,

I'm pretty unsure of myself when it comes to unicode. As I understand
it, you're generally supposed to compare things in a case insensitive
manner by case folding, right? So instead of a.lower() == b.lower()
(the ASCII way), you do a.casefold() == b.casefold()

However, I'm struggling to figure out how regular expressions should
treat case. Python's re module doesn't "work properly" to my
understanding, because:

Â Â >>> a = 'ss'
Â Â >>> b = 'ÃŸ'
Â Â >>> a.casefold() == b.casefold()
Â Â True
Â Â >>> re.match(re.escape(a), b, re.UNICODE | re.IGNORECASE)
Â Â >>> # oh dear!

In addition, it seems improbable that this ever _could_ work. Because
if it did work like that, then what would the value be of
re.match('s', 'ÃŸ', re.UNICODE | re.IGNORECASE).end() ? 0.5?

I'd really like to hear the thoughts of people more experienced with
unicode. What is the ideal correct behavior here? Or do I
misunderstand things?

-----

I'm just wondering if there is a real issue here. After all,
this is only a question of conventions. Unicode has some
conventions, re modules may (has to) use some conventions too.

It seems to me, the safest way is to preprocess the text,
which has to be examinated.

Proposed case study:
How should be ss/ÃŸ/SS/áºž interpreted?

'Richard-Strauss-StraÃŸe'
'Richard-Strauss-Strasse'
'RICHARD-STRAUSS-STRASSE'
'RICHARD-STRAUSS-STRAáºžE'

There is more or less the same situation with sorting.
Unicode can not do all and it may be mandatory to
preprocess the "input".

Eg. This fct I wrote once for the fun. It sorts French
words (without unicodedata and locale).

import libfrancais
z = ['oeuf', 'Å“uf', 'od', 'of']
zo = libfrancais.sortedfr(z)
zo

Click to expand...

Click to expand...

['od', 'oeuf', 'Å“uf', 'of']

jmf

Python's handling of unicode surrogates	17	Apr 20, 2007
comp.lang.vhdl FAQ part 3 of 4: products & services	0	Jul 8, 2003

Correct handling of case in unicode and regexps

Devin Jeanpierre

jmfauth

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads