regexps with unicode-aware characterclasses?

S

Stefan Rank

Hi all,

in a python re pattern, how do I match all unicode uppercase characters
(in a unicode string/in a utf-8 string)?

I know that there is string.uppercase/.lowercase which are
'locale-aware', but I don't think there is a "all locales" locale.

I know that there is a re.U switch that makes \w match all unicode word
characters, but there are no subclasses of that ([[:upper:]] or
preferably \u).
Or is there a module/extension to get that?

There is the module unicodedata, but it has no unicodedata.uppercase
that would correspond to string.uppercase.

<wishful thinking>

re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))

or::

re.compile('(?u)[[:upper:]]')

or::

re.compile('(?u)\u')

for the latter two, to work on utf-8 strings, would I have to set the
defaultencoding to utf-8?

</wishful thinking>
 
G

Guest

Stefan said:
<wishful thinking>

re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))

This would (almost) work, but it would be terribly inefficient (time
linear to the number of alternatives). You can realistically do

uppers = [u'[']
for i in range(sys.maxunicode):
c = unichr(i)
if c.isupper(): uppers.append(c)
uppers.append(u']')
uppers = u"".join(uppers)
uppers_re = re.compile(uppers)

Compiling this expression is quite expensive; matching it is fairly
efficient (time independent of the number of characters in the class).
To save startup cost, consider pickling the compiled expression.

(syntax note: this only works because none of the characters special
to a RE class (]-^\) is an uppercase letter; otherwise, escaping might
be needed)
for the latter two, to work on utf-8 strings, would I have to set the
defaultencoding to utf-8?

For Unicode things, you should avoid using byte strings - especially
when it comes to regular expressions. Use Unicode strings instead.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top