[regex] case-splitting strings in unicode

John Perks and Sarah Mount · Oct 8, 2005

I have to split some identifiers that are casedLikeThis into their
component words. In this instance I can safely use [A-Z] to represent
uppercase, but what pattern should I use if I wanted it to work more
generally? I can envisage walking the string testing the
unicodedata.category of each char, but is there a regex'y way to denote
"uppercase"?

Thanks

John

Micah Elliott · Oct 8, 2005

I have to split some identifiers that are casedLikeThis into their
component words. In this instance I can safely use [A-Z] to represent
uppercase, but what pattern should I use if I wanted it to work more
generally? I can envisage walking the string testing the
unicodedata.category of each char, but is there a regex'y way to
denote "uppercase"?

Not sure what your output should look like but something like this could
work:

import re
re.sub(r'([A-Z])', r' \1', 'theFirstTest theSecondTest')

Click to expand...

Click to expand...

'the First Test the Second Test'

This can be adapted for multiline, etc, but maybe '[A-Z]' is
sufficiently general. The regex module does have an understanding of
unicode (but I don't, sorry); you could add (?u) make it unicode aware.
For programming language identifiers I wouldn't think that unicode
should be an issue. Sorry I'm no help with unicode specifics.

Some useful links:

http://www.python.org/doc/2.4.2/lib/module-re.html
http://www.amk.ca/python/howto/regex/regex.html

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Oct 9, 2005

John said:
I have to split some identifiers that are casedLikeThis into their
component words. In this instance I can safely use [A-Z] to represent
uppercase, but what pattern should I use if I wanted it to work more
generally? I can envisage walking the string testing the
unicodedata.category of each char, but is there a regex'y way to denote
"uppercase"?

In this form, it is currently not implemented, although it should be
(written as [[:upper:]], I believe); contributions are welcome (make
sure you read the Unicode consortium's guidelines on regular expressions
before attempting to implement it).

Until then, the "best" way is to use a regular character class,
precomputed or computed at runtime.

uni_upper = [unichr(i) for i in range(sys.maxunicode) if
unichr(i).isupper()]
uni_re = u"["+u"".join(uni_upper)+u"]"

On my machine, this takes approximately one second to compute,
which may or may not be too much as a startup cost. To speed
this up, you could dump the resulting uni_re into a Python
source file.

Regards,
Martin

regex help: splitting string gets weird groups	8	Apr 8, 2010
Regex for unicode letter characters	4	Jan 10, 2009
Finding Upper-case characters in regexps, unicode friendly.	4	May 24, 2006
Upper/lowercase regex matching in unicode	1	Oct 19, 2005
unicode "em space" in regex	6	Apr 16, 2005
regex Mcalpine to McAlpine	5	Jul 10, 2010
length of strings in a two dimensional array	16	Apr 19, 2011
Revised PEP 349: Allow str() to return unicode strings	2	Aug 22, 2005

[regex] case-splitting strings in unicode

John Perks and Sarah Mount

Micah Elliott

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads