[regex] case-splitting strings in unicode

  • Thread starter John Perks and Sarah Mount
  • Start date
J

John Perks and Sarah Mount

I have to split some identifiers that are casedLikeThis into their
component words. In this instance I can safely use [A-Z] to represent
uppercase, but what pattern should I use if I wanted it to work more
generally? I can envisage walking the string testing the
unicodedata.category of each char, but is there a regex'y way to denote
"uppercase"?

Thanks

John
 
M

Micah Elliott

I have to split some identifiers that are casedLikeThis into their
component words. In this instance I can safely use [A-Z] to represent
uppercase, but what pattern should I use if I wanted it to work more
generally? I can envisage walking the string testing the
unicodedata.category of each char, but is there a regex'y way to
denote "uppercase"?

Not sure what your output should look like but something like this could
work:
import re
re.sub(r'([A-Z])', r' \1', 'theFirstTest theSecondTest')
'the First Test the Second Test'

This can be adapted for multiline, etc, but maybe '[A-Z]' is
sufficiently general. The regex module does have an understanding of
unicode (but I don't, sorry); you could add (?u) make it unicode aware.
For programming language identifiers I wouldn't think that unicode
should be an issue. Sorry I'm no help with unicode specifics.

Some useful links:

http://www.python.org/doc/2.4.2/lib/module-re.html
http://www.amk.ca/python/howto/regex/regex.html
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

John said:
I have to split some identifiers that are casedLikeThis into their
component words. In this instance I can safely use [A-Z] to represent
uppercase, but what pattern should I use if I wanted it to work more
generally? I can envisage walking the string testing the
unicodedata.category of each char, but is there a regex'y way to denote
"uppercase"?

In this form, it is currently not implemented, although it should be
(written as [[:upper:]], I believe); contributions are welcome (make
sure you read the Unicode consortium's guidelines on regular expressions
before attempting to implement it).

Until then, the "best" way is to use a regular character class,
precomputed or computed at runtime.

uni_upper = [unichr(i) for i in range(sys.maxunicode) if
unichr(i).isupper()]
uni_re = u"["+u"".join(uni_upper)+u"]"

On my machine, this takes approximately one second to compute,
which may or may not be too much as a startup cost. To speed
this up, you could dump the resulting uni_re into a Python
source file.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,431
Messages
2,571,677
Members
48,796
Latest member
Greg L.

Latest Threads

Top