regular expressions and the LOCALE flag

Baz Walter · Aug 3, 2010

the python docs say that re.LOCALE makes certain character classes
"dependent on the current locale".

here's what i currently see on my system:

>>> import re, locale
>>> locale.getdefaultlocale() ('en_GB', 'UTF8')
>>> locale.getlocale() (None, None)
>>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L) [u'a', u'b', u'c']
>>> locale.setlocale(locale.LC_ALL, 'en_GB.ISO 8859-1') 'en_GB.ISO 8859-1'
>>> re.findall(r'\w', u'\xe5 \xe6 \xe7 a b c', re.L) [u'\xe5', u'\xe6', u'\xe7', u'a', u'b', u'c']
>>> locale.setlocale(locale.LC_ALL, 'en_GB.UTF-8') 'en_GB.UTF-8'
>>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L)

Click to expand...

Click to expand...

[u'a', u'b', u'c']

it seems wrong to me that re.LOCALE fails to give the "right" result
when the local encoding is utf8 - i think it should give the same result
as re.UNICODE.

is this a bug, or does the documentation just need to be made clearer?

How to print a unicode string?	11	Apr 19, 2008
Locale confusion	2	Jan 7, 2005
LANG, locale, unicode, setup.py and Debian packaging	25	Jan 12, 2008
regular expressions, unicode and XML	3	Jan 26, 2006
Python and unicode	8	Sep 19, 2010
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004
problem with logging exceptions with non-ASCII __str__ result	1	Jan 14, 2008
Interested SMS/Serial Programmer/Developers Resource	0	Oct 9, 2003

regular expressions and the LOCALE flag

Baz Walter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads