regular expressions and the LOCALE flag

B

Baz Walter

the python docs say that re.LOCALE makes certain character classes
"dependent on the current locale".

here's what i currently see on my system:
>>> import re, locale
>>> locale.getdefaultlocale() ('en_GB', 'UTF8')
>>> locale.getlocale() (None, None)
>>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L) [u'a', u'b', u'c']
>>> locale.setlocale(locale.LC_ALL, 'en_GB.ISO 8859-1') 'en_GB.ISO 8859-1'
>>> re.findall(r'\w', u'\xe5 \xe6 \xe7 a b c', re.L) [u'\xe5', u'\xe6', u'\xe7', u'a', u'b', u'c']
>>> locale.setlocale(locale.LC_ALL, 'en_GB.UTF-8') 'en_GB.UTF-8'
>>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L)
[u'a', u'b', u'c']

it seems wrong to me that re.LOCALE fails to give the "right" result
when the local encoding is utf8 - i think it should give the same result
as re.UNICODE.

is this a bug, or does the documentation just need to be made clearer?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top