UNICODE mode for regular expressions - time to change the default?

John Nagle · Apr 5, 2007

Regular expressions are compiled in ASCII mode unless
Unicode mode is specified to "rc.compile". The difference is that regular
expressions in ASCII mode don't recognize things like
Unicode whitespace, even when applied to Unicode strings.
For example, Unicode character 0x00A0 is a "NO-BREAK SPACE", which is
a form of whitespace. It's the Unicode equivalent of HTML's " ".
This can create some strange bugs.

Is the current default good? Or is it time to compile all regular
expressions in Unicode mode by default? It shouldn't hurt processing of
ASCII strings to do that. The current setup is really a legacy of when
most things in Python didn't work in Unicode mode, and you didn't want to
introduce Unicode unnecessarily. It's another one of those obscure
Unicode "gotchas" that really should go away.

John Nagle

John Machin · Apr 5, 2007

Regular expressions are compiled in ASCII mode
unless
Unicode mode is specified to "rc.compile". The difference is that regular
expressions in ASCII mode don't recognize things like
Unicode whitespace, even when applied to Unicode strings.

AFAICT, the default is that \s, \d, etc are interpreted according to
the current locale's properties. Specifying re.U changes that to use
the unicodedata properties instead. There is no such thing as "ASCII
mode".

For example, Unicode character 0x00A0 is a "NO-BREAK SPACE", which is
a form of whitespace. It's the Unicode equivalent of HTML's " ".
This can create some strange bugs.

Is the current default good? Or is it time to compile all regular
expressions in Unicode mode by default? It shouldn't hurt processing of
ASCII strings to do that.

Believe it or not: there are folk out there who have data which is
encoded in 8-bit encodings which are not ASCII and for which
"\xA0".decode('whatever') does not produce u"\xA0" ... it could for
example be a box-drawing character or a letter:

Problem number 2: It's probable that users in locale X wouldn't want a
match to succeed on a character that is regarded as a digit (say) in
distant locale Y, but if found in a data file in locale X is probably
more indicative of having read binary data instead of Unicode text:

The current setup is really a legacy of when
most things in Python didn't work in Unicode mode, and you didn't want to
introduce Unicode unnecessarily. It's another one of those obscure
Unicode "gotchas" that really should go away.

It's the ASCII-centric mindset that creates gotchas and really should
go away

HTH,
John

Steve Holden · Apr 5, 2007

John said:
Regular expressions are compiled in ASCII mode unless
Unicode mode is specified to "rc.compile". The difference is that regular
expressions in ASCII mode don't recognize things like
Unicode whitespace, even when applied to Unicode strings.
For example, Unicode character 0x00A0 is a "NO-BREAK SPACE", which is
a form of whitespace. It's the Unicode equivalent of HTML's " ".
This can create some strange bugs.

Is the current default good? Or is it time to compile all regular
expressions in Unicode mode by default? It shouldn't hurt processing of
ASCII strings to do that. The current setup is really a legacy of when
most things in Python didn't work in Unicode mode, and you didn't want to
introduce Unicode unnecessarily. It's another one of those obscure
Unicode "gotchas" that really should go away.

John Nagle

Personally I'd leave it to go away with Python 3.0, when all strings
will be Unicode.

regards
Steve

unicode by default	29	May 11, 2011
Looking for UNICODE to ASCII Conversioni Example Code	15	Oct 18, 2013
The power of regular expressions without regular expressions.	0	Jul 17, 2013
Ascii to Unicode.	4	Jul 28, 2010
Correct handling of case in unicode and regexps	1	Feb 23, 2013
Unicode strings and ascii regular expressions	2	Jan 30, 2006
Need advices regarding the strings (str, unicode, coding) used asinterface for an external library.	2	Nov 22, 2010
Ascii to Unicode.	16	Jul 28, 2010

UNICODE mode for regular expressions - time to change the default?

John Nagle

John Machin

Steve Holden

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads