UNICODE mode for regular expressions - time to change the default?

Discussion in 'Python' started by John Nagle, Apr 5, 2007.

  1. John Nagle

    John Nagle Guest

    Regular expressions are compiled in ASCII mode unless
    Unicode mode is specified to "rc.compile". The difference is that regular
    expressions in ASCII mode don't recognize things like
    Unicode whitespace, even when applied to Unicode strings.
    For example, Unicode character 0x00A0 is a "NO-BREAK SPACE", which is
    a form of whitespace. It's the Unicode equivalent of HTML's " ".
    This can create some strange bugs.

    Is the current default good? Or is it time to compile all regular
    expressions in Unicode mode by default? It shouldn't hurt processing of
    ASCII strings to do that. The current setup is really a legacy of when
    most things in Python didn't work in Unicode mode, and you didn't want to
    introduce Unicode unnecessarily. It's another one of those obscure
    Unicode "gotchas" that really should go away.

    John Nagle
     
    John Nagle, Apr 5, 2007
    #1
    1. Advertising

  2. John Nagle

    John Machin Guest

    On Apr 6, 5:50 am, John Nagle <> wrote:
    > Regular expressions are compiled in ASCII mode
    > unless
    > Unicode mode is specified to "rc.compile". The difference is that regular
    > expressions in ASCII mode don't recognize things like
    > Unicode whitespace, even when applied to Unicode strings.


    AFAICT, the default is that \s, \d, etc are interpreted according to
    the current locale's properties. Specifying re.U changes that to use
    the unicodedata properties instead. There is no such thing as "ASCII
    mode".

    > For example, Unicode character 0x00A0 is a "NO-BREAK SPACE", which is
    > a form of whitespace. It's the Unicode equivalent of HTML's "&nbsp;".
    > This can create some strange bugs.
    >
    > Is the current default good? Or is it time to compile all regular
    > expressions in Unicode mode by default? It shouldn't hurt processing of
    > ASCII strings to do that.


    Believe it or not: there are folk out there who have data which is
    encoded in 8-bit encodings which are not ASCII and for which
    "\xA0".decode('whatever') does not produce u"\xA0" ... it could for
    example be a box-drawing character or a letter:

    >>> import unicodedata as ucd
    >>> "\xA0".decode('koi8-r')

    u'\u2550'
    >>> ucd.name(_)

    'BOX DRAWINGS DOUBLE HORIZONTAL'
    >>> "\xA0".decode('cp850')

    u'\xe1'
    >>> ucd.name(_)

    'LATIN SMALL LETTER A WITH ACUTE'
    >>>


    Problem number 2: It's probable that users in locale X wouldn't want a
    match to succeed on a character that is regarded as a digit (say) in
    distant locale Y, but if found in a data file in locale X is probably
    more indicative of having read binary data instead of Unicode text:

    >>> ucd.name(u"\u0f20")

    'TIBETAN DIGIT ZERO'
    >>> re.match(ur"\d", u"\u0f20")
    >>> re.match(ur"\d", u"\u0f20", re.U)

    <_sre.SRE_Match object at 0x00EFC9C0>


    > The current setup is really a legacy of when
    > most things in Python didn't work in Unicode mode, and you didn't want to
    > introduce Unicode unnecessarily. It's another one of those obscure
    > Unicode "gotchas" that really should go away.


    It's the ASCII-centric mindset that creates gotchas and really should
    go away :)

    HTH,
    John
     
    John Machin, Apr 5, 2007
    #2
    1. Advertising

  3. John Nagle

    Steve Holden Guest

    John Nagle wrote:
    > Regular expressions are compiled in ASCII mode unless
    > Unicode mode is specified to "rc.compile". The difference is that regular
    > expressions in ASCII mode don't recognize things like
    > Unicode whitespace, even when applied to Unicode strings.
    > For example, Unicode character 0x00A0 is a "NO-BREAK SPACE", which is
    > a form of whitespace. It's the Unicode equivalent of HTML's "&nbsp;".
    > This can create some strange bugs.
    >
    > Is the current default good? Or is it time to compile all regular
    > expressions in Unicode mode by default? It shouldn't hurt processing of
    > ASCII strings to do that. The current setup is really a legacy of when
    > most things in Python didn't work in Unicode mode, and you didn't want to
    > introduce Unicode unnecessarily. It's another one of those obscure
    > Unicode "gotchas" that really should go away.
    >
    > John Nagle


    Personally I'd leave it to go away with Python 3.0, when all strings
    will be Unicode.

    regards
    Steve
    --
    Steve Holden +44 150 684 7255 +1 800 494 3119
    Holden Web LLC/Ltd http://www.holdenweb.com
    Skype: holdenweb http://del.icio.us/steve.holden
    Recent Ramblings http://holdenweb.blogspot.com
     
    Steve Holden, Apr 5, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Douglas
    Replies:
    0
    Views:
    638
    Jay Douglas
    Aug 15, 2003
  2. Christopher Subich

    Unicode regular expressions -- buggy?

    Christopher Subich, Aug 11, 2005, in forum: Python
    Replies:
    1
    Views:
    336
    Fredrik Lundh
    Aug 11, 2005
  3. ProvoWallis

    regular expressions, unicode and XML

    ProvoWallis, Jan 26, 2006, in forum: Python
    Replies:
    3
    Views:
    382
    Justin Ezequiel
    Jan 27, 2006
  4. Fuzzyman
    Replies:
    2
    Views:
    561
    Fuzzyman
    Jan 31, 2006
  5. Noman Shapiro
    Replies:
    0
    Views:
    257
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page