On re / regex replacement

jmfauth · Aug 28, 2011

There is actually a discussion on the dev-list about the replacement
of "re" by "regex".

I'm not a regular expressions specialist, neither a regex user.
However, there is in regex a point that is a little bit disturbing
me.

The regex module proposes a flag to select the "coding" (wrong word,
just to be short):

The global flags are: ASCII, LOCALE, NEW, REVERSE, UNICODE.

If I can undestand the ASCII flag, ASCII being the "lingua franca" of
almost all codings, I am more skeptical about the LOCALE/UNICODE
flags.

There is in my mind some kind of conflict here. What is 100% unicode
compliant shoud be locale independent ("Unicode.org") and a locale
depedency means a loss of unicode compliance.

I'm fearing some potential problems here: Users or modules working
in one mode, while some others are working in the other mode.

Nothing technical here. It seems to me nobody has pointed this
fact.

jmf

Vlastimil Brom · Aug 28, 2011

2011/8/28 jmfauth said:
There is actually a discussion on the dev-list about the replacement
of "re" by "regex".
...
If I can undestand the ASCII flag, ASCII being the "lingua franca" of
almost all codings, I am more skeptical about the LOCALE/UNICODE
flags.

There is in my mind some kind of conflict here. What is 100% unicode
compliant shoud be locale independent ("Unicode.org") and a locale
depedency means a loss of unicode compliance.

I'm fearing some potential problems here: Users or modules working
in one mode, while some others are working in the other mode.

...
jmf

As I understand it, regex was designed to be as much compatible with
re as possible, sometimes even some problematic (in some
interpretation) behaviour is retained as default and "corrected" via
the NEW flag (e.g. zero-width split). Also the LOCALE flag seems to be
considered as legacy feature and kept with the same behaviour like re;
cf.: http://code.google.com/p/mrab-regex-hg/issues/detail?id=6&can=1
In my opinon, the LOCALE flag is not reliable (in a way I would
imagine) in either re or regex.

In the area of flags regex should work the same way like re or it just
adds more possibilities (REVERSE for backwards search, ASCII as the
complement for unicode, NEW to enable some incompatible additions or
corrections, where the original behaviour could be relied on).

The only (understandable) incompatibility I encounter in regex are the
new features requiring special syntax, which would obviously raise
errors in re or which would be matched literally instead.
see
http://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails#Additional_features
for an overview of the additions.

Personally I am very happy with regex, both with its features as well
as with the support and maintenance by its developer;
however I am mostly using it for manually entered patterns, and less
for hardcoded operation.

regards,
Vlastimil Brom

MRAB · Aug 28, 2011

As I understand it, regex was designed to be as much compatible with
re as possible, sometimes even some problematic (in some
interpretation) behaviour is retained as default and "corrected" via
the NEW flag (e.g. zero-width split). Also the LOCALE flag seems to be
considered as legacy feature and kept with the same behaviour like re;
cf.: http://code.google.com/p/mrab-regex-hg/issues/detail?id=6&can=1
In my opinon, the LOCALE flag is not reliable (in a way I would
imagine) in either re or regex.

In Python 2, re defaults to ASCII and you must use UNICODE for Unicode
strings (the str type is a bytestring). In Python 3, re defaults to
UNICODE and you must use ASCII for ASCII bytestrings (the str type is a
Unicode string).

The LOCALE flag is for locale-dependent 8-bit bytestrings. It uses the
toupper and tolower functions of the underlying C library.

The regex module tries to be drop-in compatible. It supports the LOCALE
flag only because the re module has it. Even Perl has something similar.

In the area of flags regex should work the same way like re or it just
adds more possibilities (REVERSE for backwards search, ASCII as the
complement for unicode, NEW to enable some incompatible additions or
corrections, where the original behaviour could be relied on).

The only (understandable) incompatibility I encounter in regex are the
new features requiring special syntax, which would obviously raise
errors in re or which would be matched literally instead.
see
http://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails#Additional_features
for an overview of the additions.

In the re module, unknown escape sequences are treated as literals, eg
\K is treated as K.

The regex module has more escape sequences, so that may break existing
regexes, eg \X isn't treated as X, but matches a grapheme. Unknown
escape sequences are still treated as literals, as in re.

My view is that you shouldn't be relying on that behaviour. If it looks
like an escape sequence, it may very well be one. It's like their use
in strings literals for file paths on Windows. I would've preferred
that a invalid escape sequence in a string literal raised an exception
(either it's valid and has a meaning, or it's invalid/reserved for
future use).

It's a balancing act. Requiring the NEW flag for _any_ deviation from
re would be very annoying.

Personally I am very happy with regex, both with its features as well
as with the support and maintenance by its developer;
however I am mostly using it for manually entered patterns, and less
for hardcoded operation.

And I'm very happy with your feedback. ;-)

jmfauth · Aug 29, 2011

...

The regex module tries to be drop-in compatible. It supports the LOCALE
flag only because the re module has it. Even Perl has something similar.
...

Ok. That's quite logical.

jmf

anybody help me	1	Feb 10, 2006
ANN: pyregex 0.5 - command line tools for Python's regular expression	0	Mar 10, 2006
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
python-dev Summary for 2006-02-16 through 2006-02-28	1	Apr 29, 2006
In the Matter of Herb Schildt: a Detailed Analysis of "C: TheComplete Nonsense"	109	Apr 3, 2010
compiling perl 5.8.7 on Solaris 8	3	Nov 17, 2005
REQ: Perl 5.8.3 on OpenBSD	3	Mar 6, 2004
TransModal modal dialog project : beta testing	12	May 6, 2008

On re / regex replacement

jmfauth

Vlastimil Brom

MRAB

jmfauth

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads