On re / regex replacement

J

jmfauth

There is actually a discussion on the dev-list about the replacement
of "re" by "regex".

I'm not a regular expressions specialist, neither a regex user.
However, there is in regex a point that is a little bit disturbing
me.

The regex module proposes a flag to select the "coding" (wrong word,
just to be short):

The global flags are: ASCII, LOCALE, NEW, REVERSE, UNICODE.

If I can undestand the ASCII flag, ASCII being the "lingua franca" of
almost all codings, I am more skeptical about the LOCALE/UNICODE
flags.

There is in my mind some kind of conflict here. What is 100% unicode
compliant shoud be locale independent ("Unicode.org") and a locale
depedency means a loss of unicode compliance.

I'm fearing some potential problems here: Users or modules working
in one mode, while some others are working in the other mode.

Nothing technical here. It seems to me nobody has pointed this
fact.

jmf
 
V

Vlastimil Brom

2011/8/28 jmfauth said:
There is actually a discussion on the dev-list about the replacement
of "re" by "regex".
...
If I can undestand the ASCII flag, ASCII being the "lingua franca" of
almost all codings, I am more skeptical about the LOCALE/UNICODE
flags.

There is in my mind some kind of conflict here. What is 100% unicode
compliant shoud be locale independent ("Unicode.org") and a locale
depedency means a loss of unicode compliance.

I'm fearing some potential problems here:  Users or modules working
in one mode, while some others are working in the other mode.

...
jmf


As I understand it, regex was designed to be as much compatible with
re as possible, sometimes even some problematic (in some
interpretation) behaviour is retained as default and "corrected" via
the NEW flag (e.g. zero-width split). Also the LOCALE flag seems to be
considered as legacy feature and kept with the same behaviour like re;
cf.: http://code.google.com/p/mrab-regex-hg/issues/detail?id=6&can=1
In my opinon, the LOCALE flag is not reliable (in a way I would
imagine) in either re or regex.

In the area of flags regex should work the same way like re or it just
adds more possibilities (REVERSE for backwards search, ASCII as the
complement for unicode, NEW to enable some incompatible additions or
corrections, where the original behaviour could be relied on).

The only (understandable) incompatibility I encounter in regex are the
new features requiring special syntax, which would obviously raise
errors in re or which would be matched literally instead.
see
http://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails#Additional_features
for an overview of the additions.

Personally I am very happy with regex, both with its features as well
as with the support and maintenance by its developer;
however I am mostly using it for manually entered patterns, and less
for hardcoded operation.

regards,
Vlastimil Brom
 
M

MRAB

As I understand it, regex was designed to be as much compatible with
re as possible, sometimes even some problematic (in some
interpretation) behaviour is retained as default and "corrected" via
the NEW flag (e.g. zero-width split). Also the LOCALE flag seems to be
considered as legacy feature and kept with the same behaviour like re;
cf.: http://code.google.com/p/mrab-regex-hg/issues/detail?id=6&can=1
In my opinon, the LOCALE flag is not reliable (in a way I would
imagine) in either re or regex.
In Python 2, re defaults to ASCII and you must use UNICODE for Unicode
strings (the str type is a bytestring). In Python 3, re defaults to
UNICODE and you must use ASCII for ASCII bytestrings (the str type is a
Unicode string).

The LOCALE flag is for locale-dependent 8-bit bytestrings. It uses the
toupper and tolower functions of the underlying C library.

The regex module tries to be drop-in compatible. It supports the LOCALE
flag only because the re module has it. Even Perl has something similar.
In the area of flags regex should work the same way like re or it just
adds more possibilities (REVERSE for backwards search, ASCII as the
complement for unicode, NEW to enable some incompatible additions or
corrections, where the original behaviour could be relied on).

The only (understandable) incompatibility I encounter in regex are the
new features requiring special syntax, which would obviously raise
errors in re or which would be matched literally instead.
see
http://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails#Additional_features
for an overview of the additions.
In the re module, unknown escape sequences are treated as literals, eg
\K is treated as K.

The regex module has more escape sequences, so that may break existing
regexes, eg \X isn't treated as X, but matches a grapheme. Unknown
escape sequences are still treated as literals, as in re.

My view is that you shouldn't be relying on that behaviour. If it looks
like an escape sequence, it may very well be one. It's like their use
in strings literals for file paths on Windows. I would've preferred
that a invalid escape sequence in a string literal raised an exception
(either it's valid and has a meaning, or it's invalid/reserved for
future use).

It's a balancing act. Requiring the NEW flag for _any_ deviation from
re would be very annoying.
Personally I am very happy with regex, both with its features as well
as with the support and maintenance by its developer;
however I am mostly using it for manually entered patterns, and less
for hardcoded operation.
And I'm very happy with your feedback. ;-)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,059
Latest member
cryptoseoagencies

Latest Threads

Top