PyWart: Python regular expression syntax is not intuitive.

R

Rick Johnson

In particular i find the "extension notation" syntax to be woefully
inadequate. You should be able to infer the action of the extension
syntax intuitively, simply from looking at its signature. I find
myself continually needing to consult the docs because of the lacking
or misleading style of the current syntax. Consider:

(...) # Group Capture
Okay here. Parenthesis feel very natural for delimiting a group.

(?...) # Base Extension Syntax
All extensions are wrapped in parenthesis and start with a question
mark, but i believe the question mark was a very bad choice, since the
question mark is already specific to "zero or one repetitions of
preceding RE". This simple error is why i believe Python re's are so
damn difficult to eyeball parse. You'll constantly be forced to spend
too much time deciding if this question mark is a referring to
repeats, or is the start of an extension syntax. We should have
choosen another char, and the char should NOT be known to RE in any
other place. Maybe the tilde would work? Wait, i have a MUCH better
idea!!!

Actually the best choice would have been using BRACES instead of
PARENTHESIS to delimit the extension syntax, since parenthesis are
used (wisely i might add!) for group captures. Also, anything
contained in braces is more likely to be understood (by almost all
programmers) as a "command block" -- unfortunately some idiot decided
to use braces for specifying ranges! WHAT A F'ING WASTE of intuitive
chars!

(?iLmsux) # Passing Flags Internally
This is ridiculous. re's are cryptic enough without inviting TIMTOWDI
over to play. Passing flags this way does nothing BUT harm
readability. Please people, pass your flags as an argument to the
appropriate re.method() and NOT as another cryptic syntax.

(?:...) # Non-Capturing Group
When i look at this pattern "non-capturing" DOES NOT scream out at me,
and again, the question mark is used incorrectly. When i think of a
char that screams NEGATIVE, i think of the exclamation mark, NOT the
question mark. And how the HELL is the colon helping me to interpret
this syntax?

(?P<name>...) # Named Group Capture
(?P=name) # Named Group Reference
(?#...) # Comment

################################################
## The following assertions are highly flawed ##
################################################

(?=...) # positive look ahead
(?!...) # negative look ahead
(?<=...) # positive look behind
(?<!...) # negative look behind

I cannot decipher these patterns in their current syntactical forms.
Too much information is missing or misleading. I have no idea which
pattern is looking forward, which pattern is looking backward, which
is pattern negative, and which pattern is positive. I need syntactical
clues! Consider these:

(?>=...) #Read as "forward equals pattern?"
(?>!=...) #Read as "forward NOT equals pattern?"
(?<=...) #Read as "backwards equals pattern?"
(?<!=...) #Read as "backwards NOT equals pattern?"

However, i really don't like the fact that negative assertions need
one extra char than positive assertions. Here is an alternative:

(?>+...) #Read as "forward equals pattern?"
(?>-...) #Read as "forward NOT equals pattern?"
(?<+...) #Read as "backwards equals pattern?"
(?<-...) #Read as "backwards NOT equals pattern?"

Looks much better HOWEVER we still have too much useless noise.
Replace the parenthesis delimiters with braces, and drop the "where's
waldo" question mark, and we have a simplistically intuitive
syntactical bliss!

{...} # Base Extension Syntax
{iLmsux} # Passing Flags Internally
{!()...} or (!...) # Non Capturing.
{NG=identifier...} # Named Group Capture
{NG.name} # Named Group Reference
{#...} # Comment
{>+...} # Positive Look Ahead Assertion
{>-...} # Negative Look Ahead Assertion
{<+...} # Positive Look Behind Assertion
{<-...} # Positive Look Behind Assertion
{(id/name)yes-pat|no-pat}

*school-bell-rings*

PS: In my eyes, Python 3000 is already a dinosaur.
 
R

Rick Johnson

{!()...} or (!...) # Non Capturing.

Yuck: on second thought, i don't like {!()...}, mainly because non-
capturing groups should use the parenthesis delimiters to keep the API
consistent. Try this instead --> (!:...)
{NG=identifier...}  # Named Group Capture
{NG.name}  # Named Group Reference

....should be {NG.identifier}. I am also feeling like named group
syntax could be more simplistic without sacrificing readability.

{=identifier...}  # Named Group Capture
{.identifier}  # Named Group Reference
 
I

Ian Kelly

(?...)  # Base Extension Syntax
All extensions are wrapped in parenthesis and start with a question
mark, but i believe the question mark was a very bad choice, since the
question mark is already specific to "zero or one repetitions of
preceding RE". This simple error is why i believe Python re's are so
damn difficult to eyeball parse. You'll constantly be forced to spend
too much time deciding if this question mark is a referring to
repeats, or is the start of an extension syntax. We should have
choosen another char, and the char should NOT be known to RE in any
other place. Maybe the tilde would work? Wait, i have a MUCH better
idea!!!

Did you read the very first sentence of the re module documentation?
"This module provides regular expression matching operations *similar
to those found in Perl*" (my emphasis). The goal here is
compatibility with existing RE syntaxes, not readability. Perl uses
the (?...) syntax, so the re module does too.
(?iLmsux) # Passing Flags Internally
This is ridiculous. re's are cryptic enough without inviting TIMTOWDI
over to play. Passing flags this way does nothing BUT harm
readability. Please people, pass your flags as an argument to the
appropriate re.method() and NOT as another cryptic syntax.

1) Not all regular expressions are hard-coded. Some applications even
allow users to supply regular expressions as data. Permitting flags
in the regular expression allows the user to specify or override the
defaults set by the application.

2) Permitting flags in the regular expression allows different
combinations of flags to be in effect for different parts of complex
regular expressions. You can't do that just by passing in the flags
as an argument.
(?:...) # Non-Capturing Group
When i look at this pattern "non-capturing" DOES NOT scream out at me,
and again, the question mark is used incorrectly. When i think of a
char that screams NEGATIVE, i think of the exclamation mark, NOT the
question mark. And how the HELL is the colon helping me to interpret
this syntax?

Don't ask us. Ask Larry Wall.
(?=...)  # positive look ahead
(?!...)  # negative look ahead
(?<=...) # positive look behind
(?<!...) # negative look behind

I cannot decipher these patterns in their current syntactical forms.
Too much information is missing or misleading. I have no idea which
pattern is looking forward, which pattern is looking backward, which
is pattern negative, and which pattern is positive. I need syntactical
clues! Consider these:

(?>=...) #Read as "forward equals pattern?"
(?>!=...) #Read as "forward NOT equals pattern?"
(?<=...) #Read as "backwards equals pattern?"
(?<!=...) #Read as "backwards NOT equals pattern?"

However, i really don't like the fact that negative assertions need
one extra char than positive assertions. Here is an alternative:

(?>+...) #Read as "forward equals pattern?"
(?>-...) #Read as "forward NOT equals pattern?"
(?<+...) #Read as "backwards equals pattern?"
(?<-...) #Read as "backwards NOT equals pattern?"

Looks much better HOWEVER we still have too much useless noise.
Replace the parenthesis delimiters with braces, and drop the "where's
waldo" question mark,  and we have a simplistically intuitive
syntactical bliss!

Once again, these come from Perl. Note also that Perl already has
(?>...) which means something entirely different.
{...}  # Base Extension Syntax
{iLmsux}  # Passing Flags Internally
{!()...} or (!...) # Non Capturing.
{NG=identifier...}  # Named Group Capture
{NG.name}  # Named Group Reference
{#...}  # Comment
{>+...}  # Positive Look Ahead Assertion
{>-...}  # Negative Look Ahead Assertion
{<+...}  # Positive Look Behind Assertion
{<-...}  # Positive Look Behind Assertion
{(id/name)yes-pat|no-pat}

*school-bell-rings*

Regular expression reform is not necessarily a bad thing, but this is
just forcing everybody to learn Yet Another Regex Syntax for no real
purpose. All that you've changed here is window dressing. For an
overview of many of the *real* problems with regular expression
syntax, see

http://www.perl.com/pub/2002/06/04/apo5.html

Ian
 
T

Terry Reedy

(?...) # Base Extension Syntax
All extensions are wrapped in parenthesis and start with a question
mark, but i believe the question mark was a very bad choice, since the

I think that syntax came either from Perl or the pcre library used by
several open source programs, including several Python versions.
https://en.wikipedia.org/wiki/Pcre
has some info on this.
 
R

Rick Johnson

Did you read the very first sentence of the re module documentation?
"This module provides regular expression matching operations *similar
to those found in Perl*" (my emphasis).  The goal here is
compatibility with existing RE syntaxes, not readability.  Perl uses
the (?...) syntax, so the re module does too.

@Duncan and Ian:
Did you not read the title of my post? :eek:) " Python regular expression
syntax is not intuitive." While i understand WHERE the syntax
orientations from, that fact does not solve the problem. The syntax is
not intuitive, and Python should ALWAYS be intuitive! We should always
borrow ideas from anyone (even our enemies) when those ideas support
our ideology. Perl style regexes are not Pythonic. They violate our
philosophy in too many places.
1) Not all regular expressions are hard-coded.  Some applications even
allow users to supply regular expressions as data.  Permitting flags
in the regular expression allows the user to specify or override the
defaults set by the application.

2) Permitting flags in the regular expression allows different
combinations of flags to be in effect for different parts of complex
regular expressions.  You can't do that just by passing in the flags
as an argument.

This is a valid argument, and i totally agree with you that we should
not remove the ability to pass flags internally. However, my main
point still stands strong (with a slight tweak). """Please people,
pass your flags as an argument to the appropriate re.method() and NOT
as another cryptic syntax, UNLESS YOU HAVE NO OTHER CHOICE!""" Thanks
for pointing this out.
Regular expression reform is not necessarily a bad thing, but this is
just forcing everybody to learn Yet Another Regex Syntax for no real
purpose.

I disagree here.
Whist some people may be "die-hard" fans of the un-intuitive perl
regex syntax, i believe many, if not exponentially MORE people would
like to have a better alternative. Do i want to remove the current
"well established" re module? No. But i would like to create a new
regex module that is more pythonic. A regex module that we can be
proud of. And just maybe, a regex module that "sets the bar" for all
other regular expressions.

Listen. Backwards compatibility and cross pollination is wonderful
WHEN you can make it work. However, in the case of Perl regex syntax,
this is not a "cross pollination", this is a "cross pollution".
 All that you've changed here is window dressing.  For an
overview of many of the *real* problems with regular expression
syntax, see

Window dressing is important Ian, if not, then shop owners would not
continue to show displays in their shop windows. What does window
dressing do exactly? It attracts the masses, and without the masses
all merchants will eventually go out of buisness. Note: my argument
HAS NOTHING to do with the number of folks programming python (or any
language). The argument is focused on module sustainability in a
community. Modules that are morbidly DIFFICULT to learn do not last.

I know about PyParsing but i believe we have room for PyParsing and a
more Pythonic take on Perl style regular expressions. I don't see why
we could not keep all three. Let the people decide what is best for
them.

The greatest aspect of regexes is their compactness, and we should
keep them compact. And in that respect regexes will always be cryptic
to the neophyte. However, regexes do not have to be a scourge to the
initiated. We must balance the compact and the intuitive nature of
regexes. But most importantly, we must understand that these aspects
of regexes are NOT mutually exclusive.
 
R

Rick Johnson

Or we could implement de-facto standards where they exist.

Are you so naive as to think that the Perl folks are even *slightly*
interested in intuitive regexps? Have you written, or even read, any
Perl code my friend? The *standards* are broken. Obviously they don't
care, or they prefer the esoteric nature of their cryptic creation.

And good day to you.
 
D

Devin Jeanpierre

In particular i find the "extension notation" syntax to be woefully
inadequate. You should be able to infer the action of the extension
syntax intuitively, simply from looking at its signature.

This is nice in theory. I see no reason to believe this is possible,
or that your syntax is closer to this ideal than the existing syntax.

Perhaps you should perform some experiments to prove intuitiveness?
Science is more convincing than insults.

Also, the "!" in negative assertions doesn't stand for "not equal" --
matches aren't equality. It stands for "not". It's the "=" that's a
misnomer.

-- Devin
 
I

Ian Kelly

I disagree here.
Whist some people may be "die-hard" fans of the un-intuitive perl
regex syntax, i believe many, if not exponentially MORE people would
like to have a better alternative. Do i want to remove the current
"well established" re module? No. But i would like to create a new
regex module that is more pythonic. A regex module that we can be
proud of. And just maybe, a regex module that "sets the bar" for all
other regular expressions.

Compact regex notations are inherently unpythonic. While your
reimplementation may be more intuitive to you, I don't think that it's
more pythonic at all.
Window dressing is important Ian, if not, then shop owners would not
continue to show displays in their shop windows. What does window
dressing do exactly? It attracts the masses, and without the masses
all merchants will eventually go out of buisness. Note: my argument
HAS NOTHING to do with the number of folks programming python (or any
language). The argument is focused on module sustainability in a
community. Modules that are morbidly DIFFICULT to learn do not last.

Well, FWIW, I think that the current re module was easier for me to
learn than your version would have been, mainly because the re module
matches the syntax that I was already familiar with well before I
started using Python. If you think you can do better, though, I
encourage you to actually write your regex module and put it up on
PyPI.
I know about PyParsing but i believe we have room for PyParsing and a
more Pythonic take on Perl style regular expressions. I don't see why
we could not keep all three. Let the people decide what is best for
them.

PyParsing produces recursive descent parsers. It's an alternative to
regular expressions for a different class of parsing problems, not a
replacement, and so it's not particularly germane to this discussion.
 
R

Rick Johnson

Perhaps you should perform some experiments to prove intuitiveness [of your syntax]?

I've posted my thoughts and my initial syntax. You (and everyone else)
are free to critic or offer suggestions of your own. Listen, none of
these issues that plague Python are going to be resolved until people
around here set aside the grudges and haughty arrogance. We need to
get to work. But step one is NOT writing code. Step one is to gather
the community into lively discussion on these crucial topics. And the
folks who really want to get involved are not going to speak up unless
the rhetoric is toned down a bit.
Science is more convincing than insults.

I can assure you my intentions are not to insult. My blanket
observations is that the current Python re syntax is not intuitive
enough for Python, and that we can make it better.
 
R

Rick Johnson

Compact regex notations are inherently unpythonic.  While your
reimplementation may be more intuitive to you, I don't think that it's
more pythonic at all.

Regexps will never be "truly Pythonic". By their very nature they must
be implicit, complicated, most times nested and dense, not as readable
as we'd like, special cases everywhere, not very practical, hard(sic)
to explain, and just plain cryptic. They violate almost every aspect
of the zen. The point is NOT to make regexes "Pythonic", the point is
to make them as "Pythonic" as we can and not a bit more. I discussed
this very topic earlier, did you miss my speech? I though it was quite
elegant...

Rick Johnsons stump speech 2.0: """ The greatest aspect of regexes is
their compactness, and not only should we keep them compact, we should
celebrate their compactness. It is in that respect that regexes will
always be cryptic to the neophyte, however, we must NEVER allow
regexes to be a scourge on the initiated, no. We must balance the
compact and the intuitive natures of regexes until we reach a natural
harmony. But most importantly, we must understand that these aspects
of regexes are NOT mutually exclusive -- for it is our understanding
that is flawed."""

*applause*
PyParsing produces recursive descent parsers.  It's an alternative to
regular expressions for a different class of parsing problems, not a
replacement, and so it's not particularly germane to this discussion.

It is germane in the fact that i believe PyParsing, re, and my new
regex module can co-exist in harmony.
 
S

Steven D'Aprano

In particular i find the "extension notation" syntax to be woefully
inadequate. You should be able to infer the action of the extension
syntax intuitively, simply from looking at its signature. I find myself
continually needing to consult the docs because of the lacking or
misleading style of the current syntax. Consider:

The only intuitive interface is the nipple. Everything else is learned.

Nevertheless, there are legitimate problems with Python's regex syntax.
It is based on Perl's syntax, and even Larry Wall agrees that it has some
serious problems.

Read Apocalypse 5: Wall gives a fantastic explanation of what's wrong
with current regex syntax (without such trivial platitudes as "it is not
intuitive", as if we can all agree on what it intuitive), why it has
become that way, and what Perl 6 will do about it.

http://www.perl.com/pub/2002/06/04/apo5.html

Regexes are essentially a programming language. They may or may not be
Turing complete, depending on the implementation (true regexes are not,
but Perl regexes are more powerful than true regexes), but they are still
a programming language. And users want regexes to be concise, otherwise
they would ask for richer string search methods and avoid regexes
altogether.

The problem is that conciseness and readability are usually (but not
always) in opposition. So regexes will never be as readable as Python
code, because the requirements of regexes -- that they be short, concise,
and usually written as one-liners (or at least one-liners must be
possible) -- do not meet Python standards of readability. How can they?
Regexes are shorthand. If you want longhand, write your search in
straight Python.

PS: In my eyes, Python 3000 is already a dinosaur.

We look forward to seeing your re-write. I'm sure all right-thinking
programmers will flock to your Python fork as soon as you start writing
it.
 
S

Steven D'Aprano

It is germane in the fact that i believe PyParsing, re, and my new regex
module can co-exist in harmony.

You don't have a new regex module.

When you have written it, then you will have a new regex module. Until
then, you're all talk.
 
M

Michael Torrie

The only intuitive interface is the nipple. Everything else is learned.

I think young mothers would even disagree with that. It's learned just
like everything else in life. Albeit very rapidly.
 
D

Devin Jeanpierre

It is germane in the fact that i believe PyParsing, re, and my new
regex module can co-exist in harmony.

If all you're going to change is the parser, maybe it'd be easier to
get things to coexist if parsers were pluggable in the re module.

It's more generally useful, too. Would let re gain a PyParsing/SNOBOL
like expression "syntax", for example. Or a regular grammar syntax.
Neat for experimentation.

-- Devin
 
R

Rick Johnson

If all you're going to change is the parser, maybe it'd be easier to
get things to coexist if parsers were pluggable in the re module.

It's more generally useful, too. Would let re gain a PyParsing/SNOBOL
like expression "syntax", for example. Or a regular grammar syntax.
Neat for experimentation.

I like your idea. Not sure about feasibility though. Unfortunately the
Python module "re" is under proprietary copyright. Hmm, seems not
everything is completely open source in the python world.

# This version of the SRE library can be redistributed under CNRI's
# Python 1.6 license. For any other use, please contact Secret Labs
# AB ([email protected]).

I need to dive into the "re" base code and see what is possible. My
original idea was to just start from scratch, but that may be foolish
considering all the scaffolding that will need to be erected.
 
S

Steven D'Aprano

I've posted my thoughts and my initial syntax. You (and everyone else)
are free to critic or offer suggestions of your own. Listen, none of
these issues that plague Python are going to be resolved until people
around here set aside the grudges and haughty arrogance. We need to get
to work. But step one is NOT writing code.

Well, that suits you well then, since you're an expert on not writing
code.

How is that fork of Python coming along? I really look forward to the day
that you make good on your promise to fork the language so all the right-
thinking people can follow you to the Promised Land.
 
S

Steven D'Aprano

2) Permitting flags in the regular expression allows different
combinations of flags to be in effect for different parts of complex
regular expressions. You can't do that just by passing in the flags as
an argument.

I don't believe Python's regex engine supports scoped flags, I think all
flags are global to the entire regex.

MRAB's regex engine does support scoped flags.

http://pypi.python.org/pypi/regex
 
E

Evan Driscoll

If all you're going to change is the parser, maybe it'd be easier to
get things to coexist if parsers were pluggable in the re module.

It's more generally useful, too. Would let re gain a PyParsing/SNOBOL
like expression "syntax", for example. Or a regular grammar syntax.
Neat for experimentation.

I don't know what would be involved in that, but if it could be made to
work, that sounds to me like a remarkably good idea to have come out of
this thread.

(Now it's time for my own troll: "About as good of an idea as no longer
calling PCRE-alikes 'regular expressions', because they aren't." Ahhh,
got that out of my system. :))

Evan



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPIObaAAoJEAOzoR8eZTzgrTEH/Rt+rjEIGldwDR7UvADg1JQ5
z6zV8BzekF4B5Rh6EwEJjvHMyjOhtm26/Bv+dnPmPXsb2j9ogF3EzWXz17veZDm9
9WSYYFBxRGswbzqbFZXZHVp0GGs61c4ArSnqcLyvfdudtNM1rBHWbfmFNBPQceiY
4Uj+iWYLSzuktJ5cEBXC055aIolOyE3/FFh0Q+z9NVOKsWdKWzLHYY5mmpUfw4/a
UQN8neyfyWuzxcKDr8QpCBPEZ7vUtC0KCyaVXB7eLUrraiC5994yHyPzHoL5gqaw
lwSdlgG6bT1ZHBYu11ahxtVsvrhA8Pk6/21Ri8F8k/lcs4/l2hUMxtkCMcvtlY4=
=oRtL
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top