Python regular expressions just ain't PCRE

W

Wiseman

I'm kind of disappointed with the re regular expressions module. In
particular, the lack of support for recursion ( (?R) or (?n) ) is a
major drawback to me. There are so many great things that can be
accomplished with regular expressions this way, such as validating a
mathematical expression or parsing a language with nested parens,
quoting or expressions.

Another feature I'm missing is once-only subpatterns and possessive
quantifiers ( (?>...) and ?+ *+ ++ {...}+ ) which are great to avoid
deep recursion and inefficiency in some complex patterns with nested
quantifiers. Even java.util.regex supports them.

Are there any plans to support these features in re? These would be
great features for Python 2.6, they wouldn't clutter anything, and
they'd mean one less reason left to use Perl instead of Python.

Note: I know there are LALR parser generators/parsers for Python, but
the very reason why re exists is to provide a much simpler, more
productive way to parse or validate simple languages and process text.
(The pyparse/yappy/yapps/<insert your favourite Python parser
generator here> argument could have been used to skip regular
expression support in the language, or to deprecate re. Would you want
that? And following the same rule, why would we have Python when
there's C?)
 
T

Terry Reedy

| I'm kind of disappointed with the re regular expressions module.

I believe the current Python re module was written to replace the Python
wrapping of pcre in order to support unicode.

| In particular, the lack of support for recursion ( (?R) or (?n) ) is a
| major drawback to me.

I don't remember those being in the pcre Python once had. Perhaps they are
new.

|Are there any plans to support these features in re?

I have not seen any. You would have to ask the author. But I suspect that
this would be a non-trivial project outside his needs.

tjr
 
M

Marc 'BlackJack' Rintsch

Note: I know there are LALR parser generators/parsers for Python, but
the very reason why re exists is to provide a much simpler, more
productive way to parse or validate simple languages and process text.
(The pyparse/yappy/yapps/<insert your favourite Python parser
generator here> argument could have been used to skip regular
expression support in the language, or to deprecate re. Would you want
that? And following the same rule, why would we have Python when
there's C?)

I don't follow your reasoning here. `re` is useful for matching tokens
for a higher level parser and C is useful for writing parts that need
hardware access or "raw speed" where pure Python is too slow.

Regular expressions can become very unreadable compared to Python source
code or EBNF grammars but modeling the tokens in EBNF or Python objects
isn't as compact and readable as simple regular expressions. So both `re`
and higher level parsers are useful together and don't supersede each
other.

The same holds for C and Python. IMHO.

Ciao,
Marc 'BlackJack' Rintsch
 
W

Wiseman

I believe the current Python re module was written to replace the Python
wrapping of pcre in order to support unicode.

I don't know how PCRE was back then, but right now it supports UTF-8
Unicode patterns and strings, and Unicode character properties. Maybe
it could be reintroduced into Python?
I don't remember those being in the pcre Python once had. Perhaps they are
new.

At least today, PCRE supports recursion and recursion check,
possessive quantifiers and once-only subpatterns (disables
backtracking in a subpattern), callouts (user functions to call at
given points), and other interesting, powerful features.
 
W

Wiseman

I don't follow your reasoning here. `re` is useful for matching tokens
for a higher level parser and C is useful for writing parts that need
hardware access or "raw speed" where pure Python is too slow.

Regular expressions can become very unreadable compared to Python source
code or EBNF grammars but modeling the tokens in EBNF or Python objects
isn't as compact and readable as simple regular expressions. So both `re`
and higher level parsers are useful together and don't supersede each
other.

The same holds for C and Python. IMHO.

Ciao,
Marc 'BlackJack' Rintsch

Sure, they don't supersede each other and they don't need to. My point
was that the more things you can do with regexes (not really regular
expressions anymore), the better -as long as they are powerful enough
for what you need to accomplish and they don't become a giant Perl-
style hack, of course-, because regular expressions are a built-in,
standard feature of Python, and they are much faster to use and write
than Python code or some LALR parser definition, and they are more
generally known and understood. You aren't going to parse a
programming language with a regex, but you can save a lot of time if
you can parse simple, but not so simple languages with them. Regular
expressions offer a productive alternative to full-fledged parsers for
the cases where you don't need them. So saying if you want feature X
or feature Y in regular expressions you should use a Bison-like parser
sounds a bit like an excuse, because the very reason why regular
expressions like these exist is to avoid using big, complex parsers
for simple cases. As an analogy, I mentioned Python vs. C: you want to
develop high-level languages because they are simpler and more
productive than working with C, even if you can do anything with the
later.
 
D

dustin

I don't know how PCRE was back then, but right now it supports UTF-8
Unicode patterns and strings, and Unicode character properties. Maybe
it could be reintroduced into Python?

I would say this is a case for "rough consensus and working code". With
something as big and ugly[1] as a regexp library, I think the "working
code" part will be the hard part.

So, if you have a patch, there's a decent chance such a thing would be
adopted.

I'm not sure what your skill level is, but I would suggest studying the
code, starting in on a patch for one or more of these features, and then
corresponding with the module's maintainers to improve your patch to the
point where it can be accepted.

Dustin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Are there any plans to support these features in re?

This question is impossible to answer. I don't have such
plans, and I don't know of any, but how could I speak for
the hundreds of contributors to Python world-wide, including
those future contributors which haven't contributed *yet*.

Do you have plans for such features in re?

Regards,
Martin
 
S

sjdevnull

Wiseman said:
I'm kind of disappointed with the re regular expressions module. In
particular, the lack of support for recursion ( (?R) or (?n) ) is a
major drawback to me. There are so many great things that can be
accomplished with regular expressions this way, such as validating a
mathematical expression or parsing a language with nested parens,
quoting or expressions.

-1 on this from me. In the past 10 years as a professional
programmer, I've used the wierd extended "regex" features maybe 5
times total, whether it be in Perl or Python. In contrast, I've had
to work around the slowness of PCRE-style engines by forking off a
grep() or something similar practically every other month. I think
it'd be far more valuable for most programmers if Python moved toward
dropping the extended semantics so that something one of the efficient
regex libraries (linked in a recent thread here on comp.lang.python)
could work with, and then added a parsing library to the standard
library for more complex jobs. Alternatively, if the additional
memory used isn't huge we could consider having more intelligence in
the re compiler and having it choose between a smarter PCRE engine or
a faster regex engine based on the input. The latter is something I'm
playing with a patch for that I hope to get into a useful state for
discussion soon.

But regexes are one area where speed very often makes the difference
between whether they're usable or not, and that's far more often been
a limitation for me--and I'd think for most programmers--than any lack
in their current Python semantics. So I'd rather see that attacked
first.
 
J

John Machin

I don't know how PCRE was back then, but right now it supports UTF-8
Unicode patterns and strings, and Unicode character properties. Maybe
it could be reintroduced into Python?

"UTF-8 Unicode" is meaningless. Python has internal unicode string
objects, with comprehensive support for converting to/from str (8-bit)
string objects. The re module supports unicode patterns and strings.
PCRE "supports" patterns and strings which are encoded in UTF-8. This
is quite different, a kludge, incomparable. Operations which inspect/
modify UTF-8-encoded data are of interest only to folk who are
constrained to use a language which has nothing resembling a proper
unicode datatype.
At least today, PCRE supports recursion and recursion check,
possessive quantifiers and once-only subpatterns (disables
backtracking in a subpattern), callouts (user functions to call at
given points), and other interesting, powerful features.

The more features are put into a regular expression module, the more
difficult it is to maintain and the more the patterns look like line
noise.

There's also the YAGNI factor; most folk would restrict using regular
expressions to simple grep-like functionality and data validation --
e.g. re.match("[A-Z][A-Z]?[0-9]{6}[0-9A]$", idno). The few who want to
recognise yet another little language tend to reach for parsers, using
regular expressions only in the lexing phase.

If you really want to have PCRE functionality in Python, you have a
few options:
(1) create a wrapper for PCRE using e.g. SWIG or pyrex or hand-
crafting
(2) write a PEP, get it agreed, and add the functionality to the re
module
(3) wait until someone does (1) or (2) for free
(4) fund someone to do (1) or (2)

HTH,
John
 
W

Wiseman

I'm not sure what your skill level is, but I would suggest studying the
code, starting in on a patch for one or more of these features, and then
corresponding with the module's maintainers to improve your patch to the
point where it can be accepted.

I'll consider creating a new PCRE module for Python that uses the
latest version PCRE library. It'll depend on my time availability, but
I can write Python extensions, and I haven't used PCRE in a long time,
and I recall it was a bit of a hassle, but I could get it done.
 
W

Wiseman

-1 on this from me. In the past 10 years as a professional
programmer, I've used the wierd extended "regex" features maybe 5
times total, whether it be in Perl or Python. In contrast, I've had
to work around the slowness of PCRE-style engines by forking off a
grep() or something similar practically every other month.

I use these complex features every month on my job, and performance is
rarely an issue, at least for our particular application of PCRE.

By the way, if you're concerned about performance, you should be
interested on once-only subpatterns.
 
W

Wiseman

"UTF-8 Unicode" is meaningless. Python has internal unicode string
objects, with comprehensive support for converting to/from str (8-bit)
string objects. The re module supports unicode patterns and strings.
PCRE "supports" patterns and strings which are encoded in UTF-8. This
is quite different, a kludge, incomparable. Operations which inspect/
modify UTF-8-encoded data are of interest only to folk who are
constrained to use a language which has nothing resembling a proper
unicode datatype.

Sure, I know it's a mediocre support for Unicode for an application,
but we're not talking an application here. If I get the PCRE module
done, I'll just PyArg_ParseTuple(args, "et#", "utf-8", &str, &len),
which will be fine for Python's Unicode support and what PCRE does,
and I won't have to deal with this string at all so I couldn't care
less how it's encoded and if I have proper Unicode support in C or
not. (I'm unsure of how Pyrex or SWIG would treat this so I'll just
hand-craft it. It's not like it would be complex; most of the magic
will be pure C, dealing with PCRE's API.)
There's also the YAGNI factor; most folk would restrict using regular
expressions to simple grep-like functionality and data validation --
e.g. re.match("[A-Z][A-Z]?[0-9]{6}[0-9A]$", idno). The few who want to
recognise yet another little language tend to reach for parsers, using
regular expressions only in the lexing phase.

Well, I find these features very useful. I've used a complex, LALR
parser to parse complex grammars, but I've solved many problems with
just the PCRE lib. Either way seeing nobody's interested on these
features, I'll see if I can expose PCRE to Python myself; it sounds
like the fairest solution because it doesn't even deal with the re
module - you can do whatever you want with it (though I'd rather have
it stay as it is or enhance it), and I'll still have PCRE. That's if I
find the time to do it though, even having no life.
 
K

Klaas

There's also the YAGNI factor; most folk would restrict using regular
expressions to simple grep-like functionality and data validation --
e.g. re.match("[A-Z][A-Z]?[0-9]{6}[0-9A]$", idno). The few who want to
recognise yet another little language tend to reach for parsers, using
regular expressions only in the lexing phase.

Well, I find these features very useful. I've used a complex, LALR
parser to parse complex grammars, but I've solved many problems with
just the PCRE lib. Either way seeing nobody's interested on these
features, I'll see if I can expose PCRE to Python myself; it sounds
like the fairest solution because it doesn't even deal with the re
module - you can do whatever you want with it (though I'd rather have
it stay as it is or enhance it), and I'll still have PCRE. That's if I
find the time to do it though, even having no life.

A polished wrapper for PCRE would be a great contribution to the
python community. If it becomes popular, then the argument for
replacing the existing re engine becomes much stronger.

-Mike
 
J

John Machin

There's also the YAGNI factor; most folk would restrict using regular
expressions to simple grep-like functionality and data validation --
e.g. re.match("[A-Z][A-Z]?[0-9]{6}[0-9A]$", idno). The few who want to
recognise yet another little language tend to reach for parsers, using
regular expressions only in the lexing phase.
Well, I find these features very useful. I've used a complex, LALR
parser to parse complex grammars, but I've solved many problems with
just the PCRE lib. Either way seeing nobody's interested on these
features, I'll see if I can expose PCRE to Python myself; it sounds
like the fairest solution because it doesn't even deal with the re
module - you can do whatever you want with it (though I'd rather have
it stay as it is or enhance it), and I'll still have PCRE. That's if I
find the time to do it though, even having no life.

A polished wrapper for PCRE would be a great contribution to the
python community. If it becomes popular, then the argument for
replacing the existing re engine becomes much stronger.

-Mike

You seem to be overlooking my point that PCRE's unicode support isn't,
just like the Holy Roman Empire wasn't.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,039
Messages
2,570,376
Members
47,028
Latest member
IsmaelLans

Latest Threads

Top