Recursive regexps?

Magnus Lie Hetland · Nov 19, 2004

Hi!

I've been looking at ways of dealing with nested structures in regexps
(becuase I figured that would be faster than the Python parsing code
I've currently got) and came across a few interesting things...

First there is this, mentioning the (?<DEPTH>) and (?<-DEPTH>)
constructs of the .NET regexp matcher:

http://www.rassoc.com/gregr/weblog/archive.aspx?post=590

Then there was some documentation on PCRE, mentioning (among other
things) The (?R) construct, which (as far as I can tell) matches the
pattern recursively):

http://www.gsp.com/cgi-bin/man.cgi?section=3&topic=pcrepattern

It seems that this functionality was introduced in Perl 5.6.

Is this (or something similar) something that would be realistic to
get in the re module? I know I would find it very useful...

Magnus Lie Hetland · Nov 19, 2004

It seems there are a *couple* of (experimental) PCRE features that
allow recursive matching:

[T]here is some experimental support for recursive patterns using
the non-Perl items (?R), (?number) and (?P>name). Also, the PCRE
"callout" feature allows an external function to be called during
pattern matching.

[http://www.slabihoud.de/spampal/pcrecompat.html]

All of these seem pretty useful/neat :-}

Paul McGuire · Nov 19, 2004

Magnus Lie Hetland said:
It seems there are a *couple* of (experimental) PCRE features that
allow recursive matching:

[T]here is some experimental support for recursive patterns using
the non-Perl items (?R), (?number) and (?P>name). Also, the PCRE
"callout" feature allows an external function to be called during
pattern matching.

[http://www.slabihoud.de/spampal/pcrecompat.html]

All of these seem pretty useful/neat :-}

For those who find regexp strings to be fairly opaque, pyparsing supports
both of these features.
- Recursive grammars are defined using the Forward() parsing class. See
this snippet from the fourFn.py infix notation parser:

addop = plus | minus
multop = mult | div
pi = CaselessLiteral( "PI" )

expr = Forward()
factor = ( pi | e | fnumber ).setParseAction( pushFirst ) | ( "(" +
expr.suppress() + ")" )
term = factor + ZeroOrMore( ( multop + factor ).setParseAction(
pushFirst ) )
expr << term + ZeroOrMore( ( addop + term ).setParseAction(
pushFirst ) )

expr is defined to be a Forward() instance, and then the recursive contents
are inserted using the "<<" operator.

- Note also the calls to setParseAction(). These represent callouts to
external functions to be invoked during the parsing process. These external
functions can modify the matched text, update global data structures, or
even supersede the matching rules and raise a ParseException, thereby
cancelling the match. In this case, the expression elements are being
pushed onto an expression stack, to be recursively evaluated after parsing
is complete.

-- Paul
Download pyparsing at http://pyparsing.sourceforge.net.

James Stroud · Nov 19, 2004

This may or may not be topical to the original post, but I think this would be
cool in python:
('abc 123 123 abc', ('123 ', '456 '))

(or similar), instead of
('abc 123 123 abc', '123 ')

I dont have a CS degree, but I'm pretty sure that computers can do this sort
of thing. It must not be obvious. I'm thinking of patenting the idea (regex
engine returns nested matches) and copyrighting the name for it.

James

--
James Stroud, Ph.D.
UCLA-DOE Institute for Genomics and Proteomics
611 Charles E. Young Dr. S.
MBI 205, UCLA 951570
Los Angeles CA 90095-1570
http://www.jamesstroud.com/

Jeremy Bowers · Nov 20, 2004

I dont have a CS degree, but I'm pretty sure that computers can do this sort
of thing. It must not be obvious. I'm thinking of patenting the idea (regex
engine returns nested matches) and copyrighting the name for it.

The only possible name is Recursive Descent Regular Expressions.

(CS joke. There are three basic language domains, based on parser power
necessary to recognize it. "Regular Expressions" is the lowest, but the
actual 'regular expression' the math refers to is much, much weaker the re
library, which nowadays is technically on the top level I think since you
can shell out to functions in the middle of an RE operation. A "Recursive
descent parser" is a parser for a subset of the middle class of languages,
the "context-free languages". "Unrestricted grammars" are the top, parsed
by Turing Machines.

Of course, it would be even more accurate to call it "Recursive Descent
Regular Expressions with a side of Turing Completeness" to mix all three
levels, but that's not so catchy for a brand name. Ought to make the
patent office swoon, though.)

Replacing bidirectional unicode strings/regexps	0	Jan 15, 2006
Can anyone write this recursion for simple regexp more beautifullyand clearly than the braggarts	157	Aug 29, 2009
Announcing Reg 0.4.0	21	Apr 24, 2005
"my" variables and recursive regexp strangeness	4	May 13, 2004
Treetop parser (or PEG in general?) questions	18	Jan 28, 2008
Request for Feedback; a module making it easier to use regular expressions.	1	Jan 31, 2005
Learning C++ Memory Management	15	Feb 12, 2009
pre-PEP: Object-oriented file module	2	Aug 25, 2005

Recursive regexps?

Magnus Lie Hetland

Magnus Lie Hetland

Paul McGuire

James Stroud

Jeremy Bowers

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads