Recursive regexps?

M

Magnus Lie Hetland

Hi!

I've been looking at ways of dealing with nested structures in regexps
(becuase I figured that would be faster than the Python parsing code
I've currently got) and came across a few interesting things...

First there is this, mentioning the (?<DEPTH>) and (?<-DEPTH>)
constructs of the .NET regexp matcher:

http://www.rassoc.com/gregr/weblog/archive.aspx?post=590

Then there was some documentation on PCRE, mentioning (among other
things) The (?R) construct, which (as far as I can tell) matches the
pattern recursively):

http://www.gsp.com/cgi-bin/man.cgi?section=3&topic=pcrepattern

It seems that this functionality was introduced in Perl 5.6.

Is this (or something similar) something that would be realistic to
get in the re module? I know I would find it very useful...
 
M

Magnus Lie Hetland

It seems there are a *couple* of (experimental) PCRE features that
allow recursive matching:

[T]here is some experimental support for recursive patterns using
the non-Perl items (?R), (?number) and (?P>name). Also, the PCRE
"callout" feature allows an external function to be called during
pattern matching.

[http://www.slabihoud.de/spampal/pcrecompat.html]

All of these seem pretty useful/neat :-}
 
P

Paul McGuire

Magnus Lie Hetland said:
It seems there are a *couple* of (experimental) PCRE features that
allow recursive matching:

[T]here is some experimental support for recursive patterns using
the non-Perl items (?R), (?number) and (?P>name). Also, the PCRE
"callout" feature allows an external function to be called during
pattern matching.

[http://www.slabihoud.de/spampal/pcrecompat.html]

All of these seem pretty useful/neat :-}

For those who find regexp strings to be fairly opaque, pyparsing supports
both of these features.
- Recursive grammars are defined using the Forward() parsing class. See
this snippet from the fourFn.py infix notation parser:

addop = plus | minus
multop = mult | div
pi = CaselessLiteral( "PI" )

expr = Forward()
factor = ( pi | e | fnumber ).setParseAction( pushFirst ) | ( "(" +
expr.suppress() + ")" )
term = factor + ZeroOrMore( ( multop + factor ).setParseAction(
pushFirst ) )
expr << term + ZeroOrMore( ( addop + term ).setParseAction(
pushFirst ) )

expr is defined to be a Forward() instance, and then the recursive contents
are inserted using the "<<" operator.

- Note also the calls to setParseAction(). These represent callouts to
external functions to be invoked during the parsing process. These external
functions can modify the matched text, update global data structures, or
even supersede the matching rules and raise a ParseException, thereby
cancelling the match. In this case, the expression elements are being
pushed onto an expression stack, to be recursively evaluated after parsing
is complete.

-- Paul
Download pyparsing at http://pyparsing.sourceforge.net.
 
J

James Stroud

This may or may not be topical to the original post, but I think this would be
cool in python:
('abc 123 123 abc', ('123 ', '456 '))

(or similar), instead of
('abc 123 123 abc', '123 ')


I dont have a CS degree, but I'm pretty sure that computers can do this sort
of thing. It must not be obvious. I'm thinking of patenting the idea (regex
engine returns nested matches) and copyrighting the name for it.

James

--
James Stroud, Ph.D.
UCLA-DOE Institute for Genomics and Proteomics
611 Charles E. Young Dr. S.
MBI 205, UCLA 951570
Los Angeles CA 90095-1570
http://www.jamesstroud.com/
 
J

Jeremy Bowers

I dont have a CS degree, but I'm pretty sure that computers can do this sort
of thing. It must not be obvious. I'm thinking of patenting the idea (regex
engine returns nested matches) and copyrighting the name for it.

The only possible name is Recursive Descent Regular Expressions.

(CS joke. There are three basic language domains, based on parser power
necessary to recognize it. "Regular Expressions" is the lowest, but the
actual 'regular expression' the math refers to is much, much weaker the re
library, which nowadays is technically on the top level I think since you
can shell out to functions in the middle of an RE operation. A "Recursive
descent parser" is a parser for a subset of the middle class of languages,
the "context-free languages". "Unrestricted grammars" are the top, parsed
by Turing Machines.

Of course, it would be even more accurate to call it "Recursive Descent
Regular Expressions with a side of Turing Completeness" to mix all three
levels, but that's not so catchy for a brand name. Ought to make the
patent office swoon, though.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top