regular expression for nested parentheses


N

Noah Hoffman

I have been trying to write a regular expression that identifies a
block of text enclosed by (potentially nested) parentheses. I've found
solutions using other regular expression engines (for example, my text
editor, BBEdit, which uses the PCRE library), but have not been able
to replicate it using python's re module.

Here's a version that works using the PCRE syntax, along with the
python error message. I'm hoping for this to identify the string '(foo
(bar) (baz))'

% python -V
Python 2.5.1
% python
py> import re
py> text = 'buh (foo (bar) (baz)) blee'
py> no_ws = lambda s: ''.join(s.split())
py> rexp = r"""(?P<parens>
.... \(
.... (?>
.... (?> [^()]+ ) |
.... (?P>parens)
.... )*
.... \)
.... )"""
py> print re.findall(no_ws(rexp), text)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/re.py", line 167, in findall
return _compile(pattern, flags).findall(string)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/re.py", line 233, in _compile
raise error, v # invalid expression
sre_constants.error: unexpected end of pattern

From what I understand of the PCRE syntax, the (?>) construct is a non-
capturing subpattern, and (?P>parens) is a recursive call to the
enclosing (named) pattern. So my best guess at a python equivalent is
this:

py> rexp2 = r"""(?P<parens>
.... \(
.... (?=
.... (?= [^()]+ ) |
.... (?P=parens)
.... )*
.... \)
.... )"""
py> print re.findall(no_ws(rexp2), text)
[]

....which results in no match. I've played around quite a bit with
variations on this theme, but haven't been able to come up with one
that works.

Can anyone help me understand how to construct a regular expression
that does the job in python?

Thanks -
 
Ad

Advertisements

J

John Machin

I have been trying to write a regular expression that identifies a
block of text enclosed by (potentially nested) parentheses. I've found
solutions using other regular expression engines (for example, my text
editor, BBEdit, which uses the PCRE library), but have not been able
to replicate it using python's re module.

A pattern that can validly be described as a "regular expression"
cannot count and thus can't match balanced parentheses. Some "RE"
engines provide a method of tagging a sub-pattern so that a match must
include balanced () (or [] or {}); Python's doesn't.

Looks like you need a parser; try pyparsing.

[snip]
py> rexp = r"""(?P<parens>
... \(
... (?>
... (?> [^()]+ ) |
... (?P>parens)
... )*
... \)
... )"""
py> print re.findall(no_ws(rexp), text)

Ummm ... even if Python's re engine did do what you want, wouldn't you
need flags=re.VERBOSE in there?
 
N

Noah Hoffman

A pattern that can validly be described as a "regular expression"
cannot count and thus can't match balanced parentheses. Some "RE"
engines provide a method of tagging a sub-pattern so that a match must
include balanced () (or [] or {}); Python's doesn't.

Okay, thanks for the clarification. So recursion is not possible using
python regular expressions?
Ummm ... even if Python's re engine did do what you want, wouldn't you
need flags=re.VERBOSE in there?

Ah, thanks for letting me know about that flag; but removing
whitespace as I did with the no_ws lambda expression should also work,
no?
 
J

John Machin

A pattern that can validly be described as a "regular expression"
cannot count and thus can't match balanced parentheses. Some "RE"
engines provide a method of tagging a sub-pattern so that a match must
include balanced () (or [] or {}); Python's doesn't.

Okay, thanks for the clarification. So recursion is not possible using
python regular expressions?
Ummm ... even if Python's re engine did do what you want, wouldn't you
need flags=re.VERBOSE in there?

Ah, thanks for letting me know about that flag; but removing
whitespace as I did with the no_ws lambda expression should also work,
no?

Under a very limited definition of "work". That technique would not
produce correct answers on patterns that contain any *significant*
whitespace e.g. you want to match "foo" and "bar" separated by one or
more spaces (but not tabs, newlines etc) ....
pattern = r"""
foo
[ ]+
bar
"""
 
M

MRAB

A pattern that can validly be described as a "regular expression"
cannot count and thus can't match balanced parentheses. Some "RE"
engines provide a method of tagging a sub-pattern so that a match must
include balanced () (or [] or {}); Python's doesn't.
Okay, thanks for the clarification. So recursion is not possible using
python regular expressions?
Ah, thanks for letting me know about that flag; but removing
whitespace as I did with the no_ws lambda expression should also work,
no?

Under a very limited definition of "work". That technique would not
produce correct answers on patterns that contain any *significant*
whitespace e.g. you want to match "foo" and "bar" separated by one or
more spaces (but not tabs, newlines etc) ....
pattern = r"""
foo
[ ]+
bar
"""

You can also escape a literal space:

pattern = r"""
foo
\ +
bar
"""
 
Ad

Advertisements

J

John Machin

A pattern that can validly be described as a "regular expression"
cannot count and thus can't match balanced parentheses. Some "RE"
engines provide a method of tagging a sub-pattern so that a match must
include balanced () (or [] or {}); Python's doesn't.
Okay, thanks for the clarification. So recursion is not possible using
python regular expressions?
Ummm ... even if Python's re engine did do what you want, wouldn't you
need flags=re.VERBOSE in there?
Ah, thanks for letting me know about that flag; but removing
whitespace as I did with the no_ws lambda expression should also work,
no?
Under a very limited definition of "work". That technique would not
produce correct answers on patterns that contain any *significant*
whitespace e.g. you want to match "foo" and "bar" separated by one or
more spaces (but not tabs, newlines etc) ....
pattern = r"""
foo
[ ]+
bar
"""

You can also escape a literal space:

pattern = r"""
foo
\ +
bar
"""

I know that. *Any* method of putting in a literal significant space is
clobbered by the OP's "trick" of removing *all* whitespace instead of
using the VERBOSE flag, which also permits comments:
pattern = r"""
\ + # ugly
[ ]+ # not quite so ugly
"""
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top