regex help for a newbie

Marco Herrn · Apr 5, 2004

Hi,

I am not very familiar with regular expressions. So I hope someone can
help me to achieve what I want.

I have the following string in my program:

string= "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"

Now I need to extract the parts that are enclosed in %().
There are 3 levels of nesting. The first level is named
'aaa', the second 'BBB' and 'DDD' and the third 'CCC'.
I do not need to extract the third level at this moment, since I extract
the parts in a recursive function. So the thing I want to achieve here
is to extract %(BBB%(CCC)BBB) and %(DDD).

I tried it with the following:

re.search("%\(.*\)", string).group()

But that returns:

%(BBB%(CCC)BBB)aaa%(DDD)'

which is, of course, not what I want.
So how must the regex look like that I get the two strings I need?

Marco

Diez B. Roggisch · Apr 5, 2004

Marco said:
I have the following string in my program:

string= "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"

Now I need to extract the parts that are enclosed in %().
There are 3 levels of nesting. The first level is named
'aaa', the second 'BBB' and 'DDD' and the third 'CCC'.
I do not need to extract the third level at this moment, since I extract
the parts in a recursive function. So the thing I want to achieve here
is to extract %(BBB%(CCC)BBB) and %(DDD).

Regexes aren't powerful enough for this - they are stateless, that means
that they have no way to count the number of open parenthes already found.
so you can't solve your problem with them.

So what you need here is a parser that has state. You can either use one of
the existing parser frameworks (I personally use spark) or you write it for
yourself, as your problem is considerably easy:

def parse(input):
res = ""
level = 0
for c in input:
if c == "(":
level += 1
elif c == ")":
level -= 1
if level > 0 and c != "(":
res += c
return res

Marco Herrn · Apr 5, 2004

def parse(input):
res = ""
level = 0
for c in input:
if c == "(":
level += 1
elif c == ")":
level -= 1
if level > 0 and c != "(":
res += c
return res

Thanks, that helped a lot. I had to rewrite it a bit, but now it works.
Many Thanks.
Marco

marco · Apr 6, 2004

Marco Herrn said:
I have the following string in my program:

string= "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"
[snip]

the parts in a recursive function. So the thing I want to achieve here
is to extract %(BBB%(CCC)BBB) and %(DDD).

p1, p2 = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa")[1:-1]

Cheers,

Marco Herrn · Apr 6, 2004

Marco Herrn said:
Marco Herrn said:

the parts in a recursive function. So the thing I want to achieve here
is to extract %(BBB%(CCC)BBB) and %(DDD).

Click to expand...

p1, p2 = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa")[1:-1]

Doesn't help, since I do not know that there is the string "aaa". It was
just an example. I do not know any of the strings/characters. The only
thing I know is that a percent sign indicates that the content inside
the following parentheses is an expression that has to be evaluated.

I need to do this by real parsing. In fact the solution from Diez isn't
enough. I will have to write a much more flexible parser, as I realized.

Diez mentioned spark as a parser. I also found yappy, which is a parser
generator. I have not much experience with parsers. What is the
difference between these two? When should one use the one, when the
other?

Marco

marco · Apr 6, 2004

Marco Herrn said:
Marco Herrn said:

the parts in a recursive function. So the thing I want to achieve here
is to extract %(BBB%(CCC)BBB) and %(DDD).

Click to expand...

p1, p2 = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa")[1:-1]

Click to expand...

Doesn't help, since I do not know that there is the string "aaa". It was
just an example. I do not know any of the strings/characters. The only
thing I know is that a percent sign indicates that the content inside
the following parentheses is an expression that has to be evaluated.

Ah, that's clearer

Does the "aaa"-type string really show up three times? Or is it actually:

"maybeeggs%(BBB%(CCC)BBB)maybeham%(DDD)maybespam"

If it's like you describe then maybe:

"aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("%(")[0])[1:-1]

helps (but I doubt it -- I guess you'll need a real parser

Cheers,

Marco Herrn · Apr 6, 2004

Marco Herrn said:
Marco Herrn said:

the parts in a recursive function. So the thing I want to achieve here
is to extract %(BBB%(CCC)BBB) and %(DDD).

p1, p2 = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa")[1:-1]

Click to expand...

Doesn't help, since I do not know that there is the string "aaa". It was
just an example. I do not know any of the strings/characters. The only
thing I know is that a percent sign indicates that the content inside
the following parentheses is an expression that has to be evaluated.

Click to expand...

Does the "aaa"-type string really show up three times? Or is it actually:

"maybeeggs%(BBB%(CCC)BBB)maybeham%(DDD)maybespam"

Yes, it is this. I just used the same strings to indicate the nesting
levels. All strings in this expression are arbitrary strings.

(but I doubt it -- I guess you'll need a real parser

Yes, I already realized that

Marco

F. Petitjean · Apr 8, 2004

Yes, it is this. I just used the same strings to indicate the nesting
levels. All strings in this expression are arbitrary strings.

Yes, I already realized that

Marco

A solution without any re nor parser :
the basic idea is nesting, wrapping of parsplit as a true recursive
function is left as an exercice to the reader.

#! /usr/bin/env python
# -*- coding: iso-8859-1 -*-
#
# parparse.py
#
class NestingParenError(Exception):
"""Parens %( ) do not match"""

def parsplit(s, begin='%(', end=')'):
"""returns before, inside, after or s, None, None
raises NestingParenError if begin, end pairs are not nested"""
pbegin = s.find(begin)
if pbegin == -1:
return s, None, None
before = s[

begin]
pend = s.rfind(end)
if pend == -1:
raise NestingParenError("in '%s' '%s' found without matching '%s'" %\
(s, begin, end))
inside = s[pbegin+len(begin)

end]
return before, inside, s[pend+len(end):]

def usage(s):
"""Typical use of parsplit"""
before, inside, after = parsplit(s)
if inside is None:
print "'%s' has no %%( ) part" % (s,)
return
# process :
print "before %s\ninside %s\nafter %s" % (before, inside, after)
while inside:
before, inside, after = parsplit(inside)
# process :
print "before %s\ninside %s\nafter %s" % (before, inside, after)

if __name__ == '__main__':
"""basic tests"""
s1 = """aaaa a%(bbb bbb%(iiii) ccc)dddd"""
print "nested case %s" % (s1,)
usage(s1)
print
print
usage("""0123before%()""")
print
usage("""%(inside)""")
print
usage("""%()after""")
print
s2 = """without closing %( paren"""
s3 = """without opening ) paren"""
try:
usage(s2)
except NestingParenError, e:
print e
print
usage(s3)

Hope that helps
Regards

Tobiah · Apr 9, 2004

I have the following string in my program:

string= "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"

Now I need to extract the parts that are enclosed in %().

#!/usr/bin/python

### I realize that this will not serve you in all of the cases
### that you are likely to need to handle, but just to show
### that the case that you mention can be handled with regular
### expressions, I submit the following:

import re

string = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"

m = re.search('([^%]*)%\(([^%]*)%\(([^)]*)\)([^)]*)\)([^)]+)%\(([^)]*)\)(.*)', string)

print (m.groups())

### This yields:
###
### ('aaa', 'BBB', 'CCC', 'BBB', 'aaa', 'DDD', 'aaa')
###
###
###
### Tobiah

Diez B. Roggisch · Apr 9, 2004

I need to do this by real parsing. In fact the solution from Diez isn't

enough. I will have to write a much more flexible parser, as I realized.

Why not? If all you need is to extract that parenthesized structure, a
self-written parser should be the easiest. Consider this:

import re

def parse(sg):
res = []
for c in sg:
if c == "%(":
res.append(parse(sg))
elif c == ")":
return res
else:
res.append(c)
return res

def sgen(s):
rex = re.compile(r"(%\(|\))")
for token in rex.split(s):
yield token

print parse(sgen("%(BBB%(CCC)BBB)"))

Diez mentioned spark as a parser. I also found yappy, which is a parser
generator. I have not much experience with parsers. What is the
difference between these two? When should one use the one, when the
other?

yappy is a lr(1) parser, and spark is a earley parser. Bont of them are
suited for your problem.

I personally found spark easy to use, as its very declarative - but I don't
know yappy, maybe thats cool, to.

How to extract all values except the last value in a string separated by comma in sql	2	Jun 15, 2023
Need help with a regular expression	3	Dec 19, 2007
Help for a newbie	13	Feb 13, 2023
Help understanding an Object Oriented Program example	3	Oct 28, 2012
Readline insists in ordering the returned array	0	Mar 3, 2011
best way to parse a function-call-like string?	7	Feb 26, 2009
Output of a program	7	Mar 8, 2006
regex and utf8 characters (german umlauts)	11	Aug 10, 2006

regex help for a newbie

Marco Herrn

Diez B. Roggisch

Marco Herrn

marco

Marco Herrn

marco

Marco Herrn

F. Petitjean

Tobiah

Diez B. Roggisch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads