regex help for a newbie

M

Marco Herrn

Hi,

I am not very familiar with regular expressions. So I hope someone can
help me to achieve what I want.

I have the following string in my program:

string= "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"

Now I need to extract the parts that are enclosed in %().
There are 3 levels of nesting. The first level is named
'aaa', the second 'BBB' and 'DDD' and the third 'CCC'.
I do not need to extract the third level at this moment, since I extract
the parts in a recursive function. So the thing I want to achieve here
is to extract %(BBB%(CCC)BBB) and %(DDD).

I tried it with the following:

re.search("%\(.*\)", string).group()

But that returns:

%(BBB%(CCC)BBB)aaa%(DDD)'

which is, of course, not what I want.
So how must the regex look like that I get the two strings I need?


Marco
 
D

Diez B. Roggisch

Marco said:
I have the following string in my program:

string= "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"

Now I need to extract the parts that are enclosed in %().
There are 3 levels of nesting. The first level is named
'aaa', the second 'BBB' and 'DDD' and the third 'CCC'.
I do not need to extract the third level at this moment, since I extract
the parts in a recursive function. So the thing I want to achieve here
is to extract %(BBB%(CCC)BBB) and %(DDD).


Regexes aren't powerful enough for this - they are stateless, that means
that they have no way to count the number of open parenthes already found.
so you can't solve your problem with them.

So what you need here is a parser that has state. You can either use one of
the existing parser frameworks (I personally use spark) or you write it for
yourself, as your problem is considerably easy:

def parse(input):
res = ""
level = 0
for c in input:
if c == "(":
level += 1
elif c == ")":
level -= 1
if level > 0 and c != "(":
res += c
return res
 
M

Marco Herrn

def parse(input):
res = ""
level = 0
for c in input:
if c == "(":
level += 1
elif c == ")":
level -= 1
if level > 0 and c != "(":
res += c
return res

Thanks, that helped a lot. I had to rewrite it a bit, but now it works.
Many Thanks.
Marco
 
M

marco

Marco Herrn said:
I have the following string in my program:

string= "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"
[snip]

the parts in a recursive function. So the thing I want to achieve here
is to extract %(BBB%(CCC)BBB) and %(DDD).

p1, p2 = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa")[1:-1]

Cheers,
 
M

Marco Herrn

Marco Herrn said:
the parts in a recursive function. So the thing I want to achieve here
is to extract %(BBB%(CCC)BBB) and %(DDD).

p1, p2 = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa")[1:-1]

Doesn't help, since I do not know that there is the string "aaa". It was
just an example. I do not know any of the strings/characters. The only
thing I know is that a percent sign indicates that the content inside
the following parentheses is an expression that has to be evaluated.

I need to do this by real parsing. In fact the solution from Diez isn't
enough. I will have to write a much more flexible parser, as I realized.

Diez mentioned spark as a parser. I also found yappy, which is a parser
generator. I have not much experience with parsers. What is the
difference between these two? When should one use the one, when the
other?

Marco
 
M

marco

Marco Herrn said:
Marco Herrn said:
the parts in a recursive function. So the thing I want to achieve here
is to extract %(BBB%(CCC)BBB) and %(DDD).

p1, p2 = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa")[1:-1]

Doesn't help, since I do not know that there is the string "aaa". It was
just an example. I do not know any of the strings/characters. The only
thing I know is that a percent sign indicates that the content inside
the following parentheses is an expression that has to be evaluated.

Ah, that's clearer ;)

Does the "aaa"-type string really show up three times? Or is it actually:

"maybeeggs%(BBB%(CCC)BBB)maybeham%(DDD)maybespam"

If it's like you describe then maybe:

"aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("%(")[0])[1:-1]

helps (but I doubt it -- I guess you'll need a real parser :)

Cheers,
 
M

Marco Herrn

Marco Herrn said:
the parts in a recursive function. So the thing I want to achieve here
is to extract %(BBB%(CCC)BBB) and %(DDD).

p1, p2 = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa")[1:-1]

Doesn't help, since I do not know that there is the string "aaa". It was
just an example. I do not know any of the strings/characters. The only
thing I know is that a percent sign indicates that the content inside
the following parentheses is an expression that has to be evaluated.

Does the "aaa"-type string really show up three times? Or is it actually:

"maybeeggs%(BBB%(CCC)BBB)maybeham%(DDD)maybespam"

Yes, it is this. I just used the same strings to indicate the nesting
levels. All strings in this expression are arbitrary strings.
(but I doubt it -- I guess you'll need a real parser :)

Yes, I already realized that :)

Marco
 
F

F. Petitjean

Yes, it is this. I just used the same strings to indicate the nesting
levels. All strings in this expression are arbitrary strings.


Yes, I already realized that :)

Marco
A solution without any re nor parser :
the basic idea is nesting, wrapping of parsplit as a true recursive
function is left as an exercice to the reader.

#! /usr/bin/env python
# -*- coding: iso-8859-1 -*-
#
# parparse.py
#
class NestingParenError(Exception):
"""Parens %( ) do not match"""

def parsplit(s, begin='%(', end=')'):
"""returns before, inside, after or s, None, None
raises NestingParenError if begin, end pairs are not nested"""
pbegin = s.find(begin)
if pbegin == -1:
return s, None, None
before = s[:pbegin]
pend = s.rfind(end)
if pend == -1:
raise NestingParenError("in '%s' '%s' found without matching '%s'" %\
(s, begin, end))
inside = s[pbegin+len(begin):pend]
return before, inside, s[pend+len(end):]

def usage(s):
"""Typical use of parsplit"""
before, inside, after = parsplit(s)
if inside is None:
print "'%s' has no %%( ) part" % (s,)
return
# process :
print "before %s\ninside %s\nafter %s" % (before, inside, after)
while inside:
before, inside, after = parsplit(inside)
# process :
print "before %s\ninside %s\nafter %s" % (before, inside, after)

if __name__ == '__main__':
"""basic tests"""
s1 = """aaaa a%(bbb bbb%(iiii) ccc)dddd"""
print "nested case %s" % (s1,)
usage(s1)
print
print
usage("""0123before%()""")
print
usage("""%(inside)""")
print
usage("""%()after""")
print
s2 = """without closing %( paren"""
s3 = """without opening ) paren"""
try:
usage(s2)
except NestingParenError, e:
print e
print
usage(s3)

Hope that helps
Regards
 
T

Tobiah

I have the following string in my program:

string= "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"

Now I need to extract the parts that are enclosed in %().


#!/usr/bin/python

### I realize that this will not serve you in all of the cases
### that you are likely to need to handle, but just to show
### that the case that you mention can be handled with regular
### expressions, I submit the following:

import re

string = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"

m = re.search('([^%]*)%\(([^%]*)%\(([^)]*)\)([^)]*)\)([^)]+)%\(([^)]*)\)(.*)', string)

print (m.groups())


### This yields:
###
### ('aaa', 'BBB', 'CCC', 'BBB', 'aaa', 'DDD', 'aaa')
###
###
###
### Tobiah
 
D

Diez B. Roggisch

I need to do this by real parsing. In fact the solution from Diez isn't
enough. I will have to write a much more flexible parser, as I realized.

Why not? If all you need is to extract that parenthesized structure, a
self-written parser should be the easiest. Consider this:

import re

def parse(sg):
res = []
for c in sg:
if c == "%(":
res.append(parse(sg))
elif c == ")":
return res
else:
res.append(c)
return res


def sgen(s):
rex = re.compile(r"(%\(|\))")
for token in rex.split(s):
yield token


print parse(sgen("%(BBB%(CCC)BBB)"))
Diez mentioned spark as a parser. I also found yappy, which is a parser
generator. I have not much experience with parsers. What is the
difference between these two? When should one use the one, when the
other?

yappy is a lr(1) parser, and spark is a earley parser. Bont of them are
suited for your problem.

I personally found spark easy to use, as its very declarative - but I don't
know yappy, maybe thats cool, to.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top