regex help for a newbie

Discussion in 'Python' started by Marco Herrn, Apr 5, 2004.

  1. Marco Herrn

    Marco Herrn Guest

    Hi,

    I am not very familiar with regular expressions. So I hope someone can
    help me to achieve what I want.

    I have the following string in my program:

    string= "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"

    Now I need to extract the parts that are enclosed in %().
    There are 3 levels of nesting. The first level is named
    'aaa', the second 'BBB' and 'DDD' and the third 'CCC'.
    I do not need to extract the third level at this moment, since I extract
    the parts in a recursive function. So the thing I want to achieve here
    is to extract %(BBB%(CCC)BBB) and %(DDD).

    I tried it with the following:

    re.search("%\(.*\)", string).group()

    But that returns:

    %(BBB%(CCC)BBB)aaa%(DDD)'

    which is, of course, not what I want.
    So how must the regex look like that I get the two strings I need?


    Marco


    --
    Marco Herrn
    (GnuPG/PGP-signed and crypted mail preferred)
    Key ID: 0x94620736
     
    Marco Herrn, Apr 5, 2004
    #1
    1. Advertising

  2. Marco Herrn wrote:

    > I have the following string in my program:
    >
    > string= "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"
    >
    > Now I need to extract the parts that are enclosed in %().
    > There are 3 levels of nesting. The first level is named
    > 'aaa', the second 'BBB' and 'DDD' and the third 'CCC'.
    > I do not need to extract the third level at this moment, since I extract
    > the parts in a recursive function. So the thing I want to achieve here
    > is to extract %(BBB%(CCC)BBB) and %(DDD).



    Regexes aren't powerful enough for this - they are stateless, that means
    that they have no way to count the number of open parenthes already found.
    so you can't solve your problem with them.

    So what you need here is a parser that has state. You can either use one of
    the existing parser frameworks (I personally use spark) or you write it for
    yourself, as your problem is considerably easy:

    def parse(input):
    res = ""
    level = 0
    for c in input:
    if c == "(":
    level += 1
    elif c == ")":
    level -= 1
    if level > 0 and c != "(":
    res += c
    return res

    --
    Regards,

    Diez B. Roggisch
     
    Diez B. Roggisch, Apr 5, 2004
    #2
    1. Advertising

  3. Marco Herrn

    Marco Herrn Guest

    On 2004-04-05, Diez B. Roggisch <> wrote:
    > def parse(input):
    > res = ""
    > level = 0
    > for c in input:
    > if c == "(":
    > level += 1
    > elif c == ")":
    > level -= 1
    > if level > 0 and c != "(":
    > res += c
    > return res


    Thanks, that helped a lot. I had to rewrite it a bit, but now it works.
    Many Thanks.
    Marco


    --
    Marco Herrn
    (GnuPG/PGP-signed and crypted mail preferred)
    Key ID: 0x94620736
     
    Marco Herrn, Apr 5, 2004
    #3
  4. Marco Herrn

    marco Guest

    Marco Herrn <> writes:

    > I have the following string in my program:
    >
    > string= "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"


    [snip]

    > the parts in a recursive function. So the thing I want to achieve here
    > is to extract %(BBB%(CCC)BBB) and %(DDD).


    p1, p2 = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa")[1:-1]

    Cheers,

    --

    Gunnm: Broken Angel http://amv.reimeika.ca
    http://reimeika.ca/ http://photo.reimeika.ca
     
    marco, Apr 6, 2004
    #4
  5. Marco Herrn

    Marco Herrn Guest

    On 2004-04-06, marco <> wrote:
    > Marco Herrn <> writes:
    >> the parts in a recursive function. So the thing I want to achieve here
    >> is to extract %(BBB%(CCC)BBB) and %(DDD).

    >
    > p1, p2 = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa")[1:-1]


    Doesn't help, since I do not know that there is the string "aaa". It was
    just an example. I do not know any of the strings/characters. The only
    thing I know is that a percent sign indicates that the content inside
    the following parentheses is an expression that has to be evaluated.

    I need to do this by real parsing. In fact the solution from Diez isn't
    enough. I will have to write a much more flexible parser, as I realized.

    Diez mentioned spark as a parser. I also found yappy, which is a parser
    generator. I have not much experience with parsers. What is the
    difference between these two? When should one use the one, when the
    other?

    Marco
    --
    Marco Herrn
    (GnuPG/PGP-signed and crypted mail preferred)
    Key ID: 0x94620736
     
    Marco Herrn, Apr 6, 2004
    #5
  6. Marco Herrn

    marco Guest

    Marco Herrn <> writes:

    > On 2004-04-06, marco <> wrote:
    > > Marco Herrn <> writes:
    > >> the parts in a recursive function. So the thing I want to achieve here
    > >> is to extract %(BBB%(CCC)BBB) and %(DDD).

    > >
    > > p1, p2 = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa")[1:-1]

    >
    > Doesn't help, since I do not know that there is the string "aaa". It was
    > just an example. I do not know any of the strings/characters. The only
    > thing I know is that a percent sign indicates that the content inside
    > the following parentheses is an expression that has to be evaluated.


    Ah, that's clearer ;)

    Does the "aaa"-type string really show up three times? Or is it actually:

    "maybeeggs%(BBB%(CCC)BBB)maybeham%(DDD)maybespam"

    If it's like you describe then maybe:

    "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("%(")[0])[1:-1]

    helps (but I doubt it -- I guess you'll need a real parser :)

    Cheers,

    --

    Gunnm: Broken Angel http://amv.reimeika.ca
    http://reimeika.ca/ http://photo.reimeika.ca
     
    marco, Apr 6, 2004
    #6
  7. Marco Herrn

    Marco Herrn Guest

    On 2004-04-06, marco <> wrote:
    > Marco Herrn <> writes:
    >
    >> On 2004-04-06, marco <> wrote:
    >> > Marco Herrn <> writes:
    >> >> the parts in a recursive function. So the thing I want to achieve here
    >> >> is to extract %(BBB%(CCC)BBB) and %(DDD).
    >> >
    >> > p1, p2 = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa".split("aaa")[1:-1]

    >>
    >> Doesn't help, since I do not know that there is the string "aaa". It was
    >> just an example. I do not know any of the strings/characters. The only
    >> thing I know is that a percent sign indicates that the content inside
    >> the following parentheses is an expression that has to be evaluated.

    >
    > Does the "aaa"-type string really show up three times? Or is it actually:
    >
    > "maybeeggs%(BBB%(CCC)BBB)maybeham%(DDD)maybespam"


    Yes, it is this. I just used the same strings to indicate the nesting
    levels. All strings in this expression are arbitrary strings.

    > (but I doubt it -- I guess you'll need a real parser :)


    Yes, I already realized that :)

    Marco

    --
    Marco Herrn
    (GnuPG/PGP-signed and crypted mail preferred)
    Key ID: 0x94620736
     
    Marco Herrn, Apr 6, 2004
    #7
  8. Marco Herrn

    F. Petitjean Guest

    On 6 Apr 2004 22:38:24 GMT, Marco Herrn <> wrote:
    > On 2004-04-06, marco <> wrote:
    >> Marco Herrn <> writes:
    >>
    >>> On 2004-04-06, marco <> wrote:
    >>> > Marco Herrn <> writes:
    >>> >> the parts in a recursive function. So the thing I want to achieve here
    >>> >> is to extract %(BBB%(CCC)BBB) and %(DDD).
    >>> >

    >>
    >> Does the "aaa"-type string really show up three times? Or is it actually:
    >>
    >> "maybeeggs%(BBB%(CCC)BBB)maybeham%(DDD)maybespam"

    >
    > Yes, it is this. I just used the same strings to indicate the nesting
    > levels. All strings in this expression are arbitrary strings.
    >
    >> (but I doubt it -- I guess you'll need a real parser :)

    >
    > Yes, I already realized that :)
    >
    > Marco
    >

    A solution without any re nor parser :
    the basic idea is nesting, wrapping of parsplit as a true recursive
    function is left as an exercice to the reader.

    #! /usr/bin/env python
    # -*- coding: iso-8859-1 -*-
    #
    # parparse.py
    #
    class NestingParenError(Exception):
    """Parens %( ) do not match"""

    def parsplit(s, begin='%(', end=')'):
    """returns before, inside, after or s, None, None
    raises NestingParenError if begin, end pairs are not nested"""
    pbegin = s.find(begin)
    if pbegin == -1:
    return s, None, None
    before = s[:pbegin]
    pend = s.rfind(end)
    if pend == -1:
    raise NestingParenError("in '%s' '%s' found without matching '%s'" %\
    (s, begin, end))
    inside = s[pbegin+len(begin):pend]
    return before, inside, s[pend+len(end):]

    def usage(s):
    """Typical use of parsplit"""
    before, inside, after = parsplit(s)
    if inside is None:
    print "'%s' has no %%( ) part" % (s,)
    return
    # process :
    print "before %s\ninside %s\nafter %s" % (before, inside, after)
    while inside:
    before, inside, after = parsplit(inside)
    # process :
    print "before %s\ninside %s\nafter %s" % (before, inside, after)

    if __name__ == '__main__':
    """basic tests"""
    s1 = """aaaa a%(bbb bbb%(iiii) ccc)dddd"""
    print "nested case %s" % (s1,)
    usage(s1)
    print
    print
    usage("""0123before%()""")
    print
    usage("""%(inside)""")
    print
    usage("""%()after""")
    print
    s2 = """without closing %( paren"""
    s3 = """without opening ) paren"""
    try:
    usage(s2)
    except NestingParenError, e:
    print e
    print
    usage(s3)

    Hope that helps
    Regards
     
    F. Petitjean, Apr 8, 2004
    #8
  9. Marco Herrn

    Tobiah Guest

    >
    > I have the following string in my program:
    >
    > string= "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"
    >
    > Now I need to extract the parts that are enclosed in %().



    #!/usr/bin/python

    ### I realize that this will not serve you in all of the cases
    ### that you are likely to need to handle, but just to show
    ### that the case that you mention can be handled with regular
    ### expressions, I submit the following:

    import re

    string = "aaa%(BBB%(CCC)BBB)aaa%(DDD)aaa"

    m = re.search('([^%]*)%\(([^%]*)%\(([^)]*)\)([^)]*)\)([^)]+)%\(([^)]*)\)(.*)', string)

    print (m.groups())


    ### This yields:
    ###
    ### ('aaa', 'BBB', 'CCC', 'BBB', 'aaa', 'DDD', 'aaa')
    ###
    ###
    ###
    ### Tobiah
     
    Tobiah, Apr 9, 2004
    #9
  10. > I need to do this by real parsing. In fact the solution from Diez isn't
    > enough. I will have to write a much more flexible parser, as I realized.


    Why not? If all you need is to extract that parenthesized structure, a
    self-written parser should be the easiest. Consider this:

    import re

    def parse(sg):
    res = []
    for c in sg:
    if c == "%(":
    res.append(parse(sg))
    elif c == ")":
    return res
    else:
    res.append(c)
    return res


    def sgen(s):
    rex = re.compile(r"(%\(|\))")
    for token in rex.split(s):
    yield token


    print parse(sgen("%(BBB%(CCC)BBB)"))

    >
    > Diez mentioned spark as a parser. I also found yappy, which is a parser
    > generator. I have not much experience with parsers. What is the
    > difference between these two? When should one use the one, when the
    > other?


    yappy is a lr(1) parser, and spark is a earley parser. Bont of them are
    suited for your problem.

    I personally found spark easy to use, as its very declarative - but I don't
    know yappy, maybe thats cool, to.

    --
    Regards,

    Diez B. Roggisch
     
    Diez B. Roggisch, Apr 9, 2004
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    745
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. Rick Venter

    perl regex to java regex

    Rick Venter, Oct 29, 2003, in forum: Java
    Replies:
    5
    Views:
    1,693
    Ant...
    Nov 6, 2003
  3. Replies:
    2
    Views:
    628
  4. Xah Lee
    Replies:
    1
    Views:
    972
    Ilias Lazaridis
    Sep 22, 2006
  5. Replies:
    3
    Views:
    832
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page