brackets content regular expression

Discussion in 'Python' started by netimen, Oct 31, 2008.

  1. netimen

    netimen Guest

    I have a text containing brackets (or what is the correct term for
    '>'?). I'd like to match text in the uppermost level of brackets.

    So, I have sth like: 'aaaa 123 < 1 aaa < t bbb < a <tt > ff > > 2 >
    bbbbb'. How to match text between the uppermost brackets ( 1 aaa < t
    bbb < a <tt > ff > > 2 )?

    P.S. sorry for my english.
     
    netimen, Oct 31, 2008
    #1
    1. Advertising

  2. netimen

    Paul McGuire Guest

    On Oct 31, 12:25 pm, netimen <> wrote:
    > I have a text containing brackets (or what is the correct term for
    > '>'?). I'd like to match text in the uppermost level of brackets.
    >
    > So, I have sth like: 'aaaa 123 < 1 aaa < t bbb < a <tt  > ff > > 2 >
    > bbbbb'. How to match text between the uppermost brackets ( 1 aaa < t
    > bbb < a <tt  > ff > > 2 )?
    >
    > P.S. sorry for my english.


    To match opening and closing parens, delimiters, whatever (I refer to
    these '<>' as "angle brackets" when talking about them in this
    context, otherwise they are just "less than" and "greater than"), you
    will need some kind of stack-based parser. You can write your own
    without much trouble - there are built-ins in pyparsing that do most
    of the work.

    Here is the nestedExpr method:
    >>> from pyparsing import nestedExpr
    >>> print nestedExpr('<','>').searchString('aaaa 123 < 1 aaa < t bbb < a <tt > ff > > 2 > bbbbb')

    [[['1', 'aaa', ['t', 'bbb', ['a', ['tt'], 'ff']], '2']]]

    Note that the results show not the original nested text, but the
    parsed words in a fully nested structure.

    If all you want is the highest-level text, then you can wrap your
    nestedExpr parser inside a call to originalTextFor:

    >>> from pyparsing import originalTextFor
    >>> print originalTextFor(nestedExpr('<','>')).searchString('aaaa 123 < 1 aaa < t bbb < a <tt > ff > > 2 > bbbbb')

    [['< 1 aaa < t bbb < a <tt > ff > > 2 >']]

    More on pyparsing at http://pyparsing.wikispaces.com.

    -- Paul
     
    Paul McGuire, Oct 31, 2008
    #2
    1. Advertising

  3. netimen

    Matimus Guest

    On Oct 31, 10:25 am, netimen <> wrote:
    > I have a text containing brackets (or what is the correct term for
    > '>'?). I'd like to match text in the uppermost level of brackets.
    >
    > So, I have sth like: 'aaaa 123 < 1 aaa < t bbb < a <tt  > ff > > 2 >
    > bbbbb'. How to match text between the uppermost brackets ( 1 aaa < t
    > bbb < a <tt  > ff > > 2 )?
    >
    > P.S. sorry for my english.


    I think most people call them "angle brackets". Anyway it should be
    easy to just match the outer most brackets:

    >>> import re
    >>> text = "aaaa 123 < 1 aaa < t bbb < a <tt > ff > > 2 >"
    >>> r = re.compile("<(.+)>")
    >>> m = r.search(text)
    >>> m.group(1)

    ' 1 aaa < t bbb < a <tt > ff > > 2 '

    In this case the regular expression is automatically greedy, matching
    the largest area possible. Note however that it won't work if you have
    something like this: "<first> <second>".

    Matt
     
    Matimus, Oct 31, 2008
    #3
  4. netimen

    netimen Guest

    Thank's but if i have several top-level groups and want them match one
    by one:

    text = "a < b < Ó > d > here starts a new group: < e < f > g >"

    I want to match first " b < Ó > d " and then " e < f > g " but not "
    b < Ó > d > here starts a new group: < e < f > g "
    On 31 ÏËÔ, 20:53, Matimus <> wrote:
    > On Oct 31, 10:25šam, netimen <> wrote:
    >
    > > I have a text containing brackets (or what is the correct term for
    > > '>'?). I'd like to match text in the uppermost level of brackets.

    >
    > > So, I have sth like: 'aaaa 123 < 1 aaa < t bbb < a <tt š> ff > > 2 >
    > > bbbbb'. How to match text between the uppermost brackets ( 1 aaa < t
    > > bbb < a <tt š> ff > > 2 )?

    >
    > > P.S. sorry for my english.

    >
    > I think most people call them "angle brackets". Anyway it should be
    > easy to just match the outer most brackets:
    >
    > >>> import re
    > >>> text = "aaaa 123 < 1 aaa < t bbb < a <tt š> ff > > 2 >"
    > >>> r = re.compile("<(.+)>")
    > >>> m = r.search(text)
    > >>> m.group(1)

    >
    > ' 1 aaa < t bbb < a <tt š> ff > > 2 '
    >
    > In this case the regular expression is automatically greedy, matching
    > the largest area possible. Note however that it won't work if you have
    > something like this: "<first> <second>".
    >
    > Matt
     
    netimen, Oct 31, 2008
    #4
  5. netimen

    netimen Guest

    there may be different levels of nesting:

    "a < b < Ó > d > here starts a new group: < 1 < e < f > g > 2 >
    another group: < 3 >"

    On 31 окт, 21:57, netimen <> wrote:
    > Thank's but if i have several top-level groups and want them match one
    > by one:
    >
    > text = "a < b < Ó > d > here starts a new group:  < e < f  > g >"
    >
    > I want to match first " b < Ó > d " and then " e < f  > g " but not "
    > b < Ó > d > here starts a new group:  < e < f  > g "
    > On 31 ÃËÔ, 20:53, Matimus <> wrote:
    >
    >
    >
    > > On Oct 31, 10:25Å¡am, netimen <> wrote:

    >
    > > > I have a text containing brackets (or what is the correct term for
    > > > '>'?). I'd like to match text in the uppermost level of brackets.

    >
    > > > So, I have sth like: 'aaaa 123 < 1 aaa < t bbb < a <tt Å¡> ff > > 2 >
    > > > bbbbb'. How to match text between the uppermost brackets ( 1 aaa < t
    > > > bbb < a <tt Å¡> ff > > 2 )?

    >
    > > > P.S. sorry for my english.

    >
    > > I think most people call them "angle brackets". Anyway it should be
    > > easy to just match the outer most brackets:

    >
    > > >>> import re
    > > >>> text = "aaaa 123 < 1 aaa < t bbb < a <tt Å¡> ff > > 2 >"
    > > >>> r = re.compile("<(.+)>")
    > > >>> m = r.search(text)
    > > >>> m.group(1)

    >
    > > ' 1 aaa < t bbb < a <tt Å¡> ff > > 2 '

    >
    > > In this case the regular expression is automatically greedy, matching
    > > the largest area possible. Note however that it won't work if you have
    > > something like this: "<first> <second>".

    >
    > > Matt
     
    netimen, Oct 31, 2008
    #5
  6. netimen

    Guest

    netimen:
    > Thank's but if i have several top-level groups and want them match one
    > by one:
    > text = "a < b < Ó > d > here starts a new group:  < e < f  > g >"


    What other requirements do you have? If you list them all at once
    people will write you the code faster.

    bye,
    Bearophile
     
    , Oct 31, 2008
    #6
  7. On 31 oct, 20:38, netimen <> wrote:
    > there may be different levels of nesting:
    >
    > "a < b < Ó > d > here starts a new group: < 1 < e < f  > g > 2 >
    > another group: < 3 >"
    >
    > On 31 окт, 21:57, netimen <> wrote:
    >
    > > Thank's but if i have several top-level groups and want them match one
    > > by one:

    >
    > > text = "a < b < Ó > d > here starts a new group:  < e < f  > g >"

    >
    > > I want to match first " b < Ó > d " and then " e < f  > g " but not "
    > > b < Ó > d > here starts a new group:  < e < f  > g "
    > > On 31 ÃËÔ, 20:53, Matimus <> wrote:

    >
    > > > On Oct 31, 10:25Å¡am, netimen <> wrote:

    >
    > > > > I have a text containing brackets (or what is the correct term for
    > > > > '>'?). I'd like to match text in the uppermost level of brackets.

    >
    > > > > So, I have sth like: 'aaaa 123 < 1 aaa < t bbb < a <tt Å¡> ff > > 2 >
    > > > > bbbbb'. How to match text between the uppermost brackets ( 1 aaa < t
    > > > > bbb < a <tt Å¡> ff > > 2 )?

    >
    > > > > P.S. sorry for my english.

    >
    > > > I think most people call them "angle brackets". Anyway it should be
    > > > easy to just match the outer most brackets:

    >
    > > > >>> import re
    > > > >>> text = "aaaa 123 < 1 aaa < t bbb < a <tt Å¡> ff > > 2 >"
    > > > >>> r = re.compile("<(.+)>")
    > > > >>> m = r.search(text)
    > > > >>> m.group(1)

    >
    > > > ' 1 aaa < t bbb < a <tt Å¡> ff > > 2 '

    >
    > > > In this case the regular expression is automatically greedy, matching
    > > > the largest area possible. Note however that it won't work if you have
    > > > something like this: "<first> <second>".

    >
    > > > Matt

    >
    >


    Hi,

    Regular expressions or pyparsing might be overkill for this problem ;
    you can use a simple algorithm to read each character, increment a
    counter when you find a < and decrement when you find a > ; when the
    counter goes back to its initial value you have the end of a top level
    group

    Something like :

    def top_level(txt):
    level = 0
    start = None
    groups = []
    for i,car in enumerate(txt):
    if car == "<":
    level += 1
    if not start:
    start = i
    elif car == ">":
    level -= 1
    if start and level == 0:
    groups.append(txt[start+1:i])
    start = None
    return groups

    print top_level("a < b < 0 > d > < 1 < e < f > g > 2 > < 3 >")

    >> [' b < 0 > d ', ' 1 < e < f > g > 2 ', ' 3 ']


    Best,
    Pierre
     
    Pierre Quentel, Oct 31, 2008
    #7
  8. netimen

    Matimus Guest

    On Oct 31, 11:57 am, netimen <> wrote:
    > Thank's but if i have several top-level groups and want them match one
    > by one:
    >
    > text = "a < b < Ó > d > here starts a new group:  < e < f  > g >"
    >
    > I want to match first " b < Ó > d " and then " e < f  > g " but not "
    > b < Ó > d > here starts a new group:  < e < f  > g "
    > On 31 ÏËÔ, 20:53, Matimus <> wrote:
    >
    > > On Oct 31, 10:25šam, netimen <> wrote:

    >
    > > > I have a text containing brackets (or what is the correct term for
    > > > '>'?). I'd like to match text in the uppermost level of brackets.

    >
    > > > So, I have sth like: 'aaaa 123 < 1 aaa < t bbb < a <tt š> ff > > 2 >
    > > > bbbbb'. How to match text between the uppermost brackets ( 1 aaa < t
    > > > bbb < a <tt š> ff > > 2 )?

    >
    > > > P.S. sorry for my english.

    >
    > > I think most people call them "angle brackets". Anyway it should be
    > > easy to just match the outer most brackets:

    >
    > > >>> import re
    > > >>> text = "aaaa 123 < 1 aaa < t bbb < a <tt š> ff > > 2 >"
    > > >>> r = re.compile("<(.+)>")
    > > >>> m = r.search(text)
    > > >>> m.group(1)

    >
    > > ' 1 aaa < t bbb < a <tt š> ff > > 2 '

    >
    > > In this case the regular expression is automatically greedy, matching
    > > the largest area possible. Note however that it won't work if you have
    > > something like this: "<first> <second>".

    >
    > > Matt

    >
    >


    As far as I know, you can't do that with a regular expressions (by
    definition regular expressions aren't recursive). You can use a
    regular expression to aid you, but there is no magic expression that
    will give it to you for free.

    In this case it is actually pretty easy to do it without regular
    expressions at all:

    >>> text = "a < b < O > d > here starts a new group: < e < f > g >"
    >>> def get_nested_strings(text, depth=0):

    .... stack = []
    .... for i, c in enumerate(text):
    .... if c == '<':
    .... stack.append(i)
    .... elif c == '>':
    .... start = stack.pop() + 1
    .... if len(stack) == depth:
    .... yield text[start:i]
    ....
    >>> for seg in get_nested_strings(text):

    .... print seg
    ....
    b < O > d
    e < f > g


    Matt
     
    Matimus, Oct 31, 2008
    #8
  9. netimen

    netimen Guest

    Yeah, I know it's quite simple to do manually. I was just interested
    if it could be done by regular expressions. Thank you anyway.
    On 1 ноÑб, 00:36, Matimus <> wrote:
    > On Oct 31, 11:57 am, netimen <> wrote:
    >
    >
    >
    >
    >
    > > Thank's but if i have several top-level groups and want them match one
    > > by one:

    >
    > > text = "a < b < Ó > d > here starts a new group:  < e < f  > g >"

    >
    > > I want to match first " b < Ó > d " and then " e < f  > g " but not "
    > > b < Ó > d > here starts a new group:  < e < f  > g "
    > > On 31 ÃËÔ, 20:53, Matimus <> wrote:

    >
    > > > On Oct 31, 10:25Å¡am, netimen <> wrote:

    >
    > > > > I have a text containing brackets (or what is the correct term for
    > > > > '>'?). I'd like to match text in the uppermost level of brackets.

    >
    > > > > So, I have sth like: 'aaaa 123 < 1 aaa < t bbb < a <tt Å¡> ff > > 2 >
    > > > > bbbbb'. How to match text between the uppermost brackets ( 1 aaa < t
    > > > > bbb < a <tt Å¡> ff > > 2 )?

    >
    > > > > P.S. sorry for my english.

    >
    > > > I think most people call them "angle brackets". Anyway it should be
    > > > easy to just match the outer most brackets:

    >
    > > > >>> import re
    > > > >>> text = "aaaa 123 < 1 aaa < t bbb < a <tt Å¡> ff > > 2 >"
    > > > >>> r = re.compile("<(.+)>")
    > > > >>> m = r.search(text)
    > > > >>> m.group(1)

    >
    > > > ' 1 aaa < t bbb < a <tt Å¡> ff > > 2 '

    >
    > > > In this case the regular expression is automatically greedy, matching
    > > > the largest area possible. Note however that it won't work if you have
    > > > something like this: "<first> <second>".

    >
    > > > Matt

    >
    > As far as I know, you can't do that with a regular expressions (by
    > definition regular expressions aren't recursive). You can use a
    > regular expression to aid you, but there is no magic expression that
    > will give it to you for free.
    >
    > In this case it is actually pretty easy to do it without regular
    > expressions at all:
    >
    > >>> text = "a < b < O > d > here starts a new group:  < e < f  > g >"
    > >>> def get_nested_strings(text, depth=0):

    >
    > ...     stack = []
    > ...     for i, c in enumerate(text):
    > ...         if c == '<':
    > ...             stack.append(i)
    > ...         elif c == '>':
    > ...             start = stack.pop() + 1
    > ...             if len(stack) == depth:
    > ...                 yield text[start:i]
    > ...>>> for seg in get_nested_strings(text):
    >
    > ...  print seg
    > ...
    >  b < O > d
    >  e < f  > g
    >
    > Matt
     
    netimen, Nov 1, 2008
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,338
  2. hazz
    Replies:
    6
    Views:
    49,818
    SkyUCHC
    Jun 9, 2010
  3. Dennis Yurichev
    Replies:
    0
    Views:
    343
    Dennis Yurichev
    Jan 14, 2007
  4. Joe Halbrook
    Replies:
    2
    Views:
    129
    Tad McClellan
    Oct 22, 2003
  5. Peter Stacy
    Replies:
    1
    Views:
    111
    John W. Krahn
    Nov 8, 2009
Loading...

Share This Page