Match First Sequence in Regular Expression?

Discussion in 'Python' started by Roger L. Cauvin, Jan 26, 2006.

  1. Say I have some string that begins with an arbitrary sequence of characters
    and then alternates repeating the letters 'a' and 'b' any number of times,
    e.g.

    "xyz123aaabbaabbbbababbbbaaabb"

    I'm looking for a regular expression that matches the first, and only the
    first, sequence of the letter 'a', and only if the length of the sequence is
    exactly 3.

    Does such a regular expression exist? If so, any ideas as to what it could
    be?

    --
    Roger L. Cauvin
    (omit the "nospam_" part)
    Cauvin, Inc.
    Product Management / Market Research
    http://www.cauvin-inc.com
     
    Roger L. Cauvin, Jan 26, 2006
    #1
    1. Advertising

  2. Roger L. Cauvin enlightened us with:
    > I'm looking for a regular expression that matches the first, and
    > only the first, sequence of the letter 'a', and only if the length
    > of the sequence is exactly 3.


    Your request is ambiguous:

    1) You're looking for the first, and only the first, sequence of the
    letter 'a'. If the length of this first, and only the first,
    sequence of the letter 'a' is not 3, no match is made at all.

    2) You're looking for the first, and only the first, sequence of
    length 3 of the letter 'a'.

    What is it?

    Sybren
    --
    The problem with the world is stupidity. Not saying there should be a
    capital punishment for stupidity, but why don't we just take the
    safety labels off of everything and let the problem solve itself?
    Frank Zappa
     
    Sybren Stuvel, Jan 26, 2006
    #2
    1. Advertising

  3. Hello Roger,

    > I'm looking for a regular expression that matches the first, and only
    > the first, sequence of the letter 'a', and only if the length of the
    > sequence is exactly 3.


    import sys, re, os

    if __name__=='__main__':

    m = re.search('a{3}', 'xyz123aaabbaaabbbbababbbbaabb')
    print m.group(0)
    print "Preceded by: \"" + m.string[0:m.start(0)] + "\""

    Best wishes,
    Christoph
     
    Christoph Conrad, Jan 26, 2006
    #3
  4. Roger L. Cauvin

    Tim Chase Guest

    > Say I have some string that begins with an arbitrary
    > sequence of characters and then alternates repeating the
    > letters 'a' and 'b' any number of times, e.g.
    > "xyz123aaabbaabbbbababbbbaaabb"
    >
    > I'm looking for a regular expression that matches the
    > first, and only the first, sequence of the letter 'a', and
    > only if the length of the sequence is exactly 3.
    >
    > Does such a regular expression exist? If so, any ideas as
    > to what it could be?
    >


    I'm not quite sure what your intent here is, as the
    resulting find would obviously be "aaa", of length 3.

    If you mean that you want to test against a number of
    things, and only find items where "aaa" is the first "a" on
    the line, you might try something like

    import re
    listOfStringsToTest = [
    'helloworld',
    'xyz123aaabbaabababbab',
    'cantalopeaaabababa',
    'baabbbaaabbbbb',
    'xyzaa123aaabbabbabababaa']
    r = re.compile("[^a]*(a{3})b+(a+b+)*")
    matches = [s for s in listOfStringsToTest if r.match(s)]
    print repr(matches)

    If you just want the *first* triad of "aaa", you can change
    the regexp to

    r = re.compile(".*?(a{3})b+(a+b+)*")

    With a little more detail as to the gist of the problem,
    perhaps a better solution can be found. In particular, are
    there items in the listOfStringsToTest that should be found
    but aren't with either of the regexps?

    -tkc
     
    Tim Chase, Jan 26, 2006
    #4
  5. Tim Chase <> wrote:
    ...
    > I'm not quite sure what your intent here is, as the
    > resulting find would obviously be "aaa", of length 3.


    But that would also match 'aaaa'; I think he wants negative loobehind
    and lookahead assertions around the 'aaa' part. But then there's the
    spec about matching only if the sequence is the first occurrence of
    'a's, so maybe he wants '$[^a]*' instead of the lookbehind (and maybe
    parentheses around the 'aaa' to somehow 'match' is specially?).

    It's definitely not very clear what exactly the intent is, no...


    Alex
     
    Alex Martelli, Jan 26, 2006
    #5
  6. "Sybren Stuvel" <> wrote in message
    news:...
    > Roger L. Cauvin enlightened us with:
    >> I'm looking for a regular expression that matches the first, and
    >> only the first, sequence of the letter 'a', and only if the length
    >> of the sequence is exactly 3.

    >
    > Your request is ambiguous:
    >
    > 1) You're looking for the first, and only the first, sequence of the
    > letter 'a'. If the length of this first, and only the first,
    > sequence of the letter 'a' is not 3, no match is made at all.
    >
    > 2) You're looking for the first, and only the first, sequence of
    > length 3 of the letter 'a'.
    >
    > What is it?


    The first option describes what I want, with the additional restriction that
    the "first sequence of the letter 'a'" is defined as 1 or more consecutive
    occurrences of the letter 'a', followed directly by the letter 'b'.

    --
    Roger L. Cauvin
    (omit the "nospam_" part)
    Cauvin, Inc.
    Product Management / Market Research
    http://www.cauvin-inc.com
     
    Roger L. Cauvin, Jan 26, 2006
    #6
  7. "Christoph Conrad" <> wrote in message
    news:-berlin.de...
    > Hello Roger,
    >
    >> I'm looking for a regular expression that matches the first, and only
    >> the first, sequence of the letter 'a', and only if the length of the
    >> sequence is exactly 3.

    >
    > import sys, re, os
    >
    > if __name__=='__main__':
    >
    > m = re.search('a{3}', 'xyz123aaabbaaabbbbababbbbaabb')
    > print m.group(0)
    > print "Preceded by: \"" + m.string[0:m.start(0)] + "\""


    The correct pattern should reject the string:

    'xyz123aabbaaab'

    since the length of the first sequence of the letter 'a' is 2. Yours
    accepts it, right?

    --
    Roger L. Cauvin
    (omit the "nospam_" part)
    Cauvin, Inc.
    Product Management / Market Research
    http://www.cauvin-inc.com
     
    Roger L. Cauvin, Jan 26, 2006
    #7
  8. "Alex Martelli" <> wrote in message
    news:1h9reyq.z7u4ziv8itblN%...
    > Tim Chase <> wrote:
    > ...
    >> I'm not quite sure what your intent here is, as the
    >> resulting find would obviously be "aaa", of length 3.

    >
    > But that would also match 'aaaa'; I think he wants negative loobehind
    > and lookahead assertions around the 'aaa' part. But then there's the
    > spec about matching only if the sequence is the first occurrence of
    > 'a's, so maybe he wants '$[^a]*' instead of the lookbehind (and maybe
    > parentheses around the 'aaa' to somehow 'match' is specially?).
    >
    > It's definitely not very clear what exactly the intent is, no...


    Sorry for the confusion. The correct pattern should reject all strings
    except those in which the first sequence of the letter 'a' that is followed
    by the letter 'b' has a length of exactly three.

    Hope that's clearer . . . .

    --
    Roger L. Cauvin
    (omit the "nospam_" part)
    Cauvin, Inc.
    Product Management / Market Research
    http://www.cauvin-inc.com
     
    Roger L. Cauvin, Jan 26, 2006
    #8
  9. Hello Roger,

    > since the length of the first sequence of the letter 'a' is 2. Yours
    > accepts it, right?


    Yes, i misunderstood your requirements. So it must be modified
    essentially to that what Tim Chase wrote:

    m = re.search('^[^a]*a{3}b', 'xyz123aabbaaab')

    Best wishes from germany,
    Christoph
     
    Christoph Conrad, Jan 26, 2006
    #9
  10. On Thu, 26 Jan 2006 14:09:54 GMT, rumours say that "Roger L. Cauvin"
    <> might have written:

    >Say I have some string that begins with an arbitrary sequence of characters
    >and then alternates repeating the letters 'a' and 'b' any number of times,
    >e.g.
    >
    >"xyz123aaabbaabbbbababbbbaaabb"
    >
    >I'm looking for a regular expression that matches the first, and only the
    >first, sequence of the letter 'a', and only if the length of the sequence is
    >exactly 3.
    >
    >Does such a regular expression exist? If so, any ideas as to what it could
    >be?


    Is this what you mean?

    ^[^a]*(a{3})(?:[^a].*)?$

    This fits your description.
    --
    TZOTZIOY, I speak England very best.
    "Dear Paul,
    please stop spamming us."
    The Corinthians
     
    Christos Georgiou, Jan 26, 2006
    #10
  11. Roger L. Cauvin

    Tim Chase Guest

    > Sorry for the confusion. The correct pattern should reject
    > all strings except those in which the first sequence of the
    > letter 'a' that is followed by the letter 'b' has a length of
    > exactly three.


    Ah...a little more clear.

    r = re.compile("[^a]*a{3}b+(a+b*)*")
    matches = [s for s in listOfStringsToTest if r.match(s)]

    or (as you've only got 3 of 'em)

    r = re.compile("[^a]*aaab+(a+b*)*")
    matches = [s for s in listOfStringsToTest if r.match(s)]

    should do the trick. To exposit:

    [^a]* a bunch of stuff that's not "a"

    a{3} or aaa three letter "a"s

    b+ one or more "b"s

    (a+b*) any number of "a"s followed optionally by "b"s

    Hope this helps,

    -tkc
     
    Tim Chase, Jan 26, 2006
    #11
  12. Tim Chase <> wrote:

    > > Sorry for the confusion. The correct pattern should reject
    > > all strings except those in which the first sequence of the
    > > letter 'a' that is followed by the letter 'b' has a length of
    > > exactly three.

    >
    > Ah...a little more clear.
    >
    > r = re.compile("[^a]*a{3}b+(a+b*)*")
    > matches = [s for s in listOfStringsToTest if r.match(s)]


    Unfortunately, the OP's spec is even more complex than this, if we are
    to take to the letter what you just quoted; e.g.
    aazaaab
    SHOULD match, because the sequence 'aaz' (being 'a' NOT followed by the
    letter 'b') should not invalidate the match that follows. I don't think
    he means the strings contain only a's and b's.

    Locating 'the first sequence of a followed by b' is easy, and reasonably
    easy to check the sequence is exactly of length 3 (e.g. with a negative
    lookbehind) -- but I don't know how to tell a RE to *stop* searching for
    more if the check fails.

    If a little more than just REs and matching was allowed, it would be
    reasonably easy, but I don't know how to fashion a RE r such that
    r.match(s) will succeed if and only if s meets those very precise and
    complicated specs. That doesn't mean it just can't be done, just that I
    can't do it so far. Perhaps the OP can tell us what constrains him to
    use r.match ONLY, rather than a little bit of logic around it, so we can
    see if we're trying to work in an artificially overconstrained domain?


    Alex
     
    Alex Martelli, Jan 26, 2006
    #12
  13. Christoph Conrad <> wrote:

    > Hello Roger,
    >
    > > since the length of the first sequence of the letter 'a' is 2. Yours
    > > accepts it, right?

    >
    > Yes, i misunderstood your requirements. So it must be modified
    > essentially to that what Tim Chase wrote:
    >
    > m = re.search('^[^a]*a{3}b', 'xyz123aabbaaab')


    ....but that rejects 'aazaaab' which should apparently be accepted.


    Alex
     
    Alex Martelli, Jan 26, 2006
    #13
  14. Hallo Alex,

    >> r = re.compile("[^a]*a{3}b+(a+b*)*") matches = [s for s in
    >> listOfStringsToTest if r.match(s)]


    > Unfortunately, the OP's spec is even more complex than this, if we are
    > to take to the letter what you just quoted; e.g. aazaaab SHOULD match,


    Then it's again "a{3}b", isn't it?

    Freundliche Grüße,
    Christoph
     
    Christoph Conrad, Jan 26, 2006
    #14
  15. "Tim Chase" <> wrote in message
    news:...
    >> Sorry for the confusion. The correct pattern should reject
    >> all strings except those in which the first sequence of the
    >> letter 'a' that is followed by the letter 'b' has a length of
    >> exactly three.

    >
    > Ah...a little more clear.
    >
    > r = re.compile("[^a]*a{3}b+(a+b*)*")
    > matches = [s for s in listOfStringsToTest if r.match(s)]


    Wow, I like it, but it allows some strings it shouldn't. For example:

    "xyz123aabbaaab"

    (It skips over the two-letter sequence of 'a' and matches 'bbaaab'.)

    --
    Roger L. Cauvin
    (omit the "nospam_" part)
    Cauvin, Inc.
    Product Management / Market Research
    http://www.cauvin-inc.com
     
    Roger L. Cauvin, Jan 26, 2006
    #15
  16. "Christos Georgiou" <> wrote in message
    news:...
    > On Thu, 26 Jan 2006 14:09:54 GMT, rumours say that "Roger L. Cauvin"
    > <> might have written:
    >
    >>Say I have some string that begins with an arbitrary sequence of
    >>characters
    >>and then alternates repeating the letters 'a' and 'b' any number of times,
    >>e.g.
    >>
    >>"xyz123aaabbaabbbbababbbbaaabb"
    >>
    >>I'm looking for a regular expression that matches the first, and only the
    >>first, sequence of the letter 'a', and only if the length of the sequence
    >>is
    >>exactly 3.
    >>
    >>Does such a regular expression exist? If so, any ideas as to what it
    >>could
    >>be?

    >
    > Is this what you mean?
    >
    > ^[^a]*(a{3})(?:[^a].*)?$


    Close, but the pattern should allow "arbitrary sequence of characters" that
    precede the alternating a's and b's to contain the letter 'a'. In other
    words, the pattern should accept:

    "xayz123aaabbab"

    since the 'a' between the 'x' and 'y' is not directly followed by a 'b'.

    Your proposed pattern rejects this string.

    --
    Roger L. Cauvin
    (omit the "nospam_" part)
    Cauvin, Inc.
    Product Management / Market Research
    http://www.cauvin-inc.com
     
    Roger L. Cauvin, Jan 26, 2006
    #16
  17. Roger L. Cauvin

    Tim Chase Guest

    >>r = re.compile("[^a]*a{3}b+(a+b*)*")
    >>matches = [s for s in listOfStringsToTest if r.match(s)]

    >
    > Wow, I like it, but it allows some strings it shouldn't. For example:
    >
    > "xyz123aabbaaab"
    >
    > (It skips over the two-letter sequence of 'a' and matches 'bbaaab'.)


    Anchoring it to the beginning/end might solve that:

    r = re.compile("^[^a]*a{3}b+(a+b*)*$")

    this ensures that no "a"s come before the first 3x"a" and nothing
    but "b" and "a" follows it.

    -tkc
    (who's translating from vim regexps which are just diff. enough
    to throw a wrench in works...)
     
    Tim Chase, Jan 26, 2006
    #17
  18. Roger L. Cauvin

    Peter Hansen Guest

    Roger L. Cauvin wrote:
    > Sorry for the confusion. The correct pattern should reject all strings
    > except those in which the first sequence of the letter 'a' that is followed
    > by the letter 'b' has a length of exactly three.
    >
    > Hope that's clearer . . . .


    Examples are a *really* good way to clarify ambiguous or complex
    requirements. In fact, when made executable they're called "test cases"
    :), and supplying a few of those (showing input values and expected
    output values) would help, not only to clarify your goals for the
    humans, but also to let the proposed solutions easily be tested.

    (After all, are you going to just trust that whatever you are handed
    here is correctly implemented, and based on a perfect understanding of
    your apparently unclear requirements?)

    -Peter
     
    Peter Hansen, Jan 26, 2006
    #18
  19. "Tim Chase" <> wrote in message
    news:...
    >>>r = re.compile("[^a]*a{3}b+(a+b*)*")
    >>>matches = [s for s in listOfStringsToTest if r.match(s)]

    >>
    >> Wow, I like it, but it allows some strings it shouldn't. For example:
    >>
    >> "xyz123aabbaaab"
    >>
    >> (It skips over the two-letter sequence of 'a' and matches 'bbaaab'.)

    >
    > Anchoring it to the beginning/end might solve that:
    >
    > r = re.compile("^[^a]*a{3}b+(a+b*)*$")
    >
    > this ensures that no "a"s come before the first 3x"a" and nothing but "b"
    > and "a" follows it.


    Anchoring may be the key here, but this pattern rejects

    "xayz123aaabab"

    which it should accept, since the 'a' between the 'x' and the 'y' is not
    directly followed by the letter 'b'.

    --
    Roger L. Cauvin
    (omit the "nospam_" part)
    Cauvin, Inc.
    Product Management / Market Research
    http://www.cauvin-inc.com
     
    Roger L. Cauvin, Jan 26, 2006
    #19
  20. "Peter Hansen" <> wrote in message
    news:...
    > Roger L. Cauvin wrote:
    >> Sorry for the confusion. The correct pattern should reject all strings
    >> except those in which the first sequence of the letter 'a' that is
    >> followed by the letter 'b' has a length of exactly three.
    >>
    >> Hope that's clearer . . . .

    >
    > Examples are a *really* good way to clarify ambiguous or complex
    > requirements. In fact, when made executable they're called "test cases"
    > :), and supplying a few of those (showing input values and expected
    > output values) would help, not only to clarify your goals for the humans,
    > but also to let the proposed solutions easily be tested.


    Good suggestion. Here are some "test cases":

    "xyz123aaabbab" accept
    "xyz123aabbaab" reject
    "xayz123aaabab" accept
    "xaaayz123abab" reject
    "xaaayz123aaabab" accept

    --
    Roger L. Cauvin
    (omit the "nospam_" part)
    Cauvin, Inc.
    Product Management / Market Research
    http://www.cauvin-inc.com
     
    Roger L. Cauvin, Jan 26, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jürgen Exner
    Replies:
    0
    Views:
    2,753
    Jürgen Exner
    Apr 12, 2009
  2. Peter Tuente
    Replies:
    0
    Views:
    17,008
    Peter Tuente
    Apr 17, 2009
  3. Mike Spencer
    Replies:
    0
    Views:
    2,964
    Mike Spencer
    Apr 19, 2009
  4. Big Tony
    Replies:
    2
    Views:
    116
    Anno Siegel
    Sep 27, 2004
  5. aliensite
    Replies:
    4
    Views:
    275
    aliensite
    Apr 13, 2005
Loading...

Share This Page