Match First Sequence in Regular Expression?

Christos Georgiou · Jan 26, 2006

Say I have some string that begins with an arbitrary sequence of
characters
and then alternates repeating the letters 'a' and 'b' any number of times,
e.g.

"xyz123aaabbaabbbbababbbbaaabb"

I'm looking for a regular expression that matches the first, and only the
first, sequence of the letter 'a', and only if the length of the sequence
is
exactly 3.

Does such a regular expression exist? If so, any ideas as to what it
could
be?

Click to expand...

Is this what you mean?

^[^a]*(a{3})(?:[^a].*)?$

Click to expand...

Close, but the pattern should allow "arbitrary sequence of characters" that
precede the alternating a's and b's to contain the letter 'a'. In other
words, the pattern should accept:

"xayz123aaabbab"

since the 'a' between the 'x' and 'y' is not directly followed by a 'b'.

Your proposed pattern rejects this string.

1.

(a{3})(?:b[ab]*)?$

This finds the first (leftmost) "aaa" either at the end of the string or
followed by 'b' and then arbitrary sequences of 'a' and 'b'.

This will also match "aaaa" (from second position on).

2.

If you insist in only three 'a's and you can add the constraint that:

* let s be the "arbitrary sequence of characters" at the start of your
searched text
* len(s) >= 1 and not s.endswith('a')

then you'll have this reg.ex.

(?<=[^a])(a{3})(?:b[ab]*)?$

3.

If you want to allow for a possible empty "arbitrary sequence of characters"
at the start and you don't mind search speed

^(?:.?*[^a])?(a{3})(?:b[ab]*)?$

This should cover you:

s="xayzbaaa123aaabbab"
r=re.compile(r"^(?:.*?[^a])?(a{3})(?:b[ab]*)?$")
m= r.match(s)
m.group(1) 'aaa'
m.start(1) 11
s[11:]

Click to expand...

Click to expand...

'aaabbab'

Roger L. Cauvin · Jan 26, 2006

Alex Martelli said:
....
....
If a little more than just REs and matching was allowed, it would be
reasonably easy, but I don't know how to fashion a RE r such that
r.match(s) will succeed if and only if s meets those very precise and
complicated specs. That doesn't mean it just can't be done, just that I
can't do it so far. Perhaps the OP can tell us what constrains him to
use r.match ONLY, rather than a little bit of logic around it, so we can
see if we're trying to work in an artificially overconstrained domain?

Alex, you seem to grasp exactly what the requirements are in this case. I
of course don't *have* to use regular expressions only, but I'm working with
an infrastructure that uses regexps in configuration files so that the code
doesn't have to change to add or change patterns. Before throwing up my
hands and re-architecting, I wanted to see if regexps would handle the job
(they have in every case but one).

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com

Christos Georgiou · Jan 26, 2006

Good suggestion. Here are some "test cases":

"xyz123aaabbab" accept
"xyz123aabbaab" reject
"xayz123aaabab" accept
"xaaayz123abab" reject
"xaaayz123aaabab" accept

Applying my last regex to your test cases:
^(?:.*?[^a])?(a{3})(?:b[ab]*)?$

You should also remember to check the (match_object).start(1) to verify that
it matches the "aaa" you want.

Dave Hansen · Jan 26, 2006

Is this what you mean?

^[^a]*(a{3})(?:[^a].*)?$

Click to expand...

Close, but the pattern should allow "arbitrary sequence of characters" that
precede the alternating a's and b's to contain the letter 'a'. In other
words, the pattern should accept:

"xayz123aaabbab"

since the 'a' between the 'x' and 'y' is not directly followed by a 'b'.

I don't know an RE is the best solution to this problem. If I
understand the problem correctly, building a state machine to solve
this is trivial. The following took about 5 minutes of coding:

---begin included file
# Define our states.
# state[0] is next state if character is 'a'
# state[1] is next state if character is 'b'
# state[2] is next state for any other character

# Accept state means we've found a match
Accept = []
for i in range(3):
Accept.append(Accept)

# Reject state means the string cannot match
Reject = []
for i in range(3):
Reject.append(Reject)

# Remaining states: Start, 'a' found, 'aa', 'aaa', and 'aaaa'
Start = [0,1,2]
a1 = [0,1,2]
a2 = [0,1,2]
a3 = [0,1,2]
a4 = [0,1,2]

# Start: looking for first 'a'
Start[0] = a1
Start[1] = Start
Start[2] = Start

# a1: 1 'a' found so far
a1[0] = a2
a1[1] = Reject
a1[2] = Start

# a2: 'aa' found
a2[0] = a3
a2[1] = Reject
a2[2] = Start

# a3: 'aaa' found
a3[0] = a4
a3[1] = Accept
a3[2] = Start

# a4: four or more 'a' in a row
a4[0] = a4
a4[1] = Reject
a4[2] = Start

def detect(s):
"""
Return 1 if first substring aa*b has exactly 3 a's
Return 0 otherwise
"""
state = Start
for c in s:
if c == 'a':
state = state[0]
elif c == 'b':
state = state[1]
else:
state = state[2]
if state is Accept:
return 1
return 0

print detect("xyza123abc")
print detect("xyzaaa123aabc")
print detect("xyzaa123aaabc")
print detect("xyza123aaaabc")

--- end included file ---

And I'm pretty sure it does what you need, though it's pretty naive.
Note that if '3' isn't a magic number, states a1, a2, a3, and a4 could
be re-implemented as a single state with a counter, but the logic
inside detect gets a little hairier.

I haven't timed it, but it's not doing anything other than simple
comparisons and assignments. It's a little (OK, a lot) more code than
a simple RE, but I know it works.

HTH,
-=Dave

Fredrik Lundh · Jan 26, 2006

Roger said:
Good suggestion. Here are some "test cases":

"xyz123aaabbab" accept
"xyz123aabbaab" reject
"xayz123aaabab" accept
"xaaayz123abab" reject
"xaaayz123aaabab" accept

$ more test.py

import re

print "got expected"
print "------ --------"

testsuite = (
("xyz123aaabbab", "accept"),
("xyz123aabbaab", "reject"),
("xayz123aaabab", "accept"),
("xaaayz123abab", "reject"),
("xaaayz123aaabab", "accept"),
)

for string, result in testsuite:
m = re.search("aaab", string)
if m:
print "accept",
else:
print "reject",
print result

$ python test.py
got expected
---------------
accept accept
reject reject
accept accept
reject reject
accept accept

</F>

Roger L. Cauvin · Jan 26, 2006

Fredrik Lundh said:
$ more test.py

import re

print "got expected"
print "------ --------"

testsuite = (
("xyz123aaabbab", "accept"),
("xyz123aabbaab", "reject"),
("xayz123aaabab", "accept"),
("xaaayz123abab", "reject"),
("xaaayz123aaabab", "accept"),
)

for string, result in testsuite:
m = re.search("aaab", string)
if m:
print "accept",
else:
print "reject",
print result

$ python test.py
got expected
---------------
accept accept
reject reject
accept accept
reject reject
accept accept

Thanks, but the second test case I listed contained a typo. It should have
contained a sequence of three of the letter 'a'. The test cases should be:

"xyz123aaabbab" accept
"xyz123aabbaaab" reject
"xayz123aaabab" accept
"xaaayz123abab" reject
"xaaayz123aaabab" accept

Your pattern fails the second test.

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com

Christos Georgiou · Jan 26, 2006

$ more test.py

[snip of code]

m = re.search("aaab", string)

[snip of more code]

$ python test.py
got expected
---------------
accept accept
reject reject
accept accept
reject reject
accept accept

You're right, Fredrik, but we (graciously as a group

take also notice of
the other requirements that the OP has provided elsewhere and that are not
covered by the simple test that he specified.

The code above works for "aaaab" too, which the OP has already ruled out,
and it doesn't work for "aaa".

Roger L. Cauvin · Jan 26, 2006

Christos Georgiou said:
Good suggestion. Here are some "test cases":

"xyz123aaabbab" accept
"xyz123aabbaab" reject
"xayz123aaabab" accept
"xaaayz123abab" reject
"xaaayz123aaabab" accept

Click to expand...

Applying my last regex to your test cases:
^(?:.*?[^a])?(a{3})(?:b[ab]*)?$

You should also remember to check the (match_object).start(1) to verify
that
it matches the "aaa" you want.

Thanks, but the second test case I listed contained a typo. It should have
contained a sequence of three of the letter 'a'. The test cases should be:

"xyz123aaabbab" accept
"xyz123aabbaaab" reject
"xayz123aaabab" accept
"xaaayz123abab" reject
"xaaayz123aaabab" accept

Your pattern fails the second test.

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com

Roger L. Cauvin · Jan 26, 2006

Christos Georgiou said:
Say I have some string that begins with an arbitrary sequence of
characters
and then alternates repeating the letters 'a' and 'b' any number of
times,
e.g.

"xyz123aaabbaabbbbababbbbaaabb"

I'm looking for a regular expression that matches the first, and only
the
first, sequence of the letter 'a', and only if the length of the
sequence
is
exactly 3.

Does such a regular expression exist? If so, any ideas as to what it
could
be?

Is this what you mean?

^[^a]*(a{3})(?:[^a].*)?$

Click to expand...

Close, but the pattern should allow "arbitrary sequence of characters"
that
precede the alternating a's and b's to contain the letter 'a'. In other
words, the pattern should accept:

"xayz123aaabbab"

since the 'a' between the 'x' and 'y' is not directly followed by a 'b'.

Your proposed pattern rejects this string.

Click to expand...

1.

(a{3})(?:b[ab]*)?$

This finds the first (leftmost) "aaa" either at the end of the string or
followed by 'b' and then arbitrary sequences of 'a' and 'b'.

This will also match "aaaa" (from second position on).

2.

If you insist in only three 'a's and you can add the constraint that:

* let s be the "arbitrary sequence of characters" at the start of your
searched text
* len(s) >= 1 and not s.endswith('a')

then you'll have this reg.ex.

(?<=[^a])(a{3})(?:b[ab]*)?$

3.

If you want to allow for a possible empty "arbitrary sequence of
characters"
at the start and you don't mind search speed

^(?:.?*[^a])?(a{3})(?:b[ab]*)?$

This should cover you:

s="xayzbaaa123aaabbab"
r=re.compile(r"^(?:.*?[^a])?(a{3})(?:b[ab]*)?$")
m= r.match(s)
m.group(1) 'aaa'
m.start(1) 11
s[11:]

Click to expand...

Click to expand...

'aaabbab'

Thanks for continuing to follow up, Christos. Please see my reply to your
other post (in which you applied the test cases).

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com

Christos Georgiou · Jan 26, 2006

Thanks, but the second test case I listed contained a typo. It should have
contained a sequence of three of the letter 'a'. The test cases should be:

"xyz123aaabbab" accept
"xyz123aabbaaab" reject

Here I object to either you or your need for a regular expression. You see,
before the "aaa" in your second test case, you have an "arbitrary sequence
of characters", so your requirements are met.

Roger L. Cauvin · Jan 26, 2006

Christos Georgiou said:
$ more test.py

Click to expand...

[snip of code]

m = re.search("aaab", string)

Click to expand...

[snip of more code]

$ python test.py
got expected
---------------
accept accept
reject reject
accept accept
reject reject
accept accept

Click to expand...

You're right, Fredrik, but we (graciously as a group take also notice
of
the other requirements that the OP has provided elsewhere and that are not
covered by the simple test that he specified.

My fault, guys. The second test case should be

"xyz123aabbaaab" reject

instead of

"xyz123aabbaab" reject

Fredrik's pattern fails this test case.

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com

Fredrik Lundh · Jan 26, 2006

Roger said:
Thanks, but the second test case I listed contained a typo. It should have
contained a sequence of three of the letter 'a'. The test cases should be:

"xyz123aaabbab" accept
"xyz123aabbaaab" reject
"xayz123aaabab" accept
"xaaayz123abab" reject
"xaaayz123aaabab" accept

Your pattern fails the second test.

$ more test.py

import re

print "got expected"
print "------ --------"

testsuite = (
("xyz123aaabbab", "accept"),
("xyz123aabbaaab", "reject"),
("xayz123aaabab", "accept"),
("xaaayz123abab", "reject"),
("xaaayz123aaabab", "accept"),
)

for string, result in testsuite:
m = re.search("a+b", string)
if m and len(m.group()) == 4:
print "accept",
else:
print "reject",
print result

$ python test.py

got expected
------ --------
accept accept
reject reject
accept accept
reject reject
accept accept

</F>

Roger L. Cauvin · Jan 26, 2006

Christos Georgiou said:
Here I object to either you or your need for a regular expression. You
see,
before the "aaa" in your second test case, you have an "arbitrary sequence
of characters", so your requirements are met.

Well, thank you for your efforts so far, Christos.

My purpose is to determine whether it's possible to do this using regular
expressions, since my application is already architected around
configuration files that use regular expressions. It may not be the best
architecture, but I still don't know the answer to my question. Is it
*possible* to fulfill my requirements with regular expressions, even if it's
not the best way to do it?

The requirements are not met by your regular expression, since by definition
the "arbitrary sequence of characters" stops once the sequences of a's and
b's starts.

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com

Roger L. Cauvin · Jan 26, 2006

Fredrik Lundh said:
$ more test.py

import re

print "got expected"
print "------ --------"

testsuite = (
("xyz123aaabbab", "accept"),
("xyz123aabbaaab", "reject"),
("xayz123aaabab", "accept"),
("xaaayz123abab", "reject"),
("xaaayz123aaabab", "accept"),
)

for string, result in testsuite:
m = re.search("a+b", string)
if m and len(m.group()) == 4:
print "accept",
else:
print "reject",
print result

$ python test.py

got expected
------ --------
accept accept
reject reject
accept accept
reject reject
accept accept

Thanks, but I'm looking for a solution in terms of a regular expression
only. In other words, "accept" means the regular expression matched, and
"reject" means the regular expression did not match. I want to see if I can
fulfill the requirements without additional code (such as checking
"len(m.group())").

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com

Tim Chase · Jan 26, 2006

The below seems to pass all the tests you threw at it (taking the
modified 2nd test into consideration)

One other test that occurs to me would be

"xyz123aaabbaaabab"

where you have "aaab" in there twice.

-tkc

import re
tests = [
("xyz123aaabbab",True),
("xyz123aabbaaab", False),
("xayz123aaabab",True),
("xaaayz123abab", False),
("xaaayz123aaabab",True)
]

exp = '^([^b]|((?<!a)b))*aaab+[ab]*$'
r = re.compile(exp)
print "Using regexp: %s" % exp
for test,expectedResult in tests:
if r.match(test):
result = True
else:
result = False
if result == expectedResult:
print "%s passed" % test
else:
print "%s failed (expected %s, got %s)" % (
test,expectedResult,result)

Michael Spencer · Jan 26, 2006

This passes your tests. I haven't closely followed the thread for other
requirements:

>>> pattern = ".*?(?<![a+b])aaab" #look for aaab not preceded by any a+b
>>> test(pattern)

Click to expand...

Click to expand...

got expected
------ --------
accept accept
reject reject
accept accept
reject reject
accept accept ... ("xyz123aaabbab", "accept"),
... ("xyz123aabbaaab", "reject"),
... ("xayz123aaabab", "accept"),
... ("xaaayz123abab", "reject"),
... ("xaaayz123aaabab", "accept"),
... ) ... print "got expected"
... print "------ --------"
... for string, result in testsuite:
... m = re.match(pattern, string)
... if m:
... print "accept",
... else:
... print "reject",
... print result
...
Michael

Roger L. Cauvin · Jan 26, 2006

Michael Spencer said:
This passes your tests. I haven't closely followed the thread for other
requirements:

pattern = ".*?(?<![a+b])aaab" #look for aaab not preceded by any a+b

Click to expand...

Click to expand...

Very interesting. I think you may have solved the problem. The key seems
to be the "not preceded by" part. I'm unfamiliar with some of the notation.
Can you explain what "[a+b]" and the "(?<!" do?

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com

Roger L. Cauvin · Jan 26, 2006

Tim Chase said:
The below seems to pass all the tests you threw at it (taking the modified
2nd test into consideration)

One other test that occurs to me would be

"xyz123aaabbaaabab"

where you have "aaab" in there twice.

Good suggestion.

^([^b]|((?<!a)b))*aaab+[ab]*$

Looks good, although I've been unable to find a good explanation of the
"negative lookbehind" construct "(?<". How does it work?

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com

Tim Chase · Jan 26, 2006

"xyz123aaabbaaabab"

Good suggestion.

I assumed that this would be a valid case. If not, the
expression would need tweaking.

^([^b]|((?<!a)b))*aaab+[ab]*$

Click to expand...

Looks good, although I've been unable to find a good
explanation of the "negative lookbehind" construct "(?<". How
does it work?

The beginning part of the expression

([^b]|((?<!a)b))*

breaks down as

[^b] anything that isn't a "b"
| or
(...) this other thing

where "this other thing" is

(?<!a)b a "b" as long as it isn't immediately
preceeded by an "a"

The "(?<!...)" construct means that the "..." portion can't come
before the following token in the regexp...in this case, before a
"b".

There's also a "negative lookahead" (rather than "lookbehind")
which prevents items from following. This should be usable in
this scenario as wall and works with the aforementioned tests, using

"^([^a]|(a(?!b)))*aaab+[ab]*$"

which would be "anything that's not an 'a'; or an 'a' as long as
it's not followed by a 'b'"

The gospel is at:
http://docs.python.org/lib/re-syntax.html

but is a bit terse. O'reily has a fairly good book on regexps if
you want to dig a bit deeper.

-tkc

Peter Hansen · Jan 26, 2006

Roger said:
Roger said:

"xyz123aaabbab" accept
"xyz123aabbaaab" reject
"xayz123aaabab" accept
"xaaayz123abab" reject
"xaaayz123aaabab" accept

Click to expand...

This passes your tests. I haven't closely followed the thread for other
requirements:

pattern = ".*?(?<![a+b])aaab" #look for aaab not preceded by any a+b

Click to expand...

Click to expand...

Very interesting. I think you may have solved the problem. The key seems
to be the "not preceded by" part. I'm unfamiliar with some of the notation.
Can you explain what "[a+b]" and the "(?<!" do?

I think you might need to add a test case involving a pattern of aaaab
prior to another aaab. From what I gather (not reading too closely),
you would want this to be rejected. Is that true?

xyz123aaaababaaabab

-Peter

Exact match with regular expression	4	Oct 26, 2008
grimace: a fluent regular expression generator in Python	0	Jul 15, 2013
Help with regular expression in python	1	Aug 18, 2011
Utility to locate errors in regular expressions	3	May 24, 2013
Idk need help in editing this source code	0	Nov 5, 2022
Regular expression for different date formats in Python	4	Nov 26, 2012
Regular expression to structure HTML	11	Oct 2, 2009
Regular Expression	6	Oct 22, 2007

Match First Sequence in Regular Expression?

Christos Georgiou

Roger L. Cauvin

Christos Georgiou

Dave Hansen

Fredrik Lundh

Roger L. Cauvin

Christos Georgiou

Roger L. Cauvin

Roger L. Cauvin

Christos Georgiou

Roger L. Cauvin

Fredrik Lundh

Roger L. Cauvin

Roger L. Cauvin

Tim Chase

Michael Spencer

Roger L. Cauvin

Roger L. Cauvin

Tim Chase

Peter Hansen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads