Match First Sequence in Regular Expression?

R

Roger L. Cauvin

Say I have some string that begins with an arbitrary sequence of characters
and then alternates repeating the letters 'a' and 'b' any number of times,
e.g.

"xyz123aaabbaabbbbababbbbaaabb"

I'm looking for a regular expression that matches the first, and only the
first, sequence of the letter 'a', and only if the length of the sequence is
exactly 3.

Does such a regular expression exist? If so, any ideas as to what it could
be?

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com
 
S

Sybren Stuvel

Roger L. Cauvin enlightened us with:
I'm looking for a regular expression that matches the first, and
only the first, sequence of the letter 'a', and only if the length
of the sequence is exactly 3.

Your request is ambiguous:

1) You're looking for the first, and only the first, sequence of the
letter 'a'. If the length of this first, and only the first,
sequence of the letter 'a' is not 3, no match is made at all.

2) You're looking for the first, and only the first, sequence of
length 3 of the letter 'a'.

What is it?

Sybren
 
C

Christoph Conrad

Hello Roger,
I'm looking for a regular expression that matches the first, and only
the first, sequence of the letter 'a', and only if the length of the
sequence is exactly 3.

import sys, re, os

if __name__=='__main__':

m = re.search('a{3}', 'xyz123aaabbaaabbbbababbbbaabb')
print m.group(0)
print "Preceded by: \"" + m.string[0:m.start(0)] + "\""

Best wishes,
Christoph
 
T

Tim Chase

Say I have some string that begins with an arbitrary
> sequence of characters and then alternates repeating the
> letters 'a' and 'b' any number of times, e.g.
> "xyz123aaabbaabbbbababbbbaaabb"
>
> I'm looking for a regular expression that matches the
> first, and only the first, sequence of the letter 'a', and
> only if the length of the sequence is exactly 3.
>
> Does such a regular expression exist? If so, any ideas as
> to what it could be?
>

I'm not quite sure what your intent here is, as the
resulting find would obviously be "aaa", of length 3.

If you mean that you want to test against a number of
things, and only find items where "aaa" is the first "a" on
the line, you might try something like

import re
listOfStringsToTest = [
'helloworld',
'xyz123aaabbaabababbab',
'cantalopeaaabababa',
'baabbbaaabbbbb',
'xyzaa123aaabbabbabababaa']
r = re.compile("[^a]*(a{3})b+(a+b+)*")
matches = [s for s in listOfStringsToTest if r.match(s)]
print repr(matches)

If you just want the *first* triad of "aaa", you can change
the regexp to

r = re.compile(".*?(a{3})b+(a+b+)*")

With a little more detail as to the gist of the problem,
perhaps a better solution can be found. In particular, are
there items in the listOfStringsToTest that should be found
but aren't with either of the regexps?

-tkc
 
A

Alex Martelli

Tim Chase said:
I'm not quite sure what your intent here is, as the
resulting find would obviously be "aaa", of length 3.

But that would also match 'aaaa'; I think he wants negative loobehind
and lookahead assertions around the 'aaa' part. But then there's the
spec about matching only if the sequence is the first occurrence of
'a's, so maybe he wants '$[^a]*' instead of the lookbehind (and maybe
parentheses around the 'aaa' to somehow 'match' is specially?).

It's definitely not very clear what exactly the intent is, no...


Alex
 
R

Roger L. Cauvin

Sybren Stuvel said:
Roger L. Cauvin enlightened us with:

Your request is ambiguous:

1) You're looking for the first, and only the first, sequence of the
letter 'a'. If the length of this first, and only the first,
sequence of the letter 'a' is not 3, no match is made at all.

2) You're looking for the first, and only the first, sequence of
length 3 of the letter 'a'.

What is it?

The first option describes what I want, with the additional restriction that
the "first sequence of the letter 'a'" is defined as 1 or more consecutive
occurrences of the letter 'a', followed directly by the letter 'b'.

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com
 
R

Roger L. Cauvin

Christoph Conrad said:
Hello Roger,
I'm looking for a regular expression that matches the first, and only
the first, sequence of the letter 'a', and only if the length of the
sequence is exactly 3.

import sys, re, os

if __name__=='__main__':

m = re.search('a{3}', 'xyz123aaabbaaabbbbababbbbaabb')
print m.group(0)
print "Preceded by: \"" + m.string[0:m.start(0)] + "\""

The correct pattern should reject the string:

'xyz123aabbaaab'

since the length of the first sequence of the letter 'a' is 2. Yours
accepts it, right?

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com
 
R

Roger L. Cauvin

Alex Martelli said:
Tim Chase said:
I'm not quite sure what your intent here is, as the
resulting find would obviously be "aaa", of length 3.

But that would also match 'aaaa'; I think he wants negative loobehind
and lookahead assertions around the 'aaa' part. But then there's the
spec about matching only if the sequence is the first occurrence of
'a's, so maybe he wants '$[^a]*' instead of the lookbehind (and maybe
parentheses around the 'aaa' to somehow 'match' is specially?).

It's definitely not very clear what exactly the intent is, no...

Sorry for the confusion. The correct pattern should reject all strings
except those in which the first sequence of the letter 'a' that is followed
by the letter 'b' has a length of exactly three.

Hope that's clearer . . . .

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com
 
C

Christoph Conrad

Hello Roger,
since the length of the first sequence of the letter 'a' is 2. Yours
accepts it, right?

Yes, i misunderstood your requirements. So it must be modified
essentially to that what Tim Chase wrote:

m = re.search('^[^a]*a{3}b', 'xyz123aabbaaab')

Best wishes from germany,
Christoph
 
C

Christos Georgiou

Say I have some string that begins with an arbitrary sequence of characters
and then alternates repeating the letters 'a' and 'b' any number of times,
e.g.

"xyz123aaabbaabbbbababbbbaaabb"

I'm looking for a regular expression that matches the first, and only the
first, sequence of the letter 'a', and only if the length of the sequence is
exactly 3.

Does such a regular expression exist? If so, any ideas as to what it could
be?

Is this what you mean?

^[^a]*(a{3})(?:[^a].*)?$

This fits your description.
 
T

Tim Chase

Sorry for the confusion. The correct pattern should reject
all strings except those in which the first sequence of the
letter 'a' that is followed by the letter 'b' has a length of
exactly three.

Ah...a little more clear.

r = re.compile("[^a]*a{3}b+(a+b*)*")
matches = [s for s in listOfStringsToTest if r.match(s)]

or (as you've only got 3 of 'em)

r = re.compile("[^a]*aaab+(a+b*)*")
matches = [s for s in listOfStringsToTest if r.match(s)]

should do the trick. To exposit:

[^a]* a bunch of stuff that's not "a"

a{3} or aaa three letter "a"s

b+ one or more "b"s

(a+b*) any number of "a"s followed optionally by "b"s

Hope this helps,

-tkc
 
A

Alex Martelli

Tim Chase said:
Sorry for the confusion. The correct pattern should reject
all strings except those in which the first sequence of the
letter 'a' that is followed by the letter 'b' has a length of
exactly three.

Ah...a little more clear.

r = re.compile("[^a]*a{3}b+(a+b*)*")
matches = [s for s in listOfStringsToTest if r.match(s)]

Unfortunately, the OP's spec is even more complex than this, if we are
to take to the letter what you just quoted; e.g.
aazaaab
SHOULD match, because the sequence 'aaz' (being 'a' NOT followed by the
letter 'b') should not invalidate the match that follows. I don't think
he means the strings contain only a's and b's.

Locating 'the first sequence of a followed by b' is easy, and reasonably
easy to check the sequence is exactly of length 3 (e.g. with a negative
lookbehind) -- but I don't know how to tell a RE to *stop* searching for
more if the check fails.

If a little more than just REs and matching was allowed, it would be
reasonably easy, but I don't know how to fashion a RE r such that
r.match(s) will succeed if and only if s meets those very precise and
complicated specs. That doesn't mean it just can't be done, just that I
can't do it so far. Perhaps the OP can tell us what constrains him to
use r.match ONLY, rather than a little bit of logic around it, so we can
see if we're trying to work in an artificially overconstrained domain?


Alex
 
A

Alex Martelli

Christoph Conrad said:
Hello Roger,
since the length of the first sequence of the letter 'a' is 2. Yours
accepts it, right?

Yes, i misunderstood your requirements. So it must be modified
essentially to that what Tim Chase wrote:

m = re.search('^[^a]*a{3}b', 'xyz123aabbaaab')

....but that rejects 'aazaaab' which should apparently be accepted.


Alex
 
C

Christoph Conrad

Hallo Alex,
r = re.compile("[^a]*a{3}b+(a+b*)*") matches = [s for s in
listOfStringsToTest if r.match(s)]
Unfortunately, the OP's spec is even more complex than this, if we are
to take to the letter what you just quoted; e.g. aazaaab SHOULD match,

Then it's again "a{3}b", isn't it?

Freundliche Grüße,
Christoph
 
R

Roger L. Cauvin

Tim Chase said:
Sorry for the confusion. The correct pattern should reject
all strings except those in which the first sequence of the
letter 'a' that is followed by the letter 'b' has a length of
exactly three.

Ah...a little more clear.

r = re.compile("[^a]*a{3}b+(a+b*)*")
matches = [s for s in listOfStringsToTest if r.match(s)]

Wow, I like it, but it allows some strings it shouldn't. For example:

"xyz123aabbaaab"

(It skips over the two-letter sequence of 'a' and matches 'bbaaab'.)

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com
 
R

Roger L. Cauvin

Christos Georgiou said:
Say I have some string that begins with an arbitrary sequence of
characters
and then alternates repeating the letters 'a' and 'b' any number of times,
e.g.

"xyz123aaabbaabbbbababbbbaaabb"

I'm looking for a regular expression that matches the first, and only the
first, sequence of the letter 'a', and only if the length of the sequence
is
exactly 3.

Does such a regular expression exist? If so, any ideas as to what it
could
be?

Is this what you mean?

^[^a]*(a{3})(?:[^a].*)?$

Close, but the pattern should allow "arbitrary sequence of characters" that
precede the alternating a's and b's to contain the letter 'a'. In other
words, the pattern should accept:

"xayz123aaabbab"

since the 'a' between the 'x' and 'y' is not directly followed by a 'b'.

Your proposed pattern rejects this string.

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com
 
T

Tim Chase

r = re.compile("[^a]*a{3}b+(a+b*)*")
matches = [s for s in listOfStringsToTest if r.match(s)]

Wow, I like it, but it allows some strings it shouldn't. For example:

"xyz123aabbaaab"

(It skips over the two-letter sequence of 'a' and matches 'bbaaab'.)

Anchoring it to the beginning/end might solve that:

r = re.compile("^[^a]*a{3}b+(a+b*)*$")

this ensures that no "a"s come before the first 3x"a" and nothing
but "b" and "a" follows it.

-tkc
(who's translating from vim regexps which are just diff. enough
to throw a wrench in works...)
 
P

Peter Hansen

Roger said:
Sorry for the confusion. The correct pattern should reject all strings
except those in which the first sequence of the letter 'a' that is followed
by the letter 'b' has a length of exactly three.

Hope that's clearer . . . .

Examples are a *really* good way to clarify ambiguous or complex
requirements. In fact, when made executable they're called "test cases"
:), and supplying a few of those (showing input values and expected
output values) would help, not only to clarify your goals for the
humans, but also to let the proposed solutions easily be tested.

(After all, are you going to just trust that whatever you are handed
here is correctly implemented, and based on a perfect understanding of
your apparently unclear requirements?)

-Peter
 
R

Roger L. Cauvin

Tim Chase said:
r = re.compile("[^a]*a{3}b+(a+b*)*")
matches = [s for s in listOfStringsToTest if r.match(s)]

Wow, I like it, but it allows some strings it shouldn't. For example:

"xyz123aabbaaab"

(It skips over the two-letter sequence of 'a' and matches 'bbaaab'.)

Anchoring it to the beginning/end might solve that:

r = re.compile("^[^a]*a{3}b+(a+b*)*$")

this ensures that no "a"s come before the first 3x"a" and nothing but "b"
and "a" follows it.

Anchoring may be the key here, but this pattern rejects

"xayz123aaabab"

which it should accept, since the 'a' between the 'x' and the 'y' is not
directly followed by the letter 'b'.

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com
 
R

Roger L. Cauvin

Peter Hansen said:
Examples are a *really* good way to clarify ambiguous or complex
requirements. In fact, when made executable they're called "test cases"
:), and supplying a few of those (showing input values and expected
output values) would help, not only to clarify your goals for the humans,
but also to let the proposed solutions easily be tested.

Good suggestion. Here are some "test cases":

"xyz123aaabbab" accept
"xyz123aabbaab" reject
"xayz123aaabab" accept
"xaaayz123abab" reject
"xaaayz123aaabab" accept

--
Roger L. Cauvin
(e-mail address removed) (omit the "nospam_" part)
Cauvin, Inc.
Product Management / Market Research
http://www.cauvin-inc.com
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,010
Latest member
MerrillEic

Latest Threads

Top