Regular expression intricacies: why do REs skip some matches?

C

Chris Lasher

Hey guys and gals,
This is a followup of my "Counting all permutations of a substring"
thread (see
http://groups.google.com/group/comp...b7ae381b0a9/7657235b3fd3966f#7657235b3fd3966f
in Google Groups) I'm still having a difficult time figuring out the
intricacies of regular expressions and consecutive matches. Here's a
brief example:

In [1]: import re

In [2]: aba_re = re.compile('aba')

In [3]: aba_re.findall('abababa')
Out[3]: ['aba', 'aba']

The return is two matches, whereas, I expected three. Why does this
regular expression work this way?

Using redemo.py, one can see that the matches are occurring at the
following spots:
abababa
^ ^ (where ^ indicates the start of a match)
Ideally, there'd be a way to create the regular expression to get at
this match, too:
abababa
^
So that the total matches are:
abababa
^ ^ ^

Is this simply not the way REs work? Does this sort of matching really
have to be home-coded?

Confusedly yours,
Chris
 
D

Diez B. Roggisch

Is this simply not the way REs work? Does this sort of matching really
have to be home-coded?

Yes. The reason is basically that consumed characters can't be
"unconsumed". However, if you use the search-variant with a
start-argument you can search from the last occurence start+1 to achieve
what you're after.

Diez
 
J

John Machin

It's nothing to do with how/what a regular expression matches. It's all
to do with the definition of whatever convenience methods like
findall() that have been built on top of the basic match() method --
having found a match, where do they start looking for the next match?
Typically, one does not want overlapping matches.
 
T

Tim Chase

In [1]: import re
In [2]: aba_re = re.compile('aba')

In [3]: aba_re.findall('abababa')
Out[3]: ['aba', 'aba']

The return is two matches, whereas, I expected three. Why does this
regular expression work this way?

Well, if you don't need the actual results, just their
count, you can use

how_many = len(re.findall('(?=aba)', 'abababa')

which will return 3. However, each result is empty:

>>> print re.findall('(?=aba)', 'abababa')
['','','']

You'd have to do some chicanary to get the actual pieces:

s = 'abababa'
for f in re.finditer('(?=aba)', s):
print "Found %s at %i" % (
s[f.start():f.start()+3],
f.start())

or

[s[f.start():f.start()+3] for f in
re.finditer('(?=aba)', s)]

Note that both of these know the length of the desired
piece. If not, you may have to do additional processing to
get them to work the way you want. Yippie.

All lovely hacks, but they each return all three hits.

-tim
PS: These likely only work in Python...to use them in grep
or another regexp engine, you'd have to tweak them :*)
 
B

Ben Cartwright

Tim said:
In [1]: import re

In [2]: aba_re = re.compile('aba')

In [3]: aba_re.findall('abababa')
Out[3]: ['aba', 'aba']

The return is two matches, whereas, I expected three. Why does this
regular expression work this way?

It's just the way regexes work. You may disagree, but it's more
intuitive that iterated pattern searching be non-overlapping by
default. See also:
2
Well, if you don't need the actual results, just their
count, you can use

how_many = len(re.findall('(?=aba)', 'abababa')

which will return 3. However, each result is empty:

>>> print re.findall('(?=aba)', 'abababa')
['','','']

You'd have to do some chicanary to get the actual pieces:
(snip)

Actually, you can just define a group inside the lookahead assertion:
['aba', 'aba', 'aba']

--Ben
 
C

Chris Lasher

Diez, John, Tim, and Ben, thank you all so much. I now "get it". It
makes logical sense now that the difficulty was actually in the
implementation of findall, which does non-overlapping matches. It also
makes sense, now, that one can get around this by using a lookahead
assertion. Thanks a bunch, guys; this really helped!

Chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top