reusing parts of a string in RE matches?

John Salerno · May 10, 2006

I probably should find an RE group to post to, but my news server at
work doesn't seem to have one, so I apologize. But this is in Python
anyway

So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches? I figure it has something to do
with look-ahead and look-behind, but I've only gotten this far:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')
m = pattern.findall(string)

This matches all the 'ab' followed by an 'a', but it doesn't include the
'a'. What I'd like to do is find all the 'aba' matches. A regular
findall() gives four results, but really there are seven.

Is there a way to do this with just an RE pattern, or would I have to
manually add the 'a' to the end of the matches?

Thanks.

Murali · May 10, 2006

John said:
So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches? I figure it has something to do
with look-ahead and look-behind, but I've only gotten this far:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')
m = pattern.findall(string)

Why not something like

import re
string = 'abababababababab'
pattern = re.compile(r"^aba")
ans = []
for x in xrange(len(string)):
m = pattern.match(string[x:])
if m: ans.append( (x+m.start(),x+m.end()))

# now ans is a list of pairs (p,q) where the substring string[p:q]
matches pattern

- Murali

Bo Yang · May 10, 2006

John Salerno å†™é“:

I probably should find an RE group to post to, but my news server at
work doesn't seem to have one, so I apologize. But this is in Python
anyway

So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches? I figure it has something to do
with look-ahead and look-behind, but I've only gotten this far:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')

m = pattern.findall(string)

This matches all the 'ab' followed by an 'a', but it doesn't include the
'a'. What I'd like to do is find all the 'aba' matches. A regular
findall() gives four results, but really there are seven.

I try the code , but I give seven results !

BartlebyScrivener · May 10, 2006

Right about now somebody usually jumps in and shows you how to do this
without using regex and using string methods instead.

I'll watch.

rd

John Salerno · May 10, 2006

Bo said:
I try the code , but I give seven results !

Sorry, I meant that findall() only returns 4 results when searching for
'aba', when there are actually seven instances of 'aba'. This doesn't
involve the look-ahead RE.

John Salerno · May 10, 2006

BartlebyScrivener said:
Right about now somebody usually jumps in and shows you how to do this
without using regex and using string methods instead.

I'll watch.

rd

Heh heh, I'm sure you're right, but this is more just an exercise for me
in REs, so I'm curious how you might do it, unless the answer is that
it's just too complicated to be worth it (like Murali's example!) That
goes beyond just an RE pattern.

BartlebyScrivener · May 10, 2006

I have to at least try

s = "abababababababab"

for x in range(len(s)):
.... try:
.... s.index("aba", x, x + 3)
.... except ValueError:
.... pass

rd

John Salerno · May 10, 2006

BartlebyScrivener said:
I have to at least try

s = "abababababababab"

for x in range(len(s)):
... try:
... s.index("aba", x, x + 3)
... except ValueError:
... pass

rd

yeah, looks like index() or find() can be used to do it instead of RE,
but still, i'd like to know if there's a way you can write an RE
expression to do it (and just an RE expression, without all the other
for loops and extra nonsense...otherwise i might as well just use string
methods)

mpeters42 · May 10, 2006

From the Python 2.4 docs:

findall( pattern, string[, flags])
Return a list of all ***non-overlapping*** matches of pattern in
string....

By design, the regex functions return non-overlapping patterns.

Without doing some kind of looping, I think you are out of luck.

If you pattern is fixed, then a solution might be:

string = 'abababababababab'
pat = 'aba'
[pat for s in re.compile('(?='+pat+')').findall(string)]

Click to expand...

Click to expand...

['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

If the pattern is not fixed (i.e. 'a.a') then this method can still get
a count of overlapping matches, but cannot get the individual match
strings themselves.

A simple loop should do in this case, though:
.... r= re.match(pat,string[i:])
.... if r: print r.group()
....
aba
aba
aba
aba
aba
aba
aba

BartlebyScrivener · May 10, 2006

otherwise i might as well just use string
I think you're supposed to use string methods if you can, to avoid the
old adage about having two problems instead of one when using regex.

rd

John Salerno · May 10, 2006

string = 'abababababababab'
pat = 'aba'
[pat for s in re.compile('(?='+pat+')').findall(string)]

Click to expand...

Click to expand...

['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Wow, I have no idea how to read that RE. First off, what does it match?
Should something come before the parentheses, and that will be what
matches? Also, what are the '+'s doing? Are they literal +s or still
being used as RE syntax?

John Salerno · May 10, 2006

John said:
string = 'abababababababab'
pat = 'aba'
[pat for s in re.compile('(?='+pat+')').findall(string)]

Click to expand...

['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Click to expand...

Wow, I have no idea how to read that RE. First off, what does it match?
Should something come before the parentheses, and that will be what
matches? Also, what are the '+'s doing? Are they literal +s or still
being used as RE syntax?

Nevermind, I get it! The point is that you *aren'* matching anything
(except the empty string), and this only to see how many times it
occurs, then you are replacing those occurrences with the pattern
string. So this is basically like a counter?

Kent Johnson · May 10, 2006

John said:
I probably should find an RE group to post to, but my news server at
work doesn't seem to have one, so I apologize. But this is in Python
anyway

So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches?

You can specify a start location to re.search(), and get the location of
a match from a match object. This allows you to loop, searching the
string following the last match:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')

ans = []
start = 0
while True:
m = pattern.search(string, start)
if not m: break
ans.append( (m.start(), m.end()) )
start = m.start() + 1

print ans # => [(0, 2), (2, 4), (4, 6), (6, 8), (8, 10), (10, 12), (12, 14)]

Kent

mpeters42 · May 10, 2006

Exactly,

Now this will work as long as there are no wildcards in the pattern.
Thus, only with fixed strings. But if you have a fixed string, there
is really no need to use regex, as it will complicate you life for no
real reason (as opposed to simple string methods).

With a more complex pattern (like 'a.a': match any character between
two 'a' characters) this will get the length, but not what character is
between the a's.

To actually do that you will need to iterate through the string and
apply the pattern match (which matches only the beginning of a string)
to a indexed subset of the original (see example in the last post)

Ben Cartwright · May 11, 2006

John said:
So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches? I figure it has something to do
with look-ahead and look-behind, but I've only gotten this far:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')
m = pattern.findall(string)

This matches all the 'ab' followed by an 'a', but it doesn't include the
'a'. What I'd like to do is find all the 'aba' matches. A regular
findall() gives four results, but really there are seven.

Is there a way to do this with just an RE pattern, or would I have to
manually add the 'a' to the end of the matches?

Yes, and no extra for loops are needed! You can define groups inside
the lookahead assertion:
['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

--Ben

Murali · May 11, 2006

Yes, and no extra for loops are needed! You can define groups inside

the lookahead assertion:
['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Wonderful and this works with any regexp, so

import re

def all_occurences(pat,str):
return re.findall(r'(?=(%s))'%pat,str)

all_occurences("a.a","abacadabcda") returns ["aba","aca","ada"] as
required.

- Murali

Ben Cartwright · May 11, 2006

Murali said:
Yes, and no extra for loops are needed! You can define groups inside
the lookahead assertion:

import re
re.findall(r'(?=(aba))', 'abababababababab')

Click to expand...

['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Click to expand...

Wonderful and this works with any regexp, so

import re

def all_occurences(pat,str):
return re.findall(r'(?=(%s))'%pat,str)

all_occurences("a.a","abacadabcda") returns ["aba","aca","ada"] as
required.

Careful. That won't work as expected for *all* regexps. Example:
['abaca', 'aca']

Note that this does *not* find 'aba'. You might think that making it
non-greedy might help, but:
['aba', 'aca']

Nope, now it's not finding 'abaca'.

This is by design, though. From
http://www.regular-expressions.info/lookaround.html (a good read, by
the way):

"""As soon as the lookaround condition is satisfied, the regex engine
forgets about everything inside the lookaround. It will not backtrack
inside the lookaround to try different permutations."""

Moral of the story: keep lookahead assertions simple whenever
possible.

--Ben

BartlebyScrivener · May 11, 2006

Thanks, Ben. Quite an education!

rick

Mirco Wahab · May 11, 2006

Hi mpeters42 & John

With a more complex pattern (like 'a.a': match any character between
two 'a' characters) this will get the length, but not what character is
between the a's.

Lets take this as a starting point for another example
that comes to mind. You have a string of characters
interspersed with numbers: tx = 'a1a2a3A4a35a6b7b8c9c'

Now you try to find all _numbers_, which have
symmetrical characters (like a<-2->a) which
are not in 3/3/3... synced groups.

This can easy be done in P(ytho|nerl) etc. by
positive lookahead (even the same pattern does

Py:
import re
tx = 'a1a2a3A4a35a6b7b8c9c'
rg = r'(\w)(?=(.\1))'
print re.findall(rg, tx)
Pe:
$_ = 'a1a2a3A4a35a6b7b8c9c';
print /(\w)(?=(.)\1)/g;

(should find 1,2,7,9 only, python regex
written to var in order to prevent
clunky lines ;-)

BTW, Py Regex Engine seems to
be very close to the perl one:
Naive (!) matching of a pattern
with 14 o's (intersperded by
anything) against a string of
16 o's takes about exaclty the same
time here in Py(2.4.3) and Pe (5.8.7):

tl = 'oooooooooooooooo'
rg = r'o*o*o*o*o*o*o*o*o*o*o*o*o*o*[\W]'
print re.search(rg, tl)

Py: 101 sec
Pe: 109 sec

(which would find no match because there's
no \W-like character at the end of the
string here)

Regards

Mirco

John Salerno · May 11, 2006

Mirco said:
Py:
import re
tx = 'a1a2a3A4a35a6b7b8c9c'
rg = r'(\w)(?=(.\1))'
print re.findall(rg, tx)

The only problem seems to be (and I ran into this with my original
example too) that what gets returned by this code isn't exactly what you
are looking for, i.e. the numbers '1', '2', etc. You get a list of
tuples, and the second item in this tuple contains the number, but also
the following \w character.

So there still seems to be some work that must be done when dealing with
overlapping patterns/look-ahead/behind.

Oh wait, a thought just hit me. Instead of doing it as you did:

rg = r'(\w)(?=(.\1))'

Could you do:

rg = r'(\w)(?=(.)\1)'

That would at least isolate the number, although you'd still have to get
it out of the list/tuple.

replace random matches of regexp	4	Sep 8, 2011
possible bug in re expression?	0	Apr 25, 2014
Measuring a string of text	1	Sep 15, 2022
Finding all regex matches by index?	1	May 30, 2012
Searching for Regular Expressions in a string WITH overlap	1	Nov 21, 2008
Split a list into two parts based on a filter?	35	Jun 10, 2013
Match a pattern multiple times, returning matches, captures andoffset?	9	Apr 5, 2011
Re for Apache log file format	4	Oct 8, 2013

reusing parts of a string in RE matches?

John Salerno

Murali

Bo Yang

BartlebyScrivener

John Salerno

John Salerno

BartlebyScrivener

John Salerno

mpeters42

BartlebyScrivener

John Salerno

John Salerno

Kent Johnson

mpeters42

Ben Cartwright

Murali

Ben Cartwright

BartlebyScrivener

Mirco Wahab

John Salerno

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads