reusing parts of a string in RE matches?

J

John Salerno

I probably should find an RE group to post to, but my news server at
work doesn't seem to have one, so I apologize. But this is in Python
anyway :)

So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches? I figure it has something to do
with look-ahead and look-behind, but I've only gotten this far:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')
m = pattern.findall(string)

This matches all the 'ab' followed by an 'a', but it doesn't include the
'a'. What I'd like to do is find all the 'aba' matches. A regular
findall() gives four results, but really there are seven.

Is there a way to do this with just an RE pattern, or would I have to
manually add the 'a' to the end of the matches?

Thanks.
 
M

Murali

John said:
So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches? I figure it has something to do
with look-ahead and look-behind, but I've only gotten this far:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')
m = pattern.findall(string)

Why not something like

import re
string = 'abababababababab'
pattern = re.compile(r"^aba")
ans = []
for x in xrange(len(string)):
m = pattern.match(string[x:])
if m: ans.append( (x+m.start(),x+m.end()))

# now ans is a list of pairs (p,q) where the substring string[p:q]
matches pattern

- Murali
 
B

Bo Yang

John Salerno 写é“:
I probably should find an RE group to post to, but my news server at
work doesn't seem to have one, so I apologize. But this is in Python
anyway :)

So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches? I figure it has something to do
with look-ahead and look-behind, but I've only gotten this far:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')
Below from the python library reference

*|(?=...)|*
Matches if ... matches next, but doesn't consume any of the string.
This is called a lookahead assertion. For example, Isaac (?=Asimov)
will match |'Isaac '| only if it's followed by |'Asimov'|.


m = pattern.findall(string)

This matches all the 'ab' followed by an 'a', but it doesn't include the
'a'. What I'd like to do is find all the 'aba' matches. A regular
findall() gives four results, but really there are seven.
I try the code , but I give seven results !
 
B

BartlebyScrivener

Right about now somebody usually jumps in and shows you how to do this
without using regex and using string methods instead.

I'll watch.

rd
 
J

John Salerno

Bo said:
I try the code , but I give seven results !

Sorry, I meant that findall() only returns 4 results when searching for
'aba', when there are actually seven instances of 'aba'. This doesn't
involve the look-ahead RE.
 
J

John Salerno

BartlebyScrivener said:
Right about now somebody usually jumps in and shows you how to do this
without using regex and using string methods instead.

I'll watch.

rd

Heh heh, I'm sure you're right, but this is more just an exercise for me
in REs, so I'm curious how you might do it, unless the answer is that
it's just too complicated to be worth it (like Murali's example!) That
goes beyond just an RE pattern.
 
B

BartlebyScrivener

I have to at least try :)

s = "abababababababab"

for x in range(len(s)):
.... try:
.... s.index("aba", x, x + 3)
.... except ValueError:
.... pass

rd
 
J

John Salerno

BartlebyScrivener said:
I have to at least try :)

s = "abababababababab"

for x in range(len(s)):
... try:
... s.index("aba", x, x + 3)
... except ValueError:
... pass

rd

yeah, looks like index() or find() can be used to do it instead of RE,
but still, i'd like to know if there's a way you can write an RE
expression to do it (and just an RE expression, without all the other
for loops and extra nonsense...otherwise i might as well just use string
methods)
 
M

mpeters42

From the Python 2.4 docs:

findall( pattern, string[, flags])
Return a list of all ***non-overlapping*** matches of pattern in
string....

By design, the regex functions return non-overlapping patterns.

Without doing some kind of looping, I think you are out of luck.

If you pattern is fixed, then a solution might be:
string = 'abababababababab'
pat = 'aba'
[pat for s in re.compile('(?='+pat+')').findall(string)]
['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

If the pattern is not fixed (i.e. 'a.a') then this method can still get
a count of overlapping matches, but cannot get the individual match
strings themselves.

A simple loop should do in this case, though:
.... r= re.match(pat,string[i:])
.... if r: print r.group()
....
aba
aba
aba
aba
aba
aba
aba
 
B

BartlebyScrivener

otherwise i might as well just use string
I think you're supposed to use string methods if you can, to avoid the
old adage about having two problems instead of one when using regex.

rd
 
J

John Salerno

string = 'abababababababab'
pat = 'aba'
[pat for s in re.compile('(?='+pat+')').findall(string)]
['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Wow, I have no idea how to read that RE. First off, what does it match?
Should something come before the parentheses, and that will be what
matches? Also, what are the '+'s doing? Are they literal +s or still
being used as RE syntax?
 
J

John Salerno

John said:
string = 'abababababababab'
pat = 'aba'
[pat for s in re.compile('(?='+pat+')').findall(string)]
['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Wow, I have no idea how to read that RE. First off, what does it match?
Should something come before the parentheses, and that will be what
matches? Also, what are the '+'s doing? Are they literal +s or still
being used as RE syntax?

Nevermind, I get it! The point is that you *aren'* matching anything
(except the empty string), and this only to see how many times it
occurs, then you are replacing those occurrences with the pattern
string. So this is basically like a counter?
 
K

Kent Johnson

John said:
I probably should find an RE group to post to, but my news server at
work doesn't seem to have one, so I apologize. But this is in Python
anyway :)

So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches?

You can specify a start location to re.search(), and get the location of
a match from a match object. This allows you to loop, searching the
string following the last match:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')

ans = []
start = 0
while True:
m = pattern.search(string, start)
if not m: break
ans.append( (m.start(), m.end()) )
start = m.start() + 1

print ans # => [(0, 2), (2, 4), (4, 6), (6, 8), (8, 10), (10, 12), (12, 14)]

Kent
 
M

mpeters42

Exactly,

Now this will work as long as there are no wildcards in the pattern.
Thus, only with fixed strings. But if you have a fixed string, there
is really no need to use regex, as it will complicate you life for no
real reason (as opposed to simple string methods).

With a more complex pattern (like 'a.a': match any character between
two 'a' characters) this will get the length, but not what character is
between the a's.

To actually do that you will need to iterate through the string and
apply the pattern match (which matches only the beginning of a string)
to a indexed subset of the original (see example in the last post)
 
B

Ben Cartwright

John said:
So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches? I figure it has something to do
with look-ahead and look-behind, but I've only gotten this far:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')
m = pattern.findall(string)

This matches all the 'ab' followed by an 'a', but it doesn't include the
'a'. What I'd like to do is find all the 'aba' matches. A regular
findall() gives four results, but really there are seven.

Is there a way to do this with just an RE pattern, or would I have to
manually add the 'a' to the end of the matches?

Yes, and no extra for loops are needed! You can define groups inside
the lookahead assertion:
['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

--Ben
 
M

Murali

Yes, and no extra for loops are needed! You can define groups inside
the lookahead assertion:
['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Wonderful and this works with any regexp, so

import re

def all_occurences(pat,str):
return re.findall(r'(?=(%s))'%pat,str)

all_occurences("a.a","abacadabcda") returns ["aba","aca","ada"] as
required.

- Murali
 
B

Ben Cartwright

Murali said:
Yes, and no extra for loops are needed! You can define groups inside
the lookahead assertion:
import re
re.findall(r'(?=(aba))', 'abababababababab')
['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Wonderful and this works with any regexp, so

import re

def all_occurences(pat,str):
return re.findall(r'(?=(%s))'%pat,str)

all_occurences("a.a","abacadabcda") returns ["aba","aca","ada"] as
required.


Careful. That won't work as expected for *all* regexps. Example:
['abaca', 'aca']

Note that this does *not* find 'aba'. You might think that making it
non-greedy might help, but:
['aba', 'aca']

Nope, now it's not finding 'abaca'.

This is by design, though. From
http://www.regular-expressions.info/lookaround.html (a good read, by
the way):

"""As soon as the lookaround condition is satisfied, the regex engine
forgets about everything inside the lookaround. It will not backtrack
inside the lookaround to try different permutations."""

Moral of the story: keep lookahead assertions simple whenever
possible. :)

--Ben
 
M

Mirco Wahab

Hi mpeters42 & John
With a more complex pattern (like 'a.a': match any character between
two 'a' characters) this will get the length, but not what character is
between the a's.

Lets take this as a starting point for another example
that comes to mind. You have a string of characters
interspersed with numbers: tx = 'a1a2a3A4a35a6b7b8c9c'

Now you try to find all _numbers_, which have
symmetrical characters (like a<-2->a) which
are not in 3/3/3... synced groups.

This can easy be done in P(ytho|nerl) etc. by
positive lookahead (even the same pattern does:)
Py:
import re
tx = 'a1a2a3A4a35a6b7b8c9c'
rg = r'(\w)(?=(.\1))'
print re.findall(rg, tx)
Pe:
$_ = 'a1a2a3A4a35a6b7b8c9c';
print /(\w)(?=(.)\1)/g;

(should find 1,2,7,9 only, python regex
written to var in order to prevent
clunky lines ;-)

BTW, Py Regex Engine seems to
be very close to the perl one:
Naive (!) matching of a pattern
with 14 o's (intersperded by
anything) against a string of
16 o's takes about exaclty the same
time here in Py(2.4.3) and Pe (5.8.7):

tl = 'oooooooooooooooo'
rg = r'o*o*o*o*o*o*o*o*o*o*o*o*o*o*[\W]'
print re.search(rg, tl)

Py: 101 sec
Pe: 109 sec

(which would find no match because there's
no \W-like character at the end of the
string here)

Regards

Mirco
 
J

John Salerno

Mirco said:
Py:
import re
tx = 'a1a2a3A4a35a6b7b8c9c'
rg = r'(\w)(?=(.\1))'
print re.findall(rg, tx)

The only problem seems to be (and I ran into this with my original
example too) that what gets returned by this code isn't exactly what you
are looking for, i.e. the numbers '1', '2', etc. You get a list of
tuples, and the second item in this tuple contains the number, but also
the following \w character.

So there still seems to be some work that must be done when dealing with
overlapping patterns/look-ahead/behind.

Oh wait, a thought just hit me. Instead of doing it as you did:

rg = r'(\w)(?=(.\1))'

Could you do:

rg = r'(\w)(?=(.)\1)'

That would at least isolate the number, although you'd still have to get
it out of the list/tuple.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top