Overlapping Regular Expression Matches With findall()

M

Mystilleef

Hello,

Is there a simple flag to set to allow overlapping matches
for the findall() regular expression method? In other words,
if a string contains five occurrences of the string pattern
"cat", calling findall on the string returns a list
containing five "cat" strings. Is it possible for findall()
to just return one "cat" string?

Thanks
 
F

Fredrik Lundh

Mystilleef said:
Is there a simple flag to set to allow overlapping matches
for the findall() regular expression method? In other words,
if a string contains five occurrences of the string pattern
"cat", calling findall on the string returns a list
containing five "cat" strings. Is it possible for findall()
to just return one "cat" string?

your definition of "overlapping" seems to be a bit odd, but assuming
your description is correct, the answer is no.

on the other hand, if you only want one hit, why not use "search"
instead of "findall" ?

</F>
 
M

Mystilleef

Hello,

Thanks for your response. I was going by the definition in
the manual. I believe a search only returns the first
match of a regular expression pattern in a string and then
stops further searches if one is found. That's not what I
want.

I want a pattern that scans the entire string but avoids
returning duplicate matches. For example "cat", "cate",
"cater" may all well be valid matches, but I don't want
duplicate matches of any of them. I know I can filter the
list containing found matches myself, but that is somewhat
expensive for a list containing thousands of matches.

Thanks
 
S

Simon Brunning

I want a pattern that scans the entire string but avoids
returning duplicate matches. For example "cat", "cate",
"cater" may all well be valid matches, but I don't want
duplicate matches of any of them. I know I can filter the
list containing found matches myself, but that is somewhat
expensive for a list containing thousands of matches.

Probably the cheapest way of de-duping the list would be to dump it
straight into a set, provided that you aren't concerned about the
order.
 
B

Bengt Richter

Probably the cheapest way of de-duping the list would be to dump it
straight into a set, provided that you aren't concerned about the
order.
Or if concerned, maybe try a combination like:
... I want a pattern that scans the entire string but avoids
... returning duplicate matches. For example "cat", "cate",
... "cater" may all well be valid matches, but I don't want
... duplicate matches of any of them. I know I can filter the
... list containing found matches myself, but that is somewhat
... expensive for a list containing thousands of matches.
... """
>>> import re
>>> rxo = re.compile(r'cat(?:er|e)?')
>>> rxo.findall(s) ['cate', 'cat', 'cate', 'cater', 'cate']
>>> seen = set()
>>> [w for w in (m.group(0) for m in rxo.finditer(s)) if w not in seen and not seen.add(w)]
['cate', 'cat', 'cater']

BTW, note to put longer ambiguous match first in re, e.g., not r'cat(?:e|er)?') for above.

Regards,
Bengt Richter
 
F

Fredrik Lundh

Mystilleef said:
Thanks for your response. I was going by the definition in
the manual.

"non-overlapping" in that context means that if you e.g. search for "(ba)+"
in the string "bababa", you get one match ("bababa"), not three or six.

in your case, it sounds like you want a search for "ba" to return only one
match.
I know I can filter the list containing found matches myself, but that
is somewhat expensive for a list containing thousands of matches.

if the order doesn't matter, you don't have to build a list:
set(['catatonic', 'catnip', 'catched', 'cat'])

if you need to preserve the order, you could use a combination of a
list and a set (or a dictionary):
s = set(); w = []
for m in re.finditer("cat\w*", text):
.... m = m.group()
.... if m not in s:
.... s.add(m); w.append(m)
....['cat', 'catched', 'catnip', 'catatonic']

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top