Efficient way of testing for substring being one of a set?

tinnews · Apr 3, 2008

What's the neatest and/or most efficient way of testing if one of a
set of strings (contained in a dictionary, list or similar) is a
sub-string of a given string?

I.e. I have a string delivered into my program and I want to see if
any of a set of strings is a substring of the string I have been
given. It's quite OK to stop at the first one found. Ideally the
strings being searched through will be the keys of a dictionary but
this isn't a necessity, they can just be in a list if it could be done
more efficiently using a list.

Is this the best one can do (ignoring the likelihood that I've got
some syntax wrong) :-

# l is the list
# str is the incoming string
answer = ""
for x in l:
if str.find(x) < 0:
continue
answer = x

Paul Hankin · Apr 3, 2008

What's the neatest and/or most efficient way of testing if one of a
set of strings (contained in a dictionary, list or similar) is a
sub-string of a given string?

I.e. I have a string delivered into my program and I want to see if
any of a set of strings is a substring of the string I have been
given. It's quite OK to stop at the first one found. Ideally the
strings being searched through will be the keys of a dictionary but
this isn't a necessity, they can just be in a list if it could be done
more efficiently using a list.

Is this the best one can do (ignoring the likelihood that I've got
some syntax wrong) :-

# l is the list
# str is the incoming string
answer = ""
for x in l:
if str.find(x) < 0:
continue
answer = x

I'd not use 'l' (confused with '1') or 'str' (a standard module) as
variable names. Your code checks every string in the list even when
it's found one... you can reverse the test and break when the first
one is found. Using 'in' rather than testing the return value of find
is nicer as a substring test. Finally, using the 'else' clause lets
you make it clear that answer is set to the empty string when no match
is found.

for answer in l:
if str in answer: break
else:
answer = ''

Jeff · Apr 3, 2008

def foo(sample, strings):
for s in strings:
if sample in s:
return True
return False

This was an order of magnitude faster for me than using str.find or
str.index. That was finding rare words in the entire word-list (w/
duplicates) of War and Peace.

tinnews · Apr 3, 2008

Paul Hankin said:
I'd not use 'l' (confused with '1') or 'str' (a standard module) as
variable names.

Neither would I, it was just thrown together to show how I was
thinking.

Your code checks every string in the list even when
it's found one... you can reverse the test and break when the first
one is found. Using 'in' rather than testing the return value of find
is nicer as a substring test. Finally, using the 'else' clause lets
you make it clear that answer is set to the empty string when no match
is found.

for answer in l:
if str in answer: break
else:
answer = ''

OK, that does improve things somewhat, thanks.

tinnews · Apr 3, 2008

Jeff said:
def foo(sample, strings):
for s in strings:
if sample in s:
return True
return False

This was an order of magnitude faster for me than using str.find or
str.index. That was finding rare words in the entire word-list (w/
duplicates) of War and Peace.

However it's the wrong way around, in my case 'sample' is the longer
string and I want to know if s is in it. It's simple enough to do it
the other way around though:-

def foo(sample, strings):
for s in strings:
if s in sample:
return True
return False

Using in rather than find() and making it a function would seem to be
the way to go, thanks.

George Sakkis · Apr 3, 2008

def foo(sample, strings):
for s in strings:
if sample in s:
return True
return False

This was an order of magnitude faster for me than using str.find or
str.index. That was finding rare words in the entire word-list (w/
duplicates) of War and Peace.

If you test against the same substrings over and over again, an
alternative would be to build a regular expression:

import re
search = re.compile('|'.join(re.escape(x)
for x in substrings)).search
p = search(somestring)
if p is not None:
print 'Found', p.group()

George

Ant · Apr 3, 2008

What's the neatest and/or most efficient way of testing if one of a

A different approach:

words = ["he", "sh", "bla"]
name = "blah"
True in (word in name for word in words)

Click to expand...

Click to expand...

True

False

Perhaps not as obvious or readable as Jeff's example, but is
essentially doing the same thing using generator syntax.

Jeff · Apr 3, 2008

If you test against the same substrings over and over again, an
alternative would be to build a regular expression:

import re
search = re.compile('|'.join(re.escape(x)
for x in substrings)).search
p = search(somestring)
if p is not None:
print 'Found', p.group()

George

That would be an enormous regular expression and eat a lot of memory.
But over an enormous number of substrings, it would be O(log n),
rather than O(n).

Jeff · Apr 3, 2008

What's the neatest and/or most efficient way of testing if one of a

Click to expand...

A different approach:

words = ["he", "sh", "bla"]
name = "blah"
True in (word in name for word in words)

Click to expand...

True

name = "bling"
True in (word in name for word in words)

Click to expand...

Click to expand...

False

Perhaps not as obvious or readable as Jeff's example, but is
essentially doing the same thing using generator syntax.

That's pretty

Dennis.Benzinger · Apr 3, 2008

What's the neatest and/or most efficient way of testing if one of a
set of strings (contained in a dictionary, list or similar) is a
sub-string of a given string?
[...]

You could use the Aho-Corasick algorithm <http://en.wikipedia.org/wiki/
Aho-Corasick_algorithm>.
I don't know if there's a Python implementation yet.

Dennis Benzinger

bearophileHUGS · Apr 3, 2008

Dennis Benzinger:

You could use the Aho-Corasick algorithm <http://en.wikipedia.org/wiki/
Aho-Corasick_algorithm>.
I don't know if there's a Python implementation yet.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Bye,
bearophile

George Sakkis · Apr 3, 2008

On Apr 3, 12:37 pm, (e-mail address removed) wrote:

Click to expand...

A different approach:

words = ["he", "sh", "bla"]
name = "blah"
True in (word in name for word in words)

Click to expand...

True

Click to expand...

name = "bling"
True in (word in name for word in words)

Click to expand...

False

Click to expand...

Perhaps not as obvious or readable as Jeff's example, but is
essentially doing the same thing using generator syntax.

Click to expand...

That's pretty

It's even prettier in 2.5:

any(word in name for word in words)

George

Ant · Apr 3, 2008

On Apr 3 said:
It's even prettier in 2.5:

any(word in name for word in words)

George

And arguably the most readable yet!

Pop User · Apr 3, 2008

> Dennis Benzinger:
>
> http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
>

http://nicolas.lehuen.com/download/pytst/ can do it as well.

efficient way to process data	0	Jan 12, 2014
Efficient way of looging in python	2	Apr 25, 2013
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
GET NEIL DEGRASSES TYSON, I ripped a hole with this one...	0	Nov 10, 2022
People are needed for a mental model study of concurrent programming. (>19 years old, English Speaking, Programmers who know concurrency)	1	Sep 19, 2022
Data saving in condition of changing reality	0	Apr 29, 2022
How to sort a list of strings on a substring	10	Oct 5, 2009
Only one table shows up with the information	2	Mar 29, 2023

Efficient way of testing for substring being one of a set?

tinnews

Paul Hankin

Jeff

tinnews

tinnews

George Sakkis

Ant

Jeff

Jeff

Dennis.Benzinger

bearophileHUGS

George Sakkis

Ant

Pop User

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads