How to check if any item from a list of strings is in a big string?

I

inkhorn

Hi all,

For one of my projects, I came across the need to check if one of many
items from a list of strings could be found in a long string. I came
up with a pretty quick helper function to check this, but I just want
to find out if there's something a little more elegant than what I've
cooked up. The helper function follows:

def list_items_in_string(list_items, string):
for item in list_items:
if item in string:
return True
return False

So if you define a list x = ['Blah','Yadda','Hoohoo'] and a string y =
'Yip yip yippee Blah' and you run list_items_in_string(x, y), it
should return True.

Any ideas how to make that function look nicer? :)

Matt Dubins
 
C

Chris Rebert

Hi all,

For one of my projects, I came across the need to check if one of many
items from a list of strings could be found in a long string.  I came
up with a pretty quick helper function to check this, but I just want
to find out if there's something a little more elegant than what I've
cooked up.  The helper function follows:

def list_items_in_string(list_items, string):
   for item in list_items:
       if item in string:
           return True
   return False

So if you define a list x = ['Blah','Yadda','Hoohoo'] and a string y =
'Yip yip yippee Blah' and you run list_items_in_string(x, y), it
should return True.

Any ideas how to make that function look nicer? :)

any(substr in y for substr in x)

Note that any() was added in Python 2.5

Cheers,
Chris
 
N

Nobody

For one of my projects, I came across the need to check if one of many
items from a list of strings could be found in a long string.

If you need to match many strings or very long strings against the same
list of items, the following should (theoretically) be optimal:

r = re.compile('|'.join(map(re.escape,list_items)))
...
result = r.search(string)
 
S

Steven D'Aprano

def list_items_in_string(list_items, string):
for item in list_items:
if item in string:
return True
return False ....
Any ideas how to make that function look nicer? :)

Change the names. Reverse the order of the arguments. Add a docstring.

Otherwise looks pretty nice to me. Simple, straightforward, and correct.

If you're running Python 2.5 or better, then this is even shorter (and
probably faster):

def contains(s, targets):
"""Return True if any item of targets is in string s."""
return any(target in s for target in targets)
 
J

John Machin

If you need to match many strings or very long strings against the same
list of items, the following should (theoretically) be optimal:

        r = re.compile('|'.join(map(re.escape,list_items)))
        ...
        result = r.search(string)

"theoretically optimal" happens only if the search mechanism builds a
DFA or similar out of the list of strings. AFAIK Python's re module
doesn't.

Try this:
http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
 
P

Paul Rubin

inkhorn said:
def list_items_in_string(list_items, string):
for item in list_items:
if item in string:
return True
return False

You could write that as (untested):

def list_items_in_string(list_items, string):
return any(item in string for item in list_items)

but there are faster algorithms you could use if the list is large and
you want to do the test on lots of long strings, etc.
 
I

inkhorn

Thanks all!! I found the following to be most helpful: any(substr in
long_string for substr in list_of_strings)

This bang-for-your-buck is one of the many many reasons why I love
Python programming :)

Matt Dubins
 
D

denis

Matt, how many words are you looking for, in how long a string ?
Were you able to time any( substr in long_string ) against re.compile
( "|".join( list_items )) ?
(REs are my method of choice, but different inputs of course give
different times --
see google regex speed site:groups.google.com /
site:stackoverflow.com .)

cheers
-- denis
 
G

Gabriel Genellina

Matt, how many words are you looking for, in how long a string ?
Were you able to time any( substr in long_string ) against re.compile
( "|".join( list_items )) ?

There is a known algorithm to solve specifically this problem
(Aho-Corasick), a good implementation should perform better than R.E. (and
better than the gen.expr. with the advantage of returning WHICH string
matched)
There is a C extension somewhere implementing Aho-Corasick.
 
N

Nobody

There is a known algorithm to solve specifically this problem
(Aho-Corasick), a good implementation should perform better than R.E. (and
better than the gen.expr. with the advantage of returning WHICH string
matched)

Aho-Corasick has the advantage of being linear in the length of the
patterns, so the setup may be faster than re.compile(). The actual
searching won't necessarily be any faster (assuming optimal
implementations; I don't know how safe that assumption is).
 
I

inkhorn

Hi all,

This was more a question of programming aesthetics for me than one of
great practical significance. I was looking to perform a certain
function on files in a directory so long as those files weren't found
in certain standard directories. In other words, I was using os.walk
() to get multiple root directory strings, and the lists of files in
each directory. The function was to be performed on those files, so
long as certain terms weren't in the root directory string.

In actuality, I could have stuck with the helper function I created,
but I'm always curious to see how well multiple lines of code can turn
into fewer lines of code in python and retain the same functional
value :)

Matt
 
P

Pablo Torres N.

Change the names. Reverse the order of the arguments. Add a docstring.

Why reverse the order of the arguments? Is there a design principle there?

I always make a mess out of the order of my arguments...
 
S

Steven D'Aprano

Why reverse the order of the arguments? Is there a design principle
there?

It's just a convention. Before strings had methods, you used the string
module, e.g.:

string.find(source, target)
=> find target in source

This became source.find(target).

In your function:

list_items_in_string(list_items, string)

"list_items" is equivalent to target, and "string" is equivalent to
source. It's conventional to write the source first.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top