Counting occurences of words in a list of strings

Travers Naran · May 25, 2005

Here's the basic idea. I have a dictionary of substrings (the substrings
stored as keys). I have a list of strings. I want to find out, for each
word in the dictionary, how many times the substring occurs non-overlapping
occurrences there are in the list of strings. This is the best I could
come up with:

for j in self.dict.keys():
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but it
was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up before
the inner loop and only testing the pattern in the inner loop, it was still
slow.

So, am I doing this the optimal way or am I overlooking something _really_
simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.

Robert Kern · May 25, 2005

Travers said:
Here's the basic idea. I have a dictionary of substrings (the substrings
stored as keys). I have a list of strings. I want to find out, for each
word in the dictionary, how many times the substring occurs non-overlapping
occurrences there are in the list of strings. This is the best I could
come up with:

for j in self.dict.keys():
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but it
was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up before
the inner loop and only testing the pattern in the inner loop, it was still
slow.

So, am I doing this the optimal way or am I overlooking something _really_
simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.

Find a character that doesn't exist in any of self.dict.keys(). '\x00'
will probably do.

whole = '\x00'.join(tc.strings)
for key in self.dict.keys():
self.dict[key]['count'] = whole.count(key)

--
Robert Kern
(e-mail address removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

John Machin · May 25, 2005

Travers said:
Here's the basic idea. I have a dictionary of substrings (the
substrings stored as keys). I have a list of strings. I want to find
out, for each word in the dictionary, how many times the substring
occurs non-overlapping occurrences there are in the list of strings.
This is the best I could come up with:

for j in self.dict.keys():
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but
it was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up
before the inner loop and only testing the pattern in the inner loop, it
was still slow.

Are you comparing BMH implemented in Python with str.count() implemented
in C?

So, am I doing this the optimal way or am I overlooking something
_really_ simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.

1. A pure Python suggestion:

Presuming there is a character SEP that cannot appear in the keys in
your query dictionary, try this:

SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
self.dict[query_key]['count'] += big_string_count(query_key)

Why do we have += in the above line? Is there anything else being added
in by not-shown code? If not, the following would be much more
comprehensible, and faster:

self.dict_count = {}
SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
self.dict_count[query_key] = big_string_count(query_key)

Checking your requirements for "non-overlapping": if you have 'foobar'
and 'barzot' in the dictionary, and a string contains 'foobarzot', your
code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?

2. Next, if you don't mind using a C-extension, look at Danny Yoo's
ahocorasick module, which will search in parallel for your 3000+ keys.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Under one definition of non-overlapping, you will be able to use one of
the findall methods; under the other definition, you will need to use
one of the search methods and restart at (matched_position + 1).

3. If you want to roll your own, start with Gonzalo Navarro's
publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html

HTH,
John

Travers Naran · May 25, 2005

John said:
Are you comparing BMH implemented in Python with str.count() implemented
in C?

Tee-hee. Yes. I figured out that was the problem pretty quickly and just
went back to count().

1. A pure Python suggestion:

Presuming there is a character SEP that cannot appear in the keys in
your query dictionary, try this:

SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
self.dict[query_key]['count'] += big_string_count(query_key)

Why do we have += in the above line? Is there anything else being added
in by not-shown code? If not, the following would be much more
comprehensible, and faster:

There is. Because the program can loop over mutliple input files, it needs
to count it across the files. But that's a minor thing.

Checking your requirements for "non-overlapping": if you have 'foobar'
and 'barzot' in the dictionary, and a string contains 'foobarzot', your
code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?

Exactly. Let me be more precise: non-overlapping with itself. I.e., if I
has "aa", then "aaaa" should count 2. But foorbar and barzot both coming
back 1 in your example is exactly the behavoir I want.

2. Next, if you don't mind using a C-extension, look at Danny Yoo's
ahocorasick module, which will search in parallel for your 3000+ keys.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Under one definition of non-overlapping, you will be able to use one of
the findall methods; under the other definition, you will need to use
one of the search methods and restart at (matched_position + 1).

Nifty! Thanks muchly.

3. If you want to roll your own, start with Gonzalo Navarro's
publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html

I don't suffer from NMH syndrome. If ahocorasick does the job, or even
count, I'm OK with that.

John Machin · May 25, 2005

Travers said:
I don't suffer from NMH syndrome. If ahocorasick does the job, or even
count, I'm OK with that.

Do you mean NIH syndrome? Sorry, I should have been clearer, like "if
you want faster, you will have to roll your own; start with ...". The
Aho-Corasick algorithm is about 30 years old. Navarro is part of, and
summarises the rest of, the state of the art.

Cheers,
John

Is there a way to add strings to a list without the quotation marks in C++?	1	Nov 9, 2020
Add a list of videos each one in a different button in a web page	1	Dec 10, 2022
split lines from stdin into a list of unicode strings	0	Aug 28, 2013
Searching for a list of strings in a file with Python	3	Oct 14, 2013
How does a HEAD pointer end up pointing to the first node in a linked list?	3	Jan 24, 2023
Trying to modify a list in Python, receiving "variable not defined" error	3	Feb 6, 2023
How to try a range of hex values in C# code ?	0	Nov 19, 2022
matching strings in a large set of strings	13	Apr 29, 2010

Counting occurences of words in a list of strings

Travers Naran

Robert Kern

John Machin

Travers Naran

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads