Counting occurences of words in a list of strings

T

Travers Naran

Here's the basic idea. I have a dictionary of substrings (the substrings
stored as keys). I have a list of strings. I want to find out, for each
word in the dictionary, how many times the substring occurs non-overlapping
occurrences there are in the list of strings. This is the best I could
come up with:

for j in self.dict.keys():
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but it
was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up before
the inner loop and only testing the pattern in the inner loop, it was still
slow.

So, am I doing this the optimal way or am I overlooking something _really_
simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.
 
R

Robert Kern

Travers said:
Here's the basic idea. I have a dictionary of substrings (the substrings
stored as keys). I have a list of strings. I want to find out, for each
word in the dictionary, how many times the substring occurs non-overlapping
occurrences there are in the list of strings. This is the best I could
come up with:

for j in self.dict.keys():
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but it
was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up before
the inner loop and only testing the pattern in the inner loop, it was still
slow.

So, am I doing this the optimal way or am I overlooking something _really_
simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.

Find a character that doesn't exist in any of self.dict.keys(). '\x00'
will probably do.

whole = '\x00'.join(tc.strings)
for key in self.dict.keys():
self.dict[key]['count'] = whole.count(key)

--
Robert Kern
(e-mail address removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
 
J

John Machin

Travers said:
Here's the basic idea. I have a dictionary of substrings (the
substrings stored as keys). I have a list of strings. I want to find
out, for each word in the dictionary, how many times the substring
occurs non-overlapping occurrences there are in the list of strings.
This is the best I could come up with:

for j in self.dict.keys():
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but
it was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up
before the inner loop and only testing the pattern in the inner loop, it
was still slow.

Are you comparing BMH implemented in Python with str.count() implemented
in C?
So, am I doing this the optimal way or am I overlooking something
_really_ simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.

1. A pure Python suggestion:

Presuming there is a character SEP that cannot appear in the keys in
your query dictionary, try this:

SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
self.dict[query_key]['count'] += big_string_count(query_key)

Why do we have += in the above line? Is there anything else being added
in by not-shown code? If not, the following would be much more
comprehensible, and faster:

self.dict_count = {}
SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
self.dict_count[query_key] = big_string_count(query_key)

Checking your requirements for "non-overlapping": if you have 'foobar'
and 'barzot' in the dictionary, and a string contains 'foobarzot', your
code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?

2. Next, if you don't mind using a C-extension, look at Danny Yoo's
ahocorasick module, which will search in parallel for your 3000+ keys.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Under one definition of non-overlapping, you will be able to use one of
the findall methods; under the other definition, you will need to use
one of the search methods and restart at (matched_position + 1).

3. If you want to roll your own, start with Gonzalo Navarro's
publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html

HTH,
John
 
T

Travers Naran

John said:
Are you comparing BMH implemented in Python with str.count() implemented
in C?

Tee-hee. Yes. I figured out that was the problem pretty quickly and just
went back to count().
1. A pure Python suggestion:

Presuming there is a character SEP that cannot appear in the keys in
your query dictionary, try this:

SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
self.dict[query_key]['count'] += big_string_count(query_key)

Why do we have += in the above line? Is there anything else being added
in by not-shown code? If not, the following would be much more
comprehensible, and faster:

There is. Because the program can loop over mutliple input files, it needs
to count it across the files. But that's a minor thing.
Checking your requirements for "non-overlapping": if you have 'foobar'
and 'barzot' in the dictionary, and a string contains 'foobarzot', your
code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?

Exactly. Let me be more precise: non-overlapping with itself. I.e., if I
has "aa", then "aaaa" should count 2. But foorbar and barzot both coming
back 1 in your example is exactly the behavoir I want.
2. Next, if you don't mind using a C-extension, look at Danny Yoo's
ahocorasick module, which will search in parallel for your 3000+ keys.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Under one definition of non-overlapping, you will be able to use one of
the findall methods; under the other definition, you will need to use
one of the search methods and restart at (matched_position + 1).

Nifty! Thanks muchly.
3. If you want to roll your own, start with Gonzalo Navarro's
publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html

I don't suffer from NMH syndrome. If ahocorasick does the job, or even
count, I'm OK with that.
 
J

John Machin

Travers said:
I don't suffer from NMH syndrome. If ahocorasick does the job, or even
count, I'm OK with that.

Do you mean NIH syndrome? Sorry, I should have been clearer, like "if
you want faster, you will have to roll your own; start with ...". The
Aho-Corasick algorithm is about 30 years old. Navarro is part of, and
summarises the rest of, the state of the art.

Cheers,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top