T
Travers Naran
Here's the basic idea. I have a dictionary of substrings (the substrings
stored as keys). I have a list of strings. I want to find out, for each
word in the dictionary, how many times the substring occurs non-overlapping
occurrences there are in the list of strings. This is the best I could
come up with:
for j in self.dict.keys():
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c
I tried Boyer-Moore-Horspool as someone else suggested in the group, but it
was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up before
the inner loop and only testing the pattern in the inner loop, it was still
slow.
So, am I doing this the optimal way or am I overlooking something _really_
simple to do?
Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.
stored as keys). I have a list of strings. I want to find out, for each
word in the dictionary, how many times the substring occurs non-overlapping
occurrences there are in the list of strings. This is the best I could
come up with:
for j in self.dict.keys():
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c
I tried Boyer-Moore-Horspool as someone else suggested in the group, but it
was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up before
the inner loop and only testing the pattern in the inner loop, it was still
slow.
So, am I doing this the optimal way or am I overlooking something _really_
simple to do?
Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.