Help Optimizing Word Search

C

Case Nelson

Hi there I've just been playing around with some python code and I've
got a fun little optimization problem I could use some help with.

Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.
So if the letters are ['a','b','c'] check a, b, c, ab, ac, ba, bc, ca,
cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
added to the dict of words. If the letters are ['?','?'] check a-z, aa,
ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz

I'm using a trie structure to load and query my dictionary, which
returns a 1 if the word is found, 4 if a partial word is found, and 3
if there is no possible word.

I guess I'm wondering if anyone could think of a faster way to deal
with the wildcards, perhaps removing the inner for-loop or else
something entirely different (iterative, don't use the trie?).

Any other suggestions would be welcome

findword recursion runs in 9.16300010681 secs
words.log is simply my list of words:
['hats', 'easts', 'baizes',...,'sash']

--- Code Begin ---
PY> import trie
PY> import time
PY>
PY> print 'loading trie ...',
PY> mytrie = trie.Trie('dicts.txt')
PY> print 'done.'
PY>
PY> alpha = list('abcdefghijgklmnopqrstuvwxyz')
PY>
PY> words = {}
PY> def findword(word, letters):
PY> # run out of letters, recursion stop point
PY> if letters == []:
PY> return
PY>
PY> if word != "":
PY> # Check if word is valid
PY> result = mytrie.query(word)
PY>
PY> # Found a word, add it to the database
PY> if result == 1:
PY> words[word] = 1
PY>
PY> # No other word starts with my current word, recursion stop
point
PY> if result == 3:
PY> return
PY>
PY> # Run through every letter in our list
PY> for i,letter in enumerate(letters):
PY> # Remove current letter and recurse
PY> newletters = letters[:]
PY> newletters.pop(i)
PY>
PY> # if current letter is ? must recurse through all 26
letters
PY> if letter == '?':
PY> for c in alpha:
PY> # recurse word+wildcard letter
PY> findword(word+c,newletters)
PY> else:
PY> # recurse word+letter
PY> findword(word+letter,newletters)
PY>
PY> letters = list('abae?s?')
PY> s = time.time()
PY> findword("",letters)
PY> print time.time() - s
PY>
PY> output = open('words.log','w')
PY> print >> output, words.keys()
PY> output.close()
PY>
--- Code End ---
Thanks
(Hopefully google doesn't eat my whitespace)
 
J

John Machin

Case said:
Hi there I've just been playing around with some python code and I've
got a fun little optimization problem I could use some help with.

Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.
So if the letters are ['a','b','c'] check a, b, c, ab, ac, ba, bc, ca,
cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
added to the dict of words. If the letters are ['?','?'] check a-z, aa,
ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz

This appears to be a Computer Science 101 Data Structures and
Algorithms question, not a Python question, but here's an answer
anyway:

You appear to want to find all words that have one or more letters in
common with your query or candidate string.

Aside: Have you been following the long thread started by the poster
who appeared to want to store all possible strings that were _not_
words in a given language but could be generated from its alphabet?

Here's a possibility: use a bit mask approach. You attach a bit mask to
each word; simple data structure -- a list of 2-tuples, or two parallel
lists.

!def mask(word):
! m = 0
! for letter in word:
! m |= 1 << (ord(letter) - ord('a'))
! return m

Searching without wildcards:

!def nowc_search(candidate, mylistof2tuples):
! candmask = mask(candidate) # treating candidate as str, not list
! for word, mask in mylistof2tuples:
! if mask & candmask:
! # one or more letters in common
! yield word

Note: this treats "mississippi" and "misp" the same. If "aa" is in your
dictionary, what queries would retrieve it? Depending on your exact
requirements, this technique may suit you, or you may want to use it as
a fast(?) filter, with the matches it throws up needing further
checking. You may need a "count number of bits that are set in an int"
function.

Ref: Fred J. Damerau, "A technique for computer detection and
correction of spelling errors", CACM vol 7 number 3, March 1961.

Searching with wild cards: your example of query == "??" seems to yield
all two-letter words. I'd like to see what you expect for "a?", "?a",
"ab?", and "aa?" before suggesting how to tackle wild cards.
Reverse-engineering requirements out of other folks' code is not
something I do for fun :)

An alternative for you to think about: instead of a bitmask, store the
letter-sorted transformation of the words: cat->act, act->act,
dog->dgo, god->dgo.

Alternative data structure: key = bitmask or sorted-letters, value =
list of all words that have that key.

A further suggestion which should always be considered when setting up
a search where the timing is worse than average O(1): have a separate
dictionary for each different wordlength, or some other
impossible-length-avoidance filter; that way, with minimum preliminary
calculation you can avoid considering words that are so long or so
short that they cannot possibly be matches. For example, with
approximate matching based on edit distance, if you are searching for a
10-letter word allowing for 2 errors, you can avoid doing the
complicated comparison on words shorter than 8 or longer than 12.
HTH,
John
 
P

Paul Rubin

Case Nelson said:
Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.

For that size pattern and dictionary, simply compiling the pattern to
a regexp, joining the dictionary together into one big string ("abc
def ghijk..."), and matching the regexp against the big string, may
well be faster than using some fancy algorithm coded completely in
python.
 
J

John Machin

Paul said:
significantly.

For that size pattern and dictionary, simply compiling the pattern to
a regexp, joining the dictionary together into one big string ("abc
def ghijk..."), and matching the regexp against the big string, may
well be faster than using some fancy algorithm coded completely in
python.

Paul, given the OP appears to want something like words that match any
(per)mutation of any substring of his query string -- and that's before
factoring in wildcards -- I'd like to see an example of a regexp that
could handle that.
Cheers,
John
 
M

Mike C. Fletcher

To search for a word which is a jumble of a given set of characters in a
(sizable) lexicon, see this posting:

http://blog.vrplumber.com/427

your alterations would be to check for length == to length -
number-of-wildcards (with the wildcards removed from the translation
table, of course) and then some tweaks to the "expensive" loop to allow
for up to wildcard-count ValueErrors. There's some (informal) analysis
in the comments of that post regarding why its a fairly good mechanism
for searching large sets of words.

HTH,
Mike

Case Nelson wrote:
....
Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.
So if the letters are ['a','b','c'] check a, b, c, ab, ac, ba, bc, ca,
cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
added to the dict of words. If the letters are ['?','?'] check a-z, aa,
ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz
....

________________________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://www.vrplumber.com
http://blog.vrplumber.com
 
S

snoe

First of all thank you very much for the reply. I hope I'm not too
verbose here, just hope that if someone else runs into a similar
problem they can find an answer here.
This appears to be a Computer Science 101 Data Structures and
Algorithms question, not a Python question, but here's an answer
anyway

Heh, it's been awhile since I took data structures 101 but I asked here
after reading about python generators, and although I can't quite get
my head around them, I have a sense that they could help me check all
mutations (is this the right word?) of the candidate strings.

Thanks for the ideas on finding matches, I will have to try the bit
filter out to see if its faster. The trie works similar to this in that
it will return a 3 if the substring is not in the dictionary so that
you don't have to check any superset of the mutation. Although to
clarify, the reason I posted was mostly for help on my weak wildcard
algorithm.
If "aa" is in your dictionary, what queries would retrieve it?

Any query that contains either "a" and "a" or "a" and "?" or "?" and
"?". So some examples would be:
"aa", "a?", "aba", "baa", "bbbaba", "b?b?", "bab?", "?abbbbb" etc...
Searching with wild cards: your example of query == "??" seems to yield
all two-letter words. I'd like to see what you expect for "a?", "?a",
"ab?", and "aa?" before suggesting how to tackle wild cards.
Reverse-engineering requirements out of other folks' code is not
something I do for fun :)

My apologies, in my haste my example of what query == "??" yields was
unclear. Every list of candidate letters should yield every unique (if
possible) mutation, of length 1 to len(candidate).

Heh, exactly what each of those strings would yield is pretty much the
root of my problem:

?? yields (the order of the results dont matter):
a, b, c, d, e, f, ..., x, y, z, aa, ab, ac, ..., ax, ay, az, ba, bb,
bc, ..., ya, yb, yc, ..., yx, yy, yz, za, zb, zc, ..., zx, zy, zz

a? yields
a, b, c, d, ..., x, y, z, aa, ab, ac, ..., ax, ay, az, ba, ca, da, ...,
xa, ya, za

?a yields same as a?

ab? yields
a, b, c, d, ..., x, y, z, aa, ab, ac, ..., ax, ay, az, ba, bb, bc, ...,
bx, by, bz, ca, cb, da, db, ea, eb, ..., xa, xb, ya, yb, za, zb, aba,
abb, abc, abd, ..., abx, aby, abz, baa, bab, bac, ..., bax, bay, baz,
aab, acb, adb, ..., axb, ayb, azb, bba, bca, bda, ... bxa, bya, bza,
cba, dba, eba, xba, yba, zba

HTH, basically every possible string out of the given pool of up to 10
letters.

Ok writing these out just gave me an idea on how I could try and write
this another way.

Ok done. Wow, this program actually ran much much slower than my
original 135 seconds vs 9 seconds!! I'm not sure if I made a huge
mistake or if it's because in the original I can stop searching
superset-candidates if the candidate is not a substring of a dictionary
word (in the same way the mask would reject impossible mutations).

This produces the correct mutations however (although not unique as I
wrote out above):

-- begin code --
!def getmutations(pool):
!
! for i, letter in enumerate(pool):
! pool2 = pool[:]
! pool2.pop(i)
!
! for mutation in getmutations(pool2):
! if letter != '?':
! yield letter + mutation
! else:
! for c in alpha:
! yield c + mutation
!
! yield ""
!letters = list('abae?s?')
!s = time.time()
!output = open('words.log','w')
!for candidate in getmutations(letters):
! # Check if word is valid
! result = mytrie.query(candidate)
!
! # Found a word, add it to the database
! if result == 1:
! words[candidate] = 1
!
!print time.time() - s
--- end code ---
 
S

snoe

Mike,

Thanks, this worked great, and was 4x faster than my method!

Thanks everyone for replying!


The changes I made were:

!rest = ''.join([chr(i) for i in range(256) if chr(i).upper() not in
WORD])
!# a wildcard in the word means that we check all letters
!if '?' in WORD:
! rest = ''
!translator = string.maketrans('','')

and

! try:
! pool.remove( char )
! except ValueError:
! # nested try to remove possible wildcard
! try:
! pool.remove( '?' )
! except ValueError:
! return False
 
G

Ganesan R

Case" == Case Nelson said:
Hi there I've just been playing around with some python code and I've
got a fun little optimization problem I could use some help with.
Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.
So if the letters are ['a','b','c'] check a, b, c, ab, ac, ba, bc, ca,
cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
added to the dict of words. If the letters are ['?','?'] check a-z, aa,
ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz

You're trying to cheat at scrabble aren't you ;-). I once wrote a solution
using regular expressions, which I brushed up and posted here. Can you
check how it compares with your version?

Call as "wordsearch abcd". You can also use wildcards.

=====
..#!/usr/bin/python
..
..import sys
..import os
..import re
..
..wordlist = "/usr/share/dict/words"
..
..def get_freq(str):
.. """Create a hash of number of times a character occurs in string"""
.. freq = { }
.. for c in str:
.. freq[c] = freq.get(c, 0) + 1
.. return freq
..
..
..def main(argv = sys.argv):
.. chars = argv[1]
.. if argv[2]:
.. wordlist = argv[2]
.. wild = chars.count("?")
.. if wild:
.. chars = chars.replace("?", "") # remove the wild cards
.. if chars:
.. charpat = "[" + chars + "]*"
.. else:
.. charpat = ""
..
.. # create a regexp suitable for matching
.. pattern = charpat
.. for w in xrange(wild):
.. pattern += ".?" + charpat
.. pattern = "^" + pattern + "$"
.. print >> sys.stderr, pattern
..
.. # char count for our list of chars
.. charfreq = get_freq(chars)
..
.. pattern_re = re.compile(pattern, re.IGNORECASE)
.. # for line in os.popen("zegrep -i '%s' %s" % (pattern, wordlist)):
.. for line in open(wordlist):
.. line = line.rstrip()
.. if not pattern_re.search(line):
.. continue
.. if not line or len(line) > len(chars) + wild:
.. continue
.. line = line.lower()
.. linefreq = get_freq(line)
..
.. # check if chars occur fewer times than in the given string
.. extra = 0
.. for c in linefreq.keys():
.. if linefreq[c] > charfreq.get(c, 0):
.. extra += linefreq[c] - charfreq.get(c, 0)
.. if extra > wild:
.. break
.. else:
.. print line
..
..if __name__=='__main__':
.. main()

=====
 
W

Win

Hi Case.

Just in case you're really, truly looking for a fast Scrabble
word-search algorithm, the classic AI/CS article here is:

Andrew W. Appel and Guy J. Jacobson, The world's fastest Scrabble
program, Communications of the ACM, 31(5), pp 572--578, May 1988.

You can find a copy of this article at, for example:
http://counties.csc.calpoly.edu/~team12fk/F02/acm_worlds_fastest_scrabble_program.pdf

This algorithm uses a "dawg" (Directed Acyclic Word Graph). You can
find a well-documented example of this data structure implemented in
Python in the the WordUtils package:
http://cedar-solutions.com/software/wordutils/

Regards,

Win
 
S

Scott David Daniels

Paul said:
For that size pattern and dictionary, simply compiling the pattern to
a regexp, joining the dictionary together into one big string ("abc
def ghijk..."), and matching the regexp against the big string, may
well be faster than using some fancy algorithm coded completely in
python.

If these queries happen often, build a dictionary of sorted letters to
lists-of-words. The setup will be slow, but use of such a secondary
dictionary should be quite fast, with relatively easy dealing with ?s
by iterations.

--Scott David Daniels
(e-mail address removed)
 
S

snoe

With a list of letters: 'ABAE?S?' your implementation ran 3.5 times
faster than the one from http://blog.vrplumber.com/427 (in 0.437
seconds vs 1.515)

Without wildcards yours runs slightly quicker as well.

I guess with the wildcards, using an re as a quick filter against each
word, versus the translate method is much faster.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top