[SUMMARY] DictionaryMatcher (#103)

R

Ruby Quiz

As stated in the quiz, this question comes up quite a bit. There are a few ways
to address including a fairly simple approach. Let's examine Jamie Macey's
rather short solution for an example of that:

class DictionaryMatcher < Array
alias_method :===, :=~
alias_method :match, :=~

def initialize(default = [], options = nil)
super(default)

unless options.nil? or options.is_a? Fixnum
options = Regexp::IGNORECASE
end
@regexp_options = options
end

def =~(string)
self.map{|e| Regexp.new(e, @regexp_options) =~ string }.compact.min
end
end

Jamie's idea is easy enough to follow: DictionaryMatcher is just an Array of
expressions and we can hit the String with them one at a time to find a match.
This code even has an advantage over many of the solutions in that the
individual expressions themselves can be arbitrarily complex (full regular
expressions).

The big downside to this approach though is speed. Unfortunately, it's a pretty
big downside in this case because if you had been dealing with a small number of
matches to begin with, you probably wouldn't have needed a DictionaryMatcher.
The main reason performance is bad is that all expressions must be tested, to
find the match that occurs first in the String. All that context shifting in
and out of the regular expression engine just takes time.

If we're going to get around that, we need a clever way to store the terms we
are looking for and a custom match process that takes advantage of that data
structure.

The structure used by the majority of the solutions to store the words is called
a trie or prefix tree. Louis J Scoras wrote a solution using Trie objects, and
even taught them how to display themselves. This might help you see how this
works. Have a look at this example (with corrected indentation):
t = Trie.new =>
t["cat"] = true => true
t["cab"] = true => true
t["cate"] = true => true
t["bob"] = true => true
t
=> b =>
o =>
b =>
value: true
c =>
a =>
b =>
value: true
t =>
value: true
e =>
value: true

This structure is similar to a Hash, in that it has keys and values. (Values
are only used to indicate word boundaries, so we will leave those as true and
focus on the keys.) Where a Trie differs from a Hash is how it stores the keys.

The key "cat" isn't stored as one complete String, instead it is stored as a
separate "c", "a", and "t". Those individual letters are nested beneath each
other in the structure to indicate order. "c" comes before the "a" beneath it,
which comes before the "t" beneath it.

Now, to add "cab" or even "cate" into the structure just involves adding in the
new letters at the right depth. "bob", on the other hand, begins a new tree,
since it starts with a different letter.

This structure can be used to make rather efficient large matches. Where a
Regexp has to ask does "cat", "cab", or "cate" match here, the tree version just
checks to see if the "c" matches here. When it does, more letters still need to
be checked, but when it doesn't we instantly know that all "c" words are no good
and we can move on.

To see how that comes together, let's examine some of Ross Bamford's solution:

class DictionaryMatcher
# ...

def initialize(*words)
@pt = {}
words.each { |word| self << word }
end

def <<(word)
# small memory optimization - if there's a longer word that shares
# this prefix, we can discard it since we'll only ever take the
# shortest match anyway.
word.split('').inject(@pt) do |pt, chr|
pt[chr] ||= {}
end.clear[:__WORD__] = true

self
end

# ...

def match(str, start_ofs = 0)
start_ofs.upto(str.length) do |i|
word = ""
next_pt = @pt
si = i
while next_pt = next_pt[chr = str[i,1]]
word << chr
return MatchData.new(si, i, word, str) if next_pt[:__WORD__]
i+=1
end
end

nil
end

def =~(str)
m = match(str) and m.start_offset
end

# ...
end

I've trimmed a lot of code here, but what I've shown is the heart of the
matching algorithm. Ross uses a simple set of nested Hashes to build his prefix
tree (@pt) in initialize().

Words are inserted into the tree in the <<() method. It splits the word into
letters, and walks the nested Hashes inserting each letter. When it reaches the
end of the word, any further nesting is cleared (an optimization explained in
the comment) and the special :__WORD__ marker is inserted to indicate a word
boundary.

The rest of the magic is in the match() method. It works much as I explained
before. It walks the String index by index. If a character is found in the
prefix tree, it tries to find a run of matches by walking the tree forward (the
while loop). When it makes it all the way to a word boundary marker, it
declares victory by returning Ross's custom MatchData object (not shown). If
the tree walk fails, the code advances to the next index and when the indices
are exhausted, nil is tossed to indicate that no match could be found.

The =~() method also uses match(), but changes the return value to mimic Ruby.

There's certainly more to see in the solutions and I hope this gives you enough
of a map to encourage your own spelunking expidition into the code. A big thank
you to all of the programmers who helped reduce this FAQ to a link we can pass
on to those who ask.

Sharpen your turtles folks, because it's pretty picture drawing time with
tomorrow's Ruby Quiz...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top