[SUMMARY] Shirt Reader (#140)

R

Ruby Quiz

This quiz is surprisingly challenging. Our brain is capable of making some
pretty complex associations between sounds to make sense of these puzzles quite
quickly. To code that up is a non-trivial task.

All of the solutions used some variant of the Metaphone algorithm to match words
with similar sounds. The solution from steve d is probably the most complete.
It pulls from a pronunciation dictionary, uses the afore mentioned Metaphone
algorithm, refines matches using a Levenshtein edit distance algorithm, and even
inlines some C for speed. I do recommend everyone have a look at that code.

I'm going to go through one of Eugene Kalenkovich's solutions in this summary
though. It gets pretty good answers in the spot checking I did, it's
performance isn't too bad, and it's a bit smaller than steve's code. It starts
by pulling in some external resources:

require 'rubygems'
require 'text'
include Text::Metaphone
include Text::Levenshtein
load 'expectations.rb'

# ...

Most of this code is about loading the Text gem. For those not familiar with
it, Text includes a slew of algorithms useful in processing text. This code
will use the Metaphone and the improved Double Metaphone algorithms to compare
words by sound. It also uses the Levenshtein distance algorithm to see how far
apart two possible words are.

The expectations.rb file just fills a global Hash with some test cases.

Another thing most of the solutions did was to perform some translation of the
provided words to make them more likely to match sounds. One particular problem
point seemed to be "ee" sounds at the end of words, typically ending in y.
Performing some phonetic substitutions can increase the matching accuracy in
these cases. Here's Eugene's code for that:

# ...

subs={'1'=>'wan','2'=>'to','3'=>'tre','4'=>'for','5'=>'five','6'=>'six',
'7'=>'seven','8'=>'ate','9'=>'nine','10'=>'ten',
'c'=>'see','h'=>'eich','j'=>'jey','k'=>'key','q'=>'que','r'=>'ar'}
subsy={}
%w[b c d g p t v z].each {|l| subsy[l]=l+'y'}
%w[b c d g p t v z].each {|l| subs[l]=l+'ee'}
%w[f l m n s x].each{|l| subs[l]='e'+l}

# ...

Here we see Hashes constructed to provide the matching of numbers and improve
the matching of certain letters. Note that these are just used to match
standalone chunks of the input. Substitutions will not be made inside words.

Next we have some wrappers over the algorithms provided in the Text gem:

# ...

def metadist(str1,str2)
2*distance(metaphone(str1),metaphone(str2))+
distance(str1,str2)
end

def short_double_metaphone(word)
m1,m2=double_metaphone(word)
[m1[0,2],m2 ? m2[0,2] : nil]
end

# ...

The first method is an enhanced distance function for checking how similar two
words are. It works by applying the sound algorithm to each words then building
and edit distance between those and adding that to the normal edit distance for
the words. The sound distance is weighted double in the overall result.

The second wrapper is for building a word index using a Double Metaphone
algorithm. It returns shortened versions of the provides sounds for use as Hash
keys. The code that builds that Hash is what we want to look at next:

# ...

hash=Hash.new{|h,k|h[k]=[]}

File.open("/usr/share/dict/words") {|f| f.readlines}.each do |w|
word=w.downcase.delete("^a-z")
m1,m2=short_double_metaphone(word)
hash[m1]<<word
hash[m2]<<word if m2
end
$expectations.values.each { |word|
m1,m2=short_double_metaphone(word)
hash[m1]<<word
hash[m2]<<word if m2
}

hash.each_key{|k| hash[k].uniq!}

# ...

This code just loops over the built-in dictionary (on Unix), building sound keys
for each word and populating the Hash. Note that the File loop can be
simplified to File.foreach(...) { |w| ... }.

One point of interest here is that the code loops over test cases adding the
solution words to the dictionary. This raises a good point: the quality of the
dictionary will have a big impact on your results. The built-in dictionary is
convenient to test with, but you probably want to trade up to a better word list
when quality really counts.

The input handling code is trivial:

# ...

inputs=[]
if (ARGV.empty?)
inputs=$expectations.keys
else
inputs << ARGV
end

# ...

This just supports the quiz interface, or defaults to running the included test
cases.

All we have left is the application code:

inputs.each { |rebus|
y_ed=rebus[0..-2]<<(subsy[rebus[-1]] || rebus[-1])
word=y_ed.map{|w| subs[w] || w }.join.downcase.gsub(/[^a-z0-9]/,'')
m1,m2=short_double_metaphone(word)
results=hash[m1]
results+=hash[m2] if m2 && m2!=m1
res=results.uniq.sort_by{|a| [metadist(word,a),a.length]}.first(5)
print "'#{rebus.join(' ')}' => #{res[0]}"
expected=$expectations[rebus]
print ", expected '#{expected}' is at position #{res.index(expected)}" \
if expected
puts
}

The process here isn't too hard to follow. The inputs are combined into a list,
with the last element undergoing the "ee" sound substitutions we discussed
earlier. After that, all elements go through the general letter and number
substitutions and get combined into a word.

The sound Hash keys are pulled for the resulting word and used to lookup a
result set of possible matches. That result set is then ordered by the enhanced
distance function and and pruned to the top five matches.

The rest of the code just handles the printing.

My thanks to all who managed to get surprisingly accurate results for a tough
problem. As always, you've taught me new tricks.

It's pretty likely that tomorrow's problem will center around probability...
 
R

Robert Dober

It's pretty likely that tomorrow's problem will center around probability...
Nice 1
Robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top