How to Write a Spelling Corrector

Brian Adkins · Apr 9, 2007

Peter Norvig wrote a simple spelling corrector in 20 lines of Python 2.5,
so I thought I'd see what it looks like in Ruby. I'm not too pleased with
my version, if anyone can make it more elegant, that would be great. Some
of the sticking points were:

1) List comprehensions in Python made the edits1 function more elegant
IMO. Hopefully someone can improve that function.

2) The boolean expression in the correct function evaluates empty
sets/arrays as false in Python but not in Ruby, so I had to add the
"result.empty? ? nil : result" expression to several functions. I expect
there's a better way to handle this also.

3) Minor point, but apparently Python has a built in constant for the set
of lower case characters "string.lowercase", so I just defined a constant.

Otherwise, the translation was pretty straightforward.

Here's a link to Norvig's page: http://www.norvig.com/spell-correct.html

That page includes a link to a text file that I saved locally as
holmes.txt: http://www.norvig.com/holmes.txt

Note: I wrapped a few of the longer lines for posting.

def words text
text.downcase.scan(/[a-z]+/)
end

def train features
model = Hash.new(1)
features.each {|f| model[f] += 1 }
return model
end

NWORDS = train(words(File.new('holmes.txt').read))
LETTERS = 'abcdefghijklmnopqrstuvwxyz'

def edits1 word
n = word.length
deletion = (0...n).collect {|i| word[0...i]+word[i+1..-1] }
transposition = (0...n-1).collect {
|i| word[0...i]+word[i+1,1]+word[i,1]+word[i+2..-1] }
alteration = []
n.times {|i| LETTERS.each_byte {
|l| alteration << word[0...i]+l.chr+word[i+1..-1] } }
insertion = []
(n+1).times {|i| LETTERS.each_byte {
|l| insertion << word[0...i]+l.chr+word[i..-1] } }
result = deletion + transposition + alteration + insertion
result.empty? ? nil : result
end

def known_edits2 word
result = []
edits1(word).each {|e1| edits1(e1).each {
|e2| result << e2 if NWORDS.has_key?(e2) }}
result.empty? ? nil : result
end

def known words
result = words.find_all {|w| NWORDS.has_key?(w) }
result.empty? ? nil : result
end

def correct word
(known([word]) or known(edits1(word)) or known_edits2(word) or
[word]).max {|a,b| NWORDS[a] <=> NWORDS }
end

Brian Adkins
http://lojic.com/blog/

Joel VanderWerf · Apr 9, 2007

Brian said:
3) Minor point, but apparently Python has a built in constant for the set
of lower case characters "string.lowercase", so I just defined a constant. ...
LETTERS = 'abcdefghijklmnopqrstuvwxyz'

A minor comment: you can construct this string with

("a".."z").to_a.join
=> "abcdefghijklmnopqrstuvwxyz"

Robert Dober · Apr 10, 2007

A minor comment: you can construct this string with

("a".."z").to_a.join
=> "abcdefghijklmnopqrstuvwxyz"

Almost nobody, not even the gurus ever use [*a..b] instead of (a..b).to_a.
The performance is about the same, the former seems faster for short
arrays/ranges ~ 500
and from about 1k elements the later gets slightly faster.
I feel that [*a..b] is the right/better/more logical construct to use
if one wants an array, or do I miss something?

Cheers
Robert

Brian Adkins · Apr 10, 2007

A minor comment: you can construct this string with

("a".."z").to_a.join
=> "abcdefghijklmnopqrstuvwxyz"

Click to expand...

Almost nobody, not even the gurus ever use [*a..b] instead of (a..b).to_a.

Interesting. I didn't realize that would work with ranges. Here's what I
read in "Programming Ruby" p. 348:

"Following these parameters may be a single parameter prefixed with an
asterisk. If this parameter is an array, Ruby replaces it with zero or
more parameters corresponding to the elements of the array."

I guess to_a is called on the range implicitly prior to evaluating *.

The performance is about the same, the former seems faster for short
arrays/ranges ~ 500
and from about 1k elements the later gets slightly faster.
I feel that [*a..b] is the right/better/more logical construct to use
if one wants an array, or do I miss something?

Cheers
Robert

Robert Dober · Apr 10, 2007

Brian Adkins wrote:
3) Minor point, but apparently Python has a built in constant for the set
of lower case characters "string.lowercase", so I just defined a constant.
...
LETTERS = 'abcdefghijklmnopqrstuvwxyz'

A minor comment: you can construct this string with

("a".."z").to_a.join
=> "abcdefghijklmnopqrstuvwxyz"

Click to expand...

Almost nobody, not even the gurus ever use [*a..b] instead of (a..b).to_a.

Click to expand...

Interesting. I didn't realize that would work with ranges. Here's what I
read in "Programming Ruby" p. 348:

"Following these parameters may be a single parameter prefixed with an
asterisk. If this parameter is an array, Ruby replaces it with zero or
more parameters corresponding to the elements of the array."

I guess to_a is called on the range implicitly prior to evaluating *.

Hmm I guessed rather not, but I am wrong:

429/5 > cat ast.rb
# vim: sts=2 sw=2 tw=0 expandtab nu:

a = [*1..2]
robert@PC:~/log/ruby/tests/theory 18:09:04
430/6 > ruby -rprofile ast.rb
% cumulative self self total
time seconds seconds calls ms/call ms/call name
0.00 0.00 0.00 1 0.00 0.00 Enumerable.to_a
0.00 0.00 0.00 1 0.00 0.00 Range#each
0.00 0.01 0.00 1 0.00 10.00 #toplevel

Good thinking there Brian !!
Cheers
Robert

The performance is about the same, the former seems faster for short
arrays/ranges ~ 500
and from about 1k elements the later gets slightly faster.
I feel that [*a..b] is the right/better/more logical construct to use
if one wants an array, or do I miss something? Well I did as shown above, but I still rather go for the shorter

Cheers
Robert

Click to expand...

William James · Apr 11, 2007

Brian said:
Peter Norvig wrote a simple spelling corrector in 20 lines of Python 2.5,
so I thought I'd see what it looks like in Ruby. I'm not too pleased with
my version, if anyone can make it more elegant, that would be great. Some
of the sticking points were:

1) List comprehensions in Python made the edits1 function more elegant
IMO. Hopefully someone can improve that function.

2) The boolean expression in the correct function evaluates empty
sets/arrays as false in Python but not in Ruby, so I had to add the
"result.empty? ? nil : result" expression to several functions. I expect
there's a better way to handle this also.

3) Minor point, but apparently Python has a built in constant for the set
of lower case characters "string.lowercase", so I just defined a constant.

Otherwise, the translation was pretty straightforward.

Here's a link to Norvig's page: http://www.norvig.com/spell-correct.html

That page includes a link to a text file that I saved locally as
holmes.txt: http://www.norvig.com/holmes.txt

Note: I wrapped a few of the longer lines for posting.

def words text
text.downcase.scan(/[a-z]+/)
end

def train features
model = Hash.new(1)
features.each {|f| model[f] += 1 }
return model
end

NWORDS = train(words(File.new('holmes.txt').read))
LETTERS = 'abcdefghijklmnopqrstuvwxyz'

def edits1 word
n = word.length
deletion = (0...n).collect {|i| word[0...i]+word[i+1..-1] }
transposition = (0...n-1).collect {
|i| word[0...i]+word[i+1,1]+word[i,1]+word[i+2..-1] }
alteration = []
n.times {|i| LETTERS.each_byte {
|l| alteration << word[0...i]+l.chr+word[i+1..-1] } }
insertion = []
(n+1).times {|i| LETTERS.each_byte {
|l| insertion << word[0...i]+l.chr+word[i..-1] } }
result = deletion + transposition + alteration + insertion
result.empty? ? nil : result
end

Letters = ('a'..'z').to_a

class String
def replace( which, what )
self[0...which] + what + self[which+1 .. -1]
end
end

def edits1 word
n = word.size
return nil if n.zero?
deletion = (0...n).map{|pos| word.replace(pos, "") }
transposition = (0...n-1).map{|i|
word[0...i]+word[i+1,1]+word[i,1]+word[i+2..-1] }
alteration = (0 ... n).to_a.map{|pos|
Letters.map{|ch| word.replace(pos, ch) }}.flatten
insertion = (0 .. n).to_a.map{|pos|
Letters.map{|ch| word[0...pos] + ch + word[pos..-1] }}.flatten
deletion + transposition + alteration + insertion
end

Phrogz · Apr 11, 2007

class String
def replace( which, what )
self[0...which] + what + self[which+1 .. -1]
end
end

Cleaner, IMHO:

class String
def replace( where, what )
s2 = dup
s2[ where ] = what
s2
end
end

Drew Olson · Apr 11, 2007

This may be a bit OT, but I am really struck as to how elegantly both of
these languages (and the programmers as well) handle this seemingly
difficult problem. I've been working with ruby mainly for fun over the
past 8-9 months and seeing this solution (in Python and ruby) really
heightens my respect for the elegance and expressiveness of both
languages.

</rave>

Florian Gross · Apr 12, 2007

I am so glad I find this topic here! I was on the plane yesterday, and

just like Peter Norwig I managed to write a spelling corrector. An
obvious difference is that he started from nothing, while I started
from his code... Anyway, I did it in Ruby, it works, but... it's dead
slow. If I need to compute the second array (the one with 2 mistakes),
it takes about 1 second per suggestion.

I avoided doing two-character edits. Instead I use Text::Metaphone and
mutate that string and Text::Levenshtein for filtering by distance. It
seems to work fairly well. (You will need the text gem to try this
out.)

The code (together with the holmes.txt as a sample for training it) is
available at http://flgr.0x42.net/code/spell_correct/

Brian Adkins · Apr 12, 2007

I am so glad I find this topic here! I was on the plane yesterday, and
just like Peter Norwig I managed to write a spelling corrector. An
obvious difference is that he started from nothing, while I started
from his code... Anyway, I did it in Ruby, it works, but... it's dead
slow. If I need to compute the second array (the one with 2 mistakes),
it takes about 1 second per suggestion.

My edit1 function is like this:
----------------------------
def edits1(word)
n = word.length

return ( (0..n-1).collect {|i| word[0,i] + word[i+1, n-1]} + #deletion
(0..n-2).collect {|i| word[0,i] + word[i+1,1] +word[i,1] +
word[i+2, n-1]} + #transposition
('a'..'z').collect {|c| (0..n-1).collect { |i| word[0,i] + c +
word[i+1, n-1]} } + #alteration
('a'..'z').collect {|c| (0..n).collect { |i| word[0,i] + c +
word[i, n]} } ).flatten #insertion

end

You may want to compare to my original at the start of this thread. I
don't have your code, so I can't benchmark it, but I have no noticeable
lag when I correct a word. It may have something to do with you having two
ranges, two collects and a flatten just to get the insertion set. Compare

('a'..'z').collect {|c| (0..n).collect { |i| word[0,i] + c +
word[i, n]} } ).flatten

to:

(n+1).times {|i| LETTERS.each_byte {
|l| insertion << word[0...i]+l.chr+word[i..-1] } }

Giles Bowkett · Apr 15, 2007

from his code... Anyway, I did it in Ruby, it works, but... it's dead

You may want to compare to my original at the start of this thread. I
don't have your code, so I can't benchmark it, but I have no noticeable
lag when I correct a word. It may have something to do with you having two
ranges, two collects and a flatten just to get the insertion set. Compare

I haven't taken the time to put this together as code -- sorry -- but
just off the top of my head, in Norvig's post, the Python version, he
uses set() a lot. My Python's rusty - well, that sounds a bit obscene,
but what I mean is, my skill in Python is imperfect - anyway, point
is, I'm not **certain**, but I think to get the Python set() in Ruby
you need a gem. Using arrays instead of sets could be the source of
the bottleneck.

Hang on, I just checked the docs (the docs being Google) and you don't
need a gem, it's in the standard library. All you need to do is:

require 'set'

and then either

xyz = Set.new

or

xyz = %w{x y z}.to_set

(etc.)

at which point you'll probably get the benefit of Norvig's design, and
things will go faster.

SpellChecker	3	May 20, 2009
Code Explaination: Spelling correction code	4	Apr 11, 2007
Need Help	4	May 24, 2009
Genetic algoritm generating the text	0	Aug 18, 2023
How to parse this line of code manually	4	Aug 27, 2007
Hello I am learning how to code and I tried making a calculator with HTML and js with some CSS I am stuck at thing, Like the screen value is	0	Mar 13, 2025
Chatbot	0	Oct 8, 2024
How to use Densenet121 in monai	0	Feb 16, 2024

How to Write a Spelling Corrector

Brian Adkins

Joel VanderWerf

Robert Dober

Brian Adkins

Robert Dober

William James

Phrogz

Drew Olson

Florian Gross

Brian Adkins

Giles Bowkett

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads