Damerau-Levenshtein_distance

Jeremy · May 18, 2007

I wrote the method below by copying the algorithm from
http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance (and matrix is
a really simple 2d array implementation). But the problem is that it
slows wat down as the string size gets bigger. At string length of about
150 it takes 1s, at 500 10s. Is there any way to recode this to get
better performance without rewriting it in C, and would rewrting it in C
even help or is this just a slow algorithm?

def self.distance(string1, string2)
string1 = string1.unpack('C*')
string2 = string2.unpack('C*')
s1n = string1.length
s2n = string2.length

m = DLDiff::Matrix.new(s1n+1, s2n+1)
cost = 0

(0..s1n).each {|i| m[i,0] = i}
(1..s2n).each {|j| m[0,j] = j}

(1..s1n).each do |i|
(1..s2n).each do |j|
cost = string1 == string2[j] ? 0 : 1
m[i, j] = [ m[i-1, j] + 1,
m[i, j-1] + 1,
m[i-1, j-1] + cost ].min
m[i, j] = [ m[i,j], m[i-2,j-2] + cost].min if(i > 1 && j > 1 &&
string1 == string2[j-1] && string1[i-1] == string2[j])
end
end

m[s1n, s2n]
end

class Matrix
def initialize(columns, rows)
ac = Array.new(columns, 0)
@am = Array.new(rows, 0)
@am = @am.collect{|r| ac.dup}
end

def [](c, r)
@am[r][c]
end

def []=(c,r,value)
@am[r][c] = value
end

def inspect
@am.collect{|a| a.inspect}.join("\n")
end

def to_s
@am.to_s
end
end

Daniel Martin · May 18, 2007

Jeremy said:
I wrote the method below by copying the algorithm from
http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance (and matrix
is a really simple 2d array implementation). But the problem is that
it slows wat down as the string size gets bigger. At string length of
about 150 it takes 1s, at 500 10s. Is there any way to recode this to
get better performance without rewriting it in C, and would rewrting
it in C even help or is this just a slow algorithm?

Well, the algorithm itself is O(n*m), where n and m are the size of
the strings involved, so on large strings it's going to get slow.

I was able to shave about 40% off the time for your method, and fix a
bug.

The bug was caused because the wikipedia article indexes the strings
starting from 1, but indexes the array starting form 0. In ruby both
start at 0, of course. To fix this, I added a fake element at the
start of both strings.

Anyway, here's my slightly faster version:

def self.distance(string1, string2)
string1 = string1.unpack('C*')
string2 = string2.unpack('C*')
s1n = string1.length
s2n = string2.length
string1.unshift -1
string2.unshift -1
m = Array.new(s1n+1){Array.new(s2n+1,0)}
cost = 0
a = nil; b = nil
(0..s1n).each {|i| m[0] = i}
(1..s2n).each {|j| m[0][j] = j}
(1..s1n).each do |i|
(1..s2n).each do |j|
cost = string1 == string2[j] ? 0 : 1

a = m[i-1][j] + 1
a = b if ((b = m[j-1]+1) < a)
a = b if ((b = m[i-1][j-1]+cost) < a)

if(i > 1 && j > 1 &&
string1 == string2[j-1] &&
string1[i-1] == string2[j]) then
a = b if ((b = m[i-2][j-2] + 1) < a)
end
m[j] = a
end
end
m[s1n][s2n]
end

Jeremy · May 18, 2007

Daniel said:
Well, the algorithm itself is O(n*m), where n and m are the size of
the strings involved, so on large strings it's going to get slow.

I was able to shave about 40% off the time for your method, and fix a
bug.

Thanks, thats really useful. I guess that now the data i'm comparing
has grown it would be better to use a diff type method than this.

Java matrix problem	3	Sep 10, 2023
Blue J Ciphertext Program	2	Nov 22, 2023
Python Gurobi Optimizing Cost has no errors but I get no sensible solution	0	Aug 30, 2022
Boomer trying to learn coding in C and C++	6	Dec 16, 2022
Need help with this Python code.	2	Jun 13, 2023
Python code problem	2	Apr 23, 2023
levenshtein function issue	6	Sep 24, 2010
Difference between using "let" in a "for" loop	0	Jul 3, 2022

Damerau-Levenshtein_distance

Jeremy

Daniel Martin

Jeremy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads