40 million levenshtein distances for two long strings

John · May 15, 2008

I am trying to discover similar files to reduce redundancy on a large
project. The 'Text' gem works well for this, but even short strings
take a long time. Large strings - like 20k HTML files - take an
amazing amount of time. My script looks like this:

require 'rubygems'
require 'text'

a = file_one
b = file_two

puts Text::Levenshtein.distance(a, b)

It would be nice to be able to short-circuit the comparison when the
distance crossed a max value, but that isn't possible. It would be
even BETTER to be able to compare long stings like with PHPs
similar_text, which has nice percentage output. I have to do a lot of
comparisons, about 40 million. Is there something already written?

Michael Fellinger · May 15, 2008

I am trying to discover similar files to reduce redundancy on a large
project. The 'Text' gem works well for this, but even short strings
take a long time. Large strings - like 20k HTML files - take an
amazing amount of time. My script looks like this:

require 'rubygems'
require 'text'

a = file_one
b = file_two

puts Text::Levenshtein.distance(a, b)

It would be nice to be able to short-circuit the comparison when the
distance crossed a max value, but that isn't possible. It would be
even BETTER to be able to compare long stings like with PHPs
similar_text, which has nice percentage output. I have to do a lot of
comparisons, about 40 million. Is there something already written?

Take a look at the source of the Text gem, the algorithm for
Text::Levenshtein::distance is not too hard to read or long, maybe you
can just modify that to suit your needs?

^ manveru

ara.t.howard · May 15, 2008

It would be nice to be able to short-circuit the comparison when the
distance crossed a max value

diff file|head -$max

??

a @ http://codeforpeople.com/

Erik Veenstra · May 15, 2008

I have to do a lot of comparisons, about 40 million.

40 million or 400 million?... :}

I've once written a Levenshtein implementation in C, because of
this speed issue. It compares 2 20 KB files in 3 seconds. And
it does come with a threshold.

Interested?

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

$ ll lev[12]
-rw-r--r-- 1 erik erik 19275 May 15 12:10 lev1
-rw-r--r-- 1 erik erik 19275 May 15 12:12 lev2

$ cat lev.rb
require "ev/levenshtein"

s1 = File.read("lev1")
s2 = File.read("lev2")

p EV.levenshtein(s1, s2)

$ time ruby lev.rb
12

real 0m3.105s
user 0m2.843s
sys 0m0.281s

----------------------------------------------------------------

Axel Etzold · May 15, 2008

-------- Original-Nachricht --------

Datum: Thu, 15 May 2008 20:24:42 +0900
Von: Erik Veenstra <[email protected]>
An: (e-mail address removed)
Betreff: Re: 40 million levenshtein distances for two long strings

40 million or 400 million?... :}

I've once written a Levenshtein implementation in C, because of
this speed issue. It compares 2 20 KB files in 3 seconds. And
it does come with a threshold.

Interested?

Hi Erik,

Could you post your implementation here ?

Thank you,

Axel

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

$ ll lev[12]
-rw-r--r-- 1 erik erik 19275 May 15 12:10 lev1
-rw-r--r-- 1 erik erik 19275 May 15 12:12 lev2

$ cat lev.rb
require "ev/levenshtein"

s1 = File.read("lev1")
s2 = File.read("lev2")

p EV.levenshtein(s1, s2)

$ time ruby lev.rb
12

real 0m3.105s
user 0m2.843s
sys 0m0.281s

----------------------------------------------------------------

John · May 15, 2008

....48,270,225 to be exact, which (without multithreading) would take a
bit over 300 days. I'm with Axel, I think more than just a couple
people would be interested in that slice of C.

Vidar Hokstad · May 16, 2008

...48,270,225 to be exact, which (without multithreading) would take a
bit over 300 days. I'm with Axel, I think more than just a couple
people would be interested in that slice of C.

The wikipedia article on Levenshtein distance has a reasonable C
implementation - using Ruby inline on it is fairly trivial.

I used the Wikipedia version from Ruby for some sanity checks on
results for my MSc thesis, but unfortunately I can't find my cleaned
up code right now (at work - I can look for it later today if you
like, and unless someone else has posted a full version by then)

Vidar

Erik Veenstra · May 16, 2008

I've requested a new RubyForge project. Until it's available,
I'll paste the code on PasteBin.

Comments and suggestions are welcome. To be honest, its' my
first and only C extensions 'till now...

I've never released a C extension as a GEM. Anybody willing to
help?

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

Ruby library: http://pastebin.com/m36427195
C extension: http://pastebin.com/m5bdf8ceb
extconf.rb: http://pastebin.com/m4c4c40b5
Unit Tests: http://pastebin.com/m6e956e5e

----------------------------------------------------------------

Erik Veenstra · May 16, 2008

The Ruby version of Levenshtein#distance didn't pass the
tests... :}

Here's a new version.

gegroet,
Erik V. - http://www.erikveen.dds.nl/

DJ Jazzy Linefeed · May 16, 2008

Let us know when your project is approved! Good work sir.

Erik Veenstra · May 24, 2008

I've released the Levenshtein module as a gem. The most
expensive part of the algorithm is written in C. But before
entering the most expensive part of the algorithm, it tries to
optimize as much as possible (in Ruby land).

There's a TAR.GZ archive and a GEM for Linux (tested), OS X
(untested) and Cygwin (tested). I wasn't able to compile a
version that worked with the One-Click Ruby Installer. I'm not
familiar with Windows, nor with its development environment.
Anyone willing to help?

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

The Levenshtein distance is a metric for measuring the amount
of difference between two sequences (i.e., the so called edit
distance). The Levenshtein distance between two strings is
given by the minimum number of operations needed to transform
one string into the other, where an operation is an insertion,
deletion, or substitution of a single character.

More information about the Levenshtein distance algorithm:
http://en.wikipedia.org/wiki/Levenshtein_distance .

More information about the library:
http://www.erikveen.dds.nl/levenshtein/doc/index.html .

----------------------------------------------------------------

Dave Bass · May 25, 2008

John said:
I am trying to discover similar files to reduce redundancy on a large
project.

Edit distance isn't the only way to describe (dis)similarity, you know!
Moreover, what you're proposing is a really brute-force approach.

Defining similarity is not a trivial exercise. Two identical documents
should have a similarity of one, no matter how measured, but dissimilar
documents can have very different similarity figures depending on how
you measure them.

I suggest you look at the information retrieval literature (search
engines etc).

Shallow copy and long long double on gcc 3.4.5	6	Sep 18, 2009
Any working translation framework available for Ruby	2	May 13, 2011
[ANN] main-4.0.0 (for avdi)	0	Oct 21, 2009
print header for output	0	Jun 19, 2011
ANN main-4.4.0	0	Nov 25, 2010
For Peer Review	1	Apr 2, 2010
Better crypto hash functions, long, with code	2	Aug 26, 2005
How can you make idle processors pick up java work?	2	Jul 31, 2012

40 million levenshtein distances for two long strings

John

Michael Fellinger

ara.t.howard

Erik Veenstra

Axel Etzold

John

Vidar Hokstad

Erik Veenstra

Erik Veenstra

DJ Jazzy Linefeed

Erik Veenstra

Dave Bass

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads