40 million levenshtein distances for two long strings

J

John

I am trying to discover similar files to reduce redundancy on a large
project. The 'Text' gem works well for this, but even short strings
take a long time. Large strings - like 20k HTML files - take an
amazing amount of time. My script looks like this:

require 'rubygems'
require 'text'

a = file_one
b = file_two

puts Text::Levenshtein.distance(a, b)


It would be nice to be able to short-circuit the comparison when the
distance crossed a max value, but that isn't possible. It would be
even BETTER to be able to compare long stings like with PHPs
similar_text, which has nice percentage output. I have to do a lot of
comparisons, about 40 million. Is there something already written?
 
M

Michael Fellinger

I am trying to discover similar files to reduce redundancy on a large
project. The 'Text' gem works well for this, but even short strings
take a long time. Large strings - like 20k HTML files - take an
amazing amount of time. My script looks like this:

require 'rubygems'
require 'text'

a = file_one
b = file_two

puts Text::Levenshtein.distance(a, b)


It would be nice to be able to short-circuit the comparison when the
distance crossed a max value, but that isn't possible. It would be
even BETTER to be able to compare long stings like with PHPs
similar_text, which has nice percentage output. I have to do a lot of
comparisons, about 40 million. Is there something already written?

Take a look at the source of the Text gem, the algorithm for
Text::Levenshtein::distance is not too hard to read or long, maybe you
can just modify that to suit your needs?

^ manveru
 
E

Erik Veenstra

I have to do a lot of comparisons, about 40 million.

40 million or 400 million?... :}

I've once written a Levenshtein implementation in C, because of
this speed issue. It compares 2 20 KB files in 3 seconds. And
it does come with a threshold.

Interested?

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

$ ll lev[12]
-rw-r--r-- 1 erik erik 19275 May 15 12:10 lev1
-rw-r--r-- 1 erik erik 19275 May 15 12:12 lev2

$ cat lev.rb
require "ev/levenshtein"

s1 = File.read("lev1")
s2 = File.read("lev2")

p EV.levenshtein(s1, s2)

$ time ruby lev.rb
12

real 0m3.105s
user 0m2.843s
sys 0m0.281s

----------------------------------------------------------------
 
A

Axel Etzold

-------- Original-Nachricht --------
Datum: Thu, 15 May 2008 20:24:42 +0900
Von: Erik Veenstra <[email protected]>
An: (e-mail address removed)
Betreff: Re: 40 million levenshtein distances for two long strings

40 million or 400 million?... :}

I've once written a Levenshtein implementation in C, because of
this speed issue. It compares 2 20 KB files in 3 seconds. And
it does come with a threshold.

Interested?

Hi Erik,

Could you post your implementation here ?

Thank you,

Axel
gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

$ ll lev[12]
-rw-r--r-- 1 erik erik 19275 May 15 12:10 lev1
-rw-r--r-- 1 erik erik 19275 May 15 12:12 lev2

$ cat lev.rb
require "ev/levenshtein"

s1 = File.read("lev1")
s2 = File.read("lev2")

p EV.levenshtein(s1, s2)

$ time ruby lev.rb
12

real 0m3.105s
user 0m2.843s
sys 0m0.281s

----------------------------------------------------------------
 
J

John

....48,270,225 to be exact, which (without multithreading) would take a
bit over 300 days. I'm with Axel, I think more than just a couple
people would be interested in that slice of C.
 
V

Vidar Hokstad

...48,270,225 to be exact, which (without multithreading) would take a
bit over 300 days. I'm with Axel, I think more than just a couple
people would be interested in that slice of C.

The wikipedia article on Levenshtein distance has a reasonable C
implementation - using Ruby inline on it is fairly trivial.

I used the Wikipedia version from Ruby for some sanity checks on
results for my MSc thesis, but unfortunately I can't find my cleaned
up code right now (at work - I can look for it later today if you
like, and unless someone else has posted a full version by then)

Vidar
 
E

Erik Veenstra

I've requested a new RubyForge project. Until it's available,
I'll paste the code on PasteBin.

Comments and suggestions are welcome. To be honest, its' my
first and only C extensions 'till now...

I've never released a C extension as a GEM. Anybody willing to
help?

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

Ruby library: http://pastebin.com/m36427195
C extension: http://pastebin.com/m5bdf8ceb
extconf.rb: http://pastebin.com/m4c4c40b5
Unit Tests: http://pastebin.com/m6e956e5e

----------------------------------------------------------------
 
E

Erik Veenstra

I've released the Levenshtein module as a gem. The most
expensive part of the algorithm is written in C. But before
entering the most expensive part of the algorithm, it tries to
optimize as much as possible (in Ruby land).

There's a TAR.GZ archive and a GEM for Linux (tested), OS X
(untested) and Cygwin (tested). I wasn't able to compile a
version that worked with the One-Click Ruby Installer. I'm not
familiar with Windows, nor with its development environment.
Anyone willing to help?

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

The Levenshtein distance is a metric for measuring the amount
of difference between two sequences (i.e., the so called edit
distance). The Levenshtein distance between two strings is
given by the minimum number of operations needed to transform
one string into the other, where an operation is an insertion,
deletion, or substitution of a single character.

More information about the Levenshtein distance algorithm:
http://en.wikipedia.org/wiki/Levenshtein_distance .

More information about the library:
http://www.erikveen.dds.nl/levenshtein/doc/index.html .

----------------------------------------------------------------
 
D

Dave Bass

John said:
I am trying to discover similar files to reduce redundancy on a large
project.

Edit distance isn't the only way to describe (dis)similarity, you know!
Moreover, what you're proposing is a really brute-force approach.

Defining similarity is not a trivial exercise. Two identical documents
should have a similarity of one, no matter how measured, but dissimilar
documents can have very different similarity figures depending on how
you measure them.

I suggest you look at the information retrieval literature (search
engines etc).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,585
Members
45,082
Latest member
KetonaraKetoACV

Latest Threads

Top