edit distance algorithms

R

Rogan Dawes

Hi folks,

I am trying to implement a way of showing similar documents (html pages,
mostly), and sorting them according to the degree of similarity.

The easiest way to do this is to calculate an Edit Distance (i.e. the
minimum number of changes required to transform one document to another)
from the target document.

However, the naive Levenshtein algorithm is O(n^2), and when you are
comparing 2 10k documents, it can take quite a while (9 or so seconds on
my Thinkpad T30 per comparison!)

Does anyone know more computationally efficient comparison algorithms?

Many thanks!

Rogan
 
I

Ingo R. Homann

Hi,

Rogan said:
[Levenshtein algorithm]

Does anyone know more computationally efficient comparison algorithms?

You may modify the Levenshtein-Distance in that way that it uses words
(sentences/paragraphs) instead of characters. "diff" compares lines, for
example.

Ciao,
Ingo
 
R

Rogan Dawes

Ingo said:
Hi,

Rogan said:
[Levenshtein algorithm]

Does anyone know more computationally efficient comparison algorithms?


You may modify the Levenshtein-Distance in that way that it uses words
(sentences/paragraphs) instead of characters. "diff" compares lines, for
example.

Ciao,
Ingo

Thanks for the suggestion. I'll have to see how easy that is to implement.

Rogan
 
L

Lucy

Rogan Dawes said:
Ingo said:
Hi,

Rogan said:
[Levenshtein algorithm]

Does anyone know more computationally efficient comparison algorithms?


You may modify the Levenshtein-Distance in that way that it uses words
(sentences/paragraphs) instead of characters. "diff" compares lines, for
example.

Ciao,
Ingo

Thanks for the suggestion. I'll have to see how easy that is to implement.

Rogan

diff x y | wc -l
 
H

HK

Rogan said:
I am trying to implement a way of showing similar documents (html pages,
mostly), and sorting them according to the degree of similarity.

Count the number of common trigrams, possibly not character based
but (normalized) word based. It might help to weight
trigrams according to frequency or according to the
frequencies of the words contained.

Never tried this for whole documents, but
for words or multi word terms trigram
similarity seems to be not too bad.

Harald.
 
R

Robert Maas, see http://tinyurl.com/uh3t

From: Rogan Dawes said:
I am trying to implement a way of showing similar documents (html
pages, mostly), and sorting them according to the degree of similarity.

Would you be willing to do the opposite, namely measure the quantity of
*difference* instead of a measure of similarity?
The easiest way to do this is to calculate an Edit Distance (i.e. the
minimum number of changes required to transform one document to
another) from the target document.

Note: Now you're doing the opposite of what you originally said, now
you're computing a measure if *difference* instead of similiarity. So
now despite what you originally said, you're taking my preferred
overall methodology. Within that generic, more specifically have you
considered a ProxHash as a proxy for Edit Distance? Basically for each
document you compile statistics of trigrams or whatever you think will
make a good proxy for position in abstract information space, then you
feed that into a ProxHash which reduces the unbounded-dimension
trigram-or-whatever point to a point in a fixed dimension normalized
linear space with Euclidean metric easily and quickly computable. At
that point with document -> trigramStatistics -> euclideanVector
accomplished, performing clustering on such vectors is much faster than
directly performing clustering on the original documents, because
calculating Euclidean distance (sqrt of sum of squares of
per-coordinate differences) is much faster than directly computing Edit
Distance between two original documents. Once you have the proxhashes
clustered, you can use them directly to presume distances between
original documents are proportional to proxhash distances, if that's a
good enough approximation for your purposes, or you can use compute
actual Edit Distance just for those few documents that are close enough
together to be considered for user's result set.

By the way, your question has nothing to do with Java, so you should
have posted to comp.programming instead, and I'm cross-posting my
response there.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top