R
Rogan Dawes
Hi folks,
I am trying to implement a way of showing similar documents (html pages,
mostly), and sorting them according to the degree of similarity.
The easiest way to do this is to calculate an Edit Distance (i.e. the
minimum number of changes required to transform one document to another)
from the target document.
However, the naive Levenshtein algorithm is O(n^2), and when you are
comparing 2 10k documents, it can take quite a while (9 or so seconds on
my Thinkpad T30 per comparison!)
Does anyone know more computationally efficient comparison algorithms?
Many thanks!
Rogan
I am trying to implement a way of showing similar documents (html pages,
mostly), and sorting them according to the degree of similarity.
The easiest way to do this is to calculate an Edit Distance (i.e. the
minimum number of changes required to transform one document to another)
from the target document.
However, the naive Levenshtein algorithm is O(n^2), and when you are
comparing 2 10k documents, it can take quite a while (9 or so seconds on
my Thinkpad T30 per comparison!)
Does anyone know more computationally efficient comparison algorithms?
Many thanks!
Rogan