J
John
I am trying to discover similar files to reduce redundancy on a large
project. The 'Text' gem works well for this, but even short strings
take a long time. Large strings - like 20k HTML files - take an
amazing amount of time. My script looks like this:
require 'rubygems'
require 'text'
a = file_one
b = file_two
puts Text::Levenshtein.distance(a, b)
It would be nice to be able to short-circuit the comparison when the
distance crossed a max value, but that isn't possible. It would be
even BETTER to be able to compare long stings like with PHPs
similar_text, which has nice percentage output. I have to do a lot of
comparisons, about 40 million. Is there something already written?
project. The 'Text' gem works well for this, but even short strings
take a long time. Large strings - like 20k HTML files - take an
amazing amount of time. My script looks like this:
require 'rubygems'
require 'text'
a = file_one
b = file_two
puts Text::Levenshtein.distance(a, b)
It would be nice to be able to short-circuit the comparison when the
distance crossed a max value, but that isn't possible. It would be
even BETTER to be able to compare long stings like with PHPs
similar_text, which has nice percentage output. I have to do a lot of
comparisons, about 40 million. Is there something already written?