C
Chris Plumber
I have two lists of strings (they are actually film titles gleaned from two
different sources) and I want to identify the matches between the lists.
Sadly these have been entered by different organisations and so there is a
fair degree of difference in the way that films are named. It's no problem
for a human to spot that the strings refer to the same film, but harder for
perl.
So: For example..
"Terminator 2" and "Terminator II" should match
"Star Wars: Return of the Jedi" and "Return of the Jedi" should match
"The Sound of Music" and "Music, The sound of" should match
etc ...
What I am after is a module that does what a human does in terms of saying -
"those strings arent identical, but they clearly mean the same thing". It
needs to handle differences in word order, mis-spellings, additional/missing
words, different ways of representing numbers, variations in punctuation -
etc. I suppose the output should be a statistical one - how "likely" it is
that the strings match - where 100% means an absolute match and 0% means no
commonality at all.
Any thoughts?
different sources) and I want to identify the matches between the lists.
Sadly these have been entered by different organisations and so there is a
fair degree of difference in the way that films are named. It's no problem
for a human to spot that the strings refer to the same film, but harder for
perl.
So: For example..
"Terminator 2" and "Terminator II" should match
"Star Wars: Return of the Jedi" and "Return of the Jedi" should match
"The Sound of Music" and "Music, The sound of" should match
etc ...
What I am after is a module that does what a human does in terms of saying -
"those strings arent identical, but they clearly mean the same thing". It
needs to handle differences in word order, mis-spellings, additional/missing
words, different ways of representing numbers, variations in punctuation -
etc. I suppose the output should be a statistical one - how "likely" it is
that the strings match - where 100% means an absolute match and 0% means no
commonality at all.
Any thoughts?