Loose "matching" of strings

C

Chris Plumber

I have two lists of strings (they are actually film titles gleaned from two
different sources) and I want to identify the matches between the lists.
Sadly these have been entered by different organisations and so there is a
fair degree of difference in the way that films are named. It's no problem
for a human to spot that the strings refer to the same film, but harder for
perl.

So: For example..
"Terminator 2" and "Terminator II" should match
"Star Wars: Return of the Jedi" and "Return of the Jedi" should match
"The Sound of Music" and "Music, The sound of" should match

etc ...

What I am after is a module that does what a human does in terms of saying -
"those strings arent identical, but they clearly mean the same thing". It
needs to handle differences in word order, mis-spellings, additional/missing
words, different ways of representing numbers, variations in punctuation -
etc. I suppose the output should be a statistical one - how "likely" it is
that the strings match - where 100% means an absolute match and 0% means no
commonality at all.

Any thoughts?
 
C

Chris Mattern

Chris said:
I have two lists of strings (they are actually film titles gleaned from two
different sources) and I want to identify the matches between the lists.
Sadly these have been entered by different organisations and so there is a
fair degree of difference in the way that films are named. It's no problem
for a human to spot that the strings refer to the same film, but harder for
perl.

It's *hard* for a human to do, and it's pretty well impossible for perl to do.
So: For example..
"Terminator 2" and "Terminator II" should match

Should they? There's a movie out there called "Terminator II" which is *not*
Terminator 2: Judgment Day, and is not part of the Schwartzenegger SF series
(it's also pretty bad).
"Star Wars: Return of the Jedi" and "Return of the Jedi" should match
"The Sound of Music" and "Music, The sound of" should match

To what? The famous Julie Andrews movie or an animated short that has
the same name? What about "Sound of Music"? Is that the same as "The
Sound of Music" or does it refer to two other films, both of which
actually have that title?
etc ...

What I am after is a module that does what a human does in terms of saying -
"those strings arent identical, but they clearly mean the same thing". It
needs to handle differences in word order, mis-spellings, additional/missing
words, different ways of representing numbers, variations in punctuation -
etc. I suppose the output should be a statistical one - how "likely" it is
that the strings match - where 100% means an absolute match and 0% means no
commonality at all.

Any thoughts?
If you can do this, you will deserve a research position at the AI institute
of your choice.

Chris Mattern
 
T

Tore Aursand

I have two lists of strings (they are actually film titles gleaned from
two different sources) and I want to identify the matches between the
lists.

Did you search for the appropriate modules on CPAN? There are lots of
modules which can help you out. Here's two for you:

String::Approx
Text::Soundex
 
D

David K. Wall

Tore Aursand said:
Did you search for the appropriate modules on CPAN? There are lots of
modules which can help you out. Here's two for you:

String::Approx
Text::Soundex

Text::Metaphone looks as if it might be useful, too.
 
S

Simon Taylor

Hi Chris,
It's *hard* for a human to do, and it's pretty well impossible for
perl to do.

I'm not so sure I agree. ;-) Naturally, the semantic aspect of the task
is indeed a 'hard problem', but as usual, CPAN provides tools that
do some of the preliminary work wonderfully well.

I think the OP should have a look at

Text::phraseDistance
Text::Levenshtein

on CPAN.

These modules, or something like them, would do a lot of the preliminary
work that might make matching the lists of film titles achievable.

Regards,

Simon Taylor
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,280
Latest member
BGBBrock56

Latest Threads

Top