Ensuring symmetry in difflib.SequenceMatcher

J

John Yeung

I'm generally pleased with difflib.SequenceMatcher: It's probably not
the best available string matcher out there, but it's in the standard
library and I've seen worse in the wild. One thing that kind of
bothers me is that it's sensitive to which argument you pick as "seq1"
and which you pick as "seq2":

Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
Is this a bug? I am guessing the algorithm is implemented correctly,
and that it's just an inherent property of the algorithm used. It's
certainly not what I'd call a desirably property. Are there any
simple adjustments that can be made without sacrificing (too much)
performance?

John
 
P

Peter Otten

John said:
I'm generally pleased with difflib.SequenceMatcher: It's probably not
the best available string matcher out there, but it's in the standard
library and I've seen worse in the wild. One thing that kind of
bothers me is that it's sensitive to which argument you pick as "seq1"
and which you pick as "seq2":

Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
Is this a bug? I am guessing the algorithm is implemented correctly,
and that it's just an inherent property of the algorithm used. It's
certainly not what I'd call a desirably property. Are there any
simple adjustments that can be made without sacrificing (too much)
performance?

def symmetric_ratio(a, b, S=difflib.SequenceMatcher):
return (S(None, a, b).ratio() + S(None, b, a).ratio())/2.0

I'm expecting 50% performance loss ;)

Seriously, have you tried to calculate the ratio with realistic data?
Without looking into the source I would expect the two ratios to get more
similar.

Peter
 
J

John Machin

John said:
I'm generally pleased with difflib.SequenceMatcher:  It's probably not
the best available string matcher out there, but it's in the standard
library and I've seen worse in the wild.  One thing that kind of
bothers me is that it's sensitive to which argument you pick as "seq1"
and which you pick as "seq2":
Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
import difflib
difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio() 0.44444444444444442
difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio() 0.66666666666666663

Is this a bug?  I am guessing the algorithm is implemented correctly,
and that it's just an inherent property of the algorithm used.  It's
certainly not what I'd call a desirably property.  Are there any
simple adjustments that can be made without sacrificing (too much)
performance?

def symmetric_ratio(a, b, S=difflib.SequenceMatcher):
    return (S(None, a, b).ratio() + S(None, b, a).ratio())/2.0

I'm expecting 50% performance loss ;)

Seriously, have you tried to calculate the ratio with realistic data?
Without looking into the source I would expect the two ratios to get more
similar.

Peter

Surnames are extremely realistic data. The OP should consider using
Levenshtein distance, which is "symmetric". A good (non-naive)
implementation should be much faster than difflib.

ratio = 1.0 - levenshtein(a, b) / float(max(len(a), len(b)))
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top