Ensuring symmetry in difflib.SequenceMatcher

Discussion in 'Python' started by John Yeung, Nov 24, 2010.

  1. John Yeung

    John Yeung Guest

    I'm generally pleased with difflib.SequenceMatcher: It's probably not
    the best available string matcher out there, but it's in the standard
    library and I've seen worse in the wild. One thing that kind of
    bothers me is that it's sensitive to which argument you pick as "seq1"
    and which you pick as "seq2":

    Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
    (Intel)] on
    win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import difflib
    >>> difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio()

    0.44444444444444442
    >>> difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio()

    0.66666666666666663
    >>>


    Is this a bug? I am guessing the algorithm is implemented correctly,
    and that it's just an inherent property of the algorithm used. It's
    certainly not what I'd call a desirably property. Are there any
    simple adjustments that can be made without sacrificing (too much)
    performance?

    John
    John Yeung, Nov 24, 2010
    #1
    1. Advertising

  2. On Mittwoch 24 November 2010, John Yeung wrote:
    > Are there any
    > simple adjustments that can be made without sacrificing (too
    > much) performance?


    >>> difflib.SequenceMatcher(None,*sorted(('BYRD','BRADY'))).ratio()

    0.66666666666666663

    --
    Wolfgang
    Wolfgang Rohdewald, Nov 24, 2010
    #2
    1. Advertising

  3. John Yeung

    Peter Otten Guest

    John Yeung wrote:

    > I'm generally pleased with difflib.SequenceMatcher: It's probably not
    > the best available string matcher out there, but it's in the standard
    > library and I've seen worse in the wild. One thing that kind of
    > bothers me is that it's sensitive to which argument you pick as "seq1"
    > and which you pick as "seq2":
    >
    > Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
    > (Intel)] on
    > win32
    > Type "help", "copyright", "credits" or "license" for more information.
    >>>> import difflib
    >>>> difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio()

    > 0.44444444444444442
    >>>> difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio()

    > 0.66666666666666663
    >>>>

    >
    > Is this a bug? I am guessing the algorithm is implemented correctly,
    > and that it's just an inherent property of the algorithm used. It's
    > certainly not what I'd call a desirably property. Are there any
    > simple adjustments that can be made without sacrificing (too much)
    > performance?


    def symmetric_ratio(a, b, S=difflib.SequenceMatcher):
    return (S(None, a, b).ratio() + S(None, b, a).ratio())/2.0

    I'm expecting 50% performance loss ;)

    Seriously, have you tried to calculate the ratio with realistic data?
    Without looking into the source I would expect the two ratios to get more
    similar.

    Peter
    Peter Otten, Nov 24, 2010
    #3
  4. John Yeung

    John Machin Guest

    On Nov 24, 8:43 pm, Peter Otten <> wrote:
    > John Yeung wrote:
    > > I'm generally pleased with difflib.SequenceMatcher:  It's probably not
    > > the best available string matcher out there, but it's in the standard
    > > library and I've seen worse in the wild.  One thing that kind of
    > > bothers me is that it's sensitive to which argument you pick as "seq1"
    > > and which you pick as "seq2":

    >
    > > Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
    > > (Intel)] on
    > > win32
    > > Type "help", "copyright", "credits" or "license" for more information.
    > >>>> import difflib
    > >>>> difflib.SequenceMatcher(None, 'BYRD', 'BRADY').ratio()

    > > 0.44444444444444442
    > >>>> difflib.SequenceMatcher(None, 'BRADY', 'BYRD').ratio()

    > > 0.66666666666666663

    >
    > > Is this a bug?  I am guessing the algorithm is implemented correctly,
    > > and that it's just an inherent property of the algorithm used.  It's
    > > certainly not what I'd call a desirably property.  Are there any
    > > simple adjustments that can be made without sacrificing (too much)
    > > performance?

    >
    > def symmetric_ratio(a, b, S=difflib.SequenceMatcher):
    >     return (S(None, a, b).ratio() + S(None, b, a).ratio())/2.0
    >
    > I'm expecting 50% performance loss ;)
    >
    > Seriously, have you tried to calculate the ratio with realistic data?
    > Without looking into the source I would expect the two ratios to get more
    > similar.
    >
    > Peter


    Surnames are extremely realistic data. The OP should consider using
    Levenshtein distance, which is "symmetric". A good (non-naive)
    implementation should be much faster than difflib.

    ratio = 1.0 - levenshtein(a, b) / float(max(len(a), len(b)))
    John Machin, Nov 24, 2010
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?RXJpayBFVFM=?=

    A symmetry issue

    =?Utf-8?B?RXJpayBFVFM=?=, Jun 19, 2006, in forum: ASP .Net
    Replies:
    0
    Views:
    359
    =?Utf-8?B?RXJpayBFVFM=?=
    Jun 19, 2006
  2. shuhsien
    Replies:
    0
    Views:
    303
    shuhsien
    Oct 17, 2003
  3. Tim Peters
    Replies:
    0
    Views:
    391
    Tim Peters
    Oct 17, 2003
  4. Vlastimil Brom
    Replies:
    0
    Views:
    267
    Vlastimil Brom
    Apr 16, 2010
  5. Antoon Pardon
    Replies:
    0
    Views:
    143
    Antoon Pardon
    Jun 21, 2011
Loading...

Share This Page