How fuzzy is get_close_matches() in difflib?

John Henry · Nov 16, 2006

I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?

Steven D'Aprano · Nov 16, 2006

I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?

Why don't you try it and see?

from difflib import get_close_matches
get_close_matches("appel", ["ape", "apple", "peach", "puppy"]) ['apple', 'ape']
import keyword as _keyword
get_close_matches("wheel", _keyword.kwlist) ['while']
get_close_matches("apple", _keyword.kwlist) []
get_close_matches("accept", _keyword.kwlist)

Click to expand...

Click to expand...

['except']

Those example, by the way, come from here:

John Henry · Nov 16, 2006

I did try them and I am impressed. It helped me found a lot of useful
info. I just want to get a feel as to what constitutes a "match".

I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?

Click to expand...

Why don't you try it and see?

from difflib import get_close_matches
get_close_matches("appel", ["ape", "apple", "peach", "puppy"]) ['apple', 'ape']
import keyword as _keyword
get_close_matches("wheel", _keyword.kwlist) ['while']
get_close_matches("apple", _keyword.kwlist) []
get_close_matches("accept", _keyword.kwlist)

Click to expand...

Click to expand...

['except']

Those example, by the way, come from here:

Steven D'Aprano · Nov 17, 2006

I did try them and I am impressed. It helped me found a lot of useful
info. I just want to get a feel as to what constitutes a "match".

The source code has lots of comments, but they don't explain the basic
algorithm (at least not in the difflib.py supplied with Python 2.3).

There is no single diff algorithm, but I believe that the basic idea is to
look for insertions and/or deletions of strings. If you want more
detail, google "diff". Once you have a list of differences, the closest
match is the search string with the fewest differences.

As for getting a feel of what constitutes a match, I really can't make any
better suggestion than just try lots of examples with the interactive
Python shell.

John Henry · Nov 17, 2006

I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.

Steven D'Aprano · Nov 17, 2006

I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.

All those strings are so similar that I think the exact order of "best
match" will depend on the precise details of the algorithm. I note that
the difflib specifically says:

This does not yield minimal edit sequences, but does tend to yield matches
that "look right" to people.
[end quote]

There are many ways that one can calculate the difference between two
strings, e.g. is the string "acab" an "ab" with "ac" inserted at the
front, or "ab" with "ca" inserted in the middle? The Unix diff utility
may return minimal diffs, but difflib does not make that claim.

You want to see "HIDEDCT1" match closer to "HIDESCT1" than "HIDEDST1":

HIDEDCT1 -- John's "best match" target string
HIDEDST1 -- difflib's "best match" target string
HIDESCT1 -- source string

John's best match matches in seven of eight positions, compared to six
of eight for the difflib best match. Disregarding order, both have seven
matching characters. That's a pretty slim difference between the two:
' H I D E- D S+ C T 1'

I honestly don't know how to interpret those results

.... print "a[%d] and b[%d] match for %d elements" % block
....
a[0] and b[0] match for 4 elements
a[5] and b[5] match for 3 elements
a[8] and b[8] match for 0 elements.... print "a[%d] and b[%d] match for %d elements" % block
....
a[0] and b[0] match for 4 elements
a[5] and b[4] match for 1 elements
a[6] and b[6] match for 2 elements
a[8] and b[8] match for 0 elements
I think what you are seeing is an artifact of the precise details of the
pattern matcher. Feel free to subclass the SequenceMatcher or Differ
classes to get the results you expect

John Machin · Nov 17, 2006

John said:
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?

Are you desperate to understand the inner workings of difflib, or do
you want merely to do some fuzzy matching of strings using a well-known
somewhat-more-understandable zillions-of-implementations metric?

If the latter, google "Levenshtein distance" for the metric, and
"Python Levenshtein" -- first hit for me is an implementation in a
Python C-extension. If you don't have the ability to compile a C
extension, or don't need the speed, there should be a few pure-Python
versions around; I'm specifically aware of one by Magnus Lie Hetland.
It's less than a screenful of code; good idea to grab it anyway for
educational purposes -- a quite Pythonic implementation of the
traditional algorithm.

HTH,
John

Antoon Pardon · Nov 17, 2006

I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.

H I D E D S T 1 H I D E D C T 1

H . .
I . .
D . .
E . .
S .
C .
T . .
1 . .

As far as I can see the distance of HIDEDCT1 to HIDESCT1 is
the same as the distance of HIDEDCT1 to HIDEDST1. In both
cases you have to remove one character from the target as well
as one character from the candidate in order to get the
same substring.

John Henry · Nov 17, 2006

I suppose you are right. I guess I ended up with an odd case.

I was thinking that:

To change "HIDE*S*ST1" to "HIDE*D*ST1", all you do is remove the "*S*"
from the source and the "*D*" from the target.

In order to change "HIDE*SC*T1" to "HIDE*DS*T1", I thought you have to
remove 2 characters *SC* from the source. Then I realize that it's
not true. If you remove the "C" from the source, and the "D" from the
*DS* of the destination, it's a match (!)

So, yes, they have the same distance!

Antoon said:
I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.

Click to expand...

H I D E D S T 1 H I D E D C T 1

H . .
I . .
D . .
E . .
S .
C .
T . .
1 . .

As far as I can see the distance of HIDEDCT1 to HIDESCT1 is
the same as the distance of HIDEDCT1 to HIDEDST1. In both
cases you have to remove one character from the target as well
as one character from the candidate in order to get the
same substring.

John Henry · Nov 17, 2006

Learn something new everyday. I always wondered how spell checkers are
done. Thanks.

John Henry · Nov 17, 2006

Steven D'Aprano wrote:

... print "a[%d] and b[%d] match for %d elements" % block
...
a[0] and b[0] match for 4 elements
a[5] and b[5] match for 3 elements
a[8] and b[8] match for 0 elements... print "a[%d] and b[%d] match for %d elements" % block
...
a[0] and b[0] match for 4 elements
a[5] and b[4] match for 1 elements
a[6] and b[6] match for 2 elements
a[8] and b[8] match for 0 elements
I think what you are seeing is an artifact of the precise details of the
pattern matcher. Feel free to subclass the SequenceMatcher or Differ
classes to get the results you expect

Looks like for this particular case, looking at number of matched
groups worked better.

John Machin · Nov 17, 2006

On Nov 17, 7:19 pm, Steven D'Aprano <[email protected]>
wrote:
[snip]

You want to see "HIDEDCT1" match closer to "HIDESCT1" than "HIDEDST1":

HIDEDCT1 -- John's "best match" target string
HIDEDST1 -- difflib's "best match" target string
HIDESCT1 -- source string

John's best match matches in seven of eight positions, compared to six
of eight for the difflib best match. Disregarding order, both have seven
matching characters. That's a pretty slim difference between the two:
' H I D E- D S+ C T 1'

I honestly don't know how to interpret those results

Take 1, firing from the hip:

The formatting of that output is suboptimal. Better would be:
' H I D E -D +S C T 1'
interpreted as "delete D, insert S"
' H I D E -D S +C T 1'
interpreted as "delete D, insert C"

After reflection that the author of difflib is known not to be so
silly, take 2:

| >>> list(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
[' H', ' I', ' D', ' E', '- D', '+ S', ' C', ' T', ' 1']

And the docs shed some light on what is going on:

| >>> help(difflib.Differ().compare)
Help on method compare in module difflib:

compare(self, a, b) method of difflib.Differ instance
Compare two sequences of lines; generate the resulting delta.

Each sequence must contain individual single-line strings ending
with
newlines. Such sequences can be obtained from the `readlines()`
method
of file-like objects. The delta generated also consists of
newline-
terminated strings, ready to be printed as-is via the writeline()
method of a file-like object.

Example:
''.join(Differ().compare('one\ntwo\nthree\n'.splitlines(1),
...
'ore\ntree\nemu\n'.splitlines(1))),
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu

HTH,
John

difflib-like library supporting moved blocks detection?	2	Jul 13, 2011
Remote SSH and Configuring code help	0	Dec 13, 2023
Fuzzy Logic Library for Python 3	0	Dec 19, 2012
Fuzzy Lookups	24	Jan 30, 2006
Efficiently determine where documents differ	2	Jan 4, 2010
how to find difference in number of characters	12	Oct 9, 2010
Don't understand SequenceMatcher from difflib	0	Jun 21, 2011
Fuzzy string comparison	14	Dec 26, 2006

How fuzzy is get_close_matches() in difflib?

John Henry

Steven D'Aprano

John Henry

Steven D'Aprano

John Henry

Steven D'Aprano

John Machin

Antoon Pardon

John Henry

John Henry

John Henry

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads