How fuzzy is get_close_matches() in difflib?

J

John Henry

I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?
 
S

Steven D'Aprano

I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?


Why don't you try it and see?
from difflib import get_close_matches
get_close_matches("appel", ["ape", "apple", "peach", "puppy"]) ['apple', 'ape']
import keyword as _keyword
get_close_matches("wheel", _keyword.kwlist) ['while']
get_close_matches("apple", _keyword.kwlist) []
get_close_matches("accept", _keyword.kwlist)
['except']


Those example, by the way, come from here:
 
J

John Henry

I did try them and I am impressed. It helped me found a lot of useful
info. I just want to get a feel as to what constitutes a "match".

I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?


Why don't you try it and see?
from difflib import get_close_matches
get_close_matches("appel", ["ape", "apple", "peach", "puppy"]) ['apple', 'ape']
import keyword as _keyword
get_close_matches("wheel", _keyword.kwlist) ['while']
get_close_matches("apple", _keyword.kwlist) []
get_close_matches("accept", _keyword.kwlist)
['except']


Those example, by the way, come from here:
 
S

Steven D'Aprano

I did try them and I am impressed. It helped me found a lot of useful
info. I just want to get a feel as to what constitutes a "match".

The source code has lots of comments, but they don't explain the basic
algorithm (at least not in the difflib.py supplied with Python 2.3).

There is no single diff algorithm, but I believe that the basic idea is to
look for insertions and/or deletions of strings. If you want more
detail, google "diff". Once you have a list of differences, the closest
match is the search string with the fewest differences.

As for getting a feel of what constitutes a match, I really can't make any
better suggestion than just try lots of examples with the interactive
Python shell.
 
J

John Henry

I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.
 
S

Steven D'Aprano

I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.

All those strings are so similar that I think the exact order of "best
match" will depend on the precise details of the algorithm. I note that
the difflib specifically says:

This does not yield minimal edit sequences, but does tend to yield matches
that "look right" to people.
[end quote]

There are many ways that one can calculate the difference between two
strings, e.g. is the string "acab" an "ab" with "ac" inserted at the
front, or "ab" with "ca" inserted in the middle? The Unix diff utility
may return minimal diffs, but difflib does not make that claim.

You want to see "HIDEDCT1" match closer to "HIDESCT1" than "HIDEDST1":

HIDEDCT1 -- John's "best match" target string
HIDEDST1 -- difflib's "best match" target string
HIDESCT1 -- source string

John's best match matches in seven of eight positions, compared to six
of eight for the difflib best match. Disregarding order, both have seven
matching characters. That's a pretty slim difference between the two:
' H I D E- D S+ C T 1'

I honestly don't know how to interpret those results :(

.... print "a[%d] and b[%d] match for %d elements" % block
....
a[0] and b[0] match for 4 elements
a[5] and b[5] match for 3 elements
a[8] and b[8] match for 0 elements.... print "a[%d] and b[%d] match for %d elements" % block
....
a[0] and b[0] match for 4 elements
a[5] and b[4] match for 1 elements
a[6] and b[6] match for 2 elements
a[8] and b[8] match for 0 elements
I think what you are seeing is an artifact of the precise details of the
pattern matcher. Feel free to subclass the SequenceMatcher or Differ
classes to get the results you expect :)
 
J

John Machin

John said:
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?

Are you desperate to understand the inner workings of difflib, or do
you want merely to do some fuzzy matching of strings using a well-known
somewhat-more-understandable zillions-of-implementations metric?

If the latter, google "Levenshtein distance" for the metric, and
"Python Levenshtein" -- first hit for me is an implementation in a
Python C-extension. If you don't have the ability to compile a C
extension, or don't need the speed, there should be a few pure-Python
versions around; I'm specifically aware of one by Magnus Lie Hetland.
It's less than a screenful of code; good idea to grab it anyway for
educational purposes -- a quite Pythonic implementation of the
traditional algorithm.

HTH,
John
 
A

Antoon Pardon

I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.

H I D E D S T 1 H I D E D C T 1

H . .
I . .
D . .
E . .
S .
C .
T . .
1 . .

As far as I can see the distance of HIDEDCT1 to HIDESCT1 is
the same as the distance of HIDEDCT1 to HIDEDST1. In both
cases you have to remove one character from the target as well
as one character from the candidate in order to get the
same substring.
 
J

John Henry

I suppose you are right. I guess I ended up with an odd case.

I was thinking that:

To change "HIDE*S*ST1" to "HIDE*D*ST1", all you do is remove the "*S*"
from the source and the "*D*" from the target.

In order to change "HIDE*SC*T1" to "HIDE*DS*T1", I thought you have to
remove 2 characters *SC* from the source. Then I realize that it's
not true. If you remove the "C" from the source, and the "D" from the
*DS* of the destination, it's a match (!)

So, yes, they have the same distance!


Antoon said:
I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.

H I D E D S T 1 H I D E D C T 1

H . .
I . .
D . .
E . .
S .
C .
T . .
1 . .

As far as I can see the distance of HIDEDCT1 to HIDESCT1 is
the same as the distance of HIDEDCT1 to HIDEDST1. In both
cases you have to remove one character from the target as well
as one character from the candidate in order to get the
same substring.
 
J

John Henry

Learn something new everyday. I always wondered how spell checkers are
done. Thanks.
 
J

John Henry

Steven D'Aprano wrote:

... print "a[%d] and b[%d] match for %d elements" % block
...
a[0] and b[0] match for 4 elements
a[5] and b[5] match for 3 elements
a[8] and b[8] match for 0 elements... print "a[%d] and b[%d] match for %d elements" % block
...
a[0] and b[0] match for 4 elements
a[5] and b[4] match for 1 elements
a[6] and b[6] match for 2 elements
a[8] and b[8] match for 0 elements
I think what you are seeing is an artifact of the precise details of the
pattern matcher. Feel free to subclass the SequenceMatcher or Differ
classes to get the results you expect :)

Looks like for this particular case, looking at number of matched
groups worked better.
 
J

John Machin

On Nov 17, 7:19 pm, Steven D'Aprano <[email protected]>
wrote:
[snip]
You want to see "HIDEDCT1" match closer to "HIDESCT1" than "HIDEDST1":

HIDEDCT1 -- John's "best match" target string
HIDEDST1 -- difflib's "best match" target string
HIDESCT1 -- source string

John's best match matches in seven of eight positions, compared to six
of eight for the difflib best match. Disregarding order, both have seven
matching characters. That's a pretty slim difference between the two:
' H I D E- D S+ C T 1'

I honestly don't know how to interpret those results :(

Take 1, firing from the hip:

The formatting of that output is suboptimal. Better would be:
' H I D E -D +S C T 1'
interpreted as "delete D, insert S"
' H I D E -D S +C T 1'
interpreted as "delete D, insert C"

After reflection that the author of difflib is known not to be so
silly, take 2:

| >>> list(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
[' H', ' I', ' D', ' E', '- D', '+ S', ' C', ' T', ' 1']

And the docs shed some light on what is going on:

| >>> help(difflib.Differ().compare)
Help on method compare in module difflib:

compare(self, a, b) method of difflib.Differ instance
Compare two sequences of lines; generate the resulting delta.

Each sequence must contain individual single-line strings ending
with
newlines. Such sequences can be obtained from the `readlines()`
method
of file-like objects. The delta generated also consists of
newline-
terminated strings, ready to be printed as-is via the writeline()
method of a file-like object.

Example:
''.join(Differ().compare('one\ntwo\nthree\n'.splitlines(1),
...
'ore\ntree\nemu\n'.splitlines(1))),
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu

HTH,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,432
Messages
2,571,681
Members
48,796
Latest member
Greg L.

Latest Threads

Top