Fuzzy Lookups

B

BBands

I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab
 
D

Diez B. Roggisch

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:

def relative(a, b):
"""
Computes a relative distance between two strings. Its in the range
(0-1] where 1 means total equality.
@type a: string
@param a: arg one
@type b: string
@param b: arg two
@rtype: float
@return: the distance
"""
d = distance(a,b)
longer = float(max((len(a), len(b))))
shorter = float(min((len(a), len(b))))
r = ((longer - d) / longer) * (shorter / longer)
return r


distance is a run-of-the mill levenshtein-distance as yours.

The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJolei"

and

"Bob"

Both have a l-dist of 3, but the normalized distance is


0.729591836735

and

0.015306122449

respectively - making you chose the better match :)

regards,

Diez
 
F

Fredrik Lundh

Diez said:
The advantage becomes apparent when you try to e.g. compare

"Angelina Jolie"

with

"AngelinaJolei"

and

"Bob"

Both have a l-dist of 3
13

what did I miss ?

</F>
 
D

Diez B. Roggisch

Fredrik said:
13

what did I miss ?

Hmm. I missed something - the "1" before the "3" in 13 when I looked on my
terminal after running the example. And according to

http://www.reference.com/browse/wiki/Levenshtein_distance

it has the property

"""It is always at least the difference of the sizes of the two strings."""

And my implementation I got from there (or better from Magnus Lie Hetland
whoms python version is referenced there)

So you are right, my example is crap.

But I ran into cases where my normalizing made sense - otherwise I wouldn't
have done it :)

I guess it is more along the lines of (coughed up example)

"abcdef"

compared to

"abcefd"

"abcd"

I can only say that I used it to fuzzy-compare people's and hotel names, and
applying the normalization made my results by far better.

Sorry to cause confusion.

Diez
 
B

BBands

Diez said:
I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:

Thanks for the snippet. I agree that normalizing is important. A
distance of three is one thing when your strings are long, but quite
another when they are short. I'd been thinking about something along
these lines myself, but hadn't gotten there yet. It'll be interesting
to have a look at the distribution of the normalized numbers, I'd guess
that there may be a rough threshold that effectively separates the
wheat from the chaff.

jab
 
G

gene tani

BBands said:
Thanks for the snippet. I agree that normalizing is important. A
distance of three is one thing when your strings are long, but quite
another when they are short. I'd been thinking about something along
these lines myself, but hadn't gotten there yet. It'll be interesting
to have a look at the distribution of the normalized numbers, I'd guess
that there may be a rough threshold that effectively separates the
wheat from the chaff.

jab

i noticed this guy, who's quite a good ruby developer spent some time
on distances:

http://ruby.brian-schroeder.de/editierdistanz/

and also look at soundex, other algorithms (Double Metaphone, NYSIIS,
Phonex, I have notes to investigate but I haven't looked at them
myhself)
 
A

ajones

BBands said:
I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab

Here is my stab at it, didn't fully test it so it may not work
correctly. The tuples could be avoided by using len(x), but the extra
lookups may cost in execution speed[1].

def distance(a, b):
"""
Computes the levenshtein distance between two strings
"""
m, n = (len(a),a), (len(b),b)
if(m[0] < n[0]): #ensure that the 'm' tuple holds
the longest string
m, n = n, m
dist = m[0] #assume distance = length of
longest string (worst case)
for i in range(0, n[0]): # reduce the distance for each char
match in shorter string
if m[1] == n[1]:
dist = dist - 1
return dist

[1] I have no if this is true or not. I can see arguments for both.
 
?

=?ISO-8859-1?Q?Gregory_Pi=F1ero?=

Ok, ok, I got it! The Pythonic way is to use an existing library ;-)

import difflib
CloseMatches=difflib.get_close_matches(AFileName,AllFiles,20,.7)

I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.

-Greg


I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
[[genre, artist, album, song]...]. I wanted to add a search function to
locate all the versions of a particular song. This is harder than you
might think. For example the Cajun "national anthem" is Jolie Blond,
but it can be spelled several different ways jolie, joli, blon, blond,
etc... In addition the various online services that provide song info
are riddled with typos, so an ordinary string match just doesn't get
you there. What is needed is a fuzzy string match and it turns out that
there is a very good one, the Levenshtein distance, which is the number
of inserts, deletions and substitutions needed to morph one string into
another. In my application I match the desired song title against all
song titles in my list and return the ones with the lowest Levenshtein
distances. This is remarkably, one might even say stunningly,
effective, easily finding all the version of Jolie Blon in the list.

I am using the following snippet (found on the web, proper attribution
unsure), which calculates the Levenshtein distance.

def distance(a,b):
c = {}
n = len(a); m = len(b)

for i in range(0,n+1):
c[i,0] = i
for j in range(0,m+1):
c[0,j] = j

for i in range(1,n+1):
for j in range(1,m+1):
x = c[i-1,j]+1
y = c[i,j-1]+1
if a[i-1] == b[j-1]:
z = c[i-1,j-1]
else:
z = c[i-1,j-1]+1
c[i,j] = min(x,y,z)
return c[n,m]

As mentioned above this works quite well and I am happy with it, but I
wonder if there is a more Pythonic way of doing this type of lookup?

jab

Here is my stab at it, didn't fully test it so it may not work
correctly. The tuples could be avoided by using len(x), but the extra
lookups may cost in execution speed[1].

def distance(a, b):
"""
Computes the levenshtein distance between two strings
"""
m, n = (len(a),a), (len(b),b)
if(m[0] < n[0]): #ensure that the 'm' tuple holds
the longest string
m, n = n, m
dist = m[0] #assume distance = length of
longest string (worst case)
for i in range(0, n[0]): # reduce the distance for each char
match in shorter string
if m[1] == n[1]:
dist = dist - 1
return dist

[1] I have no if this is true or not. I can see arguments for both.
 
?

=?ISO-8859-1?Q?Gregory_Pi=F1ero?=

Thanks for that, I'll have a look. (So many packages, so little

Yes, there's a standard library for everything it seems! Except for a
MySQL api :-(
I would be very interested it seeing that.

Done, see:
http://www.blendedtechnologies.com/removing-duplicate-mp3s-with-python-a-naive-yet-fuzzy-approach/60

If anyone would be kind enough to improve it I'd love to have these
features but I'm swamped this week!

- MD5 checking for find exact matches regardless of name
- Put each set of duplicates in its own subfolder.
 
K

Kent Johnson

Gregory said:
Ok, ok, I got it! The Pythonic way is to use an existing library ;-)

import difflib
CloseMatches=difflib.get_close_matches(AFileName,AllFiles,20,.7)

I wrote a script to delete duplicate mp3's by filename a few years
back with this. If anyone's interested in seeing it, I'll post a blog
entry on it. I'm betting it uses a similiar algorithm your functions.

A quick trip to difflib.py produces this description of the matching
algorithm:

The basic
algorithm predates, and is a little fancier than, an algorithm
published in the late 1980's by Ratcliff and Obershelp under the
hyperbolic name "gestalt pattern matching". The basic idea is to find
the longest contiguous matching subsequence that contains no "junk"
elements (R-O doesn't address junk). The same idea is then applied
recursively to the pieces of the sequences to the left and to the
right of the matching subsequence.

So no, it doesn't seem to be using Levenshtein distance.

Kent
 
?

=?ISO-8859-1?Q?Gregory_Pi=F1ero?=

I wonder which algorithm determines the similarity between two strings better?
 
S

Steven D'Aprano

http://www.blendedtechnologies.com/removing-duplicate-mp3s-with-python-a-naive-yet-fuzzy-approach/60

If anyone would be kind enough to improve it I'd love to have these
features but I'm swamped this week!

- MD5 checking for find exact matches regardless of name
- Put each set of duplicates in its own subfolder.

This isn't a criticism, it is a genuine question. Why do people compare
local files with MD5 instead of doing a byte-to-byte compare? Is it purely
a caching thing (once you have the checksum, you don't need to read the
file again)? Are there any other reasons?
 
P

Paul Rubin

Steven D'Aprano said:
This isn't a criticism, it is a genuine question. Why do people compare
local files with MD5 instead of doing a byte-to-byte compare? Is it purely
a caching thing (once you have the checksum, you don't need to read the
file again)? Are there any other reasons?

It's not just a matter of comparing two files. The idea is you have
10,000 local files and you want to find which ones are duplicates
(i.e. if files 637 and 2945 have the same contents, you want to
discover that). The obvious way is make a list of hashes, and sort
the list.
 
T

Tom Anderson

I often wonder that!
It's not just a matter of comparing two files. The idea is you have
10,000 local files and you want to find which ones are duplicates (i.e.
if files 637 and 2945 have the same contents, you want to discover
that). The obvious way is make a list of hashes, and sort the list.

Obvious, perhaps, prudent, no. To make the list of hashes, you have to
read all of every single file first, which could take a while. If your
files are reasonably random at the beginning, you'd be better off just
using the first N bytes of the file, since this would be just as
effective, and cheaper to read. Looking at some random MP3s i have to
hand, they all differ within the first 20 bytes - probably due to the ID3
tags, so this should work for these.

Better yet, note that if two files are identical, they must have the same
length, and that finding the length can be done very cheaply, so a quicker
yet approach is to make a list of lengths, sort that, and look for
duplicates; when you find one, do a byte-by-byte comparison of the files
(probably terminating in the first block) to see if they really are the
same.

By way of example, of the 2690 music files in my iTunes library, i have
twelve pairs of same-sized files [1], and all of these differ within the
first 18 bytes (mostly, within the first 9 bytes). Therefore, i could rule
out duplication with just 22 data blocks read from disk (plus rather more
blocks of directory information and inodes, of course). A hash-based
approach would have had to wade through a touch over 13 GB of data before
it could even get started.

Of course, there are situations where this is the wrong approach - if you
have a collection of serialised sparse matrices, for example, which
consist of identically-sized blocks of zeroes with a scattering of ones
throughout, then lengths and prefixes will be useless, whereas hashes will
work perfectly. However, here, we're looking at MP3s, where lengths and
prefixes will be a win.

tom

[1] The distribution of those is a bit weird: ten pairs consist of two
tracks from The Conet Project's 'Recordings of Shortwave Numbers
Stations', one is a song from that and The Doors' 'Horse Latitudes', and
one is between to Calexico songs ('The Ride (Pt II)' and 'Minas De
Cobre'). Why on earth are eleven of the twelve pairs pairs of songs from
the same artist? Is it really that they're pairs of songs from the same
compressor (those tracks aren't from CD), i wonder?
 
S

Steven D'Aprano

It's not just a matter of comparing two files. The idea is you have
10,000 local files and you want to find which ones are duplicates
(i.e. if files 637 and 2945 have the same contents, you want to
discover that). The obvious way is make a list of hashes, and sort
the list.

Sure. But if you are just comparing two files, is there any reason to
bother with a checksum? (MD5 or other.)

I can't see any, but I thought maybe that's because I'm not thinking
outside the box.
 
P

Paul Rubin

Tom Anderson said:
Obvious, perhaps, prudent, no. To make the list of hashes, you have to
read all of every single file first, which could take a while. If your
files are reasonably random at the beginning, ...

The possibility of two different mp3 files having the same id3 tags is
something you might specifically be checking for. Hashing the first
few hundred bytes would be more reliable, since it would include some
actual audio in the hash. But then you're back to a list of hashes.
Better yet, note that if two files are identical, they must have the
same length, and that finding the length can be done very cheaply, so
a quicker yet approach is to make a list of lengths, sort that, and
look for duplicates; when you find one, do a byte-by-byte comparison
of the files (probably terminating in the first block) to see if they
really are the same.

Yes, checking the file lengths first is an obvious heuristic, but if
you fine you have a bunch of files with the same length, what do you
do? You're back to a list of hashes.
By way of example, of the 2690 music files in my iTunes library, i
have twelve pairs of same-sized files [1], and all of these differ
within the first 18 bytes (mostly, within the first 9 bytes).

That's a small enough set of matches that you don't need a general
purpose algorithm.
 
P

Paul Rubin

Steven D'Aprano said:
Sure. But if you are just comparing two files, is there any reason to
bother with a checksum? (MD5 or other.)

No of course not, except in special situations, like some problem
opening and reading both files simultaneously. E.g.: the files are on
two different DVD-R's, they are too big to fit in ram, and you only
have one DVD drive. If you want to compare byte by byte, you have to
either copy one of the DVD's to your hard disk (if you have the space
available) or else swap DVD's back and forth in the DVD drive reading
and comparing a bufferload at a time. But you can easily read in the
first DVD and compute its hash on the fly, then read and hash the
second DVD and compare the hashes.

If it's a normal situation with two files on HD, just open both files
simultaneously, and use large buffers to keep the amount of seeking
reasonable. That will be faster than big md5 computations, and more
reliable (there are known ways to construct pairs of distinct files
that have the same md5 hash.)
 
E

Erik Max Francis

Steven said:
This isn't a criticism, it is a genuine question. Why do people compare
local files with MD5 instead of doing a byte-to-byte compare? Is it purely
a caching thing (once you have the checksum, you don't need to read the
file again)? Are there any other reasons?

Because if you store a hash, then you can keep that around even when the
original file is archived, moved elsewhere, or deleted. It's awfully
helpful for building databases of files you've seen before.
 
T

Tom Anderson

The possibility of two different mp3 files having the same id3 tags is
something you might specifically be checking for.

So read from the end of the file, rather than the beginning.
Yes, checking the file lengths first is an obvious heuristic, but if you
fine you have a bunch of files with the same length, what do you do?
You're back to a list of hashes.

Or prefixes or suffixes.
By way of example, of the 2690 music files in my iTunes library, i have
twelve pairs of same-sized files [1], and all of these differ within
the first 18 bytes (mostly, within the first 9 bytes).

That's a small enough set of matches that you don't need a general
purpose algorithm.

True - and this is *exactly* the situation that the OP was talking about,
so this algorithm is appropriate. Moreover, i believe is representative of
most situations where you have a bunch of files to compare. Of course,
cases where files are tougher to tell apart do exist, but i think they're
corner cases. Could you suggest a common kind of file with degenerate
lengths, prefixes and suffixes?

The only one that springs to mind is a set of same-sized image files in
some noncompressed format, recording similar images (frames in a movie,
say), where the differences might be buried deep in the pixel data. As it
happens, i have just such a dataset on disk: with the images in TIFF
format, i get differences between subsequent frames after 9 bytes, but i
suspect that's a timestamp or something; if i convert everything to a nice
simple BMP (throwing away 8 bits per sample of precision in the process -
probably turning most of the pixels to 0!), then i find differences about
a megabyte in. If i compare from the tail in, i also have to wade through
about a megabyte before finding a difference. Here, hashes would be ideal.

tom
 
B

BBands

Diez said:
I did a levenshtein-fuzzy-search myself, however I enhanced my version by
normalizing the distance the following way:

def relative(a, b):
"""
Computes a relative distance between two strings. Its in the range
(0-1] where 1 means total equality.
@type a: string
@param a: arg one
<snip>

Hello,

I adapted your approach to my needs and thought you might like to see
the result

def LevenshteinRelative(a, b):
"""
Returns the Levenshtein distance between two strings
as a relative quantity in the range 1 to 0 where
1.0 is a perfect match.
"""
# Calculate the Levenshtein distance. (returns an integer)
dist = LevenshteinDistance(a, b)
# dist is at most the length of the longer string.
max_dist = float(max(len(a), len(b)))
# dist is always at least the difference of the sizes of the two
strings.
min_dist = max_dist - float(min(len(a), len(b)))
try: # If max_dist and min_dist are equal use simple form.
relative = 1.0 - (dist - min_dist) / (max_dist - min_dist)
except ZeroDivisionError:
relative = 1.0 - dist / max_dist
return relative

Thanks,

jab
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top