Looking for library to estimate likeness of two strings

agenkin · Feb 6, 2008

Are there any Python libraries implementing measurement of similarity
of two strings of Latin characters?

I'm writing a script to guess-merge two tables based on people's
names, which are not necessarily spelled the same way in both tables
(especially the given names). I would like some function that would
help me make the best guess.

Many thanks in advance!

Tim Chase · Feb 6, 2008

Are there any Python libraries implementing measurement of similarity

of two strings of Latin characters?

It sounds like you're interested in calculating the Levenshtein
distance:

http://en.wikipedia.org/wiki/Levenshtein_distance

which gives you a measure of how different they are. A measure
of "0" is that the inputs are the same. The more different the
two strings are, the greater the resulting output of the function.

Unfortunately, it's an O(MN) algorithm (where M=len(word1) and
N=len(word2)) from my understanding of the code I've seen.
However it really is the best approximation I've seen of a "how
similar are these two strings" function. Googling for

python levenshtein distance

brings up oodles of hits.

-tkc

Jeff Schwab · Feb 6, 2008

Tim said:
It sounds like you're interested in calculating the Levenshtein distance:

http://en.wikipedia.org/wiki/Levenshtein_distance

which gives you a measure of how different they are. A measure of "0"
is that the inputs are the same. The more different the two strings
are, the greater the resulting output of the function.

Unfortunately, it's an O(MN) algorithm (where M=len(word1) and
N=len(word2)) from my understanding of the code I've seen. However it
really is the best approximation I've seen of a "how similar are these
two strings" function. Googling for

python levenshtein distance

brings up oodles of hits.

If the strings happen to be the same length, the Levenshtein distance is
equivalent to the Hamming distance. The Wikipedia article gives the
following Python implementation:

# http://en.wikipedia.org/wiki/Hamming_distance
def hamdist(s1, s2):
assert len(s1) == len(s2)
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

Jeff Schwab · Feb 6, 2008

Tim said:
It sounds like you're interested in calculating the Levenshtein distance:

http://en.wikipedia.org/wiki/Levenshtein_distance

which gives you a measure of how different they are. A measure of "0"
is that the inputs are the same. The more different the two strings
are, the greater the resulting output of the function.

Unfortunately, it's an O(MN) algorithm (where M=len(word1) and
N=len(word2)) from my understanding of the code I've seen. However it
really is the best approximation I've seen of a "how similar are these
two strings" function. Googling for

python levenshtein distance

brings up oodles of hits.

If the strings happen to be the same length, the Levenshtein distance is
equivalent to the Hamming distance. The Wikipedia article gives the
following Python implementation:

# http://en.wikipedia.org/wiki/Hamming_distance
def hamdist(s1, s2):
assert len(s1) == len(s2)
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

Robert Kern · Feb 6, 2008

Jeff said:
If the strings happen to be the same length, the Levenshtein distance is
equivalent to the Hamming distance. The Wikipedia article gives the
following Python implementation:

# http://en.wikipedia.org/wiki/Hamming_distance
def hamdist(s1, s2):
assert len(s1) == len(s2)
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

I'm afraid that it isn't. Using Magnus Lie Hetland's implementation:

http://hetland.org/coding/python/levenshtein.py

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop.
:def levenshtein(a,b):
: "Calculates the Levenshtein distance between a and b."
: n, m = len(a), len(b)
: if n > m:
: # Make sure n <= m, to use O(min(n,m)) space
: a,b = b,a
: n,m = m,n
:
: current = range(n+1)
: for i in range(1,m+1):
: previous, current = current, +[0]*n
: for j in range(1,n+1):
: add, delete = previous[j]+1, current[j-1]+1
: change = previous[j-1]
: if a[j-1] != b[i-1]:
: change = change + 1
: current[j] = min(add, delete, change)
:
: return current[n]
:--

In [2]:

In [3]: def hamdist(s1, s2):
...: assert len(s1) == len(s2)
...: return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
...:

In [4]: hamdist('abcdef', 'fabcde')
Out[4]: 6

In [5]: levenshtein('abcdef', 'fabcde')
Out[5]: 2

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Steven D'Aprano · Feb 7, 2008

Jeff Schwab wrote: ....

If the strings happen to be the same length, the Levenshtein distance
is equivalent to the Hamming distance.

Click to expand...

....
I'm afraid that it isn't. Using Magnus Lie Hetland's implementation: ....
In [4]: hamdist('abcdef', 'fabcde')
Out[4]: 6

In [5]: levenshtein('abcdef', 'fabcde')
Out[5]: 2

I can confirm Robert's results, using a different implementation of the
Levenshtein edit distance. It isn't true that Levenshtein simplifies to
Hamming when the strings are equal length.

For what it's worth, here's my implementation, which I wrote some years
ago. I doubt it's optimal.

def levenshtein(s1, s2):
"""Return the Levenshtein edit distance between two strings.

s1 and s2 should be two arbitrary length strings. Returns an
integer count of the edit distance between the strings.
"""
m, n = len(s1), len(s2)
if m == 0: return n
elif n == 0: return m
# Create an array with rows 0..m and columns 0..n.
array = [None]*(m+1)
for i in range(m+1):
array = [None]*(n+1)
# Initialise the first row to 0..n.
for i in range(n+1):
array[0] = i
# Initialize the first column to 0..m.
for i in range(m+1):
array[0] = i
# Measure the differences.
for i in range(1, m+1):
c1 = s1[i-1]
for j in range(1, n+1):
c2 = s2[j-1]
cost = int(c1 != c2)
x = array[j-1] + 1 # Cell immediately above plus one.
y = array[i-1][j] + 1 # Cell immediately left plus one.
z = array[i-1][j-1] + cost # Cell diagonally above and
# to the left, plus the cost.
array[j] = min(x, y, z)
# When done, the bottom-right cell contains the Levenshtein distance.
return array[m][n]

Jeff Schwab · Feb 7, 2008

Steven said:
I can confirm Robert's results, using a different implementation of the
Levenshtein edit distance. It isn't true that Levenshtein simplifies to
Hamming when the strings are equal length.

Whoops! My mistake.

Daniel Fetchinson · Feb 7, 2008

Are there any Python libraries implementing measurement of similarity

of two strings of Latin characters?

I'm writing a script to guess-merge two tables based on people's
names, which are not necessarily spelled the same way in both tables
(especially the given names). I would like some function that would
help me make the best guess.

Many thanks in advance!

Hi folks, just went through this thread and a related one from 2006
and I was wondering what the best solution is for using these string
metrics in a database search. If I want to query the database for a
string or something that is close to it (close being defined by one of
the string metrics discussed above) it seems I have to select each and
every word from the database and compare it with the query word which
is very ineffective.

Very concretely if I have an sqlite db with SQLObject as the ORM what
would be the most effective way of doing such a query?

Cheers,
Daniel

Lee Capps · Feb 7, 2008

At said:
Are there any Python libraries implementing measurement of similarity
of two strings of Latin characters?

I'm writing a script to guess-merge two tables based on people's
names, which are not necessarily spelled the same way in both tables
(especially the given names). I would like some function that would
help me make the best guess.

Many thanks in advance!

I used difflib.get_close_matches for something similar:

http://docs.python.org/lib/module-difflib.html

HTH.

Steve Holden · Feb 7, 2008

Is this really what the OP was asking for. If I understand it correctly,
Levenshtein distance works out the number of edits required to transform
the string to the target string. The smaller the more equivalent, but with
the OP's problem I would expect

table1 table2
brian briam
erian

I think the OP would like to guess at 'briam' rather than 'erian', but
Levenstein would rate them equally good guesses?

I know this is pushing it more toward phonetic alaysis of the words or
something similar, and thats orders of magnitude more complex.

just in case,

http://www.linguistlist.org/sp/Software.html#97

might be a good place to start looking into it, along with the NLTK
libraries here

http://nltk.sourceforge.net/index.php/Documentation

You could perhaps use soundex to try to choose between different
possibilities with the same Levenshtein distance from the sample.
Soundex by itself is horrible, but it might work as a prioritizer.

regards
Steve

JKPeck · Feb 7, 2008

I used difflib.get_close_matches for something similar:

http://docs.python.org/lib/module-difflib.html

HTH.

Algorithms typically used for name comparisons include soundex,
nysiis, and levenshtein distance. The last is more general and
variations are used in spell checkers. You can probably Google for
Python versions. You can find implementations of these comparison
functions at
www.spss.com/devcentral in the extendedTransforms.py module.
(Requires a login but free).

HTH,
Jon Peck

agenkin · Feb 7, 2008

Hi, All:

Many thanks for the excellent leads. I've also found several
functions to find phonetic similarity between English names: the
mentioned above soundex, then, also, one called metaphone. I'm now
thinking of the best way to use some combination of these functions.

agenkin · Feb 7, 2008

Hi folks, just went through this thread and a related one from 2006
and I was wondering what the best solution is for using these string
metrics in a database search. If I want to query the database for a
string or something that is close to it (close being defined by one of
the string metrics discussed above) it seems I have to select each and
every word from the database and compare it with the query word which
is very ineffective.

I have never used sqlite database, but Postgres has a module that
implements levenshtein(), soundex() and metaphone() functions, so you
can do something like this:

SELECT * FROM s WHERE soundex(name) = soundex('john');
SELECT * FROM s WHERE difference(name, 'john') > 2;

http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html

Guilherme Polo · Feb 7, 2008

2008/2/7 said:
I have never used sqlite database, but Postgres has a module that
implements levenshtein(), soundex() and metaphone() functions, so you
can do something like this:

SELECT * FROM s WHERE soundex(name) = soundex('john');
SELECT * FROM s WHERE difference(name, 'john') > 2;

http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html

SQLite supports soundex, but it is disabled by default, you need to
compile it with -DSQLITE_SOUNDEX=1

Gabriel Genellina · Feb 7, 2008

Many thanks for the excellent leads. I've also found several
functions to find phonetic similarity between English names: the
mentioned above soundex, then, also, one called metaphone. I'm now
thinking of the best way to use some combination of these functions.

You may be interested in this article which discusses some techniques used
to match similar records:

Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital
Government Web Service (2003)
Mohamed G. Elfeky, Vassilios S. Verykios, Ahmed K. Elmagarmid, Thanaa
M. Ghanem, Ahmed R. Huwait.
http://citeseer.ist.psu.edu/elfeky03record.html

do your bit for my mental health - how to find the differencebetween two strings?	21	Dec 14, 2010
Hpricot test for equivalence of two xml segments?	4	Jul 16, 2010
i18n: looking for expertise	15	Mar 2, 2005
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Python plain-text database or library that supports joins?	5	Jun 22, 2007
PyWart: PEP8: a seething cauldron of inconsistencies.	1	Jul 28, 2011
PyWart: PEP8: A cauldron of inconsistencies.	7	Jul 27, 2011
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012

Looking for library to estimate likeness of two strings

agenkin

Tim Chase

Jeff Schwab

Jeff Schwab

Robert Kern

Steven D'Aprano

Jeff Schwab

Daniel Fetchinson

Lee Capps

Steve Holden

JKPeck

agenkin

agenkin

Guilherme Polo

Gabriel Genellina

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads