Jarow-Winkler algorithm: Measuring similarity between strings

Ø

Øyvind

Based on examples and formulas from http://en.wikipedia.org/wiki/Jaro-Winkler.
Useful for measuring similarity between two strings. For example if
you want to detect that the user did a typo.



def jarow(s1,s2):

""" Returns a number between 1 and 0, where 1 is the most similar

example:

print jarow("martha","marhta")

"""
m= jarow_m(s1,s2)
t1 = jarow_t(s1,s2)
t2 = jarow_t(s2,s1)
t = float(t1)/float(t2)

d = 0.1

# this is the jaro-distance
d_j = 1.0/3.0 * ((m/len(s1)) + (m/len(s2)) + ((m - t)/float(m)))

# if the strings are prefixed similar, they are weighted more
heavily
l = winkler_l(s1,s2)
print l
return d_j + (l * 0.1 * (1 - d_j))

def winkler_l(s1,s2):
""" Number of the four first characters matching """

l = 0
counter = 0
for s_j,s_i in zip(s1,s2):

if s_j == s_i:

l += 1
counter += 1

if counter > 4:
break

return l



def jarow_m(s1,s2):

""" Number of matching characters """
m = 0
d = {}
for s in s1:

d = True

for s in s2:


if d.has_key(s):

m += 1
return m
def jarow_t(s1,s2):

"""
Number of transpositions

"""

t= 0
pos ={}
counter = 0
for s in s1:

pos = counter
counter += 1
counter = 0
for s in s2:

if pos.has_key(s):

if pos != counter:

t += 1

counter += 1

return t
 
J

John Machin

Based on examples and formulas fromhttp://en.wikipedia.org/wiki/Jaro-Winkler.

For another Python implementation, google "febrl".
Useful for measuring similarity between two strings. For example if
you want to detect that the user did a typo.

You mean like comparing the user's input word with some collection of
valid words? You would need to be using something else as a quick-and-
dirty filter ... Jaro-Winkler is relatively slow.
def jarow(s1,s2):

    """  Returns a number between 1 and 0, where 1 is the most similar

        example:

        print jarow("martha","marhta")

        """
    m= jarow_m(s1,s2)
    t1 = jarow_t(s1,s2)
    t2 = jarow_t(s2,s1)
    t = float(t1)/float(t2)

Huh? t1 and t2 are supposed to be counts of transpositions. So is t.
So how come t is a ratio of t1 to t2?? BTW, suppose t2 is zero.

One usually prefers symmetry i.e dist(s1, s2) == dist(s2, s1). You
can't have symmetry if t = t1/t2. Also as the Wikipedia article says,
it's not a metric. I.e. it doesn't satisfy dist(a, c) <= dist(a, b) +
dist(b, c).
    d = 0.1

    # this is the jaro-distance
    d_j = 1.0/3.0 * ((m/len(s1)) + (m/len(s2)) + ((m - t)/float(m)))

    # if the strings are prefixed similar, they are weighted more
heavily
    l = winkler_l(s1,s2)
    print l
    return d_j + (l * 0.1 * (1 - d_j))

def winkler_l(s1,s2):
    """ Number of the four first characters matching """

    l = 0
    counter = 0
    for s_j,s_i in zip(s1,s2):

        if s_j == s_i:

            l += 1
        counter += 1

        if counter > 4:
            break

    return l

def jarow_m(s1,s2):

    """ Number of matching characters """

This code ignores the caveat from the Wikipedia article
"""
Two characters from s1 and s2 respectively, are considered matching
only if they are not farther than \left\lfloor\frac{\max(|s_1|,|s_2|)}
{2}\right\rfloor-1.
"""
which looks as though it is missing the words "characters apart" or
some such from the end of it. FWIW I've never seen this "distance
apart" restriction expressed unambiguously in the case where the
strings are of unequal length.
    m = 0
    d = {}
    for s in s1:

        d = True

    for s in s2:

        if d.has_key(s):

            m += 1
    return m


The above code is not symmetrical; jarow_m(s1, s2) does not
necessarily equal jarow_m(s2, s1). The article talks about one "m",
not two of them.
def jarow_t(s1,s2):

    """
    Number of transpositions

    """

    t= 0
    pos ={}
    counter = 0
    for s in s1:

        pos = counter
        counter += 1
    counter = 0
    for s in s2:

        if pos.has_key(s):

            if pos != counter:


This test is likely to come up with the wrong answer if the string
lengths differ.
                t += 1

        counter += 1

    return t

Cheers,
John
 
R

Roger Binns

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Øyvind said:
Based on examples and formulas from http://en.wikipedia.org/wiki/Jaro-Winkler.
Useful for measuring similarity between two strings. For example if
you want to detect that the user did a typo.

Jaro-Winkler is best when dealing with names (Winkler works for the US
census). There are pure Python and C accelerated implementations at
http://bitpim.svn.sourceforge.net/viewvc/bitpim/trunk/bitpim/src/native/strings/


If you are concerned about typos then taking into account the keyboard
layout will help. For example for a user with a US keyboard, the 'a' or
'd' keys would be a common typo for 's'.

Also consider Levenshtein distance:

http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklMUEkACgkQmOOfHg372QRTlQCfUoebzX2HRbQ4wLVZ6yRFMHd7
9yMAnjovqefVuQenX0zpHwn/rvv9FLe+
=bACc
-----END PGP SIGNATURE-----
 
Ø

Øyvind

Thanks for the useful comments.


For another Python implementation, google "febrl".


You mean like comparing the user's input word with some collection of
valid words? You would need to be using something else as a quick-and-
dirty filter ... Jaro-Winkler is relatively slow.


Do you have any alternative suggestions?
Huh? t1 and t2 are supposed to be counts of transpositions. So is t.
So how come t is a ratio of t1 to t2?? BTW, suppose t2 is zero.

Good point. There should be only one jarow_t.
Also as the Wikipedia article says,
it's not a metric. I.e. it doesn't satisfy dist(a, c) <= dist(a, b) +
dist(b, c).

Its not a mathematical correct metric, but it is a useful heuristical
metric.
The above code is not symmetrical; jarow_m(s1, s2) does not
necessarily equal jarow_m(s2, s1). The article talks about one "m",
not two of them.


Hmm.. also a good point. I will make it count all the matches, not
just the matches in s1.
 
J

John Machin

Thanks for the useful comments.





Do you have any alternative  suggestions?

Use a Levenshtein or Damerau edit distance. The latter handles
adjacent transpositions (...AB...->...BA...) which covers well over
90% of the transposition errors I've ever seen. The remainder
e.g. ...AXB...->...BXA... are given the same weight by Jaro-Winkler;
this is IMHO silly.

Read a little bit further than the CS101 level on Levenshtein edit
distance and ignore any implementation which calculates all N*M cells
of a dynamic programming matrix -- in the case where the objective is
to calculate the distance except that once the distance is known to be
over a threshold you don't care how big it is (you will reject a match
if the distance is too big), then you only need to fill in the entries
in a narrow band astride the trailing diagonal, and you can reliably
give up if the current distance on the diagonal becomes too high.
Ukkonen's the man in this field.

It gets better. You can forget the matrix when the word lengths are
less than the number of bits in your computer word (even 32 bits is
enough for most words in most languages). The bit-parallel techniques
include Damerau distance. This paper by Heikki Hyyrö is well worth
reading, and refers to a whole lot of previous work, including
Ukkonen's:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.2242

And you can forget about all the fancy techniques when the word
lengths are so different that they can't be the same (e.g. Fu and
Featherstonehaugh) -- or the shorter should perhaps be checked to see
if it's a plausible abbreviation or contraction of the longer.
Its not a mathematical correct metric, but it is a useful heuristical
metric.

Useful for what? IMHO: If it doesn't satisfy the symmetry condition
and the triangle rule, it's just plain dodgy. J-W is dodgy and slow.
Might as well use soundex. End of story.
Hmm.. also a good point. I will make it count all the matches, not
just the matches in s1.

The whole Jaro definition is woolly ... what is a match when you have
more than one occurrence of the same letter? Have a look at what the
matches might be between MISSISSIPPI and a variant with an extra I in
the middle e.g MISSISISIPPI.

Cheers,
John
 
B

bearophileHUGS

John Machin:
This paper by Heikki Hyyrö is well worth
reading, and refers to a whole lot of previous work, including
Ukkonen's:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.2242

This is the site of the author:
http://www.cs.uta.fi/~helmu/pubs/pubs.html
There you can find updates too:
http://www.cs.uta.fi/~helmu/pubs/cpm02.pdf

But such researchers have to offer C/Pascal/Java code too of their
papers. Implementing things from scratch every time is a waste of
time.

Probably using SSE instructions like ANDPS, ANDNPS, ORPS, XORPS allows
to use bitwise operations on 128 bit, allowing to search for longer
strings in the same time.
And the GPU of graphic cards offers other possibilities:
http://www.cbcb.umd.edu/software/cmatch/Cmatch.pdf

Bye,
bearophile
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top