Aproximative string matching

javuchi · Nov 21, 2005

I'm searching for a library which makes aproximative string matching,
for example, searching in a dictionary the word "motorcycle", but
returns similar strings like "motorcicle".

Is there such a library?

elbertlev · Nov 21, 2005

This algorithm is called soundex. Here is one implementation example.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213

here is another:
http://effbot.org/librarybook/soundex.htm

Steven D'Aprano · Nov 21, 2005

This algorithm is called soundex. Here is one implementation example.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213

here is another:
http://effbot.org/librarybook/soundex.htm

Soundex is *one* particular algorithm for approximate
string matching. It is optimised for matching
Anglo-American names (like Smith/Smythe), and is
considered to be quite old and obsolete for all but the
most trivial applications -- or so I'm told.

Soundex will not match arbitrary changes -- it will
match both cat and cet, but it won't match cat and mat.

A more sophisticated approximate string matching
algorithm will use the Levenshtein distance. You can
find a Useless implementation here:

http://www.uselesspython.com/download.php?script_id=108

Given a function levenshtein(s1, s2) that returns the
distance between two strings, you could use it for
approximate matching like this:

def approx_matching(strlist, target, dist=1):
"""Matches approximately strings in strlist to
a target string.

Returns a list of strings, where each string
matched is no further than an edit distance of
dist from the target.
"""
found = []
for s in strlist:
if levenshtein(s, target) <= dist:
found.append(s)
return s

Tim Roberts · Nov 21, 2005

javuchi said:
I'm searching for a library which makes aproximative string matching,
for example, searching in a dictionary the word "motorcycle", but
returns similar strings like "motorcicle".

Is there such a library?

There is an algorithm called Soundex that replaces each word by a
4-character string, such that all words that are pronounced similarly
encode to the same string.

The algorithm is easy to implement; you can probably find one by Googling.

Daniel Dittmar · Nov 21, 2005

javuchi said:
I'm searching for a library which makes aproximative string matching,
for example, searching in a dictionary the word "motorcycle", but
returns similar strings like "motorcicle".

Is there such a library?

agrep (aproximate grep) allows for a certain amount of errors and there
exist Python bindings (http://www.bio.cam.ac.uk/~mw263/pyagrep.html)

Or google for "agrep python".

Daniel

Fredrik Lundh · Nov 21, 2005

Tim said:
There is an algorithm called Soundex that replaces each word by a
4-character string, such that all words that are pronounced similarly
encode to the same string.

The algorithm is easy to implement; you can probably find one by Googling.

Python used to ship with a soundex module, but it was removed
in 1.6, for various reasons. here's a replacement:

http://orca.mojam.com/~skip/python/soundex.py

</F>

Tim Heaney · Nov 21, 2005

javuchi said:
I'm searching for a library which makes aproximative string matching,
for example, searching in a dictionary the word "motorcycle", but
returns similar strings like "motorcicle".

Is there such a library?

I kind of like the one at

http://www.personal.psu.edu/staff/i/u/iua1/python/apse/

which you might use something like

>>> import Apse
>>> ap = Apse.Approx('motorcycle', edit=1)
>>> ap.match(['motorcycle', 'motorcicle', 'motorscooter'])

Click to expand...

Click to expand...

['motorcycle', 'motorcicle']

That page mentions several alternatives, as well.

I hope this helps,

Tim

Diez B. Roggisch · Nov 21, 2005

Steven said:
This algorithm is called soundex. Here is one implementation example.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213

here is another:
http://effbot.org/librarybook/soundex.htm

Click to expand...

Soundex is *one* particular algorithm for approximate string matching.
It is optimised for matching Anglo-American names (like Smith/Smythe),
and is considered to be quite old and obsolete for all but the most
trivial applications -- or so I'm told.

Soundex will not match arbitrary changes -- it will match both cat and
cet, but it won't match cat and mat.

A more sophisticated approximate string matching algorithm will use the
Levenshtein distance. You can find a Useless implementation here:

http://www.uselesspython.com/download.php?script_id=108

Given a function levenshtein(s1, s2) that returns the distance between
two strings, you could use it for approximate matching like this:

def approx_matching(strlist, target, dist=1):
"""Matches approximately strings in strlist to
a target string.

Returns a list of strings, where each string
matched is no further than an edit distance of
dist from the target.
"""
found = []
for s in strlist:
if levenshtein(s, target) <= dist:
found.append(s)
return s

I compute a relative metric based on th every same implementation like this:

def relative(a, b):
"""
Computes a relative distance between two strings. Its in the range
(0-1] where 1 means total equality.
@type a: string
@param a: arg one
@type b: string
@param b: arg two
@rtype: float
@return: the distance
"""
d = levensthein(a,b)
longer = float(max((len(a), len(b))))
shorter = float(min((len(a), len(b))))
r = ((longer - d) / longer) * (shorter / longer)
return r

The idea is that otherwise e.g. "cat" and "hippopothamus" have a
l-distance of only 3, which one would consider good at the first look.

Regards,

Diez

Steven D'Aprano · Nov 21, 2005

The idea is that otherwise e.g. "cat" and "hippopothamus" have a
l-distance of only 3, which one would consider good at the first look.

???

I make it that the L-distance between cat and hippopothamus is twelve, not
three. With len(cat)=3 and len(hippopothamus) = 13, you need ten
insertions just to get the lengths equal, so the L-distance must be at
least ten.

Irmen de Jong · Nov 21, 2005

javuchi said:
I'm searching for a library which makes aproximative string matching,
for example, searching in a dictionary the word "motorcycle", but
returns similar strings like "motorcicle".

Is there such a library?

Perhaps the get_close_matches function that is presentt in the standard
library (in the difflib module) could be useful ?"

--Irmen

Removing matching items from a list?	16	Aug 3, 2013
How to open a window not running on chrome?	3	Nov 15, 2022
Tasks	1	Nov 29, 2022
Finding all regex matches by index?	1	May 30, 2012
I have to finish this code for my assignment but I cant figure out how to solve it	1	Jun 27, 2023
How to test the high-score in my Python-game	2	Jan 5, 2023
Mutually referencing imports -- impossible?	5	Jul 13, 2008
I have issues installing pycrypto (and thus fabric) with pip	3	Jan 28, 2013

Aproximative string matching

javuchi

elbertlev

Steven D'Aprano

Tim Roberts

Daniel Dittmar

Fredrik Lundh

Tim Heaney

Diez B. Roggisch

Steven D'Aprano

Irmen de Jong

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads