J

#### javuchi

for example, searching in a dictionary the word "motorcycle", but

returns similar strings like "motorcicle".

Is there such a library?

You are using an out of date browser. It may not display this or other websites correctly.

You should upgrade or use an alternative browser.

You should upgrade or use an alternative browser.

J

for example, searching in a dictionary the word "motorcycle", but

returns similar strings like "motorcicle".

Is there such a library?

Ad

E

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213

here is another:

http://effbot.org/librarybook/soundex.htm

S

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213

here is another:

http://effbot.org/librarybook/soundex.htm

Soundex is *one* particular algorithm for approximate

string matching. It is optimised for matching

Anglo-American names (like Smith/Smythe), and is

considered to be quite old and obsolete for all but the

most trivial applications -- or so I'm told.

Soundex will not match arbitrary changes -- it will

match both cat and cet, but it won't match cat and mat.

A more sophisticated approximate string matching

algorithm will use the Levenshtein distance. You can

find a Useless implementation here:

http://www.uselesspython.com/download.php?script_id=108

Given a function levenshtein(s1, s2) that returns the

distance between two strings, you could use it for

approximate matching like this:

def approx_matching(strlist, target, dist=1):

"""Matches approximately strings in strlist to

a target string.

Returns a list of strings, where each string

matched is no further than an edit distance of

dist from the target.

"""

found = []

for s in strlist:

if levenshtein(s, target) <= dist:

found.append(s)

return s

T

javuchi said:

for example, searching in a dictionary the word "motorcycle", but

returns similar strings like "motorcicle".

Is there such a library?

There is an algorithm called Soundex that replaces each word by a

4-character string, such that all words that are pronounced similarly

encode to the same string.

The algorithm is easy to implement; you can probably find one by Googling.

D

javuchi said:

for example, searching in a dictionary the word "motorcycle", but

returns similar strings like "motorcicle".

Is there such a library?

agrep (aproximate grep) allows for a certain amount of errors and there

exist Python bindings (http://www.bio.cam.ac.uk/~mw263/pyagrep.html)

Or google for "agrep python".

Daniel

F

Tim said:There is an algorithm called Soundex that replaces each word by a

4-character string, such that all words that are pronounced similarly

encode to the same string.

The algorithm is easy to implement; you can probably find one by Googling.

Python used to ship with a soundex module, but it was removed

in 1.6, for various reasons. here's a replacement:

http://orca.mojam.com/~skip/python/soundex.py

</F>

Ad

T

javuchi said:

for example, searching in a dictionary the word "motorcycle", but

returns similar strings like "motorcicle".

Is there such a library?

I kind of like the one at

http://www.personal.psu.edu/staff/i/u/iua1/python/apse/

which you might use something like

['motorcycle', 'motorcicle']>>> import Apse

>>> ap = Apse.Approx('motorcycle', edit=1)

>>> ap.match(['motorcycle', 'motorcicle', 'motorscooter'])

That page mentions several alternatives, as well.

I hope this helps,

Tim

D

Steven said:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213

here is another:

http://effbot.org/librarybook/soundex.htm

Soundex is *one* particular algorithm for approximate string matching.

It is optimised for matching Anglo-American names (like Smith/Smythe),

and is considered to be quite old and obsolete for all but the most

trivial applications -- or so I'm told.

Soundex will not match arbitrary changes -- it will match both cat and

cet, but it won't match cat and mat.

A more sophisticated approximate string matching algorithm will use the

Levenshtein distance. You can find a Useless implementation here:

http://www.uselesspython.com/download.php?script_id=108

Given a function levenshtein(s1, s2) that returns the distance between

two strings, you could use it for approximate matching like this:

def approx_matching(strlist, target, dist=1):

"""Matches approximately strings in strlist to

a target string.

Returns a list of strings, where each string

matched is no further than an edit distance of

dist from the target.

"""

found = []

for s in strlist:

if levenshtein(s, target) <= dist:

found.append(s)

return s

I compute a relative metric based on th every same implementation like this:

def relative(a, b):

"""

Computes a relative distance between two strings. Its in the range

(0-1] where 1 means total equality.

@type a: string

@param a: arg one

@type b: string

@param b: arg two

@rtype: float

@return: the distance

"""

d = levensthein(a,b)

longer = float(max((len(a), len(b))))

shorter = float(min((len(a), len(b))))

r = ((longer - d) / longer) * (shorter / longer)

return r

The idea is that otherwise e.g. "cat" and "hippopothamus" have a

l-distance of only 3, which one would consider good at the first look.

Regards,

Diez

S

The idea is that otherwise e.g. "cat" and "hippopothamus" have a

l-distance of only 3, which one would consider good at the first look.

???

I make it that the L-distance between cat and hippopothamus is twelve, not

three. With len(cat)=3 and len(hippopothamus) = 13, you need ten

insertions just to get the lengths equal, so the L-distance must be at

least ten.

Ad

I

javuchi said:

for example, searching in a dictionary the word "motorcycle", but

returns similar strings like "motorcicle".

Is there such a library?

Perhaps the get_close_matches function that is presentt in the standard

library (in the difflib module) could be useful ?"

--Irmen

**Want to reply to this thread or ask your own question?**

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.