locating strings approximately

BBands · Jun 29, 2006

I'd like to see if a string exists, even approximately, in another. For
example if "black" exists in "blakbird" or if "beatles" exists in
"beatlemania". The application is to look though a long list of songs
and return any approximate matches along with a confidence factor. I
have looked at edit distance, but that isn't a good choice for finding
a short string in a longer one. I have also explored
difflib.SequenceMatcher and .get_close_matches, but what I'd really
like is something like:

a = FindApprox("beatles", "beatlemania")
print a
0.857

Any ideas?

jab

Irmen de Jong · Jun 29, 2006

BBands said:
I'd like to see if a string exists, even approximately, in another. For
example if "black" exists in "blakbird" or if "beatles" exists in
"beatlemania". The application is to look though a long list of songs
and return any approximate matches along with a confidence factor. I
have looked at edit distance, but that isn't a good choice for finding
a short string in a longer one. I have also explored
difflib.SequenceMatcher and .get_close_matches, but what I'd really
like is something like:

a = FindApprox("beatles", "beatlemania")
print a
0.857

Any ideas?

jab

I collected a few pointers in this article:

http://www.razorvine.net/frog/user/irmen/article/2005-05-28/53

It contains one or two additional methods besides the ones you mentioned.
Sorry, it's in Dutch, but you can find the links without much trouble i guess.

--Irmen

John Machin · Jun 29, 2006

I'd like to see if a string exists, even approximately, in another. For
example if "black" exists in "blakbird" or if "beatles" exists in
"beatlemania". The application is to look though a long list of songs
and return any approximate matches along with a confidence factor. I
have looked at edit distance, but that isn't a good choice for finding
a short string in a longer one.

There is a trivial difference between the traditional
distance-matrix-based Levenshtein algorithm for edit distance and the
corresponding one for approximate string searching. Ditto between
finite-state-machine approaches. Ditto between modern bit-bashing
approaches.

I have also explored
difflib.SequenceMatcher and .get_close_matches, but what I'd really
like is something like:

a = FindApprox("beatles", "beatlemania")
print a
0.857

Any ideas?

You got no ideas from googling "approximate string search python"???

John Machin · Jun 29, 2006

Yes, many including agrepy and soundex in addition to those I
mentioned already, but none seem really handy at approximately looking
up smaller strings in larger ones. I also note that this has been the
topic of prior discussion without resolutiuon.

jab

It helps if you tell all that you've done. Otherwise people will tell
you to do what you've done already

It helps if you reply on-list so that others can see. You may get better
answers sooner. I have to vanish now. Will check back tonight.

Cheers,
John

agrepy = approximate-grep-python -- why doesn't that suit you?

Jim Segrave · Jun 29, 2006

I collected a few pointers in this article:

http://www.razorvine.net/frog/user/irmen/article/2005-05-28/53

It contains one or two additional methods besides the ones you mentioned.
Sorry, it's in Dutch, but you can find the links without much trouble i guess.

For those who don't read Dutch, here's a translation of that page:

========================

Python </frog/user/irmen/category/2>

By fuzzy string comparision, I mean comparing two strings and getting
a result that indicates how similar the strings are to each
other. That is, it's not an absoulte string compaision (that only
tells if the two strings are absolutely the same or not), but a more
quantative string comparision.

It can be used to, for example, to remove typos from commands, a
specific form of spelling checker or for other natural language
processing.

Python has a few modules which offer a fuzzy string comparision.

â€”

QWERTY - or protein distances

Recently, BoB Hooft posted an interesting module in comp.lang.python
which uses an algorithm that is normally used for comparing proteins,
but now works for ASCII strings. The result is a float which is
negative if the strings are very different and increases up to
len(string) + 1 if the strings are identical, the score can be viewd
as 'the number of correspondingn letters'. The algorithm uses the
layout of the QWERTY keyboard to specifiy the distance between the
stings and therefore is restricted to ASCII strings (and only makes
sense if one uses a QWERTY keyboard), but therefor is very well suited
to compare an imput command with all the possible commands and thus to
identify typos.

Why not use a normalised score between 0.0 and 1.0?
Rob: It doesn't matter if you only want the 'best score' from a number
of strings. Negative results appear if the strings are not similar and
also don't even have the same length.

The module can be found here: : comparestrings.py
<http://starship.python.net/crew/hooft/comparestrings.py>.

*Python standard library*

In de standard lib hebben we difflib met de functie get_close_matches
<http://docs.python.org/lib/module-difflib.html#l2h-922>. Groot voordeel
dat deze dus standaard bij Python zit.

The standard library has difflib with the function get_close_matches
<http://docs.python.org/lib/module-difflib.html#l2h-922>. The big
advantage is that this is standard with Python.

*Febrl (biometrisch)*

Febrl <https://sourceforge.net/projects/febrl/> has a number of string
comparision functions, including the 'edit distance' or 'Levenshtein
distance' function. For the latter, there is an implementation on
Useless Python
<http://www.uselesspython.com/download.php?script_id=108>.

*Apse*

An extension module (written in C): Apse
<http://www.personal.psu.edu/staff/i/u/iua1/python/apse/>.

Requires python, SWIG, and a C compiler

*Java*

SecondString <http://sourceforge.net/projects/secondstring/> is an
open source package with Java implementations of a large number of
string comparision algorithms. They should not be too difficult to
port to Python.

John Machin · Jun 30, 2006

The trivial difference is that in the searching case one edge of the
dynamic programming matrix is initialised (in theory) to [0] *
(len(text)+1) whereas in the distance case it is set to range(len(text)+1).

It helps if you tell all that you've done. Otherwise people will tell
you to do what you've done already

It helps if you reply on-list so that others can see. You may get better
answers sooner. I have to vanish now. Will check back tonight.

Here's a possibly better answer:

def approx_search(pattern, text, max_dist):
"""
Return a generator which yields tuples (endpos, dist)
for each possible endpos such that
(1) Levenshtein_distance(pattern, text[startpos:endpos]) = dist
(2) 0 <= dist <= max_dist
for some (undetermined) startpos.
Not restricted to strings:
(a) pattern must support pattern and len(pattern).
(b) text needs only to support enumerate(text).
(c) contents of pattern and text need only support the != operator.
Example:
list(approx_search('Beatles', 'beetles Beat leverage Betelguese',
3)) ->
[(6, 3), (7, 2), (8, 3),
(12, 3), (13, 3), (14, 3), (15, 2), (16, 2), (17, 3),
(26, 3), (27, 3)]
"""
plen = len(pattern)
prange = range(plen)
dprev = range(plen+1)
dcurr = [0] * (plen+1)
dist = plen
for tx, tc in enumerate(text):
# dcurr[0] = 0 # not needed, never changes
for px in prange:
dist = dprev[px] + (tc != pattern[px])
temp = dcurr[px] + 1
if temp < dist: dist = temp
temp = dprev[px+1] + 1
if temp < dist: dist = temp
dcurr[px+1] = dist
if dist <= max_dist:
yield tx+1, dist
dprev, dcurr = dcurr, dprev

If you need/want to poke around discovering "startpos", then you need
something like the following, which is similar to the function by Magnus
Lie Hetland which you'll find on the web, but appears to be about twice
as fast without destroying legibility completely

def Levenshtein_SJM(aseq, bseq):
"""
Calculate Levenshtein distance between aseq and bseq.
Space O(min(m,n))
Time O(mn)
"""
alen = len(aseq)
blen = len(bseq)
if alen > blen:
aseq, bseq = bseq, aseq
alen, blen = blen, alen
if not alen: return blen
arange = range(alen)
dprev = range(alen+1)
dcurr = [0] * (alen+1)
for bx in xrange(blen):
bc = bseq[bx]
dcurr[0] = bx + 1
for ax in arange:
dist = dprev[ax] + (bc != aseq[ax])
temp = dcurr[ax] + 1
if temp < dist: dist = temp
temp = dprev[ax+1] + 1
if temp < dist: dist = temp
dcurr[ax+1] = dist
dprev, dcurr = dcurr, dprev
return dist

Note that both of the above functions use the ancient original
algorithm, which takes O(mn) time. For heavy lifting a modern algorithm
and use of Pyrex or C are indicated.

HTH,
John

Find and count strings of text from multiple files	17	Dec 16, 2021
Locating partially matched strings in a file	8	Sep 18, 2006
Python self-evaluating strings	4	Jan 27, 2008
Hashing long strings...	8	Aug 18, 2010
how to create an external string when binding C to Python?	1	May 9, 2014
efficiently splitting up strings based on substrings	7	Sep 5, 2009
Life	0	Sep 22, 2010
Create a backslash-escaped version of a string?	3	Feb 8, 2010

locating strings approximately

BBands

Irmen de Jong

John Machin

John Machin

Jim Segrave

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads