locating strings approximately

B

BBands

I'd like to see if a string exists, even approximately, in another. For
example if "black" exists in "blakbird" or if "beatles" exists in
"beatlemania". The application is to look though a long list of songs
and return any approximate matches along with a confidence factor. I
have looked at edit distance, but that isn't a good choice for finding
a short string in a longer one. I have also explored
difflib.SequenceMatcher and .get_close_matches, but what I'd really
like is something like:

a = FindApprox("beatles", "beatlemania")
print a
0.857

Any ideas?

jab
 
I

Irmen de Jong

BBands said:
I'd like to see if a string exists, even approximately, in another. For
example if "black" exists in "blakbird" or if "beatles" exists in
"beatlemania". The application is to look though a long list of songs
and return any approximate matches along with a confidence factor. I
have looked at edit distance, but that isn't a good choice for finding
a short string in a longer one. I have also explored
difflib.SequenceMatcher and .get_close_matches, but what I'd really
like is something like:

a = FindApprox("beatles", "beatlemania")
print a
0.857

Any ideas?

jab

I collected a few pointers in this article:

http://www.razorvine.net/frog/user/irmen/article/2005-05-28/53

It contains one or two additional methods besides the ones you mentioned.
Sorry, it's in Dutch, but you can find the links without much trouble i guess.


--Irmen
 
J

John Machin

I'd like to see if a string exists, even approximately, in another. For
example if "black" exists in "blakbird" or if "beatles" exists in
"beatlemania". The application is to look though a long list of songs
and return any approximate matches along with a confidence factor. I
have looked at edit distance, but that isn't a good choice for finding
a short string in a longer one.

There is a trivial difference between the traditional
distance-matrix-based Levenshtein algorithm for edit distance and the
corresponding one for approximate string searching. Ditto between
finite-state-machine approaches. Ditto between modern bit-bashing
approaches.
I have also explored
difflib.SequenceMatcher and .get_close_matches, but what I'd really
like is something like:

a = FindApprox("beatles", "beatlemania")
print a
0.857

Any ideas?

You got no ideas from googling "approximate string search python"???
 
J

John Machin

Yes, many including agrepy and soundex in addition to those I
mentioned already, but none seem really handy at approximately looking
up smaller strings in larger ones. I also note that this has been the
topic of prior discussion without resolutiuon.

jab

It helps if you tell all that you've done. Otherwise people will tell
you to do what you've done already :)

It helps if you reply on-list so that others can see. You may get better
answers sooner. I have to vanish now. Will check back tonight.

Cheers,
John

agrepy = approximate-grep-python -- why doesn't that suit you?
 
J

Jim Segrave

I collected a few pointers in this article:

http://www.razorvine.net/frog/user/irmen/article/2005-05-28/53

It contains one or two additional methods besides the ones you mentioned.
Sorry, it's in Dutch, but you can find the links without much trouble i guess.

For those who don't read Dutch, here's a translation of that page:

========================

Python </frog/user/irmen/category/2>

By fuzzy string comparision, I mean comparing two strings and getting
a result that indicates how similar the strings are to each
other. That is, it's not an absoulte string compaision (that only
tells if the two strings are absolutely the same or not), but a more
quantative string comparision.

It can be used to, for example, to remove typos from commands, a
specific form of spelling checker or for other natural language
processing.

Python has a few modules which offer a fuzzy string comparision.

—

QWERTY - or protein distances

Recently, BoB Hooft posted an interesting module in comp.lang.python
which uses an algorithm that is normally used for comparing proteins,
but now works for ASCII strings. The result is a float which is
negative if the strings are very different and increases up to
len(string) + 1 if the strings are identical, the score can be viewd
as 'the number of correspondingn letters'. The algorithm uses the
layout of the QWERTY keyboard to specifiy the distance between the
stings and therefore is restricted to ASCII strings (and only makes
sense if one uses a QWERTY keyboard), but therefor is very well suited
to compare an imput command with all the possible commands and thus to
identify typos.

Why not use a normalised score between 0.0 and 1.0?
Rob: It doesn't matter if you only want the 'best score' from a number
of strings. Negative results appear if the strings are not similar and
also don't even have the same length.

The module can be found here: : comparestrings.py
<http://starship.python.net/crew/hooft/comparestrings.py>.

*Python standard library*

In de standard lib hebben we difflib met de functie get_close_matches
<http://docs.python.org/lib/module-difflib.html#l2h-922>. Groot voordeel
dat deze dus standaard bij Python zit.

The standard library has difflib with the function get_close_matches
<http://docs.python.org/lib/module-difflib.html#l2h-922>. The big
advantage is that this is standard with Python.

*Febrl (biometrisch)*

Febrl <https://sourceforge.net/projects/febrl/> has a number of string
comparision functions, including the 'edit distance' or 'Levenshtein
distance' function. For the latter, there is an implementation on
Useless Python
<http://www.uselesspython.com/download.php?script_id=108>.

*Apse*

An extension module (written in C): Apse
<http://www.personal.psu.edu/staff/i/u/iua1/python/apse/>.

Requires python, SWIG, and a C compiler

*Java*

SecondString <http://sourceforge.net/projects/secondstring/> is an
open source package with Java implementations of a large number of
string comparision algorithms. They should not be too difficult to
port to Python.
 
J

John Machin

The trivial difference is that in the searching case one edge of the
dynamic programming matrix is initialised (in theory) to [0] *
(len(text)+1) whereas in the distance case it is set to range(len(text)+1).
It helps if you tell all that you've done. Otherwise people will tell
you to do what you've done already :)

It helps if you reply on-list so that others can see. You may get better
answers sooner. I have to vanish now. Will check back tonight.

Here's a possibly better answer:

def approx_search(pattern, text, max_dist):
"""
Return a generator which yields tuples (endpos, dist)
for each possible endpos such that
(1) Levenshtein_distance(pattern, text[startpos:endpos]) = dist
(2) 0 <= dist <= max_dist
for some (undetermined) startpos.
Not restricted to strings:
(a) pattern must support pattern and len(pattern).
(b) text needs only to support enumerate(text).
(c) contents of pattern and text need only support the != operator.
Example:
list(approx_search('Beatles', 'beetles Beat leverage Betelguese',
3)) ->
[(6, 3), (7, 2), (8, 3),
(12, 3), (13, 3), (14, 3), (15, 2), (16, 2), (17, 3),
(26, 3), (27, 3)]
"""
plen = len(pattern)
prange = range(plen)
dprev = range(plen+1)
dcurr = [0] * (plen+1)
dist = plen
for tx, tc in enumerate(text):
# dcurr[0] = 0 # not needed, never changes
for px in prange:
dist = dprev[px] + (tc != pattern[px])
temp = dcurr[px] + 1
if temp < dist: dist = temp
temp = dprev[px+1] + 1
if temp < dist: dist = temp
dcurr[px+1] = dist
if dist <= max_dist:
yield tx+1, dist
dprev, dcurr = dcurr, dprev

If you need/want to poke around discovering "startpos", then you need
something like the following, which is similar to the function by Magnus
Lie Hetland which you'll find on the web, but appears to be about twice
as fast without destroying legibility completely :)

def Levenshtein_SJM(aseq, bseq):
"""
Calculate Levenshtein distance between aseq and bseq.
Space O(min(m,n))
Time O(mn)
"""
alen = len(aseq)
blen = len(bseq)
if alen > blen:
aseq, bseq = bseq, aseq
alen, blen = blen, alen
if not alen: return blen
arange = range(alen)
dprev = range(alen+1)
dcurr = [0] * (alen+1)
for bx in xrange(blen):
bc = bseq[bx]
dcurr[0] = bx + 1
for ax in arange:
dist = dprev[ax] + (bc != aseq[ax])
temp = dcurr[ax] + 1
if temp < dist: dist = temp
temp = dprev[ax+1] + 1
if temp < dist: dist = temp
dcurr[ax+1] = dist
dprev, dcurr = dcurr, dprev
return dist

Note that both of the above functions use the ancient original
algorithm, which takes O(mn) time. For heavy lifting a modern algorithm
and use of Pyrex or C are indicated.

HTH,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top