locating strings approximately

Discussion in 'Python' started by BBands, Jun 29, 2006.

  1. BBands

    BBands Guest

    I'd like to see if a string exists, even approximately, in another. For
    example if "black" exists in "blakbird" or if "beatles" exists in
    "beatlemania". The application is to look though a long list of songs
    and return any approximate matches along with a confidence factor. I
    have looked at edit distance, but that isn't a good choice for finding
    a short string in a longer one. I have also explored
    difflib.SequenceMatcher and .get_close_matches, but what I'd really
    like is something like:

    a = FindApprox("beatles", "beatlemania")
    print a
    0.857

    Any ideas?

    jab
     
    BBands, Jun 29, 2006
    #1
    1. Advertising

  2. BBands wrote:
    > I'd like to see if a string exists, even approximately, in another. For
    > example if "black" exists in "blakbird" or if "beatles" exists in
    > "beatlemania". The application is to look though a long list of songs
    > and return any approximate matches along with a confidence factor. I
    > have looked at edit distance, but that isn't a good choice for finding
    > a short string in a longer one. I have also explored
    > difflib.SequenceMatcher and .get_close_matches, but what I'd really
    > like is something like:
    >
    > a = FindApprox("beatles", "beatlemania")
    > print a
    > 0.857
    >
    > Any ideas?
    >
    > jab
    >


    I collected a few pointers in this article:

    http://www.razorvine.net/frog/user/irmen/article/2005-05-28/53

    It contains one or two additional methods besides the ones you mentioned.
    Sorry, it's in Dutch, but you can find the links without much trouble i guess.


    --Irmen
     
    Irmen de Jong, Jun 29, 2006
    #2
    1. Advertising

  3. BBands

    John Machin Guest

    On 29/06/2006 9:28 AM, BBands wrote:
    > I'd like to see if a string exists, even approximately, in another. For
    > example if "black" exists in "blakbird" or if "beatles" exists in
    > "beatlemania". The application is to look though a long list of songs
    > and return any approximate matches along with a confidence factor. I
    > have looked at edit distance, but that isn't a good choice for finding
    > a short string in a longer one.


    There is a trivial difference between the traditional
    distance-matrix-based Levenshtein algorithm for edit distance and the
    corresponding one for approximate string searching. Ditto between
    finite-state-machine approaches. Ditto between modern bit-bashing
    approaches.

    > I have also explored
    > difflib.SequenceMatcher and .get_close_matches, but what I'd really
    > like is something like:
    >
    > a = FindApprox("beatles", "beatlemania")
    > print a
    > 0.857
    >
    > Any ideas?


    You got no ideas from googling "approximate string search python"???
     
    John Machin, Jun 29, 2006
    #3
  4. BBands

    John Machin Guest

    On 29/06/2006 10:07 AM, BBands wrote:
    > On 6/28/06, John Machin <> wrote:
    >> On 29/06/2006 9:28 AM, BBands wrote:
    >> > I'd like to see if a string exists, even approximately, in another. For
    >> > example if "black" exists in "blakbird" or if "beatles" exists in
    >> > "beatlemania". The application is to look though a long list of songs
    >> > and return any approximate matches along with a confidence factor. I
    >> > have looked at edit distance, but that isn't a good choice for finding
    >> > a short string in a longer one.

    >>
    >> There is a trivial difference between the traditional
    >> distance-matrix-based Levenshtein algorithm for edit distance and the
    >> corresponding one for approximate string searching. Ditto between
    >> finite-state-machine approaches. Ditto between modern bit-bashing
    >> approaches.
    >>
    >> > I have also explored
    >> > difflib.SequenceMatcher and .get_close_matches, but what I'd really
    >> > like is something like:
    >> >
    >> > a = FindApprox("beatles", "beatlemania")
    >> > print a
    >> > 0.857
    >> >
    >> > Any ideas?

    >>
    >> You got no ideas from googling "approximate string search python"???

    >
    > Yes, many including agrepy and soundex in addition to those I
    > mentioned already, but none seem really handy at approximately looking
    > up smaller strings in larger ones. I also note that this has been the
    > topic of prior discussion without resolutiuon.
    >
    > jab


    It helps if you tell all that you've done. Otherwise people will tell
    you to do what you've done already :)

    It helps if you reply on-list so that others can see. You may get better
    answers sooner. I have to vanish now. Will check back tonight.

    Cheers,
    John

    agrepy = approximate-grep-python -- why doesn't that suit you?
     
    John Machin, Jun 29, 2006
    #4
  5. BBands

    Jim Segrave Guest

    In article <44a315ad$0$31638$4all.nl>,
    Irmen de Jong <> wrote:
    >BBands wrote:
    >> I'd like to see if a string exists, even approximately, in another. For
    >> example if "black" exists in "blakbird" or if "beatles" exists in
    >> "beatlemania". The application is to look though a long list of songs
    >> and return any approximate matches along with a confidence factor. I
    >> have looked at edit distance, but that isn't a good choice for finding
    >> a short string in a longer one. I have also explored
    >> difflib.SequenceMatcher and .get_close_matches, but what I'd really
    >> like is something like:
    >>
    >> a = FindApprox("beatles", "beatlemania")
    >> print a
    >> 0.857
    >>
    >> Any ideas?
    >>
    >> jab
    >>

    >
    >I collected a few pointers in this article:
    >
    >http://www.razorvine.net/frog/user/irmen/article/2005-05-28/53
    >
    >It contains one or two additional methods besides the ones you mentioned.
    >Sorry, it's in Dutch, but you can find the links without much trouble i guess.


    For those who don't read Dutch, here's a translation of that page:

    ========================

    Python </frog/user/irmen/category/2>

    By fuzzy string comparision, I mean comparing two strings and getting
    a result that indicates how similar the strings are to each
    other. That is, it's not an absoulte string compaision (that only
    tells if the two strings are absolutely the same or not), but a more
    quantative string comparision.

    It can be used to, for example, to remove typos from commands, a
    specific form of spelling checker or for other natural language
    processing.

    Python has a few modules which offer a fuzzy string comparision.

    —

    QWERTY - or protein distances

    Recently, BoB Hooft posted an interesting module in comp.lang.python
    which uses an algorithm that is normally used for comparing proteins,
    but now works for ASCII strings. The result is a float which is
    negative if the strings are very different and increases up to
    len(string) + 1 if the strings are identical, the score can be viewd
    as 'the number of correspondingn letters'. The algorithm uses the
    layout of the QWERTY keyboard to specifiy the distance between the
    stings and therefore is restricted to ASCII strings (and only makes
    sense if one uses a QWERTY keyboard), but therefor is very well suited
    to compare an imput command with all the possible commands and thus to
    identify typos.

    Why not use a normalised score between 0.0 and 1.0?
    Rob: It doesn't matter if you only want the 'best score' from a number
    of strings. Negative results appear if the strings are not similar and
    also don't even have the same length.

    The module can be found here: : comparestrings.py
    <http://starship.python.net/crew/hooft/comparestrings.py>.

    *Python standard library*

    In de standard lib hebben we difflib met de functie get_close_matches
    <http://docs.python.org/lib/module-difflib.html#l2h-922>. Groot voordeel
    dat deze dus standaard bij Python zit.

    The standard library has difflib with the function get_close_matches
    <http://docs.python.org/lib/module-difflib.html#l2h-922>. The big
    advantage is that this is standard with Python.

    *Febrl (biometrisch)*

    Febrl <https://sourceforge.net/projects/febrl/> has a number of string
    comparision functions, including the 'edit distance' or 'Levenshtein
    distance' function. For the latter, there is an implementation on
    Useless Python
    <http://www.uselesspython.com/download.php?script_id=108>.

    *Apse*

    An extension module (written in C): Apse
    <http://www.personal.psu.edu/staff/i/u/iua1/python/apse/>.

    Requires python, SWIG, and a C compiler

    *Java*

    SecondString <http://sourceforge.net/projects/secondstring/> is an
    open source package with Java implementations of a large number of
    string comparision algorithms. They should not be too difficult to
    port to Python.



    --
    Jim Segrave ()
     
    Jim Segrave, Jun 29, 2006
    #5
  6. BBands

    John Machin Guest

    On 29/06/2006 10:52 AM, John Machin wrote:
    > On 29/06/2006 10:07 AM, BBands wrote:
    >> On 6/28/06, John Machin <> wrote:
    >>> On 29/06/2006 9:28 AM, BBands wrote:
    >>> > I'd like to see if a string exists, even approximately, in another.
    >>> For
    >>> > example if "black" exists in "blakbird" or if "beatles" exists in
    >>> > "beatlemania". The application is to look though a long list of songs
    >>> > and return any approximate matches along with a confidence factor. I
    >>> > have looked at edit distance, but that isn't a good choice for finding
    >>> > a short string in a longer one.
    >>>
    >>> There is a trivial difference between the traditional
    >>> distance-matrix-based Levenshtein algorithm for edit distance and the
    >>> corresponding one for approximate string searching. Ditto between
    >>> finite-state-machine approaches. Ditto between modern bit-bashing
    >>> approaches.


    The trivial difference is that in the searching case one edge of the
    dynamic programming matrix is initialised (in theory) to [0] *
    (len(text)+1) whereas in the distance case it is set to range(len(text)+1).

    >>>
    >>> > I have also explored
    >>> > difflib.SequenceMatcher and .get_close_matches, but what I'd really
    >>> > like is something like:
    >>> >
    >>> > a = FindApprox("beatles", "beatlemania")
    >>> > print a
    >>> > 0.857
    >>> >
    >>> > Any ideas?
    >>>
    >>> You got no ideas from googling "approximate string search python"???

    >>
    >> Yes, many including agrepy and soundex in addition to those I
    >> mentioned already, but none seem really handy at approximately looking
    >> up smaller strings in larger ones. I also note that this has been the
    >> topic of prior discussion without resolutiuon.
    >>
    >> jab

    >
    > It helps if you tell all that you've done. Otherwise people will tell
    > you to do what you've done already :)
    >
    > It helps if you reply on-list so that others can see. You may get better
    > answers sooner. I have to vanish now. Will check back tonight.
    >


    Here's a possibly better answer:

    def approx_search(pattern, text, max_dist):
    """
    Return a generator which yields tuples (endpos, dist)
    for each possible endpos such that
    (1) Levenshtein_distance(pattern, text[startpos:endpos]) = dist
    (2) 0 <= dist <= max_dist
    for some (undetermined) startpos.
    Not restricted to strings:
    (a) pattern must support pattern and len(pattern).
    (b) text needs only to support enumerate(text).
    (c) contents of pattern and text need only support the != operator.
    Example:
    list(approx_search('Beatles', 'beetles Beat leverage Betelguese',
    3)) ->
    [(6, 3), (7, 2), (8, 3),
    (12, 3), (13, 3), (14, 3), (15, 2), (16, 2), (17, 3),
    (26, 3), (27, 3)]
    """
    plen = len(pattern)
    prange = range(plen)
    dprev = range(plen+1)
    dcurr = [0] * (plen+1)
    dist = plen
    for tx, tc in enumerate(text):
    # dcurr[0] = 0 # not needed, never changes
    for px in prange:
    dist = dprev[px] + (tc != pattern[px])
    temp = dcurr[px] + 1
    if temp < dist: dist = temp
    temp = dprev[px+1] + 1
    if temp < dist: dist = temp
    dcurr[px+1] = dist
    if dist <= max_dist:
    yield tx+1, dist
    dprev, dcurr = dcurr, dprev

    If you need/want to poke around discovering "startpos", then you need
    something like the following, which is similar to the function by Magnus
    Lie Hetland which you'll find on the web, but appears to be about twice
    as fast without destroying legibility completely :)

    def Levenshtein_SJM(aseq, bseq):
    """
    Calculate Levenshtein distance between aseq and bseq.
    Space O(min(m,n))
    Time O(mn)
    """
    alen = len(aseq)
    blen = len(bseq)
    if alen > blen:
    aseq, bseq = bseq, aseq
    alen, blen = blen, alen
    if not alen: return blen
    arange = range(alen)
    dprev = range(alen+1)
    dcurr = [0] * (alen+1)
    for bx in xrange(blen):
    bc = bseq[bx]
    dcurr[0] = bx + 1
    for ax in arange:
    dist = dprev[ax] + (bc != aseq[ax])
    temp = dcurr[ax] + 1
    if temp < dist: dist = temp
    temp = dprev[ax+1] + 1
    if temp < dist: dist = temp
    dcurr[ax+1] = dist
    dprev, dcurr = dcurr, dprev
    return dist

    Note that both of the above functions use the ancient original
    algorithm, which takes O(mn) time. For heavy lifting a modern algorithm
    and use of Pyrex or C are indicated.

    HTH,
    John
     
    John Machin, Jun 30, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hilton

    Locating assemblies in ASP.NET

    Hilton, Jul 22, 2003, in forum: ASP .Net
    Replies:
    0
    Views:
    384
    Hilton
    Jul 22, 2003
  2. Anne
    Replies:
    0
    Views:
    366
  3. =?ISO-8859-1?Q?Ney_Andr=E9_de_Mello_Zunino?=

    Strategy for locating strings based on initial characters

    =?ISO-8859-1?Q?Ney_Andr=E9_de_Mello_Zunino?=, Apr 28, 2005, in forum: C++
    Replies:
    4
    Views:
    310
    Chris \( Val \)
    Apr 28, 2005
  4. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    787
    Malcolm
    Jun 24, 2006
  5. Dilip
    Replies:
    8
    Views:
    330
    Jerry Coffin
    Sep 18, 2006
Loading...

Share This Page