String similarity

Discussion in 'Python' started by Luca Montecchiani, Oct 10, 2003.

  1. Introduction
    ------------
    The need to find files that "resembled" in the name has pushed me to write
    an utility that unlike the other it was not based on the content of the files
    but on its name. Initially I start adding this functionality to
    one "C" program for Unix called "fdupes" witch give me good performance
    and good precision.
    The algorithm that I have chosen for the comparison between string was
    "Levenshtein Distance".
    For reasons that I leave you to imagine I have rewrite that utility using
    Python reducing the code from 1394 lines of "C" to 152 lines of Python
    enclosed comments :)
    Initially I have rearranged part of "C" code in one module "simil"
    to which I have added an helper and others two algorithms that in reality
    are two versions of the same algorithm "Ratcliff-Obershelp".
    The small script works great scanning my download directory or my audio files
    collection finding duplicates that I'll never be able to find manually.

    Considerations
    --------------
    During the test I have been able to verify that the performances of the
    Python module "difflib" are light years far from the algorithms of my
    module "simil", initially I thought that the bottle neck was the
    fact of having to compare all the possible combinations (without repetitions)
    of files. This is bad IMHO.
    This implementation is still in an embryonic phase however I have decided
    of posting in this group for being able to discuss it and to carry ahead,
    remaining in my HD would not have some future ;)
    I'm sorry for the quality of Python code, I'm a beginner on that language..

    Installation
    ------------
    Once decompressed the file: http://spazioinwind.libero.it/montecchiani/pyfdupes-0.7.tgz
    run the Build.sh command in order to compile and to install the "simil" module.
    To the test the modules you need to modify the Test.sh shell before the execution.
    ( the "int" mode is intentionally commented because on slower machine you could
    never see the end, 515 files will generate 132355 iterations! )
    Comments, credits, TODO and FIXME can be found inside the source code.

    Ideas and optimization
    ----------------------
    In the comparison between the content of files where the key is the size, the
    iterations can be limited using a binary tree structure.
    Comparing string this is not possible (IMHO) and the only way that I have found
    was a "preliminary Check" that avoids to make comparison between string of a great
    different length.

    Critics, ideas, comments, fix and help are welcomes
    luca

    PS: sorry for my english
    Luca Montecchiani, Oct 10, 2003
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fabian Leitritz

    Document-Document similarity

    Fabian Leitritz, Jan 14, 2005, in forum: Java
    Replies:
    0
    Views:
    402
    Fabian Leitritz
    Jan 14, 2005
  2. =?iso-8859-1?B?bW9vcJk=?=

    What are the similarity and difference b/w EBJ and COM+?

    =?iso-8859-1?B?bW9vcJk=?=, May 30, 2006, in forum: Java
    Replies:
    1
    Views:
    409
    dimitar
    May 30, 2006
  3. Tim Churches

    Re: String similarity

    Tim Churches, Oct 10, 2003, in forum: Python
    Replies:
    1
    Views:
    376
    Luca Montecchiani
    Oct 10, 2003
  4. Achim Domma

    string similarity in python

    Achim Domma, Nov 24, 2003, in forum: Python
    Replies:
    5
    Views:
    813
    Luca Montecchiani
    Nov 24, 2003
  5. Chris Chris
    Replies:
    3
    Views:
    107
    Dave Bass
    Jul 9, 2008
Loading...

Share This Page