[perl-python] a program to delete duplicate files

Discussion in 'Python' started by Xah Lee, Mar 9, 2005.

  1. Xah Lee

    Xah Lee Guest

    here's a large exercise that uses what we built before.

    suppose you have tens of thousands of files in various directories.
    Some of these files are identical, but you don't know which ones are
    identical with which. Write a program that prints out which file are
    redundant copies.

    Here's the spec.
    --------------------------
    The program is to be used on the command line. Its arguments are one or
    more full paths of directories.

    perl del_dup.pl dir1

    prints the full paths of all files in dir1 that are duplicate.
    (including files in sub-directories) More specifically, if file A has
    duplicates, A's full path will be printed on a line, immediately
    followed the full paths of all other files that is a copy of A. These
    duplicates's full paths will be prefixed with "rm " string. A empty
    line follows a group of duplicates.

    Here's a sample output.

    inPath/a.jpg
    rm inPath/b.jpg
    rm inPath/3/a.jpg
    rm inPath/hh/eu.jpg

    inPath/ou.jpg
    rm inPath/23/a.jpg
    rm inPath/hh33/eu.jpg

    order does not matter. (i.e. which file will not be "rm " does not
    matter.)

    ------------------------

    perl del_dup.pl dir1 dir2

    will do the same as above, except that duplicates within dir1 or dir2
    themselves not considered. That is, all files in dir1 are compared to
    all files in dir2. (including subdirectories) And, only files in dir2
    will have the "rm " prefix.

    One way to understand this is to imagine lots of image files in both
    dir. One is certain that there are no duplicates within each dir
    themselves. (imagine that del_dup.pl has run on each already) Files in
    dir1 has already been categorized into sub directories by human. So
    that when there are duplicates among dir1 and dir2, one wants the
    version in dir2 to be deleted, leaving the organization in dir1 intact.

    perl del_dup.pl dir1 dir2 dir3 ...

    does the same as above, except files in later dir will have "rm "
    first. So, if there are these identical files:

    dir2/a
    dir2/b
    dir4/c
    dir4/d

    the c and d will both have "rm " prefix for sure. (which one has "rm "
    in dir2 does not matter) Note, although dir2 doesn't compare files
    inside itself, but duplicates still may be implicitly found by indirect
    comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
    never compared.


    --------------------------

    Write a Perl or Python version of the program.

    a absolute requirement in this problem is to minimize the number of
    comparison made between files. This is a part of the spec.

    feel free to write it however you want. I'll post my version in a few
    days.

    http://www.xahlee.org/perl-python/python.html

    Xah

    http://xahlee.org/PageTwo_dir/more.html
    Xah Lee, Mar 9, 2005
    #1
    1. Advertising

  2. On 9 Mar 2005 04:56:13 -0800, rumours say that "Xah Lee" <> might
    have written:

    >Write a Perl or Python version of the program.
    >
    >a absolute requirement in this problem is to minimize the number of
    >comparison made between files. This is a part of the spec.


    http://groups-beta.google.com/group..._frm/thread/de90435f3e56ab0c/048e292ec9adb82d

    The whole thread is about finding duplicate files.
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
    Christos TZOTZIOY Georgiou, Mar 9, 2005
    #2
    1. Advertising

  3. On Wednesday 09 March 2005 06:56 am, Xah Lee wrote:
    > here's a large exercise that uses what we built before.
    >
    > suppose you have tens of thousands of files in various directories.
    > Some of these files are identical, but you don't know which ones are
    > identical with which. Write a program that prints out which file are
    > redundant copies.


    For anyone interested in responding to the above, a starting
    place might be this maintenance script I wrote for my own use. I don't
    think it exactly matches the spec, but it addresses the problem. I wrote
    this to clean up a large tree of image files once. The exact behavior
    described requires the '--exec="ls %s"' option as mentioned in the help.

    #!/usr/bin/env python
    # (C) 2003 Anansi Spaceworks
    #---------------------------------------------------------------------------
    # find_duplicates
    """
    Utility to find duplicate files in a directory tree by
    comparing their checksums.
    """
    #---------------------------------------------------------------------------
    # This program is free software; you can redistribute it and/or modify
    # it under the terms of the GNU General Public License as published by
    # the Free Software Foundation; either version 2 of the License, or
    # (at your option) any later version.
    #
    # This program is distributed in the hope that it will be useful,
    # but WITHOUT ANY WARRANTY; without even the implied warranty of
    # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    # GNU General Public License for more details.
    #
    # You should have received a copy of the GNU General Public License
    # along with this program; if not, write to the Free Software
    # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
    #---------------------------------------------------------------------------

    import os, sys, md5, getopt


    def file_walker(tbl, srcpath, files):
    """
    Visit a path and collect data (including checksum) for files in it.
    """
    for file in files:
    filepath = os.path.join(srcpath, file)
    if os.path.isfile(filepath):
    chksum = md5.new(open(os.path.join(srcpath, file)).read()).digest()
    if not tbl.has_key(chksum): tbl[chksum]=[]
    tbl[chksum].append(filepath)

    def find_duplicates(treeroot, tbl=None):
    """
    Find duplicate files in directory.
    """
    dup = {}
    if tbl is None: tbl = {}
    os.path.walk(treeroot, file_walker, tbl)
    for k,v in tbl.items():
    if len(v) > 1:
    dup[k] = v
    return dup

    usage = """
    USAGE: find_duplicates <options> [<path ...]

    Find duplicate files (by matching md5 checksums) in a
    collection of paths (defaults to the current directory).

    Note that the order of the paths searched will be retained
    in the resulting duplicate file lists. This can be used
    with --exec and --index to automate handling.

    Options:
    -h, -H, --help
    Print this help.

    -q, --quiet
    Don't print normal report.

    -x, --exec=<command string>
    Python-formatted command string to act on the indexed
    duplicate in each duplicate group found. E.g. try
    --exec="ls %s"

    -n, --index=<index into duplicates>
    Which in a series of duplicates to use. Begins with '1'.
    Default is '1' (i.e. the first file listed).

    Example:
    You've copied many files from path ./A into path ./B. You want
    to delete all the ones you've processed already, but not
    delete anything else:

    % find_duplicates -q --exec="rm %s" --index=1 ./A ./B
    """

    def main():
    action = None
    quiet = 0
    index = 1
    dup = {}

    opts, args = getopt.getopt(sys.argv[1:], 'qhHn:x:',
    ['quiet', 'help', 'exec=', 'index='])

    for opt, val in opts:
    if opt in ('-h', '-H', '--help'):
    print usage
    sys.exit()
    elif opt in ('-x', '--exec'):
    action = str(val)
    elif opt in ('-n', '--index'):
    index = int(val)
    elif opt in ('-q', '--quiet'):
    quiet = 1

    if len(args)==0:
    dup = find_duplicates('.')
    else:
    tbl = {}
    for arg in args:
    dup = find_duplicates(arg, tbl=tbl)

    for k, v in dup.items():
    if not quiet:
    print "Duplicates:"
    for f in v: print "\t%s" % f
    if action:
    os.system(action % v[index-1])

    if __name__=='__main__':
    main()



    --
    --
    Terry Hancock ( hancock at anansispaceworks.com )
    Anansi Spaceworks http://www.anansispaceworks.com
    Terry Hancock, Mar 9, 2005
    #3
  4. Patrick Useldinger, Mar 10, 2005
    #4
  5. On Wed, 9 Mar 2005 16:13:20 -0600, rumours say that Terry Hancock
    <> might have written:

    >For anyone interested in responding to the above, a starting
    >place might be this maintenance script I wrote for my own use. I don't
    >think it exactly matches the spec, but it addresses the problem. I wrote
    >this to clean up a large tree of image files once. The exact behavior
    >described requires the '--exec="ls %s"' option as mentioned in the help.


    The drawback of this method is that you have to read everything. For example,
    if you have ten files less than 100KiB each and one file more than 2 GiB in
    size, there is no need to read the 2 GiB file, is there?

    If it's a one-shot attempt, I guess it won't mind a lot.

    On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
    st_inum), because you know that they are the same file.
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
    Christos TZOTZIOY Georgiou, Mar 10, 2005
    #5
  6. Xah Lee

    Guest

    I've written a python GUI wrapper around some shell scripts:
    http://www.pixelbeat.org/fslint/

    the shell script logic is essentially:

    exclude hard linked files
    only include files where there are more than 1 with the same size
    print files with matching md5sum

    Pádraig.
    , Mar 10, 2005
    #6
  7. On Thu, 10 Mar 2005 10:54:05 +0100, rumours say that Patrick Useldinger
    <> might have written:

    >I wrote something similar, have a look at
    >http://www.homepages.lu/pu/fdups.html.


    That's fast and good.

    A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).

    Have you found any way to test if two files on NTFS are hard linked without
    opening them first to get a file handle?
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
    Christos TZOTZIOY Georgiou, Mar 10, 2005
    #7
  8. Christos TZOTZIOY Georgiou wrote:

    > On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
    > st_inum), because you know that they are the same file.


    I then have a bug here - I consider all files with the same inode equal,
    but according to what you say I need to consider the tuple
    (st_dev,ST_ium). I'll have to fix that for 0.13.

    Thanks ;-)
    -pu
    Patrick Useldinger, Mar 11, 2005
    #8
  9. Christos TZOTZIOY Georgiou wrote:

    > That's fast and good.


    Nice to hear.

    > A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).


    I'll look into that.

    > Have you found any way to test if two files on NTFS are hard linked without
    > opening them first to get a file handle?


    No. And even then, I wouldn't know how to find out.

    -pu
    Patrick Useldinger, Mar 11, 2005
    #9
  10. In article <>,
    "Xah Lee" <> wrote:

    > a absolute requirement in this problem is to minimize the number of
    > comparison made between files. This is a part of the spec.


    You need do no comparisons between files. Just use a sufficiently
    strong hash algorithm (SHA-256 maybe?) and compare the hashes.

    --
    David Eppstein
    Computer Science Dept., Univ. of California, Irvine
    http://www.ics.uci.edu/~eppstein/
    David Eppstein, Mar 11, 2005
    #10
  11. Xah Lee

    John Bokma Guest

    David Eppstein wrote:

    > In article <>,
    > "Xah Lee" <> wrote:
    >
    >> a absolute requirement in this problem is to minimize the number of
    >> comparison made between files. This is a part of the spec.

    >
    > You need do no comparisons between files. Just use a sufficiently
    > strong hash algorithm (SHA-256 maybe?) and compare the hashes.


    I did it as follows (some time ago):

    is filesize in hash?

    calculate md5 (and store), if equal then compare
    files.

    store info in hash.

    In some cases if might be faster to drop the md5 (since it reads all data)

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
    John Bokma, Mar 11, 2005
    #11
  12. On Fri, 11 Mar 2005 01:24:59 +0100, rumours say that Patrick Useldinger
    <> might have written:

    >> Have you found any way to test if two files on NTFS are hard linked without
    >> opening them first to get a file handle?

    >
    >No. And even then, I wouldn't know how to find out.


    MSDN is our friend for Windows stuff.

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base/createhardlink.asp

    and then

    http://msdn.microsoft.com/library/d...us/fileio/base/getfileinformationbyhandle.asp

    http://msdn.microsoft.com/library/d...ileio/base/by_handle_file_information_str.asp

    The relevant parts from this last page:

    st_dev <-> dwVolumeSerialNumber

    st_ino <-> (nFileIndexHigh, nFileIndexLow)
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
    Christos TZOTZIOY Georgiou, Mar 11, 2005
    #12
  13. On Fri, 11 Mar 2005 01:12:14 +0100, rumours say that Patrick Useldinger
    <> might have written:

    >> On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
    >> st_inum), because you know that they are the same file.

    >
    >I then have a bug here - I consider all files with the same inode equal,
    > but according to what you say I need to consider the tuple
    >(st_dev,ST_ium). I'll have to fix that for 0.13.


    I have a bug here too-- I wrote st_inum meaning st_ino, but that would be quick
    to find!

    > Thanks ;-)


    You are very welcome.
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
    Christos TZOTZIOY Georgiou, Mar 11, 2005
    #13
  14. Christos TZOTZIOY Georgiou wrote:

    > The relevant parts from this last page:
    > st_dev <-> dwVolumeSerialNumber
    > st_ino <-> (nFileIndexHigh, nFileIndexLow)


    I see. But if I am not mistaken, that would mean that I
    (1) had to detect NTFS volumes
    (2) use non-standard libraries to find these information (like the
    Python Win extentions).

    I am not seriously motivated to do so, but if somebody is interested to
    help, I am open to it.

    -pu
    Patrick Useldinger, Mar 11, 2005
    #14
  15. David Eppstein wrote:

    > You need do no comparisons between files. Just use a sufficiently
    > strong hash algorithm (SHA-256 maybe?) and compare the hashes.


    That's not very efficient. IMO, it only makes sense in network-based
    operations such as rsync.

    -pu
    Patrick Useldinger, Mar 11, 2005
    #15
  16. Christos TZOTZIOY Georgiou wrote:

    > A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).


    Changed.
    Patrick Useldinger, Mar 11, 2005
    #16
  17. In article <4231c703$>,
    Patrick Useldinger <> wrote:

    > > You need do no comparisons between files. Just use a sufficiently
    > > strong hash algorithm (SHA-256 maybe?) and compare the hashes.

    >
    > That's not very efficient. IMO, it only makes sense in network-based
    > operations such as rsync.


    Well, but the spec didn't say efficiency was the primary criterion, it
    said minimizing the number of comparisons was.

    More seriously, the best I can think of that doesn't use a strong slow
    hash would be to group files by (file size, cheap hash) then compare
    each file in a group with a representative of each distinct file found
    among earlier files in the same group -- that leads to an average of
    about three reads per duplicated file copy: one to hash it, and two for
    the comparison between it and its representative (almost all of the
    comparisons will turn out equal but you still need to check unless you
    use a strong hash).

    I'm assuming of course that there are too many files and/or they're too
    large just to keep them all in core.

    Anyone have any data on whether reading files and SHA-256'ing them (or
    whatever other cryptographic hash you think is strong enough) is
    I/O-bound or CPU-bound? That is, is three reads with very little CPU
    overhead really cheaper than one read with a strong hash?

    I guess it also depends on the number of files you expect to have
    duplicates of. If most of the files exist in only one copy, it's clear
    that the cheap hash will find them more cheaply than the expensive hash.
    In that case you could combine the (file size, cheap hash) filtering
    with the expensive hash and get only two reads per copy rather than
    three.

    --
    David Eppstein
    Computer Science Dept., Univ. of California, Irvine
    http://www.ics.uci.edu/~eppstein/
    David Eppstein, Mar 11, 2005
    #17
  18. On Thursday 10 March 2005 11:02 am, Christos "TZOTZIOY" Georgiou wrote:
    > On Wed, 9 Mar 2005 16:13:20 -0600, rumours say that Terry Hancock
    > <> might have written:
    >
    > >For anyone interested in responding to the above, a starting
    > >place might be this maintenance script I wrote for my own use. I don't
    > >think it exactly matches the spec, but it addresses the problem. I wrote
    > >this to clean up a large tree of image files once. The exact behavior
    > >described requires the '--exec="ls %s"' option as mentioned in the help.

    >
    > The drawback of this method is that you have to read everything. For example,
    > if you have ten files less than 100KiB each and one file more than 2 GiB in
    > size, there is no need to read the 2 GiB file, is there?
    >
    > If it's a one-shot attempt, I guess it won't mind a lot.
    >
    > On POSIX filesystems, one has also to avoid comparing files having same (st_dev,
    > st_inum), because you know that they are the same file.


    Two good points, but hey, it ran fast enough for me. ;-)

    Cheers,
    Terry

    --
    --
    Terry Hancock ( hancock at anansispaceworks.com )
    Anansi Spaceworks http://www.anansispaceworks.com
    Terry Hancock, Mar 11, 2005
    #18
  19. David Eppstein wrote:

    > Well, but the spec didn't say efficiency was the primary criterion, it
    > said minimizing the number of comparisons was.


    That's exactly what my program does.

    > More seriously, the best I can think of that doesn't use a strong slow
    > hash would be to group files by (file size, cheap hash) then compare
    > each file in a group with a representative of each distinct file found
    > among earlier files in the same group -- that leads to an average of
    > about three reads per duplicated file copy: one to hash it, and two for
    > the comparison between it and its representative (almost all of the
    > comparisons will turn out equal but you still need to check unless you


    My point is : forget hashes. If you work with hashes, you do have to
    read each file completely, cheap hash or not. My program normally reads
    *at most* 100% of the files to analyse, but usually much less. Also, I
    do plain comparisons which are much cheaper than hash calculations.

    > I'm assuming of course that there are too many files and/or they're too
    > large just to keep them all in core.


    I assume that file handles are sufficient to keep one open per file of
    the same size. This lead to trouble on Windows installations, but I
    guess that's a parameter to change. On Linux, I never had the problem.

    Regarding buffer size, I use a maxumim which is then split up between
    all open files.

    > Anyone have any data on whether reading files and SHA-256'ing them (or
    > whatever other cryptographic hash you think is strong enough) is
    > I/O-bound or CPU-bound? That is, is three reads with very little CPU
    > overhead really cheaper than one read with a strong hash?


    It also depends on the OS. I found that my program runs much slower on
    Windows, probably due to the way Linux anticipates reads and tries to
    reduce head movement.

    > I guess it also depends on the number of files you expect to have
    > duplicates of. If most of the files exist in only one copy, it's clear
    > that the cheap hash will find them more cheaply than the expensive hash.
    > In that case you could combine the (file size, cheap hash) filtering
    > with the expensive hash and get only two reads per copy rather than
    > three.


    Sorry, but I can still not see a point tu use hashes. Maybe you'll have
    a look at my program and tell me where a hash could be useful?

    It's available at http://www.homepages.lu/pu/fdups.html.

    Regards,
    -pu
    Patrick Useldinger, Mar 11, 2005
    #19
  20. In article <4232079b$>,
    Patrick Useldinger <> wrote:

    > > Well, but the spec didn't say efficiency was the primary criterion, it
    > > said minimizing the number of comparisons was.

    >
    > That's exactly what my program does.


    If you're doing any comparisons at all, you're not minimizing the number
    of comparisons.

    --
    David Eppstein
    Computer Science Dept., Univ. of California, Irvine
    http://www.ics.uci.edu/~eppstein/
    David Eppstein, Mar 11, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Xah Lee
    Replies:
    17
    Views:
    221
  2. Shiraz

    delete based on duplicate field

    Shiraz, Jun 16, 2005, in forum: Perl Misc
    Replies:
    4
    Views:
    114
    Jürgen Exner
    Jun 16, 2005
  3. PerlFAQ Server
    Replies:
    0
    Views:
    575
    PerlFAQ Server
    Feb 11, 2011
  4. PerlFAQ Server
    Replies:
    0
    Views:
    573
    PerlFAQ Server
    Mar 9, 2011
  5. Joon Ki Choi
    Replies:
    4
    Views:
    121
    Joon Ki Choi
    Oct 10, 2012
Loading...

Share This Page