[perl-python] a program to delete duplicate files

Discussion in 'Perl Misc' started by Xah Lee, Mar 9, 2005.

  1. Xah Lee

    Xah Lee Guest

    here's a large exercise that uses what we built before.

    suppose you have tens of thousands of files in various directories.
    Some of these files are identical, but you don't know which ones are
    identical with which. Write a program that prints out which file are
    redundant copies.

    Here's the spec.
    --------------------------
    The program is to be used on the command line. Its arguments are one or
    more full paths of directories.

    perl del_dup.pl dir1

    prints the full paths of all files in dir1 that are duplicate.
    (including files in sub-directories) More specifically, if file A has
    duplicates, A's full path will be printed on a line, immediately
    followed the full paths of all other files that is a copy of A. These
    duplicates's full paths will be prefixed with "rm " string. A empty
    line follows a group of duplicates.

    Here's a sample output.

    inPath/a.jpg
    rm inPath/b.jpg
    rm inPath/3/a.jpg
    rm inPath/hh/eu.jpg

    inPath/ou.jpg
    rm inPath/23/a.jpg
    rm inPath/hh33/eu.jpg

    order does not matter. (i.e. which file will not be "rm " does not
    matter.)

    ------------------------

    perl del_dup.pl dir1 dir2

    will do the same as above, except that duplicates within dir1 or dir2
    themselves not considered. That is, all files in dir1 are compared to
    all files in dir2. (including subdirectories) And, only files in dir2
    will have the "rm " prefix.

    One way to understand this is to imagine lots of image files in both
    dir. One is certain that there are no duplicates within each dir
    themselves. (imagine that del_dup.pl has run on each already) Files in
    dir1 has already been categorized into sub directories by human. So
    that when there are duplicates among dir1 and dir2, one wants the
    version in dir2 to be deleted, leaving the organization in dir1 intact.

    perl del_dup.pl dir1 dir2 dir3 ...

    does the same as above, except files in later dir will have "rm "
    first. So, if there are these identical files:

    dir2/a
    dir2/b
    dir4/c
    dir4/d

    the c and d will both have "rm " prefix for sure. (which one has "rm "
    in dir2 does not matter) Note, although dir2 doesn't compare files
    inside itself, but duplicates still may be implicitly found by indirect
    comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
    never compared.


    --------------------------

    Write a Perl or Python version of the program.

    a absolute requirement in this problem is to minimize the number of
    comparison made between files. This is a part of the spec.

    feel free to write it however you want. I'll post my version in a few
    days.

    http://www.xahlee.org/perl-python/python.html

    Xah

    http://xahlee.org/PageTwo_dir/more.html
    Xah Lee, Mar 9, 2005
    #1
    1. Advertising

  2. On 9 Mar 2005 04:56:13 -0800, rumours say that "Xah Lee" <> might
    have written:

    >Write a Perl or Python version of the program.
    >
    >a absolute requirement in this problem is to minimize the number of
    >comparison made between files. This is a part of the spec.


    http://groups-beta.google.com/group..._frm/thread/de90435f3e56ab0c/048e292ec9adb82d

    The whole thread is about finding duplicate files.
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
    Christos TZOTZIOY Georgiou, Mar 9, 2005
    #2
    1. Advertising

  3. Patrick Useldinger, Mar 10, 2005
    #3
  4. On Thu, 10 Mar 2005 10:54:05 +0100, rumours say that Patrick Useldinger
    <> might have written:

    >I wrote something similar, have a look at
    >http://www.homepages.lu/pu/fdups.html.


    That's fast and good.

    A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).

    Have you found any way to test if two files on NTFS are hard linked without
    opening them first to get a file handle?
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
    Christos TZOTZIOY Georgiou, Mar 10, 2005
    #4
  5. Christos TZOTZIOY Georgiou wrote:

    > That's fast and good.


    Nice to hear.

    > A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).


    I'll look into that.

    > Have you found any way to test if two files on NTFS are hard linked without
    > opening them first to get a file handle?


    No. And even then, I wouldn't know how to find out.

    -pu
    Patrick Useldinger, Mar 11, 2005
    #5
  6. In article <>,
    "Xah Lee" <> wrote:

    > a absolute requirement in this problem is to minimize the number of
    > comparison made between files. This is a part of the spec.


    You need do no comparisons between files. Just use a sufficiently
    strong hash algorithm (SHA-256 maybe?) and compare the hashes.

    --
    David Eppstein
    Computer Science Dept., Univ. of California, Irvine
    http://www.ics.uci.edu/~eppstein/
    David Eppstein, Mar 11, 2005
    #6
  7. Xah Lee

    John Bokma Guest

    David Eppstein wrote:

    > In article <>,
    > "Xah Lee" <> wrote:
    >
    >> a absolute requirement in this problem is to minimize the number of
    >> comparison made between files. This is a part of the spec.

    >
    > You need do no comparisons between files. Just use a sufficiently
    > strong hash algorithm (SHA-256 maybe?) and compare the hashes.


    I did it as follows (some time ago):

    is filesize in hash?

    calculate md5 (and store), if equal then compare
    files.

    store info in hash.

    In some cases if might be faster to drop the md5 (since it reads all data)

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
    John Bokma, Mar 11, 2005
    #7
  8. On Fri, 11 Mar 2005 01:24:59 +0100, rumours say that Patrick Useldinger
    <> might have written:

    >> Have you found any way to test if two files on NTFS are hard linked without
    >> opening them first to get a file handle?

    >
    >No. And even then, I wouldn't know how to find out.


    MSDN is our friend for Windows stuff.

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base/createhardlink.asp

    and then

    http://msdn.microsoft.com/library/d...us/fileio/base/getfileinformationbyhandle.asp

    http://msdn.microsoft.com/library/d...ileio/base/by_handle_file_information_str.asp

    The relevant parts from this last page:

    st_dev <-> dwVolumeSerialNumber

    st_ino <-> (nFileIndexHigh, nFileIndexLow)
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
    Christos TZOTZIOY Georgiou, Mar 11, 2005
    #8
  9. Christos TZOTZIOY Georgiou wrote:

    > The relevant parts from this last page:
    > st_dev <-> dwVolumeSerialNumber
    > st_ino <-> (nFileIndexHigh, nFileIndexLow)


    I see. But if I am not mistaken, that would mean that I
    (1) had to detect NTFS volumes
    (2) use non-standard libraries to find these information (like the
    Python Win extentions).

    I am not seriously motivated to do so, but if somebody is interested to
    help, I am open to it.

    -pu
    Patrick Useldinger, Mar 11, 2005
    #9
  10. David Eppstein wrote:

    > You need do no comparisons between files. Just use a sufficiently
    > strong hash algorithm (SHA-256 maybe?) and compare the hashes.


    That's not very efficient. IMO, it only makes sense in network-based
    operations such as rsync.

    -pu
    Patrick Useldinger, Mar 11, 2005
    #10
  11. Christos TZOTZIOY Georgiou wrote:

    > A minor nit-pick: `fdups.py -r .` does nothing (at least on Linux).


    Changed.
    Patrick Useldinger, Mar 11, 2005
    #11
  12. In article <4231c703$>,
    Patrick Useldinger <> wrote:

    > > You need do no comparisons between files. Just use a sufficiently
    > > strong hash algorithm (SHA-256 maybe?) and compare the hashes.

    >
    > That's not very efficient. IMO, it only makes sense in network-based
    > operations such as rsync.


    Well, but the spec didn't say efficiency was the primary criterion, it
    said minimizing the number of comparisons was.

    More seriously, the best I can think of that doesn't use a strong slow
    hash would be to group files by (file size, cheap hash) then compare
    each file in a group with a representative of each distinct file found
    among earlier files in the same group -- that leads to an average of
    about three reads per duplicated file copy: one to hash it, and two for
    the comparison between it and its representative (almost all of the
    comparisons will turn out equal but you still need to check unless you
    use a strong hash).

    I'm assuming of course that there are too many files and/or they're too
    large just to keep them all in core.

    Anyone have any data on whether reading files and SHA-256'ing them (or
    whatever other cryptographic hash you think is strong enough) is
    I/O-bound or CPU-bound? That is, is three reads with very little CPU
    overhead really cheaper than one read with a strong hash?

    I guess it also depends on the number of files you expect to have
    duplicates of. If most of the files exist in only one copy, it's clear
    that the cheap hash will find them more cheaply than the expensive hash.
    In that case you could combine the (file size, cheap hash) filtering
    with the expensive hash and get only two reads per copy rather than
    three.

    --
    David Eppstein
    Computer Science Dept., Univ. of California, Irvine
    http://www.ics.uci.edu/~eppstein/
    David Eppstein, Mar 11, 2005
    #12
  13. David Eppstein wrote:

    > Well, but the spec didn't say efficiency was the primary criterion, it
    > said minimizing the number of comparisons was.


    That's exactly what my program does.

    > More seriously, the best I can think of that doesn't use a strong slow
    > hash would be to group files by (file size, cheap hash) then compare
    > each file in a group with a representative of each distinct file found
    > among earlier files in the same group -- that leads to an average of
    > about three reads per duplicated file copy: one to hash it, and two for
    > the comparison between it and its representative (almost all of the
    > comparisons will turn out equal but you still need to check unless you


    My point is : forget hashes. If you work with hashes, you do have to
    read each file completely, cheap hash or not. My program normally reads
    *at most* 100% of the files to analyse, but usually much less. Also, I
    do plain comparisons which are much cheaper than hash calculations.

    > I'm assuming of course that there are too many files and/or they're too
    > large just to keep them all in core.


    I assume that file handles are sufficient to keep one open per file of
    the same size. This lead to trouble on Windows installations, but I
    guess that's a parameter to change. On Linux, I never had the problem.

    Regarding buffer size, I use a maxumim which is then split up between
    all open files.

    > Anyone have any data on whether reading files and SHA-256'ing them (or
    > whatever other cryptographic hash you think is strong enough) is
    > I/O-bound or CPU-bound? That is, is three reads with very little CPU
    > overhead really cheaper than one read with a strong hash?


    It also depends on the OS. I found that my program runs much slower on
    Windows, probably due to the way Linux anticipates reads and tries to
    reduce head movement.

    > I guess it also depends on the number of files you expect to have
    > duplicates of. If most of the files exist in only one copy, it's clear
    > that the cheap hash will find them more cheaply than the expensive hash.
    > In that case you could combine the (file size, cheap hash) filtering
    > with the expensive hash and get only two reads per copy rather than
    > three.


    Sorry, but I can still not see a point tu use hashes. Maybe you'll have
    a look at my program and tell me where a hash could be useful?

    It's available at http://www.homepages.lu/pu/fdups.html.

    Regards,
    -pu
    Patrick Useldinger, Mar 11, 2005
    #13
  14. In article <4232079b$>,
    Patrick Useldinger <> wrote:

    > > Well, but the spec didn't say efficiency was the primary criterion, it
    > > said minimizing the number of comparisons was.

    >
    > That's exactly what my program does.


    If you're doing any comparisons at all, you're not minimizing the number
    of comparisons.

    --
    David Eppstein
    Computer Science Dept., Univ. of California, Irvine
    http://www.ics.uci.edu/~eppstein/
    David Eppstein, Mar 11, 2005
    #14
  15. On Fri, 11 Mar 2005 11:07:02 -0800, rumours say that David Eppstein
    <> might have written:

    >More seriously, the best I can think of that doesn't use a strong slow
    >hash would be to group files by (file size, cheap hash) then compare
    >each file in a group with a representative of each distinct file found
    >among earlier files in the same group -- that leads to an average of
    >about three reads per duplicated file copy: one to hash it, and two for
    >the comparison between it and its representative (almost all of the
    >comparisons will turn out equal but you still need to check unless you
    >use a strong hash).


    The code I posted in another thread (and provided a link in this one) does
    exactly that (a quick hash of the first few K before calculating the whole
    file's md5 sum). However, Patrick's code is faster, reading only what's
    necessary (he does what I intended to do, but I was too lazy-- I actually
    rewrote from scratch one of the first programs I wrote in Python, which
    obviously was too amateurish code for me to publish :)

    It seems your objections are related to Xah Lee's specifications; I have no
    objections to your objections (-:) other than that we are just trying to produce
    something of practical value out of an otherwise doomed thread...
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
    Christos TZOTZIOY Georgiou, Mar 11, 2005
    #15
  16. On Fri, 11 Mar 2005 14:06:27 -0800, David Eppstein <> wrote:

    >In article <4232079b$>,
    > Patrick Useldinger <> wrote:
    >
    >> > Well, but the spec didn't say efficiency was the primary criterion, it
    >> > said minimizing the number of comparisons was.

    >>
    >> That's exactly what my program does.

    >
    >If you're doing any comparisons at all, you're not minimizing the number
    >of comparisons.
    >

    ISTM a lot will depend on the distribution of differences between candidate
    (equal-sized, obviously) files. If multi-GB files differ in the last byte only
    (but this is not known), they will all have to be crunched to the last byte to
    eliminate them from possible duplicate-set membership. If there is a-priori knowledge
    of highly probable difference locations (e.g., internal date stamp at some offset etc)
    then obviously the candidates can be pre-classified into smaller candidate sets by some
    simple heuristic probes.

    If differences are likely to appear early in a sequential scan, perhaps developing hashes
    in parallel would be a good strategy. But I doubt if blindly pre-computing full hashes would be optimal.
    Compare to sort|uniq as a sieve. You wouldn't want to read through any files any further than you had to.

    Regards,
    Bengt Richter
    Bengt Richter, Mar 11, 2005
    #16
  17. Xah Lee

    Xah Lee Guest

    Sorry i've been busy...

    Here's the Perl code. I have yet to clean up the code and make it
    compatible with the cleaned spec above. The code as it is performs the
    same algorithm as the spec, just doesn't print the output as such. In a
    few days, i'll post a clean version, and also a Python version, as well
    a sample directory for testing purposes. (The Perl code has gone thru
    many testings and is considered correct.)

    The Perl code comes in 3 files as it is:

    Combo114.pm
    Genpair114.pm
    del_dup.pl

    The main program is del_dup.pl. Run it on the command line as by the
    spec. If you want to actually delete the dup files, uncomment the
    "unlink" line at the bottom. Note: the module names don't have any
    significance.


    Note: here's also these python files ready to go for the final python
    version. Possibly the final propram should be just a single file...

    Combo114.py
    Genpair114.py


    Here're the files: del_dup.zip
    -----
    to get the code and full detail with latest update, please see:
    http://xahlee.org/perl-python/delete_dup_files.html

    Xah

    http://xahlee.org/PageTwo_dir/more.html


    Claudio Grondi wrote:
    > >> I'll post my version in a few days.

    > Have I missed something?
    > Where can I see your version?
    >
    > Claudio
    Xah Lee, Mar 20, 2005
    #17
  18. Xah Lee

    Guest

    Re: a program to delete duplicate files

    Why not try to use NoClone, it finds and deletes duplicate files by
    true byte-by-byte comparison. Smart marker filters duplicate files to
    delete. With GUI.
    http://noclone.net


    Xah Lee wrote:
    > here's a large exercise that uses what we built before.
    >
    > suppose you have tens of thousands of files in various directories.
    > Some of these files are identical, but you don't know which ones are
    > identical with which. Write a program that prints out which file are
    > redundant copies.
    >
    > Here's the spec.
    > --------------------------
    > The program is to be used on the command line. Its arguments are one

    or
    > more full paths of directories.
    >
    > perl del_dup.pl dir1
    >
    > prints the full paths of all files in dir1 that are duplicate.
    > (including files in sub-directories) More specifically, if file A has
    > duplicates, A's full path will be printed on a line, immediately
    > followed the full paths of all other files that is a copy of A. These
    > duplicates's full paths will be prefixed with "rm " string. A empty
    > line follows a group of duplicates.
    >
    > Here's a sample output.
    >
    > inPath/a.jpg
    > rm inPath/b.jpg
    > rm inPath/3/a.jpg
    > rm inPath/hh/eu.jpg
    >
    > inPath/ou.jpg
    > rm inPath/23/a.jpg
    > rm inPath/hh33/eu.jpg
    >
    > order does not matter. (i.e. which file will not be "rm " does not
    > matter.)
    >
    > ------------------------
    >
    > perl del_dup.pl dir1 dir2
    >
    > will do the same as above, except that duplicates within dir1 or dir2
    > themselves not considered. That is, all files in dir1 are compared to
    > all files in dir2. (including subdirectories) And, only files in dir2
    > will have the "rm " prefix.
    >
    > One way to understand this is to imagine lots of image files in both
    > dir. One is certain that there are no duplicates within each dir
    > themselves. (imagine that del_dup.pl has run on each already) Files

    in
    > dir1 has already been categorized into sub directories by human. So
    > that when there are duplicates among dir1 and dir2, one wants the
    > version in dir2 to be deleted, leaving the organization in dir1

    intact.
    >
    > perl del_dup.pl dir1 dir2 dir3 ...
    >
    > does the same as above, except files in later dir will have "rm "
    > first. So, if there are these identical files:
    >
    > dir2/a
    > dir2/b
    > dir4/c
    > dir4/d
    >
    > the c and d will both have "rm " prefix for sure. (which one has "rm

    "
    > in dir2 does not matter) Note, although dir2 doesn't compare files
    > inside itself, but duplicates still may be implicitly found by

    indirect
    > comparison. i.e. a==c, b==c, therefore a==b, even though a and b are
    > never compared.
    >
    >
    > --------------------------
    >
    > Write a Perl or Python version of the program.
    >
    > a absolute requirement in this problem is to minimize the number of
    > comparison made between files. This is a part of the spec.
    >
    > feel free to write it however you want. I'll post my version in a few
    > days.
    >
    > http://www.xahlee.org/perl-python/python.html
    >
    > Xah
    >
    > http://xahlee.org/PageTwo_dir/more.html
    , Mar 29, 2005
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Xah Lee
    Replies:
    44
    Views:
    1,269
  2. Shiraz

    delete based on duplicate field

    Shiraz, Jun 16, 2005, in forum: Perl Misc
    Replies:
    4
    Views:
    114
    J├╝rgen Exner
    Jun 16, 2005
  3. PerlFAQ Server
    Replies:
    0
    Views:
    575
    PerlFAQ Server
    Feb 11, 2011
  4. PerlFAQ Server
    Replies:
    0
    Views:
    573
    PerlFAQ Server
    Mar 9, 2011
  5. Joon Ki Choi
    Replies:
    4
    Views:
    123
    Joon Ki Choi
    Oct 10, 2012
Loading...

Share This Page