Checking if two files are the same

Discussion in 'Ruby' started by New C., Mar 7, 2011.

  1. New C.

    New C. Guest

    I have a got a few folders which may have same files under different
    names.
    Is there any way I can find which these files are using ruby ?

    The files with text (*.doc, *.txt ...) should be pretty easy to check
    but what about pdf files, exe files etc ?

    I wondering if there is some sort of diff module that can do this.

    --
    Posted via http://www.ruby-forum.com/.
     
    New C., Mar 7, 2011
    #1
    1. Advertising

  2. New C.

    Xavier Noria Guest

    On Mon, Mar 7, 2011 at 11:48 AM, New C. <> wrote:

    > I have a got a few folders which may have same files under different
    > names.
    > Is there any way I can find which these files are using ruby ?
    >
    > The files with text (*.doc, *.txt ...) should be pretty easy to check
    > but what about pdf files, exe files etc ?
    >
    > I wondering if there is some sort of diff module that can do this.


    There's File.compare.

    Depending on how many comparisons you're going to do, it might be a
    good idea to precompute checksums and compare the checksums.
     
    Xavier Noria, Mar 7, 2011
    #2
    1. Advertising

  3. On 03/07/2011 06:09 AM, Xavier Noria wrote:

    >
    > Depending on how many comparisons you're going to do, it might be a
    > good idea to precompute checksums and compare the checksums.


    +1
     
    Reid Thompson, Mar 7, 2011
    #3
  4. On Mon, Mar 7, 2011 at 11:09 AM, Xavier Noria <> wrote:
    > On Mon, Mar 7, 2011 at 11:48 AM, New C. <> wrote:
    >
    >> I have a got a few folders which may have same files under different
    >> names. Is there any way I can find which these files are using ruby ?
    >> ...
    >> I wondering if there is some sort of diff module that can do this.

    >
    > There's File.compare.
    >
    > Depending on how many comparisons you're going to do, it might be a
    > good idea to precompute checksums and compare the checksums.


    I'm interested in any (Ruby) solutions (actual or ideas) for this, as
    I have needed to do it in the past, and want to do something similar
    in the very near future.

    For comparing directories where the file names might have changed what
    I've done in the past is to first match on file name, then for the
    unmatching files in each directory see if there are any matches on
    file size, and for those matches either make a direct File.compare (if
    only two files match on a size) or compute checksums and use those to
    exclude definitely unmatching files, and then use File.compare on what
    (if anything) remains matching for that file size and checksum.

    I assume something similar would work for finding duplicates in
    general, not just comparing directories? (If there are likely to be
    many matches on file size, then presumably one might as well compute
    checksums for all files?)
     
    Colin Bartlett, Mar 7, 2011
    #4
  5. New C.

    Xavier Noria Guest

    On Mon, Mar 7, 2011 at 6:38 PM, Colin Bartlett <> wrote:

    > I'm interested in any (Ruby) solutions (actual or ideas) for this, as
    > I have needed to do it in the past, and want to do something similar
    > in the very near future.
    >
    > For comparing directories where the file names might have changed what
    > I've done in the past is to first match on file name, then for the
    > unmatching files in each directory see if there are any matches on
    > file size, and for those matches either make a direct File.compare (if
    > only two files match on a size) or compute checksums and use those to
    > exclude definitely unmatching files, and then use File.compare on what
    > (if anything) remains matching for that file size and checksum.


    I have played with this as an exercise. The idea is to filter
    candidates iteratively applying different criteria, from cheap to
    expensive, until you arrive at the solution.

    It is just a proof of concept in pseudocode, I wrote it off the top of
    my head, it does not even run:

    https://gist.github.com/859046

    The code above assumes a generic scenario with m-n possible
    duplicates, if a particular situation has details that can speed up
    the process they should be taken into account of course.
     
    Xavier Noria, Mar 7, 2011
    #5
  6. New C. wrote in post #985925:
    > I have a got a few folders which may have same files under different
    > names.
    > Is there any way I can find which these files are using ruby ?


    Here is a little ruby script I use for finding and/or deleting duplicate
    image and video files downloaded from my camera - it will work for any
    sort of file.

    #!/usr/bin/ruby -w
    require 'digest/sha1'
    if ARGV[0] == "-d"
    do_delete = true
    ARGV.shift
    end

    seen = {}
    dirs = ARGV.empty? ? ["#{ENV["HOME"]}/Pictures"] : ARGV

    dirs.each do |dir|
    Dir["#{dir}/**/*"].sort.each do |fn|
    next if File.directory?(fn)
    hash = Digest::SHA1.file(fn).hexdigest
    if seen[hash]
    puts "#{fn} is dupe of #{seen[hash]}"
    if do_delete
    File.delete(fn)
    puts "DELETED"
    end
    else
    seen[hash] = fn
    end
    end
    end

    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Mar 8, 2011
    #6
  7. On Tue, Mar 8, 2011 at 9:48 AM, Brian Candler <> wrote:
    > New C. wrote in post #985925:
    >> I have a got a few folders which may have same files under different
    >> names.
    >> Is there any way I can find which these files are using ruby ?

    >
    > Here is a little ruby script I use for finding and/or deleting duplicate
    > image and video files downloaded from my camera - it will work for any
    > sort of file.
    >
    > #!/usr/bin/ruby -w
    > require 'digest/sha1'
    > if ARGV[0] =3D=3D "-d"
    > =A0do_delete =3D true
    > =A0ARGV.shift
    > end
    >
    > seen =3D {}
    > dirs =3D ARGV.empty? ? ["#{ENV["HOME"]}/Pictures"] : ARGV
    >
    > dirs.each do |dir|
    > =A0Dir["#{dir}/**/*"].sort.each do |fn|
    > =A0 =A0next if File.directory?(fn)
    > =A0 =A0hash =3D Digest::SHA1.file(fn).hexdigest
    > =A0 =A0if seen[hash]
    > =A0 =A0 =A0puts "#{fn} is dupe of #{seen[hash]}"
    > =A0 =A0 =A0if do_delete
    > =A0 =A0 =A0 =A0File.delete(fn)
    > =A0 =A0 =A0 =A0puts "DELETED"
    > =A0 =A0 =A0end
    > =A0 =A0else
    > =A0 =A0 =A0seen[hash] =3D fn
    > =A0 =A0end
    > =A0end
    > end


    For this idiom Hash#fetch can be used nicely:

    irb(main):008:0> h=3D{};
    10.times {|i|
    puts i
    h.fetch(i % 3) {|x| printf "first %p\n", i; h[x]=3Dtrue; nil} and
    printf "duplicate %p\n", i}
    0
    first 0
    1
    first 1
    2
    first 2
    3
    duplicate 3
    4
    duplicate 4
    5
    duplicate 5
    6
    duplicate 6
    7
    duplicate 7
    8
    duplicate 8
    9
    duplicate 9
    =3D> 10
    irb(main):009:0>

    ... for arbitrary values of "nice". ;-)

    Kind regards

    robert

    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Mar 8, 2011
    #7
  8. It looks pretty obfuscated to my eyes, but each to his own.

    dirs.each do |dir|
    Dir["#{dir}/**/*"].sort.each do |fn|
    next if File.directory?(fn)
    hash = Digest::SHA1.file(fn).hexdigest
    if seen.fetch(hash) { seen[hash]=fn; false }
    puts "#{fn} is dupe of #{seen[hash]}"
    if do_delete
    File.delete(fn)
    puts "DELETED"
    end
    end
    end
    end

    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Mar 8, 2011
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Shawn
    Replies:
    2
    Views:
    391
    =?Utf-8?B?RWx0b24gVw==?=
    Feb 19, 2006
  2. darrel
    Replies:
    2
    Views:
    490
    darrel
    Apr 5, 2006
  3. Edd
    Replies:
    4
    Views:
    279
    Keith Thompson
    May 25, 2004
  4. Replies:
    8
    Views:
    717
  5. GenxLogic
    Replies:
    3
    Views:
    1,370
    andrewmcdonagh
    Dec 6, 2006
Loading...

Share This Page