Duplicate filesnames and size is equal too?

Discussion in 'Perl Misc' started by rishid@gmail.com, May 3, 2005.

  1. Guest

    Hi,

    I found some code on here to find duplicate file names, but cannot
    figure out how to add a check to make sure the file size is also the
    same then only add it to the duplicates hash.

    Here is the code, thanks for any help.

    find (\&check_file, $dir)

    sub check_file {
    if (-d $_) { next; }
    else
    {
    if ($seen{$_})
    {
    if (exists $duplicates{$_})
    {
    push (@{$duplicates{$_}}, $File::Find::name);
    }
    else
    {
    $duplicates{$_} = [$seen{$_}, $File::Find::name];
    }
    }
    else
    {
    $seen{$_} = $File::Find::name;
    }
    }
    }
    , May 3, 2005
    #1
    1. Advertising

  2. Guest

    I added the size function, still trying to learn hashes. It is working
    but found a bug, cannot seem to figure out a work around. Say I have
    test.txt (1000kb) in f1 and f2 folders. Then I have test.txt (2000kb)
    in f3 and f4 folders. The duplicate won't be found that is in f3 and
    f4, since it won't match the original size of 1000kb.

    Any ideas on how to get around this?

    Thanks a lot

    sub check_file {
    if (-d $_) { next; }
    else
    {
    if ($seen{$_})
    {
    @Stats = stat($seen{$_});
    $orgsize = $Stats[7];
    @Stats = stat($File::Find::name);
    $dupsize = $Stats[7];

    if ($orgsize == $dupsize) {
    if (exists $duplicates{$_})
    {
    push (@{$duplicates{$_}}, $File::Find::name);
    }
    else
    {
    $duplicates{$_} = [$seen{$_}, $File::Find::name];
    }
    }
    }
    else
    {
    $seen{$_} = $File::Find::name;
    }
    }
    }
    , May 3, 2005
    #2
    1. Advertising

  3. "" <> wrote in
    news::

    > I found some code on here to find duplicate file names, but cannot
    > figure out how to add a check to make sure the file size is also the


    perldoc -f stat

    You should read the list of Perl functions, and familiarize yourself with
    it:

    perldoc perlfunc

    That is also the first place to turn to when you wonder if Perl has a
    builtin that does what you want.

    Sinan
    A. Sinan Unur, May 3, 2005
    #3
  4. "" <> wrote in
    news::

    > I added the size function, still trying to learn hashes. It is
    > working but found a bug, cannot seem to figure out a work around. Say
    > I have test.txt (1000kb) in f1 and f2 folders. Then I have test.txt
    > (2000kb) in f3 and f4 folders. The duplicate won't be found that is
    > in f3 and f4, since it won't match the original size of 1000kb.


    The code you posted is doing exactly what it told you to do: Consider files
    distinct if they differ in size even if the file names are the same.

    What do you want us to do?

    Figure out what you want to do and then do it for you?

    You are the person who decides what is a duplicate and what is not.

    Also, post code that can be run with no effort on the part of the reader.
    That means a short but *complete* example.

    Sinan
    A. Sinan Unur, May 3, 2005
    #4
  5. Thomas Kratz Guest

    wrote:
    > Hi,
    >
    > I found some code on here to find duplicate file names, but cannot
    > figure out how to add a check to make sure the file size is also the
    > same then only add it to the duplicates hash.
    >
    > Here is the code, thanks for any help.


    First: Please read the posting guidelines posted here regularly. You
    should post a short and *complete* example, not snippets.

    >
    > find (\&check_file, $dir)
    >
    > sub check_file {
    > if (-d $_) { next; }
    > else
    > {


    You seem not to have

    use strict;
    use warnings;

    at the top of your script. Else you would have seen a warning not to
    return from a sub with next. Replace the first three lines with:

    return if -d

    Testing $_ is the default.

    > if ($seen{$_})
    > {
    > if (exists $duplicates{$_})


    if (
    exists $duplicates{$_} and
    -s == -s $duplicates{$_}->[-1]
    )

    > {
    > push (@{$duplicates{$_}}, $File::Find::name);
    > }
    > else
    > {
    > $duplicates{$_} = [$seen{$_}, $File::Find::name];
    > }
    > }
    > else
    > {
    > $seen{$_} = $File::Find::name;
    > }
    > }
    > }
    >


    (untested, because no runnable script provided)

    By the way: it would be better not to use tabs for indenting. Every decent
    editor should have an option to indent with spaces.

    Thomas

    --
    $/=$,,$_=<DATA>,s,(.*),$1,see;__END__
    s,^(.*\043),,mg,@_=map{[split'']}split;{#>J~.>_an~>>e~......>r~
    $_=$_[$%][$"];y,<~>^,-++-,?{$/=--$|?'"':#..u.t.^.o.P.r.>ha~.e..
    '%',s,(.),\$$/$1=1,,$;=$_}:/\w/?{y,_, ,,#..>s^~ht<._..._..c....
    print}:y,.,,||last,,,,,,$_=$;;eval,redo}#.....>.e.r^.>l^..>k^.-
    Thomas Kratz, May 3, 2005
    #5
  6. Thomas Kratz Guest

    wrote:
    > I added the size function, still trying to learn hashes. It is working
    > but found a bug, cannot seem to figure out a work around. Say I have
    > test.txt (1000kb) in f1 and f2 folders. Then I have test.txt (2000kb)
    > in f3 and f4 folders. The duplicate won't be found that is in f3 and
    > f4, since it won't match the original size of 1000kb.
    >
    > Any ideas on how to get around this?


    Use your original algorithm and go over the duplicates in a second pass
    (perhaps sorting them by filesize first). This is clearer and saves you
    from keeping the complete directory information around for comparison.

    Thomas

    --
    $/=$,,$_=<DATA>,s,(.*),$1,see;__END__
    s,^(.*\043),,mg,@_=map{[split'']}split;{#>J~.>_an~>>e~......>r~
    $_=$_[$%][$"];y,<~>^,-++-,?{$/=--$|?'"':#..u.t.^.o.P.r.>ha~.e..
    '%',s,(.),\$$/$1=1,,$;=$_}:/\w/?{y,_, ,,#..>s^~ht<._..._..c....
    print}:y,.,,||last,,,,,,$_=$;;eval,redo}#.....>.e.r^.>l^..>k^.-
    Thomas Kratz, May 3, 2005
    #6
  7. * schrieb:

    > I added the size function, still trying to learn hashes. It is working
    > but found a bug, cannot seem to figure out a work around. Say I have
    > test.txt (1000kb) in f1 and f2 folders. Then I have test.txt (2000kb)
    > in f3 and f4 folders. The duplicate won't be found that is in f3 and
    > f4, since it won't match the original size of 1000kb.
    >
    > Any ideas on how to get around this?


    Well, checking for filesize with -s() isn't as difficult. You're problem
    is exactly what you described now. I suggest to work with inner hashes
    (instead of arrays) -- you can use the filesize as unambiguous key.
    Also, your destination hash %duplicates should be replaced by an array
    since you want to store the duplicates of "test.txt" twice.

    Others has already pointed out some lacks in your code, I won't repeat
    them. I would do something similiar to:


    #!/usr/bin/perl -w
    use strict;
    use File::Find;

    my $dir = 'foo';
    my %seen;
    find( sub {
    return if -d;
    push @{ $seen{ $_ }->{ -s _ } }, $File::Find::name;
    }, $dir );

    my @duplicates;
    for ( values %seen ) {
    for ( values %$_ ) {
    push @duplicates, $_ if @$_ > 1;
    }
    }

    use Data::Dumper;
    # print Dumper \%seen;
    print Dumper \@duplicates;
    __END__


    regards,
    fabian
    Fabian Pilkowski, May 3, 2005
    #7
  8. Alan Guest

    Hi,

    I like programming tool but finished product saves your valuable time.
    Try NoClone and you can see how people programmed. However she was
    programmed in VB and runs in Windows.
    NoClone finds and removes duplicate mp3, photos and any type of
    files by true byte-by-byte comparison. Time-saving unique Smart Marker
    filters duplicates for removal. Preview images and flexible removal
    and archival options.
    http://noclone.net

    Alan


    "" <> wrote in message news:<>...
    > Hi,
    >
    > I found some code on here to find duplicate file names, but cannot
    > figure out how to add a check to make sure the file size is also the
    > same then only add it to the duplicates hash.
    >
    > Here is the code, thanks for any help.
    >
    > find (\&check_file, $dir)
    >
    > sub check_file {
    > if (-d $_) { next; }
    > else
    > {
    > if ($seen{$_})
    > {
    > if (exists $duplicates{$_})
    > {
    > push (@{$duplicates{$_}}, $File::Find::name);
    > }
    > else
    > {
    > $duplicates{$_} = [$seen{$_}, $File::Find::name];
    > }
    > }
    > else
    > {
    > $seen{$_} = $File::Find::name;
    > }
    > }
    > }
    Alan, May 13, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Paul F. Johnson
    Replies:
    12
    Views:
    675
    Mark Parnell
    Oct 25, 2004
  2. candide
    Replies:
    18
    Views:
    1,283
    Gregor Lingl
    Aug 18, 2009
  3. uni

    equal size buttons

    uni, Feb 4, 2004, in forum: ASP .Net Mobile
    Replies:
    0
    Views:
    133
  4. James Harris
    Replies:
    12
    Views:
    303
    Malcolm McLean
    Oct 9, 2013
  5. anish kumar
    Replies:
    13
    Views:
    121
Loading...

Share This Page