Duplicate filesnames and size is equal too?

R

rishid

Hi,

I found some code on here to find duplicate file names, but cannot
figure out how to add a check to make sure the file size is also the
same then only add it to the duplicates hash.

Here is the code, thanks for any help.

find (\&check_file, $dir)

sub check_file {
if (-d $_) { next; }
else
{
if ($seen{$_})
{
if (exists $duplicates{$_})
{
push (@{$duplicates{$_}}, $File::Find::name);
}
else
{
$duplicates{$_} = [$seen{$_}, $File::Find::name];
}
}
else
{
$seen{$_} = $File::Find::name;
}
}
}
 
R

rishid

I added the size function, still trying to learn hashes. It is working
but found a bug, cannot seem to figure out a work around. Say I have
test.txt (1000kb) in f1 and f2 folders. Then I have test.txt (2000kb)
in f3 and f4 folders. The duplicate won't be found that is in f3 and
f4, since it won't match the original size of 1000kb.

Any ideas on how to get around this?

Thanks a lot

sub check_file {
if (-d $_) { next; }
else
{
if ($seen{$_})
{
@Stats = stat($seen{$_});
$orgsize = $Stats[7];
@Stats = stat($File::Find::name);
$dupsize = $Stats[7];

if ($orgsize == $dupsize) {
if (exists $duplicates{$_})
{
push (@{$duplicates{$_}}, $File::Find::name);
}
else
{
$duplicates{$_} = [$seen{$_}, $File::Find::name];
}
}
}
else
{
$seen{$_} = $File::Find::name;
}
}
}
 
A

A. Sinan Unur

I found some code on here to find duplicate file names, but cannot
figure out how to add a check to make sure the file size is also the

perldoc -f stat

You should read the list of Perl functions, and familiarize yourself with
it:

perldoc perlfunc

That is also the first place to turn to when you wonder if Perl has a
builtin that does what you want.

Sinan
 
A

A. Sinan Unur

I added the size function, still trying to learn hashes. It is
working but found a bug, cannot seem to figure out a work around. Say
I have test.txt (1000kb) in f1 and f2 folders. Then I have test.txt
(2000kb) in f3 and f4 folders. The duplicate won't be found that is
in f3 and f4, since it won't match the original size of 1000kb.

The code you posted is doing exactly what it told you to do: Consider files
distinct if they differ in size even if the file names are the same.

What do you want us to do?

Figure out what you want to do and then do it for you?

You are the person who decides what is a duplicate and what is not.

Also, post code that can be run with no effort on the part of the reader.
That means a short but *complete* example.

Sinan
 
T

Thomas Kratz

Hi,

I found some code on here to find duplicate file names, but cannot
figure out how to add a check to make sure the file size is also the
same then only add it to the duplicates hash.

Here is the code, thanks for any help.

First: Please read the posting guidelines posted here regularly. You
should post a short and *complete* example, not snippets.
find (\&check_file, $dir)

sub check_file {
if (-d $_) { next; }
else
{

You seem not to have

use strict;
use warnings;

at the top of your script. Else you would have seen a warning not to
return from a sub with next. Replace the first three lines with:

return if -d

Testing $_ is the default.
if ($seen{$_})
{
if (exists $duplicates{$_})

if (
exists $duplicates{$_} and
-s == -s $duplicates{$_}->[-1]
)
{
push (@{$duplicates{$_}}, $File::Find::name);
}
else
{
$duplicates{$_} = [$seen{$_}, $File::Find::name];
}
}
else
{
$seen{$_} = $File::Find::name;
}
}
}

(untested, because no runnable script provided)

By the way: it would be better not to use tabs for indenting. Every decent
editor should have an option to indent with spaces.

Thomas

--
$/=$,,$_=<DATA>,s,(.*),$1,see;__END__
s,^(.*\043),,mg,@_=map{[split'']}split;{#>J~.>_an~>>e~......>r~
$_=$_[$%][$"];y,<~>^,-++-,?{$/=--$|?'"':#..u.t.^.o.P.r.>ha~.e..
'%',s,(.),\$$/$1=1,,$;=$_}:/\w/?{y,_, ,,#..>s^~ht<._..._..c....
print}:y,.,,||last,,,,,,$_=$;;eval,redo}#.....>.e.r^.>l^..>k^.-
 
T

Thomas Kratz

I added the size function, still trying to learn hashes. It is working
but found a bug, cannot seem to figure out a work around. Say I have
test.txt (1000kb) in f1 and f2 folders. Then I have test.txt (2000kb)
in f3 and f4 folders. The duplicate won't be found that is in f3 and
f4, since it won't match the original size of 1000kb.

Any ideas on how to get around this?

Use your original algorithm and go over the duplicates in a second pass
(perhaps sorting them by filesize first). This is clearer and saves you
from keeping the complete directory information around for comparison.

Thomas

--
$/=$,,$_=<DATA>,s,(.*),$1,see;__END__
s,^(.*\043),,mg,@_=map{[split'']}split;{#>J~.>_an~>>e~......>r~
$_=$_[$%][$"];y,<~>^,-++-,?{$/=--$|?'"':#..u.t.^.o.P.r.>ha~.e..
'%',s,(.),\$$/$1=1,,$;=$_}:/\w/?{y,_, ,,#..>s^~ht<._..._..c....
print}:y,.,,||last,,,,,,$_=$;;eval,redo}#.....>.e.r^.>l^..>k^.-
 
F

Fabian Pilkowski

I added the size function, still trying to learn hashes. It is working
but found a bug, cannot seem to figure out a work around. Say I have
test.txt (1000kb) in f1 and f2 folders. Then I have test.txt (2000kb)
in f3 and f4 folders. The duplicate won't be found that is in f3 and
f4, since it won't match the original size of 1000kb.

Any ideas on how to get around this?

Well, checking for filesize with -s() isn't as difficult. You're problem
is exactly what you described now. I suggest to work with inner hashes
(instead of arrays) -- you can use the filesize as unambiguous key.
Also, your destination hash %duplicates should be replaced by an array
since you want to store the duplicates of "test.txt" twice.

Others has already pointed out some lacks in your code, I won't repeat
them. I would do something similiar to:


#!/usr/bin/perl -w
use strict;
use File::Find;

my $dir = 'foo';
my %seen;
find( sub {
return if -d;
push @{ $seen{ $_ }->{ -s _ } }, $File::Find::name;
}, $dir );

my @duplicates;
for ( values %seen ) {
for ( values %$_ ) {
push @duplicates, $_ if @$_ > 1;
}
}

use Data::Dumper;
# print Dumper \%seen;
print Dumper \@duplicates;
__END__


regards,
fabian
 
A

Alan

Hi,

I like programming tool but finished product saves your valuable time.
Try NoClone and you can see how people programmed. However she was
programmed in VB and runs in Windows.
NoClone finds and removes duplicate mp3, photos and any type of
files by true byte-by-byte comparison. Time-saving unique Smart Marker
filters duplicates for removal. Preview images and flexible removal
and archival options.
http://noclone.net

Alan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top