Duplicate filesnames and size is equal too?

rishid · May 3, 2005

Hi,

I found some code on here to find duplicate file names, but cannot
figure out how to add a check to make sure the file size is also the
same then only add it to the duplicates hash.

Here is the code, thanks for any help.

find (\&check_file, $dir)

sub check_file {
if (-d $_) { next; }
else
{
if ($seen{$_})
{
if (exists $duplicates{$_})
{
push (@{$duplicates{$_}}, $File::Find::name);
}
else
{
$duplicates{$_} = [$seen{$_}, $File::Find::name];
}
}
else
{
$seen{$_} = $File::Find::name;
}
}
}

rishid · May 3, 2005

I added the size function, still trying to learn hashes. It is working
but found a bug, cannot seem to figure out a work around. Say I have
test.txt (1000kb) in f1 and f2 folders. Then I have test.txt (2000kb)
in f3 and f4 folders. The duplicate won't be found that is in f3 and
f4, since it won't match the original size of 1000kb.

Any ideas on how to get around this?

Thanks a lot

sub check_file {
if (-d $_) { next; }
else
{
if ($seen{$_})
{
@Stats = stat($seen{$_});
$orgsize = $Stats[7];
@Stats = stat($File::Find::name);
$dupsize = $Stats[7];

if ($orgsize == $dupsize) {
if (exists $duplicates{$_})
{
push (@{$duplicates{$_}}, $File::Find::name);
}
else
{
$duplicates{$_} = [$seen{$_}, $File::Find::name];
}
}
}
else
{
$seen{$_} = $File::Find::name;
}
}
}

A. Sinan Unur · May 3, 2005

I found some code on here to find duplicate file names, but cannot
figure out how to add a check to make sure the file size is also the

perldoc -f stat

You should read the list of Perl functions, and familiarize yourself with
it:

perldoc perlfunc

That is also the first place to turn to when you wonder if Perl has a
builtin that does what you want.

Sinan

A. Sinan Unur · May 3, 2005

I added the size function, still trying to learn hashes. It is
working but found a bug, cannot seem to figure out a work around. Say
I have test.txt (1000kb) in f1 and f2 folders. Then I have test.txt
(2000kb) in f3 and f4 folders. The duplicate won't be found that is
in f3 and f4, since it won't match the original size of 1000kb.

The code you posted is doing exactly what it told you to do: Consider files
distinct if they differ in size even if the file names are the same.

What do you want us to do?

Figure out what you want to do and then do it for you?

You are the person who decides what is a duplicate and what is not.

Also, post code that can be run with no effort on the part of the reader.
That means a short but *complete* example.

Sinan

Thomas Kratz · May 3, 2005

Hi,

I found some code on here to find duplicate file names, but cannot
figure out how to add a check to make sure the file size is also the
same then only add it to the duplicates hash.

Here is the code, thanks for any help.

First: Please read the posting guidelines posted here regularly. You
should post a short and *complete* example, not snippets.

find (\&check_file, $dir)

sub check_file {
if (-d $_) { next; }
else
{

You seem not to have

use strict;
use warnings;

at the top of your script. Else you would have seen a warning not to
return from a sub with next. Replace the first three lines with:

return if -d

Testing $_ is the default.

if ($seen{$_})
{
if (exists $duplicates{$_})

if (
exists $duplicates{$_} and
-s == -s $duplicates{$_}->[-1]
)

{
push (@{$duplicates{$_}}, $File::Find::name);
}
else
{
$duplicates{$_} = [$seen{$_}, $File::Find::name];
}
}
else
{
$seen{$_} = $File::Find::name;
}
}
}

(untested, because no runnable script provided)

By the way: it would be better not to use tabs for indenting. Every decent
editor should have an option to indent with spaces.

Thomas

--
$/=$,,$_=<DATA>,s,(.*),$1,see;__END__
s,^(.*\043),,mg,@_=map{[split'']}split;{#>J~.>_an~>>e~......>r~
$_=$_[$%][$"];y,<~>^,-++-,?{$/=--$|?'"':#..u.t.^.o.P.r.>ha~.e..
'%',s,(.),\$$/$1=1,,$;=$_}:/\w/?{y,_, ,,#..>s^~ht<._..._..c....
print}:y,.,,||last,,,,,,$_=$;;eval,redo}#.....>.e.r^.>l^..>k^.-

Thomas Kratz · May 3, 2005

I added the size function, still trying to learn hashes. It is working
but found a bug, cannot seem to figure out a work around. Say I have
test.txt (1000kb) in f1 and f2 folders. Then I have test.txt (2000kb)
in f3 and f4 folders. The duplicate won't be found that is in f3 and
f4, since it won't match the original size of 1000kb.

Any ideas on how to get around this?

Use your original algorithm and go over the duplicates in a second pass
(perhaps sorting them by filesize first). This is clearer and saves you
from keeping the complete directory information around for comparison.

Thomas

--
$/=$,,$_=<DATA>,s,(.*),$1,see;__END__
s,^(.*\043),,mg,@_=map{[split'']}split;{#>J~.>_an~>>e~......>r~
$_=$_[$%][$"];y,<~>^,-++-,?{$/=--$|?'"':#..u.t.^.o.P.r.>ha~.e..
'%',s,(.),\$$/$1=1,,$;=$_}:/\w/?{y,_, ,,#..>s^~ht<._..._..c....
print}:y,.,,||last,,,,,,$_=$;;eval,redo}#.....>.e.r^.>l^..>k^.-

Fabian Pilkowski · May 3, 2005

* [email protected] said:
I added the size function, still trying to learn hashes. It is working
but found a bug, cannot seem to figure out a work around. Say I have
test.txt (1000kb) in f1 and f2 folders. Then I have test.txt (2000kb)
in f3 and f4 folders. The duplicate won't be found that is in f3 and
f4, since it won't match the original size of 1000kb.

Any ideas on how to get around this?

Well, checking for filesize with -s() isn't as difficult. You're problem
is exactly what you described now. I suggest to work with inner hashes
(instead of arrays) -- you can use the filesize as unambiguous key.
Also, your destination hash %duplicates should be replaced by an array
since you want to store the duplicates of "test.txt" twice.

Others has already pointed out some lacks in your code, I won't repeat
them. I would do something similiar to:

#!/usr/bin/perl -w
use strict;
use File::Find;

my $dir = 'foo';
my %seen;
find( sub {
return if -d;
push @{ $seen{ $_ }->{ -s _ } }, $File::Find::name;
}, $dir );

my @duplicates;
for ( values %seen ) {
for ( values %$_ ) {
push @duplicates, $_ if @$_ > 1;
}
}

use Data:

umper;
# print Dumper \%seen;
print Dumper \@duplicates;
__END__

regards,
fabian

Alan · May 13, 2005

Hi,

I like programming tool but finished product saves your valuable time.
Try NoClone and you can see how people programmed. However she was
programmed in VB and runs in Windows.
NoClone finds and removes duplicate mp3, photos and any type of
files by true byte-by-byte comparison. Time-saving unique Smart Marker
filters duplicates for removal. Preview images and flexible removal
and archival options.
http://noclone.net

Alan

WIN32 - Update Text in a Window in order to show its size in Pixels and coordinates	0	Oct 4, 2023
size of void * is not always equal to size of int *	13	May 2, 2014
What code do I add / overwrite so that the ebDriver' object has no attribute 'find_element_by_css_selector error is gone ?	0	Sep 19, 2022
FAQ 4.41 How can I remove duplicate elements from a list or array?	0	Mar 1, 2011
Boomer trying to learn coding in C and C++	6	Dec 16, 2022
Hashcode and Equal	4	Mar 9, 2010
Trouble calling a function with enum parameter	3	Jan 13, 2023
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023

Duplicate filesnames and size is equal too?

rishid

rishid

A. Sinan Unur

A. Sinan Unur

Thomas Kratz

Thomas Kratz

Fabian Pilkowski

Alan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads