[perl-python] a program to delete duplicate files

J

Jeff Shannon

Patrick said:
I am not an expert in this field. All I know is that MD5 and SHA1 can
create collisions. Are there stronger algorithms that do not? And, more
importantly, has it been *proved* that they do not?

I'm not an expert either, but I seem to remember reading recently
that, while it's been proven that it's possible for SHA1 to have
collisions, no actual collisions have been found. Even if that's not
completely correct, you're *far* more likely to be killed by a
meteorite than to stumble across a SHA1 collision. Heck, I'd expect
that it's more likely for civilization to be destroyed by a
dinosaur-killer-sized meteor.

With very few exceptions, if you're contorting yourself to avoid SHA1
hash collisions, then you should also be wearing meteor-proof (and
lightning-proof) armor everywhere you go. (Those few exceptions would
be cases where a malicious attacker stands to gain enough from
constructing a single hash collision to make it worthwhile to invest a
*large* number of petaflops of processing power.) Sure it's not "100%
perfect", but... how perfect do you *really* need?

Jeff Shannon
 
D

David Eppstein

[email protected] (John J. Lee) said:
Hmm, Patrick's right, David, isn't he?

Yes, I was only considering pairwise comparisons. As he says,
simultaneously comparing all files in a group would avoid repeated reads
without the CPU overhead of a strong hash. Assuming you use a system
that allows you to have enough files open at once...
And I'm not sure what the trade off between disk seeks and disk reads
does to the problem, in practice (with caching and realistic memory
constraints).

Another interesting point.
 
C

Claudio Grondi

I'll post my version in a few days.
Have I missed something?
Where can I see your version?

Claudio
 
X

Xah Lee

Sorry i've been busy...

Here's the Perl code. I have yet to clean up the code and make it
compatible with the cleaned spec above. The code as it is performs the
same algorithm as the spec, just doesn't print the output as such. In a
few days, i'll post a clean version, and also a Python version, as well
a sample directory for testing purposes. (The Perl code has gone thru
many testings and is considered correct.)

The Perl code comes in 3 files as it is:

Combo114.pm
Genpair114.pm
del_dup.pl

The main program is del_dup.pl. Run it on the command line as by the
spec. If you want to actually delete the dup files, uncomment the
"unlink" line at the bottom. Note: the module names don't have any
significance.


Note: here's also these python files ready to go for the final python
version. Possibly the final propram should be just a single file...

Combo114.py
Genpair114.py


Here're the files: del_dup.zip
-----
to get the code and full detail with latest update, please see:
http://xahlee.org/perl-python/delete_dup_files.html

Xah
(e-mail address removed)
http://xahlee.org/PageTwo_dir/more.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,016
Latest member
TatianaCha

Latest Threads

Top