Create MD5 of files in directories and subdirectories

N

nicogroen

Can somebody help me out with the following problem. I tried to use
the following script of Ron Savage to create MD5 checksums of files in
a directory and all subdirectories in it, posted here:

http://groups.google.nl/[email protected]&rnum=2

On OpenBSD:
It takes a long time to create MD5 checksums of large files (about 4
seconds of a file of 3MB, 12 sec of a file of 5.5MB, 43 sec of a file
of 10,5MB).

On Windows:
Files having the same filesize (all 14,5MB) creates the same MD5
checksum. This process goes very fast (perhaps too fast).

On Redhat and FreeBSD:
The script retuns the flowing error message:

can't open (#path#
): No such file or directory at md5.pl line 39.

The script should work on all operation systems.

Thanks in advance,
Nico
 
J

James Willmore

Can somebody help me out with the following problem. I tried to use the
following script of Ron Savage to create MD5 checksums of files in a
directory and all subdirectories in it, posted here:

http://groups.google.nl/[email protected]&rnum=2

On OpenBSD:
It takes a long time to create MD5 checksums of large files (about 4
seconds of a file of 3MB, 12 sec of a file of 5.5MB, 43 sec of a file of
10,5MB).

On Windows:
Files having the same filesize (all 14,5MB) creates the same MD5
checksum. This process goes very fast (perhaps too fast).

On Redhat and FreeBSD:
The script retuns the flowing error message:

can't open (#path#
): No such file or directory at md5.pl line 39.

The script should work on all operation systems.

Yes, it should (and appears it has) work(ed) on almost all platforms -
because you noticed a difference in the execution times :)

Posting your code would be helpful :)

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
I'll defend to the death your right to say that, but I never
said I'd listen to it! -- Tom Galloway with apologies to
Voltaire
 
J

James Willmore

[ ... ]
Posting your code would be helpful :)

My bad, you did post code :)

I ran it on ye olde Linux box and it worked up until it ran into a
directory that I had no permission to access ... bummer :-(

You're execution time will depend greatly on the OS and the filesystem
being accessed. That's not the script (in most cases).

IMHO, you might be able to speed up the script by using File::Find instead
of using Cwd.

Another option is to use this script as a filter and use a command native
to the OS to feed the script files. Meaning, use `find` (in *nix) and pipe the
output to the script you're working with. Now your only concern is check
the MD5 digest of each file the script is being feed :) An added plus
to this idea is ... you can check one -or- many files with your script
without the script having to figure out *how* to find the files (using Cwd
or Find::Files).

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
It is very difficult to prophesy, especially when it pertains to
the future.
 
J

James Willmore

Can somebody help me out with the following problem. I tried to use
the following script of Ron Savage to create MD5 checksums of files in
a directory and all subdirectories in it, posted here:

http://groups.google.nl/[email protected]&rnum=2

On OpenBSD:
It takes a long time to create MD5 checksums of large files (about 4
seconds of a file of 3MB, 12 sec of a file of 5.5MB, 43 sec of a file
of 10,5MB).

On Windows:
Files having the same filesize (all 14,5MB) creates the same MD5
checksum. This process goes very fast (perhaps too fast).

On Redhat and FreeBSD:
The script retuns the flowing error message:

can't open (#path#
): No such file or directory at md5.pl line 39.

The script should work on all operation systems.

I figured I'd post an example of what I meant by 'filter'. You'll
notice that I used 'warn' instead of 'die' if the script can't digest
a file. This will prevent the script from bombing if it's run from
cron or some other method by a common user. The output of the script
is, again, your call. And to sort (or not) is your call.

I tested it on a Linux box with the following command:
find /home/jim | perl news.pl | sort

However, I'd refine the find command to avoid getting files from other
filesystems, NFS mounts, syslinks, etc. And if you use it on a
Windows box, you'll have to find the right switches for `dir`. I
don't really do Windows :)

I didn't benchmark it. There's going to be variations on this -
because each OS and filesystem is different.

Enjoy :)

==start (what I called news.pl)==
#!/usr/gnu/bin/perl -w
#
# Name:
# MD5.pl.
#
# Purpose:
# Calculate the MD5 digest of all files in a directory and its
subdirectories.
#
# Parameter:
# File(s) provided to script from STDIN.
#
# Output:
# Digest of each file from STDIN.
#
# Output format:
# <Dirname/File name>: <MD5>\n
# <Dirname/File name>: <MD5>\n
# ...

use integer;
use strict;

use Digest::MD5;

# -------------------------------------------------------------------
my $md5 = Digest::MD5->new();

while(<>){
chomp;
newprocess($md5, $_);
}

sub newprocess{
local $/ = undef;
my($md5, $file) = @_;
open(FILE, $_)
or warn "FAILED TO DIGEST $_: $!\n" and return;
my $data = <FILE>;
print "$_: ".$md5->add($data)->hexdigest()."\n";
close FILE;
}

==end==
 
M

Michele Dondi

Can somebody help me out with the following problem. I tried to use
the following script of Ron Savage to create MD5 checksums of files in
a directory and all subdirectories in it, posted here:

http://groups.google.nl/[email protected]&rnum=2

I didn't see that: well, on *nix (linux), I'd just do

find <dir> -type f | xargs md5sums

but if you want it in Perl and running on virtually any system perl
runs on, then see if something like this is fine for you/can be
adapted to your needs:

#!/usr/bin/perl -l

use strict;
use warnings;
use File::Find;
use Digest::MD5;

@ARGV=grep { -d or !warn "`$_': not a directory!\n" } @ARGV;
die "Usage: $0 <dir> [<dirs>]" unless @ARGV;

find { no_chdir => 1,
wanted => sub {
return unless -f;
open my $fh, '<:raw', $_ or
warn "Can't open `$_': $!\n" and return;
print Digest::MD5->new->addfile($fh)->hexdigest,
' ', $_;
} }, @ARGV;

__END__


[Tested to work correctly in Linux (2.6.5) and W98...]
On Windows:
Files having the same filesize (all 14,5MB) creates the same MD5
checksum. This process goes very fast (perhaps too fast).

Huh?!? Are you *really* sure that by any chance not only those files
have the same file size but are also actually identical?


HTH,
Michele
 
M

Michele Dondi

I didn't see that: well, on *nix (linux), I'd just do

find <dir> -type f | xargs md5sums

find <dir> -type f | xargs md5sum

actually! (sorry: a typo!)


Michele
 
J

Joe Smith

nicogroen said:
On OpenBSD:
It takes a long time to create MD5 checksums of large files (about 4
seconds of a file of 3MB, 12 sec of a file of 5.5MB, 43 sec of a file
of 10,5MB).

That is expected if you are stuck the pure-perl implementation of
MD5 as opposed to the compiled XS module.
On Windows:
Files having the same filesize (all 14,5MB) creates the same MD5
checksum. This process goes very fast (perhaps too fast).

On Redhat and FreeBSD:
The script retuns the flowing error message:

can't open (#path#
): No such file or directory at md5.pl line 39.

The script should work on all operation systems.

It works on all systems where Digest::MD5 is properly installed.

Looks like you're running into the slow method that is invoked
whenever the MD5.so loadable object cannot be found.

eval {
Digest::MD5->bootstrap($VERSION); # Load the fast MD5.so object
};
if ($@) {
eval {
# Try to load the pure perl version if bootstrap fails
require Digest::perl::MD5;
Digest::perl::MD5->import(qw(md5 md5_hex md5_base64));
push(@ISA, "Digest::perl::MD5"); # make OO interface work
};
}

-Joe
 
J

Joe Smith

Michele said:
I didn't see that: well, on *nix (linux), I'd just do

find <dir> -type f | xargs md5sum

Not recommended for Samba shares or anywhere that file names
and/or directory names have imbedded blanks.

find <dir> -type f -print0 | xargs -0 md5sum

-Joe
 
N

nicogroen

Thanks for your replies. My problem in Windows is solved by updating
the ActivePerl version (from 5.6.1 to 5.8.3).
 
J

James Willmore

Can somebody help me out with the following problem. I tried to use the
following script of Ron Savage to create MD5 checksums of files in a
directory and all subdirectories in it, posted here:

http://groups.google.nl/[email protected]&rnum=2

On OpenBSD:
It takes a long time to create MD5 checksums of large files (about 4
seconds of a file of 3MB, 12 sec of a file of 5.5MB, 43 sec of a file of
10,5MB).

On Windows:
Files having the same filesize (all 14,5MB) creates the same MD5
checksum. This process goes very fast (perhaps too fast).

On Redhat and FreeBSD:
The script retuns the flowing error message:

can't open (#path#
): No such file or directory at md5.pl line 39.

The script should work on all operation systems.

Yes, it should (and appears it has) work(ed) on almost all platforms -
because you noticed a difference in the execution times :)

Posting your code would be helpful :)

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
I'll defend to the death your right to say that, but I never
said I'd listen to it! -- Tom Galloway with apologies to
Voltaire
 
J

James Willmore

Yes, it should (and appears it has) work(ed) on almost all platforms -
because you noticed a difference in the execution times :)

Posting your code would be helpful :)

My bad, you did post code :)

I ran it on ye olde Linux box and it worked up until it ran into a
directory that I had no permission to access ... bummer :-(

You're execution time will depend greatly on the OS and the filesystem
being accessed. That's not the script (in most cases).

IMHO, you might be able to speed up the script by using File::Find instead
of using Cwd.

Another option is to use this script as a filter and use a command native
to the OS to feed the script files. Meaning, use `find` (in *nix) and pipe the
output to the script you're working with. Now your only concern is check
the MD5 digest of each file the script is being feed :) An added plus
to this idea is ... you can check one -or- many files with your script
without the script having to figure out *how* to find the files (using Cwd
or Find::Files).

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
It is very difficult to prophesy, especially when it pertains to
the future.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top