Efficient checksum calculating on lagre files

O

Ola Natvig

Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.
 
R

Robin Becker

Ola said:
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.
well md5sum is usable on many systems. I run it on win32 and darwin.

I tried this in 2.4 with the new subprocess module

def md5sum(fn):
import subprocess
return subprocess.Popen(["md5sum.exe", fn],
stdout=subprocess.PIPE).communicate()[0]

import time
t0 = time.time()
print md5sum('test.rml')
t1 = time.time()
print t1-t0

and got

C:\Tmp>md5sum.py
b68e4efa5e5dbca37718414f6020f6ff *test.rml

0.0160000324249


Tried with the original
C:\Tmp>timethis md5sum.exe test.rml

TimeThis : Command Line : md5sum.exe test.rml
TimeThis : Start Time : Tue Feb 08 16:12:26 2005

b68e4efa5e5dbca37718414f6020f6ff *test.rml

TimeThis : Command Line : md5sum.exe test.rml
TimeThis : Start Time : Tue Feb 08 16:12:26 2005
TimeThis : End Time : Tue Feb 08 16:12:26 2005
TimeThis : Elapsed Time : 00:00:00.437

C:\Tmp>ls -l test.rml
-rw-rw-rw- 1 user group 996688 Dec 31 09:57 test.rml

C:\Tmp>
 
F

Fredrik Lundh

Robin said:
well md5sum is usable on many systems. I run it on win32 and darwin.

I tried this in 2.4 with the new subprocess module

on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")
size = os.path.getsize(fn)
hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()

(I suspect that md5sum also uses mmap, so the difference is
probably just the subprocess overhead)

</F>
 
M

Michael Hoffman

Ola said:
Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

Is there a reason you can't use the sha module? Using a random large file I had
lying around:

sha.new(file("jdk-1_5_0-linux-i586.rpm").read()).hexdigest() # loads all into memory first

If you don't want to load the whole object into memory at once you can always call out to the sha1sum utility yourself as well.
>>> subprocess.Popen(["sha1sum", ".bashrc"], stdout=subprocess.PIPE).communicate()[0].split()[0]
'5c59906733bf780c446ea290646709a14750eaad'
 
M

Michael Hoffman

Michael said:
Is there a reason you can't use the sha module?

BTW, I'm using SHA-1 instead of MD5 because of the reported vulnerabilities
in MD5, which may not be important for your application, but I consider it
best to just avoid MD5 entirely in the future.
 
C

Christos TZOTZIOY Georgiou

well md5sum is usable on many systems. I run it on win32 and darwin.

[snip use of some md5sum.exe]

Why not use the md5 module?

The following md5sum.py is in use and tested, but not "failproof".

|import sys, os, md5
|from glob import glob
|
|for arg in sys.argv[1:]:
| for filename in glob(arg):
| fp= file(filename, "rb")
| md5sum= md5.new()
| while True:
| data= fp.read(65536)
| if not data: break
| md5sum.update(data)
| fp.close()
| print md5sum.hexdigest(), filename

It's fast enough, especially if you cache results.
 
N

Nick Craig-Wood

Ola Natvig said:
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

Here is an implementation of md5sum in python. Its the same speed
give or take as md5sum itself. This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).

I discarded the first run so both tests ran with large_file in the
cache.

$ time md5sum large_file
e7668fdc06b68fbf087a95ba888e8054 large_file

real 0m1.046s
user 0m0.946s
sys 0m0.071s

$ time python md5sum.py large_file
e7668fdc06b68fbf087a95ba888e8054 large_file

real 0m1.033s
user 0m0.926s
sys 0m0.108s

$ ls -l large_file
-rw-r--r-- 1 ncw ncw 115933184 Jul 8 2004 large_file


"""
Re-implementation of md5sum in python
"""

import sys
import md5

def md5file(filename):
"""Return the hex digest of a file without loading it all into memory"""
fh = open(filename)
digest = md5.new()
while 1:
buf = fh.read(4096)
if buf == "":
break
digest.update(buf)
fh.close()
return digest.hexdigest()

def md5sum(files):
for filename in files:
try:
print "%s %s" % (md5file(filename), filename)
except IOError, e:
print >> sys.stderr, "Error on %s: %s" % (filename, e)

if __name__ == "__main__":
md5sum(sys.argv[1:])
 
T

Thomas Heller

Nick Craig-Wood said:
Here is an implementation of md5sum in python. Its the same speed
give or take as md5sum itself. This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).

Your code won't work correctly on Windows, since you have to open files
with mode 'rb'.

But there's a perfect working version in the Python distribution already:
tools/Scripts/md5sum.py

Thomas
 
C

Christos TZOTZIOY Georgiou

on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")
[snip]

My first reaction was that "r+" should be "r+b"... but then one presumes that an
mmap'ed file does not care about stdio text-binary conventions (on platforms
that matters).
 
N

Nick Craig-Wood

Fredrik Lundh said:
on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")
size = os.path.getsize(fn)
hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()

(I suspect that md5sum also uses mmap, so the difference is
probably just the subprocess overhead)

But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you? (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)

$ dd if=/dev/zero of=z count=1 bs=1048576 seek=8192
$ ls -l z
-rw-r--r-- 1 ncw ncw 8590983168 Feb 9 09:26 z
Traceback (most recent call last):
 
N

Nick Craig-Wood

Thomas Heller said:
Your code won't work correctly on Windows, since you have to open files
with mode 'rb'.

Yes you are correct (good old Windows ;-)
But there's a perfect working version in the Python distribution already:
tools/Scripts/md5sum.py

The above is easier to understand though.
 
C

Christos TZOTZIOY Georgiou

But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you? (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)

Indeed... but the context was calculating efficiently checksums for large files
to be /served/ by a webserver. I deduce it's almost certain that the files
won't be larger than 3GiB, but ICBW :)
 
N

Nick Craig-Wood

Christos TZOTZIOY Georgiou said:
Indeed... but the context was calculating efficiently checksums for large files
to be /served/ by a webserver. I deduce it's almost certain that the files
won't be larger than 3GiB, but ICBW :)

You are certainly right ;-)

However I did want to make the point that while mmap is extremely
attractive for certain things, it does limit you to files < 4 Gb which
is something that people don't always realise.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top