Efficient checksum calculating on lagre files

Ola Natvig · Feb 8, 2005

Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

Robin Becker · Feb 8, 2005

Ola said:
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

well md5sum is usable on many systems. I run it on win32 and darwin.

I tried this in 2.4 with the new subprocess module

def md5sum(fn):
import subprocess
return subprocess.Popen(["md5sum.exe", fn],
stdout=subprocess.PIPE).communicate()[0]

import time
t0 = time.time()
print md5sum('test.rml')
t1 = time.time()
print t1-t0

and got

C:\Tmp>md5sum.py
b68e4efa5e5dbca37718414f6020f6ff *test.rml

0.0160000324249

Tried with the original
C:\Tmp>timethis md5sum.exe test.rml

TimeThis : Command Line : md5sum.exe test.rml
TimeThis : Start Time : Tue Feb 08 16:12:26 2005

b68e4efa5e5dbca37718414f6020f6ff *test.rml

TimeThis : Command Line : md5sum.exe test.rml
TimeThis : Start Time : Tue Feb 08 16:12:26 2005
TimeThis : End Time : Tue Feb 08 16:12:26 2005
TimeThis : Elapsed Time : 00:00:00.437

C:\Tmp>ls -l test.rml
-rw-rw-rw- 1 user group 996688 Dec 31 09:57 test.rml

C:\Tmp>

Fredrik Lundh · Feb 8, 2005

Robin said:
well md5sum is usable on many systems. I run it on win32 and darwin.

I tried this in 2.4 with the new subprocess module

on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")
size = os.path.getsize(fn)
hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()

(I suspect that md5sum also uses mmap, so the difference is
probably just the subprocess overhead)

</F>

Michael Hoffman · Feb 8, 2005

Ola said:
Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

Is there a reason you can't use the sha module? Using a random large file I had
lying around:

sha.new(file("jdk-1_5_0-linux-i586.rpm").read()).hexdigest() # loads all into memory first

If you don't want to load the whole object into memory at once you can always call out to the sha1sum utility yourself as well.

>>> subprocess.Popen(["sha1sum", ".bashrc"], stdout=subprocess.PIPE).communicate()[0].split()[0]

Click to expand...

Click to expand...

'5c59906733bf780c446ea290646709a14750eaad'

Michael Hoffman · Feb 8, 2005

Michael said:
Is there a reason you can't use the sha module?

BTW, I'm using SHA-1 instead of MD5 because of the reported vulnerabilities
in MD5, which may not be important for your application, but I consider it
best to just avoid MD5 entirely in the future.

Christos TZOTZIOY Georgiou · Feb 8, 2005

well md5sum is usable on many systems. I run it on win32 and darwin.

Nick Craig-Wood · Feb 8, 2005

Ola Natvig said:
Hi all

Does anyone know of a fast way to calculate checksums for a large file.
I need a way to generate ETag keys for a webserver, the ETag of large
files are not realy nececary, but it would be nice if I could do it. I'm
using the python hash function on the dynamic generated strings (like in
page content) but on things like images I use the shutil's
copyfileobject function and the hash of a fileobject's hash are it's
handlers memmory address.

Does anyone know a python utility which is possible to use, perhaps
something like the md5sum utility on *nix systems.

Here is an implementation of md5sum in python. Its the same speed
give or take as md5sum itself. This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).

I discarded the first run so both tests ran with large_file in the
cache.

$ time md5sum large_file
e7668fdc06b68fbf087a95ba888e8054 large_file

real 0m1.046s
user 0m0.946s
sys 0m0.071s

$ time python md5sum.py large_file
e7668fdc06b68fbf087a95ba888e8054 large_file

real 0m1.033s
user 0m0.926s
sys 0m0.108s

$ ls -l large_file
-rw-r--r-- 1 ncw ncw 115933184 Jul 8 2004 large_file

"""
Re-implementation of md5sum in python
"""

import sys
import md5

def md5file(filename):
"""Return the hex digest of a file without loading it all into memory"""
fh = open(filename)
digest = md5.new()
while 1:
buf = fh.read(4096)
if buf == "":
break
digest.update(buf)
fh.close()
return digest.hexdigest()

def md5sum(files):
for filename in files:
try:
print "%s %s" % (md5file(filename), filename)
except IOError, e:
print >> sys.stderr, "Error on %s: %s" % (filename, e)

if __name__ == "__main__":
md5sum(sys.argv[1:])

Thomas Heller · Feb 8, 2005

Nick Craig-Wood said:
Here is an implementation of md5sum in python. Its the same speed
give or take as md5sum itself. This isn't suprising since md5sum is
dominated by CPU usage of the MD5 routine (in C in both cases) and/or
io (also in C).

Your code won't work correctly on Windows, since you have to open files
with mode 'rb'.

But there's a perfect working version in the Python distribution already:
tools/Scripts/md5sum.py

Thomas

Christos TZOTZIOY Georgiou · Feb 8, 2005

on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")

[snip]

My first reaction was that "r+" should be "r+b"... but then one presumes that an
mmap'ed file does not care about stdio text-binary conventions (on platforms
that matters).

Nick Craig-Wood · Feb 9, 2005

Fredrik Lundh said:
on my machine, Python's md5+mmap is a little bit faster than
subprocess+md5sum:

import os, md5, mmap

file = open(fn, "r+")
size = os.path.getsize(fn)
hash = md5.md5(mmap.mmap(file.fileno(), size)).hexdigest()

(I suspect that md5sum also uses mmap, so the difference is
probably just the subprocess overhead)

But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you? (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)

$ dd if=/dev/zero of=z count=1 bs=1048576 seek=8192
$ ls -l z
-rw-r--r-- 1 ncw ncw 8590983168 Feb 9 09:26 z
Traceback (most recent call last):

Nick Craig-Wood · Feb 9, 2005

Thomas Heller said:
Your code won't work correctly on Windows, since you have to open files
with mode 'rb'.

Yes you are correct (good old Windows ;-)

But there's a perfect working version in the Python distribution already:
tools/Scripts/md5sum.py

The above is easier to understand though.

Christos TZOTZIOY Georgiou · Feb 10, 2005

But you won't be able to md5sum a file bigger than about 4 Gb if using
a 32bit processor (like x86) will you? (I don't know how the kernel /
user space VM split works on windows but on linux 3Gb is the maximum
possible size you can mmap.)

Indeed... but the context was calculating efficiently checksums for large files
to be /served/ by a webserver. I deduce it's almost certain that the files
won't be larger than 3GiB, but ICBW

Nick Craig-Wood · Feb 11, 2005

Christos TZOTZIOY Georgiou said:
Indeed... but the context was calculating efficiently checksums for large files
to be /served/ by a webserver. I deduce it's almost certain that the files
won't be larger than 3GiB, but ICBW

You are certainly right ;-)

However I did want to make the point that while mmap is extremely
attractive for certain things, it does limit you to files < 4 Gb which
is something that people don't always realise.

Zlib: correct checksum but error decompressing	3	Aug 26, 2009
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Compare 2 files with different names (checksum?)	4	Sep 18, 2009
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022
I want to include fees depending on the payment method, using the plugin "Deposits for Woocommerce"	0	Aug 17, 2022
calculating system clock resolution	12	Apr 7, 2006
Efficient processing of large nuumeric data file	12	Jan 18, 2008
Complex sort on big files	7	Aug 1, 2011

Efficient checksum calculating on lagre files

Ola Natvig

Robin Becker

Fredrik Lundh

Michael Hoffman

Michael Hoffman

Christos TZOTZIOY Georgiou

Nick Craig-Wood

Thomas Heller

Christos TZOTZIOY Georgiou

Nick Craig-Wood

Nick Craig-Wood

Christos TZOTZIOY Georgiou

Nick Craig-Wood

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads