Does hashlib support a file mode?

Phlip · Jul 6, 2011

Pythonistas:

Consider this hashing code:

import hashlib
file = open(path)
m = hashlib.md5()
m.update(file.read())
digest = m.hexdigest()
file.close()

If the file were huge, the file.read() would allocate a big string and
thrash memory. (Yes, in 2011 that's still a problem, because these
files could be movies and whatnot.)

So if I do the stream trick - read one byte, update one byte, in a
loop, then I'm essentially dragging that movie thru 8 bits of a 64 bit
CPU. So that's the same problem; it would still be slow.

So now I try this:

sum = os.popen('sha256sum %r' % path).read()

Those of you who like to lie awake at night thinking of new ways to
flame abusers of 'eval()' may have a good vent, there.

Does hashlib have a file-ready mode, to hide the streaming inside some
clever DMA operations?

Prematurely optimizingly y'rs

deeanabrown33 · Jul 6, 2011

Hi Everyone
I'm new to This forum
it is great to join this Forum, hope i'm welcome in

Thomas Rachel · Jul 6, 2011

Am 06.07.2011 07:54 schrieb Phlip:

Pythonistas:

Consider this hashing code:

import hashlib
file = open(path)
m = hashlib.md5()
m.update(file.read())
digest = m.hexdigest()
file.close()

If the file were huge, the file.read() would allocate a big string and
thrash memory. (Yes, in 2011 that's still a problem, because these
files could be movies and whatnot.)

So if I do the stream trick - read one byte, update one byte, in a
loop, then I'm essentially dragging that movie thru 8 bits of a 64 bit
CPU. So that's the same problem; it would still be slow.

Yes. That is why you should read with a reasonable block size. Not too
small and not too big.

def filechunks(f, size=8192):
while True:
s = f.read(size)
if not s: break
yield s
# f.close() # maybe...

import hashlib
file = open(path)
m = hashlib.md5()
fc = filechunks(file)
for chunk in fc:
m.update(chunk)
digest = m.hexdigest()
file.close()

So you are reading in 8 kiB chunks. Feel free to modify this - maybe use
os.stat(file).st_blksize instead (which is AFAIK the recommended
minimum), or a value of about 1 MiB...

So now I try this:

sum = os.popen('sha256sum %r' % path).read()

This is not as nice as the above, especially not with a path containing
strange characters. What about, at least,

def shellquote(*strs):
return " ".join([
"'"+st.replace("'","'\\''")+"'"
for st in strs
])

sum = os.popen('sha256sum %r' % shellquote(path)).read()

or, even better,

import subprocess
sp = subprocess.Popen(['sha256sum', path'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
sp.stdin.close() # generate EOF
sum = sp.stdout.read()
sp.wait()

?

Does hashlib have a file-ready mode, to hide the streaming inside some
clever DMA operations?

AFAIK not.

Thomas

Chris Rebert · Jul 6, 2011

Pythonistas:

Consider this hashing code:

Â import hashlib
Â file = open(path)
Â m = hashlib.md5()
Â m.update(file.read())
Â digest = m.hexdigest()
Â file.close()

If the file were huge, the file.read() would allocate a big string and
thrash memory. (Yes, in 2011 that's still a problem, because these
files could be movies and whatnot.)

So if I do the stream trick - read one byte, update one byte, in a
loop, then I'm essentially dragging that movie thru 8 bits of a 64 bit
CPU. So that's the same problem; it would still be slow.

So now I try this:

Â sum = os.popen('sha256sum %r' % path).read()

Those of you who like to lie awake at night thinking of new ways to
flame abusers of 'eval()' may have a good vent, there.

Indeed (*eyelid twitch*). That one-liner is arguably better written as:
sum = subprocess.check_output(['sha256sum', path])

Does hashlib have a file-ready mode, to hide the streaming inside some
clever DMA operations?

Barring undocumented voodoo, no, it doesn't appear to. You could
always read from the file in suitably large chunks instead (rather
than byte-by-byte, which is indeed ridiculous); see
io.DEFAULT_BUFFER_SIZE and/or the os.stat() trick referenced therein
and/or the block_size attribute of hash objects.
http://docs.python.org/library/io.html#io.DEFAULT_BUFFER_SIZE
http://docs.python.org/library/hashlib.html#hashlib.hash.block_size

Cheers,
Chris

Anssi Saari · Jul 6, 2011

Phlip said:
If the file were huge, the file.read() would allocate a big string and
thrash memory. (Yes, in 2011 that's still a problem, because these
files could be movies and whatnot.)

I did a crc32 calculator like that and actually ran into some kind of
string length limit with large files. So I switched to 4k blocks and
the speed is about the same as a C implementation in the program
cksfv. Well, of course crc32 is usually done with a table lookup, so
it's always fast.

I just picked 4k, since it's the page size in x86 systems and also a
common block size for file systems. Seems to be big enough.
io.DEFAULT_BUFFER_SIZE is 8k here. I suppose using that would be the
proper way.

Adam Tauno Williams · Jul 6, 2011

Pythonistas
Consider this hashing code:
import hashlib
file = open(path)
m = hashlib.md5()
m.update(file.read())
digest = m.hexdigest()
file.close()
If the file were huge, the file.read() would allocate a big string and
thrash memory. (Yes, in 2011 that's still a problem, because these
files could be movies and whatnot.)

Yes, the simple rule is do not *ever* file.read(). No matter what the
year this will never be OK. Always chunk reading a file into reasonable
I/O blocks.

For example I use this function to copy a stream and return a SHA512 and
the output streams size:

def write(self, in_handle, out_handle):
m = hashlib.sha512()
data = in_handle.read(4096)
while True:
if not data:
break
m.update(data)
out_handle.write(data)
data = in_handle.read(4096)
out_handle.flush()
return (m.hexdigest(), in_handle.tell())

Does hashlib have a file-ready mode, to hide the streaming inside some
clever DMA operations?

Chunk it to something close to the block size of your underlying
filesystem.

Phlip · Jul 6, 2011

wow, tx y'all!

I forgot to mention that hashlib itself is not required; I could also
use Brand X. But y'all agree that blocking up the file in python adds
no overhead to hashing each block in C, so hashlib in a loop it is!

Does hashlib support a file mode?

Phlip

deeanabrown33

Thomas Rachel

Chris Rebert

Anssi Saari

Adam Tauno Williams

Phlip

Members online

Forum statistics

Latest Threads