md5 and large files

Brad Tilley · Oct 17, 2004

I have some large files (between 2 & 4 GB) that I want to do a few
things with. Here's how I've been using the md5 module in Python:

original = file(path + f, 'rb')
data = original.read(4096)
original.close()
verify = md5.new(data)
print verify.hexdigest(), f

Is reading the first 4096 bytes of the files and calculating the md5 sum
based on that sufficient for uniquely identifying the files or am I
going about this totally wrong? Any advice or ideas appreciated.

Jeff Epler · Oct 17, 2004

It seems likely that 2 files would have the same 4k "preamble".

For instance, a unix tar file containing a 16k "file1" and then a 1k
"file2" would have the same leading bytes as a unix tar file containing
a 16k "file1" and a 1k "file3", and therefore the md5sum over the first
4k would match. (these two tar files would also have the same byte
length)

If all pages on some website begin
<HTML>
<HEAD>
<SCRIPT> pages and pages of javascript here (at least 4k) </SCRIPT>
<TITLE> ...
the initial 4k might match, too.

But anyway, if s1 != s2, then the odds that hash(s1) != hash(s2) should
be small, and that shouldn't depend on the length of the string.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFBcqPpJd01MZaTXX0RAuSzAKCYYqLknLWNhw7hmDlSJt8oXROABgCfXeuJ
PtEib20kpSOeazD1TfwdYRo=
=gnaA
-----END PGP SIGNATURE-----

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Oct 17, 2004

Brad said:
Is reading the first 4096 bytes of the files and calculating the md5 sum
based on that sufficient for uniquely identifying the files or am I
going about this totally wrong? Any advice or ideas appreciated.

Clearly, you need to use the same procedure for later verification. The
usual approach is to compute the md5sum for the entire file.

Whether this is sufficient somewhat depends on what you want to achieve:
- uniquely identify the file: this works reliable if there is some
guarantee that no two such files will be identical within the first
4096 bytes. If your files are, say, log files with different starting
dates, and the log file lines contain the starting dates, this is a
safe assumption. If these are different versions of essentially the
same file (e.g. different compilations of the same source code), I
would not bet that different files already differ within the first
4096 bytes.

- verify that the file is not corrupted, tampered with, modified.
Your approach is clearly insufficient, as it can only detect
modifications within the first 4096 bytes.

Regards,
Martin

Brad Tilley · Oct 17, 2004

Martin said:
Clearly, you need to use the same procedure for later verification. The
usual approach is to compute the md5sum for the entire file.

Whether this is sufficient somewhat depends on what you want to achieve:
- uniquely identify the file: this works reliable if there is some
guarantee that no two such files will be identical within the first
4096 bytes. If your files are, say, log files with different starting
dates, and the log file lines contain the starting dates, this is a
safe assumption. If these are different versions of essentially the
same file (e.g. different compilations of the same source code), I
would not bet that different files already differ within the first
4096 bytes.

- verify that the file is not corrupted, tampered with, modified.
Your approach is clearly insufficient, as it can only detect
modifications within the first 4096 bytes.

I would like to verify that the files are not corrupt so what's the most
efficient way to calculate md5 sums on 4GB files? The machine doing the
calculations is a small desktop with 256MB of RAM.

Tim Peters · Oct 17, 2004

[Brad Tilley]

I would like to verify that the files are not corrupt so what's the most
efficient way to calculate md5 sums on 4GB files? The machine doing the
calculations is a small desktop with 256MB of RAM.

You'll find md5sum.py in your Python distribution. It reads 8KB at a
time, so requires little RAM regardless of file size. You can use it
directly, or copy the small calculation loop into your program. Note
that reading 4GB of data will take significant time no matter what you
do.

Nelson Minar · Oct 18, 2004

Brad Tilley said:
I would like to verify that the files are not corrupt so what's the
most efficient way to calculate md5 sums on 4GB files? The machine
doing the calculations is a small desktop with 256MB of RAM.

If all you want to do is verify that a file is not corrupt, MD5 is the
wrong algorithm to use. Use something fast like crc32.

If you're worried about corruption anywhere in the file, then testing
the first 4k isn't going to help you very much.

If you really need it to be efficient, don't use Python. Use a native
program like md5sum or sum or something.

If this is you're homework, you'll learn a lot more by figuring it out
yourself.

Josiah Carlson · Oct 18, 2004

If all you want to do is verify that a file is not corrupt, MD5 is the
wrong algorithm to use. Use something fast like crc32.

CRC32 is only useful for detecting line transmission errors on
relatively small blocks of data, and even then, it does poorly.

MD5 is also fairly quick. I can compute md5 checksums at roughly
10megs/second with a 400 mhz processor.

If you really need it to be efficient, don't use Python. Use a native
program like md5sum or sum or something.

Since the md5 module is implemented in C, the only slow part is the few
lines of Python and perhaps IO; though Python has IO speed comparable to
C.

- Josiah

Andrew Dalke · Oct 18, 2004

Nelson said:
If all you want to do is verify that a file is not corrupt, MD5 is the
wrong algorithm to use. Use something fast like crc32.

How much faster is that in Python? It looks about the
same to me.
.... crc = 0
.... while 1:
.... s = infile.read(16384)
.... if not s:
.... return crc
.... crc = binascii.crc32(s, crc)
........ md5obj = md5.new()
.... while 1:
.... s = infile.read(16384)
.... if not s:
.... return md5obj.hexdigest()
.... md5obj.update(s)
........ t1 = time.time()
.... print md5file(open("/Users/dalke/databases/sprot/sprot40.dat"))
.... t2 = time.time()
.... print t2-t1
....
a2f54de61e4db857aadce04298ab177e
10.9378840923.... t1 = time.time()
.... print crc32file(open("/Users/dalke/databases/sprot/sprot40.dat"))
.... t2 = time.time()
.... print t2-t1
....
-1921799528
10.7424199581
I think most of the time is spent doing I/O, not computing
the checksum. That's probably even true if written in C.

Andrew
(e-mail address removed)

Roger Binns · Oct 18, 2004

Brad said:
I would like to verify that the files are not corrupt so what's the
most efficient way to calculate md5 sums on 4GB files? The machine
doing the calculations is a small desktop with 256MB of RAM.

If you need to be 100% certain, then only doing md5sum over the entire
file will work as Tim points out.

If you don't need to be 100% certain, then you can pick random blocks
out of the file and check their sums only. The more blocks you use,
the more certain you will be.

You may also want to consider external tools such as rsync which can
transfer large files, but also when the source changes sends the minimum
to keep the destination up to date. It will also cope correctly (and
efficiently) if network connections break during transfers.

Other things you can do is use the random method above to check a few
blocks (always check the begining and end since they would be the most
likely to be corrupted), and then schedule a more thorough background
check on the files which you may be able to offload to another machine
or another time.

The one thing you don't want to do is check multiple files at the same
time. That will cause the disk heads to keep jumping around and drastically
slow overal disk throughput.

You also didn't say if you are actually having a performance problem. On
my machine, I did some timing tests against a 2GB file. Using the md5sum
program that comes with the operating system, it took 47.5 seconds. I
then tried the Python md5sum program. It took 57 seconds with 8kb or 64kb
buffer sizes but 65.8 seconds with a 1MB buffer size. (The Python times
include the time to start and stop the interpretter).

So in my case the machine can do 1GB every 30 seconds (ATA100 controller
and disk under Linux).

Roger

Paul Rubin · Oct 18, 2004

Brad Tilley said:
I would like to verify that the files are not corrupt so what's the
most efficient way to calculate md5 sums on 4GB files?

What kind of corruption are you talking about? The best way is just
to run md5 over the old file. You could use either the external
md5sum command, or use Python's md5 module. You'd use the md5.update
operation to feed the file through md5 a few kbytes at a time, and
then md5.digest or md5.hexdigest at the end to get the checksum. You
don't need to read the whole file into memory at once or anything like
that.

Roger Binns · Oct 18, 2004

Andrew said:
I think most of the time is spent doing I/O, not computing
the checksum. That's probably even true if written in C.

The OP said they were using 4GB files, so unless the machine
has more than that amount of memory, either algorithm on
the entire file will be almost entirely disk bound.

Roger

Tobias Pfeiffer · Oct 18, 2004

Hi!

If you need to be 100% certain, then only doing md5sum over the
entire file will work as Tim points out.

This is not true. I'd say there are quite a lot of 2 GB files that
produce the same md5 hash...

I think he has to see what he really wants to do with that file. If the
goal is "compute the md5sum", then a loop with md5.update() seems most
appropriate to me. If the goal is "check the equality" or "check whether
they are corrupted", why md5? He can just read small blocks from the file
and then do a simple string comparison. Might even be faster. And here,
the chance is really close to 100% he'd notice a change in the files.

Bye
Tobias

Peter L Hansen · Oct 18, 2004

Tobias said:
This is not true. I'd say there are quite a lot of 2 GB files that
produce the same md5 hash...

Without deliberately contriving an example using the recently
discovered technique, can you offer even a single example? ;-)

(If you were trying to point out that 100.00000000000% or whatever
is not possible with MD5, okay, but note that Roger didn't specify
the precision. 100% is close enough to what you'd get with MD5.)

I think he has to see what he really wants to do with that file. If the
goal is "compute the md5sum", then a loop with md5.update() seems most
appropriate to me. If the goal is "check the equality" or "check whether
they are corrupted", why md5? He can just read small blocks from the file
and then do a simple string comparison. Might even be faster.

Simple string comparisons with *what*? Are you assuming that there
is a known-good copy of the file sitting right next to it, that he
can compare against?

Peter Hickman · Oct 18, 2004

Tobias said:
I'd say there are quite a lot of 2 GB files that
produce the same md5 hash...

If you were concerned that two large files of the exact same size might be
different but produce the same MD5 then please mail me copies of these files!
Proof of an MD5 collision, which must happen, is rarely seen in the wild. But if
you are still concerned then write a program that reads each file in in 32k
chunks (for example) and creates an MD5 of each chunk and compares them. If the
files are identical then they will match chunk for chunk.

Trouble is that you are now calculating the MD5 for two files rather than one
and comparing it to a known value

Tobias Pfeiffer · Oct 18, 2004

Hi!

Without deliberately contriving an example using the recently
discovered technique, can you offer even a single example? ;-)

Of course I can't... *grin* -- But actually, (correct me if I'm wrong) an
MD5 sum is 128 bits long, that are 2^128 different possibilities. Now a 2
GB file has 8*2*1024^3 bits, that are 2^17179869184 different
possibilities for a 2 GB file. Am I wright thinking that the number of
files with an identical md5 sum is now 2^17179869184 / 2^128 =
2^17179869056?

(If you were trying to point out that 100.00000000000% or whatever
is not possible with MD5, okay, but note that Roger didn't specify
the precision. 100% is close enough to what you'd get with MD5.)

And am I also right thinking that the possibility of getting the same md5
for two files is then 2^-128? OK, I think I have to admit that this
chance is small enough... *grin*

Simple string comparisons with *what*? Are you assuming that there
is a known-good copy of the file sitting right next to it, that he
can compare against?

That was why I said he has to know the goal of his project. I think if he
wants to compare the first 4K of the files, he also has to have a known-
good copy.

Bye
Tobias

Josiah Carlson · Oct 18, 2004

Of course I can't... *grin* -- But actually, (correct me if I'm wrong) an

MD5 sum is 128 bits long, that are 2^128 different possibilities. Now a 2
GB file has 8*2*1024^3 bits, that are 2^17179869184 different
possibilities for a 2 GB file. Am I wright thinking that the number of
files with an identical md5 sum is now 2^17179869184 / 2^128 =
2^17179869056?

(I know this is a major simplification, but bear with me)

Your probability and numbers are correct. The problem is that 2^128 is
so damn huge, if a billion computers running at a billion md5s/sec were
all running, it would still take ~10,790,283,070,806 years to see all
md5 digests, assuming that every attempt produces a unique digest (until
we have actually seen them all).

Without a specific attack on md5 (which there is one), the probability
of choosing two arbitrary strings of K characters each, that hash to the
same value with md5, is around 1 in 2^127. More precisely, you are more
likely to get hit by lightning while being a major candidate for the US
presidential election, than to discover such a pair.

- Josiah

Brad Tilley · Oct 18, 2004

Josiah said:
(I know this is a major simplification, but bear with me)

Your probability and numbers are correct. The problem is that 2^128 is
so damn huge

All you need is 2^128+1 to find a duplicate, no? The problem, as I
understand it, is getting to the end (2^128+1) as sufficient computing
power isn't available... yet.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Oct 18, 2004

Brad said:
All you need is 2^128+1 to find a duplicate, no? The problem, as I
understand it, is getting to the end (2^128+1) as sufficient computing
power isn't available... yet.

No. After 2^128+1, you will have found two files with the same md5sum.
However, you won't necessarily have found a file that hashes the same
as your *original* file. You may well search the space of all possible
files (in all file lengths), and never find a single file that hashes
the same as your original file.

Regards,
Martin

Tobias Pfeiffer · Oct 19, 2004

Hi!

You may well search the space of
all possible files (in all file lengths), and never find a single
file that hashes the same as your original file.

Are you sure about that? I have no clue about what the md5 algorithm
works like, but I'd think one could prove that with an number large
enough, every hash occurs twice. At last, md5 is not random.

Bye
Tobias

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= · Oct 19, 2004

Tobias said:
Are you sure about that? I have no clue about what the md5 algorithm
works like, but I'd think one could prove that with an number large
enough, every hash occurs twice. At last, md5 is not random.

I'm not sure. However, I'm also not sure of the contrary.

It is well possible to define a hash function which produces certain
hash values only for a single input. Consider this hash function
(operating on byte strings):

def hash(s):
if len(s)==100:
for c in s:
if c != 'X': break
else:
return 0
return len(s)+1

This hash function returns 0 only for a string of 100 Xs.
For any other input, it returns a different value.

So even though there are many collisions in the hash, there is
one input which never collides. The same may true for md5. Whether
it actually is true is still an open question, AFAIK. The proof
could go either way:
- here is string S which hashes as md5(S), and I can prove that
no other string can possibly hash as md5(S).
- here is a procedure which, for every hash value H, produces two
distinct strings S1 and S2 so that H==md5(S1)==md5(S2).

Regards,
Martin

safe way to calc md5 on very large files	9	Mar 18, 2006
How to loop in folder through all excel files and all sheets using pandas?	0	Dec 1, 2022
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Efficient MD5 (or similar) hashes	3	Dec 8, 2003
Extra Newby question - Trying to create md5 File Listing	3	Sep 27, 2006
Unable to read large files from zip	2	Aug 29, 2007
dealing with large csv files	5	Nov 30, 2008
Search a Large files backwards	7	Mar 2, 2010

md5 and large files

Brad Tilley

Jeff Epler

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Brad Tilley

Tim Peters

Nelson Minar

Josiah Carlson

Andrew Dalke

Roger Binns

Paul Rubin

Roger Binns

Tobias Pfeiffer

Peter L Hansen

Peter Hickman

Tobias Pfeiffer

Josiah Carlson

Brad Tilley

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Tobias Pfeiffer

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads