How good are checksums?

Chris · Apr 29, 2004

I've got an app where we're adding documents to a database. I need to detect
duplicates, and we don't have a unique primary key. The documents will be
anywhere from 100 bytes to 100K in length.

My thought was that I could calculate a 32-bit checksum, either CRC or
Adler, and use that as a quasi-primary key. Documents with the same checksum
would be assumed to be duplicates.

The question is, how often will this fail? Given a database of 1 million
documents, what is the likelihood of two different documents having the same
checksum? What about 100 million documents? (Yes, we really will have a
database that large).

What if I also check to make sure the document lengths are the same?

Roedy Green · Apr 29, 2004

My thought was that I could calculate a 32-bit checksum, either CRC or
Adler, and use that as a quasi-primary key. Documents with the same checksum
would be assumed to be duplicates.

the logic I use in The Replicator and Untouch for this is to compute a
checksum and compare file lengths. This makes the odds of a false
duplicate very low. I use a fast Adlerian checksum.

Even without the length check, the odds are only 1/2^32 of getting a
false duplicate.

See http://mindprod.com/products.html#REPLICATOR
http://mindprod.com/projuntouch.html

I use this for redating files back to their original dates that have
not really changed, e.g. macros expanded to the same thing as last
time.

Roedy Green · Apr 29, 2004

My thought was that I could calculate a 32-bit checksum, either CRC or
Adler, and use that as a quasi-primary key. Documents with the same checksum
would be assumed to be duplicates.

You can throw in the timestamp too into the checksum. Normally you
consider two documents with the same timestamp and contents as
duplicates even if they have different names.

You can add the name, or dir+name into the checksum if you don't want
that.

Eric Sosman · Apr 29, 2004

Chris said:
I've got an app where we're adding documents to a database. I need to detect
duplicates, and we don't have a unique primary key. The documents will be
anywhere from 100 bytes to 100K in length.

My thought was that I could calculate a 32-bit checksum, either CRC or
Adler, and use that as a quasi-primary key. Documents with the same checksum
would be assumed to be duplicates.

The question is, how often will this fail? Given a database of 1 million
documents, what is the likelihood of two different documents having the same
checksum? What about 100 million documents? (Yes, we really will have a
database that large).

Look up the "birthday paradox." If there are D documents
each with one of C equiprobable checksum values, the chance of
having no duplications at all is easily seen to be

C / C * (C-1) / C * ... * (C-D+1) / C

For large D this value would be, er, "tedious" to compute.
Also, it tells you only the probability of having no duplicates
(or of having one or more), whereas you're really interested in
how many duplicates you should expect.

There are probably fancier and more accurate ways to estimate
the average, but I think you could get a reasonable approximation
by considering the D*(D-1)/2 ~= D*D/2 pairs of documents. The
two documents in a pair have the same checksum value with the
probability 1/C. If the D*D/2 pairings were independent (they're
not, but we're approximating), you'd expect (D*D/2)/C matches.
For a 32-bit checksum and D ~= 100 million, this comes to about

100e6 * 100e6 / 2 / 4.3e9 ~= 1.2e6

pairs of false duplicates.

What if I also check to make sure the document lengths are the same?

That would increase the value of C, and hence decrease the
likelihood of duplicates. You could also increase C by using
two different checksums instead of just one. However, it's
unlikely that all of (1) CRC-32 and (2) Adler and (3) document
length will be 100% statistically independent; for a trivial
counterexample, consider 100 million one-byte documents. This
means that a pair of distinct documents is a little more likely
to have matching "signatures" than the probability 1/C suggests.
Still, it seems likely that using two as-different-as-possible
checksums and considering the lengths should get the number of
false duplicates down from 1.2 million to a more manageable
number.

Roedy Green · Apr 29, 2004

the logic I use in The Replicator and Untouch for this is to compute a
checksum and compare file lengths. This makes the odds of a false
duplicate very low. I use a fast Adlerian checksum.

Even without the length check, the odds are only 1/2^32 of getting a
false duplicate.

in the replicator, it does not matter if two unrelated documents hash
to the same key, only if the immediately previous version of a
document hashes to the same key and length.

Your problem is similar to what to do with hash table collisions.

See http://mindprod.com/jgloss/hashtable.html

Two different documents can hash to the same key. But two identical
documents cannot hash to different keys. If two documents hash to the
same key, you can do some finer check for duplicates, even a byte by
byte compare. This is not too painful if you do the i/o in whacking
big chunks as raw bytes.

Most of the time you won't have collisions, so you won't have to do
the compare.

If the two files have the same name and same timestamp, you should be
able to trust the OS that they are the same file without even
examining contents.

Michael Scovetta · Apr 29, 2004

I always use MD5(input) converted to hex for my hashes. Prepend it
with the file name, size, date, or whatever to make you feel more
comfortable, but I think you'll overload any database in existence
before you find a collision on even just the MD5 hash of the data.

Mike Scovetta

Liz · Apr 30, 2004

md5 is better and there are some tools
on the web you can use, pretty fast too

Roedy Green · Apr 30, 2004

md5 is better and there are some tools
on the web you can use, pretty fast too

Any idea how much slower MD5 is than Adler?

Michael Borgwardt · Apr 30, 2004

Michael said:
I always use MD5(input) converted to hex for my hashes. Prepend it
with the file name, size, date, or whatever to make you feel more
comfortable, but I think you'll overload any database in existence
before you find a collision on even just the MD5 hash of the data.

The method of hash generation makes very little difference as long as
it's not broken. The length of the hash is what determines the
likelihood of collisions. MD5 is 128 bits, which is much better than
32 bits, that's all.

Michael Borgwardt · Apr 30, 2004

Can't you simply do an actual comparison of the document content
if the checksum is identical? That would avoid false duplicates
completely.

Tom McGlynn · Apr 30, 2004

Chris said:
I've got an app where we're adding documents to a database. I need to detect
duplicates, and we don't have a unique primary key. The documents will be
anywhere from 100 bytes to 100K in length.

My thought was that I could calculate a 32-bit checksum, either CRC or
Adler, and use that as a quasi-primary key. Documents with the same checksum
would be assumed to be duplicates.

The question is, how often will this fail? Given a database of 1 million
documents, what is the likelihood of two different documents having the same
checksum? What about 100 million documents? (Yes, we really will have a
database that large).

What if I also check to make sure the document lengths are the same?

I don't see any other posters who answered the first question.
Very roughly you have a good chance of having
duplicate checksums when the number of elements is greater
than the square root of the number of different checksums.
With 32 bits you have 4x10^9 different checksums so you might expect
to get duplicates when the number of files begins to
exceed 10^5 -- well below the number you are worried
about. A 128 bit checksum, on the other hand, could do just
fine. So use a larger checksum or use some of the other methods
suggested.

This is the same math that comes up when you ask how many people
do you need to have in a room for it to be likely that two people
have the same birthday. There's a > 50% chance with 23 people which
is a bit more than the square root of 365.

Regards,
Tom McGlynn

Liz · Apr 30, 2004

Roedy Green said:
Any idea how much slower MD5 is than Adler?

Just tried it on a 71MB file and it took about 2 seconds.

Roedy Green · Apr 30, 2004

Just tried it on a 71MB file and it took about 2 seconds.

have you tried it both ways?

Calculating checksums for different structures.	1	Jul 3, 2010
Decrypting using Javascript	0	Jun 3, 2023
How to add dropdown selected data to table using jquery	2	Jul 2, 2022
How to change star rating color on mouseenter, on mouseout, and on onclick	0	Sep 28, 2018
Release form like IOS app to a group of people	0	Oct 24, 2022
hashing strings to integers for sqlite3 keys	22	May 22, 2014
PyMyth: Global variables are evil... WRONG!	20	Nov 12, 2013
Trying to build a SARIMAX model to forecast the S&P500 trend	0	Nov 5, 2023

How good are checksums?

Chris

Roedy Green

Roedy Green

Eric Sosman

Roedy Green

Michael Scovetta

Liz

Roedy Green

Michael Borgwardt

Michael Borgwardt

Tom McGlynn

Liz

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads