How good are checksums?

C

Chris

I've got an app where we're adding documents to a database. I need to detect
duplicates, and we don't have a unique primary key. The documents will be
anywhere from 100 bytes to 100K in length.

My thought was that I could calculate a 32-bit checksum, either CRC or
Adler, and use that as a quasi-primary key. Documents with the same checksum
would be assumed to be duplicates.

The question is, how often will this fail? Given a database of 1 million
documents, what is the likelihood of two different documents having the same
checksum? What about 100 million documents? (Yes, we really will have a
database that large).

What if I also check to make sure the document lengths are the same?
 
R

Roedy Green

My thought was that I could calculate a 32-bit checksum, either CRC or
Adler, and use that as a quasi-primary key. Documents with the same checksum
would be assumed to be duplicates.

the logic I use in The Replicator and Untouch for this is to compute a
checksum and compare file lengths. This makes the odds of a false
duplicate very low. I use a fast Adlerian checksum.

Even without the length check, the odds are only 1/2^32 of getting a
false duplicate.

See http://mindprod.com/products.html#REPLICATOR
http://mindprod.com/projuntouch.html

I use this for redating files back to their original dates that have
not really changed, e.g. macros expanded to the same thing as last
time.
 
R

Roedy Green

My thought was that I could calculate a 32-bit checksum, either CRC or
Adler, and use that as a quasi-primary key. Documents with the same checksum
would be assumed to be duplicates.

You can throw in the timestamp too into the checksum. Normally you
consider two documents with the same timestamp and contents as
duplicates even if they have different names.

You can add the name, or dir+name into the checksum if you don't want
that.
 
E

Eric Sosman

Chris said:
I've got an app where we're adding documents to a database. I need to detect
duplicates, and we don't have a unique primary key. The documents will be
anywhere from 100 bytes to 100K in length.

My thought was that I could calculate a 32-bit checksum, either CRC or
Adler, and use that as a quasi-primary key. Documents with the same checksum
would be assumed to be duplicates.

The question is, how often will this fail? Given a database of 1 million
documents, what is the likelihood of two different documents having the same
checksum? What about 100 million documents? (Yes, we really will have a
database that large).

Look up the "birthday paradox." If there are D documents
each with one of C equiprobable checksum values, the chance of
having no duplications at all is easily seen to be

C / C * (C-1) / C * ... * (C-D+1) / C

For large D this value would be, er, "tedious" to compute.
Also, it tells you only the probability of having no duplicates
(or of having one or more), whereas you're really interested in
how many duplicates you should expect.

There are probably fancier and more accurate ways to estimate
the average, but I think you could get a reasonable approximation
by considering the D*(D-1)/2 ~= D*D/2 pairs of documents. The
two documents in a pair have the same checksum value with the
probability 1/C. If the D*D/2 pairings were independent (they're
not, but we're approximating), you'd expect (D*D/2)/C matches.
For a 32-bit checksum and D ~= 100 million, this comes to about

100e6 * 100e6 / 2 / 4.3e9 ~= 1.2e6

pairs of false duplicates.
What if I also check to make sure the document lengths are the same?

That would increase the value of C, and hence decrease the
likelihood of duplicates. You could also increase C by using
two different checksums instead of just one. However, it's
unlikely that all of (1) CRC-32 and (2) Adler and (3) document
length will be 100% statistically independent; for a trivial
counterexample, consider 100 million one-byte documents. This
means that a pair of distinct documents is a little more likely
to have matching "signatures" than the probability 1/C suggests.
Still, it seems likely that using two as-different-as-possible
checksums and considering the lengths should get the number of
false duplicates down from 1.2 million to a more manageable
number.
 
R

Roedy Green

the logic I use in The Replicator and Untouch for this is to compute a
checksum and compare file lengths. This makes the odds of a false
duplicate very low. I use a fast Adlerian checksum.

Even without the length check, the odds are only 1/2^32 of getting a
false duplicate.

in the replicator, it does not matter if two unrelated documents hash
to the same key, only if the immediately previous version of a
document hashes to the same key and length.

Your problem is similar to what to do with hash table collisions.

See http://mindprod.com/jgloss/hashtable.html

Two different documents can hash to the same key. But two identical
documents cannot hash to different keys. If two documents hash to the
same key, you can do some finer check for duplicates, even a byte by
byte compare. This is not too painful if you do the i/o in whacking
big chunks as raw bytes.

Most of the time you won't have collisions, so you won't have to do
the compare.

If the two files have the same name and same timestamp, you should be
able to trust the OS that they are the same file without even
examining contents. :)
 
M

Michael Scovetta

I always use MD5(input) converted to hex for my hashes. Prepend it
with the file name, size, date, or whatever to make you feel more
comfortable, but I think you'll overload any database in existence
before you find a collision on even just the MD5 hash of the data.

Mike Scovetta
 
M

Michael Borgwardt

Michael said:
I always use MD5(input) converted to hex for my hashes. Prepend it
with the file name, size, date, or whatever to make you feel more
comfortable, but I think you'll overload any database in existence
before you find a collision on even just the MD5 hash of the data.

The method of hash generation makes very little difference as long as
it's not broken. The length of the hash is what determines the
likelihood of collisions. MD5 is 128 bits, which is much better than
32 bits, that's all.
 
M

Michael Borgwardt

Can't you simply do an actual comparison of the document content
if the checksum is identical? That would avoid false duplicates
completely.
 
T

Tom McGlynn

Chris said:
I've got an app where we're adding documents to a database. I need to detect
duplicates, and we don't have a unique primary key. The documents will be
anywhere from 100 bytes to 100K in length.

My thought was that I could calculate a 32-bit checksum, either CRC or
Adler, and use that as a quasi-primary key. Documents with the same checksum
would be assumed to be duplicates.

The question is, how often will this fail? Given a database of 1 million
documents, what is the likelihood of two different documents having the same
checksum? What about 100 million documents? (Yes, we really will have a
database that large).

What if I also check to make sure the document lengths are the same?

I don't see any other posters who answered the first question.
Very roughly you have a good chance of having
duplicate checksums when the number of elements is greater
than the square root of the number of different checksums.
With 32 bits you have 4x10^9 different checksums so you might expect
to get duplicates when the number of files begins to
exceed 10^5 -- well below the number you are worried
about. A 128 bit checksum, on the other hand, could do just
fine. So use a larger checksum or use some of the other methods
suggested.

This is the same math that comes up when you ask how many people
do you need to have in a room for it to be likely that two people
have the same birthday. There's a > 50% chance with 23 people which
is a bit more than the square root of 365.

Regards,
Tom McGlynn
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top