C
Chris
I've got an app where we're adding documents to a database. I need to detect
duplicates, and we don't have a unique primary key. The documents will be
anywhere from 100 bytes to 100K in length.
My thought was that I could calculate a 32-bit checksum, either CRC or
Adler, and use that as a quasi-primary key. Documents with the same checksum
would be assumed to be duplicates.
The question is, how often will this fail? Given a database of 1 million
documents, what is the likelihood of two different documents having the same
checksum? What about 100 million documents? (Yes, we really will have a
database that large).
What if I also check to make sure the document lengths are the same?
duplicates, and we don't have a unique primary key. The documents will be
anywhere from 100 bytes to 100K in length.
My thought was that I could calculate a 32-bit checksum, either CRC or
Adler, and use that as a quasi-primary key. Documents with the same checksum
would be assumed to be duplicates.
The question is, how often will this fail? Given a database of 1 million
documents, what is the likelihood of two different documents having the same
checksum? What about 100 million documents? (Yes, we really will have a
database that large).
What if I also check to make sure the document lengths are the same?