How good are checksums?

Discussion in 'Java' started by Chris, Apr 29, 2004.

  1. Chris

    Chris Guest

    I've got an app where we're adding documents to a database. I need to detect
    duplicates, and we don't have a unique primary key. The documents will be
    anywhere from 100 bytes to 100K in length.

    My thought was that I could calculate a 32-bit checksum, either CRC or
    Adler, and use that as a quasi-primary key. Documents with the same checksum
    would be assumed to be duplicates.

    The question is, how often will this fail? Given a database of 1 million
    documents, what is the likelihood of two different documents having the same
    checksum? What about 100 million documents? (Yes, we really will have a
    database that large).

    What if I also check to make sure the document lengths are the same?
    Chris, Apr 29, 2004
    #1
    1. Advertising

  2. Chris

    Roedy Green Guest

    On Thu, 29 Apr 2004 12:17:47 -0500, "Chris" <> wrote
    or quoted :

    >My thought was that I could calculate a 32-bit checksum, either CRC or
    >Adler, and use that as a quasi-primary key. Documents with the same checksum
    >would be assumed to be duplicates.


    the logic I use in The Replicator and Untouch for this is to compute a
    checksum and compare file lengths. This makes the odds of a false
    duplicate very low. I use a fast Adlerian checksum.

    Even without the length check, the odds are only 1/2^32 of getting a
    false duplicate.

    See http://mindprod.com/products.html#REPLICATOR
    http://mindprod.com/projuntouch.html

    I use this for redating files back to their original dates that have
    not really changed, e.g. macros expanded to the same thing as last
    time.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Apr 29, 2004
    #2
    1. Advertising

  3. Chris

    Roedy Green Guest

    On Thu, 29 Apr 2004 12:17:47 -0500, "Chris" <> wrote
    or quoted :

    >My thought was that I could calculate a 32-bit checksum, either CRC or
    >Adler, and use that as a quasi-primary key. Documents with the same checksum
    >would be assumed to be duplicates.


    You can throw in the timestamp too into the checksum. Normally you
    consider two documents with the same timestamp and contents as
    duplicates even if they have different names.

    You can add the name, or dir+name into the checksum if you don't want
    that.
    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Apr 29, 2004
    #3
  4. Chris

    Eric Sosman Guest

    Chris wrote:
    >
    > I've got an app where we're adding documents to a database. I need to detect
    > duplicates, and we don't have a unique primary key. The documents will be
    > anywhere from 100 bytes to 100K in length.
    >
    > My thought was that I could calculate a 32-bit checksum, either CRC or
    > Adler, and use that as a quasi-primary key. Documents with the same checksum
    > would be assumed to be duplicates.
    >
    > The question is, how often will this fail? Given a database of 1 million
    > documents, what is the likelihood of two different documents having the same
    > checksum? What about 100 million documents? (Yes, we really will have a
    > database that large).


    Look up the "birthday paradox." If there are D documents
    each with one of C equiprobable checksum values, the chance of
    having no duplications at all is easily seen to be

    C / C * (C-1) / C * ... * (C-D+1) / C

    For large D this value would be, er, "tedious" to compute.
    Also, it tells you only the probability of having no duplicates
    (or of having one or more), whereas you're really interested in
    how many duplicates you should expect.

    There are probably fancier and more accurate ways to estimate
    the average, but I think you could get a reasonable approximation
    by considering the D*(D-1)/2 ~= D*D/2 pairs of documents. The
    two documents in a pair have the same checksum value with the
    probability 1/C. If the D*D/2 pairings were independent (they're
    not, but we're approximating), you'd expect (D*D/2)/C matches.
    For a 32-bit checksum and D ~= 100 million, this comes to about

    100e6 * 100e6 / 2 / 4.3e9 ~= 1.2e6

    pairs of false duplicates.

    > What if I also check to make sure the document lengths are the same?


    That would increase the value of C, and hence decrease the
    likelihood of duplicates. You could also increase C by using
    two different checksums instead of just one. However, it's
    unlikely that all of (1) CRC-32 and (2) Adler and (3) document
    length will be 100% statistically independent; for a trivial
    counterexample, consider 100 million one-byte documents. This
    means that a pair of distinct documents is a little more likely
    to have matching "signatures" than the probability 1/C suggests.
    Still, it seems likely that using two as-different-as-possible
    checksums and considering the lengths should get the number of
    false duplicates down from 1.2 million to a more manageable
    number.

    --
    Eric Sosman, Apr 29, 2004
    #4
  5. Chris

    Roedy Green Guest

    On Thu, 29 Apr 2004 18:58:24 GMT, Roedy Green
    <> wrote or quoted :

    >the logic I use in The Replicator and Untouch for this is to compute a
    >checksum and compare file lengths. This makes the odds of a false
    >duplicate very low. I use a fast Adlerian checksum.
    >
    >Even without the length check, the odds are only 1/2^32 of getting a
    >false duplicate.


    in the replicator, it does not matter if two unrelated documents hash
    to the same key, only if the immediately previous version of a
    document hashes to the same key and length.

    Your problem is similar to what to do with hash table collisions.

    See http://mindprod.com/jgloss/hashtable.html

    Two different documents can hash to the same key. But two identical
    documents cannot hash to different keys. If two documents hash to the
    same key, you can do some finer check for duplicates, even a byte by
    byte compare. This is not too painful if you do the i/o in whacking
    big chunks as raw bytes.

    Most of the time you won't have collisions, so you won't have to do
    the compare.

    If the two files have the same name and same timestamp, you should be
    able to trust the OS that they are the same file without even
    examining contents. :)



    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Apr 29, 2004
    #5
  6. I always use MD5(input) converted to hex for my hashes. Prepend it
    with the file name, size, date, or whatever to make you feel more
    comfortable, but I think you'll overload any database in existence
    before you find a collision on even just the MD5 hash of the data.

    Mike Scovetta



    "Chris" <> wrote in message news:<>...
    > I've got an app where we're adding documents to a database. I need to detect
    > duplicates, and we don't have a unique primary key. The documents will be
    > anywhere from 100 bytes to 100K in length.
    >
    > My thought was that I could calculate a 32-bit checksum, either CRC or
    > Adler, and use that as a quasi-primary key. Documents with the same checksum
    > would be assumed to be duplicates.
    >
    > The question is, how often will this fail? Given a database of 1 million
    > documents, what is the likelihood of two different documents having the same
    > checksum? What about 100 million documents? (Yes, we really will have a
    > database that large).
    >
    > What if I also check to make sure the document lengths are the same?
    Michael Scovetta, Apr 29, 2004
    #6
  7. Chris

    Liz Guest

    md5 is better and there are some tools
    on the web you can use, pretty fast too


    "Chris" <> wrote in message
    news:...
    > I've got an app where we're adding documents to a database. I need to

    detect
    > duplicates, and we don't have a unique primary key. The documents will be
    > anywhere from 100 bytes to 100K in length.
    >
    > My thought was that I could calculate a 32-bit checksum, either CRC or
    > Adler, and use that as a quasi-primary key. Documents with the same

    checksum
    > would be assumed to be duplicates.
    >
    > The question is, how often will this fail? Given a database of 1 million
    > documents, what is the likelihood of two different documents having the

    same
    > checksum? What about 100 million documents? (Yes, we really will have a
    > database that large).
    >
    > What if I also check to make sure the document lengths are the same?
    >
    >
    Liz, Apr 30, 2004
    #7
  8. Chris

    Roedy Green Guest

    On Fri, 30 Apr 2004 03:38:46 GMT, "Liz" <> wrote or
    quoted :

    >md5 is better and there are some tools
    >on the web you can use, pretty fast too


    Any idea how much slower MD5 is than Adler?


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Apr 30, 2004
    #8
  9. Michael Scovetta wrote:

    > I always use MD5(input) converted to hex for my hashes. Prepend it
    > with the file name, size, date, or whatever to make you feel more
    > comfortable, but I think you'll overload any database in existence
    > before you find a collision on even just the MD5 hash of the data.


    The method of hash generation makes very little difference as long as
    it's not broken. The length of the hash is what determines the
    likelihood of collisions. MD5 is 128 bits, which is much better than
    32 bits, that's all.
    Michael Borgwardt, Apr 30, 2004
    #9
  10. Can't you simply do an actual comparison of the document content
    if the checksum is identical? That would avoid false duplicates
    completely.
    Michael Borgwardt, Apr 30, 2004
    #10
  11. Chris

    Tom McGlynn Guest

    Chris wrote:

    > I've got an app where we're adding documents to a database. I need to detect
    > duplicates, and we don't have a unique primary key. The documents will be
    > anywhere from 100 bytes to 100K in length.
    >
    > My thought was that I could calculate a 32-bit checksum, either CRC or
    > Adler, and use that as a quasi-primary key. Documents with the same checksum
    > would be assumed to be duplicates.
    >
    > The question is, how often will this fail? Given a database of 1 million
    > documents, what is the likelihood of two different documents having the same
    > checksum? What about 100 million documents? (Yes, we really will have a
    > database that large).
    >
    > What if I also check to make sure the document lengths are the same?
    >
    >


    I don't see any other posters who answered the first question.
    Very roughly you have a good chance of having
    duplicate checksums when the number of elements is greater
    than the square root of the number of different checksums.
    With 32 bits you have 4x10^9 different checksums so you might expect
    to get duplicates when the number of files begins to
    exceed 10^5 -- well below the number you are worried
    about. A 128 bit checksum, on the other hand, could do just
    fine. So use a larger checksum or use some of the other methods
    suggested.

    This is the same math that comes up when you ask how many people
    do you need to have in a room for it to be likely that two people
    have the same birthday. There's a > 50% chance with 23 people which
    is a bit more than the square root of 365.

    Regards,
    Tom McGlynn
    Tom McGlynn, Apr 30, 2004
    #11
  12. Chris

    Liz Guest

    "Roedy Green" <> wrote in message
    news:...
    > On Fri, 30 Apr 2004 03:38:46 GMT, "Liz" <> wrote or
    > quoted :
    >
    > >md5 is better and there are some tools
    > >on the web you can use, pretty fast too

    >
    > Any idea how much slower MD5 is than Adler?


    Just tried it on a 71MB file and it took about 2 seconds.

    >
    >
    > --
    > Canadian Mind Products, Roedy Green.
    > Coaching, problem solving, economical contract programming.
    > See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    >
    Liz, Apr 30, 2004
    #12
  13. Chris

    Roedy Green Guest

    On Fri, 30 Apr 2004 19:11:13 GMT, "Liz" <> wrote or
    quoted :

    >> Any idea how much slower MD5 is than Adler?

    >
    >Just tried it on a 71MB file and it took about 2 seconds.


    have you tried it both ways?

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Apr 30, 2004
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Michael Fairbank

    class checksums

    Michael Fairbank, Dec 5, 2003, in forum: Java
    Replies:
    16
    Views:
    3,108
    Michael Fairbank
    Dec 12, 2003
  2. Igor Planinc

    Bit IO, digests, checksums

    Igor Planinc, Nov 15, 2005, in forum: Java
    Replies:
    15
    Views:
    652
    Andrey Kuznetsov
    Nov 18, 2005
  3. Fredrik Lundh

    Re: Calculating md5 checksums.

    Fredrik Lundh, Mar 3, 2006, in forum: Python
    Replies:
    1
    Views:
    334
    Nainto
    Mar 4, 2006
  4. Replies:
    1
    Views:
    377
    Jonathan Lee
    Jul 3, 2010
  5. Rob

    Files and Checksums

    Rob, Aug 27, 2003, in forum: Javascript
    Replies:
    0
    Views:
    69
Loading...

Share This Page