Re: binary file compare...

Discussion in 'Python' started by Nigel Rantor, Apr 16, 2009.

  1. Nigel Rantor

    Nigel Rantor Guest

    Adam Olsen wrote:
    > On Apr 15, 12:56 pm, Nigel Rantor <> wrote:
    >> Adam Olsen wrote:
    >>> The chance of *accidentally* producing a collision, although
    >>> technically possible, is so extraordinarily rare that it's completely
    >>> overshadowed by the risk of a hardware or software failure producing
    >>> an incorrect result.

    >> Not when you're using them to compare lots of files.
    >>
    >> Trust me. Been there, done that, got the t-shirt.
    >>
    >> Using hash functions to tell whether or not files are identical is an
    >> error waiting to happen.
    >>
    >> But please, do so if it makes you feel happy, you'll just eventually get
    >> an incorrect result and not know it.

    >
    > Please tell us what hash you used and provide the two files that
    > collided.


    MD5

    > If your hash is 256 bits, then you need around 2**128 files to produce
    > a collision. This is known as a Birthday Attack. I seriously doubt
    > you had that many files, which suggests something else went wrong.


    Okay, before I tell you about the empirical, real-world evidence I have
    could you please accept that hashes collide and that no matter how many
    samples you use the probability of finding two files that do collide is
    small but not zero.

    Which is the only thing I've been saying.

    Yes, it's unlikely. Yes, it's possible. Yes, it happens in practice.

    If you are of the opinion though that a hash function can be used to
    tell you whether or not two files are identical then you are wrong. It
    really is that simple.

    I'm not sitting here discussing this for my health, I'm just trying to
    give the OP the benefit of my experience, I have worked with other
    people who insisted on this route and had to find out the hard way that
    it was a Bad Idea (tm). They just wouldn't be told.

    Regards,

    Nige
     
    Nigel Rantor, Apr 16, 2009
    #1
    1. Advertising

  2. Nigel Rantor

    Adam Olsen Guest

    On Apr 16, 3:16 am, Nigel Rantor <> wrote:
    > Adam Olsen wrote:
    > > On Apr 15, 12:56 pm, Nigel Rantor <> wrote:
    > >> Adam Olsen wrote:
    > >>> The chance of *accidentally* producing a collision, although
    > >>> technically possible, is so extraordinarily rare that it's completely
    > >>> overshadowed by the risk of a hardware or software failure producing
    > >>> an incorrect result.
    > >> Not when you're using them to compare lots of files.

    >
    > >> Trust me. Been there, done that, got the t-shirt.

    >
    > >> Using hash functions to tell whether or not files are identical is an
    > >> error waiting to happen.

    >
    > >> But please, do so if it makes you feel happy, you'll just eventually get
    > >> an incorrect result and not know it.

    >
    > > Please tell us what hash you used and provide the two files that
    > > collided.

    >
    > MD5
    >
    > > If your hash is 256 bits, then you need around 2**128 files to produce
    > > a collision.  This is known as a Birthday Attack.  I seriously doubt
    > > you had that many files, which suggests something else went wrong.

    >
    > Okay, before I tell you about the empirical, real-world evidence I have
    > could you please accept that hashes collide and that no matter how many
    > samples you use the probability of finding two files that do collide is
    > small but not zero.


    I'm afraid you will need to back up your claims with real files.
    Although MD5 is a smaller, older hash (128 bits, so you only need
    2**64 files to find collisions), and it has substantial known
    vulnerabilities, the scenario you suggest where you *accidentally*
    find collisions (and you imply multiple collisions!) would be a rather
    significant finding.

    Please help us all by justifying your claim.

    Mind you, since you use MD5 I wouldn't be surprised if your files were
    maliciously produced. As I said before, you need to consider
    upgrading your hash every few years to avoid new attacks.
     
    Adam Olsen, Apr 16, 2009
    #2
    1. Advertising

  3. Nigel Rantor

    Nigel Rantor Guest

    Adam Olsen wrote:
    > On Apr 16, 3:16 am, Nigel Rantor <> wrote:
    >> Adam Olsen wrote:
    >>> On Apr 15, 12:56 pm, Nigel Rantor <> wrote:
    >>>> Adam Olsen wrote:
    >>>>> The chance of *accidentally* producing a collision, although
    >>>>> technically possible, is so extraordinarily rare that it's completely
    >>>>> overshadowed by the risk of a hardware or software failure producing
    >>>>> an incorrect result.
    >>>> Not when you're using them to compare lots of files.
    >>>> Trust me. Been there, done that, got the t-shirt.
    >>>> Using hash functions to tell whether or not files are identical is an
    >>>> error waiting to happen.
    >>>> But please, do so if it makes you feel happy, you'll just eventually get
    >>>> an incorrect result and not know it.
    >>> Please tell us what hash you used and provide the two files that
    >>> collided.

    >> MD5
    >>
    >>> If your hash is 256 bits, then you need around 2**128 files to produce
    >>> a collision. This is known as a Birthday Attack. I seriously doubt
    >>> you had that many files, which suggests something else went wrong.

    >> Okay, before I tell you about the empirical, real-world evidence I have
    >> could you please accept that hashes collide and that no matter how many
    >> samples you use the probability of finding two files that do collide is
    >> small but not zero.

    >
    > I'm afraid you will need to back up your claims with real files.
    > Although MD5 is a smaller, older hash (128 bits, so you only need
    > 2**64 files to find collisions), and it has substantial known
    > vulnerabilities, the scenario you suggest where you *accidentally*
    > find collisions (and you imply multiple collisions!) would be a rather
    > significant finding.


    No. It wouldn't. It isn't.

    The files in question were millions of audio files. I no longer work at
    the company where I had access to them so I cannot give you examples,
    and even if I did Data Protection regulations wouldn't have allowed it.

    If you still don't beleive me you can easily verify what I'm saying by
    doing some simple experiemnts. Go spider the web for images, keep
    collecting them until you get an MD5 hash collision.

    It won't take long.

    > Please help us all by justifying your claim.


    Now, please go and re-read my request first and admit that everything I
    have said so far is correct.

    > Mind you, since you use MD5 I wouldn't be surprised if your files were
    > maliciously produced. As I said before, you need to consider
    > upgrading your hash every few years to avoid new attacks.


    Good grief, this is nothing to do with security concerns, this is about
    someone suggesting to the OP that they use a hash function to determine
    whether or not two files are identical.

    Regards,

    Nige
     
    Nigel Rantor, Apr 16, 2009
    #3
  4. On Apr 16, 3:16 am, Nigel Rantor <> wrote:
    > Adam Olsen wrote:
    > > On Apr 15, 12:56 pm, Nigel Rantor <> wrote:
    > >> Adam Olsen wrote:
    > >>> The chance of *accidentally* producing a collision, although
    > >>> technically possible, is so extraordinarily rare that it's completely
    > >>> overshadowed by the risk of a hardware or software failure producing
    > >>> an incorrect result.
    > >> Not when you're using them to compare lots of files.

    >
    > >> Trust me. Been there, done that, got the t-shirt.

    >
    > >> Using hash functions to tell whether or not files are identical is an
    > >> error waiting to happen.

    >
    > >> But please, do so if it makes you feel happy, you'll just eventually get
    > >> an incorrect result and not know it.

    >
    > > Please tell us what hash you used and provide the two files that
    > > collided.

    >
    > MD5
    >
    > > If your hash is 256 bits, then you need around 2**128 files to produce
    > > a collision.  This is known as a Birthday Attack.  I seriously doubt
    > > you had that many files, which suggests something else went wrong.

    >
    > Okay, before I tell you about the empirical, real-world evidence I have
    > could you please accept that hashes collide and that no matter how many
    > samples you use the probability of finding two files that do collide is
    > small but not zero.
    >
    > Which is the only thing I've been saying.
    >
    > Yes, it's unlikely. Yes, it's possible. Yes, it happens in practice.
    >
    > If you are of the opinion though that a hash function can be used to
    > tell you whether or not two files are identical then you are wrong. It
    > really is that simple.
    >
    > I'm not sitting here discussing this for my health, I'm just trying to
    > give the OP the benefit of my experience, I have worked with other
    > people who insisted on this route and had to find out the hard way that
    > it was a Bad Idea (tm). They just wouldn't be told.
    >
    > Regards,
    >
    >    Nige


    And yes he is right CRCs hashing all have a probability of saying that
    the files are identical when in fact they are not.
     
    SpreadTooThin, Apr 16, 2009
    #4
  5. Nigel Rantor

    Adam Olsen Guest

    On Apr 16, 11:15 am, SpreadTooThin <> wrote:
    > And yes he is right CRCs hashing all have a probability of saying that
    > the files are identical when in fact they are not.


    Here's the bottom line. It is either:

    A) Several hundred years of mathematics and cryptography are wrong.
    The birthday problem as described is incorrect, so a collision is far
    more likely than 42 trillion trillion to 1. You are simply the first
    person to have noticed it.

    B) Your software was buggy, or possibly the input was maliciously
    produced. Or, a really tiny chance that your particular files
    contained a pattern that provoked bad behaviour from MD5.

    Finding a specific limitation of the algorithm is one thing. Claiming
    that the math is fundamentally wrong is quite another.
     
    Adam Olsen, Apr 16, 2009
    #5
  6. Nigel Rantor

    Nigel Rantor Guest

    Adam Olsen wrote:
    > On Apr 16, 11:15 am, SpreadTooThin <> wrote:
    >> And yes he is right CRCs hashing all have a probability of saying that
    >> the files are identical when in fact they are not.

    >
    > Here's the bottom line. It is either:
    >
    > A) Several hundred years of mathematics and cryptography are wrong.
    > The birthday problem as described is incorrect, so a collision is far
    > more likely than 42 trillion trillion to 1. You are simply the first
    > person to have noticed it.
    >
    > B) Your software was buggy, or possibly the input was maliciously
    > produced. Or, a really tiny chance that your particular files
    > contained a pattern that provoked bad behaviour from MD5.
    >
    > Finding a specific limitation of the algorithm is one thing. Claiming
    > that the math is fundamentally wrong is quite another.


    You are confusing yourself about probabilities young man.

    Just becasue something is extremely unlikely does not mean it can't
    happen on the first attempt.

    This is true *no matter how big the numbers are*.

    If you persist in making these ridiculous claims that people *cannot*
    have found collisions then as I said, that's up to you, but I'm not
    going to employ you to do anything except make tea.

    Thanks,

    Nigel
     
    Nigel Rantor, Apr 17, 2009
    #6
  7. Nigel Rantor

    norseman Guest

    Adam Olsen wrote:
    > On Apr 16, 11:15 am, SpreadTooThin <> wrote:
    >> And yes he is right CRCs hashing all have a probability of saying that
    >> the files are identical when in fact they are not.

    >
    > Here's the bottom line. It is either:
    >
    > A) Several hundred years of mathematics and cryptography are wrong.
    > The birthday problem as described is incorrect, so a collision is far
    > more likely than 42 trillion trillion to 1. You are simply the first
    > person to have noticed it.
    >
    > B) Your software was buggy, or possibly the input was maliciously
    > produced. Or, a really tiny chance that your particular files
    > contained a pattern that provoked bad behaviour from MD5.
    >
    > Finding a specific limitation of the algorithm is one thing. Claiming
    > that the math is fundamentally wrong is quite another.
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >

    ================================
    Spending a lifetime in applied math has taught me:
    1) All applied math is finite.
    2) Any algorithm failing to handle all contingencies is flawed.

    The meaning of 1) is that it is limited in what it can actually do.
    The meaning of 2) is that the designer missed or left out something.

    Neither should be taken as bad. Both need to be accepted 'as 'is' and
    the decision to use (when,where,conditions) based on the probability of
    non-failure.


    "...a pattern that provoked bad behavior... " does mean the algorithm is
    incomplete and may be fundamentally wrong. Underscore "is" and "may".

    The more complicated the math the harder it is to keep a higher form of
    math from checking (or improperly displacing) a lower one. Which, of
    course, breaks the rules. Commonly called improper thinking. A number
    of math teasers make use of that.



    Steve
     
    norseman, Apr 17, 2009
    #7
  8. On Apr 17, 4:54 am, Nigel Rantor <> wrote:
    > Adam Olsen wrote:
    > > On Apr 16, 11:15 am, SpreadTooThin <> wrote:
    > >> And yes he is right CRCs hashing all have a probability of saying that
    > >> the files are identical when in fact they are not.

    >
    > > Here's the bottom line.  It is either:

    >
    > > A) Several hundred years of mathematics and cryptography are wrong.
    > > The birthday problem as described is incorrect, so a collision is far
    > > more likely than 42 trillion trillion to 1.  You are simply the first
    > > person to have noticed it.

    >
    > > B) Your software was buggy, or possibly the input was maliciously
    > > produced.  Or, a really tiny chance that your particular files
    > > contained a pattern that provoked bad behaviour from MD5.

    >
    > > Finding a specific limitation of the algorithm is one thing.  Claiming
    > > that the math is fundamentally wrong is quite another.

    >
    > You are confusing yourself about probabilities young man.
    >
    > Just becasue something is extremely unlikely does not mean it can't
    > happen on the first attempt.
    >
    > This is true *no matter how big the numbers are*.
    >
    > If you persist in making these ridiculous claims that people *cannot*
    > have found collisions then as I said, that's up to you, but I'm not
    > going to employ you to do anything except make tea.
    >
    > Thanks,
    >
    >    Nigel


    You know this is just insane. I'd be satisfied with a CRC16 or
    something in the situation i'm in.
    I have two large files, one local and one remote. Transferring every
    byte across the internet to be sure that the two files are identical
    is just not feasible. If two servers one on one side and the other on
    the other side both calculate the CRCs and transmit the CRCs for
    comparison I'm happy.
     
    SpreadTooThin, Apr 17, 2009
    #8
  9. Nigel Rantor

    Adam Olsen Guest

    On Apr 17, 9:59 am, norseman <> wrote:
    > The more complicated the math the harder it is to keep a higher form of
    > math from checking (or improperly displacing) a lower one.  Which, of
    > course, breaks the rules.  Commonly called improper thinking. A number
    > of math teasers make use of that.


    Of course, designing a hash is hard. That's why the *recommended*
    ones get so many years of peer review and attempted attacks first.

    I'd love of Nigel provided evidence that MD5 was broken, I really
    would. It'd be quite interesting to investigate, assuming malicious
    content can be ruled out. Of course even he doesn't think that. He
    claims that his 42 trillion trillion to 1 odds happened not just once,
    but multiple times.
     
    Adam Olsen, Apr 17, 2009
    #9
  10. Nigel Rantor

    Adam Olsen Guest

    On Apr 17, 9:59 am, SpreadTooThin <> wrote:
    > You know this is just insane.  I'd be satisfied with a CRC16 or
    > something in the situation i'm in.
    > I have two large files, one local and one remote.  Transferring every
    > byte across the internet to be sure that the two files are identical
    > is just not feasible.  If two servers one on one side and the other on
    > the other side both calculate the CRCs and transmit the CRCs for
    > comparison I'm happy.


    Definitely use a hash, ignore Nigel. SHA-256 or SHA-512. Or, if you
    might need to update one of the files, look at rsync. Rsync still
    uses MD4 and MD5 (optionally!), but they're fine in a trusted
    environment.
     
    Adam Olsen, Apr 17, 2009
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Benny Ng
    Replies:
    3
    Views:
    876
    Benny Ng
    Jan 25, 2006
  2. Ron Eggler

    writing binary file (ios::binary)

    Ron Eggler, Apr 25, 2008, in forum: C++
    Replies:
    9
    Views:
    953
    James Kanze
    Apr 28, 2008
  3. SpreadTooThin

    binary file compare...

    SpreadTooThin, Apr 13, 2009, in forum: Python
    Replies:
    16
    Views:
    2,894
    Lawrence D'Oliveiro
    Apr 18, 2009
  4. Adam Olsen

    Re: binary file compare...

    Adam Olsen, Apr 15, 2009, in forum: Python
    Replies:
    8
    Views:
    338
    Piet van Oostrum
    Apr 18, 2009
  5. Jim
    Replies:
    6
    Views:
    742
Loading...

Share This Page