binary file compare...

Discussion in 'Python' started by SpreadTooThin, Apr 13, 2009.

  1. I want to compare two binary files and see if they are the same.
    I see the filecmp.cmp function but I don't get a warm fuzzy feeling
    that it is doing a byte by byte comparison of two files to see if they
    are they same.

    What should I be using if not filecmp.cmp?
     
    SpreadTooThin, Apr 13, 2009
    #1
    1. Advertising

  2. On Apr 13, 2:00 pm, Przemyslaw Kaminski <> wrote:
    > SpreadTooThin wrote:
    > > I want to compare two binary files and see if they are the same.
    > > I see the filecmp.cmp function but I don't get a warm fuzzy feeling
    > > that it is doing a byte by byte comparison of two files to see if they
    > > are they same.

    >
    > > What should I be using if not filecmp.cmp?

    >
    > Well, here's somethinghttp://www.daniweb.com/forums/thread115959.html
    > but it seems from the post on the bottom that filecmp does comparison of
    > binary files.


    I just want to be clear, the comparison is not just based on file size
    and creation date but by a byte by byte comparison of the data in each
    file.
     
    SpreadTooThin, Apr 13, 2009
    #2
    1. Advertising

  3. On Apr 13, 2:03 pm, Grant Edwards <invalid@invalid> wrote:
    > On 2009-04-13, SpreadTooThin <> wrote:
    >
    > > I want to compare two binary files and see if they are the same.
    > > I see the filecmp.cmp function but I don't get a warm fuzzy feeling
    > > that it is doing a byte by byte comparison of two files to see if they
    > > are they same.

    >
    > Perhaps I'm being dim, but how else are you going to decide if
    > two files are the same unless you compare the bytes in the
    > files?
    >
    > You could hash them and compare the hashes, but that's a lot
    > more work than just comparing the two byte streams.
    >
    > > What should I be using if not filecmp.cmp?

    >
    > I don't understand what you've got against comparing the files
    > when you stated that what you wanted to do was compare the files.
    >


    I think its just the way the documentation was worded
    http://www.python.org/doc/2.5.2/lib/module-filecmp.html

    Unless shallow is given and is false, files with identical os.stat()
    signatures are taken to be equal.
    Files that were compared using this function will not be compared
    again unless their os.stat() signature changes.

    So to do a comparison:
    filecmp.cmp(filea, fileb, False)
    ?




    > --
    > Grant Edwards                   grante             Yow! Look DEEP into the
    >                                   at               OPENINGS!!  Do you see any
    >                                visi.com            ELVES or EDSELS ... or a
    >                                                    HIGHBALL?? ...
     
    SpreadTooThin, Apr 13, 2009
    #3
  4. On Apr 13, 2:37 pm, Grant Edwards <invalid@invalid> wrote:
    > On 2009-04-13, Grant Edwards <invalid@invalid> wrote:
    >
    >
    >
    > > On 2009-04-13, SpreadTooThin <> wrote:

    >
    > >> I want to compare two binary files and see if they are the same.
    > >> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
    > >> that it is doing a byte by byte comparison of two files to see if they
    > >> are they same.

    >
    > > Perhaps I'm being dim, but how else are you going to decide if
    > > two files are the same unless you compare the bytes in the
    > > files?

    >
    > > You could hash them and compare the hashes, but that's a lot
    > > more work than just comparing the two byte streams.

    >
    > >> What should I be using if not filecmp.cmp?

    >
    > > I don't understand what you've got against comparing the files
    > > when you stated that what you wanted to do was compare the files.

    >
    > Doh!  I misread your post and thought were weren't getting a
    > warm fuzzying feeling _because_ it was doing a byte-byte
    > compare. Now I'm a bit confused.  Are you under the impression
    > it's _not_ doing a byte-byte compare?  Here's the code:
    >
    > def _do_cmp(f1, f2):
    >     bufsize = BUFSIZE
    >     fp1 = open(f1, 'rb')
    >     fp2 = open(f2, 'rb')
    >     while True:
    >         b1 = fp1.read(bufsize)
    >         b2 = fp2.read(bufsize)
    >         if b1 != b2:
    >             return False
    >         if not b1:
    >             return True
    >
    > It looks like a byte-by-byte comparison to me.  Note that when
    > this function is called the file lengths have already been
    > compared and found to be equal.
    >
    > --
    > Grant Edwards                   grante             Yow! Alright, you!!
    >                                   at               Imitate a WOUNDED SEAL
    >                                visi.com            pleading for a PARKING
    >                                                    SPACE!!


    I am indeed under the impression that it is not always doing a byte by
    byte comparison...
    as well the documentation states:
    Compare the files named f1 and f2, returning True if they seem equal,
    False otherwise.

    That word... Seeeeem... makes me wonder.

    Thanks for the code! :)
     
    SpreadTooThin, Apr 13, 2009
    #4
  5. SpreadTooThin

    Peter Otten Guest

    Grant Edwards wrote:

    > On 2009-04-13, Grant Edwards <invalid@invalid> wrote:
    >> On 2009-04-13, SpreadTooThin <> wrote:
    >>
    >>> I want to compare two binary files and see if they are the same.
    >>> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
    >>> that it is doing a byte by byte comparison of two files to see if they
    >>> are they same.

    >>
    >> Perhaps I'm being dim, but how else are you going to decide if
    >> two files are the same unless you compare the bytes in the
    >> files?
    >>
    >> You could hash them and compare the hashes, but that's a lot
    >> more work than just comparing the two byte streams.
    >>
    >>> What should I be using if not filecmp.cmp?

    >>
    >> I don't understand what you've got against comparing the files
    >> when you stated that what you wanted to do was compare the files.

    >
    > Doh! I misread your post and thought were weren't getting a
    > warm fuzzying feeling _because_ it was doing a byte-byte
    > compare. Now I'm a bit confused. Are you under the impression
    > it's _not_ doing a byte-byte compare? Here's the code:
    >
    > def _do_cmp(f1, f2):
    > bufsize = BUFSIZE
    > fp1 = open(f1, 'rb')
    > fp2 = open(f2, 'rb')
    > while True:
    > b1 = fp1.read(bufsize)
    > b2 = fp2.read(bufsize)
    > if b1 != b2:
    > return False
    > if not b1:
    > return True
    >
    > It looks like a byte-by-byte comparison to me. Note that when
    > this function is called the file lengths have already been
    > compared and found to be equal.


    But there's a cache. A change of file contents may go undetected as long as
    the file stats don't change:

    $ cat fool_filecmp.py
    import filecmp, shutil, sys

    for fn in "adb":
    with open(fn, "w") as f:
    f.write("yadda")

    shutil.copystat("d", "a")
    filecmp.cmp("a", "b", False)

    with open("a", "w") as f:
    f.write("*****")
    shutil.copystat("d", "a")

    if "--clear" in sys.argv:
    print "clearing cache"
    filecmp._cache.clear()

    if filecmp.cmp("a", "b", False):
    print "file a and b are equal"
    else:
    print "file a and b differ"
    print "a's contents:", open("a").read()
    print "b's contents:", open("b").read()

    $ python2.6 fool_filecmp.py
    file a and b are equal
    a's contents: *****
    b's contents: yadda

    Oops. If you are paranoid you have to clear the cache before doing the
    comparison:

    $ python2.6 fool_filecmp.py --clear
    clearing cache
    file a and b differ
    a's contents: *****
    b's contents: yadda

    Peter
     
    Peter Otten, Apr 13, 2009
    #5
  6. On Mon, 13 Apr 2009 15:03:32 -0500, Grant Edwards wrote:

    > On 2009-04-13, SpreadTooThin <> wrote:
    >
    >> I want to compare two binary files and see if they are the same. I see
    >> the filecmp.cmp function but I don't get a warm fuzzy feeling that it
    >> is doing a byte by byte comparison of two files to see if they are they
    >> same.

    >
    > Perhaps I'm being dim, but how else are you going to decide if two files
    > are the same unless you compare the bytes in the files?


    If you start with an image in one format (e.g. PNG), and convert it to
    another format (e.g. JPEG), you might want the two files to compare equal
    even though their byte contents are completely different, because their
    contents (the image itself) is visually identical.

    Or you might want a heuristic as a short cut for comparing large files,
    and decide that if two files have the same size and modification dates,
    and the first (say) 100KB are equal, that you will assume the rest are
    probably equal too.

    Neither of these are what the OP wants, I'm just mentioning them to
    answer your rhetorical question :)



    --
    Steven
     
    Steven D'Aprano, Apr 14, 2009
    #6
  7. SpreadTooThin

    Dave Angel Guest

    SpreadTooThin wrote:
    > On Apr 13, 2:37 pm, Grant Edwards <invalid@invalid> wrote:
    >
    >> On 2009-04-13, Grant Edwards <invalid@invalid> wrote:
    >>
    >>
    >>
    >>
    >>> On 2009-04-13, SpreadTooThin <> wrote:
    >>>
    >>>> I want to compare two binary files and see if they are the same.
    >>>> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
    >>>> that it is doing a byte by byte comparison of two files to see if they
    >>>> are they same.
    >>>>
    >>> Perhaps I'm being dim, but how else are you going to decide if
    >>> two files are the same unless you compare the bytes in the
    >>> files?
    >>>
    >>> You could hash them and compare the hashes, but that's a lot
    >>> more work than just comparing the two byte streams.
    >>>
    >>>> What should I be using if not filecmp.cmp?
    >>>>
    >>> I don't understand what you've got against comparing the files
    >>> when you stated that what you wanted to do was compare the files.
    >>>

    >> Doh! I misread your post and thought were weren't getting a
    >> warm fuzzying feeling _because_ it was doing a byte-byte
    >> compare. Now I'm a bit confused. Are you under the impression
    >> it's _not_ doing a byte-byte compare? Here's the code:
    >>
    >> def _do_cmp(f1, f2):
    >> bufsize =UFSIZE
    >> fp1 =pen(f1, 'rb')
    >> fp2 =pen(f2, 'rb')
    >> while True:
    >> b1 =p1.read(bufsize)
    >> b2 =p2.read(bufsize)
    >> if b1 !=2:
    >> return False
    >> if not b1:
    >> return True
    >>
    >> It looks like a byte-by-byte comparison to me. Note that when
    >> this function is called the file lengths have already been
    >> compared and found to be equal.
    >>
    >> --
    >> Grant Edwards grante Yow! Alright, you!!
    >> at Imitate a WOUNDED SEAL
    >> visi.com pleading for a PARKING
    >> SPACE!!
    >>

    >
    > I am indeed under the impression that it is not always doing a byte by
    > byte comparison...
    > as well the documentation states:
    > Compare the files named f1 and f2, returning True if they seem equal,
    > False otherwise.
    >
    > That word... Seeeeem... makes me wonder.
    >
    > Thanks for the code! :)
    >
    >
    >

    Some of this discussion depends on the version of Python, but didn't say
    so. In version 2.61, the code is different (and more complex) than
    what's listed above. The docs are different too. In this version, at
    least, you'll want to explicitly pass the shallow=False parameter. It
    defaults to 1, by which they must mean True. I think it's a bad
    default, but it's still a useful function. Just be careful to include
    that parameter in your call.

    Further, you want to check the version included with your version. The
    file filecmp.py is in the Lib directory, so it's not trouble to check it.
     
    Dave Angel, Apr 14, 2009
    #7
  8. SpreadTooThin

    Adam Olsen Guest

    On Apr 13, 8:39 pm, Grant Edwards <> wrote:
    > On 2009-04-13, Peter Otten <> wrote:
    >
    > > But there's a cache. A change of file contents may go
    > > undetected as long as the file stats don't change:

    >
    > Good point.  You can fool it if you force the stats to their
    > old values after you modify a file and you don't clear the
    > cache.


    The timestamps stored on the filesystem (for ext3 and most other
    filesystems) are fairly coarse, so it's quite possible for a check/
    update/check sequence to have the same timestamp at the beginning and
    end.
     
    Adam Olsen, Apr 15, 2009
    #8
  9. SpreadTooThin

    Martin Guest

    Hi,

    On Mon, Apr 13, 2009 at 10:03 PM, Grant Edwards <invalid@invalid> wrote:
    > On 2009-04-13, SpreadTooThin <> wrote:
    >
    >> I want to compare two binary files and see if they are the same.
    >> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
    >> that it is doing a byte by byte comparison of two files to see if they
    >> are they same.

    >
    > Perhaps I'm being dim, but how else are you going to decide if
    > two files are the same unless you compare the bytes in the
    > files?


    I'd say checksums, just about every download relies on checksums to
    verify you do have indeed the same file.

    >
    > You could hash them and compare the hashes, but that's a lot
    > more work than just comparing the two byte streams.


    hashing is not exactly much mork in it's simplest form it's 2 lines per file.

    $ dd if=/dev/urandom of=testfile.data bs=1M count=5
    5+0 records in
    5+0 records out
    5242880 bytes (5.2 MB) copied, 1.4491 s, 3.6 MB/s
    $ dd if=/dev/urandom of=testfile2.data bs=1M count=5
    5+0 records in
    5+0 records out
    5242880 bytes (5.2 MB) copied, 1.92479 s, 2.7 MB/s
    $ cp testfile.data testfile3.data
    $ python
    Python 2.5.4 (r254:67916, Feb 17 2009, 20:16:45)
    [GCC 4.3.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import hashlib
    >>> sha = hashlib.sha256()
    >>> sha.update(file("testfile.data").read())
    >>> sha.hexdigest()

    'a0a8b5d1fd7b8181e0131fff8fd6acce39917e4498c86704354221fd96815797'
    >>> sha2=hashlib.sha256()
    >>> sha2.update(file("testfile2.data").read())
    >>> sha2.hexdigest()

    '25597380f833f287e8dad936b15ddb616669102c38f54dbd60ce57998d99ad3b'
    >>> sha3=hashlib.sha256()
    >>> sha3.update(file("testfile3.data").read())
    >>> sha3.hexdigest()

    'a0a8b5d1fd7b8181e0131fff8fd6acce39917e4498c86704354221fd96815797'
    >>> sha.hexdigest() == sha2.hexdigest()

    False
    >>> sha.hexdigest() == sha3.hexdigest()

    True
    >>> sha2.hexdigest() == sha3.hexdigest()

    False
    >>>




    --
    http://soup.alt.delete.co.at
    http://www.xing.com/profile/Martin_Marcher
    http://www.linkedin.com/in/martinmarcher

    You are not free to read this message,
    by doing so, you have violated my licence
    and are required to urinate publicly. Thank you.

    Please avoid sending me Word or PowerPoint attachments.
    See http://www.gnu.org/philosophy/no-word-attachments.html
     
    Martin, Apr 15, 2009
    #9
  10. On Wed, 15 Apr 2009 07:54:20 +0200, Martin wrote:

    >> Perhaps I'm being dim, but how else are you going to decide if two
    >> files are the same unless you compare the bytes in the files?

    >
    > I'd say checksums, just about every download relies on checksums to
    > verify you do have indeed the same file.


    The checksum does look at every byte in each file. Checksumming isn't a
    way to avoid looking at each byte of the two files, it is a way of
    mapping all the bytes to a single number.



    >> You could hash them and compare the hashes, but that's a lot more work
    >> than just comparing the two byte streams.

    >
    > hashing is not exactly much mork in it's simplest form it's 2 lines per
    > file.


    Hashing is a *lot* more work than just comparing two bytes. The MD5
    checksum has been specifically designed to be fast and compact, and the
    algorithm is still complicated:

    http://en.wikipedia.org/wiki/MD5#Pseudocode

    The reference implementation is here:

    http://www.fastsum.com/rfc1321.php#APPENDIXA

    SHA-1 is even more complicated still:

    http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-1_pseudocode


    Just because *calling* some checksum function is easy doesn't make the
    checksum function itself simple. They do a LOT more work than just a
    simple comparison between bytes, and that's totally unnecessary work if
    you are making a one-off comparison of two local files.



    --
    Steven
     
    Steven D'Aprano, Apr 15, 2009
    #10
  11. SpreadTooThin

    Martin Guest

    On Wed, Apr 15, 2009 at 11:03 AM, Steven D'Aprano
    <> wrote:
    > The checksum does look at every byte in each file. Checksumming isn't a
    > way to avoid looking at each byte of the two files, it is a way of
    > mapping all the bytes to a single number.


    My understanding of the original question was a way to determine
    wether 2 files are equal or not. Creating a checksum of 1-n files and
    comparing those checksums IMHO is a valid way to do that. I know it's
    a (one way) mapping between a (possibly) longer byte sequence and
    another one, how does checksumming not take each byte in the original
    sequence into account.

    I'd still say rather burn CPU cycles than development hours (if I got
    the question right), if not then with binary files you will have to
    find some way of representing differences between the 2 files in a
    readable manner anyway.

    > Hashing is a *lot* more work than just comparing two bytes. The MD5
    > checksum has been specifically designed to be fast and compact, and the
    > algorithm is still complicated:


    I know that the various checksum algorithms aren't exactly cheap, but
    I do think that just to know wether 2 files are different a solution
    which takes 5mins to implement wins against a lengthy discussion which
    optimizes too early wins hands down.

    regards,
    martin

    --
    http://soup.alt.delete.co.at
    http://www.xing.com/profile/Martin_Marcher
    http://www.linkedin.com/in/martinmarcher

    You are not free to read this message,
    by doing so, you have violated my licence
    and are required to urinate publicly. Thank you.

    Please avoid sending me Word or PowerPoint attachments.
    See http://www.gnu.org/philosophy/no-word-attachments.html
     
    Martin, Apr 15, 2009
    #11
  12. SpreadTooThin

    Nigel Rantor Guest

    Martin wrote:
    > On Wed, Apr 15, 2009 at 11:03 AM, Steven D'Aprano
    > <> wrote:
    >> The checksum does look at every byte in each file. Checksumming isn't a
    >> way to avoid looking at each byte of the two files, it is a way of
    >> mapping all the bytes to a single number.

    >
    > My understanding of the original question was a way to determine
    > wether 2 files are equal or not. Creating a checksum of 1-n files and
    > comparing those checksums IMHO is a valid way to do that. I know it's
    > a (one way) mapping between a (possibly) longer byte sequence and
    > another one, how does checksumming not take each byte in the original
    > sequence into account.


    The fact that two md5 hashes are equal does not mean that the sources
    they were generated from are equal. To do that you must still perform a
    byte-by-byte comparison which is much less work for the processor than
    generating an md5 or sha hash.

    If you insist on using a hashing algorithm to determine the equivalence
    of two files you will eventually realise that it is a flawed plan
    because you will eventually find two files with different contents that
    nonetheless hash to the same value.

    The more files you test with the quicker you will find out this basic truth.

    This is not complex, it's a simple fact about how hashing algorithms work.

    n
     
    Nigel Rantor, Apr 15, 2009
    #12
  13. SpreadTooThin

    Nigel Rantor Guest

    Grant Edwards wrote:
    > We all rail against premature optimization, but using a
    > checksum instead of a direct comparison is premature
    > unoptimization. ;)


    And more than that, will provide false positives for some inputs.

    So, basically it's a worse-than-useless approach for determining if two
    files are the same.

    n
     
    Nigel Rantor, Apr 15, 2009
    #13
  14. On Apr 15, 8:04 am, Grant Edwards <invalid@invalid> wrote:
    > On 2009-04-15, Martin <> wrote:
    >
    >
    >
    > > Hi,

    >
    > > On Mon, Apr 13, 2009 at 10:03 PM, Grant Edwards <invalid@invalid> wrote:
    > >> On 2009-04-13, SpreadTooThin <> wrote:

    >
    > >>> I want to compare two binary files and see if they are the same.
    > >>> I see the filecmp.cmp function but I don't get a warm fuzzy feeling
    > >>> that it is doing a byte by byte comparison of two files to see if they
    > >>> are they same.

    >
    > >> Perhaps I'm being dim, but how else are you going to decide if
    > >> two files are the same unless you compare the bytes in the
    > >> files?

    >
    > > I'd say checksums, just about every download relies on checksums to
    > > verify you do have indeed the same file.

    >
    > That's slower than a byte-by-byte compare.
    >
    > >> You could hash them and compare the hashes, but that's a lot
    > >> more work than just comparing the two byte streams.

    >
    > > hashing is not exactly much mork in it's simplest form it's 2
    > > lines per file.

    >
    > I meant a lot more CPU time/cycles.
    >
    > --
    > Grant Edwards                   grante             Yow! Was my SOY LOAF left
    >                                   at               out in th'RAIN?  It tastes
    >                                visi.com            REAL GOOD!!


    I'd like to add my 2 cents here.. (Thats 1.8 cents US)
    All I was trying to get was a clarification of the documentation of
    the cmp method.
    It isn't clear.

    byte by byte comparison is good enough for me as long as there are no
    cache issues.
    a check sum is not good because it doesn't guarantee that 1 + 2 + 3
    == 3 + 2 + 1
    a crc of any sort is more work than a byte by byte comparison and
    doesn't give you any more information.
     
    SpreadTooThin, Apr 15, 2009
    #14
  15. SpreadTooThin

    Adam Olsen Guest

    On Apr 15, 11:04 am, Nigel Rantor <> wrote:
    > The fact that two md5 hashes are equal does not mean that the sources
    > they were generated from are equal. To do that you must still perform a
    > byte-by-byte comparison which is much less work for the processor than
    > generating an md5 or sha hash.
    >
    > If you insist on using a hashing algorithm to determine the equivalence
    > of two files you will eventually realise that it is a flawed plan
    > because you will eventually find two files with different contents that
    > nonetheless hash to the same value.
    >
    > The more files you test with the quicker you will find out this basic truth.
    >
    > This is not complex, it's a simple fact about how hashing algorithms work..


    The only flaw on a cryptographic hash is the increasing number of
    attacks that are found on it. You need to pick a trusted one when you
    start and consider replacing it every few years.

    The chance of *accidentally* producing a collision, although
    technically possible, is so extraordinarily rare that it's completely
    overshadowed by the risk of a hardware or software failure producing
    an incorrect result.
     
    Adam Olsen, Apr 15, 2009
    #15
  16. SpreadTooThin

    Nigel Rantor Guest

    Adam Olsen wrote:
    > The chance of *accidentally* producing a collision, although
    > technically possible, is so extraordinarily rare that it's completely
    > overshadowed by the risk of a hardware or software failure producing
    > an incorrect result.


    Not when you're using them to compare lots of files.

    Trust me. Been there, done that, got the t-shirt.

    Using hash functions to tell whether or not files are identical is an
    error waiting to happen.

    But please, do so if it makes you feel happy, you'll just eventually get
    an incorrect result and not know it.

    n
     
    Nigel Rantor, Apr 15, 2009
    #16
  17. In message <>, Nigel
    Rantor wrote:

    > Adam Olsen wrote:
    >
    >> The chance of *accidentally* producing a collision, although
    >> technically possible, is so extraordinarily rare that it's completely
    >> overshadowed by the risk of a hardware or software failure producing
    >> an incorrect result.

    >
    > Not when you're using them to compare lots of files.
    >
    > Trust me. Been there, done that, got the t-shirt.


    Not with any cryptographically-strong hash, you haven't.
     
    Lawrence D'Oliveiro, Apr 18, 2009
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Benny Ng
    Replies:
    3
    Views:
    876
    Benny Ng
    Jan 25, 2006
  2. Ron Eggler

    writing binary file (ios::binary)

    Ron Eggler, Apr 25, 2008, in forum: C++
    Replies:
    9
    Views:
    954
    James Kanze
    Apr 28, 2008
  3. Adam Olsen

    Re: binary file compare...

    Adam Olsen, Apr 15, 2009, in forum: Python
    Replies:
    8
    Views:
    338
    Piet van Oostrum
    Apr 18, 2009
  4. Nigel Rantor

    Re: binary file compare...

    Nigel Rantor, Apr 16, 2009, in forum: Python
    Replies:
    9
    Views:
    290
    Adam Olsen
    Apr 17, 2009
  5. Jim
    Replies:
    6
    Views:
    743
Loading...

Share This Page