binary file compare...

SpreadTooThin · Apr 13, 2009

I want to compare two binary files and see if they are the same.
I see the filecmp.cmp function but I don't get a warm fuzzy feeling
that it is doing a byte by byte comparison of two files to see if they
are they same.

What should I be using if not filecmp.cmp?

SpreadTooThin · Apr 13, 2009

Well, here's somethinghttp://www.daniweb.com/forums/thread115959.html
but it seems from the post on the bottom that filecmp does comparison of
binary files.

I just want to be clear, the comparison is not just based on file size
and creation date but by a byte by byte comparison of the data in each
file.

SpreadTooThin · Apr 13, 2009

Perhaps I'm being dim, but how else are you going to decide if
two files are the same unless you compare the bytes in the
files?

You could hash them and compare the hashes, but that's a lot
more work than just comparing the two byte streams.

I don't understand what you've got against comparing the files
when you stated that what you wanted to do was compare the files.

I think its just the way the documentation was worded
http://www.python.org/doc/2.5.2/lib/module-filecmp.html

Unless shallow is given and is false, files with identical os.stat()
signatures are taken to be equal.
Files that were compared using this function will not be compared
again unless their os.stat() signature changes.

So to do a comparison:
filecmp.cmp(filea, fileb, False)
?

SpreadTooThin · Apr 13, 2009

Doh! I misread your post and thought were weren't getting a
warm fuzzying feeling _because_ it was doing a byte-byte
compare. Now I'm a bit confused. Are you under the impression
it's _not_ doing a byte-byte compare? Here's the code:

def _do_cmp(f1, f2):
bufsize = BUFSIZE
fp1 = open(f1, 'rb')
fp2 = open(f2, 'rb')
while True:
b1 = fp1.read(bufsize)
b2 = fp2.read(bufsize)
if b1 != b2:
return False
if not b1:
return True

It looks like a byte-by-byte comparison to me. Note that when
this function is called the file lengths have already been
compared and found to be equal.

--
Grant Edwards grante Yow! Alright, you!!
at Imitate a WOUNDED SEAL
visi.com pleading for a PARKING
SPACE!!

I am indeed under the impression that it is not always doing a byte by
byte comparison...
as well the documentation states:
Compare the files named f1 and f2, returning True if they seem equal,
False otherwise.

That word... Seeeeem... makes me wonder.

Thanks for the code!

Peter Otten · Apr 13, 2009

Grant said:
Doh! I misread your post and thought were weren't getting a
warm fuzzying feeling _because_ it was doing a byte-byte
compare. Now I'm a bit confused. Are you under the impression
it's _not_ doing a byte-byte compare? Here's the code:

def _do_cmp(f1, f2):
bufsize = BUFSIZE
fp1 = open(f1, 'rb')
fp2 = open(f2, 'rb')
while True:
b1 = fp1.read(bufsize)
b2 = fp2.read(bufsize)
if b1 != b2:
return False
if not b1:
return True

It looks like a byte-by-byte comparison to me. Note that when
this function is called the file lengths have already been
compared and found to be equal.

But there's a cache. A change of file contents may go undetected as long as
the file stats don't change:

$ cat fool_filecmp.py
import filecmp, shutil, sys

for fn in "adb":
with open(fn, "w") as f:
f.write("yadda")

shutil.copystat("d", "a")
filecmp.cmp("a", "b", False)

with open("a", "w") as f:
f.write("*****")
shutil.copystat("d", "a")

if "--clear" in sys.argv:
print "clearing cache"
filecmp._cache.clear()

if filecmp.cmp("a", "b", False):
print "file a and b are equal"
else:
print "file a and b differ"
print "a's contents:", open("a").read()
print "b's contents:", open("b").read()

$ python2.6 fool_filecmp.py
file a and b are equal
a's contents: *****
b's contents: yadda

Oops. If you are paranoid you have to clear the cache before doing the
comparison:

$ python2.6 fool_filecmp.py --clear
clearing cache
file a and b differ
a's contents: *****
b's contents: yadda

Peter

Steven D'Aprano · Apr 14, 2009

Perhaps I'm being dim, but how else are you going to decide if two files
are the same unless you compare the bytes in the files?

If you start with an image in one format (e.g. PNG), and convert it to
another format (e.g. JPEG), you might want the two files to compare equal
even though their byte contents are completely different, because their
contents (the image itself) is visually identical.

Or you might want a heuristic as a short cut for comparing large files,
and decide that if two files have the same size and modification dates,
and the first (say) 100KB are equal, that you will assume the rest are
probably equal too.

Neither of these are what the OP wants, I'm just mentioning them to
answer your rhetorical question

Dave Angel · Apr 14, 2009

SpreadTooThin said:
I am indeed under the impression that it is not always doing a byte by
byte comparison...
as well the documentation states:
Compare the files named f1 and f2, returning True if they seem equal,
False otherwise.

That word... Seeeeem... makes me wonder.

Thanks for the code!

Some of this discussion depends on the version of Python, but didn't say
so. In version 2.61, the code is different (and more complex) than
what's listed above. The docs are different too. In this version, at
least, you'll want to explicitly pass the shallow=False parameter. It
defaults to 1, by which they must mean True. I think it's a bad
default, but it's still a useful function. Just be careful to include
that parameter in your call.

Further, you want to check the version included with your version. The
file filecmp.py is in the Lib directory, so it's not trouble to check it.

Adam Olsen · Apr 15, 2009

Good point. You can fool it if you force the stats to their
old values after you modify a file and you don't clear the
cache.

The timestamps stored on the filesystem (for ext3 and most other
filesystems) are fairly coarse, so it's quite possible for a check/
update/check sequence to have the same timestamp at the beginning and
end.

Martin · Apr 15, 2009

Hi,

Perhaps I'm being dim, but how else are you going to decide if
two files are the same unless you compare the bytes in the
files?

I'd say checksums, just about every download relies on checksums to
verify you do have indeed the same file.

You could hash them and compare the hashes, but that's a lot
more work than just comparing the two byte streams.

hashing is not exactly much mork in it's simplest form it's 2 lines per file.

$ dd if=/dev/urandom of=testfile.data bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 1.4491 s, 3.6 MB/s
$ dd if=/dev/urandom of=testfile2.data bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 1.92479 s, 2.7 MB/s
$ cp testfile.data testfile3.data
$ python
Python 2.5.4 (r254:67916, Feb 17 2009, 20:16:45)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

--
http://soup.alt.delete.co.at
http://www.xing.com/profile/Martin_Marcher
http://www.linkedin.com/in/martinmarcher

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html

Steven D'Aprano · Apr 15, 2009

I'd say checksums, just about every download relies on checksums to
verify you do have indeed the same file.

The checksum does look at every byte in each file. Checksumming isn't a
way to avoid looking at each byte of the two files, it is a way of
mapping all the bytes to a single number.

hashing is not exactly much mork in it's simplest form it's 2 lines per
file.

Hashing is a *lot* more work than just comparing two bytes. The MD5
checksum has been specifically designed to be fast and compact, and the
algorithm is still complicated:

http://en.wikipedia.org/wiki/MD5#Pseudocode

The reference implementation is here:

http://www.fastsum.com/rfc1321.php#APPENDIXA

SHA-1 is even more complicated still:

http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-1_pseudocode

Just because *calling* some checksum function is easy doesn't make the
checksum function itself simple. They do a LOT more work than just a
simple comparison between bytes, and that's totally unnecessary work if
you are making a one-off comparison of two local files.

Martin · Apr 15, 2009

The checksum does look at every byte in each file. Checksumming isn't a
way to avoid looking at each byte of the two files, it is a way of
mapping all the bytes to a single number.

My understanding of the original question was a way to determine
wether 2 files are equal or not. Creating a checksum of 1-n files and
comparing those checksums IMHO is a valid way to do that. I know it's
a (one way) mapping between a (possibly) longer byte sequence and
another one, how does checksumming not take each byte in the original
sequence into account.

I'd still say rather burn CPU cycles than development hours (if I got
the question right), if not then with binary files you will have to
find some way of representing differences between the 2 files in a
readable manner anyway.

Hashing is a *lot* more work than just comparing two bytes. The MD5
checksum has been specifically designed to be fast and compact, and the
algorithm is still complicated:

I know that the various checksum algorithms aren't exactly cheap, but
I do think that just to know wether 2 files are different a solution
which takes 5mins to implement wins against a lengthy discussion which
optimizes too early wins hands down.

regards,
martin

--
http://soup.alt.delete.co.at
http://www.xing.com/profile/Martin_Marcher
http://www.linkedin.com/in/martinmarcher

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html

Nigel Rantor · Apr 15, 2009

Martin said:
My understanding of the original question was a way to determine
wether 2 files are equal or not. Creating a checksum of 1-n files and
comparing those checksums IMHO is a valid way to do that. I know it's
a (one way) mapping between a (possibly) longer byte sequence and
another one, how does checksumming not take each byte in the original
sequence into account.

The fact that two md5 hashes are equal does not mean that the sources
they were generated from are equal. To do that you must still perform a
byte-by-byte comparison which is much less work for the processor than
generating an md5 or sha hash.

If you insist on using a hashing algorithm to determine the equivalence
of two files you will eventually realise that it is a flawed plan
because you will eventually find two files with different contents that
nonetheless hash to the same value.

The more files you test with the quicker you will find out this basic truth.

This is not complex, it's a simple fact about how hashing algorithms work.

n

Nigel Rantor · Apr 15, 2009

Grant said:
We all rail against premature optimization, but using a
checksum instead of a direct comparison is premature
unoptimization.

And more than that, will provide false positives for some inputs.

So, basically it's a worse-than-useless approach for determining if two
files are the same.

n

SpreadTooThin · Apr 15, 2009

That's slower than a byte-by-byte compare.

I meant a lot more CPU time/cycles.

I'd like to add my 2 cents here.. (Thats 1.8 cents US)
All I was trying to get was a clarification of the documentation of
the cmp method.
It isn't clear.

byte by byte comparison is good enough for me as long as there are no
cache issues.
a check sum is not good because it doesn't guarantee that 1 + 2 + 3
== 3 + 2 + 1
a crc of any sort is more work than a byte by byte comparison and
doesn't give you any more information.

Adam Olsen · Apr 15, 2009

The fact that two md5 hashes are equal does not mean that the sources
they were generated from are equal. To do that you must still perform a
byte-by-byte comparison which is much less work for the processor than
generating an md5 or sha hash.

If you insist on using a hashing algorithm to determine the equivalence
of two files you will eventually realise that it is a flawed plan
because you will eventually find two files with different contents that
nonetheless hash to the same value.

The more files you test with the quicker you will find out this basic truth.

This is not complex, it's a simple fact about how hashing algorithms work..

The only flaw on a cryptographic hash is the increasing number of
attacks that are found on it. You need to pick a trusted one when you
start and consider replacing it every few years.

The chance of *accidentally* producing a collision, although
technically possible, is so extraordinarily rare that it's completely
overshadowed by the risk of a hardware or software failure producing
an incorrect result.

Nigel Rantor · Apr 15, 2009

Adam said:
The chance of *accidentally* producing a collision, although
technically possible, is so extraordinarily rare that it's completely
overshadowed by the risk of a hardware or software failure producing
an incorrect result.

Not when you're using them to compare lots of files.

Trust me. Been there, done that, got the t-shirt.

Using hash functions to tell whether or not files are identical is an
error waiting to happen.

But please, do so if it makes you feel happy, you'll just eventually get
an incorrect result and not know it.

n

Lawrence D'Oliveiro · Apr 18, 2009

Nigel said:
Not when you're using them to compare lots of files.

Trust me. Been there, done that, got the t-shirt.

Not with any cryptographically-strong hash, you haven't.

How should I compare two txt files separately coming from windows/dosand linux/unix	7	Jun 11, 2009
filecmp.cmp() doesn't seem to do what it says in the documentation	1	Sep 6, 2010
Confused compare function :)	2	Dec 5, 2012
how to compare two json file line by line using python?	6	May 27, 2013
Compare Files and Cat File Difference Question	0	Oct 21, 2008
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
reading binary file using vhdl	0	Feb 26, 2018
C code String And Comparison	2	Dec 27, 2022

binary file compare...

SpreadTooThin

SpreadTooThin

SpreadTooThin

SpreadTooThin

Peter Otten

Steven D'Aprano

Dave Angel

Adam Olsen

Martin

Steven D'Aprano

Martin

Nigel Rantor

Nigel Rantor

SpreadTooThin

Adam Olsen

Nigel Rantor

Lawrence D'Oliveiro

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads