Istvan Albert said:
...
What is slightly amusing is that your *simple*
solution is actually incorrect. You got the
comparison backwards in the while loop.
My finger slipped.
Other functional deficiency when compared to
the cmp diffs is that you don't know which
file has changed or which byte differs
... adding that brings about the potential
for another set of bugs. Then someone else comes along
who knows a little less about python and adds
a little feature to the program that actually
silently breaks it ...
You are now talking about code maintenance, which is a separate issue
from "how fast can I compare three files", which was originally asked
about.
Whether or not it is actually faster remains
to be seen. And that was my whole point,
not to don't dismiss cmp to soon, see how it works
test it, then armed with some real numbers one
can make better decisions.
Ok. Let me see. I am using the fastest computer in the apartment; my
wife's 2.26 ghz P4 without hyperthreading, with a gig of memory, running
two 40GB ATA-100 drives. I have created 3x1.5gb files on one of the
drives (I don't have 4.5tb free), which should give us a reasonable
measure, if shy by around 3 orders of magnitude in terms of time.
cmp looks to be taking 0-12% processor, so I was correct in my statement
that it is likely disk bound (unless your disks are 8 times faster or
your processor is 1/10 as fast).
Oh, goodness, it is still going. I've been typing this email for over 8
minutes. It sure would be nice if cmp had a progress bar. At least
then I could know if I should kill it now or wait another few minutes.
To hell with it, I'm going to try the Python version.
*make little fix to code to give me a progress bar, time elapsed, and an
expected total time*
cmp: at least 8 1/2 minutes for the first cmp, was not finished, I
killed it. At least 17 minutes in total;
Python: 15 minutes, 3 seconds to finish it all.
Hrm, that is only around 5 megs/second. I think we can do better (even
if it is already better than cmp)...
New algorithm...
def compare_32(fn1, fn2, fn3):
mn = md5.new
st = time.time()
f1, f2, f3 = [open(i, 'rb') for i in (fn1, fn2, fn3)]
b = 2**20 #tune this as necessary
p = -1
good = 1
total = 1.5*2**30
digs = []
m = mn()
while f1.tell() > p:
p = f1.tell()
digs.append(m.update(f1.read(b)).digest())
if not ((p>>20)&7) and p:
a = 100.0*p/total/3
d = time.time()-st
print "%8.1f%% %8.0f\t\r"%(a,100*d/a),
m = mn()
for dig in digs:
if dig != m.update(f2.read(b)).digest():
print "files 1 and 2 differ before", f2.tell()
good = 0
break
m = mn()
for dig in digs:
if dig != m.update(f3.read(b)).digest():
print "files 1 and 2 differ before", f3.tell()
good = 0
break
if good and f1.read(1) == f2.read(1) == f3.read(1) == '':
print "files are identical"
f1.close() #I prefer to explicitly close my file handles
f2.close()
f3.close()
That new one runs in 5 minutes 15 seconds total, because it exploits the
fact that sequential reads are fast. It does use ~20% processor to
compute the md5s, which only makes a difference if your processor is as
fast as a 400mhz P4.
I'd say that Python wins here. Is that concrete enough for you?
- Josiah