comparing multiple copies of terrabytes of data?

Dan Stromberg · Oct 25, 2004

We will soon have 3 copies, for testing purposes, of what should be about
4.5 terrabytes of data.

Rather than cmp'ing twice, to verify data integrity, I was thinking we
could speed up the comparison a bit, by using a python script that does 3
reads, instead of 4 reads, per disk block - with a sufficiently large
blocksize, of course.

My question then is, does python have a high-level API that would
facilitate this sort of thing, or should I just code something up based on
open and read?

Thanks!

Istvan Albert · Oct 25, 2004

Dan said:
Rather than cmp'ing twice, to verify data integrity, I was thinking we
could speed up the comparison a bit, by using a python script that does 3

Use the cmp. So what if you must run it twice ... by the way I
really doubt that you could speed up the process in python
.... you'll probably end up with a much slower version

Istvan.

Josiah Carlson · Oct 25, 2004

Istvan Albert said:
Use the cmp. So what if you must run it twice ... by the way I
really doubt that you could speed up the process in python
... you'll probably end up with a much slower version

In this case you would be wrong. Comparing data on a processor is
trivial (and would be done using Python's C internals anyways if a
strict string equality is all that matters), but IO is expensive.
Reading Terabytes of data is going to be the bottleneck, so reducing IO
is /the/ optimization that can and should be done.

The code to do so is simple:

def compare_3(fn1, fn2, fn3):
f1, f2, f3 = [open(i, 'rb') for i in (fn1, fn2, fn3)]
b = 2**20 #tune this as necessary
p = -1
good = 1
while f1.tell() < p:
p = f1.tell()
if f1.read(b) == f2.read(b) == f3.read(b):
continue
print "files differ"
good = 0
break
if good and f1.read(1) == f2.read(1) == f3.read(1) == '':
print "files are identical"
f1.close() #I prefer to explicitly close my file handles
f2.close()
f3.close()

Note that it /may/ be faster to first convert the data into arrays
(module array) to get 2, 4 or 8 byte block comparisons.

- Josiah

Eddie Corns · Oct 26, 2004

We will soon have 3 copies, for testing purposes, of what should be about
4.5 terrabytes of data.

Rather than cmp'ing twice, to verify data integrity, I was thinking we
could speed up the comparison a bit, by using a python script that does 3
reads, instead of 4 reads, per disk block - with a sufficiently large
blocksize, of course.

My question then is, does python have a high-level API that would
facilitate this sort of thing, or should I just code something up based on
open and read?

Thanks!

Taking a checksum of each file and comparing them would probably be much
faster. A quick test with md5 versus cmp gave me a 10 times speedup. Though,
unexpectedly, running 2 md5 processes in parallel was slower than 2 in
sequence - could be the cheap'n'nasty HD in my desktop, normally I'd expect a
gain here as 1 process CPUs while the other is IOing.

Eddie

Istvan Albert · Oct 26, 2004

Josiah said:
The code to do so is simple:
....
> p = -1
> good = 1
> while f1.tell() < p:
> p = f1.tell()
> if f1.read(b) == f2.read(b) == f3.read(b):
> continue

....

What is slightly amusing is that your *simple*
solution is actually incorrect. You got the
comparison backwards in the while loop.

Other functional deficiency when compared to
the cmp diffs is that you don't know which
file has changed or which byte differs
.... adding that brings about the potential
for another set of bugs. Then someone else comes along
who knows a little less about python and adds
a little feature to the program that actually
silently breaks it ...

Whether or not it is actually faster remains
to be seen. And that was my whole point,
not to don't dismiss cmp to soon, see how it works
test it, then armed with some real numbers one
can make better decisions.

Istvan.

Josiah Carlson · Oct 26, 2004

Istvan Albert said:
...

What is slightly amusing is that your *simple*
solution is actually incorrect. You got the
comparison backwards in the while loop.

My finger slipped.

Other functional deficiency when compared to
the cmp diffs is that you don't know which
file has changed or which byte differs
... adding that brings about the potential
for another set of bugs. Then someone else comes along
who knows a little less about python and adds
a little feature to the program that actually
silently breaks it ...

You are now talking about code maintenance, which is a separate issue
from "how fast can I compare three files", which was originally asked
about.

Whether or not it is actually faster remains
to be seen. And that was my whole point,
not to don't dismiss cmp to soon, see how it works
test it, then armed with some real numbers one
can make better decisions.

Ok. Let me see. I am using the fastest computer in the apartment; my
wife's 2.26 ghz P4 without hyperthreading, with a gig of memory, running
two 40GB ATA-100 drives. I have created 3x1.5gb files on one of the
drives (I don't have 4.5tb free), which should give us a reasonable
measure, if shy by around 3 orders of magnitude in terms of time.

cmp looks to be taking 0-12% processor, so I was correct in my statement
that it is likely disk bound (unless your disks are 8 times faster or
your processor is 1/10 as fast).

Oh, goodness, it is still going. I've been typing this email for over 8
minutes. It sure would be nice if cmp had a progress bar. At least
then I could know if I should kill it now or wait another few minutes.
To hell with it, I'm going to try the Python version.

*make little fix to code to give me a progress bar, time elapsed, and an
expected total time*

cmp: at least 8 1/2 minutes for the first cmp, was not finished, I
killed it. At least 17 minutes in total;
Python: 15 minutes, 3 seconds to finish it all.

Hrm, that is only around 5 megs/second. I think we can do better (even
if it is already better than cmp)...
New algorithm...

def compare_32(fn1, fn2, fn3):
mn = md5.new
st = time.time()
f1, f2, f3 = [open(i, 'rb') for i in (fn1, fn2, fn3)]
b = 2**20 #tune this as necessary
p = -1
good = 1
total = 1.5*2**30
digs = []
m = mn()
while f1.tell() > p:
p = f1.tell()
digs.append(m.update(f1.read(b)).digest())
if not ((p>>20)&7) and p:
a = 100.0*p/total/3
d = time.time()-st
print "%8.1f%% %8.0f\t\r"%(a,100*d/a),
m = mn()
for dig in digs:
if dig != m.update(f2.read(b)).digest():
print "files 1 and 2 differ before", f2.tell()
good = 0
break
m = mn()
for dig in digs:
if dig != m.update(f3.read(b)).digest():
print "files 1 and 2 differ before", f3.tell()
good = 0
break
if good and f1.read(1) == f2.read(1) == f3.read(1) == '':
print "files are identical"
f1.close() #I prefer to explicitly close my file handles
f2.close()
f3.close()

That new one runs in 5 minutes 15 seconds total, because it exploits the
fact that sequential reads are fast. It does use ~20% processor to
compute the md5s, which only makes a difference if your processor is as
fast as a 400mhz P4.

I'd say that Python wins here. Is that concrete enough for you?

- Josiah

Josiah Carlson · Oct 26, 2004

Josiah said:
for dig in digs:
if dig != m.update(f2.read(b)).digest():
print "files 1 and 2 differ before", f2.tell()
good = 0
break
m = mn()
for dig in digs:
if dig != m.update(f3.read(b)).digest():
print "files 1 and 2 differ before", f3.tell()

Oops, copy/paste bug. ^^^^^^^
Should be:
print "files 1 and 3 differ before", f3.tell()

- Josiah

Josiah Carlson · Oct 26, 2004

Obviously this was running old code as the following lines don't work...

digs.append(m.update(f1.read(b)).digest())
if dig != m.update(f2.read(b)).digest():
if dig != m.update(f3.read(b)).digest():

Those got split up into:
m.update(f1.read(b))
digs.append(m.digest())

m.update(f2.read(b))
if dig != m.digest():

m.update(f3.read(b))
if dig != m.digest():

respectively. Gah, I don't like mornings.

- Josiah

Istvan Albert · Oct 26, 2004

Josiah said:
measure, if shy by around 3 orders of magnitude in terms of time.

That new one runs in 5 minutes 15 seconds total, because it exploits the

my point was never to say that it is not possible to write
a better way, nor to imply that you could not do it, I simply
said that there is no easy way around this problem.

Your solution while short and nice is not simple, and
requires quite a bit of knowledge to understand
why and how it works.

other topic ... sometimes I have a hard time visualizing how much
a terrabyte of data actually is, this is a good example
for that, even this optimized algorithm would take over three
days to perform a a simple identity check ...

Istvan.

Josiah Carlson · Oct 26, 2004

Istvan Albert said:
my point was never to say that it is not possible to write
a better way, nor to imply that you could not do it, I simply
said that there is no easy way around this problem.

Funny, I thought the implementation I offered was an easy way. No
easier than to use someone else's code, which took all of 5 minutes to
write.

Your solution while short and nice is not simple, and
requires quite a bit of knowledge to understand
why and how it works.

Anyone who has difficulty understanding the algorithm I offered that
used md5 should seriously consider switching hobbies/jobs. The algorithm
is trivial to understand.

other topic ... sometimes I have a hard time visualizing how much
a terrabyte of data actually is, this is a good example
for that, even this optimized algorithm would take over three
days to perform a a simple identity check ...

It would only take that long if you could only get 15 megs/second from
your drives (it was the main boot drive and the files were fragmented).
At multi-tb data sizes, RAID arrays are common, so read speed is
generally much higher, which would then reduce the running time
significantly.

In the case of multiple drives in a non-raid array, splitting up the
process could linearly scale to the limit of the drive and processor(s)
speed.

- Josiah

Dan Stromberg · Oct 29, 2004

Folks, I'm a little overwhelmed by how helpful this thread has been to me
- thanks a bunch.

I'd like to point out that I have some code written up
that I hope to use correctly the first time, in verifying that our three
copies of a 3+ terrabyte collection of data... well, that the first copy
is a (not-necessarily-proper) subset of the other two copies. Once we
sign off, saying "the data copies look correct", then IBM's going to say
"OK, for better or worse, it's your storage system now, problems and all."

This storage system is the first of its kind in the known universe (as
far as the filesystem software combined with very high storage density
linux boxes), and I'm guessing that there's around $500K sunk into it,
just on hardware. -But-, it's far cheaper than almost any other solution
of its size.

I'll add that both the system we are transitioning from is opensource from
Redhat, and is apparently pureplay GPL, and the system we are
transitioning to is from ClusterFS, which has a dual-license GPL/Closed
source thing going on (where the Closed source stuff is transitioned to
GPL eventually, kind of like Alladin ghostscript).

Rather than letting the specifics of the situation creep into the code too
much, I've tried to make it into a fairly general, reusable tool that
others might be able to benefit from as well.

You basically give the program a sampling frequency (number of files to
skip before checking one), and a list of directories. It uses md5
hashes to keep resource utilization down.

I'm basically soliciting additional eyeballs for what I hope I've
persuaded you is a good cause for more than one reason, if you find
yourself with some time and curiosity. It's 108 lines of code (including
the usage() function), and 15 lines of comments.

The URL is http://dcs.nac.uci.edu/~strombrg/verify.html

Thanks for even considering looking at this with me!

Andrew Dalke · Oct 29, 2004

Dan said:
that I hope to use correctly the first time, in verifying that our three
copies of a 3+ terrabyte collection of data... well, that the first copy

Not trying to be too much of a pedant, but the prefix is "tera"
not "terra".

The URL is http://dcs.nac.uci.edu/~strombrg/verify.html

To make up for that pedentry, here's some pointers

for i in range(len(dirlist)):
fullfilename = os.path.join(dirlist,filename)
os.chdir(dirlist)
statbufs.append(os.lstat(fullfilename))

could be rewritten as

for dirname in dirlist:
fullfilename = os.path.join(dirname, filename)
os.chdir(dirname)
statbufs.append(os.lstat(fullfilename)

BTW, do you really need that chdir? You've already got the full
filename so the current directory doesn't make a difference. Unless
you allow relative directories there ... ?

for i in range(len(dirlist)-1):
for field in [stat.ST_GID, stat.ST_UID, stat.ST_MODE, stat.ST_SIZE]:
if statbufs[field] != statbufs[i+1][field]:
return 0

Can dirlist ever be empty?

You just want to check if all elements in statbufs are the
same, right? I prefer this style

first = statbufs[0]
for statbuf in statbufs[1:]:
for field in (stat.ST_GID, stat.ST_UID, stat.ST_MODE, stat.ST_SIZE]:
if first[field] != statbuf[field]:
return 0

because I find the chaining-if comparing i and i+1 harder
to understand. Even if this is one line longer.

def main():
try:
(list,dirs) = getopt.getopt(sys.argv[1:],'n:')
except:
usage()
if len(dirs) == 0:
usage()

Have you tried to see if your usage works when you give it
the wrong parameters? It will print out the error message
then give a NameError because dirs doesn't exist. Try this
instead

def main():
try:
(list,dirs) = getopt.getopt(sys.argv[1:],'n:')
except getopt.error, err:
usage()
raise SystemExit(str(err))

This will only catch getopt errors and it will raise
a system exit so the 1) it prints out the text of the
getopt error (helpful for users to know what went wrong)
and 2) it sets the program's exit code to 1.

for dir in dirs:
os.chdir(dir)
os.chdir(dirs[0])

Again, I don't understand the chdir calls here. Perhaps
I should have read the rest of this thread?

If you want to see if that directory exists, and want
to do it in this style, you should return to the original
directory. Why? Because this will fail if someone
specifies relative directories on the command-line.

origdir = os.getcwd()
for dir in dirs:
os.chdir(dir)
os.chdir(origdir)
os.chdir(dirs[0])

I still find it suspicious. I almost never use chdir
in my code because it can really confuse libraries
that expect relative paths to always work during the
lifetime of a program.

main()

That should be

if __name__ == "__main__":
main()

because it lets you import the file as well as use it
as a mainline program. Why is that useful? It makes
regression tests a lot easier to write. To help that
I'll also do

if __name__ == "__main__":
main(sys.argv)

then make it so that the main function takes an argv.

def main(argv):
...

Again, it makes it easier to write tests that way.

Hmmm, or is this response also showing me to be a pedant?

Best wishes,

Andrew
(e-mail address removed)

Comparing floats	6	Nov 27, 2010
[ANN] New paper published (Volume 7 of The Python Papers) -High-Speed Data Shredding using Python	1	Jul 30, 2012
How to use multiple instances of the same COM object at the sametime	0	Mar 22, 2013
Buffer pair for lexical analysis of raw binary data	3	Jun 27, 2009
PyWart: PEP8: a seething cauldron of inconsistencies.	1	Jul 28, 2011
PyWart: PEP8: A cauldron of inconsistencies.	7	Jul 27, 2011
searching and storing large quantities of xml!	7	Jan 16, 2010
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012

comparing multiple copies of terrabytes of data?

Dan Stromberg

Istvan Albert

Josiah Carlson

Eddie Corns

Istvan Albert

Josiah Carlson

Josiah Carlson

Josiah Carlson

Istvan Albert

Josiah Carlson

Dan Stromberg

Andrew Dalke

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads