hard disk activity

V

VSmirk

I have a task that involves knowing when a file has changed. But while
for small files this is an easy enough task, checking the modification
dates, or doing a compare on the contents, I need to be able to do this
for very large files.

Is there anything already available in Python that will allow me to
check the hard-disk itself, or that can make my routines aware when a
disk write has occurred?

Thanks for any help,

V
 
R

Rene Pijlman

VSmirk:
I have a task that involves knowing when a file has changed. But while
for small files this is an easy enough task, checking the modification
dates,

Checking the modification time works the same way for large files. Why is
that not good enough?

What's your platform?
 
V

VSmirk

I'm working primarily on Windows XP, but my solution needs to be cross
platform.

The problem is that I need more than the fact that a file has been
modified. I need to know what has been modified in that file.

I am needing to synchronize the file on a remote folder, and my current
solution, which simply copies the file if a date comparison or a
content comparison, becomes a bit unmanageable for very large files.
Some of the files I'm working with are hundreds of MB in size, or
larger.

So I need to skip copying a hundred MB file that has had only a few
bytes changed and instead identify which few bytes have changed and
where those changes are. I was thinking having a module that worked
below the file system level, at the device level, might be a place to
look for a solution.
 
P

Paul Rubin

VSmirk said:
I am needing to synchronize the file on a remote folder, and my current
solution, which simply copies the file if a date comparison or a
content comparison, becomes a bit unmanageable for very large files.
Some of the files I'm working with are hundreds of MB in size, or
larger.

Why don't you look at the rsync program:

http://samba.anu.edu.au/rsync/

but for that much data, just plopping it all in a huge file is not
a great approach if you can help it. Maybe you can use a database instead.
 
V

VSmirk

I agree with you wholeheartedly, but the large files is part of the
business requirements.

Thanks for the link. I'll look into it.

V
 
G

gene tani

VSmirk said:
I'm working primarily on Windows XP, but my solution needs to be cross
platform.

The problem is that I need more than the fact that a file has been
modified. I need to know what has been modified in that file.

I am needing to synchronize the file on a remote folder, and my current
solution, which simply copies the file if a date comparison or a
content comparison, becomes a bit unmanageable for very large files.
Some of the files I'm working with are hundreds of MB in size, or
larger.

So I need to skip copying a hundred MB file that has had only a few
bytes changed and instead identify which few bytes have changed and
where those changes are. I was thinking having a module that worked
below the file system level, at the device level, might be a place to
look for a solution.

Sounds like the diff'g files part is the crux of it, look at sequence
matching libs like (don't know if they'll handle strings this big:

http://docs.python.org/lib/module-difflib.html

for watching files' last-mod flags:
http://www.amk.ca/python/simple/dirwatch.html
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/215418

http://python-fam.sourceforge.net/

http://pyinotify.sourceforge.net/

(there's a few recipes in the online cookbook, in fact)
 
V

VSmirk

Pretty much, yeah. Except I need diffing a pair of files that exist on
opposite ends of a network, without causing the entire contents of the
file to be transferred over that network.

Now, I have the option of doing this: If I am able to determine that
(for instance) bytes 10468 to 1473 in a 849308 byte file are the only
segment that has changed, I can send that range over the network and
insert it into the right place; and then, with a downtime overnight, I
can do a file-copy synchronization to ensure there were no errors
during the day. (I'm reading this and wondering if it even makes
sense, sorry if it doesn't.)

But the trick in my mind is figuring out which specific bytes have been
written to disk. That's why I was thinking device level. Am I going
to have to work in C++ or Assembler for something like this?

Sorry if this sounds like a newbie question. I've been working with
Python long enough to know that someone out there has already solved
one or another of a really obscure problem. So I thought I'd take a
stab at it.

Thanks everyone for the great links.

V
 
P

Paul Rubin

VSmirk said:
But the trick in my mind is figuring out which specific bytes have been
written to disk. That's why I was thinking device level. Am I going
to have to work in C++ or Assembler for something like this?

No, you can do it in Python. The basic idea is: locally compute a
separate checksum for (say) each 1% chunk of the file. Do the same
thing on the remote side. So for a 1GB file, you compute 100
checksums at each end, each checksum covering 10 MB. Then send the
100 checksums over the network, which is just a few kbytes. Compare
the checksums and you know which 10MB chunks have changed. For the
chunks that have changed, divide them into 100-kbyte sub-chunks and
checksum those, etc. The optimal number of chunks at each level
depends on network speed and various other things. Anyway this is
basically how rsync works.

Doing anything device level will be highly OS dependent.
 
V

VSmirk

Aweseme!!! I got as far as segmenting the large file on my own, and I
ran out of ideas. I kind of thought about checksum, but I never put
the two together.

Thanks. You've helped a lot....

V
 
P

Paul Rubin

VSmirk said:
Aweseme!!! I got as far as segmenting the large file on my own, and I
ran out of ideas. I kind of thought about checksum, but I never put
the two together.

Thanks. You've helped a lot....

The checksum method I described works ok if bytes change in the middle
of the file but don't get inserted (piecs of the file don't move
around). If you insert on byte in the middle of a 1GB file (so it
becomes 1GB+1 byte) then all the checksums after the middle block
change, which is no good for your purpose.

Rsync is a very clever program. Rather than re-implement its
algorithm maybe you should just install it and use it, either directly
(instead of writing a Python program) or under control of a Python
program, using os.system or the subprocess module.
 
V

VSmirk

Thanks for the head's up. I was so giddy with the simplicity of the
solution, I stopped trying to poke holes in it.

I agree with your philosophy of not "reinventing the wheel", but I did
notice two things: First, the link you provided claims in the features
section that rsync if for *nix systems, so I am assuming I'll need a
port of it for windows systems; however looking at a Python rsync
module I found, it looks like it's just doing file-copy (which I have
already solved).

So I'm wondering if you know off-hand which windows port does this
checksum validation you outlined.
 
I

ironkan

Maybe an example will help

file A

abef | 1938 | 4bac | 0def | 8675

file B

adef | 0083 | abfd | 3356 | 2465

File A is different from file B and you want to have File A look like
File B. So do the segmentation (I have chosen ' | ' as the divide
between segments).

After that do checksums on each segment. For each segment's checksum
that differ there's a discrepancy between the two segments. So make
the changes to have one segment look like the other segment.

In this example the first segment's checksum would be the same whereas
the checksum for segments 2, 3, 4, and 5 will be different. So modify
the bits and bytes accordingly.

You may want to pursue this subject further by looking into various
error correction algorithms.
 
P

Paul Rubin

VSmirk said:
So I'm wondering if you know off-hand which windows port does this
checksum validation you outlined.

I think rsync has been ported to Windows but I don't know any details.
I don't use Windows.
 
V

VSmirk

Of course that was the first thing I tried.

But what I meant to say was that at least one port, the python one,
didn't have the checksum validation that Paul was talking about, so I
was wondering if he knew of one that was faithful to the unix port of
it.

Thanks much for the links, though, and all the help.
 
T

Terry Hancock

The checksum method I described works ok if bytes change
in the middle of the file but don't get inserted (piecs of
the file don't move around). If you insert on byte in the
middle of a 1GB file (so it becomes 1GB+1 byte) then all
the checksums after the middle block change, which is no
good for your purpose.

But of course, the OS will (I hope) give you the exact
length of the file, so you *could* assume that the beginning
and end are the same, then work towards the middle.
Somewhere in between, when you hit the insertion point, both
will disagree, and you've found it. Same for deletion.

Of course, if *many* changes have been made to the file,
then this will break down. But then, if that's the case,
you're going to have to do an expensive transfer anyway, so
expensive analysis is justified.

In fact, you could proceed by analyzing the top and bottom
checksum lists at the point of failure -- download that
frame, do a byte by byte compare and see if you can derive
the frameshift. Then compensate, and go back to checksums
until they fail again. Actually, that will work just coming
from the beginning, too.

If instead, the region continues to be unrecognizeable to
the end of the frame, then you need the next frame anyway.

Seems like it could get pretty close to optimal (but we
probably are re-inventing rsync).

Cheers,
Terry
 
V

VSmirk

Terry,

Yeah, I was sketching out a scenario much like that. It does break
things down pretty well, and that gets my file sync scenario up to much
larger files. Even if many changes are made to a file, if you keep
track of the number of bytes and checksum over from 1 to the number of
bytes different by shifting the sequence ( that is [abcd]ef, a[bced]f,
ab[cdef]), until a checksum is a match again, you should be able to
find some point where the checksums match again and you can continue up
(or down) doing only the checksums again without all the overhead.

The question in my mind that I will have to test is how much overhead
this causes.

One of the business rules underlying this task is to work with files
that are being continuously written to, say by logging systems or
database servers. This brings with it some obvious problems of file
access, but even in cases where you don't have file access issues, I am
very concerned about race conditions where one of the already-handled
blocks of data are written to. The synched copy on the remote system
now no longer represents a true image of the local file.

This is one of the reasons I was looking into a device-level solution
that would let me know when a hard disk write had occurred. One
colleagues suggested I was going to have to write assembler to do this,
and I may have to ultimately just use the solutions described here for
files that don't have locking and race-condition issues.

Regardless, it's a fun project, and I have to say this list is one of
the more polite lists I've been involved with. Thanks!

V
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top