Fast checksums?

George Adams · Feb 13, 2008

Hi, all. I'm trying to make a simple backup program for myself that
will check to see if certain files have been modifed before backing them
up. (It's got to be portable to multiple OSes that have Perl, but
possibly not other handy tools like, say, rsync or some such).

Anyway, I had originally planned to use the Windows archive bit, or else
file modification dates to determine if a file had been changed. That
turned out to be unreliable, though (and, in the case of the archive
bit, impossible on Linux). So I decided instead to create a checksum of
the original file, then compare future versions of the file with the
stored checksum to see if it's changed (and hence needs to be backed up).

This works... except it's really, really slow. I started with SHA-1,
but it was taking just too long. I switched to MD5 and then CRC32, but
even that was fairly slow. And when the backup directory contains
several gigs of files to check, it was just too much.

So, given the problem of "how can I tell if a file has been changed", am
I tackling it the wrong way? Should I be using some other method
besides simply hashing an entire file (using whatever algorithm) and
comparing it to a stored value? Obviously backup software makers have
solved this problem to make incremental backups pretty fast - what's the
trick to it?

Thanks to anyone who can help.

A. Sinan Unur · Feb 13, 2008

I tackling it the wrong way? Should I be using some other method
besides simply hashing an entire file (using whatever algorithm) and
comparing it to a stored value? Obviously backup software makers have
solved this problem to make incremental backups pretty fast - what's the
trick to it?

I don't know if there is a trick to it, but why not check the modification
time returned by stat?

Sinan

comp.llang.perl.moderated · Feb 13, 2008

Hi, all. I'm trying to make a simple backup program for myself that
will check to see if certain files have been modifed before backing them
up. (It's got to be portable to multiple OSes that have Perl, but
possibly not other handy tools like, say, rsync or some such).

Anyway, I had originally planned to use the Windows archive bit, or else
file modification dates to determine if a file had been changed. That
turned out to be unreliable, though (and, in the case of the archive
bit, impossible on Linux). So I decided instead to create a checksum of
the original file, then compare future versions of the file with the
stored checksum to see if it's changed (and hence needs to be backed up).

This works... except it's really, really slow. I started with SHA-1,
but it was taking just too long. I switched to MD5 and then CRC32, but
even that was fairly slow. And when the backup directory contains
several gigs of files to check, it was just too much.

So, given the problem of "how can I tell if a file has been changed", am
I tackling it the wrong way? Should I be using some other method
besides simply hashing an entire file (using whatever algorithm) and
comparing it to a stored value? Obviously backup software makers have
solved this problem to make incremental backups pretty fast - what's the
trick to it?

Maybe timestamp as well as file size with a
fallback to SHA-1 or MD5 only if time/size are
unchanged.

A. Sinan Unur · Feb 13, 2008

Maybe timestamp as well as file size with a
fallback to SHA-1 or MD5 only if time/size are
unchanged.

That is not going to save work.

If either file size or time stamp has changed, then the checksum needs
to be recomputed (so that it can be checked the next time).

If neither has changed, the checksum needs to be recomputed just to be
sure.

That is, the checksum needs to be recomputed for all the files on each
run.

Sinan

Ben Morrow · Feb 14, 2008

Quoth Big and Blue said:
Just checking the last mod time should be sufficient (it's what a backup
system would do). The assumption is that if something has changed the file
*without* changing the mod time then that would mean it has reset the mod
time after making the change deliberately, and the file didn't need to be
backed up again.

Not necessarily. There may be a good reason the file needs that date but
the change in data should still be recorded. Under Unix this is what
ctime is for: changing the data and then fixing the mtime will update
the ctime, so dump(8) can tell the file needs dumping again.

Ben

A. Sinan Unur · Feb 14, 2008

Depends what you are trying to achieve. Doing a checksum for each
file every time means you have to read everything every time, which is
rather unproductive.

Just checking the last mod time should be sufficient (it's what a
backup system would do). The assumption is that if something has
changed the file *without* changing the mod time then that would mean
it has reset the mod time after making the change deliberately, and
the file didn't need to be backed up again.

I agree with that. You snipped too much, changing the intended meaning of
the snippet above. The point I was trying to make was that with the
algorithm Charles DeRykus was proposing, one would have to compute to
checksum for each and every file on each run, not saving any time at all.

Elsethread, I did indeed propose using the modification time for deciding
if the file needs to be backed up again.

Sinan

George Adams · Feb 14, 2008

Thank you all for the replies. Let me explain more why I originally
decided against relying on ctime/mtime (this program will mainly be
mainly run in a Windows environment)

1) First, I decided to use mtime, which worked fine... until I
discovered that for some reason, certain Windows files didn't HAVE a
mtime. In Windows file properties, they were listed as "September 10,
1956, 6:13:51 AM". In Perl, the mtime was -1.

2) Still, I figured I could still use mtime, and just assume that if it
was greater than ctime (which always exists under Windows, even if mtime
does not), then the mtime was usable. If it was -1, I would just use
the ctime as the baseline instead.

3) The final blow, however, was the discovery that Windows apparently
sets a new mtime for a file *even when no changes have taken place* .
I discovered this when trying to backup some MP3s. After some research,
I discovered that when certain media programs load an MP3 file, they
will go out to retrieve additional info about that file (perhaps storing
the info they find in the ID3 tags?). Even if no changes are ultimately
made to the file, the mtime is still changed, fooling my Perl script
into thinking it needs backing up. And that's just for MP3s.

So finally I decided to give up on ctime/mtime and work with something I
was sure would be accurate - hashing. But now I've hit this stumbling
block.

Anyway, that's why I was hoping to find a hashing shortcut. This is a
fairly fault-tolerant program - if files are occassionally tagged
false-positive, then the worst that happens is that it is backed up even
when it wasn't needed. So I don't necessarily need SHA-512 or anything.
But still, even the humble CRC32 is awfully slow for really large files...

Again, thanks for the help - any other suggestions are welcomed!

Ben Morrow · Feb 14, 2008

Quoth George Adams said:
Thank you all for the replies. Let me explain more why I originally
decided against relying on ctime/mtime (this program will mainly be
mainly run in a Windows environment)

1) First, I decided to use mtime, which worked fine... until I
discovered that for some reason, certain Windows files didn't HAVE a
mtime. In Windows file properties, they were listed as "September 10,
1956, 6:13:51 AM". In Perl, the mtime was -1.

Weird... I've never seen that... Was this FAT or NTFS, and were the
files created under Win9x or NT?

2) Still, I figured I could still use mtime, and just assume that if it
was greater than ctime (which always exists under Windows, even if mtime
does not), then the mtime was usable. If it was -1, I would just use
the ctime as the baseline instead.

The ctime slot under Win32 holds the file create time, and Windows does
some weird things with the create time. In particular, if you copy a
file in Explorer the new file has a new ctime but the old mtime, so
ctime > mtime is not uncommon. It makes some sense, I guess: the file's
contents are 'older' than the file itself, since they used to be
somewhere else.

3) The final blow, however, was the discovery that Windows apparently
sets a new mtime for a file *even when no changes have taken place* .

This is a very common occurrence. Lots of programs stupidly update a
file when they didn't need to. One way out (presumably) is to make such
files read-only: in the case of an MP3 collection, I'd want to make all
files and directories involved read-only anyway, to avoid accidentally
deleting them.

Anyway, that's why I was hoping to find a hashing shortcut. This is a
fairly fault-tolerant program - if files are occassionally tagged
false-positive, then the worst that happens is that it is backed up even
when it wasn't needed. So I don't necessarily need SHA-512 or anything.
But still, even the humble CRC32 is awfully slow for really large files...

One (slightly evil) suggestion would be to use the archive flag, and
write a daemon that sits in the background (hah!) and uses
Win32::ChangeNotify or some such to watch for files that have been
changed, recalculate the hash, and fix up the flag if needed. It's still
doing the same work (well, more, actually) but if it's done in the
background it might not be so noticeable. Of course, you'd then have
lots of fun sychronisation issues... (your backup program needs to know
that all the files are in a valid state, and the daemon isn't playing
with any of them, while it works)

.

Ben

John Bokma · Feb 14, 2008

David Filmer said:
Indeed. The rsync program, for example, is able to determine which
files have changed. It's really fast and really smart (it is able to
syncronize changed files by only transmitting the delta, not the whole
file). I don't know how it works,

Me neither, but
http://en.wikipedia.org/wiki/Rsync

has some information on the check sum method used, and has also a
link to the algorithm: http://rsync.samba.org/tech_report/node2.html

comp.llang.perl.moderated · Feb 14, 2008

That is not going to save work.

If either file size or time stamp has changed, then the checksum needs
to be recomputed (so that it can be checked the next time).

If neither has changed, the checksum needs to be recomputed just to be
sure.

That is, the checksum needs to be recomputed for all the files on each
run.

True, to be really thorough, you'd need to checksum each time. From
the original problem description, I guessed that mtimes were not
being
updated so backup's were being missed and a size check would usually
catch the update even if the mtime was unchanged -- particularly with
checksumming as a last resort. Of course, the size check might still
help, if as the OP mentions, Win32 spuriously updates mtime when there
wasn't any change.

However, the problem under Win32 with a non-existent mtime for some
files or a ctime more current than mtime for others certainly makes
the problem more complicated than that

Michele Dondi · Feb 14, 2008

Maybe timestamp as well as file size with a
fallback to SHA-1 or MD5 only if time/size are
unchanged.

That's what I do in my own dup files finding script, well except the
timestamp that's not really useful in that case.

Michele

Calculating checksums for different structures.	1	Jul 3, 2010
How to minimize server load when program is run	5	Jun 13, 2013
fast synthesize	0	Apr 30, 2007
Can't locate Digest/SHA.pm cannot install Digest/SHA	4	Nov 28, 2007
How to set up a fast correct java build?	41	Jan 8, 2010
Populating a dictionary, fast	4	Nov 10, 2007
Error message if there is a space in the source directory	3	Jun 2, 2007
Fast portable storage for queues	15	Oct 23, 2006

Fast checksums?

George Adams

A. Sinan Unur

comp.llang.perl.moderated

A. Sinan Unur

Ben Morrow

A. Sinan Unur

George Adams

Ben Morrow

John Bokma

comp.llang.perl.moderated

Michele Dondi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads