Fast checksums?

Discussion in 'Perl Misc' started by George Adams, Feb 13, 2008.

  1. George Adams

    George Adams Guest

    Hi, all. I'm trying to make a simple backup program for myself that
    will check to see if certain files have been modifed before backing them
    up. (It's got to be portable to multiple OSes that have Perl, but
    possibly not other handy tools like, say, rsync or some such).

    Anyway, I had originally planned to use the Windows archive bit, or else
    file modification dates to determine if a file had been changed. That
    turned out to be unreliable, though (and, in the case of the archive
    bit, impossible on Linux). So I decided instead to create a checksum of
    the original file, then compare future versions of the file with the
    stored checksum to see if it's changed (and hence needs to be backed up).

    This works... except it's really, really slow. I started with SHA-1,
    but it was taking just too long. I switched to MD5 and then CRC32, but
    even that was fairly slow. And when the backup directory contains
    several gigs of files to check, it was just too much.

    So, given the problem of "how can I tell if a file has been changed", am
    I tackling it the wrong way? Should I be using some other method
    besides simply hashing an entire file (using whatever algorithm) and
    comparing it to a stored value? Obviously backup software makers have
    solved this problem to make incremental backups pretty fast - what's the
    trick to it?

    Thanks to anyone who can help.
    George Adams, Feb 13, 2008
    #1
    1. Advertising

  2. George Adams <> wrote in
    news::


    > I tackling it the wrong way? Should I be using some other method
    > besides simply hashing an entire file (using whatever algorithm) and
    > comparing it to a stored value? Obviously backup software makers have
    > solved this problem to make incremental backups pretty fast - what's the
    > trick to it?


    I don't know if there is a trick to it, but why not check the modification
    time returned by stat?

    Sinan

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)
    clpmisc guidelines: <URL:http://www.rehabitation.com/clpmisc.shtml>
    A. Sinan Unur, Feb 13, 2008
    #2
    1. Advertising

  3. On Feb 13, 2:25 pm, George Adams <>
    wrote:
    > Hi, all. I'm trying to make a simple backup program for myself that
    > will check to see if certain files have been modifed before backing them
    > up. (It's got to be portable to multiple OSes that have Perl, but
    > possibly not other handy tools like, say, rsync or some such).
    >
    > Anyway, I had originally planned to use the Windows archive bit, or else
    > file modification dates to determine if a file had been changed. That
    > turned out to be unreliable, though (and, in the case of the archive
    > bit, impossible on Linux). So I decided instead to create a checksum of
    > the original file, then compare future versions of the file with the
    > stored checksum to see if it's changed (and hence needs to be backed up).
    >
    > This works... except it's really, really slow. I started with SHA-1,
    > but it was taking just too long. I switched to MD5 and then CRC32, but
    > even that was fairly slow. And when the backup directory contains
    > several gigs of files to check, it was just too much.
    >
    > So, given the problem of "how can I tell if a file has been changed", am
    > I tackling it the wrong way? Should I be using some other method
    > besides simply hashing an entire file (using whatever algorithm) and
    > comparing it to a stored value? Obviously backup software makers have
    > solved this problem to make incremental backups pretty fast - what's the
    > trick to it?


    Maybe timestamp as well as file size with a
    fallback to SHA-1 or MD5 only if time/size are
    unchanged.

    --
    Charles DeRykus
    comp.llang.perl.moderated, Feb 13, 2008
    #3
  4. "comp.llang.perl.moderated" <> wrote in
    news::

    > On Feb 13, 2:25 pm, George Adams <>
    > wrote:
    >> Hi, all. I'm trying to make a simple backup program for myself that
    >> will check to see if certain files have been modifed before backing
    >> them up. (It's got to be portable to multiple OSes that have Perl,
    >> but possibly not other handy tools like, say, rsync or some such).
    >>
    >> Anyway, I had originally planned to use the Windows archive bit, or
    >> else file modification dates to determine if a file had been changed.
    >> That turned out to be unreliable, though (and, in the case of the
    >> archive bit, impossible on Linux). So I decided instead to create a
    >> checksum of the original file, then compare future versions of the
    >> file with the stored checksum to see if it's changed (and hence needs
    >> to be backed up).
    >>
    >> This works... except it's really, really slow. I started with SHA-1,
    >> but it was taking just too long. I switched to MD5 and then CRC32,
    >> but even that was fairly slow. And when the backup directory
    >> contains several gigs of files to check, it was just too much.
    >>
    >> So, given the problem of "how can I tell if a file has been changed",
    >> am I tackling it the wrong way? Should I be using some other method
    >> besides simply hashing an entire file (using whatever algorithm) and
    >> comparing it to a stored value? Obviously backup software makers
    >> have solved this problem to make incremental backups pretty fast -
    >> what's the trick to it?

    >
    > Maybe timestamp as well as file size with a
    > fallback to SHA-1 or MD5 only if time/size are
    > unchanged.


    That is not going to save work.

    If either file size or time stamp has changed, then the checksum needs
    to be recomputed (so that it can be checked the next time).

    If neither has changed, the checksum needs to be recomputed just to be
    sure.

    That is, the checksum needs to be recomputed for all the files on each
    run.

    Sinan

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)
    clpmisc guidelines: <URL:http://www.rehabitation.com/clpmisc.shtml>
    A. Sinan Unur, Feb 13, 2008
    #4
  5. George Adams

    Ben Morrow Guest

    Quoth Big and Blue <>:
    >
    > Just checking the last mod time should be sufficient (it's what a backup
    > system would do). The assumption is that if something has changed the file
    > *without* changing the mod time then that would mean it has reset the mod
    > time after making the change deliberately, and the file didn't need to be
    > backed up again.


    Not necessarily. There may be a good reason the file needs that date but
    the change in data should still be recorded. Under Unix this is what
    ctime is for: changing the data and then fixing the mtime will update
    the ctime, so dump(8) can tell the file needs dumping again.

    Ben
    Ben Morrow, Feb 14, 2008
    #5
  6. Big and Blue <> wrote in
    news::

    > A. Sinan Unur wrote:
    >
    >> That is, the checksum needs to be recomputed for all the files on
    >> each run.

    >
    > Depends what you are trying to achieve. Doing a checksum for each
    > file every time means you have to read everything every time, which is
    > rather unproductive.
    >
    > Just checking the last mod time should be sufficient (it's what a
    > backup system would do). The assumption is that if something has
    > changed the file *without* changing the mod time then that would mean
    > it has reset the mod time after making the change deliberately, and
    > the file didn't need to be backed up again.


    I agree with that. You snipped too much, changing the intended meaning of
    the snippet above. The point I was trying to make was that with the
    algorithm Charles DeRykus was proposing, one would have to compute to
    checksum for each and every file on each run, not saving any time at all.

    Elsethread, I did indeed propose using the modification time for deciding
    if the file needs to be backed up again.

    Sinan


    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)
    clpmisc guidelines: <URL:http://www.rehabitation.com/clpmisc.shtml>
    A. Sinan Unur, Feb 14, 2008
    #6
  7. George Adams

    George Adams Guest

    Thank you all for the replies. Let me explain more why I originally
    decided against relying on ctime/mtime (this program will mainly be
    mainly run in a Windows environment)

    1) First, I decided to use mtime, which worked fine... until I
    discovered that for some reason, certain Windows files didn't HAVE a
    mtime. In Windows file properties, they were listed as "September 10,
    1956, 6:13:51 AM". In Perl, the mtime was -1.

    2) Still, I figured I could still use mtime, and just assume that if it
    was greater than ctime (which always exists under Windows, even if mtime
    does not), then the mtime was usable. If it was -1, I would just use
    the ctime as the baseline instead.

    3) The final blow, however, was the discovery that Windows apparently
    sets a new mtime for a file *even when no changes have taken place* .
    I discovered this when trying to backup some MP3s. After some research,
    I discovered that when certain media programs load an MP3 file, they
    will go out to retrieve additional info about that file (perhaps storing
    the info they find in the ID3 tags?). Even if no changes are ultimately
    made to the file, the mtime is still changed, fooling my Perl script
    into thinking it needs backing up. And that's just for MP3s.

    So finally I decided to give up on ctime/mtime and work with something I
    was sure would be accurate - hashing. But now I've hit this stumbling
    block.

    Anyway, that's why I was hoping to find a hashing shortcut. This is a
    fairly fault-tolerant program - if files are occassionally tagged
    false-positive, then the worst that happens is that it is backed up even
    when it wasn't needed. So I don't necessarily need SHA-512 or anything.
    But still, even the humble CRC32 is awfully slow for really large files...

    Again, thanks for the help - any other suggestions are welcomed!
    George Adams, Feb 14, 2008
    #7
  8. George Adams

    Ben Morrow Guest

    Quoth George Adams <>:
    > Thank you all for the replies. Let me explain more why I originally
    > decided against relying on ctime/mtime (this program will mainly be
    > mainly run in a Windows environment)
    >
    > 1) First, I decided to use mtime, which worked fine... until I
    > discovered that for some reason, certain Windows files didn't HAVE a
    > mtime. In Windows file properties, they were listed as "September 10,
    > 1956, 6:13:51 AM". In Perl, the mtime was -1.


    Weird... I've never seen that... Was this FAT or NTFS, and were the
    files created under Win9x or NT?

    > 2) Still, I figured I could still use mtime, and just assume that if it
    > was greater than ctime (which always exists under Windows, even if mtime
    > does not), then the mtime was usable. If it was -1, I would just use
    > the ctime as the baseline instead.


    The ctime slot under Win32 holds the file create time, and Windows does
    some weird things with the create time. In particular, if you copy a
    file in Explorer the new file has a new ctime but the old mtime, so
    ctime > mtime is not uncommon. It makes some sense, I guess: the file's
    contents are 'older' than the file itself, since they used to be
    somewhere else.

    > 3) The final blow, however, was the discovery that Windows apparently
    > sets a new mtime for a file *even when no changes have taken place* .


    This is a very common occurrence. Lots of programs stupidly update a
    file when they didn't need to. One way out (presumably) is to make such
    files read-only: in the case of an MP3 collection, I'd want to make all
    files and directories involved read-only anyway, to avoid accidentally
    deleting them.

    > Anyway, that's why I was hoping to find a hashing shortcut. This is a
    > fairly fault-tolerant program - if files are occassionally tagged
    > false-positive, then the worst that happens is that it is backed up even
    > when it wasn't needed. So I don't necessarily need SHA-512 or anything.
    > But still, even the humble CRC32 is awfully slow for really large files...


    One (slightly evil) suggestion would be to use the archive flag, and
    write a daemon that sits in the background (hah!) and uses
    Win32::ChangeNotify or some such to watch for files that have been
    changed, recalculate the hash, and fix up the flag if needed. It's still
    doing the same work (well, more, actually) but if it's done in the
    background it might not be so noticeable. Of course, you'd then have
    lots of fun sychronisation issues... (your backup program needs to know
    that all the files are in a valid state, and the daemon isn't playing
    with any of them, while it works) :).

    Ben
    Ben Morrow, Feb 14, 2008
    #8
  9. George Adams

    John Bokma Guest

    David Filmer <> wrote:

    > George Adams wrote:
    >> Obviously backup software makers have solved this problem to
    > > make incremental backups pretty fast - what's the trick to it?

    >
    > Indeed. The rsync program, for example, is able to determine which
    > files have changed. It's really fast and really smart (it is able to
    > syncronize changed files by only transmitting the delta, not the whole
    > file). I don't know how it works,


    Me neither, but
    http://en.wikipedia.org/wiki/Rsync

    has some information on the check sum method used, and has also a
    link to the algorithm: http://rsync.samba.org/tech_report/node2.html

    --
    John

    Arachnids near Coyolillo - part 1
    http://johnbokma.com/mexit/2006/05/04/arachnids-coyolillo-1.html
    John Bokma, Feb 14, 2008
    #9
  10. On Feb 13, 3:54 pm, "A. Sinan Unur" <> wrote:
    > "comp.llang.perl.moderated" <> wrote innews::
    >
    >
    >
    > > On Feb 13, 2:25 pm, George Adams <>
    > > wrote:
    > >> Hi, all. I'm trying to make a simple backup program for myself that
    > >> will check to see if certain files have been modifed before backing
    > >> them up. (It's got to be portable to multiple OSes that have Perl,
    > >> but possibly not other handy tools like, say, rsync or some such).

    >
    > >> Anyway, I had originally planned to use the Windows archive bit, or
    > >> else file modification dates to determine if a file had been changed.
    > >> That turned out to be unreliable, though (and, in the case of the
    > >> archive bit, impossible on Linux). So I decided instead to create a
    > >> checksum of the original file, then compare future versions of the
    > >> file with the stored checksum to see if it's changed (and hence needs
    > >> to be backed up).

    >
    > >> This works... except it's really, really slow. I started with SHA-1,
    > >> but it was taking just too long. I switched to MD5 and then CRC32,
    > >> but even that was fairly slow. And when the backup directory
    > >> contains several gigs of files to check, it was just too much.

    >
    > >> So, given the problem of "how can I tell if a file has been changed",
    > >> am I tackling it the wrong way? Should I be using some other method
    > >> besides simply hashing an entire file (using whatever algorithm) and
    > >> comparing it to a stored value? Obviously backup software makers
    > >> have solved this problem to make incremental backups pretty fast -
    > >> what's the trick to it?

    >
    > > Maybe timestamp as well as file size with a
    > > fallback to SHA-1 or MD5 only if time/size are
    > > unchanged.

    >
    > That is not going to save work.
    >
    > If either file size or time stamp has changed, then the checksum needs
    > to be recomputed (so that it can be checked the next time).
    >
    > If neither has changed, the checksum needs to be recomputed just to be
    > sure.
    >
    > That is, the checksum needs to be recomputed for all the files on each
    > run.
    >


    True, to be really thorough, you'd need to checksum each time. From
    the original problem description, I guessed that mtimes were not
    being
    updated so backup's were being missed and a size check would usually
    catch the update even if the mtime was unchanged -- particularly with
    checksumming as a last resort. Of course, the size check might still
    help, if as the OP mentions, Win32 spuriously updates mtime when there
    wasn't any change.

    However, the problem under Win32 with a non-existent mtime for some
    files or a ctime more current than mtime for others certainly makes
    the problem more complicated than that :)


    --
    Charles DeRykus
    comp.llang.perl.moderated, Feb 14, 2008
    #10
  11. On Wed, 13 Feb 2008 15:39:48 -0800 (PST), "comp.llang.perl.moderated"
    <> wrote:

    >Maybe timestamp as well as file size with a
    >fallback to SHA-1 or MD5 only if time/size are
    >unchanged.


    That's what I do in my own dup files finding script, well except the
    timestamp that's not really useful in that case.


    Michele
    --
    {$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
    (($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^<R<Y]*YB='
    ..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,$_,
    256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,
    Michele Dondi, Feb 14, 2008
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Michael Fairbank

    class checksums

    Michael Fairbank, Dec 5, 2003, in forum: Java
    Replies:
    16
    Views:
    3,116
    Michael Fairbank
    Dec 12, 2003
  2. Chris

    How good are checksums?

    Chris, Apr 29, 2004, in forum: Java
    Replies:
    12
    Views:
    882
    Roedy Green
    Apr 30, 2004
  3. Igor Planinc

    Bit IO, digests, checksums

    Igor Planinc, Nov 15, 2005, in forum: Java
    Replies:
    15
    Views:
    662
    Andrey Kuznetsov
    Nov 18, 2005
  4. Fredrik Lundh

    Re: Calculating md5 checksums.

    Fredrik Lundh, Mar 3, 2006, in forum: Python
    Replies:
    1
    Views:
    337
    Nainto
    Mar 4, 2006
  5. Replies:
    1
    Views:
    380
    Jonathan Lee
    Jul 3, 2010
Loading...

Share This Page