how to know if folder contents have changed

Discussion in 'Python' started by devnew@gmail.com, Nov 12, 2007.

  1. Guest

    hi
    i am trying to create a cache of digitized values of around 100
    image files in a folder..In my program i would like to know from time
    to time if a new image has been added or removed from the folder..

    one scheme suggested was to create a string from the names of sorted
    image files and give it as the cache name..
    ie ,if i have one.jpg,three.jpg,new.jpg ,
    i will name the cache as 'newonethree.cache' and everytime i want to
    check if new image added/removed i wd create a string from the
    contents of folder and compare it with cachename.

    this scheme is ok for a small number of files,..

    can someone suggest a better way? i know it is a general programming
    problem..but i wish to know if a python solution exists
    , Nov 12, 2007
    #1
    1. Advertising

  2. On Sun, 11 Nov 2007 21:03:33 -0800, wrote:

    > one scheme suggested was to create a string from the names of sorted
    > image files and give it as the cache name..
    > ie ,if i have one.jpg,three.jpg,new.jpg ,
    > i will name the cache as 'newonethree.cache' and everytime i want to
    > check if new image added/removed i wd create a string from the
    > contents of folder and compare it with cachename.
    >
    > this scheme is ok for a small number of files,..


    Not really.

    `xxx.jpg` -> `xxx.cache`

    Now `xxx.jpg` is deleted and `x.jpg` and `xx.jpg` are created.

    `x.jpg`, `xx.jpg` -> `xxx.cache`

    > can someone suggest a better way? i know it is a general programming
    > problem..but i wish to know if a python solution exists


    Don't store the names in the cache file name but in the cache file. Take
    a look at the `set()` type for operations to easily find out the
    differences between two set of names and the `pickle` module to store
    Python objects in files.

    Ciao,
    Marc 'BlackJack' Rintsch
    Marc 'BlackJack' Rintsch, Nov 12, 2007
    #2
    1. Advertising

  3. Jorge Godoy Guest

    wrote:

    > can someone suggest a better way? i know it is a general programming
    > problem..but i wish to know if a python solution exists


    Use pyfam. I believe all docs are in fam but it integrates with that.
    Jorge Godoy, Nov 12, 2007
    #3
  4. Guest

    On Nov 11, 11:03 pm, "" <> wrote:
    > hi
    > i am trying to create a cache of digitized values of around 100
    > image files in a folder..In my program i would like to know from time
    > to time if a new image has been added or removed from the folder..
    >


    Why not use the file creation/modification timestamps?
    , Nov 12, 2007
    #4
  5. 2007/11/12, <>:
    > Why not use the file creation/modification timestamps?


    because you'd have to

    a) create a thread that pulls all the time for changes or
    b) test everytime for changes

    fam informs in a notification like way.

    Personally I'd create a "hidden" cache file parsable by configparser
    and have filename = $favorite_checksum_algo - key value pairs in it if
    it's not a long running process.

    Otherwise I'd probably go with fam (or hal i think that's the other
    thing that does that)

    hth
    martin

    --
    http://noneisyours.marcher.name
    http://feeds.feedburner.com/NoneIsYours
    Martin Marcher, Nov 12, 2007
    #5
  6. Guest

    On Nov 12, 11:27 am, "Martin Marcher" <> wrote:
    > 2007/11/12, <>:
    >
    > > Why not use the file creation/modification timestamps?

    >
    > because you'd have to
    >
    > a) create a thread that pulls all the time for changes or


    Given that it would only involve a check of one timestamp (the
    directory the files are located in), I don't think polling "from time
    to time" would be unreasonable. The modification timestamp of the
    directory should be sufficient given the use case. Even if it's not,
    tracking modification times for the files in the directory would not
    be unreasonable.

    > b) test everytime for changes
    >


    Checking a timestamp should be a very quick operation. Unless
    "everytime" occurs *very* frequently, it's certainly not unreasonable.

    > fam informs in a notification like way.
    >


    FAM would work too. However,
    1) According to http://oss.sgi.com/projects/fam/faq.html#what_os_fam,
    FAM "should be fairly easy to port to ... Unix-like operating
    systems ....". If the original poster is a user of a "Uniix-like
    operating system" he/she may actually be able to use it. Regardless,
    it seems to me that you would lose a great deal of portability (i.e.,
    is there a Windows port?), which may or may not be important to the
    poster.
    2) FAM undoubtedly uses some system resources. Probably very little,
    but it's still an overhead that must be taken into account.
    3) You still need to use another method for maintaining state across
    program invocations, do you not?

    Using timestamps are:
    1) Portable. Can you name one OS that does not provide timestamps?
    Last I checked, even Windows does :)
    2) Storage efficient. I don't have to actually *store* the
    timestamps. I can just check to see if a file/directory was modified
    after the last time I checked.
    3) Easy to maintain persistent state -- just store the timestamp!

    > Personally I'd create a "hidden" cache file parsable by configparser
    > and have filename = $favorite_checksum_algo - key value pairs in it if
    > it's not a long running process.
    >


    What is your reasoning for this? It seems to me that it is
    inefficient and unreliable. First of all you have to compute the
    checksum (which undoubtedly would involve reading every byte the file)
    -- not just once, but "everytime" (or however often you perform the
    check). Secondly, it is possible for the checksum to be the same even
    if the file has changed. Unlikely? Perhaps (depends on checksum
    algorithm used). Impossible? No. So, in effect, you are using a
    "slow" algorithm that is known to give incorrect results in certain
    cases -- all to replace something as basic as timestamps?

    > Otherwise I'd probably go with fam (or hal i think that's the other
    > thing that does that)
    >
    > hth
    > martin
    >
    > --http://noneisyours.marcher.namehttp://feeds.feedburner.com/NoneIsYours


    Thanks for the critique -- feel free to punch holes.

    --Nathan Davis
    , Nov 14, 2007
    #6
  7. I think that without further information from the OP about the
    requirements all we can do is guessing. So both of our solutions are
    just theory after all (just my personal opinion)

    2007/11/14, <>:
    > On Nov 12, 11:27 am, "Martin Marcher" <> wrote:
    > > 2007/11/12, <>:
    >>
    > > a) create a thread that pulls all the time for changes or

    >
    > Given that it would only involve a check of one timestamp (the
    > directory the files are located in), I don't think polling "from time
    > to time" would be unreasonable. The modification timestamp of the
    > directory should be sufficient given the use case. Even if it's not,
    > tracking modification times for the files in the directory would not
    > be unreasonable.


    Not for the 400 Files but the OP asks about more files too. How about
    40.000 files or 400.000 files? That could be a problem...

    > > b) test everytime for changes

    >
    > Checking a timestamp should be a very quick operation. Unless
    > "everytime" occurs *very* frequently, it's certainly not unreasonable.


    See above I think it also depends on the number of files

    > > fam informs in a notification like way.

    >
    > FAM would work too. However,
    > 1) According to http://oss.sgi.com/projects/fam/faq.html#what_os_fam,
    > FAM "should be fairly easy to port to ... Unix-like operating
    > systems ....". If the original poster is a user of a "Uniix-like
    > operating system" he/she may actually be able to use it. Regardless,
    > it seems to me that you would lose a great deal of portability (i.e.,
    > is there a Windows port?), which may or may not be important to the
    > poster.


    I don't use windows so speaking about portability you are right. It
    may be a personal thing but I stopped providing solution (or trying to
    think about them) for windows (another discussion probably best placed
    in a forum about social interests or something....)

    > 2) FAM undoubtedly uses some system resources. Probably very little,
    > but it's still an overhead that must be taken into account.


    Both is true but most Linux distributions do use FAM at some point
    anyway so the overhead is actually very little. Also I think that on
    most OSs there is a similiar thing like FAM that could be used...

    > 3) You still need to use another method for maintaining state across
    > program invocations, do you not?


    You need some method no matter wether your program is a long running
    process or just invoked in irregular intervals.

    After all I'm pretty sure that there is something FAM like that is
    available on most OSs. FAM isn't probably available on OSX either but
    I guess they provide some mechanism. If you want it really portable
    I'd use an abstraction layer that tries to communicate with some
    notification daemon which is probably available on the host os and if
    all that fails provide a fallback implementation that does naive
    tests. All accessible thru the same abstraction interface.

    > Using timestamps are:
    > 1) Portable. Can you name one OS that does not provide timestamps?
    > Last I checked, even Windows does :)
    > 2) Storage efficient. I don't have to actually *store* the
    > timestamps. I can just check to see if a file/directory was modified
    > after the last time I checked.


    read below, a changed timestamp isn't necessarily a sign that a file
    has indeed changed (backups, ....)

    > 3) Easy to maintain persistent state -- just store the timestamp!


    Well >>>I don't have to actually *store* the timestamps.<<< and
    >>>just store the timestamp!<<< are a bit confusing. I think you

    absolutely need to store the timestamp since between runs you won't
    know what to check for anyway (new files, deleted files, changed files
    - if these cases are important to you)

    > > Personally I'd create a "hidden" cache file parsable by configparser
    > > and have filename = $favorite_checksum_algo - key value pairs in it if
    > > it's not a long running process.

    >
    > What is your reasoning for this?


    because all I need to do to check for changes is getCache(configFile)
    and compare the results to getActual(os.listdir) and those 2 methods
    would give me the needed info (of course I'm just blindly guessing as
    I don't know anything about the further requirements)

    Of course with a lot of files this could be a problem. I wouldn't want
    a configparser object with 40.000 (or even just a few thousand)
    entries to be alive all the time. You'd probably have to create some
    iterator for the file so that you can check thru the entries in a
    memory efficient way...

    > It seems to me that it is
    > inefficient and unreliable. First of all you have to compute the
    > checksum (which undoubtedly would involve reading every byte the file)
    > -- not just once, but "everytime" (or however often you perform the
    > check). Secondly, it is possible for the checksum to be the same even
    > if the file has changed. Unlikely? Perhaps (depends on checksum
    > algorithm used). Impossible? No. So, in effect, you are using a
    > "slow" algorithm that is known to give incorrect results in certain
    > cases -- all to replace something as basic as timestamps?


    It seems you are absolutely linking checksum with something like md5 or sha...

    Maybe that was badly stated, depending on the use case of course a
    timestamp could also be considered a valid checksum. However to be
    safe some timestamp isn't really giving me the information. A lot of
    backup tools do update the timestamp (atime in unix, dunno about
    windows) and that could lead to even more wasting of resources.

    Consider you are checking some CSV files with timestampt which upon
    change initiate some real intensive number crunching. Now you do that
    because you figured "Hey the timestamp has changed, I need to redo my
    calculations..." while in fact just the backup programm was running.
    But as I said it depends on the use case what you consider a valid to
    know that a file changed...

    So the checksum algo is something that should be chosen depending on

    a) interval of checks (like you say)
    b) need to be sure that 2 files don't actually have the same checksum

    I guess a simple approach could be something like the Message-ID
    header in emails, a bit adapted to local use cases.

    > > Otherwise I'd probably go with fam (or hal i think that's the other
    > > thing that does that)




    --
    http://noneisyours.marcher.name
    http://feeds.feedburner.com/NoneIsYours
    Martin Marcher, Nov 17, 2007
    #7
  8. Martin Marcher, Nov 17, 2007
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    1
    Views:
    664
    Rosanne
    Oct 11, 2005
  2. Sriram
    Replies:
    14
    Views:
    660
    Alex Hunsley
    Sep 13, 2004
  3. Replies:
    0
    Views:
    271
  4. Andries

    I know, I know, I don't know

    Andries, Apr 23, 2004, in forum: Perl Misc
    Replies:
    3
    Views:
    208
    Gregory Toomey
    Apr 23, 2004
  5. mxbrunet
    Replies:
    1
    Views:
    195
Loading...

Share This Page