Trim Multiple Dirs to Max Total Space Used - by Date

Discussion in 'Perl Misc' started by Ron Heiby, Jun 25, 2004.

  1. Ron Heiby

    Ron Heiby Guest

    Hi! I've done a lot of FAQ reading and Google-ing and reading in O'Reilly books, but
    I'm still stuck.

    I have a system where data files are created in multiple directories. I need to run a
    daily script that will total the disk space used by all the files in all the
    directories and see whether the space exceeds some MAXSPACE value. In this case, all
    but one of the directories are subdirectories of a common parent dir, while the other
    one is off on its own. If the space does exceed the maximum, I need to start deleting
    files, oldest first, until the total space used drops just below the maximum.

    I've been looking at File::Find, and File::stat, among others, but don't quite see how
    this all can be hung together to accomplish this seemingly simple task.

    Any help would be much appreciated. Thanks!

    P.S. I'll be looking for responses here. If using Email, remove the "_u" from my name
    to avoid getting shuffled into an infrequently perused mailbox.

    --
    Ron.
    Ron Heiby, Jun 25, 2004
    #1
    1. Advertising

  2. Ron Heiby wrote:
    > Hi! I've done a lot of FAQ reading and Google-ing and reading in
    > O'Reilly books, but I'm still stuck.
    >
    > I have a system where data files are created in multiple directories.
    > I need to run a daily script that will total the disk space used by
    > all the files in all the directories and see whether the space
    > exceeds some MAXSPACE value. In this case, all but one of the
    > directories are subdirectories of a common parent dir, while the
    > other one is off on its own. If the space does exceed the maximum, I
    > need to start deleting files, oldest first, until the total space
    > used drops just below the maximum.
    >
    > I've been looking at File::Find, and File::stat, among others, but
    > don't quite see how this all can be hung together to accomplish this
    > seemingly simple task.


    I would attack the problem in four steps:

    First loop through all the directories to create an internal array of all
    files which you are interested in. Forget File::Find, you don't need it
    because you already have the comprehensive list of all directories.
    For your purposes a file consists of the name including the full path, the
    file size, and the date.
    The obvious data structure would be an array of hash where each hash
    contains three items, namely the qualified file name, the size, and the
    date.

    In step two you simply add all the sizes to determine your total used space.
    Or you can do that while collecting the files in step 1 already.

    Then sort the array by the date element.

    And then beginning with the oldest file delete files (you got the fully
    qualified name in the hash) until the added size of all deleted files is
    larger than the difference between desired size and actual size as
    determined in step 2.

    jue
    Jürgen Exner, Jun 25, 2004
    #2
    1. Advertising

  3. Ron Heiby

    Ron Heiby Guest

    Purl Gurl <> wrote:
    >You need to provide your system type.


    Sorry. The system is Red Hat 8.0 Linux.

    >quota -v (total disk usage per ownership)
    >du -ask (per directory total disk usage kilobytes)


    These look like they would tell me how much space is being used, but I do not see how
    they would address the aspect of deleting the oldest.

    >ls -la (returns a list of files / sizes)
    >ls -laR (recursive list of files / sizes)


    With either of these (and -t), I'm pretty sure that I can get a date-sorted list of the
    files in the various directories, but each directory is listed separately. If I could
    see how to get one combined list, that would be a big step forward.

    >Use of quota seems to be best suited for your task.


    I don't see how quota deletes the old files. I admit that I've never used the quota
    system, but the little I've looked at it, it seems that it is more for preventing the
    creation of new files that would exceed the limit. I cannot do that. I must delete the
    old ones.

    Thanks!
    Ron Heiby, Jun 25, 2004
    #3
  4. Ron Heiby

    Ron Heiby Guest

    "Jürgen Exner" <> wrote:
    >Forget File::Find, you don't need it
    >because you already have the comprehensive list of all directories.


    Sorry I didn't make that part clear. I know the odd-ball directory and I know the
    parent directory of the other directories of interest. However, I do not know, a
    priori, what their names are.

    >For your purposes a file consists of the name including the full path, the
    >file size, and the date.


    Makes sense.

    >The obvious data structure would be an array of hash where each hash
    >contains three items, namely the qualified file name, the size, and the
    >date.


    I thought that a hash matched a single key with a single value. What would you have as
    the key? Would I have the value be an array reference with the array holding the other
    two? Or, am I as confused as I think I am? :)

    >In step two you simply add all the sizes to determine your total used space.
    >Or you can do that while collecting the files in step 1 already.


    Yes, during collection makes sense to me.

    >Then sort the array by the date element.


    Perhaps when I better understand how you are picturing the data structure this will
    become clearer. It sounds like the date is the hash key. I'm thinking that if this is
    the case, I'll want to use the "raw" UNIX style seconds-since-epoch date value. But, I
    think I'll still need to be careful of potential collisions, where multiple files have
    the same modification date. This should happen rarely, and if I just increment the date
    value of the colliders until the date is unique, that won't be a problem. Maybe there's
    no reason why the date has to be the key, though. the full pathname of each file is
    already unique, and could probably be the key just as well. I'm still confused about
    having two values for each key in the hash, though.

    >And then beginning with the oldest file delete files (you got the fully
    >qualified name in the hash) until the added size of all deleted files is
    >larger than the difference between desired size and actual size as
    >determined in step 2.


    Speaking of size -- I think the size that matters here is the number of Kbytes that the
    file is actually taking up on the drive, which is likely slightly larger than its
    length might imply. On the other hand, if that's a real pain, I can pretty easily
    ignore that slop, as this does not have to be completely exact. If I leave a few of the
    files lying around an extra day, it's no problem.

    A couple other things I failed to mention earlier that may be useful to know -- The
    typical size of each of these files will be in the 50-100 Kbyte realm. We're talking
    about keeping around a configurable amount of these files, with the default being 250
    Megabytes.

    Thanks!
    Ron Heiby, Jun 25, 2004
    #4
  5. Ron Heiby wrote:
    > "Jürgen Exner" <> wrote:
    >> Forget File::Find, you don't need it
    >> because you already have the comprehensive list of all directories.

    >
    > Sorry I didn't make that part clear. I know the odd-ball directory
    > and I know the parent directory of the other directories of interest.
    > However, I do not know, a priori, what their names are.


    Well, ok, then yes, File::Find would be the best tool to enumerate all file.

    >> For your purposes a file consists of the name including the full
    >> path, the file size, and the date.

    >
    > Makes sense.
    >
    >> The obvious data structure would be an array of hash where each hash
    >> contains three items, namely the qualified file name, the size, and
    >> the date.

    >
    > I thought that a hash matched a single key with a single value. What
    > would you have as the key?


    Each hash would contain 3 elements, the keys being: 'name', 'size', and
    'date'.
    This represents one abstract file.

    > Would I have the value be an array
    > reference with the array holding the other two? Or, am I as confused
    > as I think I am? :)


    You need the complete list of all files. Easiest technical implementation is
    a array (= list) of hashes (= files).

    >> In step two you simply add all the sizes to determine your total
    >> used space. Or you can do that while collecting the files in step 1
    >> already.

    >
    > Yes, during collection makes sense to me.
    >
    >> Then sort the array by the date element.

    >
    > Perhaps when I better understand how you are picturing the data
    > structure this will become clearer. It sounds like the date is the
    > hash key. I'm thinking that if this is the case, I'll want to use the
    > "raw" UNIX style seconds-since-epoch date value. But, I think I'll
    > still need to be careful of potential collisions, where multiple
    > files have the same modification date.

    [...]

    You are thinking way to complicated. You got a list of files, implemented as
    an array of hashes. Now just sort that list by the date of each file and
    then start deleting from the upper (or lower) end of the sorted array.

    jue
    Jürgen Exner, Jun 25, 2004
    #5
  6. On Fri, 25 Jun 2004 03:43:53 GMT, Ron Heiby <>
    wrote:

    >I have a system where data files are created in multiple directories. I need to run a
    >daily script that will total the disk space used by all the files in all the
    >directories and see whether the space exceeds some MAXSPACE value. In this case, all
    >but one of the directories are subdirectories of a common parent dir, while the other
    >one is off on its own. If the space does exceed the maximum, I need to start deleting
    >files, oldest first, until the total space used drops just below the maximum.
    >
    >I've been looking at File::Find, and File::stat, among others, but don't quite see how
    >this all can be hung together to accomplish this seemingly simple task.


    Generally it's not considered a good idea to post complete solutions,
    but see is this (untested!) can help you:


    #!/usr/bin/perl -l

    use strict;
    use warnings;
    use File::Find;
    use constant MAXSPACE => 0xA00_000; # 10Mb

    @ARGV=grep { -d or !warn "`$_': not a directory!\n" } @ARGV;
    die <<"EOD" unless @ARGV;
    Usage: $0 <dir> [<dirs>]
    EOD

    my @files;

    find { no_chdir => 1,
    wanted => sub {
    return unless -f;
    print "Examining ", $_;
    push @files, [ $_, (stat _)[7,9] ];
    } }, @ARGV;

    my $t=-(MAXSPACE);
    $t+=$_->[1] for @files;

    print "No file needs to be deleted" and exit if $t <= 0;

    for (sort { $a->[2] <=> $b->[2] } @files) {
    unlink $_->[0] and
    print "Removing `$_->[0]'" or
    warn "Can't remove `$_->[0]': $!\n";
    last if ($t-=$_->[1]) <= 0;
    }

    __END__


    Michele
    --
    you'll see that it shouldn't be so. AND, the writting as usuall is
    fantastic incompetent. To illustrate, i quote:
    - Xah Lee trolling on clpmisc,
    "perl bug File::Basename and Perl's nature"
    Michele Dondi, Jun 25, 2004
    #6
  7. Ron Heiby

    Ron Heiby Guest

    Purl Gurl <> wrote:
    >Sorry Ron, our family cannot allow you to visit
    >as much as we would like for you to visit.


    Golly. Sorry about all the problems you've been having. I was able to get to the page
    you listed and have a copy of the script. I will be taking a look at it today. I
    appreciate all the help I've received. Thanks!

    --
    Ron.
    Ron Heiby, Jun 25, 2004
    #7
  8. Ron Heiby

    Ron Heiby Guest

    Michele Dondi <> wrote:
    Thanks! I'll be looking at this today. One thing is for sure, I'm learning some new (to
    me) things about using Perl!

    --
    Ron.
    Ron Heiby, Jun 25, 2004
    #8
  9. Ron Heiby wrote:

    > Golly. Sorry about all the problems you've been having.


    Do yourself a favor and read a few more of her rants before you start
    feeling too sorry for her. She's delusional, and the "problems" she speaks
    of exist only in her imagination.

    sherm--

    --
    Cocoa programming in Perl: http://camelbones.sourceforge.net
    Hire me! My resume: http://www.dot-app.org
    Sherm Pendley, Jun 25, 2004
    #9
  10. On Fri, 25 Jun 2004 12:32:10 GMT, Ron Heiby <>
    wrote:

    >Michele Dondi <> wrote:
    >Thanks! I'll be looking at this today. One thing is for sure, I'm learning some new (to
    >me) things about using Perl!


    Well, since I wrote the script in the first place, you may (modify it
    suitably and) try it on a sample directory: please tell me if there's
    anything wrong with it and ask for clarification...


    Michele
    --
    you'll see that it shouldn't be so. AND, the writting as usuall is
    fantastic incompetent. To illustrate, i quote:
    - Xah Lee trolling on clpmisc,
    "perl bug File::Basename and Perl's nature"
    Michele Dondi, Jun 25, 2004
    #10
  11. >>>>> "Ron" == Ron Heiby <> writes:

    Ron> I have a system where data files are created in multiple
    Ron> directories. I need to run a daily script that will total the
    Ron> disk space used by all the files in all the directories and see
    Ron> whether the space exceeds some MAXSPACE value. In this case, all
    Ron> but one of the directories are subdirectories of a common parent
    Ron> dir, while the other one is off on its own. If the space does
    Ron> exceed the maximum, I need to start deleting files, oldest first,
    Ron> until the total space used drops just below the maximum.

    Off the top of my head, using File::Finder (my module in the CPAN):

    my $MAXSIZE = 102400; # 100K, let's say
    my @START = qw(. /tmp); # current directory and /tmp being trimmed

    use File::Finder;
    my @list = sort { $a->[2] <=> $b->[2] } # sort age newest first
    File::Finder->type('f')->collect(sub {
    [$File::Find::name, -s, -M]
    }, @START;
    my $size = 0; # start the accumulator
    # keep all new files under the right size
    shift @list while @list and $size += $list[0][1] < $MAXSIZE;
    # delete the rest
    unlink or warn "Cannot delete $_: $!" for @list;

    Untested, but I usually get this stuff right. :)

    print "Just another Perl hacker,"

    --
    Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
    <> <URL:http://www.stonehenge.com/merlyn/>
    Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
    See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
    Randal L. Schwartz, Jun 25, 2004
    #11
  12. Ron Heiby

    Ron Heiby Guest

    Michele's script was very close to what I was looking for, so I started with it (before
    seeing Randal's version, which I'll now have to study).

    Michele Dondi <> wrote:
    > use constant MAXSPACE => 0xA00_000; # 10Mb


    I needed to be able to get the maximum space value from a configuration file, so this
    line was replaced in my version with:

    my $image_megs = `cat /path/to/config_file 2>/dev/null`;
    $image_megs = 250 unless $image_megs;

    This protects me against there being no config_file, however, I don't think that I'm
    protected against some random crap in the config_file, so probably should change this
    to look for a "reasonable" value, and default if the value is out of range.

    > find { no_chdir => 1,
    > wanted => sub {
    > return unless -f;


    I hadn't mentioned it, but all of the files of interest have the same extension, and
    there are other files in the directories, so at this point, I added:
    return unless /.extension$/;

    > push @files, [ $_, (stat _)[7,9] ];


    This is cool. I hadn't really seen an example of this that I had understood. I think
    that what is happening is that the outer [] contains an unnamed array and returns a
    reference to it, and that reference is pushed onto the @files array. I've seen this
    talked about in various documentation, but until I saw it here, the concept hadn't
    "clicked". Understanding the values going in and seeing how they were accessed later to
    do an actual task that I understood was a big help.

    Along the way, I had added a bunch of additional "print" statements, to show me what
    was going on at different points, to make sure I understood it. I commented out the
    "unlink" statement during most of my investigation and early testing. I confused myself
    at one point by having a low maximum space value and choosing a directory that had
    files of various sizes, ranging from a couple dozen bytes to a couple megs. I was
    surprised when my initial runs "deleted" all but two very small files, until I realized
    that one of the most recently modified files was larger than my max limit.

    Anyway, I got things going great. I really appreciate all of the assistance. I did look
    at Purl Gurl's script too. It was interesting to see how the direct directory accessing
    could be used, although I generally find myself philosophically aligned more closely
    with the use of library routines when they are available.

    Thanks to all!
    --
    Ron.
    Ron Heiby, Jun 26, 2004
    #12
  13. On Sat, 26 Jun 2004 03:39:40 GMT, Ron Heiby <>
    wrote:

    >Michele's script was very close to what I was looking for, so I started with it (before
    >seeing Randal's version, which I'll now have to study).


    Well, I've always used File::Find for my needs of "this kind", and it
    has always revealed to be perfectly suited for them, though I must say
    that Randal's script is cool in that his module already provides what
    I am doing manually.

    >Michele Dondi <> wrote:
    >> use constant MAXSPACE => 0xA00_000; # 10Mb

    >
    >I needed to be able to get the maximum space value from a configuration file, so this
    >line was replaced in my version with:
    >
    >my $image_megs = `cat /path/to/config_file 2>/dev/null`;
    >$image_megs = 250 unless $image_megs;


    This is good in that it's only one line. Personally I'm a bit
    idiosicratic with backticks, but that's definitely a personal thing,
    so don't mind!

    However you may want to do some more checks by adopting something
    along the lines of this (untested):

    my $image_megs = do {
    open my $fh, '<', 'path/to/config_file' or die
    "Can't open config file: $!\n";
    local $_=<$fh>;
    chomp; # not necessary IIRC, but no harm done...
    /^\d+$/ or
    warn "config file not in the expected format" and
    return 0;
    $_; } || 250;

    OTOH have you considered using an environment variable instead?

    my $image_megs = $ENV{IMAGEMEGS} || 250;

    >> find { no_chdir => 1,
    >> wanted => sub {
    >> return unless -f;

    >
    >I hadn't mentioned it, but all of the files of interest have the same extension, and
    >there are other files in the directories, so at this point, I added:
    > return unless /.extension$/;


    I suspected that: of course yours is the obvious workaround. Only you
    may want to be really fussy and write

    return unless /\.extension$/;
    ^

    instead, althoug I doubt that there would be many "false positives"...

    >> push @files, [ $_, (stat _)[7,9] ];

    >
    >This is cool. I hadn't really seen an example of this that I had understood. I think


    It's a standard Perl(5) construct.

    >that what is happening is that the outer [] contains an unnamed array and returns a
    >reference to it, and that reference is pushed onto the @files array. I've seen this


    Yes, it's a reference to an anonymous array.

    >Along the way, I had added a bunch of additional "print" statements, to show me what
    >was going on at different points, to make sure I understood it. I commented out the
    >"unlink" statement during most of my investigation and early testing. I confused myself


    A Very Good Thing(TM)! As a totally minor side note, the code as
    written:

    unlink $_->[0] and
    print "Removing `$_->[0]'" or
    warn "Can't remove `$_->[0]': $!\n";

    makes it easy to comment out just the line with unlink() and having
    the statement still working for debugging/tersting purposes:

    # unlink $_->[0] and
    print "Removing `$_->[0]'" or
    warn "Can't remove `$_->[0]': $!\n";

    This is why I often format it this way, BTW...


    Michele
    --
    you'll see that it shouldn't be so. AND, the writting as usuall is
    fantastic incompetent. To illustrate, i quote:
    - Xah Lee trolling on clpmisc,
    "perl bug File::Basename and Perl's nature"
    Michele Dondi, Jun 26, 2004
    #13
  14. Ron Heiby

    Joe Smith Guest

    Randal L. Schwartz wrote:

    > shift @list while @list and $size += $list[0][1] < $MAXSIZE;
    > # delete the rest
    > unlink or warn "Cannot delete $_: $!" for @list;
    >
    > Untested, but I usually get this stuff right. :)


    Shouldn't that last part be:
    unlink $_->[0] or warn "Cannot delete $_->[0]: $!" for @list;

    -Joe
    Joe Smith, Jun 29, 2004
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ben C
    Replies:
    6
    Views:
    2,154
    Leif K-Brooks
    Jan 28, 2007
  2. FAQ server
    Replies:
    0
    Views:
    140
    FAQ server
    Aug 29, 2006
  3. FAQ server
    Replies:
    0
    Views:
    136
    FAQ server
    Oct 26, 2006
  4. FAQ server
    Replies:
    6
    Views:
    215
    Jonas Raoni
    Dec 25, 2006
  5. FAQ server
    Replies:
    26
    Views:
    295
    Dr J R Stockton
    Feb 26, 2007
Loading...

Share This Page