Trim Multiple Dirs to Max Total Space Used - by Date

Ron Heiby · Jun 25, 2004

Hi! I've done a lot of FAQ reading and Google-ing and reading in O'Reilly books, but
I'm still stuck.

I have a system where data files are created in multiple directories. I need to run a
daily script that will total the disk space used by all the files in all the
directories and see whether the space exceeds some MAXSPACE value. In this case, all
but one of the directories are subdirectories of a common parent dir, while the other
one is off on its own. If the space does exceed the maximum, I need to start deleting
files, oldest first, until the total space used drops just below the maximum.

I've been looking at File::Find, and File::stat, among others, but don't quite see how
this all can be hung together to accomplish this seemingly simple task.

Any help would be much appreciated. Thanks!

P.S. I'll be looking for responses here. If using Email, remove the "_u" from my name
to avoid getting shuffled into an infrequently perused mailbox.

Jürgen Exner · Jun 25, 2004

Ron said:
Hi! I've done a lot of FAQ reading and Google-ing and reading in
O'Reilly books, but I'm still stuck.

I have a system where data files are created in multiple directories.
I need to run a daily script that will total the disk space used by
all the files in all the directories and see whether the space
exceeds some MAXSPACE value. In this case, all but one of the
directories are subdirectories of a common parent dir, while the
other one is off on its own. If the space does exceed the maximum, I
need to start deleting files, oldest first, until the total space
used drops just below the maximum.

I've been looking at File::Find, and File::stat, among others, but
don't quite see how this all can be hung together to accomplish this
seemingly simple task.

I would attack the problem in four steps:

First loop through all the directories to create an internal array of all
files which you are interested in. Forget File::Find, you don't need it
because you already have the comprehensive list of all directories.
For your purposes a file consists of the name including the full path, the
file size, and the date.
The obvious data structure would be an array of hash where each hash
contains three items, namely the qualified file name, the size, and the
date.

In step two you simply add all the sizes to determine your total used space.
Or you can do that while collecting the files in step 1 already.

Then sort the array by the date element.

And then beginning with the oldest file delete files (you got the fully
qualified name in the hash) until the added size of all deleted files is
larger than the difference between desired size and actual size as
determined in step 2.

jue

Ron Heiby · Jun 25, 2004

Purl Gurl said:
You need to provide your system type.

Sorry. The system is Red Hat 8.0 Linux.

quota -v (total disk usage per ownership)
du -ask (per directory total disk usage kilobytes)

These look like they would tell me how much space is being used, but I do not see how
they would address the aspect of deleting the oldest.

ls -la (returns a list of files / sizes)
ls -laR (recursive list of files / sizes)

With either of these (and -t), I'm pretty sure that I can get a date-sorted list of the
files in the various directories, but each directory is listed separately. If I could
see how to get one combined list, that would be a big step forward.

Use of quota seems to be best suited for your task.

I don't see how quota deletes the old files. I admit that I've never used the quota
system, but the little I've looked at it, it seems that it is more for preventing the
creation of new files that would exceed the limit. I cannot do that. I must delete the
old ones.

Thanks!

Ron Heiby · Jun 25, 2004

Jürgen Exner said:
Forget File::Find, you don't need it
because you already have the comprehensive list of all directories.

Sorry I didn't make that part clear. I know the odd-ball directory and I know the
parent directory of the other directories of interest. However, I do not know, a
priori, what their names are.

For your purposes a file consists of the name including the full path, the
file size, and the date.

Makes sense.

The obvious data structure would be an array of hash where each hash
contains three items, namely the qualified file name, the size, and the
date.

I thought that a hash matched a single key with a single value. What would you have as
the key? Would I have the value be an array reference with the array holding the other
two? Or, am I as confused as I think I am?

In step two you simply add all the sizes to determine your total used space.
Or you can do that while collecting the files in step 1 already.

Yes, during collection makes sense to me.

Then sort the array by the date element.

Perhaps when I better understand how you are picturing the data structure this will
become clearer. It sounds like the date is the hash key. I'm thinking that if this is
the case, I'll want to use the "raw" UNIX style seconds-since-epoch date value. But, I
think I'll still need to be careful of potential collisions, where multiple files have
the same modification date. This should happen rarely, and if I just increment the date
value of the colliders until the date is unique, that won't be a problem. Maybe there's
no reason why the date has to be the key, though. the full pathname of each file is
already unique, and could probably be the key just as well. I'm still confused about
having two values for each key in the hash, though.

And then beginning with the oldest file delete files (you got the fully
qualified name in the hash) until the added size of all deleted files is
larger than the difference between desired size and actual size as
determined in step 2.

Speaking of size -- I think the size that matters here is the number of Kbytes that the
file is actually taking up on the drive, which is likely slightly larger than its
length might imply. On the other hand, if that's a real pain, I can pretty easily
ignore that slop, as this does not have to be completely exact. If I leave a few of the
files lying around an extra day, it's no problem.

A couple other things I failed to mention earlier that may be useful to know -- The
typical size of each of these files will be in the 50-100 Kbyte realm. We're talking
about keeping around a configurable amount of these files, with the default being 250
Megabytes.

Thanks!

Jürgen Exner · Jun 25, 2004

Ron said:
Sorry I didn't make that part clear. I know the odd-ball directory
and I know the parent directory of the other directories of interest.
However, I do not know, a priori, what their names are.

Well, ok, then yes, File::Find would be the best tool to enumerate all file.

Makes sense.

I thought that a hash matched a single key with a single value. What
would you have as the key?

Each hash would contain 3 elements, the keys being: 'name', 'size', and
'date'.
This represents one abstract file.

Would I have the value be an array
reference with the array holding the other two? Or, am I as confused
as I think I am?

You need the complete list of all files. Easiest technical implementation is
a array (= list) of hashes (= files).

Yes, during collection makes sense to me.

Perhaps when I better understand how you are picturing the data
structure this will become clearer. It sounds like the date is the
hash key. I'm thinking that if this is the case, I'll want to use the
"raw" UNIX style seconds-since-epoch date value. But, I think I'll
still need to be careful of potential collisions, where multiple
files have the same modification date.

[...]

You are thinking way to complicated. You got a list of files, implemented as
an array of hashes. Now just sort that list by the date of each file and
then start deleting from the upper (or lower) end of the sorted array.

jue

Michele Dondi · Jun 25, 2004

I have a system where data files are created in multiple directories. I need to run a
daily script that will total the disk space used by all the files in all the
directories and see whether the space exceeds some MAXSPACE value. In this case, all
but one of the directories are subdirectories of a common parent dir, while the other
one is off on its own. If the space does exceed the maximum, I need to start deleting
files, oldest first, until the total space used drops just below the maximum.

I've been looking at File::Find, and File::stat, among others, but don't quite see how
this all can be hung together to accomplish this seemingly simple task.

Generally it's not considered a good idea to post complete solutions,
but see is this (untested!) can help you:

#!/usr/bin/perl -l

use strict;
use warnings;
use File::Find;
use constant MAXSPACE => 0xA00_000; # 10Mb

@ARGV=grep { -d or !warn "`$_': not a directory!\n" } @ARGV;
die <<"EOD" unless @ARGV;
Usage: $0 <dir> [<dirs>]
EOD

my @files;

find { no_chdir => 1,
wanted => sub {
return unless -f;
print "Examining ", $_;
push @files, [ $_, (stat _)[7,9] ];
} }, @ARGV;

my $t=-(MAXSPACE);
$t+=$_->[1] for @files;

print "No file needs to be deleted" and exit if $t <= 0;

for (sort { $a->[2] <=> $b->[2] } @files) {
unlink $_->[0] and
print "Removing `$_->[0]'" or
warn "Can't remove `$_->[0]': $!\n";
last if ($t-=$_->[1]) <= 0;
}

__END__

Michele

Ron Heiby · Jun 25, 2004

Purl Gurl said:
Sorry Ron, our family cannot allow you to visit
as much as we would like for you to visit.

Golly. Sorry about all the problems you've been having. I was able to get to the page
you listed and have a copy of the script. I will be taking a look at it today. I
appreciate all the help I've received. Thanks!

Ron Heiby · Jun 25, 2004

Thanks! I'll be looking at this today. One thing is for sure, I'm learning some new (to
me) things about using Perl!

Sherm Pendley · Jun 25, 2004

Ron said:
Golly. Sorry about all the problems you've been having.

Do yourself a favor and read a few more of her rants before you start
feeling too sorry for her. She's delusional, and the "problems" she speaks
of exist only in her imagination.

sherm--

Michele Dondi · Jun 25, 2004

Thanks! I'll be looking at this today. One thing is for sure, I'm learning some new (to
me) things about using Perl!

Well, since I wrote the script in the first place, you may (modify it
suitably and) try it on a sample directory: please tell me if there's
anything wrong with it and ask for clarification...

Michele

Randal L. Schwartz · Jun 25, 2004

Ron> I have a system where data files are created in multiple
Ron> directories. I need to run a daily script that will total the
Ron> disk space used by all the files in all the directories and see
Ron> whether the space exceeds some MAXSPACE value. In this case, all
Ron> but one of the directories are subdirectories of a common parent
Ron> dir, while the other one is off on its own. If the space does
Ron> exceed the maximum, I need to start deleting files, oldest first,
Ron> until the total space used drops just below the maximum.

Off the top of my head, using File::Finder (my module in the CPAN):

my $MAXSIZE = 102400; # 100K, let's say
my @START = qw(. /tmp); # current directory and /tmp being trimmed

use File::Finder;
my @list = sort { $a->[2] <=> $b->[2] } # sort age newest first
File::Finder->type('f')->collect(sub {
[$File::Find::name, -s, -M]
}, @START;
my $size = 0; # start the accumulator
# keep all new files under the right size
shift @list while @list and $size += $list[0][1] < $MAXSIZE;
# delete the rest
unlink or warn "Cannot delete $_: $!" for @list;

Untested, but I usually get this stuff right.

print "Just another Perl hacker,"

Ron Heiby · Jun 26, 2004

Michele's script was very close to what I was looking for, so I started with it (before
seeing Randal's version, which I'll now have to study).

Michele Dondi said:
use constant MAXSPACE => 0xA00_000; # 10Mb

I needed to be able to get the maximum space value from a configuration file, so this
line was replaced in my version with:

my $image_megs = `cat /path/to/config_file 2>/dev/null`;
$image_megs = 250 unless $image_megs;

This protects me against there being no config_file, however, I don't think that I'm
protected against some random crap in the config_file, so probably should change this
to look for a "reasonable" value, and default if the value is out of range.

find { no_chdir => 1,
wanted => sub {
return unless -f;

I hadn't mentioned it, but all of the files of interest have the same extension, and
there are other files in the directories, so at this point, I added:
return unless /.extension$/;

push @files, [ $_, (stat _)[7,9] ];

This is cool. I hadn't really seen an example of this that I had understood. I think
that what is happening is that the outer [] contains an unnamed array and returns a
reference to it, and that reference is pushed onto the @files array. I've seen this
talked about in various documentation, but until I saw it here, the concept hadn't
"clicked". Understanding the values going in and seeing how they were accessed later to
do an actual task that I understood was a big help.

Along the way, I had added a bunch of additional "print" statements, to show me what
was going on at different points, to make sure I understood it. I commented out the
"unlink" statement during most of my investigation and early testing. I confused myself
at one point by having a low maximum space value and choosing a directory that had
files of various sizes, ranging from a couple dozen bytes to a couple megs. I was
surprised when my initial runs "deleted" all but two very small files, until I realized
that one of the most recently modified files was larger than my max limit.

Anyway, I got things going great. I really appreciate all of the assistance. I did look
at Purl Gurl's script too. It was interesting to see how the direct directory accessing
could be used, although I generally find myself philosophically aligned more closely
with the use of library routines when they are available.

Thanks to all!

Michele Dondi · Jun 26, 2004

Michele's script was very close to what I was looking for, so I started with it (before
seeing Randal's version, which I'll now have to study).

Well, I've always used File::Find for my needs of "this kind", and it
has always revealed to be perfectly suited for them, though I must say
that Randal's script is cool in that his module already provides what
I am doing manually.

I needed to be able to get the maximum space value from a configuration file, so this
line was replaced in my version with:

my $image_megs = `cat /path/to/config_file 2>/dev/null`;
$image_megs = 250 unless $image_megs;

This is good in that it's only one line. Personally I'm a bit
idiosicratic with backticks, but that's definitely a personal thing,
so don't mind!

However you may want to do some more checks by adopting something
along the lines of this (untested):

my $image_megs = do {
open my $fh, '<', 'path/to/config_file' or die
"Can't open config file: $!\n";
local $_=<$fh>;
chomp; # not necessary IIRC, but no harm done...
/^\d+$/ or
warn "config file not in the expected format" and
return 0;
$_; } || 250;

OTOH have you considered using an environment variable instead?

my $image_megs = $ENV{IMAGEMEGS} || 250;

I hadn't mentioned it, but all of the files of interest have the same extension, and
there are other files in the directories, so at this point, I added:
return unless /.extension$/;

I suspected that: of course yours is the obvious workaround. Only you
may want to be really fussy and write

return unless /\.extension$/;
^

instead, althoug I doubt that there would be many "false positives"...

push @files, [ $_, (stat _)[7,9] ];

Click to expand...

This is cool. I hadn't really seen an example of this that I had understood. I think

It's a standard Perl(5) construct.

that what is happening is that the outer [] contains an unnamed array and returns a
reference to it, and that reference is pushed onto the @files array. I've seen this

Yes, it's a reference to an anonymous array.

Along the way, I had added a bunch of additional "print" statements, to show me what
was going on at different points, to make sure I understood it. I commented out the
"unlink" statement during most of my investigation and early testing. I confused myself

A Very Good Thing(TM)! As a totally minor side note, the code as
written:

unlink $_->[0] and
print "Removing `$_->[0]'" or
warn "Can't remove `$_->[0]': $!\n";

makes it easy to comment out just the line with unlink() and having
the statement still working for debugging/tersting purposes:

# unlink $_->[0] and
print "Removing `$_->[0]'" or
warn "Can't remove `$_->[0]': $!\n";

This is why I often format it this way, BTW...

Michele

Joe Smith · Jun 29, 2004

Randal said:
shift @list while @list and $size += $list[0][1] < $MAXSIZE;
# delete the rest
unlink or warn "Cannot delete $_: $!" for @list;

Untested, but I usually get this stuff right.

Shouldn't that last part be:
unlink $_->[0] or warn "Cannot delete $_->[0]: $!" for @list;

-Joe

Finding disk space used by Windows or Unix directory	8	Feb 3, 2005
to acess disk space viewer	3	Apr 25, 2006
key/value store optimized for disk storage	23	May 3, 2012
PATCH: Speed up direct string concatenation by 20+%!	34	Sep 29, 2006
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007

Trim Multiple Dirs to Max Total Space Used - by Date

Ron Heiby

Jürgen Exner

Ron Heiby

Ron Heiby

Jürgen Exner

Michele Dondi

Ron Heiby

Ron Heiby

Sherm Pendley

Michele Dondi

Randal L. Schwartz

Ron Heiby

Michele Dondi

Joe Smith

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads