finding directory sizes

Z

Zebee Johnstone

I want to archive directories to CD. I have many of them in
various places, I don't care if one from /data/web is on the
same CD as one from /home as long as the specified directory is
not split any further.

The important point is that there are things I need to exclude, such as
log files.

I'm currently getting the size by using du in an open, and munging
the result, is there a better way?

open (DU,"find $snapshot -type d -maxdepth 1 -exec du -sk --exclude=access_log* --exclude=error_log* {} \\;|") || die "can't do find for $snapshot $!\n";

I did think of using stat to add up every file, but if I'm talking
a few hundred per directory, is that wise? And how would I exclude
files, considering that each main directory set has more than one file
pattern to exclude? (this has 2, others have 3 or 4)

Zebee
 
D

Damian James

...
open (DU,"find $snapshot -type d -maxdepth 1 -exec du -sk --exclude=access_log* --exclude=error_log* {} \\;|") || die "can't do find for $snapshot $!\n";

I did think of using stat to add up every file, but if I'm talking
a few hundred per directory, is that wise? And how would I exclude
files, considering that each main directory set has more than one file
pattern to exclude? (this has 2, others have 3 or 4)

I'd suggest using File::Find with an appropriate callback sub. It's in
the standard distribution, and the docs have a few recipes.

Cheers,
Damian
 
Z

Zebee Johnstone

In comp.lang.perl.misc on 23 Aug 2004 07:07:30 GMT
Damian James said:
I'd suggest using File::Find with an appropriate callback sub. It's in
the standard distribution, and the docs have a few recipes.

I'm not sure what you mean by 'appropriate callback sub".

Do you mean use File::Find recursively to run stat on every file?

As far as I know, if you do that, you can't pass parameters to
the sub that's processing the files, so suddenly everything's global?

and as I say, is running stat on every file in dirs that have hundreds
of files the right way to go? and how to exclude ones you don't want?
I know the patterns I want to exclude, how do I pass those to the
File::Find subroutine?

Zebee
 
J

Joe Smith

Zebee said:
and as I say, is running stat on every file in dirs that have hundreds
of files the right way to go?

If you run `du` on a directory with hundreds of files, it is going
to stat() every file in the directory and all its subdirectories.
and how to exclude ones you don't want?

Use the global $prune variable.

I know the patterns I want to exclude, how do I pass those to the
File::Find subroutine?

sub wanted {
return($File::Find::prune = 1) if /unwanted|directory/;
...
}

-Joe
 
B

Brian McCauley

Zebee said:
In comp.lang.perl.misc on 23 Aug 2004 07:07:30 GMT



I'm not sure what you mean by 'appropriate callback sub".

Do you mean use File::Find recursively to run stat on every file?

As far as I know, if you do that, you can't pass parameters to
the sub that's processing the files, so suddenly everything's global?

Do not have an irrational fear of using package variables an local().
(Have only a rational fear). Some time ago someone motivated by
irrational fear actually modified File::Find itself not to use package
variables for it's global variables but instead to use file-scoped
lexicals (still global in the programming sense). Because local()
doesn't work on lexicals this person just unthinkingly removed all the
local()s. In so doing they, of course, broke the re-entrancy of File::Find.

However, that said, you only need to use global variables (meaning
file-socped lexicals or package scoped variables) if you want the
callback to be a named subroutine. If you use an anonymous subroutine
then it acts as a closure meaning it can see lexically scoped variables
that were in scope where the anonymous sub was defined.

sub do_find {
my $foo = 'somthing';
my $wanted = sub {
# do stuff with foo
};
find($wanted, '/foo', '/bar');
}
and as I say, is running stat on every file in dirs that have hundreds
of files the right way to go?

Well obviously you have to do this in some way - but on Win32 IIRC the
implementation of stat() is (was?) pathological. If speed is of the
essence on Win32 then spawn a native windows recursive directory lister
and parse the output.
and how to exclude ones you don't want?

return if ....

Or to exclude whole dirtectories

$File::Find::prune = 1 if ...
I know the patterns I want to exclude, how do I pass those to the
File::Find subroutine?

Shared variables (either global or via closures).

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
Z

Zebee Johnstone

In comp.lang.perl.misc on Mon, 23 Aug 2004 07:44:31 GMT
Joe Smith said:
Zebee Johnstone wrote:



Use the global $prune variable.

Where is that documented? I saw it in the File::Find perldoc but only
in passing, and it doesn't reall explain what it is or does. perldoc -f
and perltoc don't mention it.

Does it exclude files or just directories, or whatever's matched by
the regexp?

Zebee
 
J

Jim Keenan

Zebee Johnstone said:
I want to archive directories to CD. I have many of them in
various places, I don't care if one from /data/web is on the
same CD as one from /home as long as the specified directory is
not split any further.

The important point is that there are things I need to exclude, such as
log files.

I'm currently getting the size by using du in an open, and munging
the result, is there a better way?

Unless you can demonstrate through benchmarking that this is a faster
approach than another such as using 'stat', I don't see why you need
to open a filehandle connection to read a file when you are simply
interested in the file's name and size.

jimk
 
J

Jürgen Exner

Zebee said:
I'm not sure what you mean by 'appropriate callback sub".

The "wanted()" function, that _you_ need to provide sucht hat File::Find
knows what to do with each file.
Do you mean use File::Find recursively

No need for. That is the beauty of File::Find that it will recurse
automatically without _you_ doing all the leg work.
to run stat on every file?

Try "-s" instead.
and as I say, is running stat on every file in dirs that have
hundreds of files the right way to go? and how to exclude ones you
don't want? I know the patterns I want to exclude, how do I pass
those to the File::Find subroutine?

Did you look at the documentation and examples that come with File::Find?

jue
 
J

Joe Smith

Zebee said:
In comp.lang.perl.misc on Mon, 23 Aug 2004 07:44:31 GMT


Where is that documented?

It matches the option by the same name in /usr/bin/find. See the
man page for 'find'. (A bit of history: The perl script find2perl
accepts the same command line arguments as /usr/bin/find, and
outputs a perl script to impliment that command.)

/usr/bin/find / -fstype nfs -prune -o -name 'tmp' -prune -o -print
Does it exclude files or just directories, or whatever's matched by
the regexp?

File::Find calls the 'wanted' function for everything it comes across.
After your wanted() function returns, if the thing being looked at
is a directory, File::Find will process that directory recursively
unless $prune is set. Setting $prune while looking at a plain file
does nothing. Setting $prune while looking at a directory says to
pretend that the directory is empty.
-Joe
 
Z

Zebee Johnstone

In comp.lang.perl.misc on Mon, 23 Aug 2004 14:21:13 GMT
Jürgen Exner said:
Did you look at the documentation and examples that come with File::Find?

Yes. And found I couldn't understand it. As in I could read the words
but there were things missing or pre-requisite knowledge I was expected
to have, such as "prune" that I didn't have.

Zebee
 
Z

Zebee Johnstone

In comp.lang.perl.misc on 23 Aug 2004 04:46:12 -0700
Jim Keenan said:
Unless you can demonstrate through benchmarking that this is a faster
approach than another such as using 'stat', I don't see why you need
to open a filehandle connection to read a file when you are simply
interested in the file's name and size.


I'm not opening Du for each file, but for directories.

If I use File::Find, I have to go through every single file, then later
work out how to decide which directories to keep together and which to
split.

Using Du on directories I can go:
start at root.
Check all directories one level below root, get their size.
If one of them is too big to fit on a CD, then go down one level,
do it again. recurse if necessary, though it isn't usually.

This gives me the smallest number of directories to then fit on CD.
It won't be the most efficient use of CD space, but then the efficient
use of human time to find things and get them back is more important
than a few meg here and there.

If I use File::Find to look at every single file, then I have to do some
kind of later munging to work out that directory split so as to have as
much of possible of the directories below root kept together.

So root might have /web /home /other and /web might have 15 sites all
smaller than a CD (some quite small, some quite large), plus there's
/web/web2 which has a similar mix of sites below it, but /web/web2
itself is lasrger than a CD. Measnwhile /home has at least one
directory below it that is too big to fit on a CD, so it has to be
split, and the directories below do too.

But I dn't know in advance which will have to be, and which won't. If
/web/web2/website1 is big enough to take up a CD on its own, I don't
want to split it.

yes, if I have to recurse, then I have to re-do du on that directory,
so if there's a reasonable way to record the info for each file only
once and then do the splitting that might be better.

Zebee
 
D

Darren Dunham

Zebee Johnstone said:
yes, if I have to recurse, then I have to re-do du on that directory,
so if there's a reasonable way to record the info for each file only
once and then do the splitting that might be better.

When you use 'du -s', you're doing all the work, but throwing away the
intermediate information. You could just keep the entire output of
'du', then parse the directories you need.

Alternatively, you could create a hash in memory that reflected the
filesystem, along with the sizes of every subdirectory. Just form the
subtree as you're parsing.

Then you can check the size of the top-level. If it's too big, just
look for all the sub-components and check their size...

(Using straight filenames instead of a hierarchical hash would have been
easier in some ways, but if the directory was oversized, it would have
been much harder to find all the subdirectories.)

#!/usr/bin/perl -w
use strict;
use File::Find;
use File::Spec;

my $size_key = "SIZE%:/_-SIZE"; # just a string that is unlikely
# to match the name of a subdirectory

my $root_dir = "."; # Default. Change as needed.
my $dir_tree = {};
find(\&wanted, $root_dir);

# Now you can see the sizes of various dirs and subdirs.

print "Dir <.> is ", $dir_tree->{"."}{$size_key}," blocks.\n";
print "Dir <./foo> is ", $dir_tree->{"."}{"foo"}{$size_key},"blocks.\n";
print "Dir <./foo/bar> is ", $dir_tree->{"."}{"foo"}{"bar"}{$size_key}," blocks.\n";

sub wanted
{
my $blocks = (lstat $_)[12];
my @path_components = File::Spec->splitdir($File::Find::name);
my $filename = pop @path_components;

my $pointer = $dir_tree;
foreach my $component (@path_components)
{
#$filename = $File::Spec->catdir($pointer, $component);
$pointer = $pointer->{$component};
$pointer->{$size_key} += $blocks;
}
if (-d $_)
{
# add a new hash component.
$pointer->{$_} = { $size_key => $blocks };
}
}


A method like this is a big win whenever the cost of the treewalk and
the stats are large. If it takes 2 minutes to do the top-level 'du',
and you might have to repeat it a few times to get what you want, the
extra overhead in the program to do it once is worth it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top