finding directory sizes

Discussion in 'Perl Misc' started by Zebee Johnstone, Aug 23, 2004.

  1. I want to archive directories to CD. I have many of them in
    various places, I don't care if one from /data/web is on the
    same CD as one from /home as long as the specified directory is
    not split any further.

    The important point is that there are things I need to exclude, such as
    log files.

    I'm currently getting the size by using du in an open, and munging
    the result, is there a better way?

    open (DU,"find $snapshot -type d -maxdepth 1 -exec du -sk --exclude=access_log* --exclude=error_log* {} \\;|") || die "can't do find for $snapshot $!\n";

    I did think of using stat to add up every file, but if I'm talking
    a few hundred per directory, is that wise? And how would I exclude
    files, considering that each main directory set has more than one file
    pattern to exclude? (this has 2, others have 3 or 4)

    Zebee
    Zebee Johnstone, Aug 23, 2004
    #1
    1. Advertising

  2. Zebee Johnstone

    Damian James Guest

    On Mon, 23 Aug 2004 06:48:43 GMT, Zebee Johnstone said:
    >...
    >open (DU,"find $snapshot -type d -maxdepth 1 -exec du -sk --exclude=access_log* --exclude=error_log* {} \\;|") || die "can't do find for $snapshot $!\n";
    >
    >I did think of using stat to add up every file, but if I'm talking
    >a few hundred per directory, is that wise? And how would I exclude
    >files, considering that each main directory set has more than one file
    >pattern to exclude? (this has 2, others have 3 or 4)


    I'd suggest using File::Find with an appropriate callback sub. It's in
    the standard distribution, and the docs have a few recipes.

    Cheers,
    Damian
    Damian James, Aug 23, 2004
    #2
    1. Advertising

  3. In comp.lang.perl.misc on 23 Aug 2004 07:07:30 GMT
    Damian James <> wrote:
    > On Mon, 23 Aug 2004 06:48:43 GMT, Zebee Johnstone said:
    >>...
    >>open (DU,"find $snapshot -type d -maxdepth 1 -exec du -sk --exclude=access_log* --exclude=error_log* {} \\;|") || die "can't do find for $snapshot $!\n";
    >>
    >>I did think of using stat to add up every file, but if I'm talking
    >>a few hundred per directory, is that wise? And how would I exclude
    >>files, considering that each main directory set has more than one file
    >>pattern to exclude? (this has 2, others have 3 or 4)

    >
    > I'd suggest using File::Find with an appropriate callback sub. It's in
    > the standard distribution, and the docs have a few recipes.


    I'm not sure what you mean by 'appropriate callback sub".

    Do you mean use File::Find recursively to run stat on every file?

    As far as I know, if you do that, you can't pass parameters to
    the sub that's processing the files, so suddenly everything's global?

    and as I say, is running stat on every file in dirs that have hundreds
    of files the right way to go? and how to exclude ones you don't want?
    I know the patterns I want to exclude, how do I pass those to the
    File::Find subroutine?

    Zebee
    Zebee Johnstone, Aug 23, 2004
    #3
  4. Zebee Johnstone

    Joe Smith Guest

    Zebee Johnstone wrote:

    > and as I say, is running stat on every file in dirs that have hundreds
    > of files the right way to go?


    If you run `du` on a directory with hundreds of files, it is going
    to stat() every file in the directory and all its subdirectories.

    > and how to exclude ones you don't want?


    Use the global $prune variable.

    I know the patterns I want to exclude, how do I pass those to the
    > File::Find subroutine?


    sub wanted {
    return($File::Find::prune = 1) if /unwanted|directory/;
    ...
    }

    -Joe
    Joe Smith, Aug 23, 2004
    #4
  5. Zebee Johnstone wrote:
    > In comp.lang.perl.misc on 23 Aug 2004 07:07:30 GMT
    > Damian James <> wrote:
    >
    >>On Mon, 23 Aug 2004 06:48:43 GMT, Zebee Johnstone said:
    >>
    >>>...
    >>>open (DU,"find $snapshot -type d -maxdepth 1 -exec du -sk --exclude=access_log* --exclude=error_log* {} \\;|") || die "can't do find for $snapshot $!\n";
    >>>
    >>>I did think of using stat to add up every file, but if I'm talking
    >>>a few hundred per directory, is that wise? And how would I exclude
    >>>files, considering that each main directory set has more than one file
    >>>pattern to exclude? (this has 2, others have 3 or 4)

    >>
    >>I'd suggest using File::Find with an appropriate callback sub. It's in
    >>the standard distribution, and the docs have a few recipes.

    >
    >
    > I'm not sure what you mean by 'appropriate callback sub".
    >
    > Do you mean use File::Find recursively to run stat on every file?
    >
    > As far as I know, if you do that, you can't pass parameters to
    > the sub that's processing the files, so suddenly everything's global?


    Do not have an irrational fear of using package variables an local().
    (Have only a rational fear). Some time ago someone motivated by
    irrational fear actually modified File::Find itself not to use package
    variables for it's global variables but instead to use file-scoped
    lexicals (still global in the programming sense). Because local()
    doesn't work on lexicals this person just unthinkingly removed all the
    local()s. In so doing they, of course, broke the re-entrancy of File::Find.

    However, that said, you only need to use global variables (meaning
    file-socped lexicals or package scoped variables) if you want the
    callback to be a named subroutine. If you use an anonymous subroutine
    then it acts as a closure meaning it can see lexically scoped variables
    that were in scope where the anonymous sub was defined.

    sub do_find {
    my $foo = 'somthing';
    my $wanted = sub {
    # do stuff with foo
    };
    find($wanted, '/foo', '/bar');
    }

    > and as I say, is running stat on every file in dirs that have hundreds
    > of files the right way to go?


    Well obviously you have to do this in some way - but on Win32 IIRC the
    implementation of stat() is (was?) pathological. If speed is of the
    essence on Win32 then spawn a native windows recursive directory lister
    and parse the output.

    > and how to exclude ones you don't want?


    return if ....

    Or to exclude whole dirtectories

    $File::Find::prune = 1 if ...

    > I know the patterns I want to exclude, how do I pass those to the
    > File::Find subroutine?


    Shared variables (either global or via closures).

    --
    \\ ( )
    . _\\__[oo
    .__/ \\ /\@
    . l___\\
    # ll l\\
    ###LL LL\\
    Brian McCauley, Aug 23, 2004
    #5
  6. In comp.lang.perl.misc on Mon, 23 Aug 2004 07:44:31 GMT
    Joe Smith <> wrote:
    > Zebee Johnstone wrote:
    >
    >
    >> and how to exclude ones you don't want?

    >
    > Use the global $prune variable.


    Where is that documented? I saw it in the File::Find perldoc but only
    in passing, and it doesn't reall explain what it is or does. perldoc -f
    and perltoc don't mention it.

    Does it exclude files or just directories, or whatever's matched by
    the regexp?

    Zebee
    Zebee Johnstone, Aug 23, 2004
    #6
  7. Zebee Johnstone

    Jim Keenan Guest

    Zebee Johnstone <> wrote in message news:<>...
    > I want to archive directories to CD. I have many of them in
    > various places, I don't care if one from /data/web is on the
    > same CD as one from /home as long as the specified directory is
    > not split any further.
    >
    > The important point is that there are things I need to exclude, such as
    > log files.
    >
    > I'm currently getting the size by using du in an open, and munging
    > the result, is there a better way?
    >


    Unless you can demonstrate through benchmarking that this is a faster
    approach than another such as using 'stat', I don't see why you need
    to open a filehandle connection to read a file when you are simply
    interested in the file's name and size.

    jimk
    Jim Keenan, Aug 23, 2004
    #7
  8. Zebee Johnstone wrote:
    > Damian James <> wrote:
    >> I'd suggest using File::Find with an appropriate callback sub. It's
    >> in the standard distribution, and the docs have a few recipes.

    >
    > I'm not sure what you mean by 'appropriate callback sub".


    The "wanted()" function, that _you_ need to provide sucht hat File::Find
    knows what to do with each file.

    > Do you mean use File::Find recursively


    No need for. That is the beauty of File::Find that it will recurse
    automatically without _you_ doing all the leg work.

    > to run stat on every file?


    Try "-s" instead.

    > and as I say, is running stat on every file in dirs that have
    > hundreds of files the right way to go? and how to exclude ones you
    > don't want? I know the patterns I want to exclude, how do I pass
    > those to the File::Find subroutine?


    Did you look at the documentation and examples that come with File::Find?

    jue
    Jürgen Exner, Aug 23, 2004
    #8
  9. Zebee Johnstone

    Joe Smith Guest

    Zebee Johnstone wrote:

    > In comp.lang.perl.misc on Mon, 23 Aug 2004 07:44:31 GMT
    > Joe Smith <> wrote:


    >>Use the global $prune variable.

    >
    > Where is that documented?


    It matches the option by the same name in /usr/bin/find. See the
    man page for 'find'. (A bit of history: The perl script find2perl
    accepts the same command line arguments as /usr/bin/find, and
    outputs a perl script to impliment that command.)

    /usr/bin/find / -fstype nfs -prune -o -name 'tmp' -prune -o -print

    > Does it exclude files or just directories, or whatever's matched by
    > the regexp?


    File::Find calls the 'wanted' function for everything it comes across.
    After your wanted() function returns, if the thing being looked at
    is a directory, File::Find will process that directory recursively
    unless $prune is set. Setting $prune while looking at a plain file
    does nothing. Setting $prune while looking at a directory says to
    pretend that the directory is empty.
    -Joe
    Joe Smith, Aug 23, 2004
    #9
  10. In comp.lang.perl.misc on Mon, 23 Aug 2004 14:21:13 GMT
    Jürgen Exner <> wrote:
    > Did you look at the documentation and examples that come with File::Find?
    >


    Yes. And found I couldn't understand it. As in I could read the words
    but there were things missing or pre-requisite knowledge I was expected
    to have, such as "prune" that I didn't have.

    Zebee
    Zebee Johnstone, Aug 23, 2004
    #10
  11. In comp.lang.perl.misc on 23 Aug 2004 04:46:12 -0700
    Jim Keenan <> wrote:
    >
    > Unless you can demonstrate through benchmarking that this is a faster
    > approach than another such as using 'stat', I don't see why you need
    > to open a filehandle connection to read a file when you are simply
    > interested in the file's name and size.



    I'm not opening Du for each file, but for directories.

    If I use File::Find, I have to go through every single file, then later
    work out how to decide which directories to keep together and which to
    split.

    Using Du on directories I can go:
    start at root.
    Check all directories one level below root, get their size.
    If one of them is too big to fit on a CD, then go down one level,
    do it again. recurse if necessary, though it isn't usually.

    This gives me the smallest number of directories to then fit on CD.
    It won't be the most efficient use of CD space, but then the efficient
    use of human time to find things and get them back is more important
    than a few meg here and there.

    If I use File::Find to look at every single file, then I have to do some
    kind of later munging to work out that directory split so as to have as
    much of possible of the directories below root kept together.

    So root might have /web /home /other and /web might have 15 sites all
    smaller than a CD (some quite small, some quite large), plus there's
    /web/web2 which has a similar mix of sites below it, but /web/web2
    itself is lasrger than a CD. Measnwhile /home has at least one
    directory below it that is too big to fit on a CD, so it has to be
    split, and the directories below do too.

    But I dn't know in advance which will have to be, and which won't. If
    /web/web2/website1 is big enough to take up a CD on its own, I don't
    want to split it.

    yes, if I have to recurse, then I have to re-do du on that directory,
    so if there's a reasonable way to record the info for each file only
    once and then do the splitting that might be better.

    Zebee
    Zebee Johnstone, Aug 24, 2004
    #11
  12. Zebee Johnstone <> wrote:
    > yes, if I have to recurse, then I have to re-do du on that directory,
    > so if there's a reasonable way to record the info for each file only
    > once and then do the splitting that might be better.


    When you use 'du -s', you're doing all the work, but throwing away the
    intermediate information. You could just keep the entire output of
    'du', then parse the directories you need.

    Alternatively, you could create a hash in memory that reflected the
    filesystem, along with the sizes of every subdirectory. Just form the
    subtree as you're parsing.

    Then you can check the size of the top-level. If it's too big, just
    look for all the sub-components and check their size...

    (Using straight filenames instead of a hierarchical hash would have been
    easier in some ways, but if the directory was oversized, it would have
    been much harder to find all the subdirectories.)

    #!/usr/bin/perl -w
    use strict;
    use File::Find;
    use File::Spec;

    my $size_key = "SIZE%:/_-SIZE"; # just a string that is unlikely
    # to match the name of a subdirectory

    my $root_dir = "."; # Default. Change as needed.
    my $dir_tree = {};
    find(\&wanted, $root_dir);

    # Now you can see the sizes of various dirs and subdirs.

    print "Dir <.> is ", $dir_tree->{"."}{$size_key}," blocks.\n";
    print "Dir <./foo> is ", $dir_tree->{"."}{"foo"}{$size_key},"blocks.\n";
    print "Dir <./foo/bar> is ", $dir_tree->{"."}{"foo"}{"bar"}{$size_key}," blocks.\n";

    sub wanted
    {
    my $blocks = (lstat $_)[12];
    my @path_components = File::Spec->splitdir($File::Find::name);
    my $filename = pop @path_components;

    my $pointer = $dir_tree;
    foreach my $component (@path_components)
    {
    #$filename = $File::Spec->catdir($pointer, $component);
    $pointer = $pointer->{$component};
    $pointer->{$size_key} += $blocks;
    }
    if (-d $_)
    {
    # add a new hash component.
    $pointer->{$_} = { $size_key => $blocks };
    }
    }


    A method like this is a big win whenever the cost of the treewalk and
    the stats are large. If it takes 2 minutes to do the top-level 'du',
    and you might have to repeat it a few times to get what you want, the
    extra overhead in the program to do it once is worth it.

    --
    Darren Dunham
    Senior Technical Consultant TAOS http://www.taos.com/
    Got some Dr Pepper? San Francisco, CA bay area
    < This line left intentionally blank to confuse you. >
    Darren Dunham, Aug 26, 2004
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David Stephen
    Replies:
    2
    Views:
    577
    David Stephen
    Jul 10, 2003
  2. coleenholley
    Replies:
    5
    Views:
    1,142
    coleenholley
    Jan 21, 2004
  3. Greg Smith

    Changing control sizes

    Greg Smith, Jan 26, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    334
    Eliyahu Goldin
    Jan 26, 2005
  4. Feng Tien
    Replies:
    0
    Views:
    102
    Feng Tien
    Nov 13, 2007
  5. Jeffrey Ellin

    Getting directory sizes on win32

    Jeffrey Ellin, Sep 3, 2003, in forum: Perl Misc
    Replies:
    4
    Views:
    376
    Jay Tilton
    Sep 4, 2003
Loading...

Share This Page