Why Is My Hash Assignment Taking So Long?

Discussion in 'Perl Misc' started by James E Keenan, Oct 13, 2003.

  1. I need to build a hash whose elements represent characteristics of each of
    9600 files in a single directory. I am trying to figure out why this hash
    assignment is taking so long. Here is my code:

    my %threads_chars = (
    dir => "~/Threads",
    pattern => "\.thr\.txt",
    );

    $threadsref = analyze_dir(\%threads_chars);

    sub analyze_dir {
    my $hashref = shift;
    my $dir = ${$hashref}{'dir'};
    my $pattern = ${$hashref}{'pattern'};
    my (%filechars, $count);
    chdir $dir or die "Couldn't change to $dir: $!";
    opendir DIR, $dir or die "Couldn't open $dir: $!";

    #1: get files in $dir which match pattern
    my @files = grep { m|$pattern$|o } readdir DIR;
    print '@files established for ', "$dir\n";

    #2: build a hash where each element is keyed by the core of the
    file name
    # and the value is a ref to an array holding the file name, full
    path, atime, mtime.
    # Set a counter so I know how fast it's going
    foreach (@files) {
    $_ =~ m|^(.*)$pattern$|;
    my $core = $1;
    $filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
    $count++;
    if ($count % 100 == 0) { print 'count: ', "$count\n";}
    }
    closedir DIR or die "Couldn't close $dir: $!";
    print "$dir HAS BEEN ANALYZED\n";
    return \%filechars;
    }

    The counter prints to screen in 100-file increments. The first couple of
    hundred files zip right by, but then the process slows to a crawl. This
    occurred despite there being no other significant activity happening on this
    disk. Why?

    jimk
    James E Keenan, Oct 13, 2003
    #1
    1. Advertising

  2. James E Keenan

    Anno Siegel Guest

    James E Keenan <> wrote in comp.lang.perl.misc:
    > I need to build a hash whose elements represent characteristics of each of
    > 9600 files in a single directory. I am trying to figure out why this hash
    > assignment is taking so long. Here is my code:


    That's a lot of files to keep in a single directory. That is the
    reason why it's slow.

    There are a few obvious inefficiencies in your code, but these have
    no measurable effect.

    > my %threads_chars = (
    > dir => "~/Threads",
    > pattern => "\.thr\.txt",


    You have a quoting problem here. Backslashing the dots in a *string*
    has no effect. After the string is built, the backslashes are gone.
    Use qr// to quote patterns:

    pattern => qr/\.thr\.txt/,

    > );
    >
    > $threadsref = analyze_dir(\%threads_chars);
    >
    > sub analyze_dir {
    > my $hashref = shift;
    > my $dir = ${$hashref}{'dir'};
    > my $pattern = ${$hashref}{'pattern'};
    > my (%filechars, $count);
    > chdir $dir or die "Couldn't change to $dir: $!";
    > opendir DIR, $dir or die "Couldn't open $dir: $!";
    >
    > #1: get files in $dir which match pattern
    > my @files = grep { m|$pattern$|o } readdir DIR;


    Why do you build a list of all files just to process them once? Walk
    through the directory and skip what you don't want. You'll save
    the list (space efficiency) and you'll only have to apply the regex
    once (time efficiency).

    You should also take care to de-select anything that isn't a plain file.

    > print '@files established for ', "$dir\n";
    >
    > #2: build a hash where each element is keyed by the core of the
    > file name
    > # and the value is a ref to an array holding the file name, full
    > path, atime, mtime.
    > # Set a counter so I know how fast it's going
    > foreach (@files) {
    > $_ =~ m|^(.*)$pattern$|;
    > my $core = $1;
    > $filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
    > $count++;
    > if ($count % 100 == 0) { print 'count: ', "$count\n";}
    > }
    > closedir DIR or die "Couldn't close $dir: $!";
    > print "$dir HAS BEEN ANALYZED\n";
    > return \%filechars;
    > }
    >
    > The counter prints to screen in 100-file increments. The first couple of
    > hundred files zip right by, but then the process slows to a crawl. This
    > occurred despite there being no other significant activity happening on this
    > disk. Why?


    The stat() operation takes longer for files that are "late" in the
    directory, in the sense that readdir() produces them late. Since
    your directory is oversized, the difference is significant. Since
    you process the files in the order in which they were found, the process
    seems to slow down progressively. To test the hypothesis, just change

    my @files = grep { m|$pattern$|o } readdir DIR;

    in your code to

    my @files = reverse grep { m|$pattern$|o } readdir DIR;

    and watch it start out slow and grow faster as it goes. To gain speed,
    spread the files out over more smaller directories.

    Below is a cleaned-up version of analyze_dir(). It isn't any faster.

    sub analyze_dir {
    my ( $dir, $pattern) = @{ shift()}{ qw( dir pattern)};
    opendir my $d, $dir or die "Can't open dir $dir: $!";
    chdir $dir or die "Can't cd to $dir: $!";
    my (%filechars, $count);
    while ( defined( $_ = readdir $d) ) {
    die "boo" unless defined $_;
    next unless -f;
    next unless /($pattern)/;
    $filechars{ $1} = [ $_, "$dir/$_", (stat $_)[ 8 .. 9]];
    print "count: $count\n" if ++$count % 100 == 0;
    }

    Anno
    Anno Siegel, Oct 13, 2003
    #2
    1. Advertising

  3. James E Keenan

    Sisyphus Guest

    James E Keenan wrote:

    > foreach (@files) {
    > $_ =~ m|^(.*)$pattern$|;
    > my $core = $1;
    > $filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
    > $count++;
    > if ($count % 100 == 0) { print 'count: ', "$count\n";}
    > }


    Can't really see the answer to your question, but it seems to me that
    you could readily play around with that loop in order to find out just
    which part of it is killing the show.

    I would start by assigning to $filechars{$count} instead of
    $filechars{$core}. If that makes no difference comment out 'my $core =
    $1;' - then comment out the regex - then remove the stat() results from
    the array reference. Hopefully by then you will have found out where the
    time is going, and we can start pondering "why?" :)

    Hth.

    Cheers,
    Rob


    --
    To reply by email u have to take out the u in kalinaubears.
    Sisyphus, Oct 13, 2003
    #3
  4. "Anno Siegel" <-berlin.de> wrote in message
    news:bmdrkl$gji$-Berlin.DE...
    > James E Keenan <> wrote in

    comp.lang.perl.misc:
    > > I need to build a hash whose elements represent characteristics of each

    of
    > > 9600 files in a single directory. I am trying to figure out why this

    hash
    > > assignment is taking so long. Here is my code:

    >
    > That's a lot of files to keep in a single directory. That is the
    > reason why it's slow.
    >
    > There are a few obvious inefficiencies in your code, but these have
    > no measurable effect.
    >
    > > my %threads_chars = (
    > > dir => "~/Threads",
    > > pattern => "\.thr\.txt",

    >
    > You have a quoting problem here. Backslashing the dots in a *string*
    > has no effect. After the string is built, the backslashes are gone.
    > Use qr// to quote patterns:
    >
    > pattern => qr/\.thr\.txt/,
    >


    Ah! An incentive to learn the 'qr' operator!.

    > > );
    > >
    > > $threadsref = analyze_dir(\%threads_chars);
    > >
    > > sub analyze_dir {
    > > my $hashref = shift;
    > > my $dir = ${$hashref}{'dir'};
    > > my $pattern = ${$hashref}{'pattern'};
    > > my (%filechars, $count);
    > > chdir $dir or die "Couldn't change to $dir: $!";
    > > opendir DIR, $dir or die "Couldn't open $dir: $!";
    > >
    > > #1: get files in $dir which match pattern
    > > my @files = grep { m|$pattern$|o } readdir DIR;

    >
    > Why do you build a list of all files just to process them once? Walk
    > through the directory and skip what you don't want. You'll save
    > the list (space efficiency) and you'll only have to apply the regex
    > once (time efficiency).
    >

    I was trying to think of the way to walk thru the directory, but I was just
    too tired to think. After a good night's sleep, I located the answer in the
    Llama book, similar to what you have below.

    > You should also take care to de-select anything that isn't a plain file.
    >

    As it happened, there was only one non-matching file in the directory. But
    your point is well taken and I've done that in other scripts.

    > > print '@files established for ', "$dir\n";
    > >
    > > #2: build a hash where each element is keyed by the core of the
    > > file name
    > > # and the value is a ref to an array holding the file name,

    full
    > > path, atime, mtime.
    > > # Set a counter so I know how fast it's going
    > > foreach (@files) {
    > > $_ =~ m|^(.*)$pattern$|;
    > > my $core = $1;
    > > $filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
    > > $count++;
    > > if ($count % 100 == 0) { print 'count: ', "$count\n";}
    > > }
    > > closedir DIR or die "Couldn't close $dir: $!";
    > > print "$dir HAS BEEN ANALYZED\n";
    > > return \%filechars;
    > > }
    > >
    > > The counter prints to screen in 100-file increments. The first couple

    of
    > > hundred files zip right by, but then the process slows to a crawl. This
    > > occurred despite there being no other significant activity happening on

    this
    > > disk. Why?

    >
    > The stat() operation takes longer for files that are "late" in the
    > directory, in the sense that readdir() produces them late. Since
    > your directory is oversized, the difference is significant. Since
    > you process the files in the order in which they were found, the process
    > seems to slow down progressively.


    Very interesting!

    > To test the hypothesis, just change
    >
    > my @files = grep { m|$pattern$|o } readdir DIR;
    >
    > in your code to
    >
    > my @files = reverse grep { m|$pattern$|o } readdir DIR;
    >
    > and watch it start out slow and grow faster as it goes. To gain speed,
    > spread the files out over more smaller directories.
    >

    I'll test these out! Thanks, Anno.

    More generally, is the sort of too-many-files-in-a-directory slowdown I
    experienced here true across operating systems? Or just on this one
    (Windows98 SE)? (It does seem as if all directory-related operations
    calling upon this directory are slow.)

    jimk

    jimk
    James E Keenan, Oct 14, 2003
    #4
  5. James E Keenan wrote:
    >
    > "Anno Siegel" <-berlin.de> wrote in message
    > news:bmdrkl$gji$-Berlin.DE...
    > >
    > > To test the hypothesis, just change
    > >
    > > my @files = grep { m|$pattern$|o } readdir DIR;
    > >
    > > in your code to
    > >
    > > my @files = reverse grep { m|$pattern$|o } readdir DIR;
    > >
    > > and watch it start out slow and grow faster as it goes. To gain speed,
    > > spread the files out over more smaller directories.

    >
    > I'll test these out! Thanks, Anno.
    >
    > More generally, is the sort of too-many-files-in-a-directory slowdown I
    > experienced here true across operating systems? Or just on this one
    > (Windows98 SE)? (It does seem as if all directory-related operations
    > calling upon this directory are slow.)


    The FAT32 file system that Win9x uses is based on a file system
    originally (IIRC) designed for floppy disks. Even if you defragmented
    your entire disk (including directories) you would not get much of a
    speed improvement. And yes, there are operating systems/file systems
    that handle this better but I don't have any specifics. :-(


    John
    --
    use Perl;
    program
    fulfillment
    John W. Krahn, Oct 14, 2003
    #5
  6. James E Keenan

    Anno Siegel Guest

    James E Keenan <> wrote in comp.lang.perl.misc:
    >
    > "Anno Siegel" <-berlin.de> wrote in message
    > news:bmdrkl$gji$-Berlin.DE...
    > > James E Keenan <> wrote in

    > comp.lang.perl.misc:
    > > > I need to build a hash whose elements represent characteristics of each

    > of
    > > > 9600 files in a single directory. I am trying to figure out why this


    [...]

    > > The stat() operation takes longer for files that are "late" in the
    > > directory, in the sense that readdir() produces them late. Since > > your directory is oversized, the difference is significant. Since
    > > you process the files in the order in which they were found, the process
    > > seems to slow down progressively.

    >
    > Very interesting!


    [...]

    > More generally, is the sort of too-many-files-in-a-directory slowdown I
    > experienced here true across operating systems? Or just on this one
    > (Windows98 SE)? (It does seem as if all directory-related operations
    > calling upon this directory are slow.)


    Generally very large directories are slow though there may be differences,
    even large ones, between file various systems. Directories were invented
    so you don't have all your files in one place, so they are perhaps not
    optimized for large numbers. Most people would consider a few hundred
    files in a directory quite crowded, especially if the directory is active.
    Split them up so that no directory has more than 256 entries, that seems
    to be a reasonable number.

    When a directory is mostly passive (like, say, the fonts of something big
    and bulky), you can store more files per directory, but even then I'd put
    a limit at a thousand or so.

    Anno
    Anno Siegel, Oct 14, 2003
    #6
  7. James E Keenan

    Uri Guttman Guest

    >>>>> "AS" == Anno Siegel <-berlin.de> writes:

    AS> When a directory is mostly passive (like, say, the fonts of
    AS> something big and bulky), you can store more files per directory,
    AS> but even then I'd put a limit at a thousand or so.

    the size of a file has little to do with how many files should be in a
    directory. classic unix file systems totally isolate the directory
    entry from the file contents. directory searches are linear which is why
    they get very slow when there are many entries. i ran into this many
    years ago and redesigned a subsystem to use multiple subdirs before
    finding a particular file. it effectively converted a linear search to a
    tree search. i used 2 digits of the log file name for 2 levels keeping
    the number of entries in any directory to under 100. it was much much
    faster than the old system which had 10k+ entries in one dir. this is
    still a common and easy technique. some modern unix/linux filesystems
    use trees or hashes for lookup and are much faster than linear. but you
    are always guaranteed to have a classic linear filesystem around so this
    subdir trick will work anywhere.

    and i would limit dir sizes to way less than 1000. 100-300 seems to be a
    good value.

    and this is way OT as it has no perl content at all.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
    Uri Guttman, Oct 14, 2003
    #7
  8. James E Keenan

    Anno Siegel Guest

    Uri Guttman <> wrote in comp.lang.perl.misc:
    > >>>>> "AS" == Anno Siegel <-berlin.de> writes:

    >
    > AS> When a directory is mostly passive (like, say, the fonts of
    > AS> something big and bulky), you can store more files per directory,
    > AS> but even then I'd put a limit at a thousand or so.
    >
    > the size of a file has little to do with how many files should be in a
    > directory...


    Just for the record, I didn't mean to say it does.

    Anno
    Anno Siegel, Oct 15, 2003
    #8
  9. "Uri Guttman" <> wrote in message
    news:...
    > >>>>> "AS" == Anno Siegel <-berlin.de> writes:

    > [snip]
    > i ran into this many
    > years ago and redesigned a subsystem to use multiple subdirs before
    > finding a particular file. it effectively converted a linear search to a
    > tree search. i used 2 digits of the log file name for 2 levels keeping
    > the number of entries in any directory to under 100. it was much much
    > faster than the old system which had 10k+ entries in one dir. this is
    > still a common and easy technique.


    And after I pondered Anno's first response to my OP, I began to design a new
    directory structure that works along very much the same principle (viz.,
    breaking up this one big directory into smaller directories where the files
    sort on a very simple principle such as the 1st letter of the file name).

    > [snip]
    >
    > and i would limit dir sizes to way less than 1000. 100-300 seems to be a
    > good value.
    >


    Agreed.

    > and this is way OT as it has no perl content at all.
    >

    Disagree. See my original posting. I thought the posting might be due to
    some problems in my Perl hash assignment. While Anno pointed out some ways
    in which my code could be optimized (which I've since implemented), he
    explained something about file systems I didn't previously know and
    correctly diagnosed the problem as a system problem rather than a Perl
    problem.

    jimk
    James E Keenan, Oct 15, 2003
    #9
  10. James E Keenan

    Anno Siegel Guest

    James E Keenan <> wrote in comp.lang.perl.misc:
    >
    > "Uri Guttman" <> wrote in message
    > news:...
    > > >>>>> "AS" == Anno Siegel <-berlin.de> writes:


    > > and i would limit dir sizes to way less than 1000. 100-300 seems to be a
    > > good value.
    > >

    >
    > Agreed.
    >
    > > and this is way OT as it has no perl content at all.
    > >

    > Disagree. See my original posting. I thought the posting might be due to
    > some problems in my Perl hash assignment. While Anno pointed out some ways
    > in which my code could be optimized (which I've since implemented), he
    > explained something about file systems I didn't previously know and
    > correctly diagnosed the problem as a system problem rather than a Perl
    > problem.


    ....which means that my reply was off topic already. Let's no prolong it.

    Anno
    Anno Siegel, Oct 16, 2003
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mr. SweatyFinger
    Replies:
    2
    Views:
    1,769
    Smokey Grindel
    Dec 2, 2006
  2. veryhotsausage
    Replies:
    1
    Views:
    1,780
    veryhotsausage
    Jul 4, 2008
  3. Oliver Graeser
    Replies:
    10
    Views:
    574
    Oliver Graeser
    Sep 26, 2008
  4. rp
    Replies:
    1
    Views:
    500
    red floyd
    Nov 10, 2011
  5. Jim Cain
    Replies:
    1
    Views:
    199
    Yukihiro Matsumoto
    Jul 18, 2003
Loading...

Share This Page