Why Is My Hash Assignment Taking So Long?

James E Keenan · Oct 13, 2003

I need to build a hash whose elements represent characteristics of each of
9600 files in a single directory. I am trying to figure out why this hash
assignment is taking so long. Here is my code:

my %threads_chars = (
dir => "~/Threads",
pattern => "\.thr\.txt",
);

$threadsref = analyze_dir(\%threads_chars);

sub analyze_dir {
my $hashref = shift;
my $dir = ${$hashref}{'dir'};
my $pattern = ${$hashref}{'pattern'};
my (%filechars, $count);
chdir $dir or die "Couldn't change to $dir: $!";
opendir DIR, $dir or die "Couldn't open $dir: $!";

#1: get files in $dir which match pattern
my @files = grep { m|$pattern$|o } readdir DIR;
print '@files established for ', "$dir\n";

#2: build a hash where each element is keyed by the core of the
file name
# and the value is a ref to an array holding the file name, full
path, atime, mtime.
# Set a counter so I know how fast it's going
foreach (@files) {
$_ =~ m|^(.*)$pattern$|;
my $core = $1;
$filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
$count++;
if ($count % 100 == 0) { print 'count: ', "$count\n";}
}
closedir DIR or die "Couldn't close $dir: $!";
print "$dir HAS BEEN ANALYZED\n";
return \%filechars;
}

The counter prints to screen in 100-file increments. The first couple of
hundred files zip right by, but then the process slows to a crawl. This
occurred despite there being no other significant activity happening on this
disk. Why?

jimk

Anno Siegel · Oct 13, 2003

James E Keenan said:
I need to build a hash whose elements represent characteristics of each of
9600 files in a single directory. I am trying to figure out why this hash
assignment is taking so long. Here is my code:

That's a lot of files to keep in a single directory. That is the
reason why it's slow.

There are a few obvious inefficiencies in your code, but these have
no measurable effect.

my %threads_chars = (
dir => "~/Threads",
pattern => "\.thr\.txt",

You have a quoting problem here. Backslashing the dots in a *string*
has no effect. After the string is built, the backslashes are gone.
Use qr// to quote patterns:

pattern => qr/\.thr\.txt/,

);

$threadsref = analyze_dir(\%threads_chars);

sub analyze_dir {
my $hashref = shift;
my $dir = ${$hashref}{'dir'};
my $pattern = ${$hashref}{'pattern'};
my (%filechars, $count);
chdir $dir or die "Couldn't change to $dir: $!";
opendir DIR, $dir or die "Couldn't open $dir: $!";

#1: get files in $dir which match pattern
my @files = grep { m|$pattern$|o } readdir DIR;

Why do you build a list of all files just to process them once? Walk
through the directory and skip what you don't want. You'll save
the list (space efficiency) and you'll only have to apply the regex
once (time efficiency).

You should also take care to de-select anything that isn't a plain file.

print '@files established for ', "$dir\n";

#2: build a hash where each element is keyed by the core of the
file name
# and the value is a ref to an array holding the file name, full
path, atime, mtime.
# Set a counter so I know how fast it's going
foreach (@files) {
$_ =~ m|^(.*)$pattern$|;
my $core = $1;
$filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
$count++;
if ($count % 100 == 0) { print 'count: ', "$count\n";}
}
closedir DIR or die "Couldn't close $dir: $!";
print "$dir HAS BEEN ANALYZED\n";
return \%filechars;
}

The counter prints to screen in 100-file increments. The first couple of
hundred files zip right by, but then the process slows to a crawl. This
occurred despite there being no other significant activity happening on this
disk. Why?

The stat() operation takes longer for files that are "late" in the
directory, in the sense that readdir() produces them late. Since
your directory is oversized, the difference is significant. Since
you process the files in the order in which they were found, the process
seems to slow down progressively. To test the hypothesis, just change

my @files = grep { m|$pattern$|o } readdir DIR;

in your code to

my @files = reverse grep { m|$pattern$|o } readdir DIR;

and watch it start out slow and grow faster as it goes. To gain speed,
spread the files out over more smaller directories.

Below is a cleaned-up version of analyze_dir(). It isn't any faster.

sub analyze_dir {
my ( $dir, $pattern) = @{ shift()}{ qw( dir pattern)};
opendir my $d, $dir or die "Can't open dir $dir: $!";
chdir $dir or die "Can't cd to $dir: $!";
my (%filechars, $count);
while ( defined( $_ = readdir $d) ) {
die "boo" unless defined $_;
next unless -f;
next unless /($pattern)/;
$filechars{ $1} = [ $_, "$dir/$_", (stat $_)[ 8 .. 9]];
print "count: $count\n" if ++$count % 100 == 0;
}

Anno

Sisyphus · Oct 13, 2003

James said:
foreach (@files) {
$_ =~ m|^(.*)$pattern$|;
my $core = $1;
$filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
$count++;
if ($count % 100 == 0) { print 'count: ', "$count\n";}
}

Can't really see the answer to your question, but it seems to me that
you could readily play around with that loop in order to find out just
which part of it is killing the show.

I would start by assigning to $filechars{$count} instead of
$filechars{$core}. If that makes no difference comment out 'my $core =
$1;' - then comment out the regex - then remove the stat() results from
the array reference. Hopefully by then you will have found out where the
time is going, and we can start pondering "why?"

Hth.

Cheers,
Rob

James E Keenan · Oct 14, 2003

Anno Siegel said:
That's a lot of files to keep in a single directory. That is the
reason why it's slow.

There are a few obvious inefficiencies in your code, but these have
no measurable effect.

You have a quoting problem here. Backslashing the dots in a *string*
has no effect. After the string is built, the backslashes are gone.
Use qr// to quote patterns:

pattern => qr/\.thr\.txt/,

Ah! An incentive to learn the 'qr' operator!.

Why do you build a list of all files just to process them once? Walk
through the directory and skip what you don't want. You'll save
the list (space efficiency) and you'll only have to apply the regex
once (time efficiency).

I was trying to think of the way to walk thru the directory, but I was just
too tired to think. After a good night's sleep, I located the answer in the
Llama book, similar to what you have below.

You should also take care to de-select anything that isn't a plain file.

As it happened, there was only one non-matching file in the directory. But
your point is well taken and I've done that in other scripts.

print '@files established for ', "$dir\n";

#2: build a hash where each element is keyed by the core of the
file name
# and the value is a ref to an array holding the file name, full
path, atime, mtime.
# Set a counter so I know how fast it's going
foreach (@files) {
$_ =~ m|^(.*)$pattern$|;
my $core = $1;
$filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
$count++;
if ($count % 100 == 0) { print 'count: ', "$count\n";}
}
closedir DIR or die "Couldn't close $dir: $!";
print "$dir HAS BEEN ANALYZED\n";
return \%filechars;
}

The counter prints to screen in 100-file increments. The first couple of
hundred files zip right by, but then the process slows to a crawl. This
occurred despite there being no other significant activity happening on this
disk. Why?

Click to expand...

The stat() operation takes longer for files that are "late" in the
directory, in the sense that readdir() produces them late. Since
your directory is oversized, the difference is significant. Since
you process the files in the order in which they were found, the process
seems to slow down progressively.

Very interesting!

To test the hypothesis, just change

my @files = grep { m|$pattern$|o } readdir DIR;

in your code to

my @files = reverse grep { m|$pattern$|o } readdir DIR;

and watch it start out slow and grow faster as it goes. To gain speed,
spread the files out over more smaller directories.

I'll test these out! Thanks, Anno.

More generally, is the sort of too-many-files-in-a-directory slowdown I
experienced here true across operating systems? Or just on this one
(Windows98 SE)? (It does seem as if all directory-related operations
calling upon this directory are slow.)

jimk

jimk

John W. Krahn · Oct 14, 2003

James said:
I'll test these out! Thanks, Anno.

More generally, is the sort of too-many-files-in-a-directory slowdown I
experienced here true across operating systems? Or just on this one
(Windows98 SE)? (It does seem as if all directory-related operations
calling upon this directory are slow.)

The FAT32 file system that Win9x uses is based on a file system
originally (IIRC) designed for floppy disks. Even if you defragmented
your entire disk (including directories) you would not get much of a
speed improvement. And yes, there are operating systems/file systems
that handle this better but I don't have any specifics. :-(

John

Anno Siegel · Oct 14, 2003

James E Keenan said:
[...]

The stat() operation takes longer for files that are "late" in the
directory, in the sense that readdir() produces them late. Since > > your directory is oversized, the difference is significant. Since
you process the files in the order in which they were found, the process
seems to slow down progressively.

Click to expand...

Very interesting!
[...]

More generally, is the sort of too-many-files-in-a-directory slowdown I
experienced here true across operating systems? Or just on this one
(Windows98 SE)? (It does seem as if all directory-related operations
calling upon this directory are slow.)

Generally very large directories are slow though there may be differences,
even large ones, between file various systems. Directories were invented
so you don't have all your files in one place, so they are perhaps not
optimized for large numbers. Most people would consider a few hundred
files in a directory quite crowded, especially if the directory is active.
Split them up so that no directory has more than 256 entries, that seems
to be a reasonable number.

When a directory is mostly passive (like, say, the fonts of something big
and bulky), you can store more files per directory, but even then I'd put
a limit at a thousand or so.

Anno

Uri Guttman · Oct 14, 2003

AS> When a directory is mostly passive (like, say, the fonts of
AS> something big and bulky), you can store more files per directory,
AS> but even then I'd put a limit at a thousand or so.

the size of a file has little to do with how many files should be in a
directory. classic unix file systems totally isolate the directory
entry from the file contents. directory searches are linear which is why
they get very slow when there are many entries. i ran into this many
years ago and redesigned a subsystem to use multiple subdirs before
finding a particular file. it effectively converted a linear search to a
tree search. i used 2 digits of the log file name for 2 levels keeping
the number of entries in any directory to under 100. it was much much
faster than the old system which had 10k+ entries in one dir. this is
still a common and easy technique. some modern unix/linux filesystems
use trees or hashes for lookup and are much faster than linear. but you
are always guaranteed to have a classic linear filesystem around so this
subdir trick will work anywhere.

and i would limit dir sizes to way less than 1000. 100-300 seems to be a
good value.

and this is way OT as it has no perl content at all.

uri

Anno Siegel · Oct 15, 2003

Uri Guttman said:
AS> When a directory is mostly passive (like, say, the fonts of
AS> something big and bulky), you can store more files per directory,
AS> but even then I'd put a limit at a thousand or so.

the size of a file has little to do with how many files should be in a
directory...

Just for the record, I didn't mean to say it does.

Anno

James E Keenan · Oct 15, 2003

Uri Guttman said:
[snip]
i ran into this many
years ago and redesigned a subsystem to use multiple subdirs before
finding a particular file. it effectively converted a linear search to a
tree search. i used 2 digits of the log file name for 2 levels keeping
the number of entries in any directory to under 100. it was much much
faster than the old system which had 10k+ entries in one dir. this is
still a common and easy technique.

And after I pondered Anno's first response to my OP, I began to design a new
directory structure that works along very much the same principle (viz.,
breaking up this one big directory into smaller directories where the files
sort on a very simple principle such as the 1st letter of the file name).

[snip]

and i would limit dir sizes to way less than 1000. 100-300 seems to be a
good value.

Agreed.

and this is way OT as it has no perl content at all.

Disagree. See my original posting. I thought the posting might be due to
some problems in my Perl hash assignment. While Anno pointed out some ways
in which my code could be optimized (which I've since implemented), he
explained something about file systems I didn't previously know and
correctly diagnosed the problem as a system problem rather than a Perl
problem.

jimk

Anno Siegel · Oct 16, 2003

James E Keenan said:
Disagree. See my original posting. I thought the posting might be due to
some problems in my Perl hash assignment. While Anno pointed out some ways
in which my code could be optimized (which I've since implemented), he
explained something about file systems I didn't previously know and
correctly diagnosed the problem as a system problem rather than a Perl
problem.

....which means that my reply was off topic already. Let's no prolong it.

Anno

Why is regex so slow?	21	Jun 18, 2013
Efficiently searching multiple files	10	May 20, 2010
need help with hash - been too long working in Perl	2	Oct 27, 2007
Why is Java so slow????	58	Nov 19, 2007
Need to change icons on 400 + shortcuts	0	Sep 7, 2007
How do I prevent croak in a module from making my script die?	3	Aug 1, 2005
Simplifying by refactoring 'array ref' and glob from 'map'	2	Nov 7, 2004
Peer Review for Folder Delete Script	20	Jan 24, 2009

Why Is My Hash Assignment Taking So Long?

James E Keenan

Anno Siegel

Sisyphus

James E Keenan

John W. Krahn

Anno Siegel

Uri Guttman

Anno Siegel

James E Keenan

Anno Siegel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads