Why Is My Hash Assignment Taking So Long?

J

James E Keenan

I need to build a hash whose elements represent characteristics of each of
9600 files in a single directory. I am trying to figure out why this hash
assignment is taking so long. Here is my code:

my %threads_chars = (
dir => "~/Threads",
pattern => "\.thr\.txt",
);

$threadsref = analyze_dir(\%threads_chars);

sub analyze_dir {
my $hashref = shift;
my $dir = ${$hashref}{'dir'};
my $pattern = ${$hashref}{'pattern'};
my (%filechars, $count);
chdir $dir or die "Couldn't change to $dir: $!";
opendir DIR, $dir or die "Couldn't open $dir: $!";

#1: get files in $dir which match pattern
my @files = grep { m|$pattern$|o } readdir DIR;
print '@files established for ', "$dir\n";

#2: build a hash where each element is keyed by the core of the
file name
# and the value is a ref to an array holding the file name, full
path, atime, mtime.
# Set a counter so I know how fast it's going
foreach (@files) {
$_ =~ m|^(.*)$pattern$|;
my $core = $1;
$filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
$count++;
if ($count % 100 == 0) { print 'count: ', "$count\n";}
}
closedir DIR or die "Couldn't close $dir: $!";
print "$dir HAS BEEN ANALYZED\n";
return \%filechars;
}

The counter prints to screen in 100-file increments. The first couple of
hundred files zip right by, but then the process slows to a crawl. This
occurred despite there being no other significant activity happening on this
disk. Why?

jimk
 
A

Anno Siegel

James E Keenan said:
I need to build a hash whose elements represent characteristics of each of
9600 files in a single directory. I am trying to figure out why this hash
assignment is taking so long. Here is my code:

That's a lot of files to keep in a single directory. That is the
reason why it's slow.

There are a few obvious inefficiencies in your code, but these have
no measurable effect.
my %threads_chars = (
dir => "~/Threads",
pattern => "\.thr\.txt",

You have a quoting problem here. Backslashing the dots in a *string*
has no effect. After the string is built, the backslashes are gone.
Use qr// to quote patterns:

pattern => qr/\.thr\.txt/,
);

$threadsref = analyze_dir(\%threads_chars);

sub analyze_dir {
my $hashref = shift;
my $dir = ${$hashref}{'dir'};
my $pattern = ${$hashref}{'pattern'};
my (%filechars, $count);
chdir $dir or die "Couldn't change to $dir: $!";
opendir DIR, $dir or die "Couldn't open $dir: $!";

#1: get files in $dir which match pattern
my @files = grep { m|$pattern$|o } readdir DIR;

Why do you build a list of all files just to process them once? Walk
through the directory and skip what you don't want. You'll save
the list (space efficiency) and you'll only have to apply the regex
once (time efficiency).

You should also take care to de-select anything that isn't a plain file.
print '@files established for ', "$dir\n";

#2: build a hash where each element is keyed by the core of the
file name
# and the value is a ref to an array holding the file name, full
path, atime, mtime.
# Set a counter so I know how fast it's going
foreach (@files) {
$_ =~ m|^(.*)$pattern$|;
my $core = $1;
$filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
$count++;
if ($count % 100 == 0) { print 'count: ', "$count\n";}
}
closedir DIR or die "Couldn't close $dir: $!";
print "$dir HAS BEEN ANALYZED\n";
return \%filechars;
}

The counter prints to screen in 100-file increments. The first couple of
hundred files zip right by, but then the process slows to a crawl. This
occurred despite there being no other significant activity happening on this
disk. Why?

The stat() operation takes longer for files that are "late" in the
directory, in the sense that readdir() produces them late. Since
your directory is oversized, the difference is significant. Since
you process the files in the order in which they were found, the process
seems to slow down progressively. To test the hypothesis, just change

my @files = grep { m|$pattern$|o } readdir DIR;

in your code to

my @files = reverse grep { m|$pattern$|o } readdir DIR;

and watch it start out slow and grow faster as it goes. To gain speed,
spread the files out over more smaller directories.

Below is a cleaned-up version of analyze_dir(). It isn't any faster.

sub analyze_dir {
my ( $dir, $pattern) = @{ shift()}{ qw( dir pattern)};
opendir my $d, $dir or die "Can't open dir $dir: $!";
chdir $dir or die "Can't cd to $dir: $!";
my (%filechars, $count);
while ( defined( $_ = readdir $d) ) {
die "boo" unless defined $_;
next unless -f;
next unless /($pattern)/;
$filechars{ $1} = [ $_, "$dir/$_", (stat $_)[ 8 .. 9]];
print "count: $count\n" if ++$count % 100 == 0;
}

Anno
 
S

Sisyphus

James said:
foreach (@files) {
$_ =~ m|^(.*)$pattern$|;
my $core = $1;
$filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
$count++;
if ($count % 100 == 0) { print 'count: ', "$count\n";}
}

Can't really see the answer to your question, but it seems to me that
you could readily play around with that loop in order to find out just
which part of it is killing the show.

I would start by assigning to $filechars{$count} instead of
$filechars{$core}. If that makes no difference comment out 'my $core =
$1;' - then comment out the regex - then remove the stat() results from
the array reference. Hopefully by then you will have found out where the
time is going, and we can start pondering "why?" :)

Hth.

Cheers,
Rob
 
J

James E Keenan

Anno Siegel said:
That's a lot of files to keep in a single directory. That is the
reason why it's slow.

There are a few obvious inefficiencies in your code, but these have
no measurable effect.


You have a quoting problem here. Backslashing the dots in a *string*
has no effect. After the string is built, the backslashes are gone.
Use qr// to quote patterns:

pattern => qr/\.thr\.txt/,

Ah! An incentive to learn the 'qr' operator!.
Why do you build a list of all files just to process them once? Walk
through the directory and skip what you don't want. You'll save
the list (space efficiency) and you'll only have to apply the regex
once (time efficiency).
I was trying to think of the way to walk thru the directory, but I was just
too tired to think. After a good night's sleep, I located the answer in the
Llama book, similar to what you have below.
You should also take care to de-select anything that isn't a plain file.
As it happened, there was only one non-matching file in the directory. But
your point is well taken and I've done that in other scripts.
print '@files established for ', "$dir\n";

#2: build a hash where each element is keyed by the core of the
file name
# and the value is a ref to an array holding the file name, full
path, atime, mtime.
# Set a counter so I know how fast it's going
foreach (@files) {
$_ =~ m|^(.*)$pattern$|;
my $core = $1;
$filechars{$core} = [ $_, "$dir/$_", (stat($_))[8..9] ];
$count++;
if ($count % 100 == 0) { print 'count: ', "$count\n";}
}
closedir DIR or die "Couldn't close $dir: $!";
print "$dir HAS BEEN ANALYZED\n";
return \%filechars;
}

The counter prints to screen in 100-file increments. The first couple of
hundred files zip right by, but then the process slows to a crawl. This
occurred despite there being no other significant activity happening on this
disk. Why?

The stat() operation takes longer for files that are "late" in the
directory, in the sense that readdir() produces them late. Since
your directory is oversized, the difference is significant. Since
you process the files in the order in which they were found, the process
seems to slow down progressively.

Very interesting!
To test the hypothesis, just change

my @files = grep { m|$pattern$|o } readdir DIR;

in your code to

my @files = reverse grep { m|$pattern$|o } readdir DIR;

and watch it start out slow and grow faster as it goes. To gain speed,
spread the files out over more smaller directories.
I'll test these out! Thanks, Anno.

More generally, is the sort of too-many-files-in-a-directory slowdown I
experienced here true across operating systems? Or just on this one
(Windows98 SE)? (It does seem as if all directory-related operations
calling upon this directory are slow.)

jimk

jimk
 
J

John W. Krahn

James said:
I'll test these out! Thanks, Anno.

More generally, is the sort of too-many-files-in-a-directory slowdown I
experienced here true across operating systems? Or just on this one
(Windows98 SE)? (It does seem as if all directory-related operations
calling upon this directory are slow.)

The FAT32 file system that Win9x uses is based on a file system
originally (IIRC) designed for floppy disks. Even if you defragmented
your entire disk (including directories) you would not get much of a
speed improvement. And yes, there are operating systems/file systems
that handle this better but I don't have any specifics. :-(


John
 
A

Anno Siegel

James E Keenan said:
[...]
The stat() operation takes longer for files that are "late" in the
directory, in the sense that readdir() produces them late. Since > > your directory is oversized, the difference is significant. Since
you process the files in the order in which they were found, the process
seems to slow down progressively.

Very interesting!
[...]

More generally, is the sort of too-many-files-in-a-directory slowdown I
experienced here true across operating systems? Or just on this one
(Windows98 SE)? (It does seem as if all directory-related operations
calling upon this directory are slow.)

Generally very large directories are slow though there may be differences,
even large ones, between file various systems. Directories were invented
so you don't have all your files in one place, so they are perhaps not
optimized for large numbers. Most people would consider a few hundred
files in a directory quite crowded, especially if the directory is active.
Split them up so that no directory has more than 256 entries, that seems
to be a reasonable number.

When a directory is mostly passive (like, say, the fonts of something big
and bulky), you can store more files per directory, but even then I'd put
a limit at a thousand or so.

Anno
 
U

Uri Guttman

AS> When a directory is mostly passive (like, say, the fonts of
AS> something big and bulky), you can store more files per directory,
AS> but even then I'd put a limit at a thousand or so.

the size of a file has little to do with how many files should be in a
directory. classic unix file systems totally isolate the directory
entry from the file contents. directory searches are linear which is why
they get very slow when there are many entries. i ran into this many
years ago and redesigned a subsystem to use multiple subdirs before
finding a particular file. it effectively converted a linear search to a
tree search. i used 2 digits of the log file name for 2 levels keeping
the number of entries in any directory to under 100. it was much much
faster than the old system which had 10k+ entries in one dir. this is
still a common and easy technique. some modern unix/linux filesystems
use trees or hashes for lookup and are much faster than linear. but you
are always guaranteed to have a classic linear filesystem around so this
subdir trick will work anywhere.

and i would limit dir sizes to way less than 1000. 100-300 seems to be a
good value.

and this is way OT as it has no perl content at all.

uri
 
A

Anno Siegel

Uri Guttman said:
AS> When a directory is mostly passive (like, say, the fonts of
AS> something big and bulky), you can store more files per directory,
AS> but even then I'd put a limit at a thousand or so.

the size of a file has little to do with how many files should be in a
directory...

Just for the record, I didn't mean to say it does.

Anno
 
J

James E Keenan

Uri Guttman said:
[snip]
i ran into this many
years ago and redesigned a subsystem to use multiple subdirs before
finding a particular file. it effectively converted a linear search to a
tree search. i used 2 digits of the log file name for 2 levels keeping
the number of entries in any directory to under 100. it was much much
faster than the old system which had 10k+ entries in one dir. this is
still a common and easy technique.

And after I pondered Anno's first response to my OP, I began to design a new
directory structure that works along very much the same principle (viz.,
breaking up this one big directory into smaller directories where the files
sort on a very simple principle such as the 1st letter of the file name).
[snip]

and i would limit dir sizes to way less than 1000. 100-300 seems to be a
good value.

Agreed.

and this is way OT as it has no perl content at all.
Disagree. See my original posting. I thought the posting might be due to
some problems in my Perl hash assignment. While Anno pointed out some ways
in which my code could be optimized (which I've since implemented), he
explained something about file systems I didn't previously know and
correctly diagnosed the problem as a system problem rather than a Perl
problem.

jimk
 
A

Anno Siegel

James E Keenan said:
Disagree. See my original posting. I thought the posting might be due to
some problems in my Perl hash assignment. While Anno pointed out some ways
in which my code could be optimized (which I've since implemented), he
explained something about file systems I didn't previously know and
correctly diagnosed the problem as a system problem rather than a Perl
problem.

....which means that my reply was off topic already. Let's no prolong it.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top