Optimized count of files in tree

P

Patrick

Hello,

In an application I write in Perl, I must count the total number of
files (not directories) in a complete tree.

What is the most efficient way to do it ?

My current code is :

use File::Recurse;
my $nb = 0;
recurse { -f && $nb++ } $dir;

With this code, I can scan 10,000 files in 15 seconds.

I want to know if it exists a best way (= quicker) to do it ?

Thanks for your help.

Patrick
 
P

Paul Lalli

In an application I write in Perl, I must count the total number of
files (not directories) in a complete tree.

What is the most efficient way to do it ?

My current code is :

use File::Recurse;
my $nb = 0;
recurse { -f && $nb++ } $dir;

With this code, I can scan 10,000 files in 15 seconds.

I want to know if it exists a best way (= quicker) to do it ?

I don't know the answer for sure, but looking at File::Recurse, it
seems to be a bit bloated for what you want to do. In addition to
recursing through the directory structure, it also checks a hash of
options for each file found, and stores information about each entry
to be later returned from the recurse() function.

I would try just using the standard File::Find module and see if it's
any faster.

use File::Find;
my $nb = 0;
find($dir, sub { -f and $nb++ });

You may also wish to use the standard Benchmark module to actually
compare the two techniques.

Paul Lalli
 
P

Patrick

Paul Lalli a écrit :
I don't know the answer for sure, but looking at File::Recurse, it
seems to be a bit bloated for what you want to do. In addition to
recursing through the directory structure, it also checks a hash of
options for each file found, and stores information about each entry
to be later returned from the recurse() function.

I would try just using the standard File::Find module and see if it's
any faster.

use File::Find;
my $nb = 0;
find($dir, sub { -f and $nb++ });

You may also wish to use the standard Benchmark module to actually
compare the two techniques.

Paul Lalli

Thanks for your answer but I have tried with File::Find : I got exactly
the same time for the same tree : 16 seconds for 10,000 files.

I have also tried with a "manual" solution :

sub getFileNb {
my $nb = 0;

return if ! opendir DIR,$_;
my @list = readdir DIR;
closedir DIR;

$nb += grep { -f "$dir/$_" } @list;
my @subdirs = grep { /^[^\.]/ && -d "$dir/$_" } @list;

foreach ( @subdirs ) {
$nb += &getFileNb("$dir/$_");
}

return $nb;
}

The result is about the same : 16 seconds for my 10,000 files.

I search for a really better performance ...

Patrick
 
M

Martijn Lievaart

Thanks for your answer but I have tried with File::Find : I got exactly
the same time for the same tree : 16 seconds for 10,000 files.

I have also tried with a "manual" solution :

sub getFileNb {
my $nb = 0;

return if ! opendir DIR,$_;
my @list = readdir DIR;
closedir DIR;

$nb += grep { -f "$dir/$_" } @list;
my @subdirs = grep { /^[^\.]/ && -d "$dir/$_" } @list;

foreach ( @subdirs ) {
$nb += &getFileNb("$dir/$_");
}

return $nb;
}

The result is about the same : 16 seconds for my 10,000 files.

I search for a really better performance ...

Looks like IO is the bottleneck. If you are on unix, do a

$ time find . -type f | wc -l

and see if that is significantly faster. If not, get a faster harddisk,
put the files on another (fast(er)) harddisk, switch to raid, add more
memory so you have more buffers, tume OS parameters, use another type of
filesystem. Mix and match to taste.

Note that

$ time yourscript.pl will give you insight in how much time is spend
waiting on I/O. Real-(user+sys) is the time spend waiting on other tasks
and IO. On a lightly loaded system, this will be mainly IO.

Look at this:

$ time find . -type f | wc -l
49956

real 0m14.594s
user 0m0.216s
sys 0m1.609s

The find command took only a fraction of the total time. One second was
spend in the kernel and 14 seconds was spend on IO.

HTH,
M4
 
M

Michele Dondi

I have also tried with a "manual" solution :

sub getFileNb {
my $nb =3D 0;

return if ! opendir DIR,$_;
my @list =3D readdir DIR;
closedir DIR;

$nb +=3D grep { -f "$dir/$_" } @list;
my @subdirs =3D grep { /^[^\.]/ && -d "$dir/$_" } @list;

You're stat()ing twice. It is time consuming, you'd better do it once.
foreach ( @subdirs ) {
$nb +=3D &getFileNb("$dir/$_");
}

The &-form of sub call is obsolete and not likely to do what you mean.
Just avoid it.
The result is about the same : 16 seconds for my 10,000 files.

I search for a really better performance ...

I wouldn't go for a recursive solution then, but for an iterative one.


Michele
 
R

Randal L. Schwartz

Patrick> In an application I write in Perl, I must count the total number of files (not
Patrick> directories) in a complete tree.

You haven't clarified whether you want to count a symlink pointing
at a file as a separate file or not. Your code counts it separately,
even if it's pointing at a file not in your starting tree.

Patrick> recurse { -f && $nb++ } $dir;
 
I

Ingo Menger

Thanks for your answer but I have tried with File::Find : I got exactly
the same time for the same tree : 16 seconds for 10,000 files.
I have also tried with a "manual" solution :
sub getFileNb {
my $nb = 0;
return if ! opendir DIR,$_;
my @list = readdir DIR;
closedir DIR;
$nb += grep { -f "$dir/$_" } @list;
my @subdirs = grep { /^[^\.]/ && -d "$dir/$_" } @list;
foreach ( @subdirs ) {
$nb += &getFileNb("$dir/$_");
}
return $nb;
}
The result is about the same : 16 seconds for my 10,000 files.
I search for a really better performance ...

Looks like IO is the bottleneck. If you are on unix, do a

$ time find . -type f | wc -l

and see if that is significantly faster. If not, get a faster harddisk,
put the files on another (fast(er)) harddisk, switch to raid, add more
memory so you have more buffers, tume OS parameters, use another type of
filesystem. Mix and match to taste.

And, in addition, repeat the command to see whether it get's faster
the second time. Most likely, the directory blocks to be read will be
already in memory so the 2nd attempt may be that much faster.
 
M

Martijn Lievaart

And, in addition, repeat the command to see whether it get's faster the
second time. Most likely, the directory blocks to be read will be
already in memory so the 2nd attempt may be that much faster.

Good point! Very important to keep in mind.

M4
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top