Optimized count of files in tree

Patrick · Aug 12, 2007

Hello,

In an application I write in Perl, I must count the total number of
files (not directories) in a complete tree.

What is the most efficient way to do it ?

My current code is :

use File::Recurse;
my $nb = 0;
recurse { -f && $nb++ } $dir;

With this code, I can scan 10,000 files in 15 seconds.

I want to know if it exists a best way (= quicker) to do it ?

Thanks for your help.

Patrick

Paul Lalli · Aug 12, 2007

In an application I write in Perl, I must count the total number of
files (not directories) in a complete tree.

What is the most efficient way to do it ?

My current code is :

use File::Recurse;
my $nb = 0;
recurse { -f && $nb++ } $dir;

With this code, I can scan 10,000 files in 15 seconds.

I want to know if it exists a best way (= quicker) to do it ?

I don't know the answer for sure, but looking at File::Recurse, it
seems to be a bit bloated for what you want to do. In addition to
recursing through the directory structure, it also checks a hash of
options for each file found, and stores information about each entry
to be later returned from the recurse() function.

I would try just using the standard File::Find module and see if it's
any faster.

use File::Find;
my $nb = 0;
find($dir, sub { -f and $nb++ });

You may also wish to use the standard Benchmark module to actually
compare the two techniques.

Paul Lalli

Patrick · Aug 12, 2007

Paul Lalli a écrit :

I don't know the answer for sure, but looking at File::Recurse, it
seems to be a bit bloated for what you want to do. In addition to
recursing through the directory structure, it also checks a hash of
options for each file found, and stores information about each entry
to be later returned from the recurse() function.

I would try just using the standard File::Find module and see if it's
any faster.

use File::Find;
my $nb = 0;
find($dir, sub { -f and $nb++ });

You may also wish to use the standard Benchmark module to actually
compare the two techniques.

Paul Lalli

Thanks for your answer but I have tried with File::Find : I got exactly
the same time for the same tree : 16 seconds for 10,000 files.

I have also tried with a "manual" solution :

sub getFileNb {
my $nb = 0;

return if ! opendir DIR,$_;
my @list = readdir DIR;
closedir DIR;

$nb += grep { -f "$dir/$_" } @list;
my @subdirs = grep { /^[^\.]/ && -d "$dir/$_" } @list;

foreach ( @subdirs ) {
$nb += &getFileNb("$dir/$_");
}

return $nb;
}

The result is about the same : 16 seconds for my 10,000 files.

I search for a really better performance ...

Patrick

Martijn Lievaart · Aug 12, 2007

Thanks for your answer but I have tried with File::Find : I got exactly
the same time for the same tree : 16 seconds for 10,000 files.

I have also tried with a "manual" solution :

sub getFileNb {
my $nb = 0;

return if ! opendir DIR,$_;
my @list = readdir DIR;
closedir DIR;

$nb += grep { -f "$dir/$_" } @list;
my @subdirs = grep { /^[^\.]/ && -d "$dir/$_" } @list;

foreach ( @subdirs ) {
$nb += &getFileNb("$dir/$_");
}

return $nb;
}

The result is about the same : 16 seconds for my 10,000 files.

I search for a really better performance ...

Looks like IO is the bottleneck. If you are on unix, do a

$ time find . -type f | wc -l

and see if that is significantly faster. If not, get a faster harddisk,
put the files on another (fast(er)) harddisk, switch to raid, add more
memory so you have more buffers, tume OS parameters, use another type of
filesystem. Mix and match to taste.

Note that

$ time yourscript.pl will give you insight in how much time is spend
waiting on I/O. Real-(user+sys) is the time spend waiting on other tasks
and IO. On a lightly loaded system, this will be mainly IO.

Look at this:

$ time find . -type f | wc -l
49956

real 0m14.594s
user 0m0.216s
sys 0m1.609s

The find command took only a fraction of the total time. One second was
spend in the kernel and 14 seconds was spend on IO.

HTH,
M4

Michele Dondi · Aug 12, 2007

I have also tried with a "manual" solution :

sub getFileNb {
my $nb =3D 0;

return if ! opendir DIR,$_;
my @list =3D readdir DIR;
closedir DIR;

$nb +=3D grep { -f "$dir/$_" } @list;
my @subdirs =3D grep { /^[^\.]/ && -d "$dir/$_" } @list;

You're stat()ing twice. It is time consuming, you'd better do it once.

foreach ( @subdirs ) {
$nb +=3D &getFileNb("$dir/$_");
}

The &-form of sub call is obsolete and not likely to do what you mean.
Just avoid it.

The result is about the same : 16 seconds for my 10,000 files.

I search for a really better performance ...

I wouldn't go for a recursive solution then, but for an iterative one.

Michele

Randal L. Schwartz · Aug 13, 2007

Patrick> In an application I write in Perl, I must count the total number of files (not
Patrick> directories) in a complete tree.

You haven't clarified whether you want to count a symlink pointing
at a file as a separate file or not. Your code counts it separately,
even if it's pointing at a file not in your starting tree.

Patrick> recurse { -f && $nb++ } $dir;

Ingo Menger · Aug 13, 2007

Thanks for your answer but I have tried with File::Find : I got exactly
the same time for the same tree : 16 seconds for 10,000 files.

Click to expand...

I have also tried with a "manual" solution :

Click to expand...

sub getFileNb {
my $nb = 0;

Click to expand...

return if ! opendir DIR,$_;
my @list = readdir DIR;
closedir DIR;

Click to expand...

$nb += grep { -f "$dir/$_" } @list;
my @subdirs = grep { /^[^\.]/ && -d "$dir/$_" } @list;

Click to expand...

foreach ( @subdirs ) {
$nb += &getFileNb("$dir/$_");
}

Click to expand...

return $nb;
}

Click to expand...

The result is about the same : 16 seconds for my 10,000 files.

Click to expand...

I search for a really better performance ...

Click to expand...

Looks like IO is the bottleneck. If you are on unix, do a

$ time find . -type f | wc -l

and see if that is significantly faster. If not, get a faster harddisk,
put the files on another (fast(er)) harddisk, switch to raid, add more
memory so you have more buffers, tume OS parameters, use another type of
filesystem. Mix and match to taste.

And, in addition, repeat the command to see whether it get's faster
the second time. Most likely, the directory blocks to be read will be
already in memory so the 2nd attempt may be that much faster.

Martijn Lievaart · Aug 13, 2007

And, in addition, repeat the command to see whether it get's faster the
second time. Most likely, the directory blocks to be read will be
already in memory so the 2nd attempt may be that much faster.

Good point! Very important to keep in mind.

M4

Find and count strings of text from multiple files	17	Dec 16, 2021
Quickly get count of files on linux	8	Sep 28, 2011
key/value store optimized for disk storage	23	May 3, 2012
FAQ 5.3 How do I count the number of lines in a file?	0	Jan 31, 2011
Merge files	1	Aug 7, 2013
translating an OS directory recursively into a tree object	8	Dec 7, 2009
search and replace in a file tree, only for files that match	5	May 12, 2010
FAQ 5.42 How do I delete a directory tree?	0	Mar 25, 2011

Optimized count of files in tree

Patrick

Paul Lalli

Patrick

Martijn Lievaart

Michele Dondi

Randal L. Schwartz

Ingo Menger

Martijn Lievaart

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads