Quickly get count of files on linux

Ram Prasad · Sep 28, 2011

I have a system that gets jobs in files which are stored in a
directory tree structure.
To get the current job queue size , I simply have to count all files
in a particular directory ( including sub dirs )
The queue size may be upto 2 million files
I can get the size by using

find /path -type f | wc -l

But this is not fast enough. So I wrote a small directory search
script to just count the number of files , can I optimize this
further. Currently the script takes longer than optimal
0.7 s for a queue size of 300 k

The script will always run only on linux .. so I dont bother about
compatibility anyway.

#include <stdio.h>
#include <sys/types.h>
#include <dirent.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <string.h>

#if STAT_MACROS_BROKEN
# undef S_ISDIR
#endif

#define MAXPATH 1000
#if !defined S_ISDIR && defined S_IFDIR
# define S_ISDIR(Mode) (((Mode) & S_IFMT) == S_IFDIR)
#endif
/* I Think this function is the bottleneck */
int isdir (const char *path){
struct stat stats;
return stat (path, &stats) == 0 && S_ISDIR (stats.st_mode);
}
int dirnscan (const char *path){
char fullpath[MAXPATH];
DIR *dp;
struct dirent *ep;
int n=0;
dp = opendir (path);
if(dp==NULL) return 0;
while ((ep = readdir (dp))){
if(ep->d_name[0] == '.') continue;
sprintf(fullpath,"%s/%s",path,ep->d_name);
if(isdir(fullpath) == 0){
++n;
} else {
n = n + dirnscan(fullpath);
}
}
closedir(dp);
return(n);
}
int main(int argc,char *argv[]){
printf("%d\n",dirnscan(argv[1]));
return(0);
}

Keith Thompson · Sep 28, 2011

Ram Prasad said:
I have a system that gets jobs in files which are stored in a
directory tree structure.
To get the current job queue size , I simply have to count all files
in a particular directory ( including sub dirs )
The queue size may be upto 2 million files
I can get the size by using

find /path -type f | wc -l

But this is not fast enough. So I wrote a small directory search
script to just count the number of files , can I optimize this
further. Currently the script takes longer than optimal
0.7 s for a queue size of 300 k

The script will always run only on linux .. so I dont bother about
compatibility anyway.

[43 lines deleted]

You'll get better answers on comp.unix.programmer.

(But I doubt that you'll get much improvement; the "find" command
already has to do all the work you're doing. Can your queueing
system just keep track of the number of jobs itself?)

Ben Bacarisse · Sep 29, 2011

China Blue Corn Chips said:
[QUOTE="Keith Thompson said:

I have a system that gets jobs in files which are stored in a
directory tree structure.
To get the current job queue size , I simply have to count all files
in a particular directory ( including sub dirs )
The queue size may be upto 2 million files
I can get the size by using

find /path -type f | wc -l

Click to expand...

(But I doubt that you'll get much improvement; the "find" command
already has to do all the work you're doing. Can your queueing
system just keep track of the number of jobs itself?)

The problem isn't find but wc. The only relevant output of find are
the \n, but wc has to read every other character to find those.[/QUOTE]

I don't think that matters all that much:

$ time find /dir/with/long/paths -type f | wc -l
115562

real 0m0.385s
user 0m0.090s
sys 0m0.320s

$ time find /dir/with/long/paths -type f -printf "\n" | wc -l
115562

real 0m0.322s
user 0m0.100s
sys 0m0.210s

It's faster, but not by much (average path length about 100 bytes).

<snip>

Nobody · Sep 29, 2011

I have a system that gets jobs in files which are stored in a
directory tree structure.
To get the current job queue size , I simply have to count all files
in a particular directory ( including sub dirs )
The queue size may be upto 2 million files
I can get the size by using

find /path -type f | wc -l

But this is not fast enough. So I wrote a small directory search
script to just count the number of files , can I optimize this
further.

If you only need it to work on Linux, you can usually eliminate the stat()
call by using the d_type field in the dirent structure.

For more information, see the readdir(3) manual page (note: *not* the
readdir(2) manual page; the readdir() function in libc uses the getdents()
system call, not the readdir() system call).

Also, if this is something you have to do repeatedly, you can cache the
total for each directory, and only re-scan if the directory's modification
time changes.

Ram Prasad · Sep 29, 2011

If you only need it to work on Linux, you can usually eliminate the stat()
call by using the d_type field in the dirent structure.

Thanks for the tip that greatly helped the speed.
Also do I have to sprintf() everytime in the loop. Sorry I am not a
regular C programmer .. used perl / php all the while

Eric Sosman · Sep 29, 2011

[...]
If you only need it to work on Linux, you can usually eliminate the stat()
call by using the d_type field in the dirent structure.

Click to expand...

Thanks for the tip that greatly helped the speed.
Also do I have to sprintf() everytime in the loop. Sorry I am not a
regular C programmer .. used perl / php all the while

Please note that the helpful tip had nothing at all to do with
the C programming language, and everything to do with the behavior
of the Linux system on which your program runs. Ponder that, next
time you have a question and are wondering which forum would be
most suitable.

Jens Thoms Toerring · Sep 29, 2011

Thanks for the tip that greatly helped the speed.
Also do I have to sprintf() everytime in the loop. Sorry I am not a
regular C programmer .. used perl / php all the while

Note that that will not work on all types of file systems (see
the details in the NOTES section of of the man page readdir(3)).
And then there are other possible issues with your program:

#define MAXPATH 1000 ....
char fullpath[MAXPATH]; ....
sprintf(fullpath,"%s/%s",path,ep->d_name);

First of all, MAXPATH might be way too short for all possible
paths (there are actually ways to find out what the maximum is,
more about that in a Linux or Unix newsgroup). And if it's too
short you easily could write past the end of the 'fullpath'
buffer, making all what your program outputs afterward (if
it doesn't crash) at least dubious...

And no, if you use the d_type element you only have to cons-
truct the directory name when you already know it's a direc-
tory but not for normal files (since you then don't need to
call stat(). BTW, are you sure you want stat() and not lstat()?

Moreover you could copy 'path' and the slash only once to
the buffer, store the possition after the slash and then
later on only overwrite the 'ep->d_name' part.

But when you compare run times you should be aware that they
might depend a real lot on information about the file system
already buffered by the OS - with a similar program counting
about 300 K files took about 2 minutes when run for the first
time and only about 0.4 seconds when run again for a second
time immediately afterwards on my machine. Got the same kind
of behaviour from 'find'. So the run time of your program is
may likely dominated by caching issues and all your attempts
to optimize might hardly be noticable in real life when your
program is rarely run on the same directory with lots of time
in between.
Regards, Jens

Ram Prasad · Sep 29, 2011

[...]
If you only need it to work on Linux, you can usually eliminate the stat()
call by using the d_type field in the dirent structure.

Click to expand...

Click to expand...

On a reiserfs filesystem
dirent_obj->d_type is always set to 0 ... I think I will have to ask
on a unix forum now

Keith Thompson · Sep 29, 2011

Ram Prasad said:
[...]
If you only need it to work on Linux, you can usually eliminate the stat()
call by using the d_type field in the dirent structure.

Click to expand...

Click to expand...

On a reiserfs filesystem
dirent_obj->d_type is always set to 0 ... I think I will have to ask
on a unix forum now

As I suggested some time ago. Try comp.unix.programmer.

Linux: using "clone3" and "waitid"	0	Oct 17, 2023
Help a perl dev learn c, my first cli	1	May 11, 2009
What is the proper way to do recursion	2	Feb 24, 2009
Adding directories/files to a list - help!	4	Dec 6, 2006
How to keep count of right answer and wrong answers in C++?	0	Nov 3, 2021
problem on fonction mkdir	5	Apr 20, 2009
Adding adressing of IPv6 to program	1	Feb 16, 2023
My GetDirectoryTreeSize() function, couple of questions.	15	Feb 9, 2010

Quickly get count of files on linux

Ram Prasad

Keith Thompson

Ben Bacarisse

Nobody

Ram Prasad

Eric Sosman

Jens Thoms Toerring

Ram Prasad

Keith Thompson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads