Quickly get count of files on linux

R

Ram Prasad

I have a system that gets jobs in files which are stored in a
directory tree structure.
To get the current job queue size , I simply have to count all files
in a particular directory ( including sub dirs )
The queue size may be upto 2 million files
I can get the size by using

find /path -type f | wc -l

But this is not fast enough. So I wrote a small directory search
script to just count the number of files , can I optimize this
further. Currently the script takes longer than optimal
0.7 s for a queue size of 300 k

The script will always run only on linux .. so I dont bother about
compatibility anyway.




#include <stdio.h>
#include <sys/types.h>
#include <dirent.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <string.h>

#if STAT_MACROS_BROKEN
# undef S_ISDIR
#endif

#define MAXPATH 1000
#if !defined S_ISDIR && defined S_IFDIR
# define S_ISDIR(Mode) (((Mode) & S_IFMT) == S_IFDIR)
#endif
/* I Think this function is the bottleneck */
int isdir (const char *path){
struct stat stats;
return stat (path, &stats) == 0 && S_ISDIR (stats.st_mode);
}
int dirnscan (const char *path){
char fullpath[MAXPATH];
DIR *dp;
struct dirent *ep;
int n=0;
dp = opendir (path);
if(dp==NULL) return 0;
while ((ep = readdir (dp))){
if(ep->d_name[0] == '.') continue;
sprintf(fullpath,"%s/%s",path,ep->d_name);
if(isdir(fullpath) == 0){
++n;
} else {
n = n + dirnscan(fullpath);
}
}
closedir(dp);
return(n);
}
int main(int argc,char *argv[]){
printf("%d\n",dirnscan(argv[1]));
return(0);
}
 
K

Keith Thompson

Ram Prasad said:
I have a system that gets jobs in files which are stored in a
directory tree structure.
To get the current job queue size , I simply have to count all files
in a particular directory ( including sub dirs )
The queue size may be upto 2 million files
I can get the size by using

find /path -type f | wc -l

But this is not fast enough. So I wrote a small directory search
script to just count the number of files , can I optimize this
further. Currently the script takes longer than optimal
0.7 s for a queue size of 300 k

The script will always run only on linux .. so I dont bother about
compatibility anyway.
[43 lines deleted]

You'll get better answers on comp.unix.programmer.

(But I doubt that you'll get much improvement; the "find" command
already has to do all the work you're doing. Can your queueing
system just keep track of the number of jobs itself?)
 
B

Ben Bacarisse

China Blue Corn Chips said:
[QUOTE="Keith Thompson said:
I have a system that gets jobs in files which are stored in a
directory tree structure.
To get the current job queue size , I simply have to count all files
in a particular directory ( including sub dirs )
The queue size may be upto 2 million files
I can get the size by using

find /path -type f | wc -l
(But I doubt that you'll get much improvement; the "find" command
already has to do all the work you're doing. Can your queueing
system just keep track of the number of jobs itself?)

The problem isn't find but wc. The only relevant output of find are
the \n, but wc has to read every other character to find those.[/QUOTE]

I don't think that matters all that much:

$ time find /dir/with/long/paths -type f | wc -l
115562

real 0m0.385s
user 0m0.090s
sys 0m0.320s

$ time find /dir/with/long/paths -type f -printf "\n" | wc -l
115562

real 0m0.322s
user 0m0.100s
sys 0m0.210s

It's faster, but not by much (average path length about 100 bytes).

<snip>
 
N

Nobody

I have a system that gets jobs in files which are stored in a
directory tree structure.
To get the current job queue size , I simply have to count all files
in a particular directory ( including sub dirs )
The queue size may be upto 2 million files
I can get the size by using

find /path -type f | wc -l

But this is not fast enough. So I wrote a small directory search
script to just count the number of files , can I optimize this
further.

If you only need it to work on Linux, you can usually eliminate the stat()
call by using the d_type field in the dirent structure.

For more information, see the readdir(3) manual page (note: *not* the
readdir(2) manual page; the readdir() function in libc uses the getdents()
system call, not the readdir() system call).

Also, if this is something you have to do repeatedly, you can cache the
total for each directory, and only re-scan if the directory's modification
time changes.
 
R

Ram Prasad

If you only need it to work on Linux, you can usually eliminate the stat()
call by using the d_type field in the dirent structure.

Thanks for the tip that greatly helped the speed.
Also do I have to sprintf() everytime in the loop. Sorry I am not a
regular C programmer .. used perl / php all the while
 
E

Eric Sosman

[...]
If you only need it to work on Linux, you can usually eliminate the stat()
call by using the d_type field in the dirent structure.

Thanks for the tip that greatly helped the speed.
Also do I have to sprintf() everytime in the loop. Sorry I am not a
regular C programmer .. used perl / php all the while

Please note that the helpful tip had nothing at all to do with
the C programming language, and everything to do with the behavior
of the Linux system on which your program runs. Ponder that, next
time you have a question and are wondering which forum would be
most suitable.
 
J

Jens Thoms Toerring

Thanks for the tip that greatly helped the speed.
Also do I have to sprintf() everytime in the loop. Sorry I am not a
regular C programmer .. used perl / php all the while

Note that that will not work on all types of file systems (see
the details in the NOTES section of of the man page readdir(3)).
And then there are other possible issues with your program:
#define MAXPATH 1000 ....
char fullpath[MAXPATH]; ....
sprintf(fullpath,"%s/%s",path,ep->d_name);

First of all, MAXPATH might be way too short for all possible
paths (there are actually ways to find out what the maximum is,
more about that in a Linux or Unix newsgroup). And if it's too
short you easily could write past the end of the 'fullpath'
buffer, making all what your program outputs afterward (if
it doesn't crash) at least dubious...

And no, if you use the d_type element you only have to cons-
truct the directory name when you already know it's a direc-
tory but not for normal files (since you then don't need to
call stat(). BTW, are you sure you want stat() and not lstat()?

Moreover you could copy 'path' and the slash only once to
the buffer, store the possition after the slash and then
later on only overwrite the 'ep->d_name' part.

But when you compare run times you should be aware that they
might depend a real lot on information about the file system
already buffered by the OS - with a similar program counting
about 300 K files took about 2 minutes when run for the first
time and only about 0.4 seconds when run again for a second
time immediately afterwards on my machine. Got the same kind
of behaviour from 'find'. So the run time of your program is
may likely dominated by caching issues and all your attempts
to optimize might hardly be noticable in real life when your
program is rarely run on the same directory with lots of time
in between.
Regards, Jens
 
K

Keith Thompson

Ram Prasad said:
[...]
If you only need it to work on Linux, you can usually eliminate the stat()
call by using the d_type field in the dirent structure.

On a reiserfs filesystem
dirent_obj->d_type is always set to 0 ... I think I will have to ask
on a unix forum now

As I suggested some time ago. Try comp.unix.programmer.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,019
Latest member
RoxannaSta

Latest Threads

Top