Quickly get count of files on linux

Discussion in 'C Programming' started by Ram Prasad, Sep 28, 2011.

  1. Ram  Prasad

    Ram Prasad Guest

    I have a system that gets jobs in files which are stored in a
    directory tree structure.
    To get the current job queue size , I simply have to count all files
    in a particular directory ( including sub dirs )
    The queue size may be upto 2 million files
    I can get the size by using

    find /path -type f | wc -l

    But this is not fast enough. So I wrote a small directory search
    script to just count the number of files , can I optimize this
    further. Currently the script takes longer than optimal
    0.7 s for a queue size of 300 k

    The script will always run only on linux .. so I dont bother about
    compatibility anyway.




    #include <stdio.h>
    #include <sys/types.h>
    #include <dirent.h>
    #include <sys/stat.h>
    #include <stdlib.h>
    #include <string.h>

    #if STAT_MACROS_BROKEN
    # undef S_ISDIR
    #endif

    #define MAXPATH 1000
    #if !defined S_ISDIR && defined S_IFDIR
    # define S_ISDIR(Mode) (((Mode) & S_IFMT) == S_IFDIR)
    #endif
    /* I Think this function is the bottleneck */
    int isdir (const char *path){
    struct stat stats;
    return stat (path, &stats) == 0 && S_ISDIR (stats.st_mode);
    }
    int dirnscan (const char *path){
    char fullpath[MAXPATH];
    DIR *dp;
    struct dirent *ep;
    int n=0;
    dp = opendir (path);
    if(dp==NULL) return 0;
    while ((ep = readdir (dp))){
    if(ep->d_name[0] == '.') continue;
    sprintf(fullpath,"%s/%s",path,ep->d_name);
    if(isdir(fullpath) == 0){
    ++n;
    } else {
    n = n + dirnscan(fullpath);
    }
    }
    closedir(dp);
    return(n);
    }
    int main(int argc,char *argv[]){
    printf("%d\n",dirnscan(argv[1]));
    return(0);
    }
     
    Ram Prasad, Sep 28, 2011
    #1
    1. Advertising

  2. Ram Prasad <> writes:
    > I have a system that gets jobs in files which are stored in a
    > directory tree structure.
    > To get the current job queue size , I simply have to count all files
    > in a particular directory ( including sub dirs )
    > The queue size may be upto 2 million files
    > I can get the size by using
    >
    > find /path -type f | wc -l
    >
    > But this is not fast enough. So I wrote a small directory search
    > script to just count the number of files , can I optimize this
    > further. Currently the script takes longer than optimal
    > 0.7 s for a queue size of 300 k
    >
    > The script will always run only on linux .. so I dont bother about
    > compatibility anyway.
    >

    [43 lines deleted]

    You'll get better answers on comp.unix.programmer.

    (But I doubt that you'll get much improvement; the "find" command
    already has to do all the work you're doing. Can your queueing
    system just keep track of the number of jobs itself?)

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
     
    Keith Thompson, Sep 28, 2011
    #2
    1. Advertising

  3. China Blue Corn Chips <> writes:

    > In article <>, Keith Thompson <>
    > wrote:
    >
    >> Ram Prasad <> writes:
    >> > I have a system that gets jobs in files which are stored in a
    >> > directory tree structure.
    >> > To get the current job queue size , I simply have to count all files
    >> > in a particular directory ( including sub dirs )
    >> > The queue size may be upto 2 million files
    >> > I can get the size by using
    >> >
    >> > find /path -type f | wc -l

    >
    >> (But I doubt that you'll get much improvement; the "find" command
    >> already has to do all the work you're doing. Can your queueing
    >> system just keep track of the number of jobs itself?)

    >
    > The problem isn't find but wc. The only relevant output of find are
    > the \n, but wc has to read every other character to find those.


    I don't think that matters all that much:

    $ time find /dir/with/long/paths -type f | wc -l
    115562

    real 0m0.385s
    user 0m0.090s
    sys 0m0.320s

    $ time find /dir/with/long/paths -type f -printf "\n" | wc -l
    115562

    real 0m0.322s
    user 0m0.100s
    sys 0m0.210s

    It's faster, but not by much (average path length about 100 bytes).

    <snip>
    --
    Ben.
     
    Ben Bacarisse, Sep 29, 2011
    #3
  4. Ram  Prasad

    Nobody Guest

    On Wed, 28 Sep 2011 08:30:56 -0700, Ram Prasad wrote:

    > I have a system that gets jobs in files which are stored in a
    > directory tree structure.
    > To get the current job queue size , I simply have to count all files
    > in a particular directory ( including sub dirs )
    > The queue size may be upto 2 million files
    > I can get the size by using
    >
    > find /path -type f | wc -l
    >
    > But this is not fast enough. So I wrote a small directory search
    > script to just count the number of files , can I optimize this
    > further.


    If you only need it to work on Linux, you can usually eliminate the stat()
    call by using the d_type field in the dirent structure.

    For more information, see the readdir(3) manual page (note: *not* the
    readdir(2) manual page; the readdir() function in libc uses the getdents()
    system call, not the readdir() system call).

    Also, if this is something you have to do repeatedly, you can cache the
    total for each directory, and only re-scan if the directory's modification
    time changes.
     
    Nobody, Sep 29, 2011
    #4
  5. Ram  Prasad

    Ram Prasad Guest

    On Sep 29, 8:15 am, Nobody <> wrote:
    > On Wed, 28 Sep 2011 08:30:56 -0700, Ram Prasad wrote:
    > > I have a system that gets jobs in files which are stored in a
    > > directory tree structure.
    > > To get the current job queue size , I simply have to count all files
    > > in a particular directory ( including sub dirs )
    > > The queue size may be upto 2 million files
    > > I can get the size by using

    >
    > > find /path  -type f   | wc -l

    >
    > > But this is not fast enough.   So I wrote a small directory search
    > > script to just count the number of files , can I optimize this
    > > further.

    >
    > If you only need it to work on Linux, you can usually eliminate the stat()
    > call by using the d_type field in the dirent structure.
    >


    Thanks for the tip that greatly helped the speed.
    Also do I have to sprintf() everytime in the loop. Sorry I am not a
    regular C programmer .. used perl / php all the while
     
    Ram Prasad, Sep 29, 2011
    #5
  6. Ram  Prasad

    Eric Sosman Guest

    On 9/29/2011 7:23 AM, Ram Prasad wrote:
    > On Sep 29, 8:15 am, Nobody<> wrote:
    >>[...]
    >> If you only need it to work on Linux, you can usually eliminate the stat()
    >> call by using the d_type field in the dirent structure.

    >
    > Thanks for the tip that greatly helped the speed.
    > Also do I have to sprintf() everytime in the loop. Sorry I am not a
    > regular C programmer .. used perl / php all the while


    Please note that the helpful tip had nothing at all to do with
    the C programming language, and everything to do with the behavior
    of the Linux system on which your program runs. Ponder that, next
    time you have a question and are wondering which forum would be
    most suitable.

    --
    Eric Sosman
    d
     
    Eric Sosman, Sep 29, 2011
    #6
  7. Ram Prasad <> wrote:
    > On Sep 29, 8:15 am, Nobody <> wrote:
    > > On Wed, 28 Sep 2011 08:30:56 -0700, Ram Prasad wrote:
    > > > I have a system that gets jobs in files which are stored in a
    > > > directory tree structure.
    > > > To get the current job queue size , I simply have to count all files
    > > > in a particular directory ( including sub dirs )
    > > > The queue size may be upto 2 million files
    > > > I can get the size by using

    > >
    > > > find /path  -type f   | wc -l

    > >
    > > > But this is not fast enough.   So I wrote a small directory search
    > > > script to just count the number of files , can I optimize this
    > > > further.

    > >
    > > If you only need it to work on Linux, you can usually eliminate the stat()
    > > call by using the d_type field in the dirent structure.


    > Thanks for the tip that greatly helped the speed.
    > Also do I have to sprintf() everytime in the loop. Sorry I am not a
    > regular C programmer .. used perl / php all the while


    Note that that will not work on all types of file systems (see
    the details in the NOTES section of of the man page readdir(3)).
    And then there are other possible issues with your program:

    > #define MAXPATH 1000

    ....
    > char fullpath[MAXPATH];

    ....
    > sprintf(fullpath,"%s/%s",path,ep->d_name);


    First of all, MAXPATH might be way too short for all possible
    paths (there are actually ways to find out what the maximum is,
    more about that in a Linux or Unix newsgroup). And if it's too
    short you easily could write past the end of the 'fullpath'
    buffer, making all what your program outputs afterward (if
    it doesn't crash) at least dubious...

    And no, if you use the d_type element you only have to cons-
    truct the directory name when you already know it's a direc-
    tory but not for normal files (since you then don't need to
    call stat(). BTW, are you sure you want stat() and not lstat()?

    Moreover you could copy 'path' and the slash only once to
    the buffer, store the possition after the slash and then
    later on only overwrite the 'ep->d_name' part.

    But when you compare run times you should be aware that they
    might depend a real lot on information about the file system
    already buffered by the OS - with a similar program counting
    about 300 K files took about 2 minutes when run for the first
    time and only about 0.4 seconds when run again for a second
    time immediately afterwards on my machine. Got the same kind
    of behaviour from 'find'. So the run time of your program is
    may likely dominated by caching issues and all your attempts
    to optimize might hardly be noticable in real life when your
    program is rarely run on the same directory with lots of time
    in between.
    Regards, Jens
    --
    \ Jens Thoms Toerring ___
    \__________________________ http://toerring.de
     
    Jens Thoms Toerring, Sep 29, 2011
    #7
  8. Ram  Prasad

    Ram Prasad Guest

    On Sep 29, 4:41 pm, Eric Sosman <> wrote:
    > On 9/29/2011 7:23 AM, Ram Prasad wrote:
    >
    > > On Sep 29, 8:15 am, Nobody<>  wrote:
    > >>[...]
    > >> If you only need it to work on Linux, you can usually eliminate the stat()
    > >> call by using the d_type field in the dirent structure.

    >





    On a reiserfs filesystem
    dirent_obj->d_type is always set to 0 ... I think I will have to ask
    on a unix forum now
     
    Ram Prasad, Sep 29, 2011
    #8
  9. Ram Prasad <> writes:
    > On Sep 29, 4:41 pm, Eric Sosman <> wrote:
    >> On 9/29/2011 7:23 AM, Ram Prasad wrote:
    >> > On Sep 29, 8:15 am, Nobody<>  wrote:
    >> >>[...]
    >> >> If you only need it to work on Linux, you can usually eliminate the stat()
    >> >> call by using the d_type field in the dirent structure.

    >
    > On a reiserfs filesystem
    > dirent_obj->d_type is always set to 0 ... I think I will have to ask
    > on a unix forum now


    As I suggested some time ago. Try comp.unix.programmer.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
     
    Keith Thompson, Sep 29, 2011
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. rbt

    writing large files quickly

    rbt, Jan 27, 2006, in forum: Python
    Replies:
    36
    Views:
    944
    Thomas Bellman
    Jan 29, 2006
  2. JuHui
    Replies:
    1
    Views:
    291
    Bruno Desthuilliers
    Mar 17, 2006
  3. Damon Hastings
    Replies:
    2
    Views:
    241
    Lack Mr G M
    Sep 9, 2003
  4. Replies:
    4
    Views:
    500
    Robert Klemme
    Oct 3, 2012
  5. Replies:
    3
    Views:
    132
    Roy Smith
    Sep 25, 2013
Loading...

Share This Page